We provide IT Staff Augmentation Services!

Big Data Engineer Resume

0/5 (Submit Your Rating)

Los Angeles, CA

SUMMARY

  • Around 9 years of extensive hands - on Big Data Capacity with the help of Hadoop Ecosystem across internal and cloud- based platforms.
  • Expertise in Cloud Computing and Hadoop architecture and its various components - Hadoop File System HDFS, MapReduce, Spark, Name node, Data Node, Job Tracker, Task Tracker, Secondary Name Node.
  • Strong experience using HDFS, MapReduce, Hive, Spark, Sqoop, Oozie, and HBase.
  • Developed Spark-based applications to load streaming data with low latency, using Kafka and PySpark programming.
  • Experience in developing Spark Applications using Spark RDD, Spark-SQL and Data frame APIs.
  • Deep knowledge of troubleshooting and tuning Spark applications and Hive scripts to achieve optimal performance.
  • Experienced working with various Hadoop Distributions (Cloudera, Hortonworks, Map R, Amazon EMR) to fully implement and leverage new Hadoop features.
  • Worked with real-time data processing and streaming techniques using Spark streaming and Kafka.
  • Experience developing PySpark code to create RDDs, Paired RDDs and Data Frames.
  • Experience in moving data into and out of the HDFS and Relational Database Systems (RDBMS) using Apache Sqoop.
  • Expertise in working with HIVE data warehouse infrastructure-creating tables, data distribution by implementing Partitioning and Bucketing, developing and tuning the HQL queries.
  • Replaced existing MR jobs and Hive scripts with Spark SQL & Spark data transformations for efficient data processing.
  • Deep knowledge of troubleshooting and tuning Spark applications and Hive scripts to achieve optimal performance
  • Experience developing Kafka producers and Kafka Consumers for streaming millions of events per second on streaming data
  • Experienced in Optimizing the PySpark jobs to run on Kubernetes Cluster for faster data processing.
  • Database design, modeling, migration and development experience in using stored procedures, triggers, cursor, constraints and functions. Used My SQL, MS SQL Server, DB2, and Oracle
  • Experience working with NoSQL database technologies, including MongoDB, Cassandra and HBase.
  • Experience with Software development tools such as JIRA, Play, GIT.
  • Experience on Migrating SQL database to Azure Data Lake, Azure data Lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and controlling and granting database access and Migrating On premise databases to Azure Data Lake store using Azure Data factory
  • Good understanding of the Data modeling (Dimensional & Relational) concepts like Star-Schema Modeling, Schema Modeling, Fact and Dimension tables.
  • Experience in manipulating/analyzing large datasets and finding patterns and insights within structured and unstructured data.
  • Experience working in different Google Cloud Platform Technologies like Big Query, Dataflow, Dataproc, Pub sub, Airflow.
  • Design and Development of Ingestion Framework over Google Cloud and Hadoop cluster.
  • Hands of experience in GCP, Big Query, GCS bucket, G - cloud function, cloud migration, cloud dataflow, Pub/suB cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver
  • Strong understanding of Java Virtual Machines and multi-threading processes.
  • Experience in writing complex SQL queries, creating reports and dashboards.
  • Proficient in using Unix based Command Line Interface.
  • Strong experience with ETL and/or orchestration tools (e.g., Talend, Oozie, Airflow)
  • Experience setting up AWS Data Platform - AWS CloudFormation, Development End Points, AWS Glue, EMR and Jupyter/ Sage maker Notebooks, Redshift, S3, and EC2 instances
  • Experienced in using Agile methodologies including extreme programming, SCRUM and Test-Driven Development (TDD)
  • Used Informatica Power Center for (ETL) extraction, transformation and loading data from heterogeneous source systems into target databases.

TECHNICAL SKILLS

AWS environmen: EMR, EC2, EBS, RDS, S3, Athena, Glue, Elasticsearch, Lambda, SQS, DynamoDB, Redshift, ECS, Quick Sight, Kinesis.

Azure environment: Azure Databricks, Azure Data Lake, Blob Storage, Azure Data Factory, SQL Database, SQL Data Warehouse, Cosmos DB, AAD, Azure batch.

Scripting Languages: Python, PySpark, SQL, Scala, Shell, PowerShell, HiveQL.

Databases: Snowflake, MySQL, Oracle, Teradata, MS SQL SERVER, PostgreSQL, DB2

NoSQL Databases: HBase, DynamoDB

Big Data Ecosystem: HDFS, Yarn, MapReduce, Spark, Kafka, Hive, Airflow, StreamSets, Sqoop, HBase, Flume, Ambari, Oozie, Zookeeper, Nifi, Apache Hadoop, Cloudera CDP, Hortonworks HDP

Others: Jenkins, Tableau, Power BI, Grafana

SDLC- Methodologies: Agile, Waterfall, Hybrid

PROFESSIONAL EXPERIENCE

Confidential - Los Angeles, CA

Big Data Engineer

Responsibilities:

  • Performing ETL from multiple sources such as Kafka, NIFI, Teradata, DB2 using Hadoop spark.
  • Worked on Apache Spark Utilizing the Spark, SQL and Streaming components to support the intraday and real-time data processing
  • Involved in Migrating Objects using the custom ingestion framework from variety of sources such as Oracle, SAP/HANA, MongoDB, & Teradata
  • Planning and design of data warehouse in STAR schema. Designing structure of tables and documenting it.
  • Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and extracted data from MYSQL into HDFS vice-versa using Sqoop.
  • Designed and implemented end to end big data platform on Teradata Appliance
  • Sharing sample data using grant access to customers for UAT/BAT.
  • Moving data from Teradata to a Hadoop cluster Using TDCH/Fast export and Apache NIFI.
  • Develop Python, PySpark, Bash scripts logs to Transform, and Load data across on premise and cloud platform.
  • Worked extensively on AWS Components such as Elastic Map Reduce (EMR)
  • Experience in data ingestions techniques for batch and stream processing using AWS Batch, AWS Kinesis, AWS Data Pipeline
  • Created data pipeline for different events of ingestion, aggregation and load consumer response data in AWS S3 bucket into Hive external tables in HDFS location to serve as feed for Tableau dashboards
  • Created monitors, alarms and notifications for EC2 hosts using Cloud Watch, Cloud trail and SNS
  • Building data pipeline ETLs for data movement to S3, then to Redshift.
  • Scheduled different Snowflake jobs using NiFi.
  • Develop Nifi workflow to pick up the data from rest API server, from data lake as well as from SFTP server and send that to Kafka broker.
  • Experience with Snowflake Multi-Cluster Warehouses
  • Implement One time Data Migration of Multistate level data from SQL server to Snowflake by using Python and SnowSQL.
  • Day to-day responsibility includes developing ETL Pipelines in and out of data warehouse, develop major regulatory and financial reports using advanced SQL queries in snowflake.
  • Stage the API or Kafka Data (in JSON file format) into Snowflake DB by Flattening the same for different functional services.
  • Written AWS Lambda code in Python for nested Json files, converting, comparing, sorting etc.
  • Installed and configured Apache airflow for workflow management and created workflows in python
  • Write UDFs in Hadoop PySpark to perform transformations and loads.
  • Developed Python code to ather the data from HBase and designs the solution to implement using PySpark.
  • Developed PySpark code and to mimic the transformations performed in the on-premises environment.
  • Analyzed the SQL scripts and designed solutions to implement using PySpark.
  • Use NIFI to load data into HDFS as ORC files.
  • Writing TDCH scripts and Apache NIFI to load data from Mainframes DB2 to Hadoop cluster.
  • Working with, ORC, AVRO and JSON, Parquet file formats. and create external tables and query on top of these files Using BigQuery
  • Source Analysis, tracing back the sources of the data and finding its roots though Teradata, DB2 etc.
  • Identifying the jobs that load the source tables and documenting it.
  • Implement Continuous Integration and Continuous Delivery process using GitLab along with Python and Shell scripts to automate routine jobs, which includes synchronize installers, configuration modules, packages and requirements for the applications
  • Worked on Informatica Power Center tools- Designer, Repository Manager, Workflow Manager, and Workflow Monitor
  • Implemented a batch process to load the heavy volume data loading using Apache Dataflow framework using Nifi in Agile development methodology
  • Deployed the Big Data Hadoop application using Talend on cloud AWS (Amazon Web Services) and also on Microsoft Azure
  • Created Snow pipe for continuous data load from staged data residing on cloud gateway servers.
  • Developing automated process for code builds and deployments using Jenkins, Ant, Maven, Sonar type, Shell Script
  • Installing and configuring the applications like Docker tool and Kubernetes for the orchestration purpose
  • Developed automation system using PowerShell scripts and JSON templates to remediate the services.

Environment: Apache Spark, Hadoop, PySpark, HDFS, Cloudera, AWS, Azure, Kafka, Snowflake, Docker, Jenkins, Ant, Maven, Kubernetes, Nifi, JSON, Teradata, DB2, SQL Server, MongoDB, Shell Scripting.

Confidential, Atlanta, GA

Sr. Data Engineer / Big Data Engineer

Responsibilities:

  • Meetings with business/user groups to understand the business process, gather requirements, analyze, design, develop and implement according to client requirements.
  • Designing and Developing Azure Data Factory (ADF) extensively for ingesting data from different source systems like relational and non-relational to meet business functional requirements.
  • Designed and Developed event driven architectures using blob triggers and DataFactory.
  • Creating pipelines, data flows and complex data transformations and manipulations using ADF and PySpark with Databricks.
  • Automated jobs using different triggers like Events, Schedules and Tumbling in ADF.
  • Created, provisioned different Databricks clusters, notebooks, jobs and autoscaling.
  • Ingested huge volume and variety of data from disparate source systems into Azure Data Lake Gen2 using Azure Data Factory V2.
  • Created several Databricks Spark jobs with Pyspark to perform several tables to table operations.
  • Performed data flow transformation using the data flow activity.
  • Implemented Azure, self-hosted integration runtime in ADF.
  • Developed streaming pipelines using Apache Spark with Python.
  • Created, provisioned multiple Databricks clusters needed for batch and continuous streaming data processing and installed the required libraries for the clusters.
  • Improved performance by optimizing computing time to process the streaming data and saved cost to the company by optimizing the cluster run time.
  • Perform ongoing monitoring, automation, and refinement of data engineering solutions.
  • Designed and developed a new solution to process the NRT data by using Azure stream analytics, Azure Event Hub and Service Bus Queue.
  • Created Linked service to land the data from SFTP location to Azure Data Lake.
  • Extensively used SQL Server Import and Export Data tool.
  • Working with complex SQL views, Stored Procedures, Triggers, and packages in large databases from various servers.
  • Continuously monitor and manage data pipeline (CI/CD) performance alongside applications from a single console with GCP.
  • Worked on POC to check various cloud offerings including Google Cloud Platform (GCP).
  • Developed a POC for project migration from on prem Hadoop MapR system to GCP.
  • Compared Self hosted Hadoop with respect to GCPs Data Proc, and explored Big Table (managed HBase) use cases, performance evolution.
  • Experience in working on both agile and waterfall methods in a fast pace manner.
  • Generating alerts on the daily metrics of the events to the product people.
  • Extensively used SQL Queries to verify and validate the Database Updates.
  • Suggest fixes to complex issues by doing a thorough analysis of root cause and impact of the defect.
  • Provided 24/7 On-call Production Support for various applications and provided resolution for night-time production job, attend conference calls with business operations, system managers for resolution of issues.
  • Designs and implementing Scala programs using Spark Data frames and RDDs for transformations and actions on input data.
  • Improved the Hive queries performance by implementing partitioning and clustering and Optimized file formats (ORC).

Environment: Azure Data Factory (ADF v2), Azure SQL Database, Azure functions Apps, Azure Data Lake, BLOB Storage, SQL server, Windows remote desktop, UNIX Shell Scripting, AZURE PowerShell, Data bricks, Python, ADLS Gen 2, Azure Cosmos DB, Azure Event Hub, Azure Machine Learning.

Confidential - New York, NY

Big Data Engineer

Responsibilities:

  • Successfully modernized and migrated all Workloads from on-premises to AWS.
  • Created Real-Time Streaming and Batch data pipelines for the same project.
  • Created an Automated Databricks workflow notebook to run multiple data loads (Databricks notebooks) in parallel using Python.
  • Implemented Airflow configuration as a code approach to automate the generation of workflows using Jenkins and Git.
  • Implemented Airflow visualizations into Grafana dashboards and connected failure notifications to slack channels.
  • Worked on Developing Data pipelines to Ingest Hive tables and File Feeds and generated Insights into Dyanomo DB (DAX).
  • Worked with Snowflake clouddatawarehouse andAWSS3 bucket for integratingdatafrom multiple source systems to load nested JSON formatteddatainto Snowflake table
  • Created functions and assigned roles in AWS Lambda to run python scripts to perform event-driven processing.
  • Created AWS Data Pipeline to schedule an Amazon EMR cluster to clean and process web server logs stored in the S3 bucket.
  • Imported structured data from S3 into multiple systems, including RedShift, and generated reports using AWS Athena.
  • Developed Snowflake views to load and unload data from and to anAWSS3 bucket and transfer the code to production.
  • Used Apache Spark with Python to develop and execute Big Data Analytics applications.
  • Used terraform scripts to Create and Maintain the Hadoop cluster on AWS EMR.
  • Designed and developed RDD Seeds using Pyspark and EMR.
  • Involved in designing, developing, and maintaining complex SSAS databases, constantly monitoring the daily schedules, and providing data to the ad-hoc requests from business users.
  • Developed BI data lake POCs using AWS Services, including Athena, S3, Ec2, Glue, and Tableau.
  • Worked on Proof of Concepts to decide on Tableau as a BI strategy for enterprise reporting.
  • Built ETL process using Pyspark to identify potential false submissions for BaNCS
  • RefactoredPythoncode for various projects to improve code performance, readability, and flexibility.
  • Have written and reviewed SQL queries using joins clauses (inner, left, right) in Tableau Desktop to validate static and dynamic data.
  • Optimized data sources for route distribution analytics dashboard in Tableau reducing report runtime.

Environment: Spark, Python, Airflow, Snowflake, AWS, MapReduce, Pyspark, BI, Tableau, ETL, SQL.

Confidential

Data Engineer

Responsibilities:

  • Managed security groups on AWS, focusing on high-availability, fault-tolerance, and auto scaling using Terraform templates.
  • Along with Continuous Integration and Continuous Deployment with AWS Lambda and AWS code pipeline.
  • Created Data Quality Scripts using SQL and Hive to validate successful das ta load and quality of the data.
  • Created various types of data visualizations using Python and Tableau.
  • Wrote various data normalization jobs for new data ingested into Redshift.
  • Created various complex SSIS/ETL packages to Extract, Transform and Load data
  • UsedZookeeperto store offsets of messages consumed for a specific topic and partition by a specific Consumer Group in Kafka.
  • UsedKafkafunctionalities like distribution, partition, replicated commit log service for messaging systems by maintaining feeds and Created applications using Kafka, which monitors consumer lag withinApache Kafkaclusters.
  • Used Hive SQL, Presto SQL and Spark SQL for ETL jobs and using the right technology for the job to get done.
  • Migrated on premise database structure to Confidential Redshift data warehouse
  • Was responsible for ETL and data validation using SQL Server Integration Services.
  • Worked on Big data on AWS cloud services i.e., EC2, S3, EMR and DynamoDB
  • Optimizing and tuning the Redshift environment, enabling queries to perform up to 100x faster for Tableau and SAS Visual Analytics.
  • Involved in the Forward Engineering of the logical models to generate the physical model using Erwin and generate Data Models using ERwin and subsequent deployment to Enterprise Data Warehouse.
  • Defined and deployed monitoring, metrics, and logging systems on AWS
  • Developed SSRS reports, SSIS packages to Extract, Transform and Load data from various source systems
  • Created Entity Relationship Diagrams (ERD), Functional diagrams, Data flow diagrams and enforced referential integrity constraints and created logical and physical models using Erwin.
  • Created ad hoc queries and reports to support business decisions SQL Server Reporting Services (SSRS).
  • Analyze the existing application programs and tune SQL queries using execution plan, query analyzer, SQL Profiler and database engine tuning advisor to enhance performance.

Environment: Informatica, RDS, NOSQL, Snow Flake Schema, Apache Kafka, Python, Zookeeper, SQL Server, Erwin, Oracle, Redshift, MySQL, PostgreSQL.

Confidential

Data Engineer

Responsibilities:

  • Experience in Big Data Analytics and design in Hadoop ecosystem using MapReduce Programming, Spark, Hive, Pig, Sqoop, HBase, Oozie, Impala, Kafka
  • Build the Oozie pipeline which performs several actions like file move process, Sqoop the data from the source Teradata or SQL and exports into the hive staging tables and performing aggregations as per business requirements and loading into the main tables.
  • Running of Apache Hadoop, CDH and Map-R distros, dubbedElastic MapReduce (EMR)on(EC2).
  • Developed Data mapping, Transformation and Cleansing rules for the Data Management involving OLTP and OLAP.
  • Worked on to retrieve the data from FS to S3 using spark commands
  • Built S3 buckets and managed policies for S3 buckets and used S3 bucket and Glacier for storage and backup on AWS
  • Involved in creating UNIX shell Scripting. Defragmentation of tables, partitioning, compressing and indexes for improved performance and efficiency.
  • Developed reusable objects like PL/SQL program units and libraries, database procedures and functions, database triggers to be used by the team and satisfying the business rules.
  • Used SQL Server Integrations Services (SSIS) for extraction, transformation, and loading data into target system from multiple sources
  • Implemented data ingestion and handling clusters in real time processing usingKafka.
  • Developed and implemented R and Shiny application which showcases machine learning for business forecasting.
  • Developed predictive models using Python & R to predict customers churn and classification of customers.
  • Performing the forking action whenever there is a scope of parallel process for optimization of data latency.
  • Worked on different data formats such as JSON, XML and performed machine learning algorithms in Python.
  • Performed pig script which picks the data from one HDFS path and performs aggregation and loads into another path which later pulls populates into another domain table.
  • Converted this script into a jar and passed as parameter in Oozie script
  • Hands on experiences on Git bash commands like Git pull to pull the code from source and developing it as per the requirements, git add to add files, git commit after the code build and Git push to the pre prod environment for the code review and later used screwdriver. yaml which actually build the code, generates artifacts which releases in to production
  • Created logical data model from the conceptual model and its conversion into the physical database design using Erwin.
  • Involved in transforming data from legacy tables toHDFS, andHBasetables usingSqoop.
  • Partner with infrastructure and platform teams to configure, tune tools, automate tasks and guide the evolution of internal big data ecosystem; serve as a bridge between data scientists and infrastructure/platform teams.
  • Data analysis using regressions, data cleaning, excel v-look up, histograms and TOAD client and data representation of the analysis and suggested solutions for investors
  • Rapid model creation in Python using pandas, NumPy, sklearn, and plot.ly for data visualization.
  • These models are then implemented in SAS where they are interfaced with MSSQL databases and scheduled to update on a timely basis.

Environment: MapReduce, Spark, Hive, Pig, Sqoop, HBase, Oozie, Impala, Kafka,JSON, XML PL/SQL,SQl, HDFS, Unix, Python, PySpark.

Confidential

Data Analyst

Responsibilities:

  • Worked with the business community to define business requirements and analyze the possible technical solutions.
  • Requirement gathering, Business Process flow, Business Process Modeling and Business Analysis.
  • Extensively used UML and Rational Rose for designing to develop various use cases, class diagrams and sequence diagrams.
  • Used JavaScript for client-side validations, and AJAX to create interactive front-end GUI.
  • Developed application using Spring MVC architecture.
  • Developed custom tags for table utility component
  • Used various Java, J2EE APIs including JDBC, XML, Servlets, and JSP.
  • Designed and implemented the UI using Java, HTML, JSP and JavaScript.
  • Designed and developed web pages using Servlets and JSPs and also used XML/XSL/XSLT as repository.
  • Involved in Java application testing and maintenance in development and production.
  • Involved in developing the customer form data tables. Maintaining the customer support and customer data from database tables in MySQL database.
  • Involved in mentoring specific projects in application of the new SDLC based on the Agile Unified Process, especially from the project management, requirements and architecture perspectives.
  • Designed and developed View, Model and Controller components implementing MVC Framework.

Environment: JDK 1.3, J2EE, JDBC, Servlets, JSP, XML, XSL, CSS, HTML, DHTML, JavaScript, UML, Eclipse 3.0, Tomcat 4.1, MySQL.

We'd love your feedback!