Senior Data Engineer Resume
Raritan, NJ
SUMMARY
- Over 7.5+ years of experience in Data Engineering, Data Pipeline Design, Development and Implementation as a Sr. Data Engineer/Data Developer and Data Modeler.
- Experienced in Designing, Architecting, and implementing scalable cloud - based Data pipelines using Azure and AWS.
- Experienced on Migrating SQL database to Azure DataLake storage, Azure Data Factory (ADF), Azure Data Lake Analytics, Azure SQL Database, DataBricks andAzure SQL Data warehouse and controlling and granting database access and migrating on premise databases to Azure Data Lake store using Azure Datafactory.
- Involved in Software development, Data warehousing and Analytics and Dataengineering projects using Hadoop, MapReduce, Hive, and other open-source tools/technologies.
- Expertise in using major components of Hadoop ecosystem components like HDFS, YARN, Map Reduce, Hive, Impala, Pig, Sqoop, HBase, Spark, Spark SQL, Kafka, Spark Streaming, Oozie, Zookeeper, Hue.
- Performed data validation and transformation using Python and Hadoop streaming.
- Experience in developing and designing POC's using Scala, Spark SQL and MLlib libraries then deployed on the Yarn cluster.
- Experience other Hadoop ecosystem tools in jobs such as Zookeeper, Oozie, Impala
- Experienced in troubleshooting errors in HBase Shell/API, Pig, Hive and Map Reduce.
- Used Sqoop to Import data from Relational Database (RDBMS) into HDFS and Hive, storing using different formats like Text, Avro, Parquet, Sequence File, ORC File along with compression codes like Snappy and GZIP.
- Extensive Experience on importing and exporting data using Flume and Kafka.
- Joined various tables in Cassandra using spark and Scala and ran analytics on top of them
- Performed transformations on the imported data and exported back to RDBMS.
- Configured Zookeeper to coordinate and support Kafka, Spark, Spark Streaming, HBase and HDFS.
- Expertise in Creating, Debugging, Scheduling and Monitoring jobs using Airflow and Oozie
- Hands on experience on ETL to get data ingestions from RDBMS to Hive and HDFS using SQOOP from SQL Server, MYSQL.
- Involved in converting Map Reduce programs into Spark transformations using Spark RDD's using Scala and Python.
- Used Oozie and Zookeeper operational services for coordinating cluster and scheduling workflows.
- Strong Knowledge on architecture and components of Spark, and efficient in working with Spark Core, Spark SQL.
- Worked on JSON scripts generation and writing UNIX shell scripting to call the SQOOP Import/Export
- Assisted in upgrading, configuration, and maintenance of various Hadoop infrastructures like Pig, Hive, and HBase.
- Responsible for building scalable distributed data solutions using Hadoop and involved in Job management using Fair scheduler and Developed job processing scripts using Oozie workflow.
- Experience in integrating Apache Kafka with Apache Storm and created Storm data pipelines for real time processing Exposure on usage of Apache Kafka to develop data pipeline of logs as a stream of messages using producers and consumers.
- Experienced in datamanipulation using python and python libraries such as Pandas, NumPy, SciPy and Scikit-Learn for data analysis, numerical computations, and machine learning.
- Developed reports, dashboards using Tableau for quick reviews to be presented to Business and IT users.
- Excellent understanding and knowledge of NOSQL databases like Mongo dB and Cassandra.
- Strong experience in Java, Scala, SQL, PL/SQL, and Restful web services.
TECHNICAL SKILLS
Languages: Python, Java, Scala
Big Data Tools: Spark, Hive, Kafka, Redshift, Cassandra, Hadoop, Impala, Hudi, Delta Lake, Glue, Athena, Sqoop, Oozie, Flume, Zookeeper, Airflow, Storm
Database: Oracle, HBase, Cassandra, Snowflake, Mongo dB, MySQL, MS SQL Server, Postgres, DynamoDB
Python Packages: PySpark, Pandas, Matplotlib, NumPy, SqlAlchemy, Requests, cx oracle, tox, Redis, python-ldap, python-saml, marshmallow, python-OAuth2, Boto3, confluent-kafka, fabric
Cloud Development: Azure DataLake storage, Azure Data Factory (ADF), Azure Data Lake Analytics, Azure SQL Database, DataBricks,Azure SQL Data warehouse, AWS EMR, AWS Lambda, AWS EC2
Scripting Languages: HTML, CSS, JavaScript, Shell
Version Control: GIT, SVN
Continuous Integration: Jenkins
Agile Methodologies: TDD, SCRUM.
Operating Systems: UNIX, Linux, Windows
PROFESSIONAL EXPERIENCE
Confidential, Raritan, NJ
Senior Data Engineer
Responsibilities:
- Designing and developing code, scripts and data pipelines that leverage structured and unstructured data integrated from multiple sources
- Involved in designing optimizing Spark SQL queries, Data frames, import data from Data sources, perform transformations, perform read/write operations, save the results to output directory into S3.
- Extract Real time feed using Kafka and Spark Streaming and process data in the form of Data Frame and save the data as Parquet format in S3.
- Worked on RDD Architecture and implementing spark operations on RDD and optimizing transformations and actions in Spark.
- Extract Transform and Loaddatafrom Sources Systems to AzureDataStorage services using a combination of AzureDataFactory, T-SQL, Spark SQL, and U-SQLAzureDataLake Analytics.DataIngestion to one or more Azure Services - (AzureDataLake, ADF, Azure Storage, Azure SQL, Azure DW) and processing thedatain inAzure Databricks.
- Using AzureDataFactory, createddatapipelines anddataflows and triggered the pipelines.
- Written programs in Spark using Python, PySpark and Pandas packages for performance tuning, optimization, anddataquality validations.
- Worked on developing Kafka Producers and Kafka Consumers for streaming millions of events per second on streamingdata.
- Developed Spark Programs using Scala and Java API's and performed transformations and actions on RDD's.
- Expertise in building PySpark, Spark Java and Scala applications for interactive analysis, batch processing, and stream processing.
- Involved in performance tuning of spark jobs using Cache and using complete advantage of cluster environment.
- Mentoring junior members on the team in application architecture, design, and development best practices.
- Interface with business professionals, application developers and technical staff working in an agile process and environment.
- Create and maintain technical documentation, architecture designs and data flow diagrams.
- Used Broadcast variables in Spark, effective & efficient Joins, transformations, and other capabilities for data processing.
- Developed custom aggregate functions using Spark SQL and performed interactive querying.
- Fine-tuning spark applications/jobs to improve the efficiency and overall processing time for the pipelines.
- Worked with various formats of files like delimited text files, click stream log files, Apache log files, Avro files, JSON files, XML Files. Mastered in using different columnar file formats like ORC and Parquet formats.
- Used Jenkins pipelines to drive all micro-services builds out to the Docker registry and then deployed to Kubernetes, Created Pods and managed using Kubernetes to deploy custom Python scripts.
- Performed all necessary day-to-day GIT support for different projects, Responsible for design and maintenance of the GIT Repositories, and the access control strategies.
Environment: Python, Azure Databricks, Azure Data Factory, Azure DW, HDFS, Kafka, Hive, Sqoop, Linux, Maven, EC2 instances, Azure Storage, Kubernetes, Docker.
Confidential, Somerset, NJ
Senior Data Engineer
Responsibilities:
- Involved in loading and Transforming sets of Structured, Semi Structured andUnstructureddataand analyzed them by running Hive queries and Spark SQL.
- Involved in migrating SQL database to Azuredatalake,Data Lake Analytics,Databricks and Azure SQLdatawarehouse.
- Created PySpark code that uses Spark SQL to generate data frames from Avro formatted raw layer and writes them to data service layer internal tables as orc format.
- Creating Spark clusters and configuring high concurrency clusters using AzureDatabricks to speed up the preparation of high-qualitydata
- Controlling and granting database access and migrating on premise databases toAzuredatalake store using AzureDataFactory.
- DataIngestion to one or more Azure Services and processing thedatain AzureDataBricks and write thedatain the form of Text files, Parquet Files.
- Created the RDD's, Data Frames for the required inputdataand performed thedatatransformations using PySpark.
- Involved in requirement analysis, design, coding, and implementation. UsingLinked Services connected to SQL Server, Teradata and get thedatainto ADLS andBLOB storage
- Controlling and granting database access and migrating on premise databases toAzuredatalake store using AzureDataFactory.
- Using HBase to store majority of data which needs to be divided based on region.
- Performed data manipulation on extracted data using Python Pandas.
- Experienced in running query - using Impala and used BI tools to run ad-hoc queries directly on Hadoop.
- Worked on NoSQL databases including HBase and Cassandra.
- Work with subject matter experts and project team to identify, define, collate, document, and communicate the data migration requirements.
- Excellent understanding of Hadoop architecture and underlying framework including storage management.
- Designed and implemented big data ingestion pipelines to ingest multi PB data from various data source using Kafka, Spark streaming including data quality checks, transformation, and stored as efficient storage formats like parquet.
- Develop best practice, processes, and standards for effectively carrying out data migration activities. Work across multiple functional projects to understand data usage and implications for data migration.
- Migrated Map reduce jobs to Spark jobs to achieve better performance.
- Joined various tables in Cassandra using spark and Scala and ran analytics on top of them.
- Analyzed the system for new enhancements/functionalities and perform Impact analysis of the application for implementing ETL changes
- Experienced in Maintaining the Hadoop cluster on Azure HDInsight.
- Implemented a Continuous Delivery pipeline with Docker, and Git Hub.
- Participated in the full software development lifecycle with requirements, solution design, development, QA implementation, and product support using Scrum and other Agile methodologies.
- Collaborate with team members and stakeholders in design and development of data environment
- Preparing associated documentation for specifications, requirements, and testing.
Environment: AzureDataFactory (ADF), Azure Data Bricks, AzureDataLake Storage (ADLS), Azure HDInsight, Blob Storage, Cassandra, Kafka, Delta Lake, Python, PySpark, Docker.
Confidential, Madison, WI
Data Engineer
Responsibilities:
- Load and transform large sets of structured, semi structured, and unstructured data that includes Avro, sequence files and XML files.
- Worked extensively with importing metadata into Hive and migrated existing tables and applications to work on Hive and AWS cloud.
- Used Spark Data Frame Operations to perform required Validations in the data and to perform analytics on the Hive data.
- Developed a common framework to import the data from external Data bases to HDFS and to export to external data bases using Sqoop.
- Used Spark SQL with Python for creating data frames and performed transformations on data frames like adding schema manually, casting, joining data frames before storing them.
- Developed Hive queries on external tables to perform various analysis and utilized HUE interface for querying the data (Hive/Impala).
- Developed the Hive UDF'S to pre-process the data for analysis and scripts for using dynamic partitioning in Hive.
- Implemented dynamic partitioning in hive tables and used appropriate file format, compression technique to improve the performance of daily running spark jobs and map reduce jobs.
- Responsible for continuous monitoring and managing Elastic MapReduce cluster through AWS console.
- Good Knowledge on applying rules and policies using ILM (Information Life Cycle Management) workbench for Data Masking Transformation and loading into targets.
- Used Different Spark Modules like Spark core, Spark SQL, Spark Streaming, Spark Data sets and Data frames.
- Developed PySpark scripts, using both Data Frames/SQL and RDD in Spark for data aggregation and queries.
- Developed a Spark code and Spark-SQL/Streaming for processing of data.
- Performed various performance optimizations like using distributed cache for small datasets, partition and bucketing in hive, doing map side joins etc.
- Work closely with various levels of individuals to coordinate and prioritize multiple projects. Estimate scope, schedule, and track projects throughout SDLC.
- Involved in source system analysis, data analysis, data modelling to ETL
- Experienced in working with data analytics, web Scraping and Extraction of data in Python.
- Designed & Implemented database cloning using Python and Built backend support for Applications using Shell scripts.
- Active involvement in Scrum meetings and Followed Agile Methodology for implementation.
Environment: Python, Spark, EMR, HDFS, Kafka, Hive, Scoop, Impala, Hue Interface, XML, Spark-SQL, Hive UDF, Linux, Maven, Shell.
Confidential
Hadoop Developer
Responsibilities:
- Responsible for building scalable distributed data solutions using Hadoop
- Setup and benchmarked Hadoop/HBase clusters for internal use
- Developed Simple to complex Map/reduce Jobs using Java programming language that is implemented using Hive and Pig.
- Optimized Map/Reduce Jobs to use HDFS efficiently by using various compression mechanisms
- Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS, and Extracted the data from MySQL into HDFS using Sqoop.
- Used UDFs to implement business logic in Hadoop.
- Used Impala to read, write and query the Hadoop data in HBase.
- Develop programs in Spark to use on the application for faster data processing than standard MapReduce programs
- Implemented business logic by writing UDFs in Java and used various UDFs from Piggybanks and other sources.
- Continuous monitoring and managing the Hadoop cluster using Cloudera Manager.
- Used Solr to navigate through data sets in the HDFS storage.
- Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
- Worked on ETL process to clean and load large data extracted from several websites (JSON/ CSV files) to the SQL server.
- Wrote Hive Queries for analyzing data in Hive warehouse using Hive Query Language (HQL).
- Queried both Managed and External tables created by Hive using Impala. Developed a different kind of custom filters and handled pre-defined filters on HBase data using API.
- Written multiple MapReduce programs in Java for data extraction, transformation, and aggregation from multiple file-formats including XML, JSON, CSV, and other compressed file formats.
Environment: Java, Hadoop, HDFS, MapReduce, Hive, Pig, HBase, Impala, SQL Server, Linux.