Sr. Data Engineer Resume
Charlotte, NC
SUMMARY
- Around 7+ years of IT experience in a variety of industries, which includes hands on experience in Big Data Hadoop and Java development
- Expertise with the tools in Hadoop Ecosystem including Spark, Hive, HDFS, MapReduce, Sqoop, Kafka, Yarn, Oozie, and Hbase.
- Excellent knowledge on Distributed components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and Map Reduce programming paradigm.
- Experience in designing and developing production ready data processing applications in Spark using Scala/Python.
- Strong experience creating most efficient Spark applications for performing various kinds of data transformations like data cleansing, de - normalization, various kinds of joins, data aggregation.
- Good experience fine-tuning Spark applications utilizing various concepts like Broadcasting, increasing shuffle parallelism, caching/persisting dataframes, sizing executors appropriately to utilize the available resources in the cluster effectively etc.,
- Strong experience automating data engineering pipelines utilizing proper standards and best practices (right partitioning, right file formats, incremental loads by maintaining previous state etc.,)
- Good Knowledge in productionizing Machine Learning pipelines (Featurization, Learning, Scoring, Evaluation) primarily using Spark ML libraries.
- Good exposure with Agile software development process.
- Experience in manipulating/analyzing large datasets and finding patterns and insights within structured and unstructured data.
- Strong experience on Hadoop distributions like Cloudera, Hortonworks, AWS and Azure Databricks.
- Good understanding of NoSQL databases and hands-on work experience in writing applications on NoSQL databases like HBase, Cassandra and MongoDB.
- Experienced in writing complex MapReduce programs that work with different file formats like Text, Sequence, Xml, parquet, and Avro.
- Experience in Oozie and workflow scheduler to manage Hadoop jobs by Direct Acyclic Graph (DAG) of actions with control flows.
- Experience in migrating the data using Sqoop from HDFS to Relational Database System and vice-versa.
- Extensive Experience on importing and exporting data using stream processing platforms like Flume and Kafka.
- Very good experience in complete project life cycle (design, development, testing and implementation) of Client Server and Web applications.
- Excellent Java development skills using J2EE, J2SE, Servlets, JSP, EJB, JDBC, SOAP and RESTful web services.
- Experience in database design using PL/SQL to write Stored Procedures, Functions, Triggers and strong experience in writing complex queries for Oracle.
- Experienced in working with Amazon Web Services (AWS) using S3, EMR, Redshift, Athena, Glue Metastore etc.,
- Strong experience in Object-Oriented Design, Analysis, Development, Testing and Maintenance.
- Experienced in using agile approaches, including Extreme Programming, Test-Driven Development and Agile Scrum.
- Worked in large and small teams for systems requirement, design & development.
- Key participant in all phases of software development life cycle with Analysis, Design, Development, Integration, Implementation, Debugging, and Testing of Software Applications in client server environment, Object Oriented Experience in using various IDEs Eclipse, IntelliJ and repositories SVN and Git.
- Experience of using build tools SBT, Maven.
TECHNICAL SKILLS
Programming Languages: Python, Java, Scala
Big Data Technologies: Spark, HDFS, Map Reduce, HIVE, HBase, Sqoop, Flume, Oozie, Kafka, Impala.
Distributed Platforms: Cloudera, Hortonworks, Azure Databricks, AWS EMR
NoSQL Databases: HBase, Dynamo DB
Version Control: GitHub, Bitbucket, CVS, SVN
Build Tools: Ant, Maven, Gradle
Cloud Services: AWS S3, EMR, Redshift, Athena, Glue, Lambda functions
PROFESSIONAL EXPERIENCE
Confidential, Charlotte, NC
Sr. Data Engineer
Responsibilities:
- Involve in different phases of Software Development life Cycle to Analyse, Design, Coding and implement high quality scalable solutions as per the business requirements.
- Interface with data scientists, product managers, Architects and business stakeholders to understand data needs and help build large data products that scale across the company
- Define and implement best practice approaches for processing large S3, Redshift, and Snowflake data sets for predictive analytics modelling.
- Develop Python scripts to automate validation, logging, and alerting for Spark applications running on AWS EMR.
- Create Athena external tables for consumption purpose by implementing bucketing and partitions based on date of the incremental data loads.
- Collaborate with Data Scientists to implement advanced analytics algorithms that exploit our rich data sets for statistical analysis, prediction, clustering and machine learning.
- Develop python framework to convert Hive/SQL queries into Spark transformations using Spark RDDs, Data Frames and perform actions on data sets stored in memory cache.
- Develop PySpark analytical methods such as Data Modelling and Data Processing.
- Develop ETL pipelines using PySpark that links various datasets and store data into Redshift, S3 as parquet files
- Build utilities, user defined functions, libraries, and frameworks to better enable data flow patterns using tools and languages prevalent in the big data ecosystem
- Implement and leverage CI/CD to rapidly build & test application code using AWS CodeBuild and CloudFormation
- Develop consumption framework using Spark to ingest data from multiple sources in AWS Data Lake and store data as Parquet format in AWS S3.
- Involve in Ad Hoc stand up and architecture meetings to set up daily priorities and track the status of work as a part of highly agile work environment.
- Drive the design, build new data models and data pipelines in production
- Build and apply analytical functions on data using PySpark dataframes and source that data into various ML models.
- Optimize and Monitor performance of Spark applications running in production and take corrective action and improve in-case of failures.
- Create and schedule data pipelines in Airflow using Python framework.
- Perform large scale joins between large datasets with Dim tables to aggregate product information.
- Create staging layer to load partitioned datasets from S3 buckets into Snowflake.
- Develop Snowpipe to load data. Files from S3 buckets into Snowflake
- Schedule and create DAG’s in Airflow using Python
Environment: Hadoop, Spark, Hive, SQL, Python, Linux, AWS EMR, Athena, RedShift, S3, EC2, Glue, SNS, Lambda, CodeBuild, Shell Scripting, VS Code, Bitbucket, Snowflake
Confidential, CT
Sr. Data Engineer
Responsibilities:
- Worked on building centralized Data lake on AWS Cloud utilizing primary services like S3, EMR, Redshift and Athena.
- Worked on migrating datasets and ETL workloads from On-prem to AWS Cloud services.
- Built series of Spark Applications and Hive scripts to produce various analytical datasets needed for digital marketing teams.
- Worked extensively on building and automating data ingestion pipelines and moving terabytes of data from existing data warehouses to cloud.
- Worked extensively on fine tuning spark applications and providing production support to various pipelines running in production.
- Worked closely with business teams and data science teams and ensured all the requirements are translated accurately into our data pipelines.
- Worked on full spectrum of data engineering pipelines: data ingestion, data transformations and data analysis/consumption.
- Worked on automating the Infrastructure setup, launching and termination EMR clusters etc.,
- Created Hive external tables on top of datasets loaded in S3 buckets and created various hive scripts to produce series of aggregated datasets for downstream analysis.
- Build real time streaming pipeline utilizing Kafka, Spark Streaming and Redshift.
- Worked on creating Kafka producers using Kafka Java Producer Api for connecting to external Rest live stream application and producing messages to Kafka topic.
Environment: AWS S3, EMR, Redshift, Athena, Glue, Spark, Scala, Python, Java, Hive, Kafka
Confidential, Burbank, CA
Hadoop Developer
Responsibilities:
- Involved in importing and exporting data between Hadoop Data Lake and Relational Systems like Oracle, MySQL using Sqoop.
- Involved in developing spark applications to perform ELT kind of operations on the data.
- Modified existing MapReduce jobs to Spark transformations and actions by utilizing Spark RDDs, Dataframes and Spark SQL API’s
- Utilized Hive partitioning, Bucketing and performed various kinds of joins on Hive tables
- Involved in creating Hive external tables to perform ETL on data that is produced on daily basis
- Validated the data being ingested into Hive for further filtering and cleansing.
- Developed Sqoop jobs for performing incremental loads from RDBMS into HDFS and further applied Spark transformations
- Loaded data into hive tables from spark and used Parquet columnar format.
- Created Oozie workflows to automate and productionize the data pipelines
- Migrating Map Reduce code into Spark transformations using Spark and Scala.
- Collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis.
- Designed, documented operational problems by following standards and procedures using JIRA
Environment: Hadoop, Hive, Impala, Oracle, Spark, Pig, Sqoop, Oozie, Map Reduce, GIT, Confluence, Jenkins.
Confidential
Big Data/Hadoop Developer
Responsibilities:
- Involved in writing Spark applications using Scala to perform various data cleansing, validation, transformation and summarization activities according to the requirement.
- Load the data into Spark RDD and perform in-memory data computation to generate the output as per the requirements.
- Developed data pipelines using Spark, Hive and Sqoop to ingest, transform and analyze operational data.
- Developed Spark jobs, Hive jobs to summarize and transform data.
- Worked on performance tuning of Spark applications to reduce job execution times.
- Performance tuning the Spark jobs by changing the configuration properties and using broadcast variables.
- Real time streaming the data using Spark with Kafka. Responsible for handling Streaming data from web server console logs.
- Worked on different file formats like Text files, Avro, Parquet, JSON, XML files and Flat files using Map Reduce Programs.
- Developed daily process to do incremental import of data from DB2 and Teradata into Hive tables using Sqoop.
- Wrote Pig Scripts to generate transformations and performed ETL procedures on the data in HDFS.
- Solved performance issues in Hive and Pig scripts with understanding of Joins, Group and Aggregation and how does it translate to MR jobs.
- Work with cross functional consulting teams within the data science and analytics team to design, develop and execute solutions to derive business insights and solve clients operational and strategic problems.
- Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
- Extensively used Hive/HQL or Hive queries to query data in Hive Tables and loaded data into HBase tables.
- Extensively worked with Partitions, Dynamic Partitioning, bucketing tables in Hive, designed both Managed and External tables, also worked on optimization of Hive queries.
- Involved in collecting and aggregating large amounts of log data using Flume and staging data in HDFS for further analysis.
- Assisted analytics team by writing Pig and Hive scripts to perform further detailed analysis of the data.
- Designing Oozie workflows for job scheduling and batch processing.
Environment: Java, Scala, Apache Spark, MySQL, CDH, IntelliJ IDEA, Hive, HDFS, YARN, Map Reduce, Sqoop, PIG, Flume, Unix Shell Scripting, Python, Apache Kafka
Confidential
Java Developer
Responsibilities:
- Worked on developing the application involving Spring MVC implementations and Restful web services.
- Responsible for designing Rich user Interface Applications using JavaScript, CSS, HTML, XHTML and AJAX.
- Developed code using Core Java to implement technical enhancement following Java Standards.
- Worked with Swing and RCP using Oracle ADF to develop a search application which is a migration project.
- Implemented Hibernate utility classes, session factory methods, and different annotations to work with back-end data base tables.
- Implemented Ajax calls using JSF-Ajax integration and implemented cross-domain calls using jQuery Ajax methods.
- Implemented Object-relational mapping in the persistence layer using Hibernate framework in conjunction with spring functionality.
- Used JPA (Java Persistence API) with Hibernate as Persistence provider for Object Relational mapping.
- Used JDBC and Hibernate for persisting data to different relational databases.
- Developed and implemented Swing, spring and J2EE based MVC (Model-View-Controller) framework for the application.
- Implemented application-level persistence using Hibernate and spring.
- Data Warehouse (DW) data integrated from different sources in different format (PDF, TIFF, JPEG, web crawl and RDBMS data MySQL, oracle, Sql server etc.)
- Used XML and JSON for transferring/retrieving data between different Applications.
- Also wrote some complex PL/SQL queries using Joins, Stored Procedures, Functions, Triggers, Cursors, and Indexes in Data Access Layer.
- Developed back-end interfaces using embedded SQL, PL/SQL packages, stored procedures, Functions, Procedures, Exceptions Handling in PL/SQL programs, Triggers.
- Used Log4j to capture the log that includes runtime exception and for logging info.
- Used ANT as build tool and developed build file for compiling the code of creating WAR files.
- Used Tortoise SVN for Source Control and Version Management.