We provide IT Staff Augmentation Services!

Data Engineer Resume

4.00/5 (Submit Your Rating)

Newark, NJ

SUMMARY

  • Around 7+ years of IT experience in Analysis, Design, Development in Big Data technologies like Spark, MapReduce, Hive, Kafka and HDFS including programming languages like Java, Scala and Python.
  • Strong experience in end - to-end data engineering including data ingestion, data cleansing, data transformations, data validations/auditing and feature engineering.
  • Strong experience working with Hadoop architecture and various components including HDFS, Yarn, MapReduce, Hive, Pig, HBase, Kafka, Oozie etc.,
  • Good understanding of Distributed Systems architecture and design principles behind Parallel Computing.
  • Expertise in developing production ready Spark applications using scala and python as programming language.
  • Strong experience troubleshooting failures in spark applications and fine-tuning spark applications and hive queries for better performance.
  • Strong experience using Spark RDD Api, Spark DataFrame/Dataset Api, Spark-SQL and Spark ML frameworks for building end to end data pipelines.
  • Worked extensively on building real time data pipelines using Kafka for streaming data ingestion and Spark Streaming for real time consumption and processing.
  • Strong experience working with Hive for performing various data analysis.
  • Good hands-on experiencing working with various Hadoop distributions mainly Cloudera (CDH), Hortonworks (HDP) and Amazon EMR.
  • Implemented Databricks data flow pipelines to load and transform the data to Azure SQL DW
  • Implemented CI/CD using Azure Devops.
  • Good exposure on usage of NoSQL databases column oriented HBase, Cassandra and MongoDB (Document Based DB).
  • Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems (RDBMS) and vice-versa.
  • Solid experience in working with csv, text, sequential, avro, parquet, orc, json formats of data.
  • Proficient knowledge and hand on experience in writing shell scripts in Linux.
  • Experienced in requirement analysis, application development, application migration and maintenance using Software Development Lifecycle (SDLC) and Python/Java technologies.
  • Excellent technical and analytical skills with clear understanding of design goals and development for OLTP and dimensions modeling for OLAP.
  • Expertise in all phases of System Development Life Cycle Process (SDLC), Agile Software Development, Scrum Methodology and Test-Driven Development.
  • Defining user stories and driving the agile board in JIRA during project execution, participate in sprint demo and retrospective.
  • Experience in using Version Control tools like Git, SVN.

TECHNICAL SKILLS

Big Data Ecosystem: Hadoop, HDFS, MapReduce, Hive, Sqoop, Pig, HBase, Flume, Oozie, Impala, Kafka, Spark

Programming Languages: Python, Scala and Java

Scripting Languages: Shell Scripting

Databases: MySQL, Teradata, Oracle

IDE & ETL Tools: Eclipse, Intellij, Maven, Jenkins

NoSQL Databases: HBase, Cassandra, MongoDB

Other Tools: Putty, WinSCP, Amazon AWS Console, Apache Ambari, Cloudera Manager.

Version Control: GitHub, SVN, CVS

Methodologies: Agile, Waterfall

Operating Systems: Windows, Mac, Linux

PROFESSIONAL EXPERIENCE

Confidential, Newark, NJ

Data Engineer

Responsibilities:

  • Developed Spark applications using Scala utilizing Data frames and Spark SQL API for faster processing of data.
  • Developed highly optimized Spark applications to perform various data cleansing, validation, transformation and summarization activities according to the requirements.
  • Data pipeline consists Spark, Hive and Sqoop and custom build Input Adapters to ingest, transform and analyze user behavior (clickstream) data.
  • Developed Spark jobs and Hive Jobs to summarize and transform data.
  • Worked closely with machine learning team to build feature datasets on a continuous basis for assisting them in model training and model scoring activities.
  • Build the entire data pipeline on AWS cloud utilizing native services like S3 as data lake, EMR as Spark/Hive cluster, Redshift & Athena as downstream query engines and Simple workflow as orchestrator.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark DataFrames and Scala.
  • Automated creation and termination of AWS EMR clusters using Amazon Java SDK.
  • Involved in deploying spark and hive applications in AWS stack.
  • Handled importing data from different data sources into S3 using Sqoop and performing transformations using Hive and Spark.
  • Helped Devops engineers for deploying code and debug issues.
  • Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
  • Developed Hive scripts in Hive QL to de-normalize and aggregate the data.

Environment: AWS EMR, S3, Spark, Scala, Hive, MapReduce, Sqoop, ETL, Java

Confidential, Columbus, Ohio

Big Data Engineer

Responsibilities:

  • Responsible for the design, implementation and architecture of very large-scale data intelligence solutions around big data platforms.
  • Analyzed large and critical datasets using HDFS, HBase, Hive, HQL, Pig, Sqoop and Zookeeper.
  • Created data flow pipelines to process the CSV data using Spark.
  • Implemented Databricks data flow pipelines to load and transform the data to Azure SQL DW
  • Implemented CI/CD using Azure Devops.
  • Created an ingestion framework using python which extract the data from on premise sql servers to Azure using Sqoop.
  • Good experience in designing and developing CICD pipelines through Azure DevOps.
  • Azure services: Azure HDInsight, Azure Databricks, Azure Data Factory & Azure SQL DW
  • Worked on SQL queries in dimensional data warehouses and relational data warehouses. Performed Data Analysis and Data Profiling using Complex SQL queries on various systems.
  • Troubleshoot and resolve data processing issues and proactively engaged in data modelling discussions.
  • Worked on RDD Architecture and implementing spark operations on RDD and optimizing transformations and actions in Spark.
  • Written programs in Spark using Python (PySpark) packages for performance tuning, optimization and data quality validations.
  • Worked on developing Kafka Producers and Kafka Consumers for streaming millions of events per second on streaming data.
  • Implemented a distributing messaging queue to integrate with Cassandra using Apache Kafka.
  • Hands on experience on fetching the live stream data from UDB into HBase table using PySpark streaming and Apache Kafka.
  • Worked on Tableau to build customized interactive reports, worksheets, and dashboards.

Environment: HDFS, Python, SQL, Spark, Scala, Kafka, Hive, Yarn, Sqoop, Tableau, Azure Cloud, GitHub, Shell Scripting.

Confidential

Bigdata Developer

Responsibilities:

  • Developed a shell script to create staging, landing tables with the same schema as the source and generate the properties which are used by Oozie jobs.
  • Developed Oozie workflow for executing Sqoop and Hive actions and worked with NoSQL databases like HBase in creating HBase tables to load large sets of semi structured data coming from various sources.
  • Performance optimizations on Spark and Python.
  • Diagnose and resolve performance issues in Spark.
  • Responsible for developing Python wrapper scripts which will extract specific date range using Sqoop by passing custom properties required for the workflow.
  • Developed scripts to run Oozie workflows, capture the logs of all jobs that run on cluster and create a metadata table which specifies the execution times of each job.
  • Converted existing MapReduce applications to PySpark application as part of overall effort to stream line legacy jobs and create new framework.

Environment: Hadoop, HDFS, Map Reduce, Hive, HBase, Oozie, Impala, Java (jdk1.8), Cloudera, Python, UNIX Shell Scripting, Flume, Scala, Spark, Sqoop, Kafka, Oracle.

Confidential 

Java Developer

Responsibilities:

  • Involved in the complete SDLC software development life cycle of the application from requirement analysis to testing.
  • Involved in designing Database Connections using JDBC.
  • Developed the business components (in core Java) used for the calculation module (calculating various entitlement attributes).
  • Created complex SQL Queries, PL/SQL Stored procedures, Functions for back end.
  • Prepared the Functional, Design and Test case specifications.
  • Involved in writing Stored Procedures in Oracle to do some database side validations.
  • Performed unit testing, system testing and integration testing.
  • Used Oracle SQL 4.0 as the database and write SQL queries in the DAO Layer.
  • Experience in application using Core Java, JDBC, JSP, Servlets, spring, Hibernate, Web Services, SOAP, and WSD.
  • Used RESTFUL Services to interact with the Client by providing the RESTFUL URL mapping.
  • Implementing project using Agile SCRUM methodology, involved in daily stand up meetings and sprint showcase and sprint retrospective.
  • Used SVN and GitHub as version control tool.
  • Implemented Hibernate in the data access object layer to access and update information in the Oracle 10g Database.
  • Developed presentation layer using HTML, JSP, Ajax, CSS and JQuery.
  • Experience in JIRA and tracked the test results and interacted with the developers to resolve issue.
  • Used XSLT to transform my XML data structure into HTML pages.
  • Deployed EJB Components on Tomcat. Used JDBCAPI for interaction with OracleDB.
  • Wrote build & deployment scripts using shell, Perl and ANTscripts

Environment: Java, Spring, Hibernate, PL/SQL, Oracle,HTML, JavaScript, Ajax, Servlets, JSP, SOAP, SDLC life cycle, Java, Hibernate, Scrum, JIRA, Github, JQuery, CSS, XML, ANT, Tomcat Server, Jasper Reports.

We'd love your feedback!