We provide IT Staff Augmentation Services!

Sr. Data Engineer Resume

0/5 (Submit Your Rating)

Bentonville, AR


  • 8 Years of years of experience in Analysis, Design, Development, and Implementation as a Data Engineer.
  • Hands on experience across Hadoop Ecosystem that includes extensive experience in Big Data technologies like HDFS, MapReduce, YARN, Spark, Sqoop, Hive, Pig, Impala, Oozie, Oozie Coordinator, Zoo - Keeper, and Apache Cassandra, HBase.
  • Experience on Migrating SQL database to Azure Data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and controlling and granting database access and Migrating On premise databases to Azure Data Lake store using Azure Data factory.
  • Experience in using various tools like Sqoop, Flume, Kafka, NiFi and Pig to ingest structured, semi-structured and unstructured data into the cluster.
  • Proficient with Apache Spark ecosystem such as Spark, Spark Streaming using Scala and Java.
  • Developed highly optimized Spark applications to perform various data cleansing, validation, transformation, and summarization activities according to the requirement.
  • Data pipeline consists of Spark, Hive and Sqoop, and custom build Input Adapters to ingest, transform and analyze operational data.
  • Developed Spark jobs and Hive Jobs to summarize and transform data.
  • Experience in working with structured data using HiveQL, join operations, HiveUDFs, partitions, bucketing and internal/external tables.
  • Expertise in writing MapReduce Jobs in Java for processing large sets of structured, semi-structured and unstructured data sets and stores them in HDFS.
  • Experience working with Python, UNIX, and shell scripting.
  • Experience in Extraction, Transformation and Loading (ETL) of data from multiple sources like Flat files and Databases.
  • Good knowledge of cloud integration with AWS using ElasticMap Reduce (EMR), Simple Storage Service (S3), EC2, Redshift and Microsoft Azure.
  • Experience with complete Software Development Life Cycle (SDLC) process which includes Requirement Gathering, Analysis, Designing, Developing, Testing, Implementing and Documenting.
  • Hands on Experience in Spark architecture and its integrations like SparkSQL, DataFrames and DatasetsAPIs.
  • Worked on Spark for enhancing the executions of current processing in Hadoop utilizing Spark Context, SparkSQL, DataFrames and RDD’s.
  • Involved in converting Hive/SQL queries into Spark transformations using SparkRDD, SparkSQL and Python.
  • Hands on experience Using Hive Tables by Spark, performing transformations and Creating Data Frames on Hive tables using Spark.
  • Used Spark-Structured-Streaming to perform necessary transformations.
  • Expertise in converting MapReduce programs into Spark transformations using SparkRDD's
  • Strong understanding of AWS components such as EC2 and S3
  • Experience in Implementing Continuous Delivery pipeline with Maven, Ant, Jenkins, and AWS.
  • Exposure to CI/CD tools - Jenkins for Continuous Integration, Ansible for continuous deployment.
  • Worked with waterfall and Agile methodologies.


Sr. Data Engineer

Confidential, Bentonville, AR


  • Analyzing large amounts of data sets to determine optimal way to aggregate and report on these data sets.
  • Designed and Implemented Big Data Analytics architecture/pipeline.
  • Experienced in writing Spark Applications in Scala/Python.
  • Analyzed large and critical datasets using Azure Cloud, ADLS(GEN-2), ABD, Delta Lakes, Hive, Hive UDF and Spark.
  • Developed production ready Spark code for Extraction of the batch data from the source into the Azure Data Lake Services (ADLS-Gen 2 Storage) from event hubs and various other sources using spark script into the parquet format, generating 1 parquet file per minute.
  • Built multiple notebooks and piped all of the jobs together, built an ETL pipeline to making wear algorithm predictions of the model, and writing the outputs to Azure CosmosDB.
  • Analyzed the SQLscripts and designed the solution to implement using Spark.
  • Used Kafka consumer’s API in Scala for consuming data from Kafka topics
  • Developed custom aggregate functions using Spark SQL and performed interactive querying.
  • Implemented ADB job scheduler for authoring, scheduling, and monitoring Data Pipelines
  • Developed Python code to gather the data from API’s and designs the solution to implement using Spark.
  • Implemented Kafka model which pulls the latest records into Hive external tables.
  • Imported data into ADLS from various API’s and SQL databases and files using ETL and from streaming systems.
  • Loaded all datasets into Hive from Source CSV files using spark from Source CSV files using Spark
  • Exported the analyzed data to Teradata using Sqoop for visualization and to generate reports for the BI team.
  • SparkStreaming collects this data from Kafka in near-real-time and performs necessary transformations and aggregation on the fly to build the common learner data model and persists the data in NoSQL store (HBase).
  • Migrated the computational code in HQL to PySpark.
  • Completed data extraction, aggregation, and analysis in HDFS by using PySpark and store the data needed to Hive.
  • Experienced in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension)
  • Working on the integration of data engineering and Tire Wear model with data science team, to make the correct output predictions and working on fixing other issues.
  • Sound knowledge in programming Spark using Scala.
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.
  • Experienced in Importing and exporting data into HDFS and Hive using Sqoop.
  • Populated HDFS and HBase with huge amounts of data using ApacheKafka.
  • Developed MapReduce programs to parse the raw data, populate staging tables and store the refined detain partitioned tables in the EDW.
  • Deploying Spark jobs in AmazonEMR and running the job on AWS clusters.
  • Strong Experience in implementing Data warehouse solutions in Amazon web services (AWS) Redshift; Worked on various projects to migrate data from on premise databases to AWSRedshift, RDS and S3.
  • Leveraged cloud and GPU computing technologies for automated machine learning and analytics pipelines, such as AWS, GCP

Environment: Hadoop, Hive, Scala, Azure Databricks, Kafka, Flume, HBase, Java, AWS, Hortonworks, Oracle 10g/11g/12C, Teradata, Cassandra, HDFS, Data Lake, Spark, MapReduce, Ambari, Tableau, NoSQL, Shell Scripting, Ubuntu.

Sr. Data Engineer

Confidential, Rochester MN


  • Designed stream processing job used by Spark Streaming which is coded in Scala.
  • Developed real time data processing applications by using Scala and Java and implemented Apache Spark Streaming from various streaming sources like Kafka and JMS.
  • Knowledge on Pyspark and used Hive to analyze sensor data and cluster users based on their behavior in the events.
  • Experienced in writing live Real-time Processing and core jobs using SparkStreaming with Kafka as a data pipe-line system.
  • Worked on Amazon AWS concepts like EMR and EC2 web services for fast and efficient processing of Big Data.
  • Developed Spark programs and created the data frames and worked on transformations.
  • Involved in loading data from Linux file systems, servers, java web services using Kafka producers and partitions.
  • Applied Kafka custom encoders for custom input format to load data into Kafka Partitions.
  • Implement POC with Hadoop. Extract data with Spark into HDFS.
  • Used Spark SQL with Scala for creating data frames and performed transformations on data frames.
  • Used Spark API over ClouderaHadoopYARN to perform analytics on data in Hive.
  • Developed code to read data stream from Kafka and send it to respective bolts through respective stream.
  • Creating Databricks notebooks using SQL, Python and automated notebooks using jobs.
  • Creating Spark clusters and configuring high concurrency clusters using Azure Databricks to speed up the preparation of high-quality data.
  • Developed Spark applications using Scala for easy Hadoop transitions.
  • Implemented applications with Scala along with Akka and Play framework.
  • Optimized the code using Spark for better performance
  • Worked on Spark streaming using ApacheKafka for real time data processing.
  • Developed MapReduce jobs using MapReduceJavaAPI and HIVEQL.
  • Developed UDF, UDAF, UDTF functions and implemented it in HIVE Queries.
  • Developing Scripts and Batch Job to schedule a bundle (group of coordinators) which consists of various Hadoop Programs using Oozie.
  • Prepared data warehouse using Star/Snowflake schema concepts in Snowflake using SnowSQL.
  • Experienced in optimizing Hive queries, joins to handle different data sets.
  • Involved in ETL, Data Integration and Migration by writing Pig scripts.
  • Developed MapReduce programs to cleanse the data in HDFS obtained from heterogeneous data sources. Processed metadata files into AWSS3 and Elasticsearch cluster.
  • Integrated Hadoop with Solr and implement search algorithms.
  • Experience in Storm for handling real-time processing.
  • Hands on Experience working in Hortonworks distribution.
  • Worked hands on No-SQL databases like MongoDB for POC purpose in storing images and URIs.
  • Designed and implemented MongoDB and associated RESTful web service.
  • Involved in writing test cases and implement test classes using MRUnit and mocking frameworks.
  • Developed Sqoop scripts to extract the data from MYSQL and load into HDFS.
  • Very capable at using AWS utilities such as EMR, S3 and Cloud watch to run and monitor Hadoop/Spark jobs on AWS.
  • Experience in processing large volume of data and skills in parallel execution of process using Talend functionality.
  • Created multi-node Hadoop and Spark clusters in AWS instances to generate Terabytes of data and stored it in AWSHDFS.
  • Used Talend tool to create workflows for processing data from multiple source systems.

Environment: Map Reduce, HDFS, Sqoop, Java, Spark, LINUX, Oozie, Hadoop, Pig, Hive, Solr, Spark Streaming, Kafka, Storm, Spark, Scala, Akka, MongoDB, Hadoop Cluster, Amazon Web Services, Talend.

Data Engineer

Confidential, Lake Success, NY


  • Worked directly with the Big Data Architecture Team which created the foundation of this Enterprise Analytics initiative in a Hadoop-based Data Lake.
  • Created multi-node Hadoop and Spark clusters in AWS instances to generate terabytes of data and stored it in AWSHDFS.
  • Installed/Configured/Maintained ApacheHadoop clusters for application development and Hadoop tools like Hive, Pig, HBase, Zookeeper and Sqoop.
  • Extracted real time feed using Kafka and Spark streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.
  • Developed data pipeline using Flume, Sqoop, Pig and JavaMapReduce to ingest customer behavioral data and financial histories into HDFS for analysis.
  • Involved in collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis.
  • Upgraded the Hadoop cluster from CDH4.7 to CDH5.2 and worked on installing cluster, commissioning & decommissioning of Data Nodes, NameNode recovery, capacity planning, and slots configuration.
  • Developed Spark scripts to import large files from AmazonS3 buckets and imported the data from different sources like HDFS/Hbase into SparkRDD.
  • Involved in converting Hive/SQL queries into Spark transformations using SparkRDD, Scala and Java.
  • Involved in migration of ETL processes from Oracle to Hive to test the easy data manipulation and worked on importing and exporting data from Oracle and DB2 into HDFS and HIVE using Sqoop.
  • Worked on Installing ClouderaManager, CDH and install the JCE Policy File to Create a Kerberos Principal for the ClouderaManagerServer, enabling Kerberos Using the Wizard.
  • Developed Spark jobs using Scala and Java on top of Yarn/MRv2 for interactive and Batch Analysis.
  • Monitored cluster for performance and, networking and data integrity issues and responsible for troubleshooting issues in the execution of MapReduce jobs by inspecting and reviewing log files.
  • Created 25+ Linux Bash scripts for users, groups, data distribution, capacity planning, and system monitoring.
  • Install OS and administrated Hadoop stack with CDH5 (with YARN) Cloudera distribution including configuration management, monitoring, debugging, and performance tuning.
  • Developed and analyzed the SQLscripts and designed the solution to implement using spark
  • Developed a data pipeline using Kafka, Spark and Hive to ingest, transform and analysing data.
  • Supported MapReduce Programs and distributed applications running on the Hadoop clusterand scripting Hadoop package installation and configuration to support fully automated deployments.
  • Migrated existing on-premises application to AWS and used AWS services like EC2 and S3 for large data sets processing and storage and worked with ELASTIC MAPREDUCE and setup Hadoop environment in AWSEC2 Instances.
  • Worked with systems engineering team to plan and deploy new Hadoop environments and expand existing Hadoop clusters.
  • Perform maintenance, monitoring, deployments, and upgrades across infrastructure that supports all our Hadoop clusters and worked on Hive for further analysis and for generating transforming files from different analytical formats to text files.
  • Created Hive External tables and loaded the data into tables and query data using HQL and worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required.
  • Worked and learned a great deal from AWSCloud services like EC2, S3, EBS, RDS and VPC.
  • Monitoring Hadoop cluster using tools like Nagios, Ganglia, and ClouderaManager and maintaining the Cluster by adding and removing of nodes using tools like Ganglia, Nagios, and ClouderaManager.
  • Worked on Hive for exposing data for further analysis and for generating transforming files from different analytical formats to text files.

Environment: Hadoop, MapReduce, Hive, PIG, Sqoop, Spark, Spark-Streaming, Spark SQL, AWS EMR, AWS S3, AWS Redshift, Scala, Spark, MapR, Java, Oozie, Flume, HBase, Nagios, Ganglia, Hue, Hortonworks, Cloudera Manager, Zookeeper, Cloudera, Oracle, Kerberos and RedHat 6.5

Bigdata Developer



  • Developed multiple MapReduce jobs in Java for data cleaning and preprocessing and assisted with data capacity planning and node forecasting.
  • Involved in design and ongoing operation of several Hadoopclusters and Configured and deployed Hive Meta store using MySQL and thrift server
  • Implemented and operated on-premises Hadoop clusters from the hardware to the application layer including compute and storage.
  • Uploaded and processed more than 30 terabytes of data from various structured and unstructured sources into HDFS (AWScloud) using Sqoop and Flume.
  • Designed custom deployment and configuration automation systems to allow for hands-off management of clusters via Cobbler, FUNC, and Puppet.
  • Prepared complete description documentation as per the Knowledge Transferred about the Phase-II Talend Job Design and goal and prepared documentation about the Support and Maintenance work to be followed in Talend.
  • Deployed the company's first Hadoop cluster running Cloudera'sCDH2 to a 44-node cluster storing 160TB and connecting via 1 GB Ethernet.
  • Debug and solve the major issues with Cloudera manager by interacting with the Clouderateam.
  • Modified reports and TalendETL jobs based on the feedback from QA testers and Users in development and staging environments.
  • Handled importing other enterprise data from different data sources into HDFS using Sqoop and performing transformations using Hive, MapReduce and then loading data into HBase tables.
  • Involved in Cluster Maintenance and removal of nodes using ClouderaManager.
  • Collaborated with application development teams to provide operational support, platform expansion, and upgrades for Hadoop Infrastructure including upgrades to CDH3.
  • Participated in Hadoop development Scrum and installed, Configured Cognos8.4/10 and Talend ETL on single and multi-server environments.

Environment: Apache Hadoop, Cloudera, Pig, Hive, Talend, Map-reduce, Sqoop, UNIX, Cassandra, Java, LINUX, Oracle 11gR2, UNIX Shell Scripting, Kerberos

Software Developer



  • Involved in requirement analysis, design, coding, and unit testing
  • Developed the middle tier using J2EE technologies under Struts framework
  • Developed enterprise application using JSP, Servlet, JDBC and Hibernate.
  • Used spring to implement MVC (Model View Controller) architecture and Hibernate for Java object mapping with database tables
  • Used Spring AOP for cross cutting concerns like Transaction management and logging Web service calls
  • Implemented JAXP for SAX to the event-driven, serial-access mechanism that does element-by-element processing
  • Implemented JAXP also supports the XSLT to control over the presentation of the data and enabling to convert the data from XML documents to other formats, such as HTML.
  • Extensively used design patterns like Singleton, Value Object, Service Delegator and Data Access Object.
  • Developed the core component of recovery management module using Spring MVC Framework
  • Used Spring IOC and configured the Dependency Injection using Spring Context
  • Involved in design and coding utilizing Spring Dependency Injection.
  • Used Log4J for logging framework to debug the code.
  • Handled the database management using PL/SQL DML and DDL SQL statements.
  • Maintained source code versioning using CVS.
  • Was integral part of Scrum process, JSON and Agile (TDD) methodology
  • Conducted code reviews session both for features and bug fixes.
  • Used JUnit for Unit testing.

Environment: Java 1.7, Apache Tomcat 6, Spring MVC, MySQL, Hibernate 3.0, Junit, Log 4j, Java Script, jQuery, HTML, JSP.

We'd love your feedback!