We provide IT Staff Augmentation Services!

Data Engineer Resume

New York, NY


  • 4+ years of professional IT experience in Big Data technologies, architecture, and systems.
  • Firsthand experience in using CDH and HDP Hadoop ecosystem components like Hadoop, MapReduce, Yarn, Hive, Pig, Sqoop, HBase, Cassandra, Spark, Oozie, Zookeeper, Kafka and Flume.
  • Configured Spark streaming to receive real - time data from the Kafka and stored the stream data to HDFS using Scala.
  • Experienced in importing and exporting data using stream processing Flume and Kafka platforms
  • Written Hive UDFs as required and executed complex HQLs to extract data from Hive tables
  • Used partitioning and bucketing in Hive and designed both managed and external tables for performance optimization
  • Converted Hive/SQL queries into Spark transformations using Spark Data frames and Scala
  • Used Spark Data Frames API over Cloudera platform to perform analytics on Hive data
  • Good understanding and knowledge of NoSQL databases like MongoDB, HBase and Cassandra
  • Experienced in workflow scheduling and locking tools/services like Oozie and Zookeeper
  • Practiced ETL methods in enterprise-wide solutions, data warehousing, reporting and data analysis
  • Experienced in working with AWS using EMR, EC2 for computing and S3 as storage mechanism
  • Developed Impala scripts for extraction, transformation, loading of data into data warehouse
  • Good knowledge in using Apache NiFi to automate the data movement between Hadoop systems
  • Used Pig scripts for transformations, event joins, filters and pre-aggregations for HDFS storage
  • Imported and exported data with Sqoop to and from HDFS to RDBMS including Oracle, MySQL and MS SQL Server
  • Good Knowledge in UNIX Shell Scripting for automating deployments and other routine tasks
  • Experienced in using IDEs like Eclipse, NetBeans, IntelliJ.
  • Used JIRA and Rally for bug tracking and GitHub and SVN for various code reviews and unit testing
  • Experienced in working in all phases of SDLC - both agile and waterfall methodologies
  • Good understanding of Agile Scrum methodology, Test Driven Development and CI-CD


Data Engineer

Confidential, New York, NY


  • Built scalable distributed data solutions using Hadoop.
  • Involved in importing the data from various data sources into HDFS using Sqoop and applying various transformations using Hive, Apache Spark and then loading data into Hive tables or AWS S3 buckets.
  • Involved in moving data from various DB2 tables to AWS S3 buckets using Sqoop process.
  • Configuring Splunk alerts in-order to get the log files while execution and storing them to a location in S3 bucket when cluster is running.
  • Involved in Hive/SQL queries performing spark transformations using Spark RDDs and Python(spark).
  • Writing Oozie scripts in-order to schedule and automate the jobs in EMR cluster.
  • Used Bitbucket as a repository for storing the code and integrated with bamboo for integration purpose.
  • Experienced in bringing up EMR cluster and deploying code into the cluster in S3 buckets.
  • Experienced in using Nonmachine and Putty in-order to SSH the EMR cluster and running spark-submit.
  • Developed Apache Spark Applications by using Scala, Python and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources.
  • Experience in developing various Spark Streaming Jobs using python (spark) and Scala.
  • Developing spark code using spark to applying various transformations and actions for faster data processing.
  • Working knowledge on Apache Spark Streaming that enables scalable, high-throughput, fault-tolerant stream processing of live data streams.
  • Used Spark Stream processing using Scala to get data into in-memory, implemented RDD transformations, and performed actions.
  • Involved in using various Python libraries with spark in order to create data frames and store them to Hive.
  • Sqoop jobs and Hive queries were created for data ingestion from relational databases to compare with historical data.
  • Experience in working with Elastic Map Reduce (EMR) and setting up environments on amazon AWS EC2 instances.
  • Experienced in migrating HiveQL into Impala to minimize query response time.
  • Knowledge on handling Hive queries using Spark SQL that integrates with Spark environment.
  • Executed Hadoop/Spark jobs on AWS EMR using programs, stored in S3 Buckets.
  • Knowledge on creating the user defined functions (UDF's) in hive.
  • Worked with different File Formats like text file, Avro, parquet for HIVE querying and processing based on business logic.
  • Involved in pulling the data from AWS Amazon S3 bucket to data lake and built Hive tables on top of it and created data frames in Spark to perform further analysis.
  • Worked on Sequence files, RC files, Map side joins, bucketing, partitioning for Hive performance enhancement and storage improvement.
  • Involved in Test Driven Development writing unit and integration test cases for the code.
  • Implemented Hive UDF's to implement business logic and Responsible for performing extensive data validation using Hive.
  • Involved in loading the structured and semi structured data into spark clusters using Spark SQL and Data Frames API.
  • Involved in developing code and generated various data frames based on the business requirement and created temporary tables in hive.
  • Utilized AWS CloudWatch to monitor the performance environment instances for operational and performance metrics during load testing.
  • Experience in build scripts using Maven and did continuous system integrations like Bamboo.
  • Used JIRA for creating the user stories and creating branches in the bitbucket repositories based on the story.
  • Involved in story-driven agile development methodology and actively participated in daily scrum meetings.

Environment: Hadoop, HDFS, Hive, Oozie, Sqoop, Oozie, Spark, ETL, ESP Workstation, Shell Scripting, HBase, GitHub, Tableau, Oracle, MySQL, Agile/Scrum

Data Engineer

Confidential, Herndon, VA


  • Responsible for building scalable distributed data solutions using Hadoop, Spark with Scala.
  • Installed and configured Hive, Pig, Sqoop and Oozie on the Hadoop cluster.
  • Developed Simple to complex MapReduce Jobs using Hive and Pig.
  • Optimized MapReduce Jobs to use HDFS efficiently by using various compression mechanisms.
  • Worked with Senior Engineer on configuring Kafka for streaming data.
  • Developed Spark scripts by using Scala shell commands as per the requirement.
  • Performed processing on large sets of structured, unstructured and semi structured data.
  • Created applications using Kafka, which monitors consumer lag within Apache Kafka clusters.
  • Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and extracted the data from Oracle into HDFS using Sqoop.
  • Analyzed the data by performing Hive queries and running Pig scripts to study customer behavior.
  • Worked with Spark Ecosystem using Scala and Hive Queries on different data formats like Text file and parquet..
  • Implemented business logic by writing UDFs in Java and used various UDFs.
  • Responsible to migrate from Hadoop to Spark frameworks, in-memory distributed computing for real time fraud detection.
  • Used Spark to store data in-memory.
  • Implemented batch processing of data sources using Apache Spark.
  • Developed Pig Latin scripts to extract the data from the web server output files to load into HDFS.
  • Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig and HiveQL.
  • Developed the Pig UDF'S to pre-process the data for analysis.
  • Involved in creating Hive tables, loading with data and writing hive queries which will run internally in MapReduce way.
  • Develop predictive analytic using Apache Spark Scala APIs
  • Cluster co-ordination services through Zookeeper.
  • Used Apache Kafka for collecting, aggregating, and moving large amounts of data from application servers.
  • Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
  • As Part of POC setup Amazon web services (AWS) to check whether Hadoop is a feasible solution or not.
  • Worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required.
  • Installed Oozie workflow engine to run multiple Hive and Pig jobs.
  • Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.

Environment: Hadoop, MapReduce, HDFS, Hive, Java, SQL, Cloudera Manager, Pig, Apache Sqoop, Spark, Oozie, HBase, AWS, PL/SQL, MySQL and Windows

Java Developer

Confidential, Detroit, MI


  • Understanding and analyzing the requirements.
  • Implemented server-side programs by using Servlets and JSP.
  • Designed, developed, and validated User Interface using HTML, Java Script, XML and CSS.
  • Implemented MVC using Struts Framework.
  • Handled the database access by implementing Controller Servlet.
  • Implemented PL/SQL stored procedures and triggers.
  • Used JDBC prepared statements to call from Servlets for database access.
  • Designed and documented of the stored procedures
  • Widely used HTML for web-based design.
  • Involved in Unit testing for various components.
  • Worked on database interaction layer for insertions, updating and retrieval operations of data from oracle database by writing stored procedures.
  • Involved in development for simulator which is being used for controllers to simulate real time scenarios using C /C++ programming.
  • Used Spring Framework for Dependency Injection and integrated with Hibernate.
  • Involved in writing JUnit Test Cases.
  • Used Log4J for any errors in the application

Environment: Java, J2EE, JSP, Servlets, HTML, DHTML, XML, JavaScript, Struts, c/c, Eclipse, WebLogic, PL/SQL, and Oracle.

Hire Now