We provide IT Staff Augmentation Services!

Data Engineer Resume


  • Big Data professional with 6+ years of combined experience in the fields of Data Applications, Big Data implementations and Java/J2EE technologies.
  • 4 years of experience in Big Data ecosystems using Hadoop, Pig, Hive, HDFS, MapReduce, Sqoop, Storm, Spark, Airflow, Snowflake, Teradata, Flume, Kafka, Yarn, Oozie, and Zookeeper.
  • High Exposure on Big Data technologies and Hadoop ecosystem, In - depth understanding of Map Reduce and Hadoop Infrastructure.
  • Expertise in writing end to end Data processing Jobs to analyze data using MapReduce, Spark and Hive.
  • Experience with Apache Spark ecosystem using Spark-Core, SQL, Data Frames, RDD's and knowledge on Spark MLLib.
  • Extensive Knowledge on developing Spark Streaming jobs by developing RDD’s (Resilient Distributed Datasets) using Scala, PySpark and Spark-Shell.
  • Experienced in data manipulation using python for loading and extraction as well as with python libraries such as NumPy, SciPy and Pandas for data analysis and numerical computations.
  • Experienced in using Pig scripts to do transformations, event joins, filters and pre-aggregations before storing the data into HDFS.
  • Strong knowledge of Hive analytical functions, extending Hive functionality by writing custom UDFs.
  • Expertise in writing Map-Reduce Jobs in Python for processing large sets of structured, semi-structured and unstructured data sets and stores them in HDFS.
  • Good understanding of data modeling (Dimensional & Relational) concepts like Star-Schema Modeling, Snowflake Schema Modeling, Fact and Dimension tables.
  • Used Amazon Web Services Elastic Compute Cloud (AWS EC2) to launch cloud instance.
  • Hands on experience working Amazon Web Services (AWS) using Elastic Map Reduce (EMR), Redshift, and EC2 for data processing.
  • Hands on experience in SQL and NOSQL database such as Snowflake, HBase, Cassandra and MongoDB.
  • Hands on experience in setting up workflow using Apache Airflow and Oozie workflow engine for managing and scheduling Hadoop jobs.
  • Strong experience in working with UNIX/LINUX environments, writing shell scripts.
  • Excellent knowledge of J2EE architecture, design patterns, object modeling using various J2EE technologies and frameworks with Comprehensive experience in Web-based applications using J2EE Frameworks like Spring, Hibernate, Struts and JMS.
  • Worked with various formats of files like delimited text files, clickstream log files, Apache log files, Avro files, JSON files, XML Files.
  • Experienced in working in SDLC, Agile and Waterfall Methodologies.
  • Strong skills in analytical, presentation, communication, problem solving with the ability to work independently as well as in a team and had the ability to follow the best practices and principles defined for the team.


Hadoop/Spark Ecosystem: Hadoop, MapReduce, Pig, Hive/impala, YARN, Kafka, Flume, Sqoop, Oozie, Zookeeper, Spark, Airflow, MongoDB, Cassandra, HBase, and Storm.

Hadoop Distribution: Cloudera distribution and Horton works.

Programming Languages: Scala, Spring, Hibernate, JDBC, JSON, HTML, CSS

Script Languages: JavaScript, jQuery, Python, Shell Script(bash,sh)

Databases: Oracle, MySQL, SQL Server, PostgreSQL, HBase, Snowflake, Cassandra, MongoDB

Operating Systems: Linux, Windows, Ubuntu, Unix

Web/Application server: Apache Tomcat, WebLogic, WebSphere Tools Eclipse, NetBeans

IDE: Intellij, Eclipse and NetBeans

Version controls and Tools: GIT, Maven, SBT, CBT



Data Engineer


  • Responsible for the execution of big data analytics, predictive analytics and machine learning initiatives.
  • Implemented a proof of concept deploying this product in AWS S3 bucket and Snowflake.
  • Utilize AWS services with focus on big data architect /analytics / enterprise Data warehouse and business intelligence solutions to ensure optimal architecture, scalability, flexibility, availability, performance, and to provide meaningful and valuable information for better decision-making.
  • Developed Scala scripts, UDF's using both data frames/SQL and RDD in Spark for data aggregation, queries and writing back into S3 bucket.
  • Experience in data cleansing and data mining.
  • Wrote, compiled, and executed programs as necessary using Apache Spark in Scala to perform ETL jobs with ingested data.
  • Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.
  • Wrote Spark applications for data validation, cleansing, transformation, and custom aggregation and used Spark engine, Spark SQL for data analysis and provided to the data scientists for further analysis.
  • Prepared scripts to automate the ingestion process using Python and Scala as needed through various sources such as API, AWS S3,Teradata and snowflake.
  • Designed and Developed Spark workflows using Scala for data pull from AWS S3 bucket and Snowflake applying transformations on it.
  • Implemented Spark RDD transformations to Map business analysis and apply actions on top of transformations.
  • Automated resulting scripts and workflow using Apache Airflow and shell scripting to ensure daily execution in production.
  • Created scripts to read CSV, json and parquet files from S3 buckets in Python and load into AWS S3, DynamoDB and Snowflake.
  • Implemented AWS Lambda functions to run scripts in response to events in Amazon DynamoDB table or S3 bucket or to HTTP requests using Amazon API gateway
  • Migrated data from AWS S3 bucket to Snowflake by writing custom read/write snowflake utility function using Scala.
  • Worked on Snowflake Schemas and Data Warehousing and processed batch and streaming data load pipeline using Snow Pipe and Matillion from data lake Confidential AWS S3 bucket.
  • Profile structured, unstructured, and semi-structured data across various sources to identify patterns in data and Implement data quality metrics using necessary query’s or python scripts based on source.
  • Install and configure Apache Airflow for S3 bucket and Snowflake data warehouse and created dags to run the Airflow.
  • Created DAG to use the Email Operator, Bash Operator and spark Livy operator to execute and in EC2 instance.
  • Deploy the code to EMR via CI/CD using Jenkins
  • Extensively used Code cloud for code check-in and checkouts for version control.

Environment: AgileScrum,MapReduce,Snowflake,Pig,Spark,Scala,Hive,Kafka,Python,Airflow,JSON,Parquet,CSV,Codecloud, AWS


Big Data Engineer


  • Extensively involved in Installation and configuration of Cloudera Hadoop Distribution.
  • Implemented advanced procedures like text analytics and processing using the in-memory computing capabilities like Apache Spark written in Scala
  • Developed spark applications for performing large scale transformations and denormalization of relational datasets.
  • Have real-time experience of Kafka-Storm on HDP 2.2 platform for real time analysis.
  • Loaded data into the cluster from dynamically generated files using Flume and from relational database management systems using Sqoop.
  • Created reports for the BI team using Sqoop to export data into HDFS and Hive.
  • Performed analysis on the unused user navigation data by loading into HDFS and writing MapReduce jobs. The analysis provided inputs to the new APM front end developers and lucent team.
  • Loading the data from multiple Data sources like (SQL, DB2, and Oracle) into HDFS using Sqoop and load into Hive tables.
  • Created HIVE Queries to process large sets of structured, semi-structured and unstructured data and store in Managed and External tables.
  • Developed Complex HiveQL's using SerDe JSON
  • Created HBase tables to load large sets of structured data.
  • Involved in importing the real time data to Hadoop using Kafka and implemented the Oozie job for daily imports
  • Performed Real time event processing of data from multiple servers in the organization using Apache Storm by integrating with apache Kafka.
  • Managed and reviewed Hadoop log files.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
  • Worked on PySpark APIs for data transformations.
  • Data ingestion to Hadoop (Sqoop imports). To perform validations and consolidations for the imported data.
  • Extending Hive and Pig core functionality by writing custom UDF's for Data Analysis.
  • Upgraded current Linux version to RHEL version 5.6
  • Expertise in hardening, Linux Server and Compiling, Building and installing Apache Server from sources with minimum modules
  • Worked on JSON, Parquet, Hadoop File formats.
  • Worked on different Java technologies like Hibernate, spring, JSP, Servlets and developed code for both server side and client side for our web application.
  • Used Git hub for continuous integration services.

Environment: Agile Scrum, MapReduce, Hive, Pig, Sqoop, Spark, Scala, Oozie, Flume, Java, HBase, Kafka, Python, Storm, JSON, Parquet, GIT, JSON SerDe, Cloudera.

Hire Now