We provide IT Staff Augmentation Services!

Data Engineer Resume


  • Over 5 years of professional IT experience, this includes hands on experience in Big Data Analytics and development.
  • Expertise with the tools in Hadoop Ecosystem including Pig, Hive, HDFS, MapReduce, Sqoop, Storm, Spark, Scala, Kafka, Yarn, Oozie and Zookeeper for complex business problems.
  • Advanced TSQL knowledge including stored procedures functions.
  • Experience in developing, designing, testing and maintaining object - oriented application in Python, Scala and Java.
  • Experience in writing complex hive queries that work with different file formats like Csv, Sequence, Xml, Orc, parquet and Avro.
  • Performed different Optimization techniques like distributed cache, map side joins, partitioning and bucketing in Hive.
  • Executed Hive queries on Parquet tables stored in Hive to perform data analysis to meet the business requirements.
  • Strong knowledge of Pig and Hive's analytical functions, extending Hive and Pig core functionality by writing custom UDFs and macros.
  • Python modules such as requests, urllib, urllib2 for web crawling.
  • Experienced in connectivity with SQLs usingPySpark& Python
  • Worked on developing ETL processes to load data from multiple data sources to HDFS using Kafka and SQOOP, perform structural modifications using Map-Reduce, HIVE and analyse data using visualization/reporting tools.
  • Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and data processing.
  • Wrote Lambda functions in python for AZURE Lambda which invokes python scripts to perform various transformations and analytics on large data sets in EMR clusters.
  • Hands on expertise in running the SPARK & SPARK SQL on AMAZON ELASTIC MAPREDUCE (EMR).
  • Involved in Agile SDLC during the development of project.
  • Integrated Apache Storm with Kafka to perform web analytics. Uploaded click stream data from Kafka to HDFS, HBase and Hive by integrating with Storm.


Big Data Technologies: Apache Spark, PySpark, Apache Hadoop, Map Reduce, Apache Hive, Apache Pig, Apache Sqoop, Apache Flume, Apache oozie, Hue, Apache Zookeeper, HDFS, Amazon S3, EC2, EMR.

Programming Language: Scala, Python, Java, SQL

IDE: IntelliJ, Eclipse, Jupyter Notebook, Visual Studio 2005/2008/2012/2015/2017

Databases: Oracle 11g/12C, MySQL, MS-SQL Server, SSDT, SSIS, Teradata

Hadoop Distributions: Cloudera, Hortonworks

Operating systems: Mac OS, Windows 7/10, Linux (CentOS, Redhat, Ubuntu).

Methodologies: Agile, UML, Waterfall, CI/CD

Development Tools: IntelliJ, Maven, Scala Test, GitHub, Jenkins


Data Engineer



  • Proficient in programming with Resilient Distributed Datasets (RDDs)
  • Develop spark programming using python APIs to compare the performance of spark with Hive and SQL.
  • Experience in tuning and debugging Spark application running
  • Experienced in connectivity with SQLs usingPySpark& Python
  • Used Spark API over cloudera Hadoop YARN to Perform analytics on data in Hive
  • Data Extractions & modeling support from Hive,PySpark& Teradata using Complex SQL queries
  • Fine tuning the SQL queries -Big Data - Hive,PySpark& Teradata.
  • Implemented spark using Python and SparkSQL for faster testing and processing data
  • Designed and created Hive External tables using shared meta-store instead of derby with partitioning, dynamic partitioning and buckets
  • Involved in integration of Kafka with Spark for real time data processing
  • Imported data using Sqoop from db2, SQL server to HIVE, performance tuning, executinganddesigning complex Hive HQLs.
  • Developed user defined functions in Scala to support spark transformations
  • Loaded and presented data sets in multiple formats and multiple sources including JSON, text files, and log data
  • Developed ingestion scripts to load data from SQL Server, db2, different types flat files to HIVE using PySpark.
  • ETL process using python scripts withPySparkRDD

Environment: Cloudera, HDFS, MapReduce, Scala, Python, Spark, Hive, Pig, Sqoop, Shell Scripting, MySQL, SQL Server, Tableau, HBase


Data Engineer (Co-op)


  • Handled importing of data from various data sources, performed transformations using Hive, MapReduce,
  • Loaded data into HDFS and extracted the data from MySQL into HDFS using Sqoop.
  • Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
  • Managing and scheduling jobs on a Hadoop cluster.
  • Developed Simple to complex Map/reduce Jobs using Hive.
  • Analyzed the data by performing Hive queries and running Pig scripts to know user behavior.
  • Optimized Map/Reduce Jobs to use HDFS efficiently by using various compression mechanisms.
  • Developed Simple to complex Map/reduce Jobs using Hive.
  • Implemented Partitioning and bucketing in Hive.
  • Mentored analyst and test team for writing Hive Queries.
  • Experience in managing and reviewing Hadoop log files.
  • Extensively used Pig for data cleansing.
  • Configured Flume to extract the data from the web server output files to load into HDFS.
  • Developed the Pig UDF'S to pre-process the data for analysis.
  • Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig.

Environment: Cloudera Hadoop, MapReduce, HDFS, Hive, Java (jdk1.7), Pig, Linux, XML. HBase, Zookeeper, Sqoop, Amazon Web Services (AWS).


Data Engineer Intern


  • Worked on Hortonworks Data Platform (HDP 2.4) Hadoop distribution for data querying using Hive to store and retrieve data.
  • Involved in data ingestion into HDFS using Sqoop for full load and Kafka for incremental load on variety of sources like web server, RDBMS and Data API's.
  • Experienced in implementing Spark RDD transformations, actions to implement business analysis and Worked with accumulators and broadcast variables. Involved in writing queries in Spark SQL using PySpark.
  • Extensive experience in Spark Streaming through core Spark API running Scala & Python Scripts to transform raw data from several data sources into forming baseline data.
  • Developed MapReduce/ EMR jobs to analyze the data and provide heuristics and reports. The heuristics were used for improving campaign targeting and efficiency.
  • Performed real-time analysis of the incoming data using Kafka consumer API, Kafka topics, Spark Streaming utilizing Scala.
  • Created Hive external tables and views, on the data imported into the HDFS and developed and implemented Hive scripts for transformations such as evaluation, filtering and aggregation.
  • Worked on partitioning and bucketing of Hive tables and running the scripts in parallel to reduce run-time of the scripts.
  • Developed User Defined Functions (UDF) in Python if required for hive queries.
  • Worked with data in multiple file formats including Parquet, Sequence files, ORC and Text(delimited)/CSV, JSON.
  • Worked on creating End-End data pipeline orchestration using Oozie.
  • Developed bash scripts to automate the above process of Extraction, Transformation and Loading.

Environment: Hadoop 2.2, AWS, MapReduce, Hive, Python, PySpark, Avro, Kafka, Storm, Linux, Sqoop, Shell Scripting, SQL Server, Oozie, Cassandra, Git, XML, Scala, Java, Maven, Eclipse, Horton Works.


Big Data Developer


  • Installed and configured Hadoop, Developed multiple MapReduce jobs in Java for data cleaning and preprocessing.
  • Installed and configured Pig for ETL jobs.
  • Troubleshooting the cluster by reviewing HadoopLOG files.
  • Imported data using Sqoop from Teradata using Teradata connector.
  • Used Oozie to orchestrate the work flow.
  • Creating Hive tables and working on them for data analysis in order to meet the business requirements.
  • Designed and implemented MapReduce-based large-scale parallel relation-learning system.
  • Installed and benchmarked Hadoop/HBase clusters for internal use.
  • Experience with data model concepts-star schema dimensional modeling Relational design (ER).
  • Created tables, stored procedures in SQL for data manipulation and retrieval, Database Modification using SQL, PL/SQL, Stored procedures, triggers, Views in Oracle 9i.
  • Used Python modules such as requests, urllib, urllib2 for web crawling.
  • Used other packages such as Beautiful soup for data parsing. Worked on writing and as well as read data from CSV and excel file formats.
  • Worked on resulting reports of the application and Tableau reports.
  • Utilize PyUnit, the Python unit test framework, for all Python applications.

Environment: Hadoop, MapReduce, HDFS, Hive, Java, Cloudera, Pig, HBase, Linux, XML, MySQL Workbench, Python, Java 6, Eclipse, Oracle 10g, PL/SQL, SQL*PLUS .


SQL Developer intern


  • Worked as anSQL Developer with experience on analytical techniques in writing SQL queries and optimizing the queries in SQL Server 2008
  • Used Cold fusion front end to develop multiple web systems applications and automation tools
  • Coordinated automation change testing roll-outs
  • Provided recommendations based on technical issue research
  • Participated in software evaluation, business requirements analysis and application integration.
  • Knowledgeand work experience in RDBMS concepts and experience in producing tables, reports, graphs and listings using various procedures andhandling large databases to perform data manipulations
  • Analyzed components of end-to-end designs for web interfaces
  • Advised on appropriate reporting applications and analysis architecture
  • Collaborated with team members to optimize viable solutions
  • Documented user security-related constraints and functionality

Environment: SQLserver 2008, Visual Basic, Windows 2008, Excel.

Hire Now