We provide IT Staff Augmentation Services!

Data Engineer Resume

Dallas, TX


  • Hadoop Developer with 7+ years of IT experience on designing and implementing complete end - to-end Hadoop Ecosystem which includes HDFS, MAPREDUCE, YARN, PIG, HIVE, HBASE, FLUME, SQOOP, OOZIE, SPARK, KAFKA and ZOOKEEPER.
  • Excellent hands on knowledge of Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and MapReduce programming paradigm.
  • Hands on experience with Real time streaming using Spark and Kafka Streaming into HDFS.
  • Great Experience in Handling large Datasets using Partitions, Spark In-memory capabilities and Broadcasts in Spark
  • Strong Knowledge on Spark concepts like RDD operations, Caching and Persistence.
  • Developed Analytical Components using Spark SQL and Spark Stream.
  • Experience in analyzing the data using Hive UDF and Hive UDTF custom MapReduce programs in Java.
  • Expert in working with Hive Data Warehouse tool-creating tables, Data Distribution by implementing Partitions and Bucketing.
  • Expertise in all the Hive functionalities and migration of data from different databases like ORACLE, DB2, MYSQL and MongoDB.
  • Handled Upgrades of Apache Ambari, CDH and HDP Cluster.
  • Experienced working with Amazon EMR, Cloudera (CDH3 & CDH4) and Horton Works Hadoop Distributions.
  • Hands on Experience with AWS infrastructure services, Amazon Simple Storage Service (Amazon S3) and Amazon Elastic Compute Cloud (Amazon EC2).
  • Experienced in writing Test cases and implement unit test cases using testing frame works like Junit, Easy mock and Mockito.
  • Experienced in Oozie and workflow scheduler to manage Hadoop jobs by Direct Acyclic Graph (DAG) of actions with control flows.
  • Hands on experience in Application Development by writing MapReduce jobs in JAVA, PYTHON, Hive Queries, RDBMS and Linux shell scripting.
  • Experienced in build scripts using Maven and do continuous integrations systems like Jenkins.
  • Well experienced in using applications like WebLogic, Web Sphere and Java tools in client server.
  • Excellent knowledge in Python, Java, Collections, J2EE, Servlets, JSP, Spring, Hibernate, JDBC/ODBC.
  • Hands on experience with NoSQL databases like HBase, Cassandra and Relational Databases like Oracle and MySQL
  • Deep Analytics and understanding of Big Data and Algorithms using Hadoop, MapReduce, NoSQL and distributed computing tools.
  • Experienced in using Data Warehousing ETL concepts using Informatica Power Center, OLAP, OLTP and AutoSys
  • Good understanding of XML methodologies (XML, XSL, XSD) including Web Services and SOAP.
  • Experienced in designing both time driven and data driven automated workflows using OOZIE order to run jobs of Hadoop MapReduce and PIG.
  • Good experience in doing project impact assessment, Project Schedule Planning, onsite offshore Team coordination, End user coordination starting from requirement gathering to till live support.
  • Successful in meeting new technical challenges and finding solutions to meet the needs of the customer.
  • Successfully working in fast-paced environment, both independently and in collaborative team environments.
  • Strong Business, Analytical and Communication Skills.



Real-Time/Stream Processing: Apache Storm, Apache Spark, Flume.

Distributed Message Broker: Apache Kafka

Databases: Oracle9.x, 10g, 11g MS SQL Server, MySQL Server, DB2, HBase, MongoDB, Cassandra.

Database/NoSQL: HBase, Oracle 9i,10g,12c, MySQL

Scripting Languages: JavaScript, shell, python

Network & Protocols: TCP/IP, Telnet, HTTP, HTTPS, FTP, SNMP, LDAP, DNS.

Operating Systems: Linux, UNIX, MAC, Windows NT / 98 /2000/ XP / Vista, Windows 7, Windows 8.


Confidential - Dallas, TX

Data Engineer


  • Involved in Requirement Gathering, Business Analysis and Translated Business requirements into Technical Design in Hadoop and Big Data
  • Great Hands-on experience working on different Hadoop ecosystem components like Pig, Hive, Sqoop, Spark, Kafka.
  • In-depth understanding of Spark Architecture including Spark Core , Spark SQL , Data Frames , Spark Streaming , Spark MLlib .
  • Implemented Incremental load approach in spark for huge tables.
  • Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive
  • Worked on different file formats like JSON,CSV,XML using spark SQL.
  • Imported and exported data into HDFS from database and vice versa using Sqoop .
  • Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS
  • Worked with CDH4 as well as CDH5 applications. Performed Data transfer of large data back and forth from development and production clusters.
  • Created Partitions and Buckets in Hive for both Managed and External tables in Hive for optimizing performance.
  • Implemented Hive Join queries to join multiple tables of a source system and load them to Elastic search tables
  • Built reusable Hive UDF libraries for business requirements which enabled users to use these UDF's in Hive Querying .
  • Developed Data Lake as a Data Management Platform for Hadoop.
  • Used Talend to run the ETL processes instead of Hive queries
  • Successfully moved data from Hadoop to Cassandra using Bulk output format class.
  • Extracted data from Teradata database and loaded into Data warehouse using spark-JDBC.
  • Handled Data Movement, Data transformation, Analysis and visualization across the lake by integrating it with various tools.
  • Experienced in code repositories like GitHub.
  • Good Understanding of NoSQL database and hands on experience in writing applications on NoSQL databases like HBase, Cassandra and MongoDB

Environment: Hadoop YARN, Spark Core, Spark Streaming, Spark SQL, Python, Kafka, Hive, Sqoop, Amazon AWS, Elastic Search, Impala, Cassandra, Tableau, Talend, Oozie, Jenkins, Cloudera, Oracle 12c, Linux

Confidential - Denver, Co

Data Engineer


  • Responsible for Building Scalable Distributed Data solutions using Hadoop.
  • Real time data processing (Kafka, Spark Streaming & Spark Structured Streaming), Worked on Spark SQL, Structured Streaming, MLlib and using Core Spark API to explore Spark features to build data pipelines, Implemented Spark streaming applications & fine tune to reduce shuffling.
  • Handled large datasets using partitions, Spark In-Memory capabilities, Broadcasts in Spark, Effective & Efficient Joins, transformations and other during ingestion process itself.
  • Worked in Performance Tuning of Spark Applications for setting right batch interval time, correct level of parallelism and memory tuning.
  • Worked on a Cluster of Size 105 nodes .
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Python.
  • Handled Importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and Extracted the data from MySQL into HDFS using Sqoop
  • Implemented schema extraction for Parquet and Avro file formats in Hive.
  • Used Hive and created Hive tables, loaded data from Local file system to HDFS
  • Used hive to do transformations, event joins and pre-aggregations before storing the data to HDFS
  • Implemented Partitioning, Dynamic Partitions, and Buckets on huge datasets to analyze and compute various metrics for reporting.
  • Involved in HBase setup and storing data into HBase for future analysis.
  • Good experience working on Tableau and Spotfire and enabled the JDBC/ODBC data connectivity from those to Hive tables.
  • Used Oozie workflow to coordinate pig and Hive scripts.
  • Used Stash Git-Bucket for Code Control and Worked on AWS Components such as Airflow, Elastic Map Reduce (EMR), Athena and Snowflake .
  • Written Hive Queries for ad hoc data analysis to meet business requirements.

Environment: HDFS, MapReduce, Hive, PIG, Sqoop, HBase, Oozie, Flume, Sqoop, Kafka, Zookeeper, Amazon AWS, SparkSQL, Spark Dataframes, PySpark, Python, Java, JSON, SQL Scripting and Linux Shell Scripting, Avro, Parquet, Hortonworks.

Confidential, Herndon, VA

Data Engineer


  • Worked on analyzing Hadoop cluster and different big data analytic tools including Pig, HBase database and Sqoop.
  • Configured Sqoop Jobs to import data from RDBMS into HDFS using Oozie workflows.
  • Experienced in Job management using Fair scheduler and Developed job processing scripts using Oozie workflow.
  • Involved in installation and configuration of Hadoop MapReduce, HDFS and Developed multiple MapReduce jobs in Java for Data Cleaning and Processing.
  • Load and Transform huge datasets of structured and semi-structured data using Hive.
  • Responsible for developing data pipeline with Amazon AWS to extract the data from weblogs and store in HDFS.
  • Created Hive tables and Developed Hive queries for De-normalizing the Data.
  • Created PIG Latin Scripts to sort, group, join and filter the enterprise wise data.
  • Involved in moving all log files generated from various sources to HDFS for further processing through Flume.
  • Created batch analysis job prototypes using Hadoop, Pig, Oozie, Hue and Hive.
  • Created HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NoSQL and a variety of portfolios
  • Worked on the root cause analysis for all the issues that occur in batch and provide the permanent fixes for the issues.
  • Involved in Analyzing system failures, identifying root causes, and recommended course of actions
  • Created and maintained technical documentation for all the workflows.
  • Created database access layer using JDBC and SQL stored procedures.
  • Collaborated with the infrastructure, network, database, application and BI teams to ensure data quality and availability.

Environment: Hadoop YARN, Hive, Sqoop, Amazon AWS, Java, Python, Oozie, Jenkins, Cassandra, Oracle 12c, Linux.

Confidential - Lakewood, Oh

Data Engineer


  • Responsible for Cluster maintenance, adding and removing cluster nodes, Cluster Monitoring and Troubleshooting, Manage and review data backups and log files.
  • Used Oozie to orchestrate the workflow.
  • Involved in loading data from LINUX file system to HDFS
  • Analyzed data using Hadoop components Hive and Pig.
  • Responsible for importing data to HDFS using Sqoop from different RDBMS servers and exporting data using Sqoop to the RDBMS servers after aggregations for other ETL operations.
  • Created Hive queries that helped market analysts spot emerging trends by comparing fresh data with EDW reference tables and historical metrics.
  • Moved the data from Oracle, MSSQL Server in to HDFS using Sqoop and importing various formats of flat files in to HDFS.
  • Written HBASE Client program in Java and Webservices.
  • Mentored analyst and test team for writing Hive Queries.
  • Implemented test scripts to support test driven development and continuous integration
  • Worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required.
  • Gained excellent hands on knowledge on Hadoop cluster, MapReduce Jobs, Data Migration concepts in Hive.

Environment: Hadoop, MapReduce, HDFS, Sqoop, Hive, Java, Cloudera, Pig, HBase, Linux, XML, MySQL Workbench, Java, Eclipse, Oracle 10g, PL/SQL, SQL*PLUS.

Hire Now