We provide IT Staff Augmentation Services!

Hadoop Data Engineer Resume

Southbury, CT


  • Certified Coludera Spark and Hadoop Developer with overall 6+ years of IT and 3+ years of industrial expertise in Bigdata analytics, Data manipulation using Hadoop Eco System tools Map - Reduce, HDFS, Hive, Spark, Flume, Sqoop, AWS and Zookeeper.
  • Good Understanding of Hadoop building and Hands on involvement with Hadoop segments such as YARN, Name Node, Data Node and HDFS Framework.
  • Experienced in loading data to Hive partitions and created buckets in Hive and developed MapReduce jobs to automate transfer the data from HBase
  • Capable of processing large sets of structured, semi-structured and unstructured data and supporting systems application architecture.
  • Worked on Spark using Python on cluster for computational (analytics), performed advanced analytical application by making use of Spark with Hive and SQL/Oracle.
  • Hands on experience in developing SPARK applications using Spark tools like RDD transformations, Spark core and Spark SQL.
  • Good knowledge of AWS EC2 instance and working knowledge on with S3 services.
  • Used Spark Data Frames API over Cloudera platform to perform analytics on Hive data
  • Practiced Agile Scrum methodology, contributed to TDD, CI-CD and all aspects of SDLC
  • Explored multi-node Hadoop and Spark clusters in AWS instances to generate terabytes of data and stored it in EMRFS.
  • Having strong Testing and Debugging skills with exposure to complete software development life cycle from requirements gathering to product release.
  • Good Experience in creating Business Intelligence solutions and designing ETL workflows using IBM DataStage.
  • Good experience in Scrum / Agile framework and Waterfall project execution methodologies.
  • Proven ability to manage all stages of project development Strong Problem Solving and Analytical skills and abilities to make Balanced and efficient Decisions.


Hadoop Distributions: Cloudera, Hortonworks, AWS EMR

Languages: Java, Scala, Python, SQL, HTML

No SQL Databases: Cassandra, MongoDB and HBase

Development / Build Tools: PyCharm, Eclipse, Maven, Gradle,IntelliJ, JUNIT

DB Languages: MySQL, PL/SQL and Oracle

RDBMS: Teradata, Oracle 9i,10g,11i, MS SQL Server, MySQL and DB2

Operating systems: Windows, MAC OS,CentOS,RHEL

Data analytical tools: R, SAS and MATLAB

ETL Tools: IBM DataStage


Confidential, Southbury CT

Hadoop Data Engineer

  • Worked directly with the Big Data Architecture Team which created the foundation of this Enterprise Analytics initiative in a Hadoop-based Data Lake.
  • Involved in migration of ETL processes from Oracle to Hive to test the easy data manipulation and worked on importing and exporting data from Oracle and DB2 into HDFS and HIVE using Sqoop.
  • Develop Hive queries to load and process data in Hadoop File System.
  • Develop Hive UDFs to sort Structure fields and return complex data type.
  • Created Hive tables with partitions such as Static and Dynamic with bucketing .
  • Developed Spark jobs using Python on top of Yarn/MRv2 for interactive and Batch Analysis.
  • Imported the data from various formats like JSON, Sequential, Text, CSV, AVRO and Parquet to HDFS cluster and produced them with compressed for optimization using PySpark.
  • Performed Data Analysis Using PySpark (SparkSQL).
  • Used PySpark and Developed Scripts to Filter & calculate aggregate Data and write the resulted data to variety of file formats with compression for optimization as per business needs.
  • Developed data pipeline using Flume, Sqoop to ingest customer behavioral data and financial histories into HDFS for analysis.
  • Involved in collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis.
  • Developed Spark scripts to import large files from Amazon S3 buckets and imported the data from different sources like HDFS/HBase into SparkRDD.
  • Experience in managing and reviewing huge Hadoop log files. Worked on PyCharm IDE to develop the code and dubbing.
  • Worked on migrating MapReduce programs into Spark transformations using Spark and Scala, initially done using python (PySpark)

Environment: Hadoop, MapReduce, Hive, PySpark, Sqoop, Python, SparkSQL, AWS EMR, AWS S3, Flume, HBase, Cloudera Manager, Zookeeper, Oracle, DB2

Confidential, Southbury, CT

Hadoop Data Engineer

  • Created and maintained Technical documentation for launching Hadoop clusters and for executing Hive queries and Pig Scripts.
  • Have done monitoring and reviewing Hadoop log files and written queries to analyze them.
  • Conducted POC's and mocks with client to understand the Business requirement, also attended defect triage meeting with UAT team and QA team to ensure defects are resolved in timely manner.
  • Worked with Kafka for the proof of concept for carrying out log processing on a distributed system.
  • Understanding the existing Enterprise data warehouse set up and provided design and architecture suggestion converting to Hadoop using MRv2, HIVE and SQOOP
  • Loading the data from the different Data sources like (Teradata and DB2) into HDFS using Sqoop, Flume and load into Hive tables, which are partitioned.
  • Developed HQL queries, Mappings, tables, external tables in Hive for analysis across different banners and worked on partitioning, optimization, compilation and execution.
  • Written complex queries to get the data into HBase and responsible for executing hive queries using Hive Command Line, HUE.
  • Designed and implemented proprietary data solutions by correlating data from SQL and NoSQL databases using Kafka.
  • Used Pig as ETL tool to do transformations and some pre-aggregations before storing the analyzed data into HDFS.
  • Developed a PySpark Code for saving data into AVRO and Parquet format and building hive tables on top of them.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark SQL (Data Frames) with Python.
  • Automated workflows using shell scripts to pull data from various data bases into Hadoop.
  • Developed Oozie Workflows for daily incremental loads, which gets data from Teradata and then imported into hive tables.
  • Developed spark programs using Python, involved in creating SparkSQL Queries and developed Oozie workflow for spark jobs.

Environment: HDFS, Hadoop 2.x YARN, Teradata, NoSQL, PySpark, MapReduce, pig, Hive, Sqoop, Spark, Scala, Oozie, Java, Python, MongoDB, Shell and bash Scripting.

Confidential, CA

ETL Developer

  • Gathering ETL requirements from Business Analysts, developing and supporting it till the end of project lifecycle.
  • Working on new ETL enhancements and supporting and guiding it in all phases.
  • Providing technical inputs and guidance to the team. Coordinating with team for ETL development works.
  • Streamlined and automated ETL processes allowing for their completion within Service Level Agreement timelines.
  • Working on defects of ETL jobs and supporting it till production deployment.
  • Writing and gathering documentation in all phases of project lifecycle.
  • Creating scheduling plan, job execution timings and sharing with scheduling team.
  • Building the code as per the design documents created.
  • Extensively used DataStage Designer, DataStage Director for developing jobs and to view log files for execution errors.
  • Integrated Data from Disparate Sources such as Flat Files (Fixed Width, Delimited), COBOL Files, Microsoft Access Database Files (.accdb), Extensive Mark-up Language (XML) Files, and Relational Databases (RDBMS - DB2, Teradata, SQL Server, Oracle).
  • Developed job sequencer with proper job dependencies, job control stages, triggers.
  • Created DataStage jobs using different stages like Transformer, Aggregator, Sort, Join, Merge, Lookup, Data Set, Funnel, Remove Duplicates, Copy, Modify, Filter, Change Data Capture, Change Apply, Column Generator, Difference, Row Generator, Sequencer, Email Communication activity, Command activity, Sequential File, CFF stage, Dataset, Terminator activity.
  • Transformation of data by using various conversion functions available and create custom conversion functions.
  • Used Data Stage Director and its run-time engine for job monitoring, testing and debugging its components, and monitoring the resulting executable versions on ad hoc or scheduled basis.
  • Documented ETL test plans, test cases, test scripts, and validations based on design specifications for unit testing, system testing, functional testing, prepared test data for testing, error handling and analysis.
  • Involved in performance turning of long running jobs. Reviewing the code developed by subordinates with respect to naming standards, best practices.
  • Developed UNIX shell scripts to track DataStage job logs to Developer Group and Analyst.


FAS Specialist - II

  • Created Complex Stored procedures and queries to pull data out of database for generating reports.
  • Created databases, tables, stored procedures, DDL/DML triggers, views, functions and cursors.
  • Extensively using joins and sub-queries for complex queries, which were involving multiple tables from different databases. Analyzing data and re-mapping fields based on business requirements.
  • Developed aggregate strategies to aggregate data, sorting and joining tables.
  • Assembled and converted user requirements from multiple clients into our standard reporting solution and customized specific client requests when necessary.
  • Responsible for performing quality control of data before releasing to clients ensuring the quality and standard of the deliverables are met.
  • Perform monthly patching and functional testing on the processing and hosting tools.
  • Assist in Daily, Weekly and monthly running jobs which must be ran manually using SQL server agent scheduler and technical support of identifying, prioritizing a, troubleshooting and leveraging all available resources to resolve production related issues.
  • Worked on resolutions in finding bugs in daily, weekly and monthly running jobs. Communicate these issues to development groups and management as required.
  • Experience with processing tools like eCapture, Clearwell, Venio and Hosting platforms like Relativity and concordance.
  • End to End knowledge on E discovery lifecycle for data processing, hosting and productions.
  • Responsible for performing quality control of data before releasing to clients ensuring the quality and standard of the deliverables are met.
  • Perform monthly patching and functional testing on the processing and hosting tool

Hire Now