We provide IT Staff Augmentation Services!

Data Engineer Resume

Walnut Creek, CA


  • Around 5+ years of overall IT experience along with 2+ years in Hadoop (Cloudera Distribution CDH 4 and 5) on cluster of 30 nodes.
  • Worked with data with size of over 60 TB.
  • Extensive experience in HDFS, Sqoop, Flume, Hive, Pig, Spark, Oozie, Impala.
  • Good understanding and working experience on Hadoop Distributions like Cloudera and Hortonworks.
  • Experience in importing and exporting multi terabytes of data using Sqoop from Relational Database Management System to HDFS and vice versa.
  • Experience in using HiveQL of querying and analyzing large datasets.
  • Experience in writing simple to complex Pig scripts for processing and analyzing large volumes of data.
  • Querying both Managed and External tables created in Hive using Impala.
  • Extensive experience with ETL and query big data tools HiveQL and Pig Latin.
  • Experience in loading logs from multiple sources into HDFS using Flume.
  • Experience with Oozie workflow engine in running jobs with actions that run Sqoop, Pig and Hive jobs.
  • Experience in using Spark API over Map Reduce to perform analytics on data.
  • Experience in creating Resilient Distributed Datasets for the input data and data transformations using PySpark.
  • Experienced with Spark processing framework such as Spark SQL.
  • Experience in Data Warehousing and ETL processes.
  • Experience in processing large datasets of different forms like structured, semi - structured and unstructured data.
  • Experience in working with different file formats like Avro , Parquet , ORC , Sequence , and JSON files.
  • Background with traditional databases such as MySQL, SQL Server
  • Good analytical, interpersonal, communication, problem solving skills with ability to quickly master new concepts and capable of working in group as well as independently.


Hadoop Distribution: Cloudera, Hortonworks

Big Data Ecosystem: HDFS, Sqoop, Flume, Hive, Pig, Impala, Oozie, Spark

Databases: MySQL, MS SQL Server

NoSQL / Storage: Hbase, AWS Redshift, S3, EMR

Languages: Java, Python

Operating System: Windows XP/7/8/10, Linux, Mac OS


Confidential, Walnut Creek, CA

Data Engineer


  • Worked on Cloudera CDH 5.4 distribution of Hadoop.
  • Extensively working with MySQL for identifying required tables and views to export into HDFS.
  • Responsible for moving data from MySQL to HDFS to development cluster for validation and cleansing.
  • Responsible for creating Hive tables on top of HDFS and developed Hive Queries to analyze the data.
  • Developed Hive tables on data using different SERDE’s, storage format and compression techniques.
  • Optimized the data sets by creating Dynamic Partition and Bucketing in Hive.
  • Used Pig Latin to analyze datasets and perform transformation according to requirements.
  • Implemented Hive custom UDF’s for comprehensive data analysis.
  • Involved in loading data from local file systems to Hadoop Distributed File System.
  • Experience working with SparkSQL and creating RDD’s using PySpark.
  • Extensive experience working with ETL of large datasets using PySpark in Spark on HDFS.
  • Developed ETL workflow which pushes web server logs to an Amazon S3 bucket.
  • Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Sqoop script, Pig script, Hive queries.
  • Exporting data from HDFS environment into RDBMS using Sqoop.

Confidential, Denver, CO

Data Engineer


  • Worked on live 30 nodes Hadoop cluster running CDH 4.4
  • Worked with highly unstructured and semi structured data of 20TB in size.
  • Responsible for building scalable distributed data solutions using Hadoop .
  • Managing data from various file system to HDFS using UNIX command line utilities.
  • Involved in importing and exporting data between RDBMS and HDFS using Sqoop.
  • Creating Hive tables on top of the loaded data and writing hive queries for adhoc analysis.
  • Implemented Partitioning, Dynamic Partition, and Bucketing in Hive for efficient data access.
  • Performed querying of both managed and external tables created by Hive using Impala.
  • Developed Pig scripts for data analysis and perform transformation.
  • Implemented Pig as ETL tool to do transformations, event joins and some pre-aggregations before storing the data onto HDFS.
  • Developed Spark code and Spark SQL for faster testing and processing of data.
  • Involved in converting Hive SQL queries into Spark transformation using Spark RDD’s, Python.
  • Developed UNIX shell scripts to load large number of files into HDFS from Linux File System.
  • Implemented Oozie workflow for Sqoop, Pig and Hive actions.
  • Exported the analyzed data to the relational databases using Sqoop.
  • Debugged the results to find if there is any missing at the outcome.
  • Analyzed large amounts of data sets to determine optimal way to aggregate and report on it.
  • Involved in performance tuning and fixing bugs.


SQL Developer


  • Involved in database development and creating SQL scripts.
  • Involved in Requirement Study, UI Design, Development, Implementation, Code Review, Validation, Testing.
  • Managed database related activities.
  • Designed tables and indexes.
  • Writing SQL queries to fetch the business data.
  • Developed Views, Sequence and indexes.
  • Created Joins and Sub queries involving multiple tables.
  • Analyzing SQL data, identifying issues and modifying the SQL scripts to fix the issues.
  • Involved in trouble shooting and fine tuning of databases for its performance and concurrency.
  • Involved in fixing bugs and different forms of testing including black and white box testing .
  • Handling issues regarding database, its connectivity and maintenance.
  • Manage the priorities, deadlines and deliverables of individual project and issues related to it.
  • Effectively prioritize work while considering business need and urgency.
  • Worked effectively and efficiently on multiple tasks and deadlines and produces high quality results.
  • Involved in performance improvement of web application for user friendly experience and solving a critical issue that happens in the production environment.

Hire Now