We provide IT Staff Augmentation Services!

Senior Data Engineer Resume

Costa Mesa, CA


  • Over 7 Years of professional experience in designing, developing, integrating and testing software applications, which includes 5 years of experience in various Big Data technologies of Hadoop like Map - Reduce, Hive, Spark (core and Spark SQL), Impala and Sqoop and 3+ years of experience in programing.
  • Hands on experience in working with Big Data in Hadoop ecosystem using Spark, hive, Impala and Map-Reduce.
  • Hands on experience in programming and implementation of Java, Scala and Python codes with strong knowledge in Object Oriented and Functional Programming Concepts.
  • In-depth knowledge in implementation of Data Visualization techniques in Spark using Apache Zeppelin.
  • Highly skilled at SQL and shell scripting operations.
  • Good knowledge of data architecture including data ingestion pipeline designs, Lambda architecture, data streams, data lakes and data warehouses.
  • Good knowledge of data modeling and advanced data processing techniques for Structured, Semi Structured and Unstructured data.
  • Able to assess business rules, collaborate with stakeholders and perform source-to-target data mapping, design and review with strong analytical and communication skills.
  • Hands on experience in tuning mappings with expertise in identifying and resolving performance bottlenecks in various levels.
  • Excellent skills in analyzing system architecture usage, defining and implementing procedures
  • A quick learner, punctual and trustworthy.
  • Motivated problem solver and resourceful team member with decent written and verbal communication skills.


Hadoop Platforms: Hortonworks(HDP) and Cloudera (CDH)

Filesystems: HDFS, S3

Databases: Hive, Impala

Scheduling: Oozie and crontab

Streaming Engines: Sqoop, Flume and Kafka

Querying Engines: Beeline, Phoenix

NoSQL DBs: HBase, MongoDB and Cassandra

Apache Zeppelin, Tableau, Arcadia, Microsoft: office (Power-point, Word & Excel) and Github.

Development Technologies: Spark, Python, Scala, Java, SQL and Shell Scripting.

IDEs, FTP and SSH tools: Eclipse, Intellij, Pycharm, DBvisualizer, Xshell, Putty, Filezilla and WinScp.

Build tools: SBT, Maven

File Formats: Structured (Delimiter separated values), Semi Structured (json, xml, html etc...), Compressed (Zgip, Snappy, LZO etc...) and Binary (Sequence, Avro, Parquet, ORC etc...).

Operating Systems: Mac, Ubuntu, Linux and Windows.


Senior Data Engineer

Confidential, Costa Mesa, CA

Environment: Cloudera (CDH), Hadoop, Putty, Spark, beeline, Impala, MySQL, HDFS, Python, Scala, SQL scripting, Sqoop, Linux shell scripting, Eclipse, Intellij, Pycharms, SBT, and Maven.


  • Worked on Designing of ETL pipe line using Spark, Hive and HBase components.
  • Worked on batch data ingestion by creating a data pipe line using Sqoop and Spark.
  • Worked on near-real time data ingestion using a data pipe line designed using Kafka and Spark.
  • Worked on integration of Hive to HBase using HBaseStorageHandlers to enable insertion and updating of Data in the NOSQL HBase storage.
  • Worked on enabling Transactional tables in Hive to enable row level updates.
  • Worked on encryption of data using Hive to encrypt sensitive data, in-addition to providing a mechanism in to encrypt the Sqoop passwords use to connect to the Legacy systems.
  • Worked on designing a framework to input the RDMS and JDBC connections in MYSQL, to automate and unify the data pull processing for Sqoop ingestions.
  • Used Oozie and crontab application to schedule the Hadoop application jobs to using the cluster effectively.
  • Worked on real-time data analytics in Spark Streaming for streaming text data by integrating Flume and kafka with Spark Streaming.
  • Worked on Schema tuning, performance triage/troubleshooting and data distribution for the ingested and existing data in the Enterprise data platform.
  • Worked on performance tuning, debugging and optimization of hive queries.

Hadoop Solutions Engineer

Confidential, Colorado Springs, CO

Environment: Hortonworks (HDP), Hadoop, Putty, Oracle, MySQL, HDFS, Spark, Hive, arcadia, Python, Scala, SQL scripting, Sqoop, Linux shell scripting, Eclipse, Intellij, Pycharms, SBT, and Maven.


  • Worked on creation of ETL pipe line using Spark, Hive and arcadia components.
  • Storing and retrieving data from HDFS in different formats like text, json, Sequence, Avro, Parquet, ORC and in compressed formats.
  • Worked on app and visual creation in Arcadia data to enable data visualization and Descriptive analytics.
  • Tuned Spark RDD parallelism technics to improving the performance and optimization of the spark jobs on Hadoop cluster.
  • Designed Hive table schemas using partitioning and bucketing to store tables as both external and internal table.
  • Worked on developing Hive UDF’s in Python to define custom analytical functions.
  • Worked on programming spark applications using python and Scala in-addition to optimize the memory parameters for efficient cluster utilization.
  • Worked on loading data to and from RDBMS to HDFS using Spark and JDBC connectors for integrating Hadoop with MySQL and Oracle.

Hadoop Developer

Confidential , Southlake, TX, US

Environment: Cloudera (CDH), Hadoop, Putty, Oracle, MySQL, HDFS, Spark, Hive, Impala, Python, Scala, SQL scripting, Linux shell scripting, Eclipse, Intellij, Pycharms, SBT, and Maven.


  • Worked on parsing and filtering Semi-Structure data like json using Dataframe/SparkSQL, case classes and also programmatically specifying the schema explicitly.
  • Worked on modeling data using Avro schema into Parquet format using SparkSQL.
  • Worked on real-time data analytics in Spark Streaming for streaming text and kafka topic data.
  • Worked on data preparation methods in spark dataframes using set operations, regular expressions, sorting, parsing arbitrary date/time inputs and converting json arrays values into lists.
  • Worked on performance tuning, debugging and optimization of hive queries by changing the default YARN values.
  • Worked on developing Queries to analyses data of different format in Impala and Hive.
  • Worked on performance tuning, debugging and optimization of hive queries by changing the default YARN values.
  • Worked on loading data to and from RDBMS to HDFS using Spark and JDBC connectors.
  • Loading and retrieving data to and from the Local systems into HDFS.

Hadoop Developer

Confidential, Mayfield Village, OH

Environment: Big Data Platform - CDH 5.0.3, Hadoop HDFS, Map Reduce, Hive, Sqoop, Spark, Impala, Java, Shell Scripts, Oracle 10g, Eclipse, Tableau, Putty and Intellij.


  • Prepare technical design documents based on business requirements and prepare data flow diagrams.
  • Implement new design as per technical specifications.
  • Integrated Hadoop with Oracle in order to load and then cleanse raw unstructured data in Hadoop ecosystem to make it suitable for processing in Oracle using stored procedures and functions.
  • Experience in using Map-Reduce programming model for Batch processing of data stored in HDFS.
  • Developed Java Map-Reduce programs on log data to transform into structured way to find user location, login /logout time and spending time, errors.
  • Load and transform large sets of structured, semi structured and unstructured data.
  • Used SQOOP for importing data into HDFS and exporting data from HDFS to oracle database
  • Built re-usable Hive UDF libraries for business requirements which enabled users to use these UDF's in Hive querying
  • Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
  • Optimized the Hive tables using optimization techniques like partitions and bucketing to provide better performance with Hive-QL queries.
  • Developed Spark scala Scripts for ETL kind of operation on captured data and delta record processing between newly arrived data and already existing data in HDFS.
  • Extensively used Pig for data cleansing.
  • Used Pyspark to do transformations, event joins, filter boot traffic and some pre-aggregations before storing the data onto HDFS.
  • Experienced in extending Hive and Pig core functionality by writing custom Impala UDFs using Java and Python.
  • Created action filters, parameters and calculated sets for preparing dashboards and worksheets in Tableau.
  • Worked extensively in performance optimization by adopting/deriving at appropriate design patterns of the Map-Reduce jobs by analyzing the I/O latency, map time, combiner time, reduce time etc.
  • Troubleshooting: Used Hadoop logs to debug the scripts.

Hadoop Developer

Confidential, Houston, TX

Environment: Big Data Platform - CDH 4.2.1, Hadoop HDFS, Map Reduce, Hive, Sqoop, IBM DB2, PL/SQL, UNIX, Python, Eclipse.


  • Integrated, managed and optimized utility systems, including assets, devices, networks, servers, applications and data.
  • Ensured quality integration into the overall functions of smart meters into the system data acquisition and processing.
  • Enabled the use of metering data for a variety of applications such as billing, outage detection and recovery, fraud detection, finance, energy efficiency, customer care and a variety of analytics.
  • Analyzed large amounts of raw data to create information. Compiled technical specifications that allowed IT to create data systems, which supported the smart metering system.
  • Responsible for technical reviews and gave the quick-fix solution for the customer on production defects.
  • Developed Map-Reduce programs to parse the raw data, populate staging tables and store the refined data in partitioned tables in the enterprise data warehouse (EDW).
  • Worked with Importing and exporting the data using Sqoop from HDFS to Relational Database systems and vice-versa.
  • Worked on developing Scala codes to filter and parse raw data in HDFS using Spark.
  • Written Map-Reduce java programs to analyze the log data for large-scale weather data sets.
  • Involved in testing Map-Reduce programs using MRUnit and JUnit testing frameworks.
  • Customize parser loader application of Data migration to HBase.
  • Provide support for data analysts in running ad-hoc Pig and Hive queries
  • Developed PL/SQL Procedures, Functions, and Packages using Oracle Utilities like PL/SQL, SQL Loader and Handled Exceptions to handle key business logic.
  • Utilized PL/SQL bulk collect feature to optimize the ETL performance. Fine-Tuned and optimized number of SQL queries and performed code debugging.
  • Developed UNIX & SQL script to load large volume of data for Data Mining & Data Warehousing.

Hadoop Developer

Confidential, NC

Environment: Big Data Platform - CDH 4.0.1, XML, Hadoop HDFS, Spark, Hive, Sqoop, Impala, Oracle 10g, Java, Eclipse.


  • Involved in design and development of server-side layer using XML, JDBC and JDK patterns using Eclipse IDE.
  • Involved in unit testing, system integration testing and enterprise user testing.
  • Extensively used Core Java, Servlets, and JDBC.
  • Developed data pipeline using Hive, Sqoop, Spark and Map Reduce to ingest customer behavioral data and purchase histories into HDFS for analysis.
  • Worked with NoSQL databases like Hbase in creating tables to load large sets of semi structured data coming from various sources.
  • Wrote MRUnit test cases to test and debug Map Reduce programs in local machine.
  • Involved in creating Hive tables, loading data and running hive queries in those data.
  • Imported data using Sqoop to load data from Oracle to HDFS on regular basis.
  • Developed scripts and Batch Jobs to schedule various Hadoop Program.
  • Written Hive queries for data analysis to meet the business requirements.
  • Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
  • Written Hive queries to parse the logs and structure them in tabular format to facilitate effective querying on the log data.
  • Developed Pig UDF’s to pre-process data for analysis.
  • Developed Complex and Multi-Step data pipeline using Spark.
  • Written Spark SQL queries for data analysis.

Hadoop Developer

Confidential, NY

Environment: Big Data Platform - CDH 3, Map-Reduce, Hive, Spark Scripting, JDK 1.6, and Oracle.


  • Involved in analysis, design and development of data collection, data ingestion, and data profiling and data aggregation.
  • Working in development of controller, Batch and logging module using JDK 1.6.
  • Worked on development of data ingestion process using FS Shell and data loading into HDFS.
  • Working in the definition of Hive query for different profiling rules like business checks, outlier’s checks and domain and data range validation.
  • Working on the automating the generation of Hive query and Map-Reduce programs.
  • Developed User Defined Function in java and python to facilitate data analysis in Hive and pig.
  • Managed the end-to-end delivery during the different phase of the software implementation.
  • Involved in initial POC implementation using Hadoop - Map Reduce, Spark Scripting, and Hive Scripting.
  • Designed the framework for Data Ingestion, Data Profiling and generating the Risk Aggregation report based various business entities.
  • Mapped the business requirements and rules with the Risk Aggregation System.
  • Used JDBC to invoke Stored Procedures and database connectivity to ORACLE.
  • Code debugging and creating Documentation for future use.

Hire Now