We provide IT Staff Augmentation Services!

Data Engineer Resume

Palo, AltO


  • 7+ years of technical expertise in all phases of SDLC (Software Development Life Cycle) which includes a major concentration on Big Data analyzing frame works, various Relational Databases, NoSQL Databases and Java/J2EE technologies with highly recommended software practices.
  • 3+ years of industrial IT experience in Data manipulation using BigDataHadoop Eco system components Map - Reduce, HDFS, Yarn/MRv2, Pig, Hive, Hbase, Spark, Kafka, Flume, Sqoop, Flume, Oozie, Avro, AWS, Spark integration with Cassandra, Solr and Zookeeper.
  • Extensive Experience in working with Cloudera (CDH4 & 5), and HortonworksHadoop distros and AWSAmazonEMR, to fully leverage and implement new Hadoop features.
  • Experience in Azure Data Factory (ADF) creating multiple pipelines and activities using Azure for full and incremental data loads.
  • Strong experience and knowledge of real time data analytics using Spark Streaming, Kafka and Flume. Good experience in writing Spark applications using Python and Scala.
  • Worked on replacing MR jobs and Hive scripts with Spark SQL and Spark data transformations for efficient data processing.
  • Involved in converting Cassandra/Hive/SQL queries into Spark transformations using RDD’s and Scala.
  • Knowledge about unifying data platforms using Kafka producers/ consumers, implement pre-processing using storm topologies
  • Hands on experience with data ingestion tools Kafka, Flume and workflow management tools Oozie.
  • Experience processing Avro data files using Avro tools and MapReduce programs.
  • Hands on experience in writing Map Reduce programs using Java to handle different data sets using Map and Reduce tasks.
  • Good understanding and knowledge of Hadoop architecture and Hands on experience with Hadoop components such as Name Node, Data Node and Map Reduce concepts, Spark Execution Concepts and HDFS Framework.
  • Developed multiple MapReduce jobs to perform data cleaning and preprocessing.
  • Expert in working with Hive data warehouse tool -creating tables, data distribution by implementing Partitioning and Bucketing, writing and optimizing the HiveQL queries.
  • Designed HIVE queries & Pig scripts to perform data analysis, data transfer and table design.
  • Implemented Ad-hoc query using Hive to perform analytics on structured data.
  • Expertise in writing Hive UDF, Generic UDF's to in corporate complex business logic into Hive Queries.
  • Experienced in optimizing Hive queries by tuning configuration parameters.
  • Involved in designing the data model in Hive for migration of ETL process into Hadoop and wrote Pig Scripts to load data into Hadoop environment.
  • Experience in developing data ingestion, data processing and analytical pipelines for Big data, relational databases, NoSQL databases.
  • Compared performance on hive and Big SQL for our data warehousing systems.
  • Implemented SQOOP for large dataset transfer between Hadoop and RDBMS.
  • Extensively used Apache Flume to collect the logs and error messages across the cluster.
  • Experience in composing shell scripts to dump the shared information from MySQL servers to HDFS.
  • Worked on Implementing and optimizing Hadoop/MapReduce algorithms for Big Data analytics.
  • Developed graphs using Graphical Development Environment (GDE) with various Ab Initio components and migrated few graphs to Hadoop.
  • Team player with good Inter personnel skills, communication and presentation skills.
  • Exceptional ability to learn and master new technologies and to deliver outputs in short deadlines.
  • Detailed understanding of Software Development Life Cycle (SDLC) and experience in project implementation methodologies including Waterfall and Agile.


Big Data Ecosystems: Hadoop, Map Reduce, HDFS, Zookeeper, Hive, Pig, Sqoop, Oozie, Flume, Yarn, Spark, NiFi

Database Languages: SQL, PL/SQL, Oracle

Programming Languages: Java, Scala, Python( can read and understand)

Frameworks: Spring, Hibernate, JMS

Scripting Languages: JSP, Servlets, JavaScript, XML, HTML, Python

Web Services: RESTful web services

Databases: RDBMS, HBase, Cassandra

IDE: Eclipse, IntelliJ

Platforms: Windows, Linux, Unix

Application Servers: Apache Tomcat, Web Sphere, Web logic, JBoss

Methodologies: Agile, Waterfall

ETL Tools: Talend


Confidential, Palo Alto

Data Engineer


  • Migrating the code from Ab-initio (ETL tool) to Hadoop using hive and spark according to the complexity of the Ab-initio graphs
  • Developed some SQL logics using Spark-SQL for matching business requirements.
  • Used various Windowing functions and developed advanced clustered queries in Spark-SQL .
  • Experienced in querying data using SparkSQL on top of Spark engine for faster data sets processing.
  • Design and Develop the efficient architecture to process data using Py-Spark programs.
  • Developed optimized distributed applications with Spark Core and Spark SQL in Python integrating Rest, fact and dimensional data, and feed the data to HDFS and SQL server.
  • Optimize, Migrate data intensive batch jobs from AbInitio into spark ETL’s.
  • Developed Scalable Transformation/Aggregation/rollup Operations with Hive and Optimized the SLA’s utilizing hive-based partitions, buckets and storing the data in different file formats (Parquet, Avro, ORC) using suitable compression codecs (snappy, lz4, gzip, lzo, bzip) based on application needs.
  • Developed graphs using Graphical Development Environment (GDE) with various Ab Initio components.
  • Developed MapReduce batch jobs in java for loading the data to HDFS in sequential format.
  • Ingested structured data from RDBMS to HDFS as incremental import using Sqoop .
  • Developed Sqoop scripts to import export data from relational sources and handled incremental loading on the transaction data by date.
  • Involved in writing Pig scripts to wrangle the raw data and store it to HDFS , load the data to Hive tables using HCatalog.
  • Created Hive external tables with clustering and partitioning on the date for optimizing the performance of ad-hoc queries.
  • Involved in creating Hive tables on wide range of data formats like text, sequential, avro, parquet and orc.
  • Transformed the semi-structured log data to fit into the schema of the Hive tables using Pig.
  • Evaluated suitability of Hadoop and its ecosystem to the project and implementing / validating with various proof of concept ( POC ) applications to eventually adopt them to benefit from the Big Data Hadoop initiative.
  • Coordinated with Hadoop Admin team on implementing the DDLs for new applications.
  • Worked on Incident and Change Management for creating tickets and CRQ using ASK NOW.
  • Worked on Agile framework to tasks on Sprint basis using JIRA board.
  • Worked on ESP and D-series to create collections for scheduling Job Docs in Production DCs.
  • Follow the D2P process to test and debug the scripts from lower to higher environments.
  • Worked with Distributed copy for applications to move data cross clusters.

Confidential, TX

Hadoop developer


  • Developed Spark applications using Python utilizing Data frames and spark SQL API for faster processing of data.
  • Worked with Spark Librariesfor improving performance and optimization of existing algorithms in Hadoop using Spark Context, Spark -SQL, Data Frame, Pair RDD's, Spark YARN
  • Developed Spark jobs and Hive Jobs to summarize and transform data.
  • Built real time data pipelines by developing Kafka producers and Spark streaming applications for consuming.
  • Worked on Batch processing and Real-time data processing on Spark Streaming .
  • Developed Spark Applications using Python and Implemented Apache Spark data processing project to handle data from No-SQL DB’sand Streaming sources.
  • Worked with cloud services like Azure and involved in ETL, data integration and migration.
  • Created Azure data Factory pipelines to consume data from external sources and load it into Azure SQL databases.
  • Experience in Azure Data Factory (ADF) creating multiple pipelines and activities using Azure for full and incremental data loads.
  • Extract Transform and load data from source systems to Azure data storage services using a combination of Azure Data factory and ingest data Azure Blob storage.
  • Wrote Spark-Streaming applications to consume the data from Kafka topics and write the processed streams to HBase.
  • Collaborated with analytics and business teams, to improve Data models, increasing Data accessibility performed data analysis to troubleshoot Data quality issues on the source and assisted business teams.
  • Responsible for understanding the business analytics requirements for HDinsight , analyze and understand the data and correlate with business requirements, build data pipelines to generate the data.
  • Worked on solving performance issues in Hive with understanding of Joins, Group and aggregation and how does it translate to MapReduce jobs.
  • Developed Hive scripts in Hive QL to de-normalize and aggregate the data.
  • Expertise in creating Hive Tables, loading and analyzing data using hive queries.
  • Performed transformations, cleaning and filtering on imported data using Hive and loaded final data into HDFS.
  • Developed Hive queries on different tables for finding insights. Automated the process of building data pipelines for data scientists to predict, classify, descriptive and prescriptive analytics.
  • Built NiFi system for replicating the whole database.
  • Created NiFi flows to trigger spark jobs and used put email processors to get notifications if there are any failures.

Confidential, WA

Big data engineer


  • Worked with the Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
  • Serializing JSON data and storing the data into tables using Spark SQL.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDD's and Scala.
  • Knowledge of cloud infrastructure technologies in Azure.
  • Experience with Confidential Azure Cloud services, Storage Accounts, Azure date storage, Azure Data Factory, Data Lake and Virtual Networks.
  • Part of a team which helps Confidential customers build Big data and advanced analytics solutions on Confidential Azure cloud using Azure data services or open source software
  • Used Scala to convert Hive/SQL queries into RDD transformations in Apache Spark.
  • Worked with Azure Monitoring and Data Factory.
  • Supported migrations from on premise to Azure.
  • Providing support services to enterprise customers related to MicrosoftAzureCloud networking and experience in handling critical situation cases.
  • Implemented Spark Scripts using Scala, Spark SQL to access hive tables into spark for faster processing of data.
  • Experience in writing Shell scripts to automate the process flow.
  • Experience in performing business analytical scripts using HiveSQL.
  • Provided consulting and cloud architecture for premier customers and internal projects running on MSAzure platform for high-availability of services, low operational costs.
  • Optimized test content and process with a reduction of 20% in false positives. Used SQL and excel to pull, analyze, polish and visualize data.
  • Followed agile methodology and SCRUM meetings to track, optimize and tailored features to customer needs.

Hire Now