We provide IT Staff Augmentation Services!

Hadoop & Spark Developer/ Data Engineer Resume

4.00/5 (Submit Your Rating)

Weehawken, NJ


  • 9+ years of professional IT experience in BIGDATA using HADOOP framework and Analysis, Design, Development, Documentation, Deployment and Integration using SQL and Big Data technologies as well as Java / J2EE technologies with AWS,AZURE
  • Experience inHadoopEcosystem components like Hive, HDFS, Sqoop, Spark, Kafka, Pig.
  • Experience in architecting, designing, installation, configuration and management of Apache Hadoop Clusters, MapR, Hortonworks & Cloudera Hadoop Distribution.
  • Good understanding of Hadoop architecture and Hands on experience with Hadoop components such as Resource Manager, Node Manager, Name Node, Data Node and Map Reduce concepts and HDFS Framework.
  • Expertise in Data Migration, Data Profiling, Data Ingestion, Data Cleansing, Transformation, Data Import, and Data Export through the use of multiple ETL tools such as Informatica Power Centre.
  • Working knowledge of Spark RDD, Dataframe API, Data set API, Data Source API, Spark SQL and Spark Streaming.
  • Experience in exporting as well as importing the data using Sqoop between HDFS to Relational Database systems and vice - versa and load into Hive tables, which are partitioned.
  • Worked on HQL for required data extraction and join operations as required and having good experience in optimizing Hive Queries.
  • Experience in Partitions, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
  • Developed Spark code using Scala, Python and Spark-SQL/Streaming for faster processing ofdata.
  • Implemented Spark Streaming jobs in Scala by developing RDD's (Resilient Distributed Datasets) and used Pyspark and spark-shell accordingly.
  • Profound experience in creating real time data streaming solutions using Apache Spark /Spark Streaming, Kafka and Flume.
  • Good knowledge of using apache NiFi to automate the data movement between different Hadoop Systems.
  • Good experience in handling messaging services using Apache Kafka.
  • Knowledge in Data mining and Data warehousing using ETL Tools and Proficient in Building reports and dashboards in Tableau (BI Tool).
  • Excellent knowledge of job workflow scheduling and locking tools/services like Oozie and Zookeeper.
  • Good understanding and knowledge of NoSQL databases like HBase and Cassandra.
  • Good understanding of Amazon Web Services (AWS) like EC2 for computing and S3 as storage mechanism and EMR, Step functions, Lambda,RedShift, DynamoDB.
  • Good understanding and knowledge of Microsoft Azure services like HDInsight Clusters, BLOB, ADLS, Data Factory and Logic Apps.
  • Worked with various formats of files like delimited text files, JSON files, XML Files. Mastered in using different columnar file formats like RC, ORC and Parquet formats and has a good understanding of various compression techniques used in Hadoop processing like G-zip, Snappy, LZO etc.
  • Hands on experience building enterprise applications utilizing Java, J2EE, Spring, Hibernate, JSF, JMS, XML, EJB, JSP, Servlets, JSON, JNDI, HTML, CSS and JavaScript, SQL, PL/SQL.
  • Experienced in Software Development Lifecycle (SDLC) using SCRUM, Agile methodologies.


Hadoop Components / Big Data: HDFS, Hue, MapReduce, PIG, Hive, HCatalog, Hbase, Sqoop, Impala, Zookeeper, Flume, Kafka, Yarn, Cloudera Manager, Kerberos,pyspark Airflow, Kafka Snowflake

Languages: Scala, Python, SQL, Python, Hive QL, KSQL.

IDE Tools: Eclipse, IntelliJ, pycharm.

Cloud platform: AWS, Azure

Reporting and ETL Tools: Tableau, Power BI, Talend, AWS GLUE.

Databases: Oracle, SQL Server, MySQL, MS Access, NoSQL Database (Hbase, Cassandra, Mongo DB)

Big Data Technologies: Hadoop, HDFS, Hive, Pig, Oozie, Sqoop, Spark, Machine Learning, Pandas, NumPy, Seaborn, Impala, Zookeeper, Flume, Airflow, Informatica, Snowflake, DataBricks, Kafka, Cloudera

Containerization: Docker, Kubernetes

CI/CD Tools: Jenkins, Bamboo, GitLab

Operating Systems: UNIX, LINUX, Ubuntu, CentOS.

Software Methodologies: Agile, Scrum, Waterfall


Confidential, Weehawken, NJ

Hadoop & Spark Developer/ Data Engineer


  • Develop and add features to existing data analytic applications built with Spark and Hadoop on a Scala, java and Python development platform on the top of AWS services.
  • Programming using Python, Scala along with Hadoop framework utilizing Cloudera Hadoop Ecosystem projects (HDFS, Spark, Sqoop, Hive, HBase, Oozie, Impala, Zookeeper etc.).
  • Involved in developing spark applications using Scala, Python for Data transformations, cleansing as well as validation using Spark API.
  • Worked on all the Spark APIs, like RDD, Dataframe, Data source and Dataset, to transform the data.
  • Worked on both batch processing and streaming data Sources. Used Spark streaming and Kafka for the streaming data processing.
  • Developed Spark Streaming script which consumes topics from distributed messaging source Kafka and periodically pushes batch of data to spark for real time processing.
  • Builtdatapipelines for reporting, alerting, anddatamining. Experienced with table design anddata management using HDFS, Hive, Impala, Sqoop, MySQL, and Kafka.
  • Worked on Apache Nifi to automate the data movement between RDBMS and HDFS.
  • Created shell scripts to handle various jobs like Map Reduce, Hive, Pig, Spark etc, based on the requirement.
  • Used Hive techniques like Bucketing, Partitioning to create the tables.
  • Experience on Spark-SQL for processing the large amount of structured data.
  • Experienced working with source formats, which includes - CSV, JSON, AVRO, JSON, Parquet, etc.
  • Worked on AWS to aggregate clean files in Amazon S3 and also on Amazon EC2 Clusters to deploy files into Buckets.
  • Designed and architected solutions to load multipart files which can't rely on a scheduled run and must be event driven, leveraging AWS SNS,
  • Involvedin Data Modeling usingStar Schema, Snowflake Schema.
  • Used AWS EMR clusters for creating hadoop and spark clusters. These clusters are used for submitting and executing scala and python applications in production.
  • Responsible for developing data pipeline with Amazon AWS to extract the data from weblogs and store in HDFS.
  • Migrated the data from AWS S3 to HDFS using Kafka.
  • Integrating Kubernetes with network, storage of security to provide a comprehensive infrastructure and orchestrating the Kubernetes containers across the multiple hosts.
  • Implementing Jenkins and built pipelines to drive all microservice builds out to Docker registry and deploying to Kubernetes.
  • Experienced in loading and transforming of large sets of structured, semi structured data using ingestion tool Talend.
  • Worked with NoSQL databases like HBase, Cassandra to retrieve and load the data for real time processing using Rest API.
  • Worked on creating data models for Cassandra from the existing Oracle data model.
  • Responsible for transforming and loading the large sets of structured, semi structured and unstructured data.

Environment: Hadoop 2.7.7, HDFS 2.7.7, Apache Hive 2.3, Apache Kafka 0.8.2.X, Apache Spark 2.3, Spark-SQL, Spark-Streaming, Zookeeper, Pig, Oozie, Java 8, Python3, S3, EMR, EC2, Redshift, Cassandra, Nifi, Talend, HBase,Cloudera (CHD 5.X).

Confidential, Greenwood Village, CO

Data Engineer


  • Develop a data set process for data mining and data modeling and also recommend the ways to improve data quality, efficiency and reliability.
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics . Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in InAzure Databricks..
  • Responsible for writing Hive Queries to analyze the data in Hive warehouse using Hive Query Language (HQL).Involved in developing Hive DDLs to create, drop and alter tables.
  • Extracted the data and updated it into HDFS using Sqoop Import from various sources like Oracle, Teradata, SQL server etc.
  • Implemented Apache Airflow for authoring, scheduling, and monitoring Data Pipelines
  • Created Hive staging tables and external tables and also joined the tables as required.
  • Implemented Dynamic Partitioning, Static Partitioning and also Bucketing.
  • Installed and configured Hadoop Map Reduce, Hive, HDFS, Pig, Sqoop, Flume and Oozie on Hadoop cluster.
  • Worked on Microsoft Azure services like HDInsight Clusters, BLOB, ADLS, Data Factory and Logic Apps and also done POC on Azure Data Bricks.
  • Implemented Sqoop jobs for data ingestion from the Oracle to Hive.
  • Worked with various formats of files like delimited text files, click stream log files, Apache log files,Avro files, JSON files, XML Files. Mastered in using different columnar file formats like RC, ORC and Parquet formats.
  • Developed custom the Unix/BASH SHELL scripts for the purpose of pre and post validations of the master and slave nodes, before and after the configuration of the name node and datanodes respectively.
  • Developed job workflows in Oozie for automating the tasks of loading the data into HDFS.
  • Implemented compact and efficient file storage of big data by using various file formats like Avro, Parquet, JSON and using compression methods like GZip, Snappy on top of the files.
  • Exploring with Spark, improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame and Pair RDD's.
  • Worked on Spark using Python as well as Scala and Spark SQL for faster testing and processing of data.
  • Worked on various data modelling concepts like star schema, snowflake schema in the project.
  • Extensively used Stash, Bit-Bucket and GITHUB for the code control purpose.
  • Migrated Map reduce jobs to Spark jobs for achieving a better performance.

Environment: Hadoop 2.7, HDFS, Microsoft Azure services like HDinsight, BLOB, ADLS, Logic Apps etc, Hive 2.2, Sqoop 1.4.6, snowflake, Apache Spark 2.3, Airflow, Spark-SQL, ETL, Maven, Oozie, Java 8, Python3, Unix shell scripting.

Confidential, New York, NY

Hadoop Data Engineer /Data Analyst


  • Developed PySpark Applications by using python and Implemented Apache PySpark data processing project to handle data from various RDBMS and Streaming sources.
  • Handled importing of data from various data sources, performed data control checks using PySpark and loaded data into HDFS.
  • Involved in converting Hive/SQL queries into PySpark transformations using Spark RDD, python.
  • Used PySpark SQL to Load JSON data and create Schema RDD and loaded it into Hive Tables and handled structured data using Spark SQL.
  • Developed PySpark Programs using python and performed transformations and actions on RDD's.
  • Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDD's.
  • Used PySpark and Spark SQL to read the parquet data and create the tables in hive using the python API.
  • Implemented PySpark using python and utilizing Data frames and PySpark SQL API for faster processing of data.
  • Developed python scripts, UDFs using both Data frames/SQL/Data sets and RDD/Map Reduce in Spark 1.6 for data Aggregation, queries and writing data back into OLTP system through Sqoop.
  • Experienced in handling large datasets using Partitions, PySpark in Memory capabilities, Broadcasts in PySpark, effective & efficient Joins, Transformations and other during ingestion process itself.
  • Processing the schema oriented and non-schema-oriented data using python and Spark.
  • Involved in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipe-line system.
  • Used Spark API over Cloudera Hadoop YARN to perform analytics on data in HDFS.
  • Worked on streaming pipeline that uses PySpark to read data from Kafka, transform it and write it to HDFS.
  • Designed Data Marts by following Star Schema and Snowflake Schema Methodology, using industry leading Data Modeling tools.
  • Worked on Snowflake database on queries and writing Stored Procedures for normalization.
  • Worked with Snowflake’s stored procedures, used procedures with corresponding DDL statements, used JavaScript API to easily wrap and execute numerous SQL queries.

Environment: Cloudera (CDH3), AWS, Snowflake,HDFS, Pig 0.15.0, Hive 2.2.0,Kafka, Sqoop, Shell Scripting, Spark 1.8, Linux- Cent OS, Map Reduce, python 2, Eclipse 4.6.


Data Modeller/Engineer


  • Working with open source Apache Distribution then Hadoop admins have to manually setup all the configurations- Core-Site, HDFS-Site, YARN-Site and Map Red-Site. However, when working with popular Hadoop distribution like Hortonworks, Cloudera or MapR the configuration files are setup on startup and the Hadoop admin need not configure them manually.
  • Used Sqoop to import data from Relational Databases like MySQL, Oracle.
  • Involved in importing structured and unstructured data into HDFS.
  • Responsible for fetching real-time data using Kafka and processing using Spark and Scala.
  • Worked on Kafka to import real-time weblogs and ingested the data to Spark Streaming.
  • Developed business logic using Kafka Direct Stream in Spark Streaming and implemented business transformations.
  • Worked on Building and implementing real-time streaming ETL pipeline using Kafka Streams API.
  • Worked on Hive to implement Web Interfacing and stored the data inHive tables.
  • Migrated Map Reduce programs into Spark transformations using Spark and Scala.
  • Experienced with Spark Context, Spark-SQL, Spark YARN.
  • Implemented Spark Scripts using Scala, Spark SQL to access hive tables into Spark for faster processing of data.
  • Implemented data quality checks using Spark Streaming and arranged passable and bad flags on the data.
  • Implemented Hive Partitioning and Bucketing on the collected data in HDFS.
  • Implemented Sqoop jobs for large data exchanges between RDBMS and Hive clusters.
  • Extensively used Zookeeper as a backup server and job scheduled for Spark Jobs.
  • Developed Spark scripts using Scala shell commands as per the business requirement.
  • Worked on Cloudera distribution and deployed on AWS EC2 Instances.
  • Experienced in loading the real-time data to a NoSQL database like Cassandra.
  • Experience in retrieving the data present in Cassandra cluster by running queries in CQL (Cassandra Query Language).
  • Worked on connecting Cassandra database to the Amazon EMR File System for storing the database in S3.
  • Implemented usage of Amazon EMR for processing Big Data across a Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3).
  • Deployed the project on Amazon EMR with S3 connectivity for setting a backup storage.
  • Well versed in using Elastic Load Balancer for Auto scaling in EC2 servers.
  • Coordinated with the SCRUM team in delivering agreed user stories on time for every sprint.

Environment: Hadoop, Map Reduce, Hive, Spark, Oracle, GitHub, Tableau, UNIX, Cloudera, Kafka, Sqoop, Scala, NIFI, HBase, Amazon EC2, S3.


Data Modeler/Analyst


  • Assess and document data requirements and client-specific requirements to develop user-friendly BI solutions - reports, dashboards, and decision aids.
  • Design, develop, and maintain T-SQL - stored procedures, joins, complex sub-queries for ad-hoc data retrieval and management.
  • Develops logical and physical data flow models for ETL applications.
  • Build, test and maintain automated ETL processes to ensure data accuracy and integrity
  • Adapt and optimize ETL processes to accommodate changes in source systems and new business user requirements.
  • Build, test, and manage BI standard reporting templates and dashboards for internal and external use.,build packages recording and updating historical attributes of employees using slowly changing dimension.
  • Configure percentage sampling transformation to identify sample respondents for survey research.
  • Applied lookup and cache transformation and use reference table to populate missing columns into Data Warehouse.
  • Troubleshoot data integration issues and debug reasons for ETL failure.
  • Undertakes regular data mapping, parsing and ETL scanning.
  • Create custom reports for requested projects and modify existing queries and reports in Power BI desktop and SSRS.
  • Created complex Stored Procedures, Triggers, Functions, Indexes, and joins to implement business rules. Implemented different types of constraints on tables for consistency.
  • Documentation of data environment and solutions to support end-users and analysts

We'd love your feedback!