We provide IT Staff Augmentation Services!

Spark/hadoop Developer Resume

Durham, NC


  • Around 8 years of IT experience which includes experience as Hadoop/Spark developer using Big data technologies like Hadoop Ecosystem, Spark Ecosystem and experience in application development using J2EE.
  • Experience in working on various Hadoop data access components like MAPREDUCE, PIG, HIVE, HBASE, SPARK and KAFKA.
  • Experience on handling Hive queries using Spark SQL that integrates with Spark environment
  • Having good knowledge on Hadoop data management components like HDFS and YARN.
  • Hands on experience in using various Hadoop workflow compononets like SQOOP, FLUME and KAFKA.
  • Worked on Hadoop data operation components like ZOOKEEPER and OOZIE.
  • Working knowledge on AWS technologies like S3 and EMR for storage, big data processsing and analysis.
  • Good understanding of Hadoop security components like RANGER and KNOX.
  • Good experience working with Hadoop distributions such as HORTONWORKS and CLOUDERA.
  • Excellent programming skills at higher level of abstraction using SCALA and JAVA.
  • Experience in Java programming with skills in analysis, design, testing and deploying with various technologies like J2EE, JavaScript, JSP, JDBC, HTML, XML and JUNIT.
  • Having good knowledge on Apache Spark components including SPARK CORE, SPARK SQL, SPARK STREAMING and SPARK MLLIB.
  • Experience in performing transformations and actions on Spark RDDS using Spark Core.
  • Experience in using Broadcast variables, Accumulator variables and RDD caching in Spark.
  • Experience in troubleshooting Cluster jobs using Spark UI
  • Experience working with Cloudera Distribution Hadoop(CDH) and Hortonworks data platform(HDP).
  • Expert in Hadoop and Big data ecosystem including Hive, HDFS, Spark, Kafka, MapReduce, Sqoop, Oozie and Zookeeper
  • Good Knowledge on Hadoop Cluster architecture and monitoring the cluster
  • Hands - on experience in distributed systems technologies, infrastructure administration, monitoring configuration
  • Expertise in data transformation & analysis using Spark, Hive
  • Knowledge of writing Hive Queries to generate reports using Hive Query Language
  • Hands on experience with the Spark SQL for complex data transformations using Scala programming language.
  • Developed Spark code using Python/Scala and Spark-SQL for faster testing and processing of data
  • Good knowledge of Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and MapReduce concepts
  • Extensive experience in data ingestion technologies like Flume, Kafka, Sqoop and NiFi
  • Utilize Flume, Kafka and NiFi to gain real-time and near real-time streaming data in HDFS from different data sources
  • Good in analyzing data using HiveQL and custom MapReduce program in Java
  • Good Knowledge in working with AWS (Amazon Web Services) cloud platform
  • Good knowledge in Unix shell commands
  • Experience in analyzing Log files for Hadoop and eco system services and finding root cause and setting up and managing the batch scheduler on Oozie
  • Thorough knowledge of Release management, CI/CD process using Jenkins and Configuration management using Visual Studio Online
  • Experience in extracting the data from RDBMS in to HDFS using Sqoop Ingestion, collecting the logs from log collector into HDFS using Flume
  • Used Project Management services like JIRA for handling service requests and tracking issues.
  • Good experience with Software methodologies like Agile and Waterfall.
  • Experienced working with Zookeeper to provide coordination services to the cluster
  • Skilled in Tableau 9 for data visualization, Reporting and Analysis
  • Extensively involved through the Software Development Life Cycle (SDLC) from initial planning through implementation of the projects by using Agile and waterfall methodologies
  • Good team player with ability to solve problems, organize and prioritize multiple tasks.


Data Access Tools: HDFS, YARN, Hive, Pig, HBase, Solr, Impala, Spark Core, Spark SQL, Spark Streaming

Data Management: HDFS, YARN

Data Workflow: Sqoop, Flume, Kafka

Data Operation: Zookeper, Oozie

Data Security: Ranger, Knox

BigData Distributions: Hortonworks, Cloudera

Cloud Technologies: AWS (Amazon Web Services) EC2, S3, IAM, CLOUD WATCH, DynamoDB, SNS, SQS, EMR, KINESIS

Programming & Languages: Java, Scala, Pig Latin, HQL, SQL, Shell Scripting, HTML, CSS, JavaScript

IDE/Build Tools: Eclipse, Intellij

Java/J2EE Technologies: XML, Junit, JDBC, AJAX, JSON, JSP

Operating Systems: Linux, Windows, Kali Linux

SDLC: Agile/SCRUM, Waterfall


Confidential, Durham NC

Spark/Hadoop Developer


  • Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data
  • Converted existing MapReduce jobs into Spark transformations and actions using Spark RDDs, Data frames and Spark SQL APIs.
  • Developed a data pipeline using Kafka, Spark and Hive to ingest, transform and analyzing customer behavioral data.
  • Worked on Big Data infrastructure for batch processing as well as real-time processing. Responsible for building scalable distributed data solutions using Hadoop.
  • Developed real time data processing applications by using Scala and Python and implemented Apache Spark Streaming from various streaming sources like Kafka.
  • Developed Spark jobs and Hive Jobs to summarize and transform data
  • Expertise in implementing Spark Scala application using higher order functions for both batch and interactive analysis requirement.
  • Experienced in developing Spark scripts for data analysis in scala.
  • Used Spark-Streaming APIs to perform necessary transformations.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark SQL and Scala.
  • Worked with spark to consume data from kafka and convert that to common format using scala.
  • Worked extensively with importing metadata into Hive and migrated existing tables and applications to work on Hive and Spark.
  • Converted existing MapReduce jobs into Spark transformations and actions using Spark RDDs, Data frames and Spark SQL APIs.
  • Wrote new spark jobs in Scala to analyze the data of the customers and sales history.
  • Involved in requirement analysis, design, coding and implementation phases of the project.
  • Used Spark API over Hadoop YARN to perform analytics on data in Hive.
  • Experience in both SQLContext and SparkSession.
  • Developed Scala based Spark applications for performing data cleansing, data aggregation, de-normalization and data preparation needed for machine learning and reporting teams to consume.
  • Worked on troubleshooting spark application to make them more error tolerant.
  • Involved in HDFS maintenance and loading of structured and unstructured data and imported data from mainframe dataset to HDFS using Sqoop and written the PySpark Script to process the HDFS data.
  • Used Spark API over Hadoop YARN to perform analytics on data in Hive.
  • Extensively worked on the core and Spark SQL modules of Spark.
  • Involved in Spark and Spark Streaming creating RDD's, applying operations -Transformation and Actions.
  • Created partitioned tables and loaded data using both static partition and dynamic partition method.
  • Implemented POC’s on migrating to Spark-Streaming to process the live data.
  • Executed Hive queries on Parquet tables stored in Hive to perform data analysis to meet the business requirements.
  • Ingested data from RDBMS and performed data transformations, and then export the transformed data to HDFS as per the business requirement..
  • Used Impala to read, write and query the data in HDFS.
  • Worked on troubleshooting spark application to make them more error tolerant.
  • Stored the output files for export onto HDFS and later these files are picked up by downstream systems.
  • Load the data into Spark RDD and do in memory data Computation to generate the Output response.

Environment: Hadoop 2.x, Spark Core, Spark SQL,Spark API Spark Streaming, Scala,Pyspark, Hive,Pig, kafka,Oozie, Amazon EMR, Tableau, Impala, RDBMS,Hive,HDFS,YARN, JIRA, MapReduce.

Confidential, New York, NYC

Spark/Hadoop Developer


  • Responsible to collect, clean, and store data for analysis using Kafka, Sqoop, Spark, HDFS
  • Used Kafka and Spark framework for real time and batch data processing
  • Ingested large amount of data from different data sources into HDFS using Kafka
  • Implemented Spark using Scala and performed cleansing of data by applying Transformations and Actions
  • Used Case Class in Scala to convert RDD’s into DataFrames in Spark
  • Processed and Analyzed data in stored in HBase and HDFS
  • Developed Spark jobs using Scala on top of Yarn for interactive and Batch Analysis.
  • Developed Unix shell scripts to load large number of files into HDFS from Linux File System.
  • Experience in querying data using Spark SQL for faster processing of the data sets.
  • Offloaded data from EDW into Hadoop Cluster using Sqoop.
  • Developed Sqoop scripts for importing and exporting data into HDFS and Hive
  • Created Hive internal and external Tables by Partitioning, bucketing for further Analysis using Hive
  • Used Oozie workflow to automate and schedule jobs
  • Used Zookeeper for maintaining and monitoring clusters
  • Exported the data into RDBMS using Sqoop for BI team to perform visualization and to generate reports
  • Continuously monitored and managed the Hadoop Cluster using Cloudera Manager
  • Used JIRA for project tracking and participated in daily scrum meetings

Environment: Spark, Sqoop, Scala, Hive, Kafka, YARN, Teradata, RDBMS, HDFS, Oozie, Zookeeper, HBase, Tableau, Hadoop (Cloudera), JIRA

Confidential, New York, NY

Hadoop Developer


  • Actively participated in interaction with users to fully understand the requirements of the system
  • Experience with the Hadoop ecosystem and NoSQL database
  • Migrating the needed data from Oracle, MySQL in to HDFS using Sqoop and importing various formats of flat files in to HDFS
  • Imported data from RDBMS (MySQL, Teradata) to HDFS and vice versa using Sqoop (Big Data ETL tool) for Business Intelligence, visualization and report generation
  • Working with Kafka to get near real-time data onto big data cluster and required data into Spark for analysis
  • Used Spark streaming to receive near real time data from the Kafka and store the stream data to HDFS using Scala and NoSQL database such as Cassandra and HDFS
  • Involved in Analyzing data by writing queries using HiveQL for faster data processing
  • Designing and creating Hive external tables using shared meta-store instead of derby with partitioning, dynamic partitioning and buckets
  • Optimized queries in Hive to increase performance and query execution time
  • Involved in writing Flume and Hive scripts to extract, transform and load the data into Database
  • Created tables in DataStax Cassandra and loaded large sets of data for processing
  • Worked on Oozie workflows, coordinators to run multiple Hive jobs
  • Used Git for version control, JIRA for project tracking and Jenkins for continuous integration
  • Utilized Agile and Scrum Methodology to help manage and organize a team of developers with regular code review session.

Environment: HDFS, Kafka, Sqoop, Scala, java, Hive, Oozie, NoSQL, Oracle, MySQL, GIT, Zookeeper, DataStax Cassandra, Agile methodology, JIRA, Hortonworks data platform, Jenkins, AGILE(SCRUM).

Hire Now