We provide IT Staff Augmentation Services!

Hadoop / Spark Developer Resume

Phoenix, AZ


  • 5 years of IT work experience on Big Data Analytics which includes Analysis, Design, Development, Deployment & Maintenance of projects using Apache Hadoop - HDFS, Amazon S3, MapReduce, YARN, Hive, Pig Latin, Impala, Hue, Sqoop, Kafka, Flume, Spark, Scala, Oozie and Hadoop APIs’.
  • Experience in meeting expectations with Hadoop clusters using Cloudera and MapR, in Agile Scrum methodologies.
  • Experience in implementation and integration using Big Data Hadoop ecosystem components in Cloudera and MapR environments working with various file formats like Avro, Parquet, JSON and ORC.
  • Experience in data ingestion, processing and analysis using Spark with Scala, Spark Streaming, Kafka, Flume, Sqoop and Shell Script.
  • Efficient in developing Sqoop jobs for migrating data from RDBMS to Hive / HDFS and vice versa.
  • Experience in developing NoSQL applications using Mongo DB, HBase and Cassandra.
  • Thorough knowledge of Hadoop architecture and core components Name node, Data nodes, Job trackers, Task Trackers, Oozie, Hue, Flume, HBase, etc.
  • Very good experience of partitioning, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
  • Knowledge of ETL methods for data extraction, transformation and loading in corporate-wide ETL Solutions and Data warehouse tools for data analysis.
  • Experience in extending HIVE and PIG core functionality by using Custom User Defined functions.
  • Working knowledge on Oozie, a workflow scheduler system to manage the jobs that run on PIG, HIVE and SQOOP.
  • Experience in Spark applications using Scala for easy Hadoop transitions.
  • Knowledge of utilizing Kafka / Flume technologies for real time data streaming and ingestion.
  • Excellent knowledge in Data Analysis, Data Validation, Data Cleansing, Data Verification and identifying data mismatch.
  • Wrote multiple customized MapReduce Programs for various Input file formats.
  • Developed multiple internal and external Hive Tables using dynamic partitioning & bucketing.
  • Involved in converting SQL queries into HiveQL.
  • Designing and creating Hive external tables using shared meta-store instead of the derby with partitioning, dynamic partitioning and buckets.
  • Experience in Oozie and workflow scheduler to manage Hadoop jobs by Direct Acyclic Graph (DAG) of actions with control flows.
  • Experience in integrating Hive and HBase for effective operations.
  • Design and development of full text search feature with multi-tenancy elastic search after collecting the real time data through Spark streaming.
  • Experienced in working with Apache Spark ecosystem using Spark-SQL and Scala queries on different data file formats like .txt, .csv etc.
  • Developed data pipeline for real time use cases using Kafka, Flume and Spark Streaming.
  • Experience in analyzing large scale data to identify new analytics, insights, trends, and relationships with a strong focus on data clustering.
  • An excellent team player with good organizational, interpersonal, communication skills and leadership qualities, Quick learner, possesses a positive attitude and flexibility towards the ever-changing industry.
  • Technically strong person who has capability to work with business users, project managers, team leads, architects and peers, thus maintaining healthy environment in the project.


Distributed Computing: Apache Hadoop 2.x, HDFS, YARN, Map Reduce, Hive, Pig, HBase, Sqoop, Flume, Zookeeper, Hue, Impala, Oozie, Kafka, Spark

SDLC Methodologies: Agile Scrum

Relational Databases: Teradata, Netezza, Oracle, My SQL

Distributed Databases: No SQL (HBase, Cassandra, Mongo DB)

Distributed Filesystems: HDFS, Amazon S3

Distributed Query Engines: Hive, Preston

Distributed Computing Environment: Cloudera, MapR

Operating Systems: Windows, Mac OS, Unix, Ubuntu

Programming: Java, Python, Scala, UNIX, Pig Latin, HiveQL

Scripting: Shell Scripting

Version Control: GitHub

IDE: Scala IDE, PyCharm, Jupyter Notebook, Eclipse (PyDev)


Confidential, Phoenix, AZ

Hadoop / Spark Developer

Responsibilities -

  • Worked with Hadoop Ecosystem components like HBase, Sqoop, Zookeeper, Oozie, Hive and Pig with MapR Hadoop distribution.
  • Wrote Pig Scripts for sorting, joining, filtering and grouping the data.
  • Developed programs in Spark based on the application for faster data processing than standard MapReduce programs.
  • Developed spark programs using Scala, involved in creating Spark SQL Queries and Developed Oozie workflow for spark jobs.
  • Developed the Oozie workflows with Sqoop actions to migrate the data from relational databases like Oracle, Teradata to HDFS.
  • Used Hadoop FS actions to move the data from upstream location to local data locations.
  • Written extensive Hive queries to do transformations on the data to be used by downstream models.
  • Developed map reduce programs as a part of predictive analytical model development.
  • Developed Hive queries to do analysis of the data and to generate the end reports to be used by business users.
  • Worked on scalable distributed computing systems, software architecture, data structures and algorithms using Hadoop, Apache Spark and ingested streaming data into Hadoop using Spark Framework and Scala.
  • Expertise working with NOSQL databases like MongoDB.
  • Extensively used GIT as a code repository and Version One for managing day agile project development process and to keep track of the issues and blockers.
  • Written spark python for model integration layer.
  • Implemented Spark using Scala, Java and utilizing Data frames and Spark SQL API for faster processing of data.
  • Developed Spark code and Spark-SQL/Streaming for faster testing and processing of data.
  • Wrote new spark jobs in Scala to analyze the data of the customers and sales history.
  • Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.
  • Developed a data pipeline using Kafka, HBase, Mesos Spark and Hive to ingest, transform and analyzing customer behavioral data.

Technologies Used - MapR, LINUX, Hadoop, HBase, Hive, Impala, Oracle, Spark, Scala, Python, Pig, Sqoop, Teradata, Zookeeper, Oozie, MongoDB, Map Reduce, GitHub.

Confidential, Waltham, MA

Big Data Hadoop Engineer

Responsibilities -

  • Involved in designing the data pipeline using various Hadoop components like Spark, Hive, Impala, Sqoop with Cloudera distribution.
  • Developed Spark (Python API) applications in PyCharm using spark core libraries (RDD, Spark-SQL) for performing ETL transformations, thereby eliminating the need of utilizing ETL tool (SSIS/ODI).
  • Developed Spark application for Batch processing.
  • Implemented Partitioning, Dynamic partitioning and Bucketing in Hive using internal and external table for more efficient data.
  • Using Open Source packages, designed POC to demonstrate Integration of Kafka/Flume with Spark Streaming for real-time data Ingestion and processing.
  • Designed Sqoop application for migration of sensitive PHI data residing on Netezza (RDBMS) to HDFS / Hive tables.
  • Designed Sqoop Application for CDC (Change Data Capture) process and integrated it with Spark and Oozie.
  • Performed data validation using Sqoop on the exported data.
  • Performed transformations / analysis by writing complex HQL queries in Hive and exported result to HDFS in discrete file format (JSON, AVRO, Parquet, ORC).
  • Implemented data ingestion and transformation using automated workflows using Oozie.
  • Utilizing Cloudera Navigator created Audit reports, which notifies security threat and will track all the user / tools activity which uses various Hadoop components.
  • Developed strategy for various Hadoop components used to track Data Lineage and Meta-Data extracted from Pipeline using Cloudera Navigator.
  • Designed various plots showing HDFS Analytics and Other operations performed on the environment.
  • Worked with Infra team for testing the environment after patches, upgrades and migration take place.
  • Developed multiple python scripts for delivering End-To-End support and common routines, while maintaining product integrity.
  • Performed analytical queries using Impala (Cloudera) for swifter responses and exported result to Tableau for data visualization.
  • Documented all the applications worked on and presented it to higher level.

Technologies Used - Cloudera, LINUX, Hadoop, HDFS, HBase, Hive, Spark, MapReduce, Sqoop, Flume, Kafka, Python, Netezza, Oozie, PyCharm, Cloudera Impala, Tableau 10.0, GitHub.


Hadoop Developer

Responsibilities -

  • Imported Data from Different Relational Data Sources like RDBMS, Teradata to HDFS using Sqoop.
  • Analyzed large data sets by running Hive queries and exported the results as views/ Flat-files, etc.
  • Worked with the Data Science team to gather requirements for various data mining projects and conducted POC.
  • Involved in creating Hive managed / external tables while maintaining raw files integrity and analyzed data using hive queries.
  • Worked on Spark core Libraries like RDD, Spark SQL, Spark Streaming modules of Spark extensively by handling structured and unstructured data.
  • Designed Spark applications performing ETL transformations using Python API.
  • Developed Simple to complex MapReduce Jobs using Hive.
  • Wrote Hive Queries to have a consolidated view of the mortgage and retail data.
  • Orchestrated hundreds of sqoop scripts, pig scripts, hive queries using oozie workflows and sub-workflows.
  • Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting
  • Developed Python utility to validate ingested Hive tables using Sqoop with source RDBMS tables
  • Involved in running Hadoop jobs for processing millions of records of text data.
  • Worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required.
  • Responsible for managing data from multiple sources using Flume.
  • Load and Transform large data sets consisted of structured, semi structured and unstructured data.
  • Created and maintained Technical documentation for launching HADOOP Clusters and for executing Hive queries.
  • Using Python libraries like Pandas, NumPy, and SciPy performed statistical analysis on dataset. Discovered patterns by evaluating parameters, which recommends top parameters to focus while designing.
  • Integrated Tableau with Impala as a source to create interactive BI dashboard.

Technologies Used - Cloudera, LINUX, Hadoop, HDFS, Hive, Spark, MapReduce, Sqoop, Flume, Teradata, Python, MySQL, Oozie, Impala, Tableau 10.0.


Hadoop Developer

Responsibilities -

  • Installed and configured Hadoop MapReduce, HDFS, Developed multiple MapReduce jobs in java for data cleaning and preprocessing.
  • Importing and exporting data into HDFS and Hive using Sqoop.
  • Experience in defining job flows using Oozie and shell scripts.
  • Experienced in implementing various customizations in MapReduce at various levels by implementing custom input formats, custom record readers, partitioners and data types in java.
  • Experience in ingesting data using flume from web server logs and telnet sources.
  • Installed and configured Cloudera Manager, Hive, Pig, Sqoop, and Oozie on CDH5 cluster.
  • Experienced in managing disaster recovery cluster and responsible for data migration and backup.
  • Performed an upgrade in development environment from CDH 4.x to CDH 5.x.
  • Implemented encryption and masking on customer sensitive data in flume by building a custom interceptor and masking and encrypting the data as per the requirement by considering the rules in MySQL.
  • Experience in managing and reviewing Hadoop log files.
  • Extracted files from RDBMS through Sqoop and placed in HDFS and processed.
  • Experience in running Hadoop streaming jobs to process terabytes of xml format data.
  • Supported MapReduce Programs those are running on the cluster.
  • Involved in loading data from UNIX file system to HDFS.
  • Involved in creating Hive tables, loading with data and writing hive queries which will run internally in MapReduce way.
  • Experiences in implementing Hive-HBase integration by creating hive external tables and using HBase storage handler.
  • Executed queries using Hive and developed MapReduce jobs to analyze data.
  • Developed Pig Latin scripts to extract the data from the web server output files to load into HDFS.
  • Developed Hive queries for the analysts.
  • Involved in loading data from LINUX and UNIX filesystem to HDFS.
  • Designed and implemented MapReduce based large scale parallel relation learning system.
  • Developed Master tables in HIVE using a "jsondeserializer" or "get json object" or "json tuple" functions of HIVE.
  • Designed the entire flow in HDFS in such a way that is needed to be achieved using Oozie workflows.

Technologies Used - Cloudera, Eclipse, Hadoop, Hive, HBase, MapReduce, Flume, HDFS, PIG, Sqoop, Oozie, Cassandra, Java (JDK 1.6), My SQL, UNIX Shell Scripting.


Jr. Hadoop Developer

Responsibilities -

  • Involved in creating Hive tables, loading with data and writing hive queries to process the data.
  • Developing and maintaining Workflow Scheduling Jobs in Oozie for importing data from RDBMS to Hive.
  • Implemented Partitioning, Bucketing in Hive for better organization of the data.
  • Involved with the team of fetching live stream data from DB2 to HBase table using Spark Streaming and Apache Kafka.
  • Developing Spark Streaming program on Scala for importing data from the Kafka topics into the HBase tables.
  • Developed solutions to pre-process large sets of structured, semi-structured data, with different file formats (Text file, Avro data files, Sequence files, XML and JSON files, ORC and Parquet).
  • Involved in the Design Phase for getting live event data from the database to the front-end application using Spark Ecosystem.
  • Importing data from hive table and run SQL queries over imported data and existing RDD’s Using Spark SQL.
  • Responsible for loading and transforming large sets of structured, semi structured and unstructured data.
  • Collected the log data from web servers and integrated into HDFS using Flume.
  • Responsible to manage data coming from different sources.
  • Extracted files from Couch DB and placed into HDFS using Sqoop and pre-process the data for analysis.
  • Developed the sub queries in Hive.
  • Partitioning and bucketing the imported data using HiveQL.
  • Partitioning dynamically using dynamic-partition insert feature.
  • Moving this partitioned data onto the different tables as per as business requirements.

Technologies Used - Eclipse, Hadoop, HDFS, Map Reduce, Pig, Hive, Spark, Kafka, Flume, HBase, Couch DB, Apache-Maven.

Hire Now