We provide IT Staff Augmentation Services!

Hadoop/ Spark Developer Resume

Redmond, WA


  • Around 8 years of Full Software Development Life Cycle experience in Software, System analysis, design, development, testing, deployment, maintenance, enhancements, re - engineering, migration, troubleshooting and support of multi-tiered web applications in high performing environments.
  • 4 years of comprehensive experience in Big Data Technology Stack., Spark Core, Spark SQL, Spark Streaming, Kafka streaming and Kafka Security.
  • Working knowledge in AWS environment and AWS spark with Strong experience in Cloud computing platforms such as AWS services and Google Cloud Services
  • Expertise in deploying Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, and Map Reduce/Yarn concepts.
  • Good experience with setting up and configuring a Hadoop cluster on cloud infrastructure like Amazon web Services (EC2 and S3), Azure and ADF.
  • Experience on commercial distribution of Hadoop including HDP (Hortonworks Data Platform) and CDH.
  • Very good understanding on NOSQL databases like Cassandra.
  • Techno-functional responsibilities include interfacing with users, identifying functional and technical gaps, estimates, designing custom solutions, development and production support.
  • Excellent interpersonal and communication skills, creative, research-minded, technically competent and result-oriented with problem solving and leadership skills.


Confidential, Redmond, WA

Hadoop/ Spark Developer


  • Json data parsing using spark-Scala explode function
  • Handled importing of data from various data sources, performed transformations using Spark, python and loaded data into Cassandra.
  • Developed framework for converting existing PowerCenter mappings and to PySpark(Python and Spark) Jobs.
  • Optimized the Pyspark jobs to run on Kubernetes Cluster for faster data processing
  • Analysed the sql scripts and designed it by using PySpark SQL for faster performance.
  • Developed the Pysprk code for AWS Glue jobs and for EMR.
  • Experienced in writing live Real-time Processing using Spark Streaming with Kafka on AWS EMR.
  • Worked with AWS Glue jobs to transform data to a format that optimizes query performance for Athena
  • Used Spark API over EMR Cluster Hadoop YARN to perform analytics on data.
  • Analysis of Real-time data over Spark Streaming via Kafka.
  • Installed the Apache Kafka cluster and Confluent Kafka open source in different environments
  • Installed Confluent Enterprise in Docker and Kubernetes in an 18-node cluster
  • Deploy corresponding subject area into DEV, UAT and Prod environments.
  • Address the technical and operational challenges while facing issues in production using Kafka Security.
  • Estimate efforts for the new applications to be developed and Code changes pertaining to it.
  • Installed and configured MapReduce, HIVE QL and the HDFS; implemented CDH3 Hadoop cluster on CentOS. Assisted with performance tuning and monitoring.
  • Complete production programming, deployments and deliveries within defined SLA
  • Negotiate, conclude, monitor service level agreements (SLAs ) with key stakeholders
  • Ensured SLA are clearly defined and agreed to by the Business Partner
  • Worked on Import & Export of data using the ETL tool Sqoop from MySQL to HDFS
  • Involved in Tuning and Debugging of existing ETL processes
  • Designed and Developed ETL Processes in AWS Glue to migrate Campaign data from external sources like S3.
  • Data Extraction, aggregations and consolidation of Adobe data within AWS Glue using PySpark
  • Queried and analysed data from Cassandra for quick searching, sorting and grouping.
  • Derive business insights from extremely large datasets using Google Big Query
  • Enabled and automated data pipelines for moving the data from Oracle and DB2 source tables to Hadoop and Google Big Query using GitHub for source control and Jenkins.
  • Used Teradata Oracle databases for informatica DW tool to load source data.
  • Created Teradata schemas with constraints, Created Macros in Teradata. Loaded the data using Fast load utility. Created functions and procedures in Teradata.
  • Developed Batch and streaming workflows with in-built Stone branch scheduler and bash scripts to automate the Data Lake systems.
  • Involved in writing the scope scripts in Data Lake and HDFS to structure the Peta bytes of unstructured data stored in the Azure Data Lake.
  • Created various Parser programs to extract data from Confidential, Confidential Business Objects, XML, Informatica, Java, and database views using Scala
  • Strong programming skills in designing and implementation of multi-tier applications using web-based technologies like Spring MVC and Spring Boot
  • Developed server-side application to interact with database using Spring Boot and Hibernate
  • Worked on data masking/de - identification using Informatica.
  • Extensively used Akka actors architecture for scalable &hassle-free multi-threading
  • Worked on integration independent microservices for real-time bidding (Scala/akka, firebase, Cassandra, Elasticsearch).
  • Involved in Developing Spark applications using Spark - SQL in Databricks for data extraction, transformation and aggregation from multiple file formats for analysing & transforming the data to uncover insights into the customer usage patterns.
  • Worked on large-scale Hadoop YARN cluster for distributed data processing and analysis using Data Bricks Connectors, Spark core, Spark SQL, ksql, Sqoop, Pig, Hive, Impala and NoSQL databases.
  • Involved in Building ETL to Kubernetes with Apache Airflow and Spark in GCP.
  • Worked on Docker based containers for using Airflow
  • Done POC on newly adopted technologies like Apache Airflow and Snowflake and Gitlab
  • Installed and configured apache airflow for workflow management and created workflows in python
  • Involved in SQL query tuning and Informatica Performance Tuning.
  • Used sparkSQL for reading data from external sources and processes the data using Scala computation framework.
  • Involved in writing efficient search queries against Solr indexes using Solr REST/Java API
  • Developed Microservices using Spring boot and core Java/J2EE hosted on AWS
  • Extensive experience with the searching engines like Apache Lucene, Apache Solr and Elastic Search.
  • Experienced in deploying the Apache solr/ zookeeper cluster in cloud, on Premises, working on the data storage and disaster recovery for solr/ zookeeper.
  • Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS.
  • Good Knowledge of Atlassian SDK, JIRA Java API, JIRA REST APIs and JIRA plugin development.
  • Developed Python based API for converting the files to Key-Value pairs for the files getting sourced to the Splunk Forwarder
  • Worked on multiple data formats like JSON,ORC,Parquet on HDFS using PySpark.
  • Analysed the sql scripts and designed it by using PySpark SQL for faster performance.
  • Enhanced and provided core design impacting the Splunk framework and components
  • Building, publishing customized interactive reports and dashboards, report scheduling using Tableau server.
  • Deployed various Microservices like Spark, MongoDB, Cassandra in Kubernetes and Hadoop clusters using Docker.
  • Involved in creating Docker Containers leveraging existing Linux Containers and AMI's in addition to creating Docker Containers from scratch
  • Deployed Kubernetes in both AWS and Google cloud. Setup cluster, replicator. Deployed multiple containers in a pod.
  • Created action filters, parameters and calculated sets for preparing dashboards and worksheets in Tableau.
  • Automated applications and MySQL container deployment in Docker using Python
  • Deployed Resource Manager templates. Resource Manager deployment is the deployment method used by Data Factory in the Azure portal.
  • Assigned the built-in contributor role on the Azure data factory resource for the user.
  • Develop Spark SQL queries and shell scripting for transforming and processing huge amount of data.
  • Worked on Data Migration and Data Reconciliation Projects.
  • Coordinate with other Teams and Release management Team for deployment related activities
  • Production support and fixes for the migrated applications.
  • Coordinate Team meeting, Client meeting, scrum calls for day-to-day activities.

Technologies: HDFS, Pig, Hive QL, Scala, Oozie, Zookeeper, MapReduce, Google Cloud, MongoDB, AWS, Azure, Java 1.8, Scala 2.11.8, Hive, Big Query, Java API, HDFS, YARN, Apache Spark 1.6.1, SLOR, Cassandra, Tableau, Sqoop, Kafka, Python, Shell Scripting.

Confidential, Dallas, TX

Hadoop/Spark Developer


  • Migrated 160 tables from Oracle to Cassandra using Apache Spark.
  • Handled importing of data from various data sources, performed transformations using Spark and loaded data into Cassandra.
  • Worked on the Core, Spark SQL and Spark Streaming modules of Spark extensively.
  • Used Scala to write code for all Spark use cases.
  • Assigned name to each of the columns using case class option in Scala.
  • Exploring with the Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's and YARN.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
  • Created Airflow Scheduling scripts in Python
  • Involved in Creating, Debugging, Scheduling and Monitoring jobs using Airflow and Oozie.
  • Distributed Application development using Actor Models for extreme scalability using Akka
  • W orked with Play framework and Akka parallel processing
  • Used Modern technologies like Scala, Spray Framework, Akka and Play Framework
  • Involved in developing Microservices with Spring boot using Java and Akka framework using Scala
  • Built data pipelines using Hadoop ecosystem components such has Hive, Spark & Airflow
  • Administrator for Pig, Hive and HBase installing updates, patches and upgrades.
  • Involved in Migrating Objects from Teradata to Snowflake.
  • Developed reusable components in Informatica.
  • Worked on the Hortonworks based Hadoop platform deployed on 120 nodes cluster to build the Data Lake, utilizing the Spark, Hive and NoSQL for data processing.
  • Performed Tuning on Hive Queries, SQL Queries, informatica sessions
  • Migration of the mappings from lower to higher environments in Informatica
  • Enhanced Search Query Performance based on Splunk Search Queries
  • Developed data warehouse model in snowflake for over 100 datasets using whereScape
  • Built real time pipeline for streaming data using Kafka/Microsoft Azure Queue and Spark Streaming.
  • Spark Streaming collects this data from Kafka in near-real-time and performs necessary transformations and aggregation on the fly to build the common learner data model and persists the data in Cassandra cluster.
  • Installed Confluent Kafka, applied security to it and monitoring with Confluent control center
  • Implemented real time log analytics pipeline using Confluent Kafka, storm, elastic search
  • Performed performance tuning for Spark Steaming e.g. setting right Batch Interval time, correct level of Parallelism, selection of correct Serialization & memory tuning.
  • Used Data Frame API in Scala for converting the distributed collection of data organized into named columns.
  • Responsible for estimating the cluster size, monitoring and troubleshooting of the Spark data bricks cluster.
  • Deployed production environment using AWS EC2 instances and ECS with Docker
  • Involved in Building/Maintaining Docker container clusters managed by Kubernetes, Linux, Bash, GIT, Docker, on GCP. Utilized Kubernetes and Docker for the runtime environment of the CI/CD system to build, test deploy.
  • Queried and analysed data from Cassandra for quick searching, sorting and grouping.
  • Populated HDFS and Cassandra with huge amounts of data using Apache Kafka.
  • Worked with Cassandra UDT (User-Defined Type) extensively.
  • Involved in Spark-Cassandra data modelling.

Technologies: Java 8, Scala 2.11.8, Hive QL, HDFS, YARN, Apache Spark 1.6.1, Apache Kafka, Cassandra 2.1.12, CDH 5.x, Sqoop, Kafka, Python, Airflow, Oracle 12c.

Confidential, Southfield, MI

Hadoop Developer


  • Involved in all phases of development activities from requirements collection to production support.
  • Migrated from different RDBMS system and focused on migrating from Cloudera distribution to Azure to reduce project cost
  • Worked on deploying and tuning live Hortonworks production HDP (Hortonworks Data Platform) clusters.
  • Worked with different feeds data like JSON, CSV, XML and implemented data lake concept.
  • Loaded data into the cluster from dynamically generated files using Flume and from relational database management systems using Sqoop.
  • Very Good experience on UNIX shell scripting, Python.
  • Used UNIX scripts to run Teradata DDL in BTEQ and write to a log table.
  • Good experience on developing of ETL Scripts for Data cleansing and Transformation.
  • Used Pig as ETL tool to do transformations, event joins, filter bot traffic and some pre-aggregations before storing the data onto HDFS
  • Developed end to end ETL batch and streaming data integration into Hadoop (MapR), transforming data.
  • Expertise in designing python scripts to interact with middleware/back end services.
  • Worked on python scripts to analyse the data of the customer.
  • Used Jira for bug tracking.
  • Responsible for Coding batch pipelines, Restful Service, Map Reduce program, Hive query's, testing, debugging, Peer code review, troubleshooting and maintain status report
  • Loaded data into MongoDB.
  • Used Git to check-in and checkout code changes.

Technologies: Hadoop-Java API, Hortonworks, Linux, Python, HDFS, Pig, Hive, Sqoop, Flume, Oozie, Zookeeper, MapReduce, Google Cloud, MongoDB, AWS EC2, EMR, Azure, Jenkins, SOLR, Restful Service, Teradata, Tableau, Amazon Data Pipeline.

Confidential, Philadelphia PA

Big Data/Hadoop Developer


  • Involved in complete SDLC of the project includes requirements gathering, design documents, development, testing and production environments.
  • Worked collaboratively with all levels of business stakeholders to architect, implement and test Big Data based analytical solution from disparate sources.
  • Developed Java Map Reduce programs on log data to transform into structured way.
  • Developed Java Map Reduce programs using Python programming.
  • Developed optimal strategies for distributing the web log data over the cluster; importing and exporting the stored web log data into HDFS and Hive using Sqoop.
  • Created HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NoSQL • Involved in Agile methodologies, daily scrum meetings, spring planning
  • Responsible for Cluster maintenance, adding and removing cluster nodes, Cluster Monitoring and Troubleshooting, manage and review data backups and log files.
  • Collected the log data from web servers and integrated into HDFS using Flume. queries and Pig Scripts.

Technologies: Hadoop, HDFS, MapReduce, Hive, Pig, Sqoop, HBase, Zookeeper, Java, ETL, Linux, Oozie, Maven, Shell Scripting, Python, RHEL, Rational tools, Hortonworks HDP.


Software Engineer


  • Analysis, design and development of Application based on J2EE using Struts, Spring and Hibernate
  • Involved in developing the user interface using Struts and worked with JSP, Servlets, JSF, JSTL/EL.
  • Worked with JDBC and Hibernate.
  • Configured and Maintained Subversion version control.
  • Implemented Data Access Object, MVC design patterns.
  • Experience of working in Agile Methodology.
  • Worked with both SOAP and Restful web Services.
  • Used PL/SQL for queries and stored procedures in ORACLE as the backend RDBMS.
  • Worked with Complex SQL queries, Functions and Stored Procedures.
  • Developed Test Scripts using JUnit and JMockit.
  • Use of core java, which includes Generics and Annotations.
  • Involved in refactoring the existing code.
  • Implemented Struts, J2EE Design Patterns like MVC, Spring Rest API, DAO, Singleton and DTO Design patterns.
  • Improved reporting mechanisms for the Splunk tool to the clients
  • Developed Web Services using XML messages that use SOAP.
  • Developed Spring Configuration file to define data source, beans and Hibernate properties.
  • Experience in using WebSphere Application Server to Deploy Application.
  • Used SVN as a version control.
  • Designed middleware components like POJO (Plain Old Java Objects such as Java beans)
  • Developed controller and bean classes using spring and configured them in spring configuration file.
  • Worked with Struts Validation Framework to implement Client Side and Server-Side validations.
  • Worked with log4j utility to implement run time log events.
  • Worked with ANT and Maven to develop build scripts.
  • Worked with Hibernate, JDBC to handle data needs.
  • Configured Development Environment using Tomcat and Apache Web Server.

Technologies: Struts 1.x/2.x, Spring 2.0, J2SE 1.6, JEE 6, JSP 2.1, J2EE Design Patterns, HTML 5, JavaScript, Java API, JSF, jQuery 1.6/1.7, jQuery UI, XML, Servlets 2.5, WSDL, JUnit, JMockit, CSS, AJAX, Apache 2.0, Java Beans, Tomcat 5.5, Oracle 9i/10g, Oracle Application Server.

Hire Now