We provide IT Staff Augmentation Services!

Aws Spark Developer Resume

New, JerseY


  • 7 years of experience in IT industry as a software engineer which includes 3 years of experience in design and development using hadoop big data eco system tools.
  • Experience in using different hadoop eco system components such as HDFS, YARN, MapReduce, Spark, Pig, Sqoop, Hive, Impala, Hbase, Kafka, and Crontab tools.
  • Experience in developing ETL applications on large volumes of data using different tools: MapReduce, PySpark, Spark - Sql, Hive and Pig.
  • Well versed with Big data on AWS cloud services i.e. EC2, S3, EMR, Lambda and CloudWatch
  • Well-versed in using Map Reduce programming model for analyzing the data stored in HDFS and experience in writing Map Reduce codes in Java as per business requirements.
  • Experience in importing and exporting data using Sqoop from RDBMS to HDFS and Hive.
  • Expert in creating Spark UDFs using python in order to analyze data sets for complex aggregate requirements.
  • Experience in developing spark jobs to run entity resolution on PII data of United States population.
  • Responsible for taking care and running the spark jobs along with optimizing the jobs in spark along with data validation and automation.
  • Experience in performing extensive data analysis in spark using complex dataframe transformations and RDD transformations and creating external hive tables on top of the aggregated data for down stream consumption.
  • Used Shell scripting for automating the triggering of the spark jobs.
  • Used Hbase for real time low latency read writes for multiple applications.
  • Well versed in developing the complex SQL queries using Hive and Spark Sql.
  • Experienced in preparing and executing unit test plan and unit test cases during software development.
  • Strong understanding in Object-Oriented Programming concepts and implementation.
  • Experience in providing training and guidance to new team members in the Project.
  • Experience in detailed system design using use case analysis, functional analysis, modeling program with class & sequence, activity and state diagrams using UML and rational rose.
  • Used JIRA tool for creating the tasks and logging the work and tracking the issues.
  • Very good experience in customer specification study, requirements gathering, system architectural design and turning the requirements into final product/ service.
  • Experience in interacting with customers and working at client locations for real time field testing of products and services.
  • Ability to communicate and work effectively with associates at all levels within the organization.
  • Strong background in mathematics and have very good analytical and problem solving skills.


Programming: Python, Core Java and SQL.

Big Data Eco System: HDFS, YARN, Map Reduce, Spark Core, SparkSQL, ImpalaHive, Pig, Kafka and Sqoop

AWS Stack: EC2, S3, EMR, Lambda and CloudWatch

ELK Stack: Elastic Search, Logstash and Kibana

Scripting Languages: UNIX Shell scripting and Python scripting.

DBMS / RDBMS: Oracle 11g, SQL Serve

Version Control: Git and Bitbucket

CI/CD Tool: Jenkins


Confidential, New Jersey

AWS Spark Developer


  • Develop software to combine data from legacy databases and files into various data marts using Pyspark.
  • Worked extensively on AWS Components such as Elastic Map Reduce (EMR), DynamoDB, Lambda, RDS etc.
  • Worked with Parquet files and converted the data from either format Parsed Semi Structured JSON data and converted to Parquet using Data Frames in PySpark.
  • Developed a Python Script to load the CSV files into the S3 buckets and created AWS S3buckets, performed folder management in each bucket, managed logs and objects within each bucket.
  • Deployed Scalable Hadoop cluster on AWS using S3 as underlying file system for Hadoop.
  • Connecting my SQL database through Spark driver.
  • Having experienced in Agile Methodologies, Scrum stories and sprints experience in a Python based environment, along with data analytics, data wrangling and Excel data extracts.
  • Develop and run serverless Spark based applications using AWS Lambda service and Pyspark to compute metrics for various business requirements.
  • Develop python, shell scripting and spark-based applications using Pycharm and Anaconda integrated development environments.
  • Use Git as version control tool for maintaining software and Jenkins as continuous integration and continuous development tool for deploying applications in production servers.
  • Push AWS EMR and AWS Lambda logs to Elastic search using log stash for log analysis.
  • Create different dashboards using documents in elastic search and Kibana as a visualization tool.
  • Analyze failed jobs in AWS EMR Spark cluster and identify the cause of failures and improve the job performance and minimize/avoid failures using pyspark software.
  • Use JIRA tool for updating my tasks/stories created by agile lead.
  • Analyze, store and process data captured from different sources using different AWS cloud services such as AWS S3, Cloud Formation, Lambda and EMR services.
  • Automate jobs using shell scripting and schedule those jobs to run at a specific time using crontab.
  • Test and validate the developed applications in development and QA environments and deploy them in Production environment.
  • Develop unit test cases for the software developed before deploying it in production servers.
  • Using version control tool - Git with Jenkins to accumulate all the work done by team members.
  • Using agile methodology - SCRUM, along with JIRA for project.
  • Responsible for debugging and troubleshooting the running applications in production.
  • Participated in writing scripts for test automation.

Environment: Spark Core, AWS S3, EMR, Lambda, CloudFormation, CloudWatch, Python, Hive, Presto, Crontab, Elastic Search and Kibana.

Confidential, Texas

Hadoop Developer


  • Extensively used Spark core i.e. RDDs, DataFrames, and Spark Sql as part of developing multiple applications using both Python and Scala.
  • Built multiple data pipe lines using Pig scripts for processing data for specific applications.
  • Used different file formats such as Parquet, Avro, and ORC for storing and retrieving data in hadoop.
  • Used Spark-streaming for consuming event based data from Kafka and joined this data set with existing Hive table data to generate performance indicators for an application
  • Developed analytical queries on different tables using Spark sql for finding insights and building data pipelines for data scientists to consume this data for applying ML models.
  • Spark performance tuning by applying different techniques: choosing optimum parallelism, Serialization format while shuffling the data, using broadcast variables, joins, aggregations, and memory management.
  • Written multiple custom Sqoop import scripts to load data from oracle into HDFS directories and Hive tables.
  • Used Nifi for automating and managing data flows between multiple systems.
  • Used different compression techniques while storing data into Hive tables for performance improvement: snappy and Gzip
  • Have used Impala for faster querying for a time critical application to generate reports.
  • Also used Hbase for OLTP purpose for an application requiring high scalability using hadoop.
  • Have written sqoop export scripts to write the date from HDFS into Oracle database.
  • Used Control M component to simplify and automate different batch workload Applications.
  • Worked closely with multiple data science and machine learning teams in building a data eco system to support AI.
  • Also developed a Java based application to automate most of the manual work in on boarding a tenant to a multi-tenant environment. This is saving around 4 to 5 hours of manual work per tenant per person every day.
  • Applied different job tuning techniques while processing data using Hive and spark frameworks to improve the performance of jobs.

Environment: Spark Core, Spark Streaming, Core Java, Python, Hive, Impala, HBase, Sqoop, Kerberos (security), LDAP, and Control M.


Software Engineer


  • Analyzing Hadoop cluster and different Big Data analytic tools including Pig, Hive, and Sqoop.
  • Implemented POC to migrate Map Reduce jobs into Spark RDD transformations using SCALA.
  • Developed an ETL framework using Spark, Pig and Hive.
  • Developed Spark Programs for Batch and Real-Time Processing to process incoming streams of data from Kafka sources and transform it into as Data frames and load those data frames into Hive and HDFS.
  • Experience in developing SQL scripts using Spark for handling different data sets and verifying the performance of Map Reduce jobs.
  • Developed Spark programs using Spark-SQL library to perform analytics on data in Hive.
  • Developed various JAVA UDF functions to use in both Hive and Impala for ease of usage in various requirements.
  • Created multiple MapReduce jobs in Pig and Hive for data cleaning and preprocessing.
  • Created Hive views/tables for providing SQL like interface.
  • Successfully loading files to Hive and HDFS from Oracle, SQL Server using SQOOP.
  • Writing Hive jobs to parse the logs and structure them in tabular format to facilitate effective querying on the log data.
  • Using Hive to analyze the partitioned data and compute various metrics for reporting.
  • Transformed the Impala queries into hive scripts which can be run using the shell commands directly for higher performance rate.
  • Created the shell scripts which can be scheduled using Oozie workflows and even the Oozie Coordinators.
  • Developed the Oozie workflows to generate monthly report files automatically.
  • Managing and reviewing the Hadoop log files.
  • Exporting data from HDFS environment into RDBMS using Sqoop for report generation and visualization purpose.

Environment: Hadoop, MapReduce, Sqoop, HDFS, Hive, Pig, Oozie, Java, Oracle 10g, MySQL, and Impala.

Hire Now