We provide IT Staff Augmentation Services!

Hadoop, Spark Developer/data Engineer Resume

Plano, TX


  • Around 6+ years of overall IT experience in all phases of Software Development Life Cycle (SDLC) with skills in data analysis, design, development, testing and deployment of software systems.
  • 4 years of relevant experience in design and development of Big Data Analytics using Apache Hadoop ecosystem components Map Reduce, HDFS, HBase, Hive, Impala, Sqoop, Pig, Oozie, Zookeeper, Spark and Informatica BDE 10.x.
  • 2+ years of relevant experience in mainframe applications using Cobol, IBM DB2, Oracle DB, JCL, CICS, Control - M, Jira, CHGMAN, SQL’s.
  • Good knowledge in Hadoop architecture and various components such as HDFS, Job tracker, Task tracker, Resource Manager, Name Node, Data Node and Map Reduce concepts.
  • Wrote Hive Queries, Pig Scripts for data analysis to meet the requirements.
  • Enhanced the functionalities of Hive and Pig by writing UDF’s.
  • Experience in importing and exporting data using Sqoop in between HDFS and RDBMS.
  • Worked on Map-Reduce programs in java for ETL operations with multiple file formats including XML, JSON, CSV and other compressed file formats.
  • Utilized Flume to analyze log files and write into HDFS.
  • Worked on NoSQL databases using Hbase for faster reads and writes using its indexed based architecture.
  • Experienced Hadoop job schedulers like Oozie, Control-M workflow engine.
  • Good experience in working with data ingestion, storage, processing and analyzing the big data.
  • Used SVN and GITHUB version control tool to push and pull functions to get the updated code from repository.
  • Worked in various AWS cloud services like EC2, S3 and RDS.
  • Worked on kerberized clusters.
  • Worked on replacing the existing MR jobs to Spark data transform, actions for faster in memory operations.
  • Developed Spark SQL jobs on hive tables to load data into HDFS and run queries on top of that.
  • Developing Spark best practices like Partitions, Caching check pointing for performance and UDF’s.
  • Worked on mappings from Informatica Developer (BDE) to Spark.


Hadoop Eco-Systems: HDFS, MapReduce, Pig, Hive, Sqoop, Oozie, Kafka, Impala, ZooKeeper, CDH, Spark

Spark Components: Apache Spark, Data Frames, Spark SQL

Programming Languages: SQL, Java, Pig Latin, Hive QL, Cobol, Scala, Python

Databases: IBM DB2, VSAM, MySQL, Hbase, Cassandra 2.1, Oracle DB, AWS

Operating Systems: Windows, UNIX, Linux Distributions

IDE's & other tools: Eclipse, Net Beans, IntelliJ, Informatica BDE


Confidential, Plano, TX

Hadoop, Spark Developer/Data Engineer


  • Worked in scrum based projects with weekly/bi-weekly sprints.
  • Involved in requirement gathering with business and converted them to technical implementations.
  • Worked with Informatica BDE 10.1 to leverage Informatica UI tool and Hadoop cluster computing capabilities.
  • Worked with native, blaze modes and created multiple mappings, maplets.
  • Worked on importing data quality rules, using reference data from Oracle in BDE.
  • Converted the existing Informatica Developer mappings to Spark dynamic framework (Single table approach) using few driver and logging tables.
  • Implemented custom functions in BDE using SHA1 algorithm to support UUID5 functionality in Informatica.
  • Worked on Informatica workflows, application deploy and running app’s using control-M jobs.
  • Worked on CDH 5.4 and 5.8.4 distributions.
  • Worked on Hadoop eco-systems like HDFS, Hive, Impala, YARN, Oozie, Hbase, Pig, Sqoop, HUE, Spark API’s,
  • Used optimization techniques in hive tables like partitioning, bucketing, map-side join.
  • Used Impala for small tactical queries since Impala does not have node recovery after failure.
  • Designed and developed pipelines from different sources using SQOOP, Informatica BDE and spark to design data model.
  • Used SVN, GIT version control tool to maintain versioning of code.
  • Worked on a POC for Hbase for use case as it deals with multiple reads and multiple writes.
  • Performed ETL’s on large sets of structured data using spark since spark performance increases as data size grow.
  • Created UDF’s in spark using UUID3, UUID5 and random UUID’s.
  • Worked on spark RDD, Datasets and SQL API’s.
  • Worked on both Scala and Python programming languages for spark.
  • Resolved small files issue in spark jobs by using repartition/coalesce or using custom methods that resulted in faster query execution time as few partitions are created.
  • Tuned various spark configuration variables like (memory-overheads, yarn.driver.maxResultsSize, default serialization to kryo, auto broadcast join threshold, dynamic allocation to false)
  • Used spark best practices like cache, persist and broadcast for frequently used dataframes.
  • Used Spark-Oracle API to import small data from Oracle database rather than using SQOOP.
  • Used Parquet format with Snappy compression to optimize storage as suggested by Cloudera.
  • Created wrapper scripts to schedule the jobs in Control-M using bash and python.
  • Worked with RegEx in Spark SQL and hive.
  • Worked on YARN resource manager and optimized jobs by tuning various YARN configuration variables.
  • Ingested data to Hadoop Data lake using Sqoop from different RDBMS databases.
  • Exported the information to RDBMS using Sqoop export from HDFS to accommodate data for BI team to analyze and generate reports.
  • Worked on different file formats like Text file, Avro, JSON, ORC and Parquet.
  • Used an AWS services like S3 for raw data storage that is used by BI team.
  • Worked on EMR cluster with auto scaling capabilities on demand.
  • Worked on POC for Apache Kafka and Spark Streaming.
  • Replaced jobs with Spark SQL which were running earlier in Hive QL for performance.
  • Worked with Kafka message queue for Spark streaming.
  • As part of Data Engineering team, responsible for cluster and Hadoop echo system related issues.
  • Worked on Jira tickets for edge node directory access, AD groups creation and user database access.
  • Worked on Code elevation request for higher environments in SVN.
  • Make sure pre-prod and prod cluster is running smoothly and make sure tenants have enough resources for the applications to run.
  • Worked with Cloudera team in finding the temporary workarounds and permanent fixes for user issues.
  • Worked on creating various connections to BDE with Informatica admin teams.


Hadoop Developer


  • Worked as SME (Subject Matter Expert) for ITS sub system and involved in business calls during migration project.
  • Worked on multiple POC’s to choose best tools per given use case considering cost, SLA, CPU.
  • Closely worked with BSA’s to gather the requirements for migration project and converted them to technical specifications.
  • Worked on tools like HDFS, Hive, Pig, Sqoop, Impala, YARN and Impala on CDH 4.3
  • Used Oozie as a scheduling tool to automate the jobs runs.
  • Involved in creation of Hive tables, loading the data and writing hive queries which internally run in map reduce framework.
  • Written multiple shell scripts for automation purposes.
  • Worked extensively on Hive Query Language to re-write code snippets and used partitioning to organize data.
  • Imported data using Sqoop to load data from IBM DB2 to HDFS on regular basis.
  • Worked on incremental/delta loads using Sqoop import.

Environment: Hadoop, CDH, Map Reduce, HDFS, Pig, Hive, Oozie, Java, UNIX, Impala, Hbase, Oracle, AutoSys, Mainframes, JCL, IBM DB2, NDM.


Programmer Analyst/SQL Developer


  • Estimating the requirements, prioritizing, allocating, reviewing and delivering with highest Quality.
  • Worked as a mainframe developer and gathered lot of knowledge in health insurance domain.
  • Worked on writing embedded COBOL programs using IBM DB2 as backend database.
  • Worked with different VSAM files, GDG’s and flat files in mainframes.
  • Written SQL queries on IBM DB2 tables to pull reports or process data.
  • Wrote JCL’s to execute the main business logics used in source code.
  • Conducting business domain and technical sessions for trainees and team members.
  • Attending Code Reviews, Test Reviews and status meetings, defect management meetings at project level.
  • Worked on various reporting tasks which involves calling stored procs and writing many SQL queries.

Environment: COBOL, IBM DB2, Oracle DB, JCL, CICS, Control-M, Jira, CHGMAN, SQL’s.

Hire Now