We provide IT Staff Augmentation Services!

Data Engineer Resume


  • Overall, 11+ years of extensive IT industry experience including 4+ years of strong experience in Design and Development of Big Data/Apache Hadoop ecosystem components including Data Ingestion, Data modeling, Querying, Processing, Storage Analysis, Data Integration and Implementing enterprise level systems transforming Big data on Cloud using Apache Spark framework.
  • Experience in Design, Development, and Testing of large scale distributed data processing solutions using Big Data/Hadoop ecosystem tools Spark, Hadoop, YARN, Sqoop, s3, HDFS, Hive, Impala, Spark SQL, Pig, HBase, Oozie, Kafka, ORC, and Parquet format files on Cloud.
  • Experienced in Application Development using Spark, Map reduce, Hive, Spark SQL and Linux shell scripting.
  • Experience in various Big Data application phases like Data ingestion, Data transformations and Data visualization.
  • Strong experience in Spark SQL UDFs, Spark SQL and Performance Tuning.
  • Experience in efficient processing of large datasets using Spark features such as Partitions, In Memory capabilities, Broadcasts variables, Joins, RDD Transformations, Actions, Case classes.
  • In - depth understanding of Spark Architecture including Spark Core, Spark SQL, Data Frames, Spark Streaming.
  • Experience working with relational databases such as MySQL, Oracle, MS SQL Server and NoSQL databases such as HBase.
  • Hands on experience in working with file formats like Orc, Avro, Sequence, Text, and Parquet.
  • Strong AWS experience setting up EC2 Instances, RDS with Oracle, MySQL, PostgreSQL, Aurora, Redshift, etc. and has experience working with S3, EBS, RDS, Lambda functions and VPC.
  • Good knowledge in troubleshooting skills, understanding of system's capacity, bottlenecks, basics of memory, CPU, OS, storage, and network.
  • Developed few reports using Tableau 9 for data visualization, Reporting and Analysis.
  • Ability to meet deadlines and handle multiple tasks, decisive with strong leadership qualities, flexible in work schedules and possess good communication skills.
  • Extensively involved through the Software Development Life Cycle (SDLC) from initial planning through implementation of the projects by using Agile and waterfall methodologies.
  • Skilled in using Version Control Software Such as GIT, GitHub and Jira for Project tracking.
  • Extensively used ETL methodology for supporting data extractions and transformations and loading processes in corporate wide ETL solutions using IBM Infosphere DataStage 8.5, 9.1, 11.5.
  • Good team player with ability to solve problems, organize and prioritize multiple tasks.
  • Good interpersonal skills, committed, result oriented, hard working with a quest and zeal to learn new technologies.
  • Flexible and versatile to adapt to any new environment and work on any project.


Big data Technologies: Spark/Scala, HDFS, MapReduce, Tez, Pig, Sqoop, Hive, Impala, Spark Streaming, Lambda, Kafka and Oozie

File Formats: Orc, Avro, Sequence, Text, Json and Parquet

ETL/BI Tools: Infosphere Datastage, AWS Glue, Tableau

Scripting Languages: Shell Scripting

Distributions: AWS EMR, Cloudera CDHs

Database Technologies: Oracle, MS SQL Server, MySQL, HBase

Software IDE Tools: Eclipse and IntelliJ

Integration Tools: Dell Boomi

Version Control Systems: Git and GitHub.

Automation tools: Jenkins

Scheduling tools: Control-M, Airflow



Data Engineer

Environment: AWS EMR, Spark/Scala, Oracle, SQL Server, S3, Hive, SparkSQLAWS Athena, Linux, Control-m


  • AWS EMR is used to provision the fully managed Hadoop distributed framework.
  • Implement the process to migrate the data from multiple source systems to AWS S3.
  • Built Apache Sqoop jobs for data ingestion from different relational databases (SQL Server and Oracle) to load into S3 landing/Raw area.
  • Write Shell and python scripts to extract data in Json format from several web portals using REST API into S3 bucket.
  • Developed Preprocessing job using Spark Data frames to flatten Json documents to flat file.
  • Write Spark/Scala programs to validate the transactional data on Parquet files.
  • Created Hive tables on S3 files to implement business logic using Spark Data frames.
  • Develop and enhance Spark jobs to cleanse and transform the data on Hive tables and S3 files.
  • Worked on the design, build and system testing for supporting complex pipelines by aggregating data leveraging Spark/Scala.
  • Used Broadcast variables in Spark, effective & efficient Joins, transformations and other capabilities for data processing.
  • Used the spark-library to push the data into RDBMS tables from S3.
  • Build Athena external tables to query the data from AWS S3.
  • Coordination and completion of various Business Intelligence related functions including: data preparation (sourcing, acquisition and integration), data warehousing, and data exploration and information delivery.
  • Work with internal clients and ensure that operational systems and IT functions are maintained at peak efficiency with no disruptions or unscheduled downtime.
  • Extensively used GitHub & Git for code repository, code review and as version control tool and used JIRA for project tracking.
  • Build Control-M jobs to schedule the spark jobs and coordinate with other platforms to schedule series of jobs in pipeline, including cluster creation, spark job execution, S3 data operation, etc.
  • Enhance and support legacy Data warehouse Data Stage jobs.
  • Review the code with technical team and attend the code review sessions.
  • Testing of all the components in development and test environment and update test results document.
  • Responsible for Production Support and involved in On-Call.


Big Data Developer

Environment: Cloudera CDH, Spark/Scala, Hive, Pig, RDBMS, Sqoop, SparkSQL, KafkaHDFS, Oozie, HBase.


  • Part of Design and implementation team responsible for interactions with users to fully understand the requirements and implement the same.
  • Involved in requirement and design phase to implement Streaming Architecture to use real time streaming using Spark and Kafka.
  • Used Spark Core/SQL API for reading text file formats from HDFS and convert into DataFrames by injecting schema for better performance.
  • Analyzed data by performing Hive queries (HiveQL) and running Pig scripts (Pig Latin)
  • Created Pig scripts for Sorting, Joining, Grouping the data.
  • Worked on Sequence file, ORC files and Mapside joins, Bucketing, Partitioning for improving hive performance and storage improvement.
  • Loaded processed data into Hive tables, HDFS, finally into RDBMS (using Sqoop) for Storage and BI requirements.
  • Developed watcher jobs using shell scripting for notifying process owners if there is any delay or miss in files generation as per agreed schedule.
  • Experienced in Oozie operational service to create workflows and automate the Hive, Pig jobs.
  • Effectively followed Agile Methodology and participated in Sprints and daily Scrums to deliver software tasks on-time and with good quality in coordination with onsite and offshore teams
  • Review the code with technical team and attend the code review sessions.
  • Used JIRA for incident creation, bug tracking and Bitbucket to check-in and checkout code changes.
  • Involved in Post-implementation support phases of the project.
  • Involved in Unit testing and unit test cases documentation.

Hire Now