We provide IT Staff Augmentation Services!

Data Engineer Resume

Nashville, TN


  • Having 8+ years of experience in the Data warehouse environment, coupled with extensive experience in SQL SERVER, Oracle, Teradata, Redshift and Azure SQL Data base Development in Retail, Manufacturing, Financial and Communication sectors.
  • Having experience in Bigdata Technologies, Hadoop ecosystem, Data Warehousing, SQL related technologies.
  • Experience in Big Data Analytics using Various Hadoop eco - systems tools and Spark Framework and designed pipelines using pyspark and spark sql.
  • Experience installing/configuring/maintaining Apache Hadoop clusters for application development and Hadoop tools like Sqoop, Hive, PIG, HBase, Kafka, Hue, Oozie, Spark, Scala and Python.
  • Implemented multiple big data projects on cloud using AWS components like S3, DynamoDB, Glue, Athena, Data Pipeline, EMR, EC2, Lambda, CloudWatch, CloudFormation and Redshift.
  • Developed data pipelines using PySpark using EMR clusters and scheduled jobs using Airflow
  • Developed AWS Cloud formation templates to create custom sized EC2 instances, ELB, Lambda, S3, Glue crawlers, Glue ETL jobs and security groups.
  • Extensive experience in developing applications that perform Data Processing tasks using Teradata, Oracle, SQL Server and MySQL databases.
  • Worked with major distributions like Cloudera (CDH) & Horton works Distributions and AWS. Also worked on Unix and DWH in support for various Distributions
  • Hands on experience in developing and deploying enterprise-based applications using major components in Hadoop ecosystem.
  • Experience in handling large datasets using Partitions, Spark in memory capabilities, Broadcasts in Spark with python, Effective and efficient Joins, Transformations and other during ingestion process itself
  • Experience in developing data pipeline using Pig, Sqoop, and Flume to extract the data from weblogs and store in HDFS and accomplished developing Pig Latin Scripts and using HiveQL for data analytics.
  • Extensively dealt with Spark Streaming and Apache Kafka to fetch live stream data.
  • Strong expertise in troubleshooting and performance fine-tuning Spark, Map Reduce and Hive applications
  • Worked on data warehousing and ETL tools like Informatica, Tableau, and Pentaho
  • Acquaintance with Agile and Waterfall methodologies. Responsible for handling several clients facing meetings with great communication skills


Big Data Technologies: HDFS, MapReduce, Hive, Pig, Sqoop, Flume, Oozie, Zookeeper, Kafka, Cassandra, Apache Spark, Spark Streaming, HBase, Flume, Impala

Hadoop Distribution: Cloudera, Horton Works, Apache, AWS

Languages: Java, SQL, PL/SQL, Python, Pig Latin, HiveQL, Scala, Regular Expressions

Web Technologies: HTML, CSS, JavaScript, XML, JSP, Restful, SOAP

Operating Systems: Windows (XP/7/8/10), UNIX, LINUX, UBUNTU, CENTOS

Portals/Application servers: WebLogic, WebSphere Application server, WebSphere Portal server, JBOSS

Build Automation tools: SBT, Ant, Maven

Version Control: GIT

IDE &Build Tools,Design: Eclipse, Visual Studio, Net Beans, Rational Application Developer, Junit

Databases: Oracle, SQL Server, MySQL, MS Access, NoSQL Database (HBase, Cassandra, MongoDB), Teradata


Data Engineer

Confidential, Nashville, TN


  • Involved in complete Big Data flow of the application starting from data ingestion from upstream to HDFS, processing and analyzing the data in HDFS.
  • Developed Spark API to import data into HDFS from Teradata and created Hive tables.
  • Developed Sqoop jobs to import data in Avro file format from Oracle database and created hive tables on top of it.
  • Created Partitioned and Bucketed Hive tables in Parquet File Formats with Snappy compression and then loaded data into Parquet hive tables from Avro hive tables.
  • Involved in running all the hive scripts through hive, Impala, Hive on Spark and some through Spark SQL.
  • Involved in performance tuning of Hive from design, storage and query perspectives.
  • Developed Flume ETL job for handling data from HTTP Source and Sink as HDFS.
  • Collected the Json data from HTTP Source and developed Spark APIs that helps to do inserts and updates in Hive tables.
  • Developed Spark scripts to import large files from Amazon S3 buckets.
  • Developed Spark core and Spark SQL scripts using Scala for faster data processing.
  • Developed Kafka consumer's API in Scala for consuming data from Kafka topics.
  • Involved in designing and developing tables in HBase and storing aggregated data from Hive Table.
  • Integrated Hive and Tableau Desktop reports and published to Tableau Server.
  • Developed shell scripts for running Hive scripts in Hive and Impala.
  • Orchestrated number of Sqoop and Hive scripts using Oozie workflow and scheduled using Oozie coordinator.
  • Used Jira for bug tracking and Bit Bucket to check-in and checkout code changes.
  • Continuous monitoring and managing the Hadoop cluster through Cloudera Manager.

Environment: HDFS, Yarn, Map Reduce, Hive, Sqoop, Flume, Oozie, HBase, Kafka, Impala, Spark SQL, Spark Streaming, Eclipse, Oracle, Teradata, PL/SQL UNIX Shell Scripting, Cloudera.

Data Engineer

Confidential, Evansville, IN


  • Responsible for architecting Hadoop clusters with CDH3 and involved in installation of CDH3 and up gradation to CDH4 from CDH3
  • Worked on creating Key space in Cassandra for saving the Spark Batch output
  • Worked on Spark application to compact the small files present into hive ecosystem to make it equivalent to block size of HDFS
  • Manage migration of on-perm servers to AWS by creating golden images for upload and deployment
  • Manage multiple AWS accounts with multiple VPC's for both production and non-production where primary objectives are automation, build out, integration and cost control
  • Implemented the real time streaming ingestion using Kafka and Spark Streaming
  • Loaded data using Spark-streaming with Scala and Python
  • Involved in requirement and design phase to implement Streaming Lambda Architecture to use real time streaming using Spark and Kafka and Scala
  • Experience in loading the data into Spark RDD and performing in-memory data computation to generate the output responses
  • Migrated complex map reduce programs into In-memory Spark processing using
  • Transformations and actions
  • Developed full text search platform using NoSQL and Logstash Elastic Search engine, allowing for much faster, more scalable and more intuitive user searches
  • Developed the Sqoop scripts to make the interaction between Pig and MySQL Database
  • Worked on Performance Enhancement in Pig, Hive and HBase on multiple nodes
  • Worked with Distributed n-tier architecture and Client/Server architecture
  • Supported Map Reduce Programs those are running on the cluster and developed multiple Map Reduce jobs in Java for data cleaning and pre-processing
  • Developed MapReduce application using Hadoop, MapReduce programming and HBase
  • Evaluated usage of Oozie for Workflow Orchestration and experienced in cluster coordination using Zookeeper
  • Developing ETL jobs with organization and project defined standards and processes
  • Experienced in enabling Kerberos authentication in ETL process
  • Design of GUI using Model View Controller Architecture (STRUTS Framework)
  • Integrated Spring DAO for data access using Hibernate and involved in the Development of Spring Framework Controller

Environment: HDFS, MapReduce, Hive, Pig, Sqoop, Oozie, HBase, Java, J2EE, Eclipse, HQL.

Big Data Engineer

Confidential, Branchburg, NJ


  • Extensively involved in Installation and configuration of Cloudera Hadoop Distribution.
  • Implemented advanced procedures like text analytics and processing using the in-memory computing capabilities like Apache Spark written in Scala
  • Developed spark applications for performing large scale transformations and denormalization of relational datasets.
  • Have real-time experience of Kafka-Storm on HDP 2.2 platform for real time analysis.
  • Loaded data into the cluster from dynamically generated files using Flume and from relational database management systems using Sqoop.
  • Created reports for the BI team using Sqoop to export data into HDFS and Hive.
  • Performed analysis on the unused user navigation data by loading into HDFS and writing MapReduce jobs. The analysis provided inputs to the new APM front end developers and lucent team.
  • Loading the data from multiple Data sources like (SQL, DB2, and Oracle) into HDFS using Sqoop and load into Hive tables.
  • Created HIVE Queries to process large sets of structured, semi-structured and unstructured data and store in Managed and External tables.
  • Developed Complex HiveQL's using SerDe JSON
  • Created HBase tables to load large sets of structured data.
  • Involved in importing the real time data to Hadoop using Kafka and implemented the Oozie job for daily imports
  • Performed Real time event processing of data from multiple servers in the organization using Apache Storm by integrating with apache Kafka.
  • Managed and reviewed Hadoop log files.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
  • Worked on PySpark APIs for data transformations.
  • Data ingestion to Hadoop (Sqoop imports). To perform validations and consolidations for the imported data.
  • Extending Hive and Pig core functionality by writing custom UDF's for Data Analysis.
  • Upgraded current Linux version to RHEL version 5.6
  • Expertise in hardening, Linux Server and Compiling, Building and installing Apache Server from sources with minimum modules
  • Worked on JSON, Parquet, Hadoop File formats.
  • Worked on different Java technologies like Hibernate, spring, JSP, Servlets and developed code for both server side and client side for our web application.
  • Used Git hub for continuous integration services.

Environment: Agile Scrum, MapReduce, Hive, Pig, Sqoop, Spark, Scala, Oozie, Flume, Java, HBase, Kafka, Python, Storm, JSON, Parquet, GIT, JSON SerDe, Cloudera.

Sr ETL Developer



  • Involved in Migrating historical as built data from Link Tracker Oracle database to TD using Abinitio.
  • Implemented historical purge process for Clickstream, order broker &link tracker to TD using Abinitio
  • Implemented the centralized graphs concept.
  • Extensively used Abinitio components like Reformate, rollup, lookup, joiner, re-defined and also developed many sub graphs
  • Abinitio Sandbox creation at both GDE level and air command level, scheduling the interdependent jobs (Abinitio deployed graphs) through UNIX wrapper template.
  • Performing tuning the Abinitio graphs
  • Sandbox creation and adding the parameters based on the requirement
  • Involved in loading the transformed data file into TD staging tables through TD Load utilities, Fast load and Multi load scripts, and Creating TD macro’s for loading the data from staging to target tables
  • Performed the data validation on TD warehouse data as per few standard test cases
  • Leading the module to load all PARTY RELSHIP tables, responsible for requirement gathering, creating specification documents & Test cases documents, designing and validating the ETL mapping, development through Unit Testing, validating the data populated in the data base and giving UAT support and Resolution of issues raised by the users and different groups.
  • Responsible as E-R consultant, ER(Extract-Replicate) Golden gate tool which is used to extract the real time data to warehouse without hitting to the database which pulls the data from oracle Archive logs as oracle 10g support as ASM(Automatic storage mechanism) method.
  • Also involved to designing the Data Allegro post scripts to load the data from LRF files to DA database.

Data Analyst



  • Developed stored procedures in MS SQL to fetch the data from different servers using FTP and processed these files to update the tables.
  • Responsible for Designing Logical and Physical data modelling for various data sources on Confidential Redshift
  • Designed and Developed ETL jobs to extract data from Salesforce replica and load it in data mart in Redshift.
  • Experience with building data pipelines in python/Pyspark/HiveSQL/Presto/BigQuery and building python DAG in Apache Airflow.
  • Created ETL Pipeline using Spark and Hive for ingest data from multiple sources.
  • Involved in using SAP and transactions done in SAP - SD Module for handling customers of the client and generating the sales reports.
  • Coordinated with clients directly to get data from different databases.
  • Worked on MS SQL Server, including SSRS, SSIS, and T-SQL.
  • Designed and developed schema data models.
  • Documented business workflows for stakeholder review.

Hire Now