We provide IT Staff Augmentation Services!

Data Engineer Resume

Charlotte, NC


  • Data Engineer professional with 8+ years of combined experience in the fields of Data Engineer, Big Data implementations and Spark technologies.
  • Experience in Big Data ecosystems using Hadoop, Pig, Hive, HDFS, MapReduce, Sqoop, Storm, Spark, Airflow, Snowflake, Teradata, Flume, Kafka, Yarn, Oozie, and Zookeeper.
  • High Exposure on Big Data technologies and Hadoop ecosystem, In - depth understanding of Map Reduce and Hadoop Infrastructure.
  • Expertise in writing end to end Data processing Jobs to analyze data using MapReduce, Spark and Hive.
  • Experience with Apache Spark ecosystem using Spark-Core, SQL, Data Frames, RDD's and noledge on Spark MLLib.
  • Experienced in data manipulation using python for loading and extraction as well as with python libraries such as NumPy, SciPy and Pandas for data analysis and numerical computations.
  • A solid experience and understanding of designing and operationalization of large-scale data and analytics solutions on Snowflake Data Warehouse.
  • Developing ETL pipelines in and out of data warehouse using a combination of Python and SnowSQL.
  • Experience in extracting files from MongoDB through Sqoop and placed in HDFS and processed.
  • Worked with NoSQL databases like HBase in creating HBase tables to load large sets of semi-structured data coming from various sources.
  • Implemented Cluster for NoSQL tool HBase as a part of POC to address HBase limitations.
  • Strong Knowledge on architecture and components of Spark, and efficient in working with Spark Core.
  • Strong noledge of Hive analytical functions, extending Hive functionality by writing custom UDFs.
  • Expertise in writing Map-Reduce Jobs in Python for processing large sets of structured, semi-structured and unstructured data sets and stores them in HDFS.
  • Good understanding of data modeling (Dimensional & Relational) concepts like Star-Schema Modeling, Snowflake Schema Modeling, Fact and Dimension tables.
  • Used Amazon Web Services Elastic Compute Cloud (AWS EC2) to launch cloud instances.
  • Hands on experience working Amazon Web Services (AWS) using Elastic Map Reduce (EMR), Redshift, and EC2 for data processing.
  • Hands on experience in SQL and NOSQL database such as Snowflake, HBase, Cassandra and MongoDB.
  • Hands on experience in setting up workflow using Apache Airflow and Oozie workflow engine for managing and scheduling Hadoop jobs.
  • Experience in data warehousing and business intelligence area in various domain.
  • Created tableau dashboards designing with large data volumes from data source SQL server.
  • Extract, Transform and Load (ETL) source data into respective target tables to build the required data marts.
  • Active involvement in all scrum ceremonies - Sprint Planning, Daily Scrum, Sprint Review and Retrospective meetings and assisted Product owner in creating and prioritizing user stories.
  • Strong experience in working with UNIX/LINUX environments, writing shell scripts.
  • Worked with various formats of files like delimited text files, clickstream log files, Apache log files, Avro files, JSON files, XML Files.
  • Strong skills in analytical, presentation, communication, problem solving with the ability to work independently as well as in a team and had the ability to follow the best practices and principals defined for the team.


Hadoop/Spark Ecosystem: Hadoop, MapReduce, Pig, Hive/impala, YARN, Kafka, Flume, Oozie, Zookeeper, SparkAirflow, MongoDB, Cassandra, HBase, and Storm.

Hadoop Distribution: Cloudera distribution and Horton works.

Programming Languages: Scala, Hibernate, JDBC, JSON, HTML, CSS, SQL, R, Shell Scripting

Script Languages: JavaScript, jQuery, Python.

Databases: Oracle, SQL Server, MySQL, Cassandra, Teradata, PostgreSQL, MS Access, Snowflake, NoSQLDatabase (HBase, MongoDB).

Operating Systems: Linux, Windows, Ubuntu, Unix

Web/Application server: Apache Tomcat, WebLogic, WebSphere Tools Eclipse, NetBeans

Data Visualization Tools: Tableau, Power BI, SAS, Excel, ETL

OLAP/Reporting: SQL Server Analysis Services and Reporting Services.

Cloud Technologies: MS Azure, Amazon Web Services (AWS).

Machine Learning Models: Logistic Regression, Decision Tree, Random Forest, K-Nearest Neighbor (KNN), TEMPPrincipal Component Analysis, Linear Regression, Naïve Bayes.


Confidential, Charlotte, NC

Data Engineer


  • Responsible for the execution of big data analytics, predictive analytics, and machine learning initiatives.
  • Implemented a proof of concept deploying dis product in AWS S3 bucket and Snowflake.
  • Utilize AWS services with focus on big data architect /analytics / enterprise Data warehouse and business intelligence solutions to ensure optimal architecture, scalability, flexibility, availability, performance, and to provide meaningful and valuable information for better decision-making.
  • Developed Scala scripts, UDF's using both data frames/SQL and RDD in Spark for data aggregation, queries and writing back into S3 bucket.
  • Experience in data cleansing and data mining.
  • Wrote, compiled, and executed programs as necessary using Apache Spark in Scala to perform ETL jobs with ingested data.
  • Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.
  • Wrote Spark applications for data validation, cleansing, transformation, and custom aggregation and used Spark engine, Spark SQL for data analysis and provided to the data scientists for further analysis.
  • Prepared scripts to automate the ingestion process using Python and Scala as needed through various sources such as API, AWS S3, Teradata and snowflake.
  • Designed and Developed Spark workflows using Scala for data pull from AWS S3 bucket and Snowflake applying transformations on it.
  • Implemented Spark RDD transformations to Map business analysis and apply actions on top of transformations.
  • Automated resulting scripts and workflow using Apache Airflow and shell scripting to ensure daily execution in production.
  • Created scripts to read CSV, json and parquet files from S3 buckets in Python and load into AWS S3, DynamoDB and Snowflake.
  • Implemented AWS Lambda functions to run scripts in response to events in Amazon DynamoDB table or S3 bucket or to HTTP requests using Amazon API gateway
  • Migrated data from AWS S3 bucket to Snowflake by writing custom read/write snowflake utility function using Scala.
  • Worked on Snowflake Schemas and Data Warehousing and processed batch and streaming data load pipeline using Snow Pipe and Matillion from data lake Confidential AWS S3 bucket.
  • Profile structured, unstructured, and semi-structured data across various sources to identify patterns in data and Implement data quality metrics using necessary queries or python scripts based on source.
  • Install and configure Apache Airflow for S3 bucket and Snowflake data warehouse and created dags to run the Airflow.
  • Created DAG to use the Email Operator, Bash Operator, and spark Livy operator to execute and in EC2 instance.
  • Deploy the code to EMR via CI/CD using Jenkins
  • Extensively used Code cloud for code check-in and checkouts for version control.

Environment: Agile Scrum, MapReduce, Snowflake, Pig, Spark, Scala, Hive, Kafka, Python, Airflow, JSON, Parquet, CSV, Code cloud, AWS.

Confidential, Rochester, MN

Data Engineer


  • Worked on designing and developing the Real - Time Tax Computation Engine using Oracle, StreamSets, Kafka, Spark Structured Streaming and MySQL
  • Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data.
  • Involved in ingestion, transformation, manipulation, and computation of data using StreamSets, Kafka, MySQL, Spark
  • Involved in data ingestion into MySQL using Kafka - MySQL pipeline for full load and Incremental load on variety of sources like web server, RDBMS, and Data API’s.
  • Worked on Spark Data sources, Spark Data frames, Spark SQL and Streaming using Scala.
  • Worked extensively on AWS Components such as Elastic Map Reduce (EMR), Elastic Compute Cloud (EC2), Simple Storage Service (S3)
  • Experience in developing Spark application using Scala SBT
  • Experience in integrating Spark-MySQL connector and JDBC connector to save the data processed in Spark to MySQL.
  • Responsible for creating tables and MySQL pipelines which are automated to load the data into tables from Kafka topics
  • Performed a POC to check the time taking for Change Data Capture (CDC) of oracle data across Stream, StreamSets and DB Visit
  • Expertise in using different file formats like Text files, CSV, Parquet, JSON
  • Experience in custom compute functions using Spark SQL and performed interactive querying.
  • Responsible for masking and encrypting the sensitive data on the fly
  • Responsible for creating multiple applications for reading the data from different Oracle instances to Kafka topics using Stream
  • Responsible for setting up a MySQL cluster on AWS EC2 Instance
  • Experience in Real time streaming the data using Spark with Kafka.
  • Responsible for creating a Kafka cluster using multiple brokers.
  • Experience working on Vagrant boxes to setup a local Kafka and StreamSets pipelines

Environment: Spark 2.2, Scala, Linux, MySQL 5.8, Kafka 1.0, Stream, StreamSets, Spark SQL, Spark Structured Streaming, AWS EC2, EMR, IntelliJ, SBT, Git, Vagrant.

Confidential, MI

Big Data Engineer


  • Extensively involved in Installation and configuration of Cloudera Hadoop Distribution.
  • Implemented advanced procedures like text analytics and processing using the in-memory computing capabilities like Apache Spark written in Scala
  • Developed spark applications for performing large scale transformations and denormalization of relational datasets.
  • Has real-time experience of Kafka-Storm on HDP 2.2 platform for real time analysis.
  • Loaded data into the cluster from dynamically generated files using Flume and from relational database management systems using Sqoop.
  • Created reports for the BI team using Sqoop to export data into HDFS and Hive.
  • Performed analysis on the unused user navigation data by loading into HDFS and writing MapReduce jobs. The analysis provided inputs to the new APM front end developers and lucent team.
  • Loading the data from multiple Data sources like (SQL, DB2, and Oracle) into HDFS using Sqoop and load into Hive tables.
  • Created HIVE Queries to process large sets of structured, semi-structured and unstructured data and store in Managed and External tables.
  • Developed Complex HiveQL's using SerDe JSON
  • Created HBase tables to load large sets of structured data.
  • Involved in importing the real time data to Hadoop using Kafka and implemented the Oozie job for daily imports
  • Performed Real time event processing of data from multiple servers in the organization using Apache Storm by integrating with apache Kafka.
  • Managed and reviewed Hadoop log files.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
  • Worked on PySpark APIs for data transformations.
  • Data ingestion to Hadoop (Sqoop imports). To perform validations and consolidations for the imported data.
  • Extending Hive and Pig core functionality by writing custom UDF's for Data Analysis.
  • Upgraded current Linux version to RHEL version 5.6
  • Expertise in hardening, Linux Server, and Compiling, Building, and installing Apache Server from sources with minimum modules
  • Worked on JSON, Parquet, Hadoop File formats.
  • Worked on different Java technologies like Hibernate, spring, JSP, Servlets and developed code for both server side and client side for our web application.
  • Used Git hub for continuous integration services.

Environment: Agile Scrum, MapReduce, Hive, Pig, Sqoop, Spark, Scala, Oozie, Flume, Java, HBase, Kafka, Python, Storm, JSON, Parquet, GIT, JSON SerDe, Cloudera.


Data Analyst


  • Understand the data visualization requirements from the Business Users.
  • Writing SQL queries to extract data from the Sales data marts as per the requirements.
  • Developed Tableau data visualization using Scatter Plots, Geographic Map, Pie Charts and Bar Charts and Density Chart.
  • Designed and deploy rich Graphic visualizations with Drill Down and Drop-down menu option and Parameterized using Tableau.
  • Created action filters, parameters, and calculated sets for preparing dashboards and worksheets in Tableau.
  • Explored traffic data from databases connecting them with transaction data, and presenting as well as writing report for every campaign, providing suggestions for future promotions.
  • Extracted data using SQL queries and transferred it to Microsoft Excel and Python for further analysis.
  • Data Cleaning, merging, and exporting the dataset was done in Tableau Prep.
  • Data processing and cleaning techniques carried out to reduce text noise, reduce dimensionality in order to improve the analysis.

Environment: Python, Informatica v9.x, MS SQL SERVER, T-SQL, SSIS, SSRS, SQL Server Management Studio, Oracle, Excel.


Data Analyst


  • Processed data received from vendors and loading them into the database. The process was carried out on a weekly basis and reports were delivered on a bi-weekly basis. The extracted data had to be checked for integrity.
  • Documented requirements and obtained signoffs.
  • Coordinated between the Business users and development team in resolving issues.
  • Documented data cleansing and data profiling.
  • Wrote SQL scripts to meet the business requirement.
  • Analyzed views and produced reports.
  • Tested cleansed data for integrity and uniqueness.
  • Automated the existing system to achieve faster and accurate data loading.
  • Generated weekly, bi-weekly reports to be sent to client business team using business objects and documented them too.
  • Learned to create Business Process Models.
  • Ability to manage multiple projects simultaneously tracking them towards varying timelines effectively through a combination of business and technical skills.
  • Good Understanding of clinical practice management, medical and laboratory billing, and insurance claim with processing with process flow diagrams.
  • Assisted QA team in creating test scenarios that cover a day in the life of the patient for Inpatient and Ambulatory workflows.

Environment: SQL, data profiling, data loading, QA team, Tableau, Python, Machine Learning models.

Hire Now