We provide IT Staff Augmentation Services!

Data Engineer Resume

Rochester, MN


  • 9+ years of IT experience in a variety of industries working on Big Data technology using technologies such as Cloudera and Hortonworks distributions. Hadoop working environment includes Hadoop, Spark, MapReduce, Kafka, Hive, Ambari, Sqoop, HBase, and Impala.
  • Fluent programming experience wif Scala, Java, Python, SQL, T - SQL, R.
  • Hands-on experience in developing and deploying enterprise-based applications using major Hadoop ecosystem components like MapReduce, YARN, Hive, HBase, Flume, Sqoop, Spark MLlib, Spark GraphX, Spark SQL, Kafka.
  • Adept at configuring and installing Hadoop/Spark Ecosystem Components.
  • Proficient wif Spark Core, Spark SQL, Spark MLlib, Spark GraphX and Spark Streaming for processing and transforming complex data using in-memory computing capabilities written in Scala. Worked wif Spark to improve efficiency of existing algorithms using Spark Context, Spark SQL, Spark MLlib, Data Frame, Pair RDD's and Spark YARN.
  • Experience in application of various data sources like Oracle SE2, SQL Server, Flat Files and Unstructured files into a data warehouse.
  • Able to use Sqoop to migrate data between RDBMS, NoSQL databases and HDFS.
  • Experience in Extraction, Transformation and Loading (ETL) data from various sources into Data Warehouses, as well as data processing like collecting, aggregating and moving data from various sources using Apache Flume, Kafka, PowerBI and Microsoft SSIS.
  • Hands-on experience wif Hadoop architecture and various components such as Hadoop File System HDFS, Job Tracker, Task Tracker, Name Node, Data Node and Hadoop MapReduce programming.
  • Comprehensive experience in developing simple to complex Map reduce and Streaming jobs using Scala and Java for data cleansing, filtering and data aggregation. Also possess detailed noledge of MapReduce framework.
  • Used IDEs like Eclipse, IntelliJ IDE, PyCharm IDE, Notepad ++, and Visual Studio for development.
  • Seasoned practice in Machine Learning algorithms and Predictive Modeling such as Linear Regression, Logistic Regression, Naïve Bayes, Decision Tree, Random Forest, KNN, Neural Networks, and K-means Clustering.
  • Ample noledge of data architecture including data ingestion pipeline design, Hadoop/Spark architecture, data modeling, data mining, machine learning and advanced data processing.
  • Experience working wif NoSQL databases like Cassandra and HBase and developed real-time read/write access to very large datasets via HBase.
  • Developed Spark Applications dat can handle data from various RDBMS (MySQL, Oracle Database) and Streaming sources.
  • Proficient SQL experience in querying, data extraction/transformations and developing queries for a wide range of applications.
  • Capable of processing large sets (Gigabytes) of structured, semi-structured or unstructured data.
  • Experience in analyzing data using HiveQL, Pig, HBase and custom MapReduce programs in Java 8.
  • Experience working wif GitHub/Git 2.12 source and version control systems.
  • Strong in core Java concepts including Object-Oriented Design (OOD) and Java components like Collections Framework, Exception handling, me/O system.


Hadoop/Big Data Technologies: HDFS, Hive, Pig, Sqoop, Yarn, Spark, Spark SQL, Kafka

Hadoop Distributions: Horton works and Cloudera Hadoop

Languages: C, C++, Python, Scala, UNIX Shell Script, COBOL, SQL and PL/SQL

Tools: Teradata SQL Assistant, Pycharm, Autosys

Google Cloud Platform: GCP Cloud Storage, Big Query, Composer

Operating Systems: Linux, UNIX, ZOS and Windows

Databases: Teradata, Oracle 9i/10g, DB2, SQL Server, MySQL 4.x/5.x

ETL Tools: IBM InfoSphere Information Server V8, V8.5 & V9.1

Reporting: Tableau


Data Engineer

Confidential, Rochester MN


  • Worked on AWS Data pipeline to configure data loads from S3 to into Redshift.
  • Using AWS Redshift, me Extracted, transformed and loaded data from various heterogeneous data sources and destinations
  • Created Tables, Stored Procedures, and extracted data using T-SQL for business users whenever required.
  • Performs data analysis and design, and creates and maintains large, complex logical and physical data models, and metadata repositories using ERWIN and MB MDR
  • me has written shell script to trigger data Stage jobs.
  • Assist service developers in finding relevant content in the existing reference models.
  • Like Access, Excel, CSV, Oracle, flat files using connectors, tasks and transformations provided by AWS Data Pipeline.
  • Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries.
  • Worked on developing Pyspark script to encrypting the raw data by using Hashing algorithms concepts on client specified columns.
  • Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators.
  • Experience in GCP Dataproc, GCS, Cloud functions, BigQuery.
  • Used cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQuery.
  • Responsible for Design, Development, and testing of the database and Developed Stored Procedures, Views, and Triggers
  • Developed Python-based API (RESTful Web Service) to track revenue and perform revenue analysis.
  • Compiling and validating data from all departments and Presenting to Director Operation.
  • KPI calculator Sheet and maintain dat sheet wifin SharePoint.
  • Created Tableau reports wif complex calculations and worked on Ad-hoc reporting using PowerBI.
  • Creating data model dat correlates all the metrics and gives a valuable output.
  • Worked on the tuning of SQL Queries to bring down run time by working on Indexes and Execution Plan.
  • Performing ETL testing activities like running the Jobs, Extracting the data using necessary queries from database transform, and upload into the Data warehouse servers.
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics.
  • Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.
  • Implemented Copy activity, Custom Azure Data Factory Pipeline Activities
  • Primarily involved in Data Migration using SQL, SQL Azure, Azure Storage, and Azure Data Factory, SSIS, PowerShell.
  • Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, and NoSQL DB).
  • Migration of on-premise data (Oracle/ SQL Server/ DB2/ MongoDB) to Azure Data Lake and Stored (ADLS) using Azure Data Factory (ADF V1/V2).
  • Developed a detailed project plan and halped manage the data conversion migration from the legacy system to the target snowflake database.
  • Design, develop, and test dimensional data models using Star and Snowflake schema methodologies under the Kimball method.
  • Implement ad-hoc analysis solutions using Azure Data Lake Analytics/Store, HDInsight
  • Developed data pipeline using Spark, Hive, Pig, python, Impala, and HBase to ingest customer
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
  • Ensure deliverables (Daily, Weekly & Monthly MIS Reports) are prepared to satisfy the project requirements cost and schedule
  • Worked on a direct query using PowerBI to compare legacy data wif the current data and generated reports and stored and dashboards.
  • Designed SSIS Packages to extract, transfer, load (ETL) existing data into SQL Server from different environments for the SSAS cubes (OLAP)
  • SQL Server reporting services (SSRS). Created & formatted Cross-Tab, Conditional, Drill-down, Top N, Summary, Form, OLAP, Subreports, ad-hoc reports, parameterized reports, interactive reports & custom reports
  • Created action filters, parameters and calculated sets for preparing dashboards and worksheets using PowerBI
  • Developed visualizations and dashboards using PowerBI
  • Used ETL to implement the Slowly Changing Transformation, to maintain Historically Data in Data warehouse.
  • Performing ETL testing activities like running the Jobs, Extracting the data using necessary queries from database transform, and upload into the Data warehouse servers.
  • Created dashboards for analyzing POS data using Power BI

Data Engineer

Confidential, Columbus, OH


  • Implemented Apache Airflow for authoring, scheduling, and monitoring Data Pipelines
  • Designed several DAGs (Directed Acyclic Graph) for automating ETL pipelines
  • Performed data extraction, transformation, loading, and integration in data warehouse, operational data stores and master data management
  • Strong understanding of AWS components such as EC2 and S3
  • Performed Data Migration to GCP
  • Experience in moving data between GCP and Azure using Azure Data Factory.
  • Responsible for data services and data movement infrastructures
  • Experienced in ETL concepts, building ETL solutions and Data modeling
  • Worked on architecting the ETL transformation layers and writing spark jobs to do the processing.
  • Aggregated daily sales team updates to send report to executives and to organize jobs running on Spark clusters
  • Loaded application analytics data into data warehouse in regular intervals of time
  • Designed & build infrastructure for the Google Cloud environment from scratch
  • Experienced in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension)
  • Leveraged cloud and GPU computing technologies for automated machine learning and analytics pipelines, such as AWS, GCP
  • Worked on confluence and Jira
  • Designed and implemented configurable data delivery pipeline for scheduled updates to customer facing data stores built wif Python
  • Proficient in Machine Learning techniques (Decision Trees, Linear/Logistic Regressors) and Statistical Modeling
  • Compiled data from various sources to perform complex analysis for actionable results
  • Measured Efficiency of Hadoop/Hive environment ensuring SLA is met
  • Optimized the Tensorflow Model for efficiency
  • Analyzed the system for new enhancements/functionalities and perform Impact analysis of the application for implementing ETL changes
  • Implemented a Continuous Delivery pipeline wif Docker, and Git Hub and AWS
  • Built performant, scalable ETL processes to load, cleanse and validate data
  • Participated in the full software development lifecycle wif requirements, solution design, development, QA implementation, and product support using Scrum and other Agile methodologies
  • Collaborate wif team members and stakeholders in design and development of data environment
  • Preparing associated documentation for specifications, requirements, and testing

Environment: AWS, Gcp, Bigquery, Gcs Bucket, G-Cloud Function, Apache Beam, Cloud Dataflow, Cloud Shell, Gsutil, Bq Command Line Utilities, Dataproc, Cloud Sql, Mysql, Posgres, Sql Server, Python, Scala, Spark, Hive, Spark - Sql

Data Engineer

Confidential, Mountain view, CA


  • Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP
  • Strong understanding of AWS components such as EC2 and S3
  • Implemented a Continuous Delivery pipeline wif Docker and Git Hub
  • Worked wif g-cloud function wif Python to load Data in to Bigquery for on arrival csv files in GCS bucket
  • Process and load bound and unbound Data from Google pub/sub topic to Bigquery using cloud Dataflow wif Python.
  • Devised simple and complex SQL scripts to check and validate Dataflow in various applications.
  • Performed Data Analysis, Data Migration, Data Cleansing, Transformation, Integration, Data Import, and Data Export through Python.
  • Developed and deployed data pipeline in cloud such as AWS and GCP
  • Performed data engineering functions: data extract, transformation, loading, and integration in support of enterprise data infrastructures - data warehouse, operational data stores and master data management
  • Responsible for data services and data movement infrastructures good experience wif ETL concepts, building ETL solutions and Data modeling
  • Architected several DAGs (Directed Acyclic Graph) for automating ETL pipelines
  • Hands on experience on architecting the ETL transformation layers and writing spark jobs to do the processing.
  • Gather and process raw data at scale (including writing scripts, web scraping, calling APIs, write SQL queries, writing applications)
  • Experience in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension)
  • Devised PL/SQL Stored Procedures, Functions, Triggers, Views and packages. Made use of Indexing, Aggregation and Materialized views to optimize query performance.
  • Developed logistic regression models (Python) to predict subscription response rate based on customers variables like past transactions, response to prior mailings, promotions, demographics, interests, and hobbies, etc.
  • Develop near real time data pipeline using spark
  • Process and load bound and unbound Data from Google pub/sub topic to Bigquery using cloud Dataflow wif Python
  • Hands of experience in GCP, Big Query, GCS bucket, G - cloud function, cloud dataflow, Pub/suB cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver
  • Implemented Apache Airflow for authoring, scheduling and monitoring Data Pipelines
  • Proficient in Machine Learning techniques (Decision Trees, Linear/Logistic Regressors) and Statistical Modeling
  • Worked on confluence and Jira skilled in data visualization like Matplotlib and seaborn library.
  • Hands on experience wif big data tools like Hadoop, Spark, Hive
  • Experience implementing machine learning back-end pipeline wif Pandas, Numpy.

Environment: Gcp, Bigquery, Gcs Bucket, G-Cloud Function, Apache Beam, Cloud Dataflow, Cloud Shell, Gsutil, Docker, Kubernetes, AWS, Apache Airflow, Python, Pandas, Matplotlib, seaborn library, text mining, Numpy, Scikit-learn, Heat maps, Bar charts, Line charts, ETL workflows, linear regression, multivariate regression, Python, Scala, Spark

Jr. Big Data Engineer



  • Worked on analyzing Hadoop cluster using different big data analytic tools including Pig, Hive, and MapReduce.
  • Involved in loading data from LINUX file system to HDFS.
  • Importing and exporting data into HDFS and Hive using Sqoop.
  • Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
  • Used Sqoop to import data into HDFS and Hive from other data systems.
  • Configured Performance Tuning and Monitoring for Cassandra Read and Write processes for fast me/O operations and low latency time. used Java API and Sqoop to export data into DataStax Cassandra cluster from RDBMS.
  • Experience working on processing unstructured data using Pig and Hive.
  • Experienced in running Hadoop streaming jobs to process terabytes of xml format data.
  • Involved in scheduling Oozie workflow engine to run multiple Hive and pig jobs.
  • Developed Pig Latin scripts to extract data from the web server output files to load into HDFS.
  • Extensively used Pig for data cleansing.
  • Implemented SQL, PL/SQL Stored Procedures.
  • Worked on debugging, performance tuning of Hive & Pig Jobs.
  • Implemented test scripts to support test driven development and continuous integration.
  • Worked on tuning the performance Pig queries.
  • Created and maintained Technical documentation for launching HADOOP Clusters and for executing Hive queries and Pig Scripts.
  • Actively involved in code review and bug fixing for improving the performance.

Environment: Hadoop, HDFS, Pig, Hive, MapReduce, Sqoop, LINUX, Cloudera, Big Data, Java APIs, Java collection, SQL.

Application Developer (ETL DataStage)



  • Used IBM InfoSphere suite products for ETL development, enhancement, testing, support, maintenance and debugging software applications dat support business units and support functions in consumer banking sector.
  • Utilized Hadoop Ecosystem for Big Data sources in Customer Relationship Hub and Master Data Management: for data ingestion: Kafka, Storm and Spark Streaming; for data landing: HBase, Phoenix relational DB layer on HBase; for query and ETL used Phoenix, Pig and HiveQL; for job runtime management: Yarn and Ambari.
  • Developed ETL packages using SQL Server Integration services tool to perform data migration from legacy systems like DB2, SQL Server, Excel Sheets, XML files, Flat Files to SQL Server databases using various tools such as SQL Server Integration Services SSIS.
  • Performed database health checks daily tasks including backup and restore by using SQL Server tools like SQL Server Management Studio, SQL Server Profiler, SQL Server Agent, and Database Engine Tuning Advisor on Development and UAT environments.
  • Performed the ongoing delivery, migrating client mini-data warehouses or functional data-marts from different environments to MS SQL server.
  • Involved in Implementation of database design and administration of SQL based database.
  • Developed SQL scripts, Stored Procedures, functions and Views.
  • Worked on DTS Package, DTS Import/Export for transferring data from various database Oracle and Text format data to SQL Server 2005.
  • Designed and implemented various machine learning models (e.g., customer propensity scoring model, customer churn model) using Python (NumPy, SciPy, pandas, scikit-learn), Apache Spark (SparkSQL, MLlib).
  • Provide performance tuning & optimization of data integration frameworks and distributed database system architecture dat is optimized for Hadoop.
  • Designed and developed a solution in Apache Spark to extract transactional data from various HDFS sources and ingest it to Apache Hbase tables.
  • Designed and developed Streaming jobs to send events and logs from Gateway systems to Kafka.

Environment: HortonWorks, DataStage 11.3, Oracle, DB2, UNIX, Mainframe, Autosys.

Hire Now