We provide IT Staff Augmentation Services!

Data Engineer Resume

4.00/5 (Submit Your Rating)

Philadelphia, PA


  • Overall 8+ years of professional experience in Information Technology and expertise in BIGDATA using HADOOP framework and Analysis, Design, Development, Testing, Documentation, Deployment and Integration using SQL and Big Data technologies.
  • Expertise in using major components of Hadoop ecosystem components like HDFS, YARN, MapReduce, Hive, Impala, Pig, Sqoop, HBase, Spark, Spark SQL, Kafka, Spark Streaming, Flume, Oozie, Zookeeper, Hue.
  • Good understanding of distributed systems, HDFS architecture, Internal working details of MapReduce and Spark processing frameworks.
  • Knowledge of ETL methods for data extraction, transformation and loading in corporate - wide ETL Solutions and Data Warehouse tools for reporting and data analysis.
  • Experience in importing and exporting the data using Sqoop from HDFS to Relational Database systems and vice-versa and load into Hive tables, which are partitioned.
  • Hands-on use of Spark and Scala API's to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames in Scala.
  • Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
  • Experience in designing star schema, Snowflake schema for Data Warehouse, ODS architecture.
  • Experience in Data Analysis, Data Profiling, Data Integration, Migration, Data governance and Metadata Management, Master Data Management and Configuration Management.
  • Hands on Spark MLlib utilities such as including classification, regression, clustering, collaborative filtering, dimensionality reduction.
  • Knowledge of working with Proof of Concepts (PoC's) and gap analysis and gathered necessary data for analysis from different sources, prepared data for data exploration using data munging and Teradata.
  • Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and Mongo DB using Python.
  • Excellent in performing data transfer activities between SAS and various databases and data file formats like XLS, CSV, etc.
  • Experienced in development and support knowledge on Oracle, SQL, PL/SQL, T-SQL queries.
  • Experience in Designing and implementing data structures and commonly used data business intelligence tools for data analysis.
  • Expert in building Enterprise Data Warehouse or Data warehouse appliances from Scratch using both Kimball and Inmon’s Approach.
  • Expertise in SQL Server Analysis Services (SSAS) and SQL Server Reporting Services (SSRS).
  • Experience in data manipulation, data analysis, and data visualization of structured data, semi-structured data, and unstructured data
  • Understanding of the Hadoop Architecture and its ecosystem such as HDFS, YARN, MapReduce, Sqoop, Avro, Spark, Hive, HBase, Flume, and Zookeeper
  • Creative skills in developing elegant solutions to challenges related to pipeline engineering
  • Knowledge of the Spark Architecture and programming Spark applications


Big Data Ecosystem: HDFS, MapReduce, HBase, Pig, Hive, Sqoop, KafkaFlume, Cassandra, Impala, Oozie, Zookeeper, MapR, Amazon Web Services (AWS), EMR

Machine Learning Classification Algorithms: Logistic Regression, Decision Tree, Random Forest, K-Nearest Neighbor (KNN), Gradient Boosting Classifier, Extreme Gradient Boosting Classifier, Support Vector Machine (SVM), Artificial Neural Networks (ANN), Naïve Bayes Classifier, Extra Trees Classifier, Stochastic Gradient Descent, etc.

Cloud Technologies: AWS, Azure, Google cloud platform (GCP)

IDE’s: IntelliJ, Eclipse, Spyder, Jupyter

Ensemble and Stacking: Averaged Ensembles, Weighted Averaging, Base Learning, Meta Learning, Majority Voting, Stacked Ensemble, AutoML - Scikit-Learn, MLjar, etc.

Databases: Oracle 11g/10g/9i, MySQL, DB2, MS-SQL Server, HBASE

Programming / Query Languages: Java, SQL, Python Programming (Pandas, NumPy, SciPy, Scikit-Learn, Seaborn, Matplotlib, NLTK), NoSQL, PySpark, PySpark SQL, SAS, R Programming (Caret, Glmnet, XGBoost, rpart, ggplot2, sqldf), RStudio, PL/SQL, Linux shell scripts, Scala.

Data Engineer/Big Data Tools / Cloud / Visualization / Other Tools: Databricks, Hadoop Distributed File System (HDFS), Hive, Pig, Sqoop, MapReduce, Spring Boot, Flume, YARN, Hortonworks, Cloudera, Mahout, MLlib, Oozie, Zookeeper, etc. AWS, Azure Databricks, Azure Data Explorer, Azure HDInsight, Salesforce, GCP, Google Shell, Linux, PuTTY, Bash Shell, Unix, etc., Tableau, Power BI, SAS, We Intelligence, Crystal Reports, Dashboard Design.


Confidential, Philadelphia, PA

Data Engineer


  • Created Tables, Stored Procedures, and extracted data using T-SQL for business users whenever required.
  • Performs data analysis and design, and creates and maintains large, complex logical and physical data models, and metadata repositories using ERWIN and MB MDR
  • I have written shell script to trigger data Stage jobs.
  • Assist service developers in finding relevant content in the existing reference models.
  • Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries.
  • Worked on developing PySpark script to encrypting the raw data by using hashing algorithms concepts on client specified columns.
  • Responsible for Design, Development, and testing of the database and Developed Stored Procedures, Views, and Triggers.
  • Build the oozie pipeline which performs several actions like file move process, Sqoop the data from the source Teradata or SQL and exports into the hive staging tables and performing aggregations as per business requirements and loading into the main tables.
  • Experience managing Azure Data Lakes (ADLS) and Data Lake Analytics and an understanding of how to integrate with other Azure Services. Knowledge of USQL
  • Developed Python-based API (RESTful Web Service) to track revenue and perform revenue analysis.
  • Compiling and validating data from all departments and Presenting to Director Operation.
  • KPI calculator Sheet and maintain that sheet within SharePoint.
  • Created Tableau reports with complex calculations and worked on Ad-hoc reporting using PowerBI.
  • Creating datamodel that correlates all the metrics and gives a valuable output.
  • Developed data sources in Tableau, Tableau visualizations and dashboards using Tableau Desktop and published the same on Tableau Server and Report Scheduling in Tableau server.
  • Developed Dashboards and Worksheets with action filters, parameters, calculated set and so on in Tableau.
  • Worked on the tuning of SQL Queries to bring down run time by working on Indexes and Execution Plan.
  • Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, PostgreSQL, Data Frame, OpenShift, Talend, pair RDD's.
  • Involved in integration of Hadoop cluster with spark engine to perform BATCH and GRAPHX operations.
  • Performed data preprocessing and feature engineering for further predictive analytics using Python Pandas.
  • Developed and validated machine learning models including Ridge and Lasso regression for predicting total amount of trade.
  • Creating Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
  • Boosted the performance of regression models by applying polynomial transformation and feature selection and used those methods to select stocks.
  • Generated report on predictive analytics using Python and Tableau including visualizing model performance and prediction results.
  • Created numerous pipelines in Azure using Azure Data Factory v2 to get the data from disparate source systems by using different Azure Activities like Move &Transform, Copy, filter, for each, Databricks etc.
  • Implemented Copy activity, Custom Azure Data Factory Pipeline Activities
  • Primarily involved in Data Migration using SQL, SQL Azure, Azure Storage, and Azure Data Factory, SSIS, PowerShell.
  • Developed data sources in Tableau.
  • Developed Tableau visualizations and dashboards using Tableau Desktop and published the same on Tableau Server.
  • Report Scheduling in Tableau server.
  • Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB).
  • Design, develop, and test dimensionaldatamodels using Star andSnowflakeschemamethodologies under the Kimball method.
  • Implement ad-hoc analysis solutions using Azure Data Lake Analytics/Store, HDInsight
  • Developed data pipeline using Spark, Hive, Pig, python, Impala, and HBase to ingest customer
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
  • Worked on a direct query using PowerBI to compare legacy data with the current data and generated reports and stored and dashboards.
  • Designed SSIS Packages to extract, transfer, load (ETL) existing data into SQL Server from different environments for the SSAS cubes (OLAP)
  • SQL Server reporting services (SSRS). Created & formatted Crosstab, Conditional, Drill-down, Top N, Summary, Form, OLAP, Sub-reports, ad-hoc reports, parameterized reports, interactive reports & custom reports.
  • Built data pipelines to move data from source to destination scheduling by Airflow.
  • Implemented Big Data Analytics and Advanced Data Science techniques to identify trends, patterns, and discrepancies on petabytes of data by using Azure Databricks, Hive, Hadoop, Python, PySpark, Spark SQL, MapReduce, and Azure Machine Learning.
  • Created action filters, parameters, and calculated sets for preparing dashboards and worksheets using PowerBI.
  • Developed visualizations and dashboards using PowerBI.
  • Worked on all phases of data integration development lifecycle, real-time/batch data pipelines design and implementation, and support of WU Digital Big Data ETL& Reporting track.
  • Sticking to ANSI SQL language specification wherever possible, and providing context about similar functionality in other industry-standard engines (e.g. referencing PostgreSQL function documentation)
  • Used ETL to implement the Slowly Changing Transformation, to maintain Historically Data in Data warehouse.
  • Performing ETL testing activities like running the Jobs, Extracting the data using necessary queries from database transform, and upload into the Data warehouse servers.
  • Created dashboards for analyzing POS data using Power BI

Environment: MS SQL Server 2016, T-SQL, SQL Server Integration Services (SSIS), SQL Server Reporting Services (SSRS), SQL Server Analysis Services (SSAS), Management Studio (SSMS), Advance Excel (creating formulas, pivot tables, Hlookup, VLOOKUP, Macros), Spark, Python, ETL, Power BI, Tableau, Presto, Hive/Hadoop, Snowflakes, Power BI, AWS Data Pipeline.

Confidential, Foster City, CA

Data Engineer


  • Extensively used Agile methodology as the Organization Standard to implement the data Models
  • Created several types of data visualizations using Python and Tableau.
  • Extracted Mega Data from AWS using SQL Queries to create reports.
  • Performed reverse engineering using Erwin to redefine entities, attributes, and relationships existing database.
  • Worked on AWS Data pipeline to configure data loads from S3 to into Redshift.
  • Using AWS Redshift, I Extracted, transformed, and loaded data from various heterogeneous data sources and destinations.
  • Analyzed functional and non-functional business requirements and translate into technical data requirements and create or update existing logical and physical data models.
  • Developed a data pipeline using Kafka to store data into HDFS.
  • Experience in DevelopingSparkapplications usingSpark - SQLinDatabricksfor data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
  • Worked on analyzing Hadoop Cluster and different big data analytic tools including Pig, Hive.
  • Working experience with data streaming process with Kafka, Apache Spark, Hive.
  • Worked with various HDFS file formats like Avro, Sequence File, Nifi, Json and various compression formats like Snappy, bzip2.
  • Created pipelines, data flows and complex data transformations and manipulations using ADF and PySpark with Databricks.
  • Designed and implemented by configuring Topics in new Kafka cluster in all environments.
  • Successfully secured the Kafka cluster with Kerberos.
  • Implemented Kafka Security Features using SSL and without Kerberos. Further with more grain-fines Security I set up Kerberos to have users and groups this will enable more advanced security features.
  • Experience in Converting existing AWS Infrastructure to Server less architecture (AWS Lambda, Kinesis), deploying via Terraform and AWS Cloud Formation templates.
  • Used Spark-Streaming APIs to perform necessary transformations and actions on the data got from Kafka.
  • Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
  • Analyzed the SQL scripts and designed the solution to implement using Scala.
  • Designed and developed architecture for data services ecosystem spanning Relational, NoSQL, and Big Data technologies.
  • Performed Regression testing for Golden Test Cases from State (end to end test cases) and automated the process using python scripts.
  • Developed Spark jobs using Scala for faster real-time analytics and used Spark SQL for querying
  • Used SQL Server Integrations Services (SSIS) for extraction, transformation, and loading data into target system from multiple sources
  • Designed both 3NF data models for OLTP systems and dimensional data models using star and snowflake Schemas.
  • Developed a python script to transfer data, REST API’s and extract data from on-premises to AWS S3. Implemented Micro Services based Cloud Architecture using Spring Boot.
  • Worked on Ingesting data by going through cleansing and transformations and leveraging AWS Lambda, AWS Glue and Step-Functions.
  • Created Yaml files for each data source and including glue table stack creation. Worked on a python script to extract data from Netezza databases and transfer it to AWS S3.
  • Like Access, Excel, CSV, Oracle, flat files using connectors, tasks and transformations provided by AWS Data Pipeline.
  • Created and maintained SQL Server scheduled jobs, executing stored procedures for the purpose of extracting data from Oracle into SQL Server.
  • Extensively used Tableau for customer marketing data visualization.
  • Developed Advance PL/SQL packages, procedures, triggers, functions, Indexes and Collections to implement business logic using SQL Navigator.
  • Designed and implemented system architecture for Amazon EC2 based cloud-hosted solution for client.
  • Generated various reports using SQL Server Report Services (SSRS) for business analysts and the management team.
  • Created HBase tables to store variable data formats of PII data coming from different portfolios.
  • Designed data models with industry standards up to 3rd NF (OLTP) and de normalized (OLAP) data marts with Star & Snow flake schemas.
  • Executed Hive queries on Parquet tables stored in Hive to perform data analysis to meet the business requirements.
  • Installed Kerberos secured Kafka cluster with no encryption on Dev and Prod. Also set up Kafka ACLs into it
  • Successfully did set up a no authentication Kafka listener in parallel with Kerberos (SASL) Listener. Also, I tested non authenticated user (Anonymous user) in parallel with Kerberos user.
  • Authoring Python (PySpark) Scripts for custom UDF’s for Row/ Column manipulations, merges, aggregations, stacking, data labeling and for all Cleaning and conforming tasks. Migrate data from on-premises to AWS storage buckets.
  • Agile methodology including test-driven and pair-programming concept.
  • Created functions and assigned roles in AWS Lambda to run python scripts, and AWS Lambda using java to perform event driven processing.
  • Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data.
  • Implemented best income logic using Pig scripts.
  • Job workflow scheduled and monitored using tools like Oozie
  • Developed parallel reports using SQL and Python to validate the daily, monthly and quarterly reports.

Environment: Hadoop, HDFS, Python 3.6, AWS Glue, Lambda, Kafka, PyCharm, Informatica Power Center, Code Build, Code Pipeline, Event Bridge, Athena), Oozie, Spark, OLTP, OLAP, PL/SQL, SQL-Server, No-SQL, Scala, Linux Shell Scripting

Confidential, Atlanta, GA

Big Data Engineer


  • Experience in Job management using Fair scheduler and Developed job processing scripts using Oozie workflow.
  • Used Spark, Hive for implementing the transformations need to join the daily ingested data to historic data.
  • Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time.
  • Developed Spark scripts by using Scala, shell commands as per the requirement.
  • Used Spark API over EMR Cluster Hadoop YARN to perform analytics on data in Hive.
  • Developed Scala scripts, UDFs using both Data frames/SQL/Data sets and RDD in Spark for Data Aggregation, queries and writing data back into OLTP system through Sqoop.
  • Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
  • Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
  • Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark.
  • Developed logistic regression models (using R programming and Python) to predict subscription response rate based on customer’s variables like past transactions, response to prior mailings, promotions, demographics, interests and hobbies, etc.
  • Created Tableau dashboards/reports for data visualization, Reporting and Analysis and presented it to Business.
  • Created/ Managed Groups, Workbooks and Projects, Database Views, Data Sources and Data Connections
  • Worked with the Business development managers and other team members on report requirements based on existing reports/dashboards, timelines, testing, and technical delivery.
  • Knowledge in Tableau Administration Tool for Configuration, adding users, managing licenses and data connections, scheduling tasks, embedding views by integrating with other platforms.
  • Developed dimensions and fact tables for data marts like Monthly Summary, Inventory data marts with various Dimensions like Time, Services, Customers and policies.
  • Developed reusable transformations to load data from flat files and other data sources to the Data Warehouse.
  • Assisted operation support team for transactional data loads in developing SQL Loader & Unix scripts
  • Implemented Python script to call the Cassandra Rest API, performed transformations and loaded the data into Hive.
  • Extensively worked on Python and build the custom ingest framework.
  • Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations and other during ingestion process itself.
  • Experienced in writing live Real-time Processing using Spark Streaming with Kafka.
  • Created Cassandra tables to store various data formats of data coming from different sources.
  • Designed, developed data integration programs in a Hadoopenvironment with NoSQL data store Cassandra for data access and analysis.
  • Generated Custom SQL to verify the dependency for the daily, Weekly, Monthly jobs.
  • Using Nebula Metadata, registered Business and Technical Datasets for corresponding SQL scripts
  • Experienced in working with spark ecosystem using Spark SQL and Scala queries on different formats like text file, CSV file.
  • Developed spark code and spark-SQL/streaming for faster testing and processing of data.
  • Closely involved in scheduling Daily, Monthly jobs with Precondition/Postcondition based on the requirement.
  • Monitor the Daily, Weekly, Monthly jobs and provide support in case of failures/issues.

Environment: Hadoop YARN, Spark 1.6, Spark Streaming, Spark SQL, Scala, Kafka, Python, Hive, Sqoop 1.4.6, Impala, Tableau, Talend, Oozie, AWS S3, Oracle 12c, SQL, Scala, Shell, Cassandra, Linux.


Data Analyst


  • Installed Hadoop, MySQL, PostgreSQL, SQL Server, Sqoop, Hive, and HBase.
  • Created bashrc files and all other xml configurations to automate the deployment of Hadoop VMs over AWS EMR.
  • Experience creating and organizing HDFS over a staging area.
  • Troubleshooted RSA SSH keys in Linux for authorization purposes.
  • Inserted data from multiple csv files into MySQL, SQL Server, and PostgreSQL using spark.
  • Utilized Sqoop to import structured data from MySQL, SQL Server, PostgreSQL, and a semi-structured csv file dataset into HDFS data lake.
  • Developed a raw layer of external tables within S3 containing copied data from HDFS.
  • Created a data service layer of internal tables in Hive for data manipulation and organization.
  • Achieved business intelligence by creating and analyzing an application service layer in Hive containing internal tables of the data which are also integrated with HBase.
  • Developed complex SQL statements to extract the Data and packaging/encrypting Data for delivery to customers.
  • Provided business intelligence analysis to decision-makers using an interactive OLAP tool
  • Created T/SQL statements (select, insert, update, delete) and stored procedures.
  • Defined Data requirements and elements used in XML transactions.
  • Created Informatica mappings using various Transformations like Joiner, Aggregate, Expression, Filter and Update Strategy.
  • Performed Tableau administering by using tableau admin commands.
  • Involved in defining the source to target Data mappings, business rules and Data definitions.
  • Ensured the compliance of the extracts to the Data Quality Center initiatives
  • Metrics reporting, Data mining and trends in helpdesk environment using Access
  • Worked on SQL Server Integration Services (SSIS) to integrate and analyze data from multiple heterogeneous information sources.
  • Built reports and report models using SSRS to enable end user report builder usage.
  • Created Excel charts and pivot tables for the Ad-hoc Data pull.

Environment: Hadoop, Sqoop, Hive, HBase, HDFS, SQL, PL/SQL, T/SQL, MYSQL, XML, Informatica, Tableau, OLAP, SSIS, SSRS, Excel, OLTP, OLAP


Data Analyst


  • Involved in review of functional and non-functional requirements.
  • Installed and configured Hadoop MapReduce, HDFS, Developed multiple MapReduce
  • Installed and configured Pig and also written Pig Latin scripts.
  • Wrote MapReduce job using Pig Latin. Involved in ETL, Data Integration and Migration.
  • Imported data using Sqoop to load data from Oracle to HDFS on regular basis.
  • Developing Scripts and Batch Job to schedule various Hadoop Program.
  • Written Hive queries for data analysis to meet the business requirements.
  • Creating Hive tables and working on them using Hive QL. Experienced indefining jobflows. Collaborated with Business Analysts, SMEsacross departments to gather business requirements, and identify workable items for further development.
  • Partnered with ETL developers to ensure that data is well cleaned, and the data warehouse is up-to-date for reporting purpose by Pig.
  • Selected and generated data into csv files and stored them into AWS S3 by using AWS EC2 and then structured and stored in AWS Redshift.
  • Used PySpark and Pandas to calculate the moving average and RSI score of the stocks and generated them into data warehouse.
  • Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, PostgreSQL, Data Frame, OpenShift, Talend, pair RDD's.
  • Inserted data into DSL internal tables from RAW external tables.
  • Achieved business intelligence by creating and analyzing an application service layer in Hive containing internal tables of the data which are also integrated with HBase
  • Involved in integration of Hadoop cluster with spark engine to perform BATCH and GRAPHX operations.
  • Performed data preprocessing and feature engineering for further predictive analytics using Python Pandas.
  • Developed and validated machine learning models including Ridge and Lasso regression for predicting total amount of trade.
  • Generated report on predictive analytics using Python and Tableau including visualizing model performance and prediction results.
  • Utilized Agile and Scrum methodology for team and project management.
  • Used Git for version control with colleagues.

Environment: Hadoop, Map Reduce, Hive, AWS redshift, SQL, PL/SQL, T/SQL, XML, Informatica, Python, Tableau, OLAP, SSIS, SSRS, Excel, OLTP, Git.

We'd love your feedback!