We provide IT Staff Augmentation Services!

Data Engineer Resume

Boston, MA


  • More than 5+ years of experience in implementing various Big Data Engineering, Cloud Data engineering, Data Warehouse, Data Mart, Data Visualization, Reporting, Data Quality, and Data virtualization Solution
  • Experience in Data transformation, Data mapping from source to target database schema, Data Cleansing procedures
  • Adept in programming languages like R and Python including Big Data technologies like Hadoop, Hive
  • Experienced in Normalization (1NF, 2NF, 3NF and BCNF) and De - normalization techniques for effective and optimum performance in OLTP and OLAP environments
  • Collaborated with lead Data Architect to model Data warehouse in accordance with 3NF format, and Star/Snowflake schema
  • Excellent Knowledge of Relational Database Design, Data Warehouse/OLAP concepts, and methodologies
  • Experience in designing star schema, Snowflake schema for Data Warehouse, ODS architecture
  • Expertise in OLTP/OLAP System Study, Analysis and E-R modeling, developing Database Schemas like Star schema and Snowflake schema used in relational, dimensional and multidimensional modeling
  • Experienced in Data Management solution that covers DWH/Data Architecture design, Data Governance Implementation and Big Data
  • Experience in designing, building and implementing complete Hadoop ecosystem comprising of MapReduce, HDFS, Hive, Impala, Pig, Sqoop, Oozie, HBase, MongoDB, and Spark
  • Experienced in Data Architecture and data modeling using Erwin, ER-Studio and MS Visio
  • Experience in coding SQL for developing Procedures, Triggers, and Packages
  • Experience in creating separate virtual data warehouses with difference size classes in AWS Snowflake
  • Good knowledge on Tableau Metadata tables
  • Experienced in handling BIG DATA using HADOOP eco system components like SQOOP, PIG and HIVE
  • Experience writing spark streaming and spark batch jobs, using spark MLlib for analytics
  • Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems (RDBMS)- Oracle, DB2 and SQL Server and from RDBMS to HDFS
  • Experienced in Data Analysis, Design, Development, Implementation and Testing using Data Conversions, Extraction, Transformation and Loading (ETL) and SQL Server, ORACLE and other relational and non-relational databases
  • Well experienced in Normalization, De-Normalization and Standardization techniques for optimal performance in relational and dimensional database environments
  • Solid understanding of AWS (Amazon Web Services) S3, EC2 and Apache Spark, Scala process, and concepts
  • Hands on experience in machine learning, big data, data visualization, R and Python development, Linux, SQL, GIT/GitHub
  • Experience with data visualization using tools like ggplot, Matplotlib, Seaborn, Tableau and using Tableau software to publish and presenting dashboards, storyline on web and desktop platforms
  • Experienced in python data manipulation for loading and extraction as well as with python libraries such as NumPy, SciPy and Pandas for data analysis and numerical computations
  • Hands on experience with RStudio for doing data pre-processing and building machine learning algorithms on different datasets
  • Experienced in Data Analysis, Data Migration, Data Cleansing, Transformation, Integration, Data Import and Data Export through use of multiple tools such as SSIS and Informatica Power Center
  • Experienced in Data Modeling retaining concepts of RDBMS, Logical and Physical Data Modeling until (3NF) and Multidimensional Data Modeling Schema (Star schema, Snow-Flake Modeling, Facts and dimensions)
  • Experienced working on NoSQL databases like MongoDB and HBase.
  • Expertise in Technical proficiency in Designing, Data Modeling Online Application, Data Warehouse, Business Intelligence Applications
  • Worked and extracted data from various database sources like Oracle, SQL Server, and DB2
  • Extensive working experience with Python including Scikit-learn, SciPy, Pandas, and NumPy developing machine learning models, manipulating and handling data
  • Extensive experience in Text Analytics, developing different Statistical Machine Learning, Data Mining solutions to various business problems and generating data visualizations using R, Python
  • Expertise in complex Data design/development, Master data and Metadata and hands-on experience on Data analysis in planning, coordinating, and executing on records and databases
  • Proficient in Machine Learning techniques (Decision Trees, Linear/Logistic Regressors, Random Forest, SVM, Bayesian, XG Boost, K-means Clustering, K-Nearest Neighbors) and Statistical Modeling in Forecasting/ Predictive Analytics, Segmentation methodologies, Regression-based models, Hypothesis testing, Factor analysis/ PCA, Ensembles
  • Implemented machine learning algorithms on large datasets to understand hidden patterns and capture insights


Big Data Tools: Hadoop, HDFS, Sqoop, Hbase, Hive, MapReduce, Spark, Cassandra

Cloud Technologies: Snowflake, AWS

ETL Tools: SSIS, Informatica Power Center

Erwin, ER Studio, StarSchema, Snowflake: Schema Modeling, FACT and dimension tables, Pivot Tables

Database: Snowflake Cloud Database, Oracle, MS SQL Server, Teradata, MySQL, DB2

Operating Systems: Microsoft Windows and Unix

Reporting Tools: MS Excel, Tableau, Tableau server, Tableau Reader, Power BI, QlikView

Methodologies: Agile, UML, System Development Life Cycle (SDLC), Ralph Kimball, Waterfall Model

Machine Learning: Regression Models, Classification Models, Clustering, Linear regression, Logistic regression, Decision trees, Random Forest, Gradient Boosting, K nearest neighbor (KNN), K mean, Na ve Bayes, Time Series Analysis, PCA, Avro, MLbase

R: tidyr, tidyverse, dplyr, lubridate, ggplot2, tseries Python - beautiful Soup, numpy, scipy, matplotlib, seaborn, pandas, scikit-learn, keras

Programming Languages: SQL, R (shiny, R-studio), Python (Jupyter Notebook, PyCharm IDE), Scala


Confidential, Boston, MA

Data Engineer


  • Worked as Data Engineer to review business requirement and compose source to target data mapping documents
  • Involved in Agile development methodology active member in scrum meetings
  • Involved in Data Profiling and merge data from multiple data sources
  • Involved in Big data requirement analysis, develop and design solutions for ETL and Business Intelligence platforms
  • Designed 3NF data models for ODS, OLTP systems and dimensional data models using Star and Snowflake Schemas
  • Worked on Snowflake environment to remove redundancy and load real time data from various data sources into HDFS using Kafka
  • Developed data warehouse model in Snowflake for over 100 datasets
  • Designing and implementing a fully operational production grade large scale data solution on Snowflake Data Warehouse
  • Work with structured/semi-structured data ingestion and processing on AWS using S3, Python. Migrate on-premises big data workloads to AWS
  • Involved in migration of data from existing RDBMS to Hadoop using Sqoop for processing data, evaluate performance of various algorithms/models/strategies based on real-world data sets
  • Created Hive tables for loading and analyzing data and developed Hive queries to process data and generate data cubes for visualizing
  • Extracted data from HDFS using Hive, Presto and performed data analysis using Spark with Scala , pySpark and feature selection and created nonparametric models in Spark
  • Handled importing data from various data sources, performed transformations using Hive, Map Reduce , and loaded data into HDFS
  • Captured unstructured data that was otherwise not used and stored it in HDFS and HBase / MongoDB. Scarpe data using Beautiful Soup and saved data into MongoDB (JSON format)
  • Worked on AWS S3 buckets and intra cluster file transfer between PNDA and s3 securely
  • Design & Implementation of Data Mart, DBA coordination, DDL & DML generation & usage
  • Provide data architecture support to enterprise data management efforts, such as development of enterprise data model and master and reference data, as well as support to projects, such as development of physical data models, data warehouses and data marts
  • Worked with Data governance, Data quality, data lineage, Data architect to design various models and processes
  • Independently coded new programs and designed Tables to load and test program effectively for given POC's using with Big Data/Hadoop
  • Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS
  • Worked on Apache Spark with Python to develop and execute Big Data Analytics and Machine learning applications, executed machine learning use cases under Spark ML and MLLib
  • Built and analyzed datasets using R, SAS and Python, designed data models and data flow diagrams using Erwin and MS Visio
  • Used pandas, numpy, seaborn, scipy, matplotlib, scikit-learn, NLTK in Python for developing various machine learning algorithms for predictive modeling utilizing R and Python
  • Implemented a Python-based distributed random forest via Python streaming
  • Utilized machine learning algorithms such as linear regression, multivariate regression, Naive Bayes, Random Forests, K-means, & KNN for data analysis

Environment: Python, R/R studio, SQL, Oracle, Cassandra, MongoDB, AWS, Snowflake, Hadoop, Hive, MapReduce, Scala, Spark, Kafka, MLLib, regression, Tableau

Confidential, Boston MA

Data Engineer


  • Gathered, analyzed, and translated business requirements to technical requirements, communicated with other departments to collect client business requirements and access available data
  • Acquiring, cleaning and structuring data from multiple sources and maintain databases/data systems. Identifying, analyzing, and interpreting trends or patterns in complex data sets
  • Develop, prototype and test predictive algorithms. Filtering and cleaning data, review reports and performance indicators
  • Developing and implementing data collection systems and other strategies that optimize statistical efficiency and data quality
  • Create and statistically analyze large data sets of internal and external data
  • Working closely with marketing team to deliver actionable insights from huge volume of data, coming from different marketing campaigns and customer interaction matrices such as web portal usage, email campaign responses, public site interaction, and other customer specific parameters
  • Performed incremental loads as well as full loads to transfer data from OLTP to Data Warehouse of snowflake schema using different data flow and control flow tasks and provide maintenance for existing jobs
  • Design and implement secure data pipelines into a Snowflake data warehouse from on-premise and cloud data sources
  • Creation of best practices and standards for data pipelining and integration with Snowflake data warehouses
  • Responsible for Data Cleaning, features scaling, features engineering by using NumPy and Pandas in Python
  • Conducted Exploratory Data Analysis using Python Matplotlib and Seaborn to identify underlying patterns and correlation between features
  • Worked with NoSQL databases like HBase in creating tables to load large sets of semi structured data coming from source systems
  • Used information value, principal components analysis , and Chi square feature selection techniques
  • Used Python and R scripting by implementing machine algorithms to predict data and forecast data for better results
  • Used Python and R scripting to visualize data and implemented machine learning algorithms
  • Experience in developing packages in R studio with a shiny interface
  • Improve efficiency and accuracy by evaluating model in Python and R
  • Used Python and R script for improvement of model
  • Experimented with multiple classification algorithms, such as Logistic Regression , Support Vector Machine (SVM) , Random Forest , AdA boost and Gradient boosting using Python Scikit-Learn and evaluated performance on customer discount optimization on millions of customers
  • Built models using Python and Pyspark to predict probability of attendance for various campaigns and events
  • Implemented classification algorithms such as Logistic Regression , K-NN neighbors and Random Forests to predict Customer churn and Customer interface
  • Performed data visualization and Designed dashboards with Tableau , and generated complex reports, including charts, summaries, and graphs to interpret findings to team and stakeholders

Environment: OLTP Data Warehouse, Hadoop, Hive, HBase, Spark, Snowflake, R/R studio, Python- Pandas, Numpy, Scikit-Learn, TensorFlow - SciPy, Seaborn, Matplotlib, SQL, Machine Learning, ggplot, lattice, MASS, mice and logit.


Data Engineer


  • Worked with Hadoop eco system covering HDFS, HBase, YARN and MapReduce
  • Worked with Oozie Workflow Engine in running workflow jobs with actions that run Hadoop MapReduce, Hive, Spark jobs
  • Performed Data Mapping, Data design (Data Modeling) to integrate data across multiple databases in to EDW
  • Responsible for design and development of advanced R/Python programs to prepare transform and harmonize data sets in preparation for modeling
  • Hands on experience on Hadoop /Big Data related technology experience in Storage, Querying, Processing and analysis of data
  • Developed Spark/Scala, Python for regular expression (regex) project in Hadoop / Hive environment for big data resources. Used clustering techniques like K-means to identify outliers and to classify unlabeled data
  • Data gathering, data cleaning and data wrangling performed using Python and R
  • Transformed raw data into actionable insights by incorporating various statistical techniques, data mining,data cleaning, data quality, integrity utilizing Python (Scikit-Learn, NumPy, Pandas, and Matplotlib) and SQL
  • Calculated errors using various machine learning algorithms such as Linear Regression, Ridge Regression, Lasso Regression, Elastic net regression, KNN, Decision Tree Regressor, SVM, Bagging Decision Trees, Random Forest, AdaBoost, and XGBoost. Chose best model eventually based on MAE
  • Experimented with Ensemble methods to increase accuracy of training model with different Bagging and Boosting methods
  • Identified target groups by conducting Segmentation analysis using Clustering techniques like K-means
  • Conducted model optimization and comparison using stepwise function based on AIC value
  • Used cross-validation to test models with different batches of data to optimize models and prevent over fitting
  • Worked and collaborated with various business teams (operations, commercial, innovation, HR, logistics, safety, environmental, accounting) to analyze and understand changes in key financial metrics and provide ad-hoc analysis that can be leveraged to build long term points of view where value can be captured
  • Explored and analyzed customer specific features by using Matplotlib, Seaborn in Python and dashboards in Tableau

Environment: Machine Learning, R Language, Hadoop, Big Data, Python, DB2, MongoDB, Web Services


Data Engineer


  • Analyzed and translated Functional Specifications and Change Requests into Technical Specifications
  • Designed and Implemented Big Data Analytics architecture, transferring data from Oracle Datawarehouse/ external APIs/ flat files to Hadoop using Hortonworks
  • Designed and developed Use Cases, Activity Diagrams, and Swim Lane Diagrams and Process flows using Unified Model Language
  • Ran SQL queries for data validation and performed quality analysis on data extracts to ensure data quality and integrity across various database systems
  • Involved with Data Profiling activities for new sources before creating new subject areas in warehouse
  • Created DDL scripts for implementing Data Modeling changes
  • Worked with ETL processes to transfer/migrate data from relational database and flat files common staging tables in various formats to meaningful data in Oracle and MS- SQL
  • Tested ETL process for both before data validation and after data validation process. Tested messages published by ETL tool and data loaded into various database
  • Responsible for different Data mapping activities from Source systems
  • Performed extensive data cleansing, data manipulations and date transforms and data auditing
  • Involved in SQL Development, Unit Testing and Performance Tuning and to ensure testing issues are resolved on basis of using defect reports
  • Involved in Data mapping specifications to create and execute detailed system test plans. Data mapping specifies what data will be extracted from an internal data warehouse, transformed and sent to an external entity
  • Developed Data Migration and Cleansing rules for Integration Architecture ( OLTP, ODS, DW )
  • Responsible for defining naming standards for data warehouse
  • Performed data discovery and build a stream that automatically retrieves data from multitude of sources ( SQL databases, external data such as social network data , user reviews) to generate KPI's using Tableau
  • Writing SQL queries for visualization and reporting systems. Good experience in visualization tool Tableau
  • Wrote ETL scripts in SQL for extraction and validating data

Environment: : SQL Server, ETL, SSIS, SSRS, Tableau, Excel

Hire Now