Data Engineer/scientist Resume
SUMMARY
- Data Engineer with 6 years of professional experience in the E - commerce and Supply-Chain domain, performing Statistical Modelling, Data Extraction, Data screening, Data cleaning, Data Exploration and Data Visualization of structured and unstructured datasets as well as implementing large scale Machine Learning algorithms to deliver resourceful insights, inferences and significantly impacted business revenues and user experience.
- Experienced in Facilitating the entire lifecycle of a data science project: Data Extraction, Data Pre-Processing, Feature Engineering, Dimensionality Reduction, Algorithm implementation, Back Testing and Validation.
- Expert at working with statistical tests: two-way independent & paired t-test, one-way & two-way ANOVA.
- Proficient in transforming data using log, square-root, reciprocal, differencing and complete box-cox transformation depending on the dataset.
- Adept at Analysis of Missing data by exploring correlations and similarities, introducing dummy variables for missing values and choosing from imputation methods such as iterative imputer on Python.
- Hands on experience in implementing Naive Bayes, Random Forests, Decision Trees, Linear and Logistic Regression, SVM, Clustering, Principle Component Analysis using Scikit-learn package in Python.
- In-depth Knowledge of Dimensionality Reduction (PCA, LDA), Hyper-parameter tuning, Model Regularization (Ridge, Lasso, Elastic Net) and Grid Search techniques to optimize model performance.
- Adept with Python and OOP concepts such as Inheritance, Polymorphism, Abstraction, Association, etc.
- Experienced in developing all kinds of Deep Learning algorithms like Artificial Neural Networks, Convolutional Neural Networks, Recurrent Neural Networks and LSTM to implement AI solutions.
- Expertise in creating executive Tableau Dashboards for Data visualization and deploying it to the servers.
- Skilled in using Pandas, MatplotLib, Seaborn, plotly in Python for performing Exploratory Data Analysis.
- Highly skilled at PowerBI, Data Visualization and Exploratory tools such as Tableau, Qlik and Plotly.
- Hands on experience on Big Data tools such as Hadoop HDFS, Spark, MapReduce, MySQL, Oracle SQL, Hive, Cassandra, MongoDB, Redshift SQL, Spark SQL, PySpark, HDFS (Hadoop), MapReduce & Kafka.
- Excellent exposure to Data Visualization with Tableau, PowerBI, Seaborn, Matplotlib, pyplot and ggplot2.
- Working knowledge of Azure, Data Bricks, AWS other data lake related technologies.
- Created and maintained data models using both RDBMS and NoSQL databases like Oracle, DB2, MongoDB, MySQL, Cassandra and Microsoft SQL server as well as normalizing data up to third form using SQL functions.
- Can hold on to the outlook on the strengths and limitations of statistical models while designing a model for various business contexts; and can evaluate and effectively communicate the uncertainty in the results.
- Approach analysis in multiple ways to evaluate approaches, compare results and present the findings.
- Skilled Business Analyst in several Software Development Life Cycle (SDLC) methodologies such as Waterfall, Waterfall-Agile Hybrid, Agile like Scrum, Kanban, and Scaled Agile Framework (SAFe), with knowledge in Scrumban, Extreme Programming (XP), Spiral, and Rational Unified Process (RUP).
- High expertise in fundamental Business Analyst undertakings of gathering requirements and creating artifacts.
- Experienced in conducting As-Is and To-Be as a part of GAP analysis and strong knowledge in carrying out processes like Risk Analysis, Cost-Benefit Analysis, Feasibility Study, and Change Management. Efficient in handling Change Requests, conducting impact analysis and accessing its impact on the Triple Constraints.
- Vivid business writing skills in documenting artifacts like Business Requirements Document (BRD), Use Case specifications, Functional Requirements Document (FRD) and System Requirements Specifications (SRS).
- Accustomed to a variety of requirements elicitation techniques such as JAD Sessions, Document Analysis, Interviews, Brainstorming, Focus Groups, Prototyping, Observations, and Requirements Workshop.
- Well acquainted with Use Case design, Use Case Scenarios, Business Process Modeling (BPM), Work flow diagrams, technical documentation using behavioral Unified Modeling Language (UML) diagrams.
- Prodigious expertise with management deliverables like Project Scope Statement, Project Charter, SWOT Analysis, Work Breakdown Structure (WBS), Critical Path Analysis, and Earned Value Management.
- Broadly experienced with various Agile Prioritization techniques like MoSCoW, KANO model and 100 Point Method, also well-equipped in writing user stories and splitting those to satisfy the INVEST criteria.
- Experience with collaboration tools like Confluence and SharePoint to improve teamwork and transparency.
- Strongly qualified in Data Modeling (creating conceptual and logical models), Entity-Relationship (E-R) diagrams, Data Mapping (from source to target fields), Data Migration, Data Profiling and Data Integration.
- In-depth understanding of Enterprise Data Warehouse system, working on several DWH layers, Dimensional Modeling using Facts, Dimensions, Star Schema & Snowflake Schema, and OLAP Cubes of MOLAP, ROLAP.
- Executed various OLAP operations of Slicing, Dicing, Roll-Up, Drill-Down and Pivot in multidimensional data.
- Operational knowledge of the complete Extract, Transform and Load (ETL) process, working experience with the ETL tool of Informatica Power center (IPC) (Source Analyzer, Target Designer, Transformation Developer, Mapplet Designer, Mapping Designer, Repository Manager and Workflow Manager), performing variety of transformations and mappings for passing data through to the target repository, while monitoring the workflow.
- Six Sigma Green Belt practitioner certified by the Institute of Industrial & Systems Engineering (IISE).
TECHNICAL SKILLS
Languages: Python, Matlab, SQL
Database: MySQL, Oracle, MongoDB, Microsoft SQL Server, Cassandra, HBase
Statistical Tests: Hypothesis Testing, ANOVA tests, t-tests, Chi-Square Fit test, Regression.
Validation Techniques: k-fold cross validation, Out of the Box Estimates, A/B Tests.
Optimization Techniques: Gradient Descent, Stochastic Gradient Descent, Mini-Batch Gradient Descent, Gradient Optimization - Adam, Momentum, RMSProp
Data Visualization: Tableau, ggplot2, MatplotLib, Seaborn, plotly
Data modeling: Entity relationship Diagrams (ERD), Snowflake and Star Schema
Big Data: Apache Hadoop, HDFS, Kafka, MapReduce, Spark, Azure, AWS, Spark, PySpark, Data bricks, Hive
Project Management Tools: Jira, Confluence, MS Project, Version One
Cloud Technologies: Google Colab, Google Compute
PROFESSIONAL EXPERIENCE
Data Engineer/Scientist
Confidential
Responsibilities:
- Worked independently and collaboratively throughout the analytical lifecycle including data extraction & preparation, design and implementation of scalable analysis and solutions, and documentation of results.
- Partnered with technical/non-technical resources across the business to support and integrate our efforts.
- Tested multiple classification models like Random Forest, SVM, Logistic Regression and Gradient Boost. Also performed hyper-parameter tuning on the models to optimize the model predicting power in Python.
- Experimented various DL algorithms and ensured that the model has low False Positive Rate.
- Obtained knowledge of image processing algorithms: encoding/decoding, feature detection and matching, image segmentation and transformation.
- Used a combination of various filter sizes for convolutions (used padding), Max Pooling, selection of activation functions (like ReLU and Softmax), Dropout Function and Batch Normalizer to regularize data.
- Used ResNet-50 model initially which was pretrained on ImageNet and refined the model by adding, deleting and modifying the blocks and layers, and fine tuning the hyper-parameters.
- Built another CNN model to predict the category of product using the label(title) of the Image.
- Modified the Convolutional Neural Network (CNN) model to accommodate the CNN of the text blocks. Then integrated them both (image model and text model) to produce a reliable and highly accurate model.
- Extensively used Hadoop and Databricks clusters. Used Hive and pySpark to extract and Analyze data.
- Hands on experience on Linux while handling the Hadoop clusters.
- Involved in all the jobs for pipelining the data from the databases through to analysis and reporting.
- Simulated the model multiple times changing the hyper-parameters such as learning rate, Epochs, batch size, Activation Function, number of hidden layers and units, dropout and initial Weights.
- Gained experience dealing with huge amounts of data from HBase, Azure data lake and data bricks.
- Worked on RDBMS (MySQL) and NoSQL (Cassandra) databases integrated with the Hadoop environment.
Environment: Python, MySQL, PySpark, Hadoop, Databricks, RDBMS, NoSQL, Cassandra, Regression analysis, K Nearest Neighbors, Random Forest, HBase, Hive, Azure, Naïve Bayes, SVM, K-Means Clustering and Convolutional Neural Nets.
Data Engineer/Scientist
Confidential
Responsibilities:
- Visualized the time series data and detected production flow and abnormal patterns in Python.
- Conducted data cleaning, imputed missing values to improve the consistency the data.
- Conducted pre-processing of the data due to its influence on the accuracy on the demand forecasting, which involved removing outliers from the historical data by smoothing and Hampel filter. And, also simply replaced values that coincided with non-working holidays, as these samples were unnecessary outliers in the series, which can potentially harm the fitting process of the forecasting models.
- Performed EDA using PySpark and Tableau, accessing the data from cloud data lakes like Azure & AWS.
- Performed aggregation on the data, on daily basis and the hierarchical categorization of the data.
- Reduced Operating cost significantly by decreasing the inconsistency in master data during the first year by matching the old products with the new products and using the inventory on hand.
- Performed Double Seasonal Exponential Smoothing due to dual seasonality in the data.
- Tested the data using different models such as Naïve Method, Moving Average, Weighted Moving Average, Simple Exponential Smoothing, Holt’s Winter method and General Multiplicative Double Seasonal ARIMA.
- We decided to limit parameters p, d, q, P1, P2, Q1, Q2 to lower order values in order to prevent over-fitting.
- Identified KPI’s by item and regions to assist in reducing inventory and cutting warehouse costs.
- Kept track of the Root Mean Square Error (RMSE) value to evaluate the accuracy of the model, while executing all the models using different values of parameters. This helped to get to the best model while experimenting with different combinations of values; and also helped understand the behavior of the data.
- Integrated demand prediction with an inventory optimization model such that the model considers costs of unavailability, resupply cost as well as warehousing costs.
- Developed new ways to forecast client product needs and used that it to create purchasing forecasts.
- Worked almost exclusively on creating inventory forecasts for several retail locations.
- Assisted purchasing and logistics with a plan that would keep all stores supplied with popular products.
- Customized forecast accuracy report and was accountable for analyzing, presenting and explaining forecast accuracy and deviations in Sale and Operation Planning Process (S&OP).
- The whole project was extensively run on AWS platform, where I was exposed to different clusters of AWS.
- Hands-on experience with Amazon EC2, Amazon S3, Amazon RDS, Amazon Elastic Load Balancing, Auto Scaling, Dynamo, Elastic MapReduce, Redshift Spectrum, Athena and other services of the AWS family.
Environment: Python, Scrum, SQL, JIRA, Time Series Forecasting, AWS, Sale and Operating Planning Process (S&OP), Naïve Method, Moving Average, Weighted Moving Average, Simple Exponential Smoothing, Holt’s Winter method and General Multiplicative Double Seasonal ARIMA.
Data Engineer
Confidential
Responsibilities:
- Created Logical & Physical Data Modeling on Relational (OLTP), Dimensional Data Modeling (OLAP) on Star schema for Fact & Dimension tables using Erwin.
- Analyzed the business requirements and designed Conceptual and Logical Data Models.
- Prepared ETL technical Mapping documents alongside the test cases for Mapping for future needs.
- Extensively worked on Normalization and De-normalization data for both OLTP and OLAP systems.
- Designed and implemented Data Integration modules to perform Extract/Transform/Load functions.
- Involved in Data flow analysis, Data modeling, Data Mapping, performance analysis and tuning.
- Involved in extracting the data from various sources like Oracle, SQL, Teradata and XML.
- Managed database design and implemented a comprehensive Star-Schema with shared dimensions.
- Build Source Definition Document from multiple sources and enterprise database management system.
- Performed Data Analysis and Data Profiling using complex SQL queries on various sources systems including Oracle, Teradata.
- Participated in client discussions to gather scope information to provide inputs for project documents.
- Assisted in defining OLAP cube dimensions and performed roll up and drill down OLAP operations.
- Actively involved in creating ETL Mapping Sheets to define Source Target Mapping (STM).
- Created Informatica jobs, sessions, workflows to load organization related fact and dimension tables.
- Involved in the GAP analysis and Reverse mapping to check the quality and for any missing data.
Environment: Waterfall-Scrum Hybrid, JIRA, Informatica, Erwin, MS Project, SQL Server, SharePoint, UML, Python (NumPy, Pandas, Matplotlib), Spyder, SharePoint.
Business Analyst/ Data Engineer
Confidential
Responsibilities:
- Elicited requirements from stakeholders by conducting Joint Application Development (JAD) sessions, User Story Workshops and Interviews (Semi- Structured individually and in groups).
- Actively involved in documenting user stories and assisting PO in estimating and prioritizing stories.
- Documented/Contributed to GAP analysis with the Project Manager in planning the system transition.
- Read organization-projects, processes, policies and procedures to get a detailed understanding of the project. Analyzed and modified Business Requirements (BRD) and Systems Requirements (SRS).
- Developed logical and conceptual data models. Created data dictionaries and defined metadata fields.
- Created Use Cases and designed UML diagrams like Use Case, Activity and Sequence using MS Visio.
- Involved in creating conceptual/ logical ER diagrams and design DB Schemas for developing the DW.
- Assisted the Systems architect to create dimensional models, fact tables, dimensions. Alongside defined technical requirements of the project for the development team to implement it.
- Spearheaded in designing Data Models.
- Conducted normalization and de-normalization of data to fit OLTP and OLAP frameworks.
- Performed slicing & dicing on data before loading into Data Marts for analytical decision-making support.
- Documented the confirmed dimensional grids to define dimensional dependencies.
- Helped with data normalization up to 3NF to eliminate data redundancy and faster query processing.
- Facilitated Scrum events and gathered stakeholder feedbacks and change requests. Used JIRA to manage stories, tasks and change requests. Utilized SharePoint to store all related artifacts and meeting minutes.
- Worked closely with QA and the development team to create test cases, test plans for ETL testing.
- Documented User Acceptance Tests (UAT) and collected test data from business users for the UAT.
- Extensively used inner and outer joins, select statements and aggregate functions in SQL during source data extraction and target data validation. Created different views to assist users in monitoring the DW.
- Assisted the development team in documenting and testing API requests and responses using POSTMAN.
- Collaborated with ETL developers while conducting unit testing on data model changes, ETL, and data warehouse loads to ensure accurate results and integrity within the context of the defined requirements.
- Documented Integration, Data Integrity and Acceptance Testing results at various checkpoints.
- Engaged in creating user manuals, providing user training and post-production support.
- Performed Data Visualization and Exploratory Data Analysis (EDA) to find correlation and patterns within the data using Python and Tableau. Built a report summarizing the patterns from EDA.
- Deduced what factors would affect the Business outcome the most according to the reports from EDA.
- Analyzed internal operations and helped to identify areas where the company could improve efficiency and profitability.
Environment: Waterfall-Scrum Hybrid, JIRA, Tableau, Informatica, MS Visio, MS Project, SQL Server, SharePoint, UML, UAT, Python (NumPy, Pandas, Matplotlib), Jupyter Notebook, POSTMAN.
