- Data Scientist with 6+ years of experience in statistical testing methods, data analytics, data management, predictive analytics and to deliver resourceful insights and business strategies
- Experience working in various domains such as Banking, E - commerce, Education, Healthcare and performed various result-oriented statistical experiments to identify solutions for business problems
- Experience in Data Profiling, Data cleansing, Data mapping, Data chunking, creating workflows and Data validation using data integration tools like Informatica, Talend Open Studio during ETL and ELT processes
- Extensive knowledge in Machine Learning techniques like Regression Modeling, Classification, Neural Networks, SVM, Clustering, Decision Tree & Random Forest, Association Rule Mining
- Experience working with R packages like tidyverse, and ggplot2
- Experience in statistical testing like ANOVA, t-test, Hypothesis Testing and Chi-Square Fit test
- Experience on Apache Hadoop Ecosystem with good knowledge of Apache Hadoop Distributed file system (HDFS), Map Reduce, Hive, Pig, HBase, Sqoop, Flume, Cassandra, Spark, Oozie, Kafka
- Experience working with various RDBMS like Oracle, MYSQL, expertise in creating tables, data manipulation and data extraction from these databases and performed necessary data screening
- Skilled in using dplyr and pandas in R and Python for performing exploratory data analysis.
- Experience in Text Mining & also good working knowledge on NLP components and packages like Natural Language Generation (NLG) and Natural Language Understanding (NLU) using Python NLTK package.
- Experience in ingesting datasets from various data sources ranging from HDFS, Cassandra, AWS and other RDBMS like MYSQL, ORACLE, SQL SERVER, DB2, Teradata, SAP HANA etc.
- Experience working in data modeling, data analysis and working with OLTP and OLAP systems and experience working with data mining techniques such as EDW, MOLAP DM and ROLAP.
- Experience with DBA tasks involving database creation, data profiling, data cleaning, performance tuning, creation of indexes, creating and modifying table spaces for better and optimized experience
- Expertise in Cost Benefit Analysis, Feasibility Analysis, Impact Analysis, Gap Analysis, SWOT analysis and ROI analysis, SCRUM, leading JAD sessions and Dashboard Reporting using tools like Tableau & Power BI
- Experience in SAS/STAT, STATA, R, SQL, Tableau, Python, MS EXCEL (VLOOKUP, Pivot table, Macros).
- Skilled in ERD & UML modelling, & theorize these models to create physical models from logical model.
- Expertise in creating Tableau Dashboards for data visualization and deploying it to the servers.
- Expertise in SQL Queries and 5 years of experience in creating the databases, populating it, to extract data from data tables along with creation of tables, Sub queries, Joins, Views, Indexes, SQL Functions
- Experience in Data warehousing concepts like Star Schema, galaxy and Snowflake Schema, Data Marts, Kimball Methodology used in Relational and Multidimensional data modeling.
- Good knowledge in Normalization and De-Normalization techniques for optimum schema designing.
- Experience with conceptual, logical and physical data modeling considering Meta data standards.
- Proficient with Python, R and Object-Oriented Programming concepts such as Inheritance, Polymorphism, Abstraction, Encapsulation, Association, Aggregation, etc.
- Expertise in cloud technologies such as AWS, Azure and Google & retrieved data from the cloud to perform data screening, and analytical operations on the data to provide insights
- Expertise in leveraging the Exploratory Data Analysis (EDA) with all numerical computations and by plotting all kinds of relevant visualizations to do feature engineering and to get feature importance
Statistical Tests: Hypothesis testing, ANOVA tests, t-tests, Chi-Square Fit test, Regression, Time series analysis.
Machine Learning Algorithms: Regression Models (Linear, Polynomial, Support Vector, Decision Trees); Classification Models (Logistic Regression, Decision Trees, Support Vector Machines); Ensemble Learning (Random Forest, Bagging Trees, Gradient Boosting Machine); Text Mining (NLP)
Validation Tests: Monte Carlo methods, k-fold cross validation, Out of the Box Estimate
Analytical tools: Google analytics, R Studio, SAS, MATLAB, Azure data lake analytics, Google Ads
Data Visualization: Tableau, Microsoft Power BI, R ggplot2 and plotly, Python matplotlib, seaborn, bokeh
Data modeling: Entity relationship Diagrams (ERD), Snowflake schema, Star schema
Languages: SQL, HIVE QL, C, R, Python, SAS
Database Systems: SQL Server 10.0/11.0/13.0, Oracle, MYSQL 5.1/5.6/5.7, Teradata, DB2, Amazon Redshift, SAP HANA
NOSQL Databases: HBASE, Apache Cassandra
ETL Tools: Informatica Power House 9.0, Informatica IDQ, Talend Open Studio, KAFKA, FLUME, Microsoft SSIS, Apache Spark
Big Data: Apache Hadoop, HDFS, Sqoop, Spark, Flume, Kafka, Hive, Impala, MapReduce, Splunk ML-SPL, Splunk Hadoop Connect, Oozie
Confidential, New York City, NY
- Performed Data cleaning in a huge dataset which had many missing data & extreme outliers from Hadoop
- Used MICE in R & iterative imputer in python to impute missing observations based on the existing observations & tracked outliers using Mahalanobis distance & chi square cut off to remove extreme outliers
- Used cook’s distance to detect distinct observational influence on the dataset and removed the outliers
- Used two sample independent t-tests to access the differences in mean purchases across dichotomous variables such as gender and marital status, used one-way ANOVA and tukey parameter to access difference between mean purchases across polychotomous variables such as occupation and age
- Used Multiple Linear Regression, Decision Tree Regression, Support Vector Regression & ensemble learning like Bagging, Random Forests & Gradient Boosting Machine to train 70% of the model & the models were optimized using Grid Search & the predictions were made on the test set using each trained model
- Computed Absolute and Relative return based on the simulation and plotted histograms for the selections to find the best ad and strategy based on the reinforcement learning algorithm (Thompson Sampling)
- The final model was selected using Gain Plot Curve Relative Gini Score, Root Mean Squared Error, and Mean Absolute Error and validated using ten-fold cross validation technique
- The final results were summarized as a dashboard in tableau and it was presented to the client
Environment: Tableau, Excel, Python (Pandas, Scikit, Numpy), Jupyter Notebook, R, MYSQL, Apache Hadoop Distribution 2.7.X, HDFS, Linux, MS office suite, Apache Spark
Confidential, Santa Clara, CA
- Performed extensive data exploration and generated features from 4 Tb data using Hadoop
- Used Python to develop variety of models & algorithms to predict the Loan default using the parameters
- Deployment of advanced techniques (e.g., text mining, statistical analysis, etc.) & performed a/b testing, to access the historical loan defaults based on binary categories and validated the results using Chi-Square tests.
- Successfully formulated the problem and built a classification model to predict the probability of loan default using Logistic Regression and Decision Tree classification
- Employed ensemble learning methods like Random Forest and Gradient Boosting to predict the probability of loan default and improved the recall by 40% over the existing system
- Used Phyton libraries such as numpy, matplotlib and pandas to work with dataframes and to plot graphs
- Validated the model using a ten-fold cross validation technique and used advanced hyper parameter tuning techniques such as OOB estimates and Grid Search to find the optimal model and selected the best model based on precision recall, ROC curve, lift charts, AUC and Pseudo R-square.
- Developed simple and compound features from over 50 tables consisting details of user account, transaction, location, logging, time of the user transaction, system interactions etc.
Environment: SQL Server, ETL, Python 3.x(Scikit-Learn/SciPy/NumPy/Pandas), R, Hadoop Framework, HDFS, Jupyter notebook, Apache Spark
Confidential, Bentonville, AR
- Applied Digital Marketing analytics using Facebook Analytics, Tableau visualizations and Looker
- Collected, processed and cleansed raw data from a wide variety of sources using R and performed statistical testing such as two sample independent t-tests to access the differences in mean purchases across variables
- Responsible for enabling analysis through producing information of products and is involved in the research and development efforts and to create data-based customer profiles to build a geo-demographic segmentation model and efficiently allocate resources in future expanding markets
- Directly worked with cross-functional teams including analysts, engineers, product managers & executive management to understand the business needs & build data-driven strategies to help them meet their goals
- Provided analytical insights and dashboards using Tableau and presented it to the client
- Created a recommendation system based on customer purchasing history using Machine Learning algorithms such as K-NN and association rule mining
- Analyzed and evaluated performance results from model execution to find key trends and opportunities to expand the business of the client
Environment: SQL Server 2012, Python 3.x (Scikit-Learn, NumPy, Pandas, Matplotlib), Tableau, Looker, Facebook Analytics, R, Linux, MS Excel.
Junior Data Scientist
- Collected data from end client, performed ETL and defined the uniform standard format
- Wrote queries to retrieve data from SQL Server database to get the sample dataset containing basic fields
- Performed string formatting on the dataset converting hours from date format to a numerical integer
- Used Python libraries like Matplotlib and Seaborn to visualize the numerical columns of the dataset such as day of week, age, hour and number of screens.
- Developed and implemented predictive models like Logistic Regression, Decision Tree, Support Vector Machine (SVM) to predict the probability of enrollment
- Used Ensemble learning methods like Random Forest, Bagging & Gradient Boosting & picked the final model based on confusion matrix, ROC & AUC & predicted the probability of customer enrollment
- Tuned the hyper parameters of the above models using Grid Search to find the optimum models
- Designed and implemented K-Fold Cross-validation to test and verify the model's significance
- Developed a dashboard and story in Tableau showing the benchmarks and summary of model's measure.
Environment: SQL Server 2012/2014, Python 3.x (Scikit-Learn, NumPy, Pandas, Matplotlib, Dateutil, Seaborn), Tableau, Hadoop
- Gathered and managed Data in SQL server 2008, MS Access & conducted in-depth data analysis & predictive modelling to uncover hidden patterns & communicated the insights to the product, sales & marketing teams
- Perform Data Analysis on target data after transfer to Data Warehouse
- Created and automated dashboards using Excel VBA
- Work with Data Architect on Dimensional Model with both Star and Snowflake Schemas utilized
- Created ETL solution using Informatica tool to read Product & Order data from files on shared network into SQL Server database
- Made business recommendations based on data collected to improve business efficiency.
- Created Data visualization, dashboards & advanced story telling reporting using Tableau and MS Power BI
Environment: Windows XP, SQL Server 2005/2008, PostgreSQL, MSSQL, SQLite, Excel VBA, MS Office 2010, MS Access 2010, Tableau, SSIS, MS Power BI