Sr. Data Scientist/ Machine Learning Lead Resume
PROFESSIONAL SUMMARY:
- An innovative Data Scientist/AI expert with over six years of corporate experience (6 yrs.), and Bachelor’s degree in Electrical Engineering and Mathematics with successful background in delivering end to end solutions that address customer pain points, across domains like Recommender Systems, Anomaly detection, NLP, Machine Vision, Collaborative Filtering, Clustering, Classification, Regression, Deep Learning, SLAM, and/or statistical modelling.
- Expert in statistical programming languages like R, SPSS and Python (Pandas, NumPy, Sckit - Learn, Beautiful Soup) and Mahout for implementing machine learning algorithms in production environment. Strong Experience using TensorFlow, MXNet, Theano, Caffe and other open source frameworks.
- Actively participated in all phases of the project life cycle including data acquisition (Web Scraping), data cleaning, Data Engineering (dimensionality reduction (PCA & LDA), normalization, weight of evidence, information value), feature selection, features scaling & features engineering, Statistical modeling (decision trees, regression models, neural networks, SVM, clustering), testing and validation (ROC plot, k-fold cross validation) and data visualization.
- Experience working full data insight cycle - from discussions with business, understanding business logic and business drivers, Exploratory Data Analysis, identifying predictors, enriching data, working with missing values, exploring data dynamics, meaning or building predictive data models (if predictability can be found)
- Excellent data visualization experience either with proprietary code in R or Python , or using other visualization tools; ready for insight digestion by business and decision making to senior management (Global CTO, Global BI Leadership level)
- Extensive experience using R packages like (GGPLOT2, CARET, DPLYR)
- Extensive experience in creating visualizations and dashboards using R Shiny .
- Experience collating sparse data into single source, working with unstructured data, writing custom data logic validation scripts
- Extensive experience in data cleaning, web scraping, fetching live streaming data, data loading & data parsing using a wide variety of Python & R packages like beautiful soup
- Hands on experience in implementing SVM, Naïve Bayes, Logistic Regression, LDA, Decision trees, Random Forests, recursive partitioning (CART), Passive Aggressive, Bagging & Boosting
- Experienced with Big Data Tools like Hadoop (HDFS), SAP HANA, Hive, & PIG
- Expertise in writing effective Test Cases and Requirement Traceability Matrix to ensure adequate software testing and manage Scope Creep .
- Experience in working with Data Management and Data Governance based assignments.
- Proficient with high-level Logical Data Models, Data Mapping and Data Analysis.
- Extensive knowledge in Data Validation in Oracle and MySQL by writing SQL queries.
- Experience in Health care Management, Retail with excellent Domain knowledge in financial industry’s financial instruments and financial markets (Capital & Money). Excellent communication, analytical, interpersonal and presentation skills; expert at managing multiple projects simultaneously.
- Experience working with on-shore , offshore , on-site and off-site individuals and teams.
- Strong understanding of Software Testing Techniques especially those performed or supervised by BA including Black Box testing, Regression testing, and UAT .
TECHNICAL SKILLS:
Platform: Windows 98/2000/XP/Vista/7, UNIX, Mac OSX, LINUX, ERP, SOA.
Databases: HADOOP (HDFS),NOSQL,SAP HANA,IBM DB2,Oracle, MS Access, MS SQL,PIG,HIVE, SPARK SQL
Languages: R, SPSS, Python, SQL,C, HTML, VB scripting, .ASP, Visual Studio, SOA (XML)
Packages: Pandas, NUMPY, SCKIT-Learn, Beautiful SOUP, GGPLOT2, CARET, MAHOUT, DPLYR, GGMCMC, ReporteS, Knitr, RJSONIO, SHinyJS etc.
Business Process Modeling Tools: MS VISIO, RUP Tools (Rational Requisite Pro, Rational Rose, Rational Clear Case)
Documentation Tools: SharePoint 2013, MS Office(Word/Excel/Power Point/ Visio)
PROFESSIONAL EXPERIENCE:
Confidential
Sr. Data Scientist/ Machine Learning Lead
Responsibilities:
- The goal of this project is to increase the basket size, individualized experience and driving sales. This project is based on an ensemble model completely implemented in spark scala using an ensemble model.
- This was a complex model involving Machine Learning pipelines available through the databricks API.
- The inputs received will be items present in the basket and the output will be a recommended sku.
- Filters were added human guidance to your machine learning - driven experiences.
- Boosters were added to in corporate click-stream behavior for adding individualized experiences
- Variations were added to randomize the mix.
- All hierarchies of the product are passed through a FPGrowth container to create associations at all levels.
- Cosine similarity based approach is applied to last two levels of product hierarchy.
Confidential
Data ScientistResponsibilities:
- Given a pair of shoes find a pair of shoes that look similar.
- Solved using representations of a image in a high dimensional space.
- Locality Sensitivity Hashing LSH was used to find neighbourhoods.
- Raw image representations were changed into higher representations using Deep Image Featurizer to capture informations such as edges,texture,etc.
- All the Algorithms were stitched using Machine Learning Pipelines using Transformers and Estimators.
- This is an image recognition task consisting of finding a sku based on an image. This project is still in inception and would involve heavily on CNN and RNN.
Environment: Shiny, Haddop, Hive, Scala, Spark, Zepplin, Pyspark, Zepplin, Jupyter, R, sparkR, d3js, putty, databricks, microstrategy, power bi, git, bitbucket, Web technologies including html,css,javascript, R studio server, HPC, SQL
Confidential, Boston, MA
Data Science consultant/ R developer
Responsibilities:
- Exploratory Data Analysis using R/Matlab
- Identifying different algorithms for dynamic reporting for the advisory services
- Build model and test it performance
- Ensure the coding is done in adherence to the actual coding techniques
- Deploy tool in production environment.
- Interact with client on a daily basis for requirement gathering
Confidential, Indianapolis, IN
Data Scientist
Responsibilities:
- Went through data cleaning and manipulation phase on labeled and unlabeled image data set.
- Handled unbalanced data set problem such as models were not learning and label imbalance issues.
- Several resampling methods were implemented.
- Overfitting issue was present when model failed to generalize using resampled data. Data augmentation, batch normalization, l2 norm, dropout helped to overcome this issue.
- Resnet algorithm was used which uses smaller network
- Used Keras for implementation and trained using cyclic learning rate schedule.
- Using cyclic learning rate automatic schedule was implemented for in three cycles for about 20 hours of time.
- Accuracy, Kappa, precision and F1 score were calculated for comparing the results of four different algorithms: Naïve, Resampled, weighted and Resnet
- About 80% accuracy was achieved using Resnet
Environment: TensorFlow, Keras, Python, HPC, sparklyR
Confidential
Data ScientistResponsibilities:
- Went through the data cleaning and manipulation phase on clinical trial data sets of different drugs
- R functions were written for data piping and manipulation before data was feed inside The Bayesian models for meta-analysis
- Included several likelihood models such as Normal, Binomial, TTE, cLogLog, survival models, Poisson etc.
- Data parsing was done using DOCOPT package.
- MCMC sampling was implemented using JAGS sampler.
- WINBUGS code was included for data processing and model implementation
- Several visualizations (density plots, forest plots, leverage plots, network plots, covariant adjustment plots etc) were made using packages such as GGPLOT2, GGMCMC, animation etc
- Customized reports and presentations were generated autonomously using tool for different models using r packages e.g. rmarkdown, animation, knitr, ReporteRs etc
- Eventually everything was put in a package for Lilly internal use.
- Tool was tested under system testing and user acceptance testing in a regulated environment.
- Created a shiny dashboard app with a capability of saving different sessions.
- These sessions can be saved and reactivated for later use.
- Worked on an internally developed package “GLMCMP”
- Integrated jqPlots charts and graphs for jQuery within shiny for drag gable plots. added a capability to add up to 500 single/reverse distributions using modules.
- Summary reports were made in different formats using custom plots.
Environment: R, Shiny, java Script, jQuery,
Confidential, Mooresville, NC
Data Scientist
Responsibilities:
- Performed Data Profiling to learn about user behavior
- Merged user data from multiple data sources
- Performed Exploratory Data Analysis using R and Hive on Hadoop HDFS
- Prototype machine learning algorithm for POC (Proof of Concept)
- Performed Data Cleaning, features scaling, features engineering,
- Developed novel approach to build machine learning algorithm and implement it in production environment
- Performed ad-hoc data analysis for customer insights using Hive
- Developed Performance metrics to evaluate Algorithm’s performance
- Used RMSE score, F-SCORE, PRECISION, RECALL, A/B testing to evaluate recommender’s performance in both simulated environment and real-world.
- Fine tune the algorithm using regularization term to overcome the problem of over fitting
Environment: TERADATA, Oracle, HADOOP (HDFS), PIG, MySQL, RStudio, Python, JAVA, MAHOUT, HIVE, PIG, SPARK
Confidential, Chicago, IL
Data Scientist
Responsibilities:
- Defined Project Scope, project Charter & Business Case
- Prototype machine learning algorithm for POC (Proof Of Concept)
- Performed Data Cleaning, features scaling, features engineering,
- Developed predictive models for use in machine learning platform using the scikit-learn python framework
- Improved statistical models using learning curves, parameter curves, feature selection, and regularization.
- Performed ad-hoc data analysis for customer insights using SQL using Amazon AWS Hadoop cluster
- Developed MapReduce pipeline for feature extraction
- Implemented Support Vector Machine (lite)
- Performed Principal Component Analysis (PCA) & Linear Discriminate Analysis(LDA)
- Fine-tuned low bias & High variance trade off
- Defined the technical requirements of the analytic solutions.
- Defined the data requirements of the analytic solution.
- Worked on commercial data from desperate source systems, built data models and transformed data to provide added value in IT applications by streamlining processes, reducing cost, maximizing profits & rolling out business solutions that met one of the objectives
- Worked closely with subject-matter experts and business analysts from SAP and non-SAP systems and platforms, and investigating statistical and predictive and prescriptive patterns in the data to build business solutions
- Made iterative changes to analytic/predictive models and decision logic embedded in operational applications and business process platforms
- Worked with multiple relational, dimensional, and OLAP databases
Environment: MS SQL, Oracle, HADOOP (HDFS), PIG, MySQL, SAP Sybase, RStudio, Python, JAVA,.NET, HIVE, HADOOP HDFS, PIG, MAHOUT
Confidential, Northbrook, IL
Data Scientist
Responsibilities:
- Data analysis and visualization (Python, R,)
- Designed, implemented and automated modeling and analysis procedures on existing and experimentally created data
- Increased pace & confidence of learning algorithm by combining state of the art technology and statistical methods; provided expertise and assistance in integrating advanced analytics into ongoing business processes
- Parsed data, producing concise conclusions from raw data in a clean, well-structured and easily maintainable format
- Implemented Topic Modelling, PASSIVE AGGRESSIVE & other linear classifier models
- Perform tfidf weighting, normalize
- Performed scheduled and adhoc data driven statistical analysis, supporting existing processes
- Developed clustering models for customer segmentation using R
- Created dynamic linear models to perform trend analysis on customer transactional data in R
- Performed Topic modeling
Environment: R, SQL, Python, TABLEU, SAP HANA, SAS, JAVA, PCA & LDA, regression, logistic regression, random forest, neural networks, Topic Modeling, NLTK,SVM(Support Vector Machine),JSON,XML, HIVE, HADOOP, PIG,MAHOUT