- An innovative Data Scientist/AI expert with over six years of corporate experience (6 yrs.), and Bachelor’s degree in Electrical Engineering and Mathematics with successful background in delivering end to end solutions that address customer pain points, across domains like Recommender Systems, Anomaly detection, NLP, Machine Vision, Collaborative Filtering, Clustering, Classification, Regression, Deep Learning, SLAM, and/or statistical modelling.
- Expert in statistical programming languages like R, SPSS and Python (Pandas, NumPy, Sckit - Learn, Beautiful Soup) and Mahout for implementing machine learning algorithms in production environment.
- Strong Experience using TensorFlow, MXNet, Theano, Caffe and other open source frameworks.
- Actively participated in all phases of the project life cycle including data acquisition (Web Scraping), data cleaning, Data Engineering (dimensionality reduction (PCA & LDA), normalization, weight of evidence, information value), feature selection, features scaling & features engineering, Statistical modeling (decision trees, regression models, neural networks, SVM, clustering), testing and validation (ROC plot, k-fold cross validation) and data visualization.
- Experience working full data insight cycle - from discussions with business, understanding business logic and business drivers, Exploratory Data Analysis, identifying predictors, enriching data, working with missing values, exploring data dynamics, meaning or building predictive data models (if predictability can be found)
- Excellent data visualization experience either with proprietary code in R or Python , or using other visualization tools; ready for insight digestion by business and decision making to senior management (Global CTO, Global BI Leadership level)
- Extensive experience using R packages like (GGPLOT2, CARET, DPLYR)
- Extensive experience in creating visualizations and dashboards using R Shiny .
- Experience collating sparse data into single source, working with unstructured data, writing custom data logic validation scripts
- Extensive experience in data cleaning, web scraping, fetching live streaming data, data loading & data parsing using a wide variety of Python & R packages like beautiful soup
- Hands on experience in implementing SVM, Naïve Bayes, Logistic Regression, LDA, Decision trees, Random Forests, recursive partitioning (CART), Passive Aggressive, Bagging & Boosting
- Experienced with Big Data Tools like Hadoop (HDFS), SAP HANA, Hive, & PIG
- Expertise in writing effective Test Cases and Requirement Traceability Matrix to ensure adequate software testing and manage Scope Creep .
- Experience in working with Data Management and Data Governance based assignments.
- Proficient with high-level Logical Data Models, Data Mapping and Data Analysis.
- Extensive knowledge in Data Validation in Oracle and MySQL by writing SQL queries.
- Experience in Health care Management, Retail with excellent Domain knowledge in financial industry’s financial instruments and financial markets (Capital & Money). Excellent communication, analytical, interpersonal and presentation skills; expert at managing multiple projects simultaneously.
- Experience working with on-shore , offshore , on-site and off-site individuals and teams.
- Strong understanding of Software Testing Techniques especially those performed or supervised by BA including Black Box testing, Regression testing, and UAT .
Platform: Windows 98/2000/XP/Vista/7, UNIX, Mac OSX, LINUX, ERP, SOA.
Databases: HADOOP (HDFS),NOSQL,SAP HANA,IBM DB2,Oracle, MS Access, MS SQL,PIG,HIVE, SPARK SQL
Languages: R, SPSS, Python, SQL,C, HTML, VB scripting, .ASP, Visual Studio, SOA (XML)
Packages: Pandas, NUMPY, SCKIT-Learn, Beautiful SOUP, GGPLOT2, CARET, MAHOUT, DPLYR, GGMCMC, ReporteS, Knitr, RJSONIO, SHinyJS etc.
Business Process Modeling Tools: MS VISIO, RUP Tools (Rational Requisite Pro, Rational Rose, Rational Clear Case)
Documentation Tools: SharePoint 2013, MS Office(Word/Excel/Power Point/ Visio)
Confidential, Indianapolis, IN
Sr. Data Scientist
- Went through data cleaning and manipulation phase on labeled and unlabeled image data set.
- Handled unbalanced data set problem such as models were not learning and label imbalance issues.
- Several resampling methods were implemented.
- Overfitting issue was present when model failed to generalize using resampled data. Data augmentation, batch normalization, 12 norm, dropout helped to overcome this issue.
- Resnet algorithm was used which uses smaller network
- Used Keras for implementation and trained using cyclic learning rate schedule.
- Using cyclic learning rate automatic schedule was implemented for in three cycles for about 20 hours of time.
- Accuracy, Kappa, precision and F1 score were calculated for comparing the results of four different algorithms: Naïve, Resampled, weighted and Resnet
- About 80% accuracy was achieved using Resnet
Environment: TensorFlow, Keras, Python, HPC
- Went through the data cleaning and manipulation phase on clinical trial data sets of different drugs
- R functions were written for data piping and manipulation before data was feed inside The Bayesian models for meta-analysis
- Included several likelihood models such as Normal, Binomial, TTE, cLogLog, survival models, Poisson etc.
- Data parsing was done using DOCOPT package.
- MCMC sampling was implemented using JAGS sampler.
- WINBUGS code was included for data processing and model implementation
- Several visualizations (density plots, forest plots, leverage plots, network plots, covariant adjustment plots etc) were made using packages such as GGPLOT2, GGMCMC, animation etc
- Customized reports and presentations were generated autonomously using tool for different models using r packages e.g. rmarkdown, animation, knitr, ReporteRs etc
- Eventually everything was put in a package for Lilly internal use.
- Tool was tested under system testing and user acceptance testing in a regulated environment.
Environment: R, Matlab, HPC, Java Script, JAVA, SQL, C++
- Created a shiny dashboard app with a capability of saving different sessions.
- These sessions can be saved and reactivated for later use.
- Worked on an internally developed package “GLMCMP”
- Integrated jqPlots charts and graphs for jQuery within shiny for drag gable plots. added a capability to add up to 500 single/reverse distributions using modules.
- Summary reports were made in different formats using custom plots.
Environment: R, Shiny, java Script, jQuery,
Confidential, Mooresville, NC
Principal Data Scientist
- Performed Data Profiling to learn about user behavior
- Merged user data from multiple data sources
- Performed Exploratory Data Analysis using R and Hive on Hadoop HDFS
- Prototype machine learning algorithm for POC (Proof of Concept)
- Performed Data Cleaning, features scaling, features engineering,
- Developed novel approach to build machine learning algorithm and implement it in production environment
- Performed ad-hoc data analysis for customer insights using Hive
- Developed Performance metrics to evaluate Algorithm’s performance
- Used RMSE score, F-SCORE, PRECISION, RECALL, A/B testing to evaluate recommender’s performance in both simulated environment and real-world.
- Fine tune the algorithm using regularization term to overcome the problem of over fitting
Environment: TERADATA, Oracle, HADOOP (HDFS), PIG, MySQL, RStudio, Python, JAVA, MAHOUT, HIVE, PIG, SPARK
Confidential, Northbrook, IL
- Data analysis and visualization (Python, R,)
- Designed, implemented and automated modeling and analysis procedures on existing and experimentally created data
- Increased pace & confidence of learning algorithm by combining state of the art technology and statistical methods; provided expertise and assistance in integrating advanced analytics into ongoing business processes
- Parsed data, producing concise conclusions from raw data in a clean, well-structured and easily maintainable format
- Implemented Topic Modelling, PASSIVE AGGRESSIVE & other linear classifier models
- Perform tfidf weighting, normalize
- Performed scheduled and adhoc data driven statistical analysis, supporting existing processes
- Developed clustering models for customer segmentation using R
- Created dynamic linear models to perform trend analysis on customer transactional data in R
- Performed Topic modeling
Environment: R, SQL, Python, TABLEU, SAP HANA, SAS, JAVA, PCA & LDA, regression, logistic regression, random forest, neural networks, Topic Modeling, NLTK,SVM(Support Vector Machine),JSON,XML, HIVE, HADOOP, PIG,MAHOUT