- Data Scientist with 7 years of experience in Data Acquisition, Data Screening, Statistical Modelling, Data Exploration, Data Visualization with large data sets of Structured and Unstructured data and implementing Machine Learning algorithms to facilitate important business decisions.
- Experience with working in domains including E - commerce, Real-estate, Telecom and Retail.
- Proficient in utilizing analytical applications like Python and R to identify trends and relationships between different data points to draw appropriate conclusions and translate analytical findings into risk management and marketing strategies that drive value.
- Experience in facilitating the entire Data Science project lifecycle and actively involved in Data Extraction, Data Cleaning, Feature Engineering, Dimensionality Reduction, Prototyping & Training (Data processing & Encoding), Model selection, Backtesting, Model tuning and Productionization.
- Adept in working with statistical tests: t-tests, One-way & Two-wayANOVA; Normality tests: Jarque-Bera, Shapiro-Wilk tests; Non-parametric tests: Chi-square, Wilcoxin signed-rank tests.
- Experience in working with Machine Learning algorithms including Regression models like Linear, Polynomial, Support Vector, Decision trees; Classification models including Logistic Regression, Support Vector Machines, K-Nearest Neighbor, Naïve Bayes, Decision trees; Ensemble learning methods like Random forests, Bagging, Boosting, Stacking; Clustering techniques like K-means, DBSCAN, Hierarchical clustering; Association Rule learning with Apriori, Eclat; Reinforcement learning with Upper Confidence Bound, Thompson Sampling;
- Extensive knowledge of Dimensionality reduction (PCA, LDA), Hyper-parameter tuning, Model regularization, Grid search techniques to optimize the cost function and model performance.
- Expertise in Data Cleaning process of outlier detection and removal using Isolation forest, Grubb’s test for univariate analysis, Mahalanobis& Cook’s distance for multi-variate analysis; Imputing null values using Multiple Imputed Chained Equations (MICE) in R and Iterative imputer in Python.
- Skilled in Big Data technologies like Spark, SparkSQL, PySpark, Hadoop Distributed File System, MapReduce&Kafka.
- Experience in Web Data mining with Python’s ScraPy and BeautifulSoup packages along with working knowledge of Natural Language processing (NLP) to analyze text patterns.
- Experience in working with text by implementing Recurrent Neural networks using Long Short term memory (LSTM) architecture with Many-to-One combination for Sentiment Analysis.
- Good knowledge of Database Creation and maintenance of physical data models with Oracle, DB2 and SQL Server databases.
- Excellent exposure to Data Visualization with Tableau&PowerBI.
- Experience with Python libraries including NumPy, Pandas, SciPy, SkLearn, MatplotLib, Seaborn, Dask, Theano, Tensorflow, nltk and R libraries including ggplot2, dplyr, Esquisse.
- Experience developing algorithms to create Artificial Neural networks to implement AI solutions to optimize business processes and minimize costs.
- Expertise in Computer vision for image classification and face detection using Convolutional Neural networks with Res-Net architecture.
- Utilized Excel Pivot Tables and VLookup for data pre-processing and created ANOVA sheets, regressions and performed hypotheses testing using data analysis add-on in Excel.
Programming Languages: Python, R, Matlab
Database: MySQL, PostgreSQL, Oracle, MongoDB, Microsoft SQL Server
Hypotheses testing:: Independent & pairwise t-tests, one-way and two-way factorial ANOVA, Pearson s correlation; Regression Methods: Linear, Multiple, Polynomial, Decision trees and Support vector; Classification: Logistic, K-NN, Na ve Bayes, Decision trees and SVM; Clustering: K-means, DBSCAN, Hierarchical, Expectation maximization; Association Rule Learning: Apriori, Eclat; Reinforcement Learning: Upper Confidence Bound, Thompson Sampling; Deep Learning: Artificial Neural Networks, Convolutional Neural Networks, Recurrent Neural networks with Long short term memory (LSTM), Deep Boltzmann machines; Dimensionality Reduction: Principal component Analysis (PCA), Linear discriminant Analysis (LDA), Autoencoders; Text mining: Natural Language processing; Ensemble Learning: Random forests, Bagging, Stacking, Gradient Boosting;
Algorithms: Gradient Descent, Stochastic Gradient Descent, Gradient Optimization - Adam, Momentum, RMSProp
Gradient, Kfold cross Validation, Monte: Carlo simulations, Out of bag sample estimate
Data Visualization: Tableau, Microsoft PowerBI, ggplot2, MatplotLib, Seaborn and Bokeh
Data modeling: Entity relationship Diagrams (ERD), Snowflake Schema
Big Data: Apache Hadoop, HDFS, Kafka, MapReduce, Spark
Confidential, Jersey City, NJ
- Assisted in developing a Recommendation System with Machine learning algorithms such as K-Nearest Neighbor, Apriori and Deep Neural Networks.
- Incorporated Deep Boltzmann Machines and AutoEncodersarchitectures of the Recurrent Neural Networks class to develop a highly accurate model. Communicated and coordinated with end client for collecting data and performed ETL to define the uniform standard format.
- Implemented dimensionality reduction using Deep Autoencodercollaborative filtering technique to increase the recall by 6% and customer reach by 11%.
- Performed correlation analysis in the data exploration phase using graphical techniques in MatplotLib and Seaborn to produce insights about the product sales in a region and type of business.
- Explored the product sales data for cannibalization, when similar products were launched in the same category.
- Segmentedthe data using K-Means clustering and analyzed client's behavior according to their demographic details, regions and monthly revenues in each cluster.
- Incorporated image embedding’s like t-distributed stochastic neighbor embedding (t-SNE) obtained from deep convolutional networks for improved recommendation of items.
- Designed and implemented Cross-validation and statistical tests including k-fold, stratified k-fold, hold-out scheme to test and verify the models' significance.
- Developed multivariate Gaussian anomaly detection algorithm in Python to identify suspicious patterns in network traffic. Employed Expectation-maximization clustering using Gaussian Mixture models (GMM).
- Developed dashboards in Tableau to visualize the suspicious patterns/activities in real time for business users.
- Used Python to perform text mining to find out meaningful patterns from unstructured textual feedback. Created word cloud and word corpuses that were used by the higher management.
- Implemented topic modeling using LDA to predict the product category of feedback and detect specific issues in each category by analyzing the feedback collected from customer service.
- Performed sentiment analysis on customer feedback after every release.
Environment: Python 3.6, Pytorch, Tableau
Confidential, Framingham, MA
- Gathered requirements from business and Reviewed business requirements & analyzed data sources.
- Involved in various pre-processing phases of text like Tokenizing, Stemming, Lemmatization and converting the raw text to structured data.
- Performed Data Collection, Data Cleaning, Feature Engineering (Deep Feature Synthesis), Validation, Visualization, Report findings and developed strategic uses of data.
- Implemented sampling, Principal component analysis and t-SNE for visualizing high dimensional data.
- Worked with NLP to classify text with the data drawn from a big data system. The text categorization involved labeling natural language texts with relevant categories from a predefined set.
- One of the goals was to target users through automated classification. This assisted in creating cohorts to improve marketing.
- The NLP text analysis monitored, tracked and classified user discussion about product and service in online discussion.
- The gradient boosted decision trees classifier was trained to identify whether a cohort was a promoter or detractor.
- Constructed new vocabulary to encode the variables in a machine readable format using Bag of words, TF-IDF, Word2vec, Average Word2vec.
- Implemented Long Short Term Memory (LSTM) layer network of moderate depth to gain the information in the sequence.
- Optimized the performance of the neural network by Pruning and choosing the right number of hidden layers and neurons per layer.
- Executed processes in parallel using distributed environment of Tensorflow across multiple devices (CPUs & GPUs).
- The overall project improved the marketing Return on Investment (ROI) by 15% and customer satisfaction by 20%.
Environment: s:Python - NLTK
- Screened for missing values by rows and columns and removed variables with missing values above the cutoff point.
- Used Mahalanobis distance, cook’s distance, leverage statistics along with chisq cutoff on the numerical variables to detect outliers.
- Checked for correlation in data to observe the distributions of all numeric and categorical variables.
- Analyzed customer data for churn prediction using Logistic Regression, Support Vector Machines,Decision Trees and Random Forests and compared the results.
- Optimized the decision tree model using ensemble learning methods mainly Bagging, Random Forests, Stacking and Extreme Gradient Boosting techniques.
- Analyzed and grouped customers into different clusters based on purchase and historic data using techniques such as k-means clustering.
- Built a Logistic Regression Classifier to determine user's purchase intention and target potential buyers from past data history.
- The models were validated using Backtesting through K-fold cross validation, the learning rate was optimized through Hyper-parameter tuning and Grid search.
- The models were refined on the basis of values obtained from the ROC plot and CAP curve. Various metrics such as RMSE, MAE & Confusion matrix were used to evaluate the performance of the model.
- The final results were summarized using a dashboard in PowerBIand presented to the client.
Environment: Python, SQL, R, Microsoft PowerBI, Spark
- Led initiative to build statistical models using historical data to predict real estate prices in several economic markets. Focused on analyzing the factors affectingthe value of properties in Bangalore, India.
- Developed prediction algorithm using advanced data mining techniques to classify similar properties together and to develop sub-markets based on zip codes.
- Created database designs through data-mapping using ER diagrams and normalization up to the 3rd normal form and extracted relevant data whenever required using joins in PostgreSQL, Microsoft SQL Server and SQLite.
- Extracted terabytes of relevant data using HDFS &MapReduce from Hadoop.
- Conducted data preparation, and outlier detection using MS SQL server; built the model using Python.
- Provided statistical insights using t-tests, Anova, chisq tests and performed Post-hoc analysis including Bonferroni correction and Tukeys’s HSD to assess differences across levels of categories, test significance of proportional differences & assess whether sample size is large enough to detect the differences.
- Provided statistical insights into 5% VaR, expected shortfall, semi-deviation & skewness to-kurtosis ratio to guide investment decisions.
- Predicted house-prices and area population income using regression methods in Excel and Octave(Matlab).
- Worked with Portfolio managers to arrive at an optimal solution to the problem. Increased the revenue of the firm by 5%.
- Created and presented executive dashboards and scorecards to show the trends in the data using Excel and VBA-Macros.
Environment: Microsoft Excel, Python, Microsoft SQL Server