- Data Scientist wif more than 7 years of experience in managing a project life cycle and involved in all phases including data extraction, data cleaning, statistical modelling and data visualization, wif large sets of data.
- Proficient in Statistical Modelling and Machine Learning techniques (Linear, Logistic, Decision Trees, Clustering (K - means, Hierarchical), K-Nearest Neighbours, Naive Bayes Forecasting/Predictive Analytics, Segmentation methodologies, Regression based models, Hypothesis testing, PCA, Ensembles.
- Having good knowledge of Random Forest and SVM.
- Extensively worked for data analysis using R Studio, SQL, Tableau and other BI tools.
- Strong skills in statistical methodologies such as A/B test, experiment design, hypothesis testing, computational linguistics/natural language processing (NLP), data mining, ANOVA, chi-square tests implementation using R and Python.
- Ability to use dimensionality reduction techniques and regularization techniques.
- Skilled in using R, Python, SAS for performing exploratory data analysis.
- Experienced in Data Mining solutions to various business problems, predictive modelling, finding patterns in the data using data visualizations in R, Python and Tableau and providing the recommendations to the client.
- Worked wif large sets of complex datasets dat include structured, semi-structured and unstructured data and discover meaningful business insights.
- Strong analytical skills wif ability to collect, organize, analyze, estimation, use case review, scenario preparation, test planning and strategy decision making, test execution, test results analysis, team management and test result reporting and disseminate significant amount of information wif attention to detail and accuracy.
Languages: R, Python, SQL, Java
Analytical Tools: R Studio, Python, Tableau, My-SQL, MS-Excel and Other Microsoft Office tools
Database: SQL Server, DB2, MS SQL, MS Access
Operating system: Windows 7/8/10, Linux
Statistical Techniques: Logistic Regression, Linear Regression, Multi-Linear Regression, Machine Learning, Decision Tree, Random Forest, Support Vector Machine (SVM), TEMPPrincipal Component Analysis (PCA), Clustering, Hierarchical Clustering, K-nearest neighbour(KNN), K-means Clustering, Neural Networks, ANN & RNN, Naive Bayes.
Statistical Methods: Descriptive and Inferential Statistics, Histograms, Pie Charts, Box-Plots, Data distributions - Standard normal, Exponential/Poisson, Binomial, Standard deviation and variance, Hypothesis testing, P-values test for significance, F-test, T-test, Chi-squared, ANOVA testing, A/B Testing.
Confidential, Denver, CO
- Involved in the entire data science project life cycle including data extraction, data cleansing, transform and prepare the data ahead of analysis and data visualization phase.
- Used R to manipulate data, develop and validate quantitative models.
- Cleansed the data by eliminating duplicate and inaccurate data in R and Python.
- Identified the missing values, pattern recognition techniques were used to find the pattern for missing values and handled them wif K-NN imputation.
- Unstructured data was scaled and combined wif structured data to apply statistical methods.
- Dummy variables created for categorical variables.
- Sorted attributes into categorical and numeral variables.
- Data was visualized using different visualization (scatter plot, box plots, and histograms) techniques from ggplot2 package in R.
- Tried and implemented multiple models to evaluate predictions and performance.
- Machine learning algorithms like Random forest and logistic regression models were built on correlated data sets and emphasized on advanced algorithms like neural networks, SVM.
- Data was segmented into training, testing and validation sets at 0.5-0.25-0.25 ratio.
- Fine-tuned models to obtain more recall than accuracy. Trade-off between False Positives and False Negatives.
- Evaluated models using Recall, Precision, Cross Validation and ROC.
- The model could predict 76.7% accurate.
- Z-score standardization, Laplace estimator and other techniques was applied on the model for performance improvement.
- The final model after performance was predicted at a accuracy of 83.4%
- Preparing the Final Documents wif all the recommendations and ensure delivery to the Client before EOD.
Environment: Programming, Python, MS Excel, MS Access, SQL Server 2008,Microsoft office, SQL Server Management Studio, Power point.
Confidential, Boston, MA
- Involved wif business analyst to understand the domain and the attributes .
- Brainstorming sessions wif the audit and trading teams led us to identify which data is appropriate to build an algorithm to classify obfuscated data.
- Transform and prepared data ahead of analysis & data-mining phase.
- Missing value treatment, outlier capping and anomalies treatment using statistical methods, deriving customized key metrics.
- Creating Dummy variables for certain categorical variables.
- Responsible for variable selection by performing forward stepwise regression, R-square and VIF values.
- Tried and implemented multiple models to evaluate predictions and performance (Linear regression, CART, Random Forest).
- Multiple Regression techniques were used and tested. Robust regression was finalized based the feasibility and accuracy of results
- Robust regression is an alternative to least squares regression when data is contaminated wif outliers or influential observations.
- Robust regression was used for the purpose of detecting influential observations. MAPE was given the priority as part of stats.
- Conducted validation of data models by different measures such as AUC, ROC, and confusion matrix.
- Grouped based on the past issues and each group was assigned wif specific set of issues.
- The model could predict 89% accurate.
- Documentation of the model was done and recommendations were forwarded to the company specifying the target customer base for the policy for achieving the maximum success.
Environment: R, Python, Analytics Dashboard, Web services, Text Mining, MS- Office.
- Collect the data from mill and prepare the data health report.
- Performing univariate and bivariate analysis to understand the variables.
- Visualizing the data wif the help of box plots and scatter plots to understand the distribution of data.
- Cleanse, transform and prepare data ahead of analysis and data-mining phase.
- Treating missing values and outliers on new input datasets.
- Dummy variables were created whenever their are categorical variables.
- Responsible for variable selection by performing Lasso Regression, Decision Tree, Random Forest, forward stepwise regression, R-square and VIF values.
- Tried and implemented different prediction models such as Linear Regression, Decision Tree and Support Vector, choosing the best model based on the trade off between accuracy and interpretation.
- Designed accurate and scalable predicting algorithms.
- Responsible for training and evaluating predictive statistical model (Linear Regression, SVM and Decision Tree).
- Strong validation experience of data models by different measures such as RMSE, R 2 and Adjusted R 2 values.
- Received extreme positive feedback from global programming leads, managers, higher management for my analytical and technical skills, commitment and dedication
Jr. Data Scientist
- Participated in all phases of research including requirement gathering, data cleaning, data mining, developing model and visualization.
- Collaborated wif Data analyst and others to get insights and understanding of the data.
- Used R to manipulate and analyze data for solution. Packages were used for test mining.
- Performed data mining, data analytics, data collection, and data cleaning.
- Developing models, Validation, visualization and performed Gap analysis.
- Cleansed and transformed the data by treating Outliers, Imputing the missing values.
- Used predictive analysis to create models of customer behaviour dat are correlated positively wif historical data and use these models to forecast future results.
- Translate business needs into mathematical model and algorithms and build exceptional machine learning algorithms.
- Tried and implemented multiple models to evaluate predictions and performance.
- Application of various machine learning algorithms and statistical modelling like Logistic Regression, Decision tree, SVM, to identify Volume using scikit-learn package in R & Python.
- Performed Boosting method on predicted model for the improve efficiency of the model.
- Improve efficiency and accuracy by evaluating model in R.
- The model could predict 87.4% accurate.
- Documented the visualizations and results and submitted to HR management.
- Presented Dashboards to Higher Management for more Insights using Power BI and Tableau.
- Environment: R/R studio, Python, Tableau, MS SQL Server 2005/ 2008, MS Access.