- Data Scientist with around 7 years of experience in Healthcare, e - commerce, Automobile and Insurance domain. Skilled at performing Data Extraction, Data Screening, Data Cleaning, Data Exploration, Data Visualization and Statistical Modelling of varied datasets, structured and unstructured, as well as implementing large-scale Machine Learning and Deep Learning Algorithms to deliver resourceful insights and inferences significantly impacting business revenues and user experience.
- Experienced in facilitating the entire lifecycle of a data science project: Data Extraction, Data Pre-Processing, Feature Engineering, Algorithm Implementation & Selection, Back Testing and Validation.
- Expert at working with Statistical Tests Parametric tests: t-test, ANOVA along with Non-parametric tests: Chi-squared tests, Mann-Whitney U & Kruskal-Wallis test.
- Skilled in using Python libraries NumPy, Pandas for performing Exploratory Data Analysis.
- Proficient in Data transformations using log, square-root, reciprocal, cube root, square and complete box-cox transformation depending upon the dataset.
- Adept at handling Missing Data by exploring the causes like MAR, MCAR, MNAR and analyzing Correlations and similarities, introducing dummy variables and various Imputation methods.
- Experienced in Machine Learning techniques such as Regression and Classification models like Linear Regression, Logistic Regression, Decision Trees, Support Vector Machine using scikit-learn on Python.
- In-depth Knowledge of Dimensionality Reduction (PCA, LDA), Hyper-parameter tuning, Model Regularization (Ridge, Lasso, Elastic net) and Grid Search techniques to optimize model performance.
- Skilled at Python, SQL, R and Object Oriented Programming (OOP) concepts such as Inheritance, Polymorphism, Abstraction, Encapsulation.
- Working knowledge of Database Creation and maintenance of Physical data models with Oracle, DB2 and SQL server databases as well as normalizing databases up to third form using SQL functions.
- Experience in Web Data Mining with Python’s ScraPy and BeautifulSoup packages along with working knowledge of Natural Language Processing (NLP) to analyze text patterns.
- Proficient in Natural Language Processing (NLP) concepts like Tokenization, Stemming, Lemmatization, Stop Words, Phrase Matching and libraries like SpaCy and NLTK.
- Skilled in Big Data Technologies like Apache Spark (PySpark, Spark Streaming, MLlib), Hadoop Ecosystem (MapReduce, HDFS, HIVE, Kafka, Ambari)
- Proficient in Ensemble Learning using Bagging, Boosting (AdaBoost, xGBoost) & Random Forests; clustering like K-means.
- Experienced in developing Supervised Deep Learning algorithms which include Artificial Neural Networks, Convolution Neural Networks, Recurrent Neural Networks, LSTM, GRU and Unsupervised Deep Learning Techniques like Self-Organizing Maps (SOM’s) in Keras and TensorFlow.
- Built and deployed recurrent neural network architecture called LSTM in one of the projects to improve the accuracy of the model, also have knowledge of Deep Learning approaches such as traditional Artificial Neural Network and Convolutional Neural Network.
- Skilled at Data Visualization with Tableau, PowerBI, Seaborn, Matplotlib, ggplot2, Bokeh and interactive graphs using Plotly & Cufflinks.
- Knowledge of Cloud services like Amazon Web Services (AWS) and Microsoft Azure for building, training and deploying scalable models.
- Proficient in using PostgreSQL, Microsoft SQL server and MySQL to extract data using multiple types of SQL Queries including Create, Join, Select, Conditionals, Drop, Case etc.
Languages and Platforms: Python (Numpy, Pandas, Scikit-learn, Tensorflow, etc.), Spyder, JuPyter lab, R Studio(ggplot2, dplyr, lattice, highcharter etc), SQL, SAS.
Analytical Techniques: Regression Methods: Linear, Polynomial, Decision Trees; Classification: Logistic Regression, K-NN, Naïve Bayes, Support Vector Machines (SVM); Ensemble Learning: Random Forests, Gradient Boosting, Bagging; Clustering: K-means clustering, Hierarchical clustering; Deep Learning: Artificial Neural Networks, Convolutional Neural Networks, Recurrent Neural Networks; Dimensionality Reduction: Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA); Time-series Forecasting: Using Python Statsmodels and R for time-series modelling.
Database: SQL, PostgreSQL, MongoDB, Microsoft SQL Server, NoSQL, Oracle
Statistical Tests: Hypothesis Testing, ANOVA, z-test, t-test, Chi-Squared Fit test
Validation Techniques: Monte Carlo simulations, k-fold cross validation, A/B Testing
Optimization Techniques: Gradient Descent, Stochastic Gradient Descent, Gradient Optimization - Momentum, RMSProp, Adam
Big Data: Apache Hadoop, HDFS, MapReduce, Apache Spark, HiveQL, Pig, Kafka
Data Visualization: Tableau, Microsoft PowerBI, ggplot2, Matplotlib, Seaborn, Bokeh, Plotly
Data Modeling: Entity Relationship Diagrams (ERD), Snowflake Schema, SPSS Modeler
Operating Systems: Microsoft Windows, iOS, Linux Ubuntu
Database Systems: SQL Server, Oracle, MySQL, Teradata Processing System, NoSQL (MongoDB, HBase, Cassandra), AWS (DynamoDB, ElastiCache)
- Worked closely with Business Analyst to understand business requirements and start deriving solutions accordingly.
- Worked with Data Engineers and Data Analysts into a cross functional team for the deployment of models and working of the projects.
- Extracted data from HTML and XML files by web-scraping through customer reviews using Beautiful Soup, also pre-preprocessed raw data from the company’s data warehouse.
- Performed Data collection, Data cleaning, Feature scaling, Feature engineering, Validation, Visualization, Data Resampling, report findings, develop strategic uses of data by Python libraries like NumPy, Pandas, Scipy, Scikit-Learn, TensorFlow.
- Involved in various pre-processing phases of text-data like Tokenization, Stemming, Lemmatization and converting the raw text data to structured data.
- Performed collaborative filtering to generate item recommendations. Rank-based and content-based recommendations were used to address the problem of cold start.
- Performed data post processing using NLP techniques like TF-IDF, Word2Vec & BOW to identify the most pertinent Product Subject Headings terms that describe items.
- Performed Data Visualization using RStudio, used ggplot2, lattice, highcharter, Leaflet, Plotly & Cufflinks, sunburstR, RGL to make interesting plots.
- Implemented various statistical techniques to manipulate the data like missing data imputation, Principal Component Analysis, tSNE for dimension-reduction.
- Performed Naïve Bayes, K-NN, Logistic Regression, Random Forest, SVM and KMeans to categorize customers into certain groups.
- Performed Linear Regresion onto the classified clusters of customers that were deduced from clustering through K-NN and K-means clustering.
- Employed statistical methodologies such as A/B test, experiment design and hypothesis testing.
- Used AWS Sagemaker to train model using protobuf and deploy the model owing to its relative simplicity and computational efficiency over Beanstalk.
- Employed various metrics such as Cross-Validation, Confusion Matrix, ROC and AUC to evaluate the performance of each model.
Environment: Python 3.6 (NumPy, Pandas, Matplotlib), PySpark 2.4.1, AWS, SQL Server, RStudio
Confidential, Norfolk, VA
- Performed data collection, data cleaning, data profiling, data visualization and report creating.
- Extracted required medical data from Azure Data Lake Storage into PySpark dataframe for further exploration and visualization of the data to find insights and build prediction model.
- Performed data cleaning on the medical dataset which had missing data and extreme outliers from PySpark data frames and explored data to draw relationships and correlations between variables.
- Implemented data pre-processing using Scikit-Learn. Steps include Imputation for missing values, Scaling and logarithmic transform, one hot encoding etc.
- Analyzed the applicant’s medical data to find various relations which were further plotted using Alteryx and Tableau to get better understanding on the available data.
- Utilized Python's data visualization libraries like Matplotlib and Seaborn to communicate findings to the data science, marketing and engineering teams.
- Performed univariate, bivariate and multivariate analysis on the BMI, age and employment to check how the features were related in conjunction to each other and the risk factor.
- Trained several machine learning models like Logistic Regression, Random Forest and Support vector machines (SVM) on selected features to predict Customer churn.
- Improved model accuracy by 5% by introducing Ensemble techniques: Bagging, Gradient, Xtreme Gradient and Adaptive Boosting.
- Worked on Statistical methods like data driven Hypothesis Testing and A/B Testing to draw inferences, determined significance level and derived P-value, and to evaluate the impact of various risk factors.
- Furthered Hypothesis testing by evaluating Errors (Type 1 and Type 2) to eliminate skewed inferences.
- Implemented and tested the model on AWS EC2 and collaborated with development team to get the best algorithms and parameters.
- Prepared data-visualization designed dashboards with Tableau, and generated complex reports including summaries and graphs to interpret the findings to the team.
Environment: Python (NumPy, Pandas, Matplotlib, Sk-learn), AWS, Jupyter Notebook, HDFS, Hadoop MapReduce, PySpark, Tableau, SQL
Junior Data Scientist
- Performed exploratory data analysis and data cleaning along with data visualization using Seaborn and Matplotlib.
- Performed feature engineering to create new features and determine transactions associated with completed offer.
- Performed data visualization using Matplotlib and S eaborn from data using features like age, income, membership duration, etc.
- Built machine learning models using customer models and transaction data.
- Built a classification model to classify customers for promotional deals to increase likelihood of purchase using Logistic Regression and Decision Tree Classifier.
- Tested out performance of classifiers like Logistic Regression, Naïve Bayes, Decision tress and Support vector classifiers.
- Employed ensemble learning techniques like Random forests and Ada Gradient Boosting to improve the model by 15%.
- Picked the final model using ROC & AUC and fine-tuned the hyper parameters of the above models using Grid Search to find the optimum values.
- Using k-fold cross validation to test and verify the model accuracy.
- Prepared dashboard in PowerBI to summarize the model and show the summary of model’s measure.
Environment: Python (NumPy, Pandas, Matplotlib, SkLearn), Tidyverse, R, MySQL, PgAdmin
- Drew statistical inferences using t-tests, ANOVA, chi-sq tests and performed Post-Hoc Analysis using Tukey’s HSD and Bonferroni correction to assess difference across levels of raw material categories, test significance of proportional differences and assess whether sample size is large enough to detect the differences.
- Provided statistical insights into semi-deviation & skewness-to-kurtosis ratio to guide vendor decisions and inferences into optimum pricing for order quantities.
- Performed Data Analysis on target data after transfer to Data Warehouse.
- Developed interactive executive dashboards using PowerBI and Excel VBA to provide a reporting tool that facilitates organizational metrics and data
- Worked with Data Architects on Dimensional Model with both Star and Snowflake Schemas utilized
- Created ETL solution using Informatica tool to read Product & Order data from files on shared network into SQL Server database.
- Created Database designs through data-mapping using ER diagrams and normalization upto the 3rd normal form and extracted relevant data whenever required using joins in Postgre SQL and Microsoft SQL Server.
- Conducted data preparation and outlier detection using Python.
- Worked with Team manager to develop a lucrative system of classifying auditions and vendor’s best fitting for the company in the long run.
- Presented executive dashboards and scorecards to visualize and present trends in the data using Excel and VBA-Macros.
Environment: Microsoft Office, Microsoft PowerBI, Oracle SQL, Informatica, SPSS.