Data Scientist Resume
Torrance, CA
SUMMARY
- 6+ years of professional experience in “Manufacturing”, “Logistics” and “Healthcare” domains, performing Statistical Modelling, Data Extraction, Data Screening & Cleaning, Data Exploration, Feature Selection, Feature Engineering and Data Visualization, Linear and Logistic Modelling, Dimensionality Reduction, A/B testing and implementing Machine Learning algorithms on large - scale to deliver resourceful insights and inferences targeted towards boosting revenues and enriching customer experiences.
- Performed Univariate Analysis and analyzed Descriptive Statistics like Mean, Median, Mode, Range, Standard Deviation, Variance and Missing data Treatment, Outlier Detection and Treatment, Normality Check with Skewness and Kurtosis, Presented the results on Histograms, Box Plots etc.,
- Performed Bivariate analysis using Inferential Statistical tests like Z-test, T-test, Chi-Square, ANOVA to Check Multicollinearity and Singularity and presented the results using scatter plots, bar charts, line charts etc.,
- Experienced in solving Classification and Regression problems using algorithms such as Linear Regression, Logistic Regression, Decision Trees, Random Forests, Support Vector Machines, Naïve Bayes etc., and optimizing the performance using Ensemble Learning techniques like Bagging, Boosting and Slacking. Performed Clustering using K-means and DBSCAN.
- Experienced in handling petabytes of data using Apache Spark (Pyspark). Hands on experience in using various spark APIs like Spark SQL, Spark Streaming, Spark Mllib, Spark ML and GraphX. Experience in working with different data structures of spark like DataFrames, RDDs, Datasets.
- Worked extensively on Spark SQL DataFrames, performed basic DataFrame operations, created UDFs, worked on Caching, Persisting and Repartitioning the DataFrames.
- Hands-on experience with Dimensionality Reduction techniques like Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), Model Regularization (Ridge, Lasso, Elastic Net), Hyper-parameter tuning (Grid Search).
- Experience in Text Mining and Natural Language Processing techniques like Topic modelling, tokenizing, stemming and lemmatizing, Parts of Speech (POS) Tagging and Sentiment Analysis. Used Natural Language Processing Toolkit (NLTK) and Spacy while Building Aspect Based Sentiment Analysis model.
- Proficient in evaluating and analyzing the performance of model with Cross validation, ROC-AUC, Confusion Matrix, Precision, Recall, Log-Loss, MSPE, MSAE, R Square, F1-Score and creating Performance reports using tables and visualizations.
- Expertise in solving problems like Underfitting, Overfitting, Poor-Quality data, Irrelevant features, Insufficient data, Non-representative data using Data mining, predictive modelling, Data Cleaning, Data Screening, Feature Selection, Feature Engineering, Model Building, Evaluating, Optimizing Hyperparameters with GridSearch and Random Search.
- Experience with Python libraries including NumPy, Pandas, SciPy, Scikit-learn, Spacy, Plotly, Matplotlib, Seaborn, Theano, TensorFlow, Keras, NLTK and R modules like ggplot2.
- Expertise in using big data frameworks such as Hadoop, Spark, PySpark, Spark SQL, Hive, Pig, Apache Kafka, Sqoop, Oozie, Apache Atlas, Flume, Storm, YARN, and NoSQL databases such as Cassandra and MongoDB.
- Proficient in using Amazon Web Services like EC2, EMR, SageMaker, S3, Redshift, Kinesis, Lambda Glue, Athena and experienced in using Microsoft Azure and Google Cloud Platform (GCP).
- Expert at Web Scraping using python libraries like Scrapy, BeautifulSoup, Urllib, Regular Expressions etc.,
- Knowledge on CI/CD using Jenkins, Ansible, Kubernetes, Mesos, OpenStack for deployment of models.
- Experience in Data Gathering with various tools like Informatica, Pentaho, AWS Glue, Sqoop transformed data using Hive, Spark and moved data to data warehouses like Redshift, S3, HDFS, MongoDB, Cassandra etc.,
- Good hands on experience on Master Data Management, Meta Data Management and Data Quality checks and ensuring the standards of regulatory boards like GDPR.
TECHNICAL SKILLS
Regression Methods: Linear, Multiple, Polynomial, Decision trees and Support vector;
Classification: Logistic Regression, K-NN, Naïve Bayes, Decision trees and SVM;
Clustering: K-means, DBSCAN, Hierarchical, Expectation maximization;
Deep Learning: Artificial Neural Networks, Convolutional Neural Networks, Recurrent Neural networks with Long Short-Term memory (LSTM), Mask R-CNN;
Dimensionality Reduction: Singular Value Decomposition (SVD), Principal component Analysis (PCA), Linear discriminant Analysis (LDA);
Text Analytics/Natural Language Processing: Stemming, NLTK, Spacy, TFIDF, Word2Vec, Doc2Vec, Topic Modelling, Aspect Based Sentiment Analysis;
Ensemble Learning: Random forests, Bagging, Stacking, Gradient Boosting.
Scripting Languages: Python (NumPy, Pandas, Scikit-Learn, TensorFlow, re, pickle, Seaborn, Matplotlib, OPenCV, Tensorflow, Keras, Pytorch), R, SQL.
ETL: Hadoop (Sqoop/Hive), Spark, Informatica, Pentaho, AWS (Glue).
Web Scraping: BeautifulSoup, Scrapy, Selenium, Urllib, requests, Regular Expressions
Deployment: Kubernetes, Docker, Ansible, Jenkins, Sagemaker, OpenStack, OpenShift, Flask
Databases: MongoDB, Cassandra, MySQL, PostgreSQL, Oracle, Microsoft SQL Server, Amazon Dynamo DB, Redshift.
Validation Techniques: k-Fold cross validation, Out-of-the-Box Estimates, A/B Tests, Monte Carlo Simulations.
Optimization Techniques: Gradient Descent, Stochastic Gradient Descent, Mini-Batch Gradient Descent, Gradient Optimization.
Data Visualization: Tableau, Microsoft Power BI, ggplot2, Plotly, Matplotlib, Seaborn.
Big Data Tools: Apache Hadoop, Hive, Spark, Sqoop, Oozie, Pig, Kafka, Flask, Atlas, HDFS, YARN, Zeppelin Notebook.
Cloud Services & VCS: AWS (EC2, EMR, SageMaker, S3, Glue, Redshift), Microsoft Azure, Git, GitHub, Bitbucket.
PROFESSIONAL EXPERIENCE
Confidential, Torrance, CA
Data Scientist
Responsibilities:
- Focused on Natural Language Processing techniques and used NLP methods for information extraction, topic modelling and parsing to explore trends in the customer contention data.
- Worked with text feature engineering techniques n-grams, TF-IDF, word2vec etc.
- Performed SMOTE (Synthetic Minority Over-Sampling Technique) to create synthetic features of the minority class (churned customers) and evaluated Classification performance using ROC (Receiver Operating Characteristic) AUC (Area Under Curve).
- Applied various Classification models such as Naïve Bayes, Logistic Regression,Random Forests, Support Vector Classifiers,Stochaistic Gradient Descent, RNN, LSTM from scikit-learn and Keras library and improved performance of the model by using various Ensemble learning like Random Forests, XGBoost and Gradient Boosting using Scikit-learn
- Addressed Overfitting and Underfitting by using K-fold Cross Validation.
- Performed Confusion Matrix and Classification report to evaluate accuracy and performance of different models used. Evaluated the model’s performance using various metrics like Precision, Recall, F-Score, AUC-ROC, Cross Validation to test the models with different batches of data to optimize models.
- Applied and Tried manual Hyper-parameter tuning using Grid Search to get better performance.
- Performed Confusion Matrix and Classification report to evaluate accuracy and performance of different models used. Evaluated the model’s performance using various metrics like Precision, Recall, F-Score, AUC-ROC, Cross Validation to test the models with different batches of data to optimize models.
Confidential, Dearborn, MI
Data Scientist and Data Engineer
Responsibilities:
- Performed Variable Identification and checked for percentage of Missing Values, Data Types, Outliers etc.,
- Performed Univariate Analysis and analyzed Descriptive Statistics like Mean, Median, Mode, Range, Standard Deviation, Variance and check for Missing data, Detect Outliers, Normality Check with Skewness and Kurtosis, Presented the results on Histograms, Box Plots etc.,
- Performed Bivariate analysis using Correlation and Inferential Statistical tests like Z-test, T-test, Chi-Square, ANOVA to Check Multicollinearity and Singularity and presented the results using scatter plots, bar charts, line charts etc.,
- Performed Outlier Detection and Treatment in Python using different techniques like Median Absolute Deviation (MAD), Minimum Covariance Determinant, Histograms and Box plots.
- Performed Feature Engineering using python scikit-learn library and by applying different techniques like Filter methods (Z-test, t-test, ANOVA, f-test), Wrapper Methods (Step Forward Selection, Step Backward Selection, Exhaustive Selection), Embedded Methods (Random Forests, LASSO, Ride Regression).
- Performed Feature Engineering such as Missing Value Imputation, Normalization and Scaling, Outliers Detection and Treatment, One-Hot-Encoding, Splitting Features and used Label Encoder to convert categorical variables to numerical values using python scikit-learn library.
- Performed Exploratory Data Analysis (EDA) to visualize through various plots and graphs using matplotlib, NumPy, Pandas, Scikit-learn and seaborn libraries of python, and to understand and discover the patterns on the Data. Calculated Pearson Correlation Coefficient to deal with Multicollinearity.
- Performed SMOTE (Synthetic Minority Over-Sampling Technique) to create synthetic features of the minority class (churned customers) and evaluated Classification performance using ROC (Receiver Operating Characteristic) AUC (Area Under Curve).
- Applied various Classification models such as Naïve Bayes, Logistic Regression,Random Forests, Support Vector Classifiers, from scikit-learn library and improved performance of the model by using various Ensemble learning like Random Forests, XGBoost and Gradient Boosting using Scikit-learn
- Addressed Overfitting and Underfitting by using K-fold Cross Validation.
- Performed Principal Component Analysis (PCA) along with Linear Discriminant Analysis (LDA) that yielded better Dimensionality Reduction and Classification.
- Developed Hybrid-model to overcome shortcomings of Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA).
- Applied K-means clustering to look for churn patterns among customers based of various features.
- Performed Confusion Matrix and Classification report to evaluate accuracy and performance of different models used. Evaluated the model’s performance using various metrics like Precision, Recall, F-Score, AUC-ROC, Cross Validation to test the models with different batches of data to optimize models.
- Applied and Tried manual Hyper-parameter tuning using Grid Search to get better performance and used Python Flask REST API to Deploy the model.
- Used Spark (Pyspark) on Amazon EMR cluster to train the model. Created 3 node cluster using spark and Applied different Transformations and Actions on spark. Used different spark APIs like Spark SQL to create SparkDataFrames, spark.ml and spark.mllib to create machine learning models using spark.
- Worked on Caching, Persisting and Repartitioning the DataFrames.
- Performed data visualization using tableau dashboards, Tableau data stories using Line and Scatter plots, Bar Charts, Histograms, Pie Charts, Box plots.
Confidential, Downers Grove, IL
Data Scientist
Responsibilities:
- Defined appropriate churn scope and defined target metric for the project. Created feature requirements for the project by working with a team of data engineers and business analysts.
- Gathered data from multiple sources like ERP, CRM, Social media, Web logs through Amazon Kinesis.
- Performed Variable Identification and checked for percentage of Missing Values, Data Types, Outliers etc.,
- Performed Univariate Analysis and analyzed Descriptive Statistics like Mean, Median, Mode, Range, Standard Deviation, Variance and check for Missing data, Detect Outliers, Normality Check with Skewness and Kurtosis, Presented the results on Histograms, Box Plots etc.,
- Performed Bivariate analysis using Correlation and Inferential Statistical tests like Z-test, T-test, Chi-Square, ANOVA to Check Multicollinearity and Singularity and presented the results using scatter plots, bar charts, line charts etc.,
- Performed Outlier Detection and Treatment in Python using different techniques like Median Absolute Deviation (MAD), Minimum Covariance Determinant, Histograms and Box plots.
- Performed Feature Engineering using python scikit-learn library and by applying different techniques like Filter methods (Z-test, t-test, ANOVA, f-test), Wrapper Methods (Step Forward Selection, Step Backward Selection, Exhaustive Selection), Embedded Methods (Random Forests, LASSO, Ride Regression).
- Performed Feature Engineering such as Missing Value Imputation, Normalization and Scaling, Outliers Detection and Treatment, One-Hot-Encoding, Splitting Features and used Label Encoder to convert categorical variables to numerical values using python scikit-learn library.
- Performed Exploratory Data Analysis (EDA) to visualize through various plots and graphs using matplotlib, NumPy, Pandas, Scikit-learn and seaborn libraries of python, and to understand and discover the patterns on the Data. Calculated Pearson Correlation Coefficient to deal with Multicollinearity.
- Applied various Classification models such as Naïve Bayes, Logistic Regression,Random Forests, Support Vector Classifiers, from scikit-learn library and improved performance of the model by using various Ensemble learning like Random Forests, XGBoost and Gradient Boosting using Scikit-learn
- Performed Confusion Matrix and Classification report to evaluate accuracy and performance of different models used. Evaluated the model’s performance using various metrics like Precision, Recall, F-Score, AUC-ROC, Cross Validation to test the models with different batches of data to optimize models.
