Data Scientist Resume
SUMMARY
- Around 5 years of extensive experience in Data science, concentrated in dealing with data using Python, R and SQL
- Experienced Data Scientist with expertise in Data Mining, Data Cleaning, Exploring and Visualizing Data, building and evaluating statistical models, preparing dashboard and ultimately implementing best suitable machine learning models to make strong decisions
- Hands on experience using Python 3.x undertaking data analytics and visualizations using various core analytical Python libraries such as Numpy, Scipy, Pandas and Scikit - learn
- Well-versed in data-wrangling, loading and working efficiently through SQL server and writing complex queries
- Strong Knowledge in Statistical methodologies such as Hypothesis Testing, ANOVA, Principal Component Analysis (PCA), and Time Series Analysis
- Strong ability to use optimization Techniques like Gradient Descent, Stochastic Gradient Descent for regression models
- Profound knowledge of various Supervised and Unsupervised machine learning algorithms such as Ensemble Methods, Clustering algorithms, Classification algorithms and Time Series models (AR, MA, ARMA, ARIMA)
- Exposure to text mining using NLP along with data cleaning using Natural Language Toolkit (NLTK)
- Expertise on using data transformation techniques such as log, root-mean-square, skewness, normalization, aggregation of the data
- In-depth Knowledge of Dimensionality Reduction, Hyper-parameter tuning, Model Regularization (Ridge, Lasso) and Grid Search techniques to optimize model performance
- Proficient in Creating reports/ dashboards using Tableau by using advance functionalities such as animations, interactive charts etc.
- Expertise in developing analytical approaches to meet business requirements
- Certified in Dell EMC Data Science Associate (EMCDSA)
TECHNICAL SKILLS
Machine Learning: Classification, Regression, Feature Engineering, One hot coding, Clustering, Regression analysis, Naive Bayes, Decision Tree, Random Forest, Support Vector Machine, KNN, Ensemble Methods, K-Means Clustering, Time Series Analysis, Confidence Intervals, Principal Component Analysis and Dimensionality Reduction
Statistical Analysis: Hypothesis testing, A/B analysis, ANOVA, MANOVA, normal distribution, mean, median, mode, standard deviation, regressions, f-test
Exploratory Data Analysis: Univariate/Multivariate analysis, Outlier detection, data cleansing, Missing value imputation, normalization, linearity assumption, Histograms/Density estimation, data visualization using libraries like yellowbrick, seaborn from Python
Programming Languages: Python (pandas, seaborn, numpy, Scikit-learn), R (dplyr, ggplot2), SQL (complex queries using window functions and aggregation)
Big Data: Hadoop Ecosystem - Hive, Pig, MapReduce, PySpark
Databases: MS SQL Server, Oracle, PostgreSQL, HDFS, HBase
Data Visualization Tools: Tableau 9.x/10.x, Seaborn/ Matplotlib, Plotly, SSRS, Shiny, ggplot2
Cloud Data Systems: AWS (RedShift), AWS S3
IDE: Jupyter-lab, R Studio, Eclipse, Spyder, Atom, Notepad++, Pycharm
Version Control Tools: Git, GitHub
PROFESSIONAL EXPERIENCE
Confidential
Data Scientist
Responsibilities:
- Cleaned and manipulated complex datasets to create the data foundation for further analysis and the development of key insights (MS SQL server, R, Tableau, Excel)
- Incorporated various machine learning algorithms and advanced statistical analysis like decision trees, regression models, SVM, clustering using scikit-learn package in Python
- Worked on data pre-processing and cleaning the data to perform feature engineering and performed data imputation techniques for the missing values in the dataset using Python
- Performed Exploratory Data Analysis (EDA) to visualize through various plots and graphs using matplotlib and seaborn library of python, and to understand and discover the patterns on the Data, understanding correlation in the features using heatmap, performed hypothesis testing to check significance of the features
- Developed analytical approaches to answer high-level questions and provided insightful recommendations
- Conducted various statistical analysis like linear regression, ANOVA and classification models to the data for analysis
- Involved in extracting customer's Big Data from various data sources (Excel, Flat Files, Oracle, SQL Server, MongoDB, Teradata, and also log data from servers) into Hadoop HDFS
- Used pandas, numpy, seaborn, scipy, matplotlib, scikit-learn in Python for developing various machine learning algorithms
- Assured data quality and data integrity, and optimized data collection procedures on a weekly and monthly basis
- Created Data Quality Scripts using SQL and Hive to validate successful data load and assured the quality of data.
- Worked on different data formats such as JSON, XML, CSV, .dat and exported the data into data visualization/ ETL platform
- Evaluated model performance using techniques like R square, adjusted R square, confusion matrix, AUC, ROC curve, Root mean squared error etc.
- Incorporated, Developed and applied metrics and prototypes that could be used to drive business decisions
- Participated in ongoing research, and evaluation of new technologies and analytical solutions to optimize the model performance
- Used problem-solving skills to find and correct the data problems, applied statistical methods to adjust and project results when necessary
- Worked across cross-functional teams to understand the data requirements and provided the detailed analytical reports to accomplish the business decisions
Environment: Model: Logistic Regression, Decision Trees, Random Forest, SVM
Evaluation Metrics: Confusion Matrix, Recall, Precision, ROC, Cross Validation
Technologies: Python, MySQL, Machine Learning Libraries - NumPy, Pandas, Scikit, Stats-model
Confidential
Data Scientist
Responsibilities:
- Participated in all phases of data mining, data cleaning, data collection, developing models, validation, and visualization
- Accomplished data pipeline process, collected required data from different sources and converted into structured form using SQL
- Hands on experience in Working with the large volume of data with more than 10M customer records with 20+ features
- Incorporated Exploratory Data Analysis to identify the correlation between variables, multicollinearity, and hidden patterns, trends and seasonality using Numpy and Pandas libraries to perform data analysis
- Analyzed customer churn distribution over the different attributes such as tenure, different services a customer has signed up for and demographic data using visualization libraries matplotlib and seaborn in python
- Performed PCA, backward feature selection, correlation analysis for Dimensionality Reduction of the data to achieve the accuracy in the results
- Performed SMOTE analysis to deal with unbalanced distribution of the data over the training data set
- Implemented various classification models such as Logistic Regression, Decision Trees, Random Forest, KNN, XGBoost, and SVM and applied most efficient algorithm to predict the results
- Performed K-Fold cross-validation to test models with different batches to optimize the model and prevent overfitting
- Participated in feature engineering such as feature generating, PCA, feature normalization and label encoding with Scikit-learn preprocessing
- Achieved 90% customer monthly retention by predicting the likelihood of returning customers using a logistic regression model in R
- Designed and implemented end-to-end systems for Data Analytics and Automation, integrating custom visualization tools using R, Tableau, and Power BI
- Improved data cleansing and mining process based on R and SQL, resulting in a 50% of time reduction
- Conducted analysis and pattern on customer’s needs in different location, different categories and different months by using time series modeling techniques
Environment: Python 3.x, (Scikit-Learn/Scipy/Numpy/Pandas/Matplotlib/Seaborn), Tableau, Machine Learning algorithms (KNN, Decision Tree, Random Forest, Logistic Regression, Support Vector Machines, XGBoost), Microsoft Excel, MySQL
Confidential, MA
Data Scientist
Responsibilities:
- Performed statistical analysis such as ANOVA, linear regression, logistic regression on data using Python and Advance MS Excel
- Collected data from end client, performed ETL using various platforms and defined the uniform standard format
- Wrote queries to retrieve data from SQL Server database to get the sample dataset containing basic fields
- Used Ensemble learning methods like Random Forest, Bagging & Gradient Boosting and picked the final model based on confusion matrix, ROC & AUC & predicted the probability of customer enrolment
- Tuned the hyper parameters of the above models using Grid Search to find the optimum models
- Designed and implemented K-Fold Cross-validation to test and verify the model's significance
- Developed a dashboard and story in Tableau showing the benchmarks and summary of model's measure
- Created database Tables, Constraints, Views, Synonyms, Sequences, Indexes and Object Types
- Created Entity Relation Diagrams for coding integrations for the database using Microsoft Visio routines.
- Explored sales of different categories of customers with the help of graphs using Matplotlib and Seaborn and analysed whether sales have trending or seasonality pattern.
- Experimented and built predictive models such as Linear Regression, AR, MA, ARMA, and ARIMA to predict sales amount using Scikit-learn and Stats-model library.
- Developed and modified database procedures, triggers to enhance and improved functionality using T-SQL