Data Scientist Resume
SUMMARY
- Extensive experience as a data scientist / data analyst performing data acquisition, data integration, data processing, data wrangling, data visualization, data analytics, machine learning, deep learning, predictive modeling, and big data analytics.
- Extensive experience in building supervised machine learning models by applying algorithms of Linear Regression, Logistic Regression, Decision Tree, Random Forest, k - Nearest Neighbor (k-NN), Support Vector Machine (SVM), Naive Bayes, XGBoost, etc.
- Strong experience in developing unsupervised machine learning models by applying algorithms and techniques of K-Means, DBSCAN, Hierarchical Clustering, Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), t-Distributed Stochastic Neighbor Embedding (t-SNE), etc.
- Extensive experience in building machine learning and deep learning models of Natural Language Processing (NLP) to perform text preprocessing, feature extraction, text mining, sentiment analysis, topic modeling, etc.
- Strong experience in querying, processing, analyzing and building machine learning models on large amounts of data using big data technologies including Spark (Spark Core, Spark SQL, Spark MLlib), HDFS, etc.
- Extensive experience in performing text classification using deep learning algorithms including Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM).
- Strong experience in performing web scraping to acquire data and information from public websites and save results as csv or pdf files.
- Strong experience in performing data cleansing, data manipulation, data wrangling, feature engineering, feature selection, and Exploratory Data Analysis (EDA).
- Extensive experience in performing data preprocessing tasks including imputing missing values, identifying and handling outliers, feature scaling, categorical feature encoding.
- Strong experience in performing Model Validation using K-fold Cross Validation, Leave One Out and Stratified K-fold Cross Validation methods, and performing Hyper-Parameter Tuning using Grid Search method.
- Extensive experience in building compelling visualizations using Python Matplotlib, Seaborn, Plotly packages to show data distribution, feature correlation, trends and patterns, and model performance.
- Expertise in utilizing Python data science packages to perform data processing, build visualizations and develop machine learning models.
- Strong experience in developing content-based recommendation systems and model-based Collaborative Filtering recommendation systems.
- Proficient in using advanced SQL techniques including complex joins, subqueries, views, temp tables, CTEs, window functions to retrieve data, implement business logic, and conduct data conditioning.
- Extensive experience in developing visualizations, reports and dashboards using BI tools including Tableau, Power BI and SSRS and presenting to senior management for reviewing and decision making.
- Strong experience in gathering and soliciting business requirements from various internal and external stakeholders and translating business requirements into technical specifications and actionable data tasks.
- Hands-on experience in performing ETL processes to integrate data from heterogenous data sources (Excel, flat files, databases, Web API, JSON format data, XML format data, etc.)
- Extensive experience in designing, developing, and implementing A/B testing plans to test hypothesis and evaluating effectiveness of these tests.
- Strong experience in developing and deploying machine learning models and solutions using cloud computing platforms including AWS (SageMaker, Redshift, EMR, EC2, S3), GCP (BigQuery), and Microsoft Azure.
- Excellent leadership, analytical problem-solving skills, multi-functional collaboration and communication skills, fast-learning and self-motivated, and proactive team player with strong analytical and troubleshooting skills.
TECHNICAL SKILLS
Programing: Python, R, SQL, PySpark
Data Science Packages: NumPy, Pandas, SciPy, Matplotlib, Seaborn, Scikit-Learn, NLTK, TextBlob, Gensim, spaCy, Tensorflow, Keras, Selenium, PySpark
Machine Learning: Linear Regression, Ridge Regression, Lasso Regression, Logistic Regression, Decision Tree, Random Forest, k-Nearest Neighbor (k-NN), Support Vector Machine (SVM), Naive Bayes, XGBoost, K-Means, DBSCAN, t-SNE, Gaussian Mixture Models (GMM)
Deep Learning: Neural Network, Multi-layer Perceptron (MLP), Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM)
Big Data: Spark (Spark Core, Spark MLlib, Spark SQL), HDFS
Mathematics: Likelihood Inference, Hypothesis Testing, A/B Testing, Bayesian Inference, Statistical Modeling, Forecasting, Optimization
Business Intelligence Tools: Tableau, Power BI, SSRS
Databases: MS SQL Server, MySQL, MongoDB
Cloud Platforms: AWS (SageMaker, EMR, EC2, Redshift, S3), GCP (BigQuery), Microsoft Azure, Databricks
PROFESSIONAL EXPERIENCE
Confidential
Data Scientist
Responsibilities:
- Gathered, identified and prioritized business requirements by interviewing with various stakeholders across departments and translated business requirements into actionable data tasks.
- Developed web scraping models to scrape information and documents from various websites and convert to pdf files using Python Selenium, Requests, Pdfkit, Pywin32.
- Built models to extract images from pdf files using Python PyMuPDF.
- Developed models to compare a set of images and pick out the most appropriate company logos using Python Scikit-image, Matplotlib and OpenCV.
- Built various data analytical reports on operational metrics, developed compelling visualizations and dashboards in Power BI and presented to key stakeholders.
- Collaborated with the data engineering and DevOps team to retrieve data from AWS Redshift and deploy web scraping models and solutions using AWS EC2 and S3.
Confidential
Data Scientist
Responsibilities:
- Gathered, identified and prioritized business requirements by interviewing with various stakeholders across departments and translated business requirements into actionable data tasks.
- Performed data cleansing, data wrangling, missing values imputation, outlier detection, feature engineering, feature scaling and normalization, and data profiling for further modeling using Python Numpy and Pandas.
- Performed Exploratory Data Analysis (EDA) to get deeper understanding about the data, discover patterns, test business assumptions, generate hypotheses for further analysis, and prepare the data for modeling using Python Matplotlib and Seaborn.
- Developed Content-based Recommendation System models based on preferences, profiles, and features of both business buyers and business sellers using Scikit-Learn.
- Built Collaborative Filtering recommendation models by applying Alternating Least Square (ALS) algorithm to perform Non-Negative Matrix Factorization (NNMF) on implicit preferences of users using Spark MLlib.
- Perform NLP tasks of text manipulation and preprocessing including removing stop words, stemming and lemmatization, and extracting features using Count Vectorizers, TF-IDF Vectorizers and Word Embeddings.
- Developed complex SQL queries by using advanced SQL techniques of complex joins, subqueries, views, temp tables, CTEs, window functions to retrieve data and conduct data conditioning in MySQL.
- Partnered with the marketing team and the engineering team to develop and implement A/B testing strategies and plans and evaluated effectiveness of tests on marketing campaigns and changes made to the website.
Confidential
Data Scientist
Responsibilities:
- Gathered, analyzed, documented, and translated business requirements into data models and supported standardization of documentation and adoption of standards and practices related to data.
- Performed Exploratory Data Analysis (EDA) to uncover underlying structures, test underlying assumptions, detect outliers and anomalies by leveraging Numpy and Pandas methods and building visualizations of bar plot, scatter plot, box plot, line plot, heatmap, pair plot, etc. using Matplotlib and Seaborn.
- Performed data cleaning, data manipulation and data preprocessing on 10 million records of SME insurance and retirement plan datasets using PySpark in Databricks .
- Developed regression models of Ridge Regression, Lasso Regression, Logistic Regression, Random Forest, XGBoost to predict macroeconomic indicators (GDP, CPI, interest rate, etc.) and identified critical features relevant to the target variable.
- Developed machine learning models to perform customer segmentation using clustering algorithms of K-Means, DBSCAN, Hierarchical Clustering, Gaussian Mixture Models (GMM).
- Performed dimensionality reduction using Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and t-Distributed Stochastic Neighbor Embedding (t-SNE).
- Generated analysis reports on model performance and predicted results, developed compelling visualizations and dashboards using Matplotlib, Seaborn, Plotly packages and Power BI. Made presentations to key stakeholders, incorporated feedback and obtained approval for adoption.
Confidential
Data Analyst
Responsibilities:
- Performed Exploratory Data Analysis (EDA) to uncover key words of tweets on each target company using Pandas, Matplotlib, and Seaborn packages.
- Built a web scraping bot to scrape tweets related to acquisition target companies using Tweepy and GetOldTweets packages.
- Performed NLP tasks of text cleaning and preprocessing on scraped tweets using Regular Expressions (RegEx) library and NLTK package for further modeling.
- Performed feature engineering on text data using vectorizing techniques of Count Vectorizers, TF-IDF Vectorizers and Word Embeddings such as GloVe, Word2Vec and FastText.
- Built sentiment analysis models of Naive Bayes, Logistic Regression, Support Vector Machine (SVM), Random Forest, XGBoost, etc. using Scikit-Learn to predict sentiment of tweets on target companies.
- Developed NLP deep learning models of Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) using Keras to perform sentiment analysis on tweets of target companies.
- Built visualizations and dashboards using Tableau, made presentations to senior management, and delivered investment recommendations.
Confidential
Energy Analyst
Responsibilities:
- Collected and identified requirements from various stakeholders and translated business requirements into technical specifications for reports and dashboards development.
- Merged data from heterogenous data sources and performed data cleaning, data munging, data regularization and feature engineering using SSIS.
- Retrieved data from SQL Server by developing complex SQL queries using advanced techniques including complex joins, subqueries, temp tables, CTEs, window functions.
- Performed Exploratory Data Analysis (EDA) and handled duplicated data, missing data, and outliers per business and technical constrains and performed data normalization to align data with business objectives.
- Developed predictive models in SSAS to predict energy consumption in buildings, identify energy usage trend, improve facility operations efficiency, and save annual energy costs.
- Designed, developed, and supported BI solutions including dashboards, scorecards, operational and scheduled reports, ad-hoc queries using SSRS.
- Built, maintained and managed reporting templates for internal use, and maintained technical documentation to describe report development, logic, testing, changes, and corrections.
- Presented results to different stakeholders across departments and partnered with the engineering team to design and implement new energy systems to boost energy efficiency and save energy cost.