We provide IT Staff Augmentation Services!

Data Scientist Resume


  • Data Scientist/Data Analyst with several years of extensive experience in transforming business requests from structured and unstructured data sources into analytical models
  • Experienced in industries like Finance, E - Commerce, Pharmaceutical, and Technology
  • Knowledgeable in the entire Data Science Project Life Cycle, including ETL, Data Collection, Data Cleaning, Data Imputation, Data Mining, Data Visualization, Machine Learning Algorithms, Natural Language Processing (NLP), and Deep Learning Algorithms
  • Expert in machine learning algorithms such as Linear Regression, Logistic Regression, Decision Tree (CART), Random Forest, Adaptive Boosting (AdaBoost), Extreme Gradient Boosting (XGBoost), Support Vector Machine (SVM), K-Nearest-Neighbors (KNN), Naïve Bayes, K-Means Clustering, Neural Networks, Recommendation System Design and more.
  • Professional at NLP: Text Pre-processing and normalization, such as tokenization, stemming, lemmatization, POS tagging, BILOU tagging, Chunking, Named Entity Recognition (NER), TF-IDF, Seq2Seq+Attention model and Sentiment Analysis, and toolkits, such as NLTK, CoreNLP, OpenNLP, spaCy and Gensim
  • Skilled in deep learning algorithms such as Convolution Neural Network (CNN), Recurrent Neural Network (RNN), Long short-term memory (LSTM), and frameworks such as Keras, TensorFlow, PyTorch
  • Experienced in Python 3.6-3.8 with toolkits: Numpy, Pandas, Scipy, Scikit-Learn, Matplotlib, Seaborn, Plotly, PyOD and Theano
  • Skilled in model optimization with Grid Search, Random Search, Bayesian optimization, and Gradient-based optimization and using Hyperparameters Tuning for model selection
  • Experience in using model regularization: Lasso (L1) & Ridge (L2) Regression
  • Deep understanding with Model Validation by using Train Test Split and K-fold Cross-Validation, Stratified K-fold Cross-Validation and Stratified Shuffle Spit
  • Proficient in Model Testing: Confusion Matrix, Accuracy, Precision, Recall or Sensitivity, Specificity, F1 Score, and AUC-ROC curve to evaluate model performance
  • Proficient in MySQL, R 3.4-3.6 with CARET, RandomForest, dplyr, ggplot2, glmnet, and shiny and SAS 9.4 (SAS 9.4 Specialist Certification) by using different procedures (statistics, tables, reports, transpose, macro, SQL, and so on) and functions to analyze data sets
  • Experience in ETL Process with Snowflake 3.0, Talend open studio 7.0.1, Pentaho Kettle 7.1, Informatica Power Center 10.2, and SQL Server Integration Services 2017 (SSIS)
  • Strong skills in Statistics methodologies such as Hypothesis Testing, A/B Testing, ANOVA, Principal Component Analysis (PCA), ARIMA Forecasting.
  • Experienced in data analysis software: Minitab 17.3.1, SPSS 25, JMP 14, SAS EM 14.3, SSAS 2017, Excel VBA and Matlab 8.4 and data visualization: Tableau 2020.2.1, Power BI 2.83, and SSRS 2017
  • Solid ability to write and optimize diverse SQL Queries, knowledge of RDBMS (MySQL, PostgreSQL), NoSQL (Amazon DynamoDB, Redis, MongoDB) and Graph Database (Neo4j)
  • Strong skills in Web tools: AngularJS, HTML5, CSS3, Javascript, and Web Analytics: Google Analytics and Google Trends for SEO optimization and Web Scraping by python
  • Excellent understanding of SDLC (Systems Development Life Cycle), Agile and Scrum
  • Familiar with version control tool Git, GitLab and GitHub and its built-in Continuous Integration tool, Container tool Docker, cloud platform such as Azure, and Project management tools like Azure DevOps, JIRA 8.1.0
  • AWS with Kinesis, S3, DynamoDB, Elastic MapReduce (EMR), Redshift, SageMaker, Athena, etc.
  • Proficient in Big Data tools: Apache Hadoop 3.1-3.3, Hive 3.1.2, Spark 3.0 and Teradata 16
  • Trustworthy and effective team player with strong enthusiastic, possess a strong ability to adapt and learn new technologies and new business lines rapidly


Languages\ Integrated Development Environment (IDE)\: Python 3.6/3.7/3.8, SQL, R 3.4-3.6, SAS 9.4, \ Anaconda, Spyder, Jupyter Notebook, PyCharm, \ VBA 7.1, HTML5, CSS3, JavaScript, AngularJS\ Eclipse, R studio, Visual Studio\

Packages\ Database\: Python (Numpy, Pandas, Scikit-learn, Scipy, \ MySQL 5.7, PostgreSQL 9/10/11, MongoDB, \ Scikit-Learn, Matplotlib, Seaborn, Plotly, \ Dynamo DB, Redis, Microsoft SQL Server \ Pytorch, PyOD, TensorFlow), R (caret, \ 2017, BigQuery, Neo4j\ RandomForest, e1071, dplyr, gglopt2, word\ cloud, glmnet, shiny)\

BI Tools\ Cloud Platform\: Tableau 9.4/9.2, Microsoft Power BI 2.83, MS \ Amazon Web Service (AWS SageMaker, EC2, \ Office (Word/ Excel/ PowerPoint/ Visio), \ S3), Google Cloud Platform, Azure DevOps, \ Qlick Sense 3.1, QlickView 12, Superset\ Jira 8.1.0\

Analytical Tools\ Operating Systems\: Minitab 17.3.1, SPSS 25, JMP 14, SAS EM \ Windows 10/8/7/XP, Linux (Ubuntu, RedHat, \ 14.3, MSSQL (SSIS/SSAS/SSRS) 2017, and \ CentOS), macOS\ Matlab 8.4, Snowflake 3.0



Data Scientist


  • Configured AWS Redshift and set up clusters as a data warehouse to storage data
  • Conducted Exploratory Data Analysis (EDA) by using Python3.8 (Seaborn, Matplotlib) to analyze data sets to summarize their main characteristics with visual methods, such as univariate graphical or multivariate graphical.
  • Performed data preprocessing by using Python 3.8 (Scikit-learn, Pandas, Numpy, PyOD) to perform normalization, missing value replacement, outlier detection, resampling and feature selection.
  • Conducted one-hot encoding and label encoding to convert categorical data to numerical data
  • Dealt with an imbalanced dataset with sampling, such as under-sampling and SMOTE
  • Performed feature selection by filter (Pearson correlation coefficient), wrapper (RFE), and embedded (L1, L2 regularization) methods
  • Developed the baseline model by using Logistic regression
  • Built machine learning models, including K-Nearest Neighbor (KNN) models, Decision Tree, Random Forest, Naive Bayes, and Support Vector Machine (SVM)
  • Utilized K-fold cross-validation and stratified K-fold cross-validation to prevent overfitting
  • Evaluated the performance of models by Confusion Matrix, F1 score, and AUC-ROC curve
  • Used Random Search to implement Hyperparameter Tuning for model optimization
  • Applied time series analysis LSTM-RNN to predict occupancy rate and future cancellations
  • Built a pipeline to transmit data to data lake through creating a Kinesis data stream to access data and a Firehose delivery stream to transmit the data to S3 for storage
  • Visualized the data by using Tableau and created interactive dashboards and s tories to produce business insights and present to the executive leadership team

Environment: Python 3.8, AWS Redshift, AWS Kinesis, AWS S3, T ableau, Spyder, Github, Microsoft SQL Server, Azure DevOps


Data Scientist


  • Designed and built database for the clinical data from EDC in AWS Cloud and managed data in AWS Redshift
  • Designed, coded and executed programs to extract, aggregate and manipulate pharmacy and medical claims from various sources, identify data integrity issues
  • Followed Standard Operation Procedures (SOP's) to assure quality assurance and timely delivery of information.
  • Utilized QLIK 10.0 and SNOWFLAKE 3.0 to develop API interfaces to logistical systems and pull from SQL based databases and processed ETL
  • Using Python (numpy, pandas, sklearn, Scipy, matplotlib, etc) to wrangle raw data sets (structured and un-structured data sets) into a format that can have advanced methods applied against the resulting data
  • Developed new and maintain existing Python programs to analyze healthcare claims to support ad hoc data requests, including data processing and exploratory data analysis (EDA)
  • Used AWS SageMaker to configure machine learning models and optimize parameters
  • Applied machine learning, deep learning and other advanced techniques, such as Convolution Neural Network (CNN), Recurrent Neural Network (RNN) and combined models through ensemble modeling to have better performance
  • Exported data to AWS S3 and record the steps and processes of the progresses of the work
  • Used Git to do version control to ensure the programs and codes were updated

Environment: AWS Cloud, AWS Redshift, AWS SageMaker, AWS S3, Python 3.8, Excel VBA, Git, Snowflake 3.0, Qlik 10.0

Hire Now