- Data scientist with multiple years of experience in transforming business requirements into analytical models, designing algorithms, and strategic solutions that scale across massive volumes of data in finance, healthcare, and public transportation industry.
- Comprehensively worked through the entire data science project life cycle, including Data Acquisition, Cleaning, Data Manipulation, Data Mining, Machine Learning Algorithms, Testing and Validation, and Data Visualization.
- Skilled in machine learning algorithms including supervised machine learning and unsupervised machine learning such as linear regression, logistic regression, decision tree(CART), LightGBM, XGBoost, random forest, gradient boosting, k - nearest-neighbors, naïve Bayes, Bayesian network, k-means clustering.
- Adapted knowledge in deep learning models including Convolutional Neural Networks, Recurrent Neural Networks and LSTM, and deep learning libraries such as Keras and TensorFlow.
- Strong knowledge in Statistics methodologies such as hypothesis testing, ANOVA, principle component analysis and correspondence analysis, ARIMA time series analysis.
- Worked with various scaled and various structured data using data regularization.
- Good at model optimization and tuning, including grid search for hyper parameter tuning, k fold cross validation, customized evaluation metrics.
- Proficient in Python 3.x with Numpy, Pandas, Scipy, Scikit-learn, matplotlib, plotly packages.
- Extensively worked on R 3 regarding to data manipulation and statistic modeling by using dplyr, tidyr, ggplot2, randomforest, rpart, nnet, mlr.
- Solid ability to write and optimize diverse SQL queries, performed RDBMS design and maintenance, familiar with data schemas such as 3NF, star schema, and snowflake schema.
- Adapt knowledge of big data tools like Hadoop MapReduce and databricks Spark 2.0.1 (PySpark, Spark SQL, Spark ML).
- Practical skills in Cloud Service and Cloud Computing, like AWS S3, AWS Lambda.
- Experience in ETL tools such as AWS Glue, Xplenty, Airflow
- Familiar with Jira bug tracking and project management.
- Adapt to agile mindset, workflow and methodology.
- Familiar with GitHub version control and code sharing.
- Deep understanding of building, publishing customized interactive dashboards and reports with customized parameters and user-filters using Tableau 2018.x/2019.x, R Shiny, Spotfire 10.1, Qlik Sense 13.32 and Power BI, Visio and PowerPoint 2016
- Fast learning capability and strong adaptive capacity in new fields, good at collaboration and communication, work effectively both as team leader and team member
Languages: Python 3.6.5, R 3, SQL \ Windows 10/8/7, OS X\
Other Data Tools \ Big Data Tools:: AWS S3, AWS Lambda, Jira, GitHub, \ Spark 2.3 (PySpark, Spark Sql, Spark ML), \ArcGIS 10.6, TrackWise 8 \ Hadoop \
BI Tools \ Database: : Tableau 2018.x/2019.x, Minitab 17.3.1, Spotfire \ MySQL 8.0, SQL Server 2016, MangoDB \ 10.1, Qlik Sense 13.32, MS Office 2016/2013 \ (Word/Excel/PowerPoint/Visio/Outlook/Power \ BI) \
Packages: Python (Scikit-learn, Keras, PyTorch, XGBoost, TensorFlow, Pandas, NumPy, SciPy, Matplotlib, Seaborn, Plotly, BeautifulSoup, statsmodels) R (CARET, dplyr, tidyr, rjson, mice, rpart, devtools, randomforest, nnet, bnlearn, ggplot2, forecast, Shiny, Lubridate, RCrawler, StringR, Rvest, rminer, RCurl, SparseM, pls, RTextTools)
Confidential, Jacksonville, FL
- Extracted, cleaned and manipulated data from MySQL database
- Building the data cleaning pipeline and data preprocessing pipeline for the project
- Performed feature engineering and feature selection to created new features, provide prediction on different groups(like age group, education group, etc.)
- Performed logistic regression to predict the success rate of loan repayment
- Used Imbalanced-Learn package for under-sampling and over-sampling to solve the problems caused by imbalanced dataset
- Improved the model accuracy by 5% and solved the overfitting problem by using random forest and gradient boosting
- Further Improved the model accuracy by 9% by using LightGBM
- Implemented Keras with TensorFlow RNN LSTM backend
- L1 & L2 regularization to solve the overfitting problem
- Setting up drop out layer to neural network
- Performed hyper parameter tuning to balance the model accuracy and time complexity
- Validated and selected models using k-fold cross validation methods, error metrics and worked on optimizing models for more stable performance
- Used AWS S3 for data storage and AWS Lambda to re-train the model and run the model periodically.
- Used GitHub for version control and code sharing with team members.
- Created data visualization dashboard using Tableau, Python matplotlib and PowerPoint
Environment: MySQL8.0, Python 3.6 (NumPy, Pandas, Scikit-learn, statsmodels, Keras, TensorFlow), Microsoft Office 2016 (PowerPoint/Word/Excel), Tableau 2019.2, AWS S3, AWS Lambda, GitHub
Confidential, Boston, MA
- Extracted and manipulated data from MySQL database
- Transformed the data to forms better for visualization, modeling and storage
- Built an interactive dashboard with R shiny integrated with ggplot2 and dplyr to visualize the importation and exportation data of an international trading company using a heatmap, a bar chart and a scatterplot
- Performed ARIMA time series analysis and correlation analysis for importation and exportation data
- Designed an interactive dashboard with Tableau to visualize the insurance sales data by hierarchy groups and timeline, including multiple pie charts and line charts with filtering options
- Performed linear regression to insurance pricing data and KNN to insurance fraud detection data
- Created two interactive dashboards with Excel, Power BI and Tableau to visualize the census and economic data of the Philippine, including income, age, population, etc.
- Designed UML diagrams, data model and data dictionary, hosted requirements gathering sessions
- Efficiently communicated with manager and research group on business demands and plans
Environment: MySQL8.0, R 3 (forecast, ggplot2, glm, kNN, ARIMA), Tableau 2019.1, Microsoft Office 2016 (PowerPoint/Word/Excel/Power BI/Visio)
Confidential, Boston, MA
- Cleaned and Restructured the data from Excel sheets to relational database
- Conducted EDA(exploratory data analysis) on social media sentiment data and stock history data in the same period
- Designed time series analysis (ARIMA) for both stock data and sentiment data
- Created many date time related new features by shifting the input time forward and backward
- Conducted correlation analysis between sentiment data and stock data based on various shifted input time
- Created a quantified correlation scale for pricing and sentiment data, use that to identify stocks will influenced by sentiment from previous days
- Performed feature transformation to the predicting output value, make it from numeric(price change) to binary(ups and downs), this significantly improved the prediction accuracy
- Summarized a stock list that can be influenced by sentiment from previous days
- Generate narrative insights and suggestions by natural language generation and combine visualizations by Matplotlib, Tableau and Spotfire for business report
Environment: MySQL8.0, Python 3.6 (NumPy, Pandas, Scikit-learn, statsmodels), R 3 (forecast, ggplot2), Microsoft Office 2016 (PowerPoint/Word/Excel), Tableau 2019.1