Data Analytics Tools: R, Python, Spark, Rapid Miner, SQL, Tableau, Hadoop, Hive, GIS
Modeling Skills: Statistical & Predictive Modeling, Text Mining, Clustering, Simulation & Heuristic Modeling
Lab Analytics Skills: GCMS, DNA Extraction, Q - PCR, High-Through Sequencing, Cell Culture
Specialties: Classification, Regression, Exploratory Analysis, Visualization, Recommender Systems, Genetic Analysis
Confidential, Minneapolis, MN
Data Science Intern
- Engaging with three different stakeholders to understanding their strategic needs and data generation process.
- Analyzed 8000 Anti - epileptic drugs using patients' characteristic by preprocessing 18+million hospital and RX claim data
- Identified the patient symptom history patterns that non-epileptics patient consumption of AED using association rules
- Developing regression model to extract key factors impact patient claim number to improve personalized care and save cost
- Partnered with three analysts to predict hospital-acquired-infections (HAI’s) in a regional hospital
- Addressed sparse data through the use of dummy variables and features selection with PCA
- Developed classification models with 8 different machine learning algorithms, Decision Tree, k-NN, Logistical Regression, Support Vector Machine, Neural Networks, Naive Bayes, Random Forest, Ensemble Model)
- Built best model with 60% accuracy, 10% misclassification error to minimize financial impact and reduce patient risk.
- Used review and product textual data of Amazon's clothing segment to develop a recommender system via rating prediction
- Preprocessed sparse data by setting density threshold for user and item data, also used original sparse data using R
- Used tri-gram to extract item content feature, and word entropy (0.8) to select feature for content base model using python
- Compare collaborating, content based and hybrid model to find best model for rating prediction with 0.966 RMSE
- Collaborated with a team of five analysts, serving as the subject matter expert and business analyst
- Scrubbed 335 lakes and property tax panel data through outlier detection, inflation adjustment, and derived attributes
- Described and visualized 35 years' water quality trends and property characteristics using R and ArcGIS
- Analyzed interactive relationships between lake and property with mixed model, and refined it with external data
- Presented analysis insights to 100 data scientists for stakeholders to develop community management strategy
- Developed queries in Spark SQL to analyze meetup.com API data using both DataFrame and RDD based approaches
- Built a SparkML pipeline and cost-sensitive classification models to predict MRSA & pneumonia patient readmissions
- Performed data pre-processing and built both unsupervised (clustering) and supervised (regression) models in SparkR to assess and predict Minneapolis meetups
- Led project team of five analysts to drive new marketing strategies by analyzing 3+million transaction data.
- Analyzed 50 retail companies and segmented according to product market share for further promotion analysis.
- Utilized temporal anomaly detection to determine which retailer is more competitive in terms of market share
- Developed statistical models with average 90% accuracy to identify the short and long term effects of marketing share
- Created executive level dashboard for client to gain the most efficient promotion combination and market positioning
- Partnered with four analysts to predict hotel booking volume, price and length of stay in different granularities.
- Retrieved and transformed 3million + row data from client database using SQL
- Developed an auto regression predict model in weekly, monthly, quarterly levels with average 90%+ accuracy
- Improved model with stock price, oil price and S&P data in regarding the industry component
- Created an interactive dashboard in executive level to visualize model performance to clients
- Partnered with four analysts to predict the likelihood of Airbnb user booking within 90 days after sign up
- Explored the customer demographic data and online behavior data through data cleaning, data transform using Python
- Analyzed and visualized the consumer behavior characteristic difference between customer user and non-customer user
- Built a logistical regression model with 0.72 AUC using user demography, enrollment methods, online behavior attributes.
- Improved the model by 0.13 AUC with Dow Jones industrial index and consumer sentiment index
Water Analyst & Research Associate
- Collaborated with 21 cities to manage water quality and implemented 8 river remediation and evaluation policies
- Collected, analyzed and reported water data of 124 manual sections and 28 automatic monitoring stations
- Mapped hydrography using ArcGIS, presented biannual and annual report to agency director
- Response for emergency water pollution event on call as data analyst and pollution source investigator
- Developed water quality database for water remediation extension project by collaborating with technical team
- Awarded 100K Yuan National Science Foundation to conduct bio-toxicity monitoring system application project
- Developing bio-toxicity monitoring standard with 3 organisms monitoring system-fish, luminal bacteria, microbial fuel cell.
- Assess the water quality of delta area with Biotic Ligand Model based on lab toxicity data and water quality parameters.
- Predicted the bio-toxicity of river water with 24 physical and chemical parameters using ensemble model
- Assisted in building drinking water precautionary system by analyzing risk with decision tree using SPSS and R
Confidential, Saint Paul, MN
- Supervised by Confidential soil scientist to study microbial mechanism associated greenhouse gas emission mitigation
- Worked with lab technician to sampling soil and collecting CO 2 /N 2 O data from Rosemount Confidential corn field
- Designed thesis experiments, conducted soil incubation in lab, tested CO 2 /N 2 O and physical-chemical parameters
- Collaborated with Post-doctor to perform functional genes expression experiments with soil amendment.
- Analyzed factors for correlation and performed visualization in R, incorporating results in published paper