- 5 years industrial working experience focused on big data analysis, data mining, statistical inference, A/B testing, machine learning, data visualization and ETL data pipelines.
- Solid working experience with SQL including MySQL and MS SQL Server. Strongly skilled in writing stored procedures,triggers and complex queries containing multiple joins, subqueries and window functions to create reports and perform analysis.
- Worked on creating dashboards in Tableau for reporting and data visualization, and guided business decision - making for multiple stakeholders.
- Worked on developing Python script for the whole data science project life cycle including data acquisition, data cleaning, data exploration, and data modeling using libraries such as Pandas and Sklearn .
- Hand-on experience performing statistical analysis, causal inference and statistical modeling in R and Python, interpreting and analyzing results of hypothesis tests, A/B tests and multivariate tests and providing recommendations based on data.
- Experience building and interpreting machine learning models including Linear Regression, Logistic Regression, Random Forest, Xgboost, Kmean, KNN and Neural Network.
- Experience working with NoSQL databases and big data tools including Hadoop, Hive and Spark.
- Built ETL pipelines to extract, transform and load into analytical databases, schedule and automated pipelines using Apache Airflow.
- Solid experience working with cloud platforms such as AWS and Google Cloud.
- Experience working with Shell Scripting in operating systems such as Linux and version-control tools such as Git.
- Detail-oriented and self-starter with strong communications skills presenting results of analysis to both technical and non-technical audiences and experience collaborating within cross-functional teams.
- Worked on designing, developing and tracking Key Performance Indicators(KPI) and creating dashboards to monitor them.
- Experience translating business requirements into technical requirements, and enable decision making by retrieving and aggregating data from multiple sources.
- Strong time management skills to manage work plans, handle tight deadlines and multiple projects.
Language: Python, R, SQL, Bash
Packages: Numpy, Pandas, Scikit-learn, TensorFlow, Matplotlib, Seaborn, Plotly,NLTK
Cloud: AWS (EC2, S3, RDS RedShift, EMR), Google Cloud (BigQuery, Kubernetes)
Databases: MySQL, PostgreSQL, MS SQL Server, MongoDB, Hive, Presto, AWS RDS, AWS Redshift, AWS Redis, BigQuery
Tools: Tableau, Hadoop, Hive, Apache Airflow, Apache Spark, Flask, Apache Kafka, Jupyter Notebook, Excel, Jira, Git, Docker, Kubernetes
- Responsible for providing data analysis that focused on improving user experience and optimizing user retention.
- Wrote complex ad-hoc MySQL queries involving correlated subqueries, window functions and common table expressions to track user activity metrics such as retention rate and daily active user. Optimized existing queries to run faster on large dataset.
- Designed and maintained MySQL databases, and created pipelines using user-defined functions and stored procedures for daily reporting tasks.
- Developed dashboards in Tableau and present data to track KPIs and product performance using data from various sources such as MS Excel, AWS S3, AWS RDS, AWS Redshift, JSON and XML.
- Spotted and analyze trends in user activity data, identify underperformed user segmentation report insight to stakeholders.
- Designed metrics for A/B testing, creating dashboards in Tableau to monitor test processes, analyzing test results, interpret and give recommendations to stakeholders.
- Performed statistical analysis such as hypothesis testing causal inference and bayesian analysis.
- Communicated between the business, product manager and data engineers to ensure the data quality.
- Built ETL pipelines to retrieve data from NoSQL databases and load aggregated data into the analytical platform.
- Managed data storing and processing using big data tools such as Hadoop, HDFS, Hive and Spark.
- Developed Python scripts to automate data validation and data cleaning processes such as deduplicating and checking data consistency using Pandas and Apache Airflow.
- Analyzed large scale user log data and generated features for classification models using SparkSQL in Spark.
- Implemented scalable machine learning models Random Forest using SparkML and Python to predict customer churn.
- Applied advanced classification models such as XGboost, SVM, Neural Network to train data using Python packages such as Scikit-learn.
- Defined metrics to estimate impact of new features and give recommendations for business decisions based on data analysis.
- Participated in data project planning, gathering business requirements and translating them into technology requirements.
- Responsible for creating reporting dashboards, performing data mining and analysis to understand customer purchase behavior.
- Created real-time dashboards in Tableau to visualize and monitor key metrics and A/B test processing using both external and internal data.
- Collaborated with the marketing team to analyze marketing campaign data and perform analysis involving segmentation, cohort analysis.
- Designed MySQL table schemas and implemented stored procedures to extract and store customer purchase and session data.
- Queried data from MySQL database and validated and detected inconsistent data using Python packages Pandas and Numpy.
- Actively involved in designing A/B tests, defining metrics to validate new user interface features, calculating sample size and checking statistical assumptions for tests.
- Performed statistical analysis such as hypothesis testing, regression analysis, confidence interval and P-value calculation using R to find insights to increase click through rate and sales and built web applications for ad-hoc interactive dashboard.
- Performed Exploratory Data Analysis to identify trends using Tableau and Python (Matplotlib, Seaborn, Plotly Dash).
- Wrote scripts to store data into Hadoop HDFS from various sources including AWS S3, AWS RDS and Web API and NoSQL Database MongoDB.
- Deployed Big Data tool Spark and Hive to analyze large datasets up to 2TB stored in Hadoop HSDF, including performing filtering and aggregation using SparkSQL based on Spark DataFrame.
- Developed Python scripts to do data preprocessing for predictive models including missing value imputation, label encoding and feature engineering.
- Implemented machine learning model including Decision Trees and Logistic Regression to predict revenue from returning customers to help the market team take appropriate promotion strategy using Python.
- Communicated key findings from data to multiple stakeholders to facilitate data-driven decisions using tools MS PowerPoint Tableau and Jupyter Notebook.
- Worked on performing quantitative analysis for credit portfolios, data quality control and data profiling.
- Developed interactive dashboards and storyboards in Tableau to explore credit portfolios data and report insight to the manager.
- Performed exploratory data analysis and identified data patterns that can translate into business rules.
- Deployed MS SQL Server platform to store transform and query data for data analyzing and reporting.
- Wrote T-SQL queries involving multiple joins, subqueries and common table expression to perform data aggregation for reporting.
- Performed statistical analysis such as T-test, F-test, ANOVA, MANOVA, Confidence interval calculation and Distribution identification using R, and generated reports using R Markdown.
- Used ETL tool SSIS to extract data from both external and internal sources including csv json excel and load data into database.
- Collected data from various sources using Python and loaded data into Pandas Dataframe to detect outliers and imputing missing values.
- Used MS Excel to get summary statistics of datasets, and VBA to automate data transforming.
- Worked on monitoring statistics of email delivery and optimize email efficiency to increase customer engagement.
- Analyzed result of A/B tests to find better time of email delivery for different topics using R.
- Set up automated systems used Excel VBA to pull the data smoothly into Tableau platform to create reports and dashboards.
- Collected data from internal sources using MySQL and performed data cleaning and aggregation for reporting and further analysis.
- Manipulated date and calculated key metric for reporting using MySQL queries (window function, subqueries) and MS Excel.
- Visualized data from MySQL and MS Excel using Tableau Dashboard to monitor email performance. connected to MySQL database using R package RMySQL and retrieving data for statistical analysis and visualization.
- Performed exploratory data analysis (EDA) to find insight such as difference efficiency among different devices using ggplot2 in R.
- Used Google Docs, Google sheet and Google Slide to collaborate writing data and code documentation and reporting insight to the manager.