- Over 5 years of experience with Extensive industrial work, research experience, and great passion in data analysis, data visualization, business intelligence, machine learning and project management.
- Extensive experience in relational (SQL) and non - relational(NoSQL) database, such as MySQL, Oracle and MongoDB. Deep knowledge and high experience in writing and troubleshooting complex SQL procedures.
- Extensive experience in ETL(extracting, transforming, loading), storing and analyzing large datasets with HDFS(Hadoop Distributed File System), Hive, Pig and Spark.
- Extensive experience with Linux and writing BASH Scripts.
- Experience in Tableau and Kibana for data visualization and information design.
- Experience in processing CSV, TXT, XML and JSON files.
- Extensive experience in Python and its libraries such as Pandas, Numpy, SciPy, Plotly, Matplotlib, Scikit-Learn, PySQL, PySpark to conduct data analysis and visualization.
- Experience in using statistical tools such as SAS, R and JMP for data analysis.
- Experienced and strong knowledge in REST frameworks such as Flask.
- Experience in project management software such as AGILE and SCRUM for inspection and adaption during the project management process.
- Experience with Amazon Web Services (EC2, EBS, S3).
- Outstanding written, communication, and presentation skills to develop and present conclusions and recommendation to senior executives and worked well with cross-functional development teams.
Language: SQL, Python, R, Pig Latin, Hive, PySpark, Scala, SAS, Shell, Java, MATLAB
Tools & Skills: ETL, MySQL/Teradata SQL, Hadoop, HDFS, Yarn, MapReduce, Hive, Pig, Spark, Sqoop, Flume, Tableau, MongoDB, Linux, Bash, Git, Data Warehousing, SSDT(SSIS,SSAS,SSRS), TensorFlow, ELK Stack(Elasticsearch, Logstash, Kibana), OLAP, OLTP, Agile, SCRUM
Cloud: AWS, Amazon EC2, EBS, S3
Confidential, Sunnyvale, CA
Data Quality Analyst
- The project is to retire old database and roll out new database. I analyze data quality - validation, completeness, consistency - of new database based on data sent by Confidential factories which produce from modules and accessories to finished products including iPhone, iMac, iPad, and iWatch.
- By querying databases (with Mysql, Oracle, Hive) and visualizing data (with Tableau) in daily base, I monitor data quality and find potential error source if any and communicate across data generating and administrating parties(Internationally) to fix quality issues.
- Update data dictionary and design ER diagram for manufacturing schemas using MySQL Workbench. Update tables after ad-hoc test and clean up database (create, rename and drop tables). Proficient in ETL skills such as import/export CSV, text, XML to/from MySQL database, transfer json file to python dataframe and excel to load into database.
- Work in Linux environment. Write Bash script to run python and hive. Schedule daily and weekly crontab jobs in different manufacturing servers to extract, transfer and load data automatically into database, with emails sent to inform the process. Work on scripts consolidation and migration across servers.
- Proficiently write MySQL and Oracle queries for master data extract and analysis (use full/left outer join, inter join, union, where, group by, having, order by, etc). Write Hive queries for high volume transactional data extract and analysis (create internal and external tables with partitions defined, join tables from different schemas, pivot and unpivot tables using literal review explode, map functions, etc.)
- Used Tableau to realize data visualization by connecting to MySQL database or by importing CSV files or other flat files into Tableau. Created worksheets and dashboards containing dynamic graphs such as stacked bars, side-by-side bars, symbol maps, highlight tables. Use Tableau dashboard filter action to select information from one worksheet and get related information from another sheet in dashboard.
- Use Python to modify SQL and Bash scripts. Use OS package to create and run dynamic shell scripts in python. Use Python to convert data from nested Jason structure to Pandas DataFrame.
Confidential, Milpitas, CA
- Analyzed large sales datasets for our client as a sales company, and recommended sales strategies to increase quarterly sales revenues. Analyzed customer behaviors based on their online browsing and purchasing information.
- Wrote Pig Latin under Grunt Shell to transform and load huge unstructured web log data into HDFS system.
- Used Hive to build or modify tables by writing Data Definition Language statements such as CREATE, DROP, SHOW, TRUNCATE, DESCRIBE and ALTER.
- Wrote Hive queries for operations such as Joining (LEFT, RIGHT and FULL INNER/OUTER JOIN), Sorting(ORDER BY, CLUSTER BY), Searching (WHERE, HAVING), Sampling (TABLESAMPLE), and Windowing functions(OVER, RANK).
- Maintained and managed relational database and analyze data of the company’s physical stores using SQL. Wrote queries to select data in desired period and regions, by leveraging SQL conditional clauses such as GROUP BY, HAVING, LIMIT, and OFFSET.
- Created customized SQL tables using SUBQUERIES, MERGE, JOIN, and UNION, and used CASE to create new fields. Wrote SQL functions such as COUNT, LIMIT, AVG, IF, IFNULL to query key sales performance metrics from database.
- Analyzed product features and sales patterns that may contribute to sales performance by using Python. Dealt with Python Pandas Series and Dataframe to do data manipulation such as Handling Missing Data (DROPNA, FILLNA), Dataframe Column/Row Operations (IX, ASSIGN, WHERE, ASTYPE, APPLY, SORT), Combine Data (MERGE), Group Data (GROUPBY, AGG)
- Leveraged Matplotlib and Seaborn libraries to draw statistical plots for data exploration purpose. Draw plots such as histograms, box-plots, kernel density estimation plots, correlation tables and heatmaps.
- Conducted A/B test to explore if certain marketing patterns had significant effect on improving sales performance, by proposing statistical hypothesis, calculating sample size, randomly assigning tested features to treatment groups (of stores and websites), collecting and cleaning data, analyzing data and developing recommendations.
Confidential, Santa Clara, CA
- Engaged in efficient coding with Python packages such as Numpy, SciPy, Pandas and PySpark to load and process large-scaled data about customers of a credit card company for credit cards recommendation.
- Used Matplotlib, Seaborn to generate plots such as power spectra, bar charts, time-series charts, error charts, scatterplots, correlation tables, etc. Predicted the customer behavior such as credit card gaming probability and spending capability, based on data from website cookies and from third-party providers.
- Applied machine learning models such as Logistic Regression, Classification Tree, Random Forests, SVM and Gradient Boosting to train gamer and spend models using Scikit-learn and NLTK packages, and achieved high accuracy rate up to 0.82.
- Designed an end to end pipeline using ELK(Elasticsearch, Logstash, Kibana), based on time-series data, for international marketing department. Converted CSV files to JSON files which include self-defined hash-code using Python SHA256 package, and then bulk uploaded the data into Elasticsearch using RESTful APIs.
- Developed the internal dashboards on Kibana to monitor the metrics of the effects of credit cards recommendation on customers. Explored the metrics pattern to get business insights which contributed to the improvement of existing recommendation models. Supplied troubleshooting, analysis, and solutions for back-end application and third-party data providing issues.
- Used the company’s Python-based infrastructure, designed new APIs using REST with Flask. Wrote Python code to help to improve the infrastructure and make API access fast, easy, and reliable.
- Worked closely with data science team and collaborate with front-end engineering teams. Interpreted high-level requirements and refined as Agile-based implementation stories.
Confidential, Philadelphia, Pennsylvania
Junior Data Analyst
- Developed a CRUD(Create, Read, Update, Delete) customer management web application using Flask. Designed the components structure of application by deep diving into MVC architectural pattern.
- Deployed Microsoft SQL Server BI Platform to assemble, manage and report data across multiple sources.
- Used SSIS to extract data from different external data source including txt, excel, Json, xml, csv.
- Used SSAS to clean, transform, join, merge the data to meet the business requirement. Processed SSAS cubes to load the data into a target data warehouse and create an index for the data which can be accessed by SSRS.
- Used SSRS to report data to customers. Used SQL query in Tableau to design internal dashboards. Analyzed the metrics to explore business insights for business department. Developed an email alert system to monitor essential business metrics.
- Built Shell scripts and used Crontab to automate the daily execution of ETL process. Designed custom log system to capture data issues due to sever crash or other bugs. Made the documentations for team members handling relative tasks
- Worked with both business and technical teams. Collaborated with product managers and other business systems analysts to establish the business and technical visions use Microsoft Excel, Access, SSPS, and JMP.
- Converting JSON data into relational database tables for reporting. Synthesizes data and feedback from users, vendors, and management to make data driven decisions and recommendations.
- Conducted Time-series analysis, Multivariate linear regressions to study whether the passage of the Sarbanes- Oxley Act of 2002 (SOX) has an impact on the financial statement comparability of U.S. firms, and whether such impact varies among industries with various levels of regularity.
- Plotted histograms, scatterplots, box-plots, correlation tables with JMP.
- Programmed SAS to conduct t-test, F-test, ANOVA, and MANOVA, and used SAS MACRO to combine and manipulate financial datasets.
- Prepared summarizing documentations and created PowerPoint and dashboards for professors and other members in our research group.