- Data scientist with 6 years of healthcare, technology, fashion, and e - commerce experience.
- Over 4 plus years of experience involved in the entire data science project life cycle, including Data Acquisition, Data Cleaning, Data Manipulation, Data Mining, Machine Learning Algorithms, Data Validation, and Data Visualization.
- Expertise in transforming business requirements into analytical models, applying algorithms, and reporting solutions that scales across massive volume of structured and unstructured data.
- Experienced with linear regression and logistic regression, Bayesian inference, SVM, neural networks, ANOVA, Gaussian mixture, recommendation system and maximum likelihood estimation analysis.
- Strong skills in statistical methodologies and dimension reduction methods like PCA and correspondence analysis, variable clustering.
- Worked with testing and validation using k-fold cross validation and regularization.
- Extensive experience in developing time series modeling, including but not limited to ARIMA and GARCH modeling, using SAS 9.4, SAS Enterprise Miner & SAS Enterprise Guide and SAS/JMP.
- Worked with Python 3.3 in developing machine learning algorithms, like decision tree, random forest, lasso regression, k-mean clustering analysis, using Numpy, Pandas, Scikit-learn, SFrame, Scipy and Matplotlib, nltk packages.
- Strong ability to write and optimize diverse SQL queries, working knowledge of RDBMS and NoSQL Database, such as MySQL, SQL Server, HBase, Cassandra, MongoDB.
- Adept and deep understanding of text mining, generating data visualizations, delivering projects using various packages in R, like ggplot2, dplyr, caret, twitteR, NLP, rjson, openNLP, tm, GoogleVis, Shiny.
- Deep understanding of Map Reduce with Hadoop and Spark. Good knowledge of big data ecosystem like Hadoop 2.0 (HDFS, Hive, Pig, Impala), Spark (SparkSql, Spark MILib, Spark Streaming).
- Excellent performance in building, publishing customized interactive reports and dashboards with customized parameters and user-filters using Tableau 9.4/9.2
- Excellent understanding of SDLC, Agile, and Scrum.
- Experience with version control tool - Git.
- Effective team player with strong communication and interpersonal skills, possess a strong ability to adapt and learn new technologies and new business lines rapidly.
BI Tools \ Languages: \: Tableau 9.4/9.2, SharePoint 2016/2013, \ Python 3.3/2.7, R 3, SQL, SAS 9.4, VBA, \MS Office (Word/Excel/PowerPoint/Visio)\ HiveQL, Pig Latin\
Big Data Tools \ Operating Systems: \: Hadoop 2 (Hive, HDFS, Pig, Impala), Spark 2.1 \ Windows 10/8/7, UNIX, Linux\(SparkSql, MILib), MapReduce\
Packages \ Database: \: Python (Numpy, Pandas, Scikit-learn, SFrame, \ Oracle 11g, MS Access 2013, SQL Server \Scipy, Matplotlib, nltk) R (ggplot2, dplyr, caret, \ 2014/2012, MySQL 5.5, HBase 1.2, MongoDB \Twitter, NLP, openNLP, rjson, tm, \ 3.2, Cassandra 3.0\GoogleVis, Shiny)\
Confidential, Stamford, CT
- Continuously collected business requirements during the whole project life cycle
- Wrote and optimized complex SQL queries involving multiple joins and advanced analytical functions to perform data extraction and merging from large volumes of historical data stored in Oracle 11g, validating the ETL processed data in target database
- Pulled unstructured data from MongoDB and ensured data aggregation
- Developed Python scripts to automate data sampling process. Ensured the data integrity by checking for completeness, duplication, accuracy, and consistency
- Worked on data cleaning and reshaping, generated segmented subsets using Numpy and Pandas in Python
- Applied various machine learning algorithms and statistical modeling like decision tree, logistic regression, Gradient Boosting Machine to build predictive model using scikit-learn package in Python
- Identified the variables that significantly affect the target
- Conducted model optimization and comparison using stepwise function based on AIC value
- Worked on model selection based on confusion matrices, minimized the Type II error
- Generated cost-benefit analysis to quantify the model implementation comparing with the former situation
- Generated data analysis reports using Matplotlib, Tableau, successfully delivered and presented the results for C-level decision makers
Environment: Tableau 9.4, Python 3.3, Numpy, Pandas, Matplotlib, Scikit-Learn, Machine Learning, MongoDB, Oracle 11g, SQLConfidential
- Collected business requirements and data analysis needs from other departments
- Performed data parsing and data profiling from large volumes of varied data to learn about behavior with various features based on transactional data, call center history data and customer personal profile, etc.
- Processed the primary quantitative and qualitative market research and loaded the survey responses into database, in preparation of data exploration
- Developed python scripts to automate data sampling process. Ensured the data integrity by checking for duplication, completeness, accuracy, and validity
- Worked on data cleaning and ensured data quality, consistency, integrity using Numpy, SFrame in Python
- Used k-means clustering technique to identify outliers and to classify unlabeled data
- Applied Principal Component Analysis method in feature engineering to analyze high dimensional data
- Application of various machine learning algorithms and statistical modeling - decision tree, lasso regression, multivariate regression to identify key features using scikit-learn package in python
- Evaluated models using k-fold cross validation, log loss function
- Ensured that the model has low false positive rate, validated model by interpreting ROC Plot
- Experimented text mining based on customer complaints using nltk in Python
- Built repeatable processes in support of implementation of new features and other initiatives
- Created various type of data visualization using Tableau
- Communicated and presented the results with product development team for driving best decisions
Environment: Python 3.3, Hadoop 2, HiveQL, HBase, MapReduce, Tableau 9.4, Numpy, SFrame, Scikit-Learn, nltkConfidential, Hartford, CT
- Implemented and delivered all requirements that are outlined within the contractual agreement between company and the university
- Prepared and executed complex SparkSql queries involving multiple joins and advanced analytical functions to validate the ETL processed data in target database
- Searched and collected data from external sources, integrated with the primary database. Created SparkSql Context to load data from JSON files and performed SQL queries
- Extracted and compiled data, conducted data manipulation to ensure data quality, consistency, and integrity using SFrame in Python
- Performed time series model (ARIMA) to capture data pattern and traffic trends, conducted the forecasting of the occupancy rate by different parking lots
- Effectively communicated with the business development team, ensured to implement, and complete the initiative that may increase opportunities
- Efficiently delivered data interpretation by creating interactive analysis reports using data visualization tools - Tableau, to identify business solutions and to support business decisions on marketing and operation
Environment: Hadoop 2, Spark, SparkSql, MS Office (Excel), Tableau 9.2, Python, SFrameConfidential
Healthcare Data Analyst
- Extracted and amalgamated information on the data working. Create primary and secondary competitive intelligence gathering for distribution of impactful bi-weekly and monthly reports
- Performed initial descriptive data analysis on datasets using SAS, generated statistical report by PROC UNIVARIATE and FREQ
- Conducted hypothesis tests and analysis on the content of clinical datasets to assess quality, completeness, and volumes of data
- Coordinated with research team and system owners, in order to understand the origins, contents, and structure of datasets, ensured that research objectives were able to be met
- Effectively communicated the results and reported to colleagues and partners
- Created decision-driving competitive intelligence reporting from scientific conferences
- Comprehensive knowledge of drug development and commercial landscapes
Environment: SAS 9.4, SAS Enterprise Guide, SQL server 2012, MS Office 2013 (Access/PowerPoint/Word/Excel), SPSSConfidential
- Prepared and executed complex SQL queries involving multiple joins and advanced analytical functions to validate the ETL processed data in target database
- Accomplished the study of client, including buying behaviors, client profile, segmentations
- Analyzed the traffic and business performance of commercial and marketing operations in an approach of continuous improvement of digital devices. Created the data visualization using Shiny in R, in order to track the performance of business campaigns (newsletters, mailing)
- Implemented the strategic initiatives with history data, built and tested the predictive models to better estimate the impact of new campaigns
- Developed the recommendation system by applying collaborative filter and content-based filter, based on large scale of data set, improved the accuracy and the promptitude of customized recommendation
- Created materials on emphasizing product knowledge, brand heritage, website user experience, and luxury service to support CRM initiative and drive sales results
- Assembled monthly product performance analysis for use by c-level executives
Environment: MySQL, R, dplyr, caret, mle2, ShinyConfidential
- Participated in data entry, data extraction using SQL queries with MySQL
- Identified the key parameters by clearly defining treatment and control groups and marking target audiences who would be incremental and profitable to business
- Conducted A/B testing for the implementations of new initiatives and conducted documentation in support of the Web design team
- Created different kinds of charts to visualize data analysis results
- Successfully generated decision-driving reports
Environment: Python 2.7, MySQL, R, MS Office (Excel/PowerPoint/Word), Pandas