- A passionate Data Scientist with over 5 years of experience in Data Mining, Data Modelling, Data Visualization, Machine Learning with rich domain knowledge and experience in Healthcare, Banking and Travel industries.
- Proficient in Data preparation such as Data Extraction, Data Cleansing, Data Validation and Exploratory Data Analysis to ensure the data quality.
- Expert in Feature Engineering by implementing both Feature Selection and Feature Extraction.
- Strong skills in machine learning algorithms such as Linear Regression, Logistic Regression, Naive Bayes, Decision Tree, Random Forest, Support Vector Machine, K - Nearest-Neighbors, K-means Clustering, Neural networks, Ensemble Methods.
- Familiar with Recommendation System Design by implementing Collaborative Filtering, Matrix Factorization and Clustering Methods.
- Experienced with Natural Language Processing along with Topic modeling and Sentiment Analysis.
- Experienced with statistics methodologies such as Hypothesis Testing, ANOVA, and Chi-Square Test.
- Ability to write SQL queries for various RDBMS such as SQL Server, MySQL, Teradata and Oracle; worked on NoSQL databases such as MongoDB and Cassandra to handle unstructured data.
- Experienced with streaming database Kafka.
- In depth understanding of building and publishing customized interactive reports and dashboards with customized parameters and user-filters using Tableau and SSRS.
- Expertise in Python programming with various packages including NumPy, Pandas, SciPy and Scikit Learn.
- Proficient in Data visualization tools such as Tableau, Plotly, Python Matplotlib and Seaborn.
- Familiar with Hadoop Ecosystem such as HDFS, HBase, Hive, Pig and Oozie.
- Experienced in building models by using Spark (PySpark, SparkSQL, Spark MLLib, Spark ML).
- Experienced in Cloud Services such as AWS EC2, EMR, RD S, S3 to assist with big data tools, solve the data storage issue and work on deployment solution.
- Experienced in ticketing systems such as Jira/confluence and version control tools such as GitHub.
- Worked on deployment tools such as Azure Machine Learning Studio, Oozie, AWS Lambda.
- Strong understanding of SDLC in Agile methodology and Scrum process.
- Strong experience for working in fast-paced multi-tasking environment both independently and in the collaborative team. Acceptable with challenging projects and work in ambiguity to solve complex problems. A self-motivated enthusiastic learner.
Database: Microsoft SQL Server 2008/2012/2014/2016, Oracle 9i/10g, MySQL 5.5/5.6, MongoDB 3.x, Hadoop HBase, AWS RDS, Kafka 0.10
ETL Tools: SSIS, Informatica,Data Visualization
Reporting: Tableau 9.x/10.x, Python Seaborn/Matplotlib, Plotly, SSRS
Deployment Tools: Azure Machine Learning Studio, Oozie 4.2, AWS Lambda, Anaconda Enterprise v5
Hadoop Ecosystem: HDFS, MapReduce, Spark 2.x (PySpark, SparkSQL, SparkMLLib), Pig 0.15, Hive 1.x/2.x, HBase 0.98, Zookeeper 3.4
Machine Learning: Regression analysis, Naive Bayes, Decision Tree, Random Forest, Support Vector Machine, Neural Network, KNN, Ensemble Methods, K-Means Clustering, Natural Language Processing (NLP), Sentiment Analysis, Latent Dirichlet Allocation, Collaborative Filtering
Operating System: Windows XP/7/8/10, Linux (Ubuntu 12.04/14.04/16.04 )
Confidential, White Plains, NY
- Collected and analyzed the business requirements, understood the particular Fraud/AML challenges that our client faces.
- Participated in Data integration job with Data Engineer team to gather traditional transaction data and external source data together.
- Transformed data from SQL Server database to Hadoop Clusters which is set up by using AWS EMR.
- Conducted data cleansing and feature engineering job through python NumPy and Pandas.
- Implemented Naive Bayes, Logistic Regression, SVM, Random Forest and Gradient boosting with weighted loss function by using Python Scikit-learn.
- Implemented mulit-layers Neural Networks by using Google Tensorflow and Spark.
- Performed extensive Behavioral modeling and Customer Segmentation to discover behavior patterns of customers by using K-means Clustering.
- Managed and scheduled models by using Oozie for batch processing.
- Updated and saved Fraud predictions to AWS S3 for application team.
- Tested the business performance of the AML models by evaluating detection rate and false positive rate and worked on continuous improvement on model.
- Created reports and dashboards, by using Tableau, to explain and communicated data insights, significant features, model’s score and performance of new transaction monitoring system to both technical and business teams.
- Used GitHub for version control with Data Engineer team and Data Scientists colleagues.
Environment: SQL Server 2014, Hadoop 2.0, Hive 2.0, Spark (PySpark, SparkSQL), Python 3.X, Tensorflow, Oozie 4.2, Tableau 10.X, AWS S3/EC2/EMR, Github
Confidential, Bronx, NY
- Conducted reverse engineering based on demo reports to understand the data without documentation.
- Generated new data mapping documentations and redefined the proper requirements in detail.
- Generated different Data Marts for gathering the tables needed (Member info, Claim info, Transaction info, Appointment info, Diagnose info) from SQL Server Database.
- Created ETL packages to transform data into the right format and join tables together to get all features required using SSIS.
- Processed data using Python pandas to examine transaction data, identify outliers and inconsistencies.
- Conducted exploratory data analysis using python NumPy and Seaborn to see the insights of data and validate each feature through different charts and graphs.
- Built predictive models including Linear regression, Lasso Regression, Random Forest Regression and Support Vector Regression to predict the claim closing gap by using python scikit-learn.
- Used GridSearchCV to evaluate each model and to find best parameters set for each model.
- Created reports and an app demo using Tableau to show client how prediction can help the business.
- Deployed and hosted our models by using Azure Machine Learning Studio and share an API with application development team.
- Used Confluence to share and collaborate on projects with team members, and keep track of up to date documentations.
Environment: SQL Server 2012, SQL Server Data Tools 2010, SQL Server Integration Services, Python 2.7/3.3, Tableau 9.4, Azure Machine Learning Studio
Confidential, Morristown, NJ
Junior Data ScientistResponsibilities:
- Communicated and coordinated with other departments to gather business requirements.
- Gathered data information from multiple sources, and performed resampling method to handle the issue of imbalanced data.
- Worked with ETL Team and Doctors to understand the data and define the uniform standard format.
- Conducted data cleansing by using advanced SQL queries in SQL Server Database.
- Split the data into different smaller dataset based on different diagnoses, in charge of conducting exploratory data analysis for three of diagnoses datasets (Diabetes, cold/flu, allergy).
- Created the whole pipeline of data preprocessing (imputing, scaling, label encoding) through python pandas to get data ready to modeling part.
- Built predictive models, using python scikit-learn, including Support Vector Machine, Decision tree, Naive Bayes Classifier, Neural Network to predict a potential readmitted case.
- Performed Ensemble methods, including Gradient Boosting, Random Forest, customized ensemble method to produce more accurate solutions.
- Designed and implemented cross-validation and statistical tests including Hypothesis testing, AVOVA, Chi-square test to verify models’ significance.
- Created a API by using Flask and shared the idea with application team and help them define the requirements of new application.
- Used Agile methodology and Scrum process for project developing.
Environment: SQL server 2012, SQL Server Integration Services, Python 2.7, Jupyter notebook, Flask 0.10, SharePoint 2013
- Involved in gathering user/project requirements from business users and IT managers, translated it into functional and non-functional specifications needed and created documentations for the project.
- Assisted in design and data modeling efforts of Data Marts and Enterprise Data Warehouse.
- Used T-SQL in SQL Server to develop complex stored procedures, triggers, clustered index & non-clustered index, Views, and User-defined Functions (UDFs).
- Designed SSIS packages to extract, transform and load existing data into SQL Server, used lots of components of SSIS, such as Pivot Transformation, Fuzzy Lookup, Derived Columns, Condition Split, Aggregate, Execute SQL Task, Data Flow Task and Execute Package Task.
- Created SSIS Packages that involved dealing with different source formats (Text files, XML, Database Tables)
- Debugged and troubleshot the ETL packages by using breakpoint, analyzing process, catching error information by SQL command in SSIS.
- Create reports with the use of SSRS to generate different types of reports such as tabular, matrix, drill down and charts reports with accordance with user requirement.
- Maintained and updated existing reports, analyzed the SQL queries and logic behind them to improve the performance.
- Helped deploy the report with scheduling, subscription, history snapshot configured and set up.
- Developed in Agile environment throughout the project.
Environment: SQL server 2008/2012, SQL Server Management Studio (SSMS), MS BI Suite (SSIS, SSRS)