Sr. Data Scientist Resume
New, YorK
SUMMARY:
- Over 6 + Years Of Data Analyzing Experience Encompassing In Machine Learning, Data mining With Large Datasets Of Structured And Unstructured Data, Data Acquisition, Data Validation, Predictive Modeling, Data Visualization.
- Hands - On Experience with Machine Learning Algorithms Such As Regression Analysis, Clustering, Boosting, Classification, Principal Component Analysis and Data Visualization Tools.
- Strong Programming Skills in a Variety of Languages Such As Python, R, SAS and SQL.
- Proficient in Machine Learning Techniques (Decision Trees, Linear/Logistic Regressors, Random Forest, Svm, Bayesian, K-Nearest Neighbors).
- Statistical Modeling In Forecasting/ Predictive Analytics, Segmentation Methodologies, Regression Based Models, Hypothesis Testing, Factor Analysis/ Pca.
- Experience In Designing Visualizations Using Tableau And Ggplot2 And Storyline On Web And Desktop Platforms, Publishing And Presenting Dashboards.
- Hands On Experience In Implementing Lda, Naive Bayes And Skilled In Decision Trees, Random Forests, Linear And Logistic Regression, Svm, Clustering, Neural Networks And Good Knowledge On Recommender Systems.
- Adept in statistical programming languages like Rand also Python including Big Data technologies like Hadoop, Hive.
- Experience Developing Sql Procedures on Complex Datasets for Data Cleaning and Automating the Reports.
- Experience Developing SAS Macros for Ad-Hoc Reporting In SAS Enterprise Guide Using Query Builder and Sql.
- Knowledge Of Using Teradata Tools Like Sql Assistant And Microsoft Sql Server For Accessing And Manipulating Data On ODBC-Compliant Database Servers.
- Expertise In Transforming Business Requirements Into Building Models, Designing Algorithms, Developing Data Mining And Reporting Solutions That Scales Across Massive Volume Of Unstructured Data And Structured.
- Having Good Domain Knowledge on Retail, Payment Processing, Supply Chain and Healthcare.
- Well Experienced In Normalization & De-Normalization Techniques For Optimum Performance In Relational And Dimensional Database Environments.
- Hands-On Experience with Machine Learning Algorithms Such As Regression Analysis, Clustering, Boosting, Classification, Principal Component Analysis and Data Visualization Tools.
- Strong Programming Skills in a Variety of Languages Such As Python, R, SAS and SQL.
- Proficient in Machine Learning Techniques (Decision Trees, Linear/Logistic Regressors, Random Forest, Svm, Bayesian, K-Nearest Neighbors).
- Statistical Modeling In Forecasting/ Predictive Analytics, Segmentation Methodologies, Regression Based Models, Hypothesis Testing, Factor Analysis/ Pca.
- Experience In Designing Visualizations Using Tableau And Ggplot2 And Storyline On Web And Desktop Platforms, Publishing And Presenting Dashboards.
- Hands On Experience In Implementing Lda, Naive Bayes And Skilled In Decision Trees, Random Forests, Linear And Logistic Regression, Svm, Clustering, Neural Networks And Good Knowledge On Recommender Systems.
- Adept in statistical programming languages like Rand also Python including Big Data technologies like Hadoop, Hive.
- Experience Developing Sql Procedures on Complex Datasets for Data Cleaning and Automating the Reports.
- Experience Developing SAS Macros for Ad-Hoc Reporting In SAS Enterprise Guide Using Query Builder and Sql.
PROFESSIONAL EXPERIENCE:
Confidential, New York
Sr. Data Scientist
Roles & Responsibilities:
- Extensively involved in all phases of data acquisition, data collection, data cleaning, model development, model validation, and visualization to deliver data science solutions.
- Built machine learning models to identify fraudulent applications for loan pre-approvals and to identify fraudulent credit card transactions using the history of customer transactions with supervised learning methods.
- Extracted data from database, copied into HDFS File system and used Hadoop tools such as Hive and Pig Latin to retrieve the data required for building models.
- Worked on data cleaning and ensured data quality, consistency, integrity using Pandas, NumPy.
- Tackled highly imbalanced Fraud dataset using sampling techniques like down-sampling, up-sampling and SMOTE (Synthetic Minority Over-Sampling Technique) using Python Scikit-learn.
- Used PCA and other feature engineering techniques to reduce the high dimensional data, feature normalization techniques and label encoding with Scikit-learn library in Python.
- Used Pandas, NumPy, Seaborn, Matplotlib, Scikit-learn in Python for developing various machine learning models such as Logistic regression, Gradient Boost Decision Tree and Neural Network.
- Used cross-validation to test the models with different batches of data to optimize the models and prevent over fitting.
- Experimented with Ensemble methods to increase the accuracy of the training model with different Bagging and Boosting methods.
- Implemented a Python-based distributed random forest via PySpark and MLlib.
- Used AWS S3, Dynamo DB, AWS lambda, AWS EC2 for data storage and models' deployment.
- Created and maintained reports to display the status and performance of deployed model and algorithm with Tableau.
- In preprocessing phase, used Pandas to clean all the missing data, data type casting and merging or grouping tables for EDA process.
- Used PCA and other feature engineering, feature normalization and label encoding Scikit-learn preprocessing techniques to reduce the high dimensional data (>150 features).
- In data exploration stage used correlation analysis and graphical techniques in Matplotlib and Seaborn to get some insights about the patient admission and discharge data.
- Experimented with predictive models including Logistic Regression, Support Vector Machine (SVC), Random Forest provided by Scikit-learn, XG Boost, Light GBM and Neural network by Keras to predict showing probability and visiting counts.
- Designed and implemented Cross-validation and statistical tests including k-fold, stratified k-fold, hold-out scheme to test and verify the models' significance.
- Implemented, tuned and tested the model on AWS Lambda with the best performing algorithm and parameters.
Environment: Oracle 11g, Hadoop 2.x, HDFS, Hive, Pig Latin, Spark/PySpark/MLlib, Python 3.x (Numpy, Pandas, Scikit-learn, Matplotlib, Seaborn), Jupyter Notebook, AWS, Github, Linux, Machine learning algorithms, Tableau.
Confidential, Des Moines, IOWA
Sr. Data Scientist
Roles & Responsibilities:
- Lead the full machine learning system implementation process: Collecting data, model design, feature selection, system implementation, and evaluation.
- Worked with Machine learning algorithms like Regressions (linear, logistic), SVMs and Decision trees.
- Developed a Machine Learning test-bed with different model learning and feature learning algorithms.
- By thorough systematic search, demonstrated performance surpassing the state-of-the-art (deep learning).
- Used Text Mining and NLP techniques find the sentiment about the organization.
- Developed unsupervised machine learning models in the Hadoop/Hive environment on AWS EC2 instance.
- Used clustering technique K-Means to identify outliers and to classify unlabeled data.
- Worked with data-sets of varying degrees of size and complexity including both structured and unstructured data.
- Participated in all phases of Data mining, Data cleaning, Data collection, developing models, Validation, Visualization and Performed Gap analysis.
- Used R programming language for graphically critiquing the datasets and to gain insights to interpret the nature of the data.
- Implemented Predictive analytics and machine learning algorithms to forecast key metrics in the form of designed dashboards on to AWS (S3/EC2) and Django platform for the company's core business.
- Data wrangling to clean, transform and reshape the data utilizing Numpy and Pandas library.
- Contribute to data mining architectures, modeling standards, reporting, and data analysis methodologies.
- Conduct research and make recommendations on data mining products, services, protocols, and standards in support of procurement and development efforts.
- Involved in defining the Source to Target data mappings, Business rules, data definitions.
- Worked with different data science teams and provided respective data as required on an ad-hoc request basis
- Assisted both application engineering and data scientist teams in mutual agreements/provisions of data.
Environment: R Studio 3.5.1, AWS S3, NLP, EC2, Neural networks, SVM, Decision trees, ML base, ad-hoc, MAHOUT, No SQL, Pl/Sql, MDM, MLLib & Git.
Confidential, San Francisco, CA
Data Scientist
Roles & Responsibilities
- Responsible for Retrieving data using SQL/Hive Queries from the database and perform Analysis enhancements.
- Used R, SAS and SQL to manipulate data, and develop and validate quantitative models.
- Worked as a RLC (Regulatory and Legal Compliance) Team Member and undertook user stories (tasks) with critical deadlines in Agile Environment.
- Applied Regression in identifying the probability of the Agent's location regarding the insurance policies sold.
- Used advanced Microsoft Excel functions such as Pivot tables and VLOOKUP in order to analyze the data and prepare programs.
- Performed various statistical tests for clear understanding to the client.
- Actively involved in Analysis, Development and Unit testing of the data and delivery assurance of the user story in Agile Environment.
- Cleaned data by analyzing and eliminating duplicate and inaccurate data using R.
- Experience in retrieving unstructured data from different sites such as in html, xml format.
- Worked with Data frames and other data interfaces in R for retrieving and storing the data.
- Responsible in making sure that the data is accurate with no outliers.
- Applied various machine learning algorithms such as Decision Trees, K-Means, Random Forests and Regression in R with the required packages installed.
- Applied K-Means algorithm in determining the position of an Agent based on the data collected.
- Read data from various files including HTML, CSV and sas7bdat file etc using SAS/Python.
- Involved in Analyzing system failures, identifying root causes, and recommended course of actions.
- Coded, tested, debugged, implemented and documented data using R.
- Researched on Multi-layer classification algorithms as well as building Natural Language Processing model through ensemble.
- Worked with Quality Control Teams to develop Test Plan and Test Cases.
- Worked closely with data scientists to assist on feature engineering, model training frameworks, and model deployments implementing documentation discipline.
- Involved with Data Analysis primarily Identifying Data Sets, Source Data, Source Meta Data, Data Definitions and Data Formats.
- Worked with the ETL team to document the Transformation Rules for Data Migration from OLTP to Warehouse Environment for reporting purposes.
- Performed data testing, tested ETL mappings (Transformation logic), tested stored procedures, and tested the XML messages.
- Created Use cases, activity report, logical components to extract business process flows and workflows involved in the project using Rational Rose, UML and Microsoft Visio.
- Involved in development and implementation of SSIS, SSRS and SSAS application solutions for various business units across the organization.
- Developed mappings to load Fact and Dimension tables, SCD Type 1 and SCD Type 2 dimensions and Incremental loading and unit tested the mappings.
- Wrote test cases, developed Test scripts using SQL and PL/SQL for UAT.
- Creating or modifying the T-SQL queries as per the business requirements and worked on creating role playing dimensions, fact-less Fact, snowflake and star schemas.
Environment: R 3.5, Decision Trees, K-Means, Random Forests, Microsoft Excel, Agile, SAS, SQL, NLP
Confidential
Data Scientist
Roles & Responsibilities:
- Responsible for data identification, collection, exploration, and cleaning for modeling, participate in biological model development.
- Performed data analysis using industry leading text mining, data mining, and analytical tools and open source software.
- Used Jira for defect tracking and project management.
- Worked on writing and as well as read data from CSV and excel file formats.
- Visualize, interpret, report findings, and develop strategic uses of data by R Libraries like ggplot2, The Cancer Genome Atlas (TCGA) Data Portal, ClinVar, ENCODE.
- Responsible for loading, extracting and validation of client data.
- Creating statistical analysis using distributed and standalone models to build various diagnostics, predictive and prescriptive solution.
- Missing value treatment, outlier detection and anomalies treatment using statistical methods, deriving customized key metrics by using R package software.
- Evaluated models using Cross Validation, Log loss function, ROC curves and used AUC for feature selection.
- Used Spark Data frames, Spark-SQL, Spark MLLib extensively and developing and designing POC's using Scala, Spark SQL and MLlib libraries.
- Experienced in parsing json within R or turn R data frames into json by using Mongo DB.
- Experienced in using rob mixglm v1.0-2 package to implements robust generalized linear models (GLM) using a mixture method.
- Worked with several R packages including knitr, dplyr, SparkR, Causal Infer, space-time.
- Strong skills in data visualization like ggplot2, shiny, Plotly, creating different charts such as Heat maps, Bar charts, Line charts.
- Responsible for creating / revising and implementing standard operation procedures (SOPs), laboratory records and other related documentation
- Analyse the ClinVar data to propose the NonHotspot rule proposal.
- UAT testing for patient variant files
- Experienced in handling complex database objects like Stored Procedures, Functions, Packages and Triggers using SQL and PL/SQL.
- Worked on production issues and resolving production tickets.
- Involved in the integration of multiple layers in the application.
- Knowledge in generating Hibernate Mapping Files and Java Classes and Creating the Reverse Engineering File and Creating Hibernate Mapping Files and POJOs From a Database
- Basic knowledge in creating an XML configuration file for Hibernate - Database connectivity
- Responsible to review PIK3CA novel a MOI reports and the sub protocol for the document for ARM Z1F. Summarized the evidence used for the sub protocol variants and compare to that used for the novel a MOIs.
- Performed QA testing on the application.
- Held meetings with client and worked for the entire project with limited help from the client.
Environment: Java 1.8, Core Java, Eclipse, Tomcat, Apache Tomcat 5.0, JSP, XML, JIRA, RDBMS, SQL, JSON, JavaScript, HTML5, CSS3, GIT, PL/SQL, GRID, Linux.