Data Scientist Resume
SUMMARY
- 8 + years of experience in IT industry encompassing a wide range of skill set
- 3 + years of experience as Data Scientist specializing in Machine Learning, Predictive Modeling, Data Analysis, Data mining, Predictive modeling, Text Mining, Big Data, Data Lineage and Data Visualization with large data sets of Structured and Unstructured data.
- Experience in Data Analysis and Reporting in Finance, Healthcare and Underwriting domains.
- Adept in statistical programming languages like Python, R and SAS
- Developed statistical models in R and Python using various supervised and unsupervised Machine learning algorithms such as Linear Regression and Logistic Regression, Classification, Decision Trees, Random Forests, Gradient Boosting, Ensemble Models, KNN, Support Vector Machines, Naïve Bayes, K - Means Clustering, Neural Networks, Principal Component Analysis and Recommender Systems on structured and unstructured data.
- Proficient working knowledge in managing entiredatascience project life cycle and actively involved in all phases includingdatacollection,datapreprocessing, feature engineering, model building, model selection, model tuning and model validation.
- Experience with Python libraries including MatplotLib, NumPy, SciPy, Pandas, Seaborn, Scikit-learn and NLTK for analysis purpose.
- Experience in using various R packages such as ggplot2, caret, dplyr, Reshape2, sqldf, apriori, arules, caTools, rpart, randomForest
- Hands on Experience in Deep Learning building Neural Networks (Feed-Forward, Convolutional and Recurrent Neural Networks) using Keras
- Adept with big data tools such as Spark, Hadoop, HDFS, Map Reduce, Hive, Pig, Sqoop
- Experience in Machine Learning with Spark (PySpark, SparkSQL, Spark MLLib, Spark ML).
- Experience with Natural Language Processing along with Topic modeling and Sentiment Analysis.
- Experience with statistics methodologies such as Hypothesis Testing, ANOVA, and Chi-Square Test e. Confidential .c
- Experience in Cloud Services such as AWS EC2, EMR, RedShift, S3 to assist with big data tools and solve the data storage issue.
- Extensive experience in Data Visualization including producing tables, graphs, listings using various procedures and tools such as Tableau.
- Experience working on analyzing and reporting Facility, Professional and Pharmacy Claims data using SAS.
- Excellent Experience in utilizing SAS Procedures, Macros, and other SAS application for data extraction, data cleansing, data loading and reporting.
- Experience working on various RDBMS such as Oracle and MS SQL Server and NoSQL DBs such as Cassandra, DynamoDB.
- Experience in writing advanced SQL programs for sorting and grouping data, joining multiple tables, creating views, indexes, stored procedures and metadata analysis.
- Strong understanding of SDLC in Agile methodology and Scrum process.
- Good industry knowledge, analytical &problem solving skills and ability to work well with in a team as well as an individual.
- Highly creative, innovative, committed, intellectually curious, business savvy with good communication and interpersonal skills.
- Excellent problem solving skills for delivering useful and compact solutions. Always keen and eager to face up to the challenges by means of innovative ideas.
TECHNICAL SKILLS
Programming: R, Python, SAS, SQL, Java, VB Script, C++
Databases: Oracle, SQL Server, MySQL, MS Access, Cassandra
Business Intelligence Tools: Tableau 9.3, MS Excel - Analytical Solver.
Machine Learning: Decision Trees, Naive Bayes classification, OLS, Logistic Regression, Neural Networks, Support Vector Machines, Clustering Algorithms and PCA.
R Packages: dplyr, caret, data. table, reshape, ggplot2, quantMod, sqldf, ggmap, ggvis, dplyr, fselector, lattice, randomForest, rpart, lm, glm, nnet, xgboost, ksvm, lda, qda, adabag, adaboost, lars and lasso.
Python packages: Numpy, Pandas, Scikit-learn, SciPy, matplotlib, networkx
SAS Software: SAS/BASE, SAS/MACROS/, SAS/SQL, SAS Enterprise Guide
Big Data Technologies: Apache Hadoop, Spark, Sqoop, Hive
Cloud Technologies: AWS EC2, S3, EMR, Redshift
PROFESSIONAL EXPERIENCE
Confidential
Data Scientist
Responsibilities:
- Worked on identifying data lineage and instrumenting governance across various data movement tools such as NDM, Informatica, Datastage, SFTP, B2Bi. This enterprise level implementation provides an end to end view of data lineage from origination to end reports such as FR Y-14A and CCAR.
- Involved in analyzing business requirements and preparing the functional requirements document and the technical specification document.
- Performed data cleaning, data manipulation, data de-duplication and data aggregation
- Involved in designing the schema for the Enterprise Data Repository(EMR). Also, involved in designing the conceptual, logical and physical model for the EMR.
- Developed a network topology using Python to analyze the data lineage of files/feeds across all enterprise applications as they pertain to derived data domains using networkx, pandas, numpy, pandasql, matplotlib.
- Developed workflows for complete end to end ETL process starting with getting data into HDFS, copying it into AWS S3, validating and applying business logic, storing the aggregated data in AWS Redshift for reporting and observe any data quality issues.
- Developed business logic to aggregate the data using Spark and stored the data in Redshift
- Explored with the Spark (python) to develop adhoc reports on the existing data using Spark Context, Spark SQL,DataFrames etc. on Redshift.
- Designed dashboards using Tableau to identify the coverage of movement data exhaust against enterprise applications
- Presented the analytical solutions to the senior management
Environment: Python (networkx, pandas, numpy, pandasql, matplotlib, pyodbc), Hadoop, AWS, Amazon S3, Amazon RedShift, Spark, Hive
Confidential, Jersey City, NJ
Data Scientist
Responsibilities:
- Built a Text Classifier using Python to retrieve counter quote information from Insurance policy documents. The goal is to build an automated process solution for faster retrieval of desired information from policy submission documents
- Developed Bag-of-words model using CountVectorizer and Term Frequency - Inverse Document Frequency (TFIDF) to identify rare words associated to each label class. Build the Text Classifier on top of this Bag-of-words model.
- Built predictive models including Support Vector Machines, Naïve Bayes, KNN, Random Forest and Gradient Boosting(GBM) techniques
- Developed Deep Neural Networks using Keras to apply to the Text Classifier
- Evaluated the model using various metrics such as confusion matrix, sensitivity, specificity, F1 scores and employed validation techniques like K-Fold Cross validation and GridSearchCV to evaluate each model and to find best parameters set for each model
- Developed a workflow to import the data into AWS S3 and deploy the model on EC2 instance.
- Performed data cleaning (field boundary detection, missing value treatment, line breaks, white spaces) to prepared data for building the model
- Authored a program in Python to retrieve the matching Submission information from Amazon RDS to assign the labels for every line in the submission document.
- Developed regular expressions in Python to match the patterns within the data and retrieve the necessary information
Environment: Machine Learning, Python, Text Mining, NLP, NLTK, AWS S3, AWS EC2, Amazon RDS, sklearn, pandas, numpy, neural networks, keras
Confidential, Boston, MA
Data Scientist
Responsibilities:
- Created predictive models to analyze the behavior of customer in purchase of an Auto Insurance policy using Python
- Application of various machine learning algorithms and statistical modeling techniques like decision trees, regression models, SVM, Random Forest, Gradient Boosting, XgBoost using scikit-learn package in Python.
- Collected data needs and requirements by Interacting with the other departments.
- Performed Data Cleaning, features scaling, features selection using pandas, scipy and numpy packages in python.
- Handled the Unbalanced class in the dataset using undersampling, oversampling and Synthetic Minority Over Sampling (SMOTE) techniques
- Performed a trainingdata/testdatasplit to better manage variance / bias tradeoff in model building process.
- Produced confusion matrix, precision, recall scores to visualize the performance of the models.Evaluated models using Cross Validation, ROC curves and AUC.
- Validated and selected models using k-fold cross validation, LOOCV and worked on optimizing models for high recall rate.
- Used GridSearchCV to evaluate each model and to find best parameters set for each model.
- Using graphical packages, produced ROC Curve to visually represent True Positive Rate versus False Positive Rate. Equally produced visualization of Precision Recall Curve for Area under the Curve.
- Used Principal Component Analysis in feature engineering to analyze high dimensional data.
- Explored with Spark MLlib, Spark’s Machine learning library to build and evaluate different models.
- Implemented rule based expertise system from the results of exploratory analysis and information gathered from the people from different departments.
- Created Data Quality Scripts using SQL to validate successful data load and quality of the data. Created various types of data visualizations using Python and Tableau.
- Communicated the results with operations team for taking best decisions.
Environment: Python, sklearn, Machine Learning, Tableau, SQL, Spark, MLLib
Confidential, Jersey City, NJ
Data Science Graduate Assistant
Responsibilities:
- Performed Text analysis to identify the fraudulent information (fraud name, location, crime type) from the data.
- Implemented Stanford NER, NLTK and geocoder packages in python to perform text and parts of speech analysis
- Worked on text pre processing. Used techniques like contraction mapping, word and sentence tokenization, stop words removal, parsing and lemmatization
- Worked on sentiment analysis in python to analyze the severity of crime/fraudulent activity from the articles.
- Developed regular expressions to effectively identify the patterns in the documents, which improved the efficiency of identifying the fraudulent activity.
- Performed POS tagging, chunking and semantic analysis (word vector analysis, IR techniques)
- Imported the crime data (Insurance related articles) from NoSQL DB Cassandra and exported the predictions to Cassandra
Environment: NLP, Python, NLTK, SQL, Machine Learning
Confidential, Cleveland, OH
Data Scientist
Responsibilities:
- Analyzed business requirements, system requirements, data mapping requirement specifications, and responsible for documenting functional requirements and supplementary requirements.
- Developed Predictive Models to identify unsatisfied customers by analyzing the customer survey data with variables like past transactions, promotions, response to prior mailings, demographics, interests and hobbies, etc.
- Used R to implement different machine learning algorithms, including Generalized Linear Model, Random Forest, SVM and Gradient Boosting
- Also implemented advanced machine learning models such as Deep Neural Networks to achieve higher accuracies for the model.
- Improved the prediction performance by using optimization algorithms with stochastic gradient descent algorithm.
- Tuned parameters with K-Fold Cross Validation and Grid Search CV to optimize performance of models.
- Produced confusion matrix, precision, recall scores to visualize the performance of the models.Evaluated models using Cross Validation, ROC curves and AUC.
- Performed RFM Analysis on the data to understand the customer behavior and how it impacts the value that can generate for the client with independent RFM scoring using R, SQL.
- Performed market basket analysis and provided association rules to understand customer’s behavioral evolution. This enabled the client to target current customers by suggesting the products which fall under these conditions and thus promoting the products based on the analysis.
- Product Clustering analysis to identify products that are more likely in the same basket (from market basket analysis) and to make product offer selections for cross-sell and up-sell marketing.
- Carried out segmentation and customer classification in R using K-means clustering and provided association rules to provide specific conditions of buying patterns.
- Extracted data from multiple data sources including SQL Server and Cassandra.
- Prepareddataby cleaning, extraction, missing value treatments, transformations, and other statistical techniques usingR
- Analyze the data and provide the insights about the customers using Tableau.
- Developed advanced SQL programs for sorting and grouping data, joining multiple tables, creating views, indexes and stored procedures
Environment: R (dplyr, ggplot2, sqldf, SQLite, caTools), SQL Server, Cassandra, Tableau
Confidential, Tampa, FL
Programmer Analyst/ Data Analyst
Responsibilities:
- Calculated total payments by beneficiary for inpatient, skilled nursing facilities, home health agencies and hospice claims and total Medicare Part-A payments using various SAS procedures
- Created Member Months data for a certain population of interest and for any period of interest
- Designed programs and analyzed the bucket costs incurred to our client for paying the providers for their services.
- Integrated the member’s data with the costs data to generate Per Member per Month(PMPM) cost metrics.
- Programmed SAS code to identify the primary care physicians to using the taxonomy codes present in the Registry
- Cleaned and mange Member Data to handle overlapping and collapsing records with respect to enrollment period
- Created SAS reports using to analyze the duration of membership remaining for each member, members who uniquely enrolled to the provider for at least one month and total number of member months carried by each provider.
- Designing required Tables, Views, Indexes, Stored procedures, User Defined Functions and constraints like Primary Key, Foreign Keys Check and Not Null/Null.
- Merged Member data and Claim data based at header level and detail level using complex joins and subqueries.
- Decreased double data entry by deleting the duplicate records and updated the data as mentioned in the client specifications.
- Visualized the cost metrics via Line plots and Bar charts to interpret the variation of costs with respect to months and members using Tableau
Environment: SAS BASE, SAS Enterprise Guide, SQL, Tableau, MS Excel.
Confidential, Troy, Michigan
Programmer Analyst Trainee
Responsibilities:
- Worked on Loan Application data, Customer Demographics data, Customer’s Bank Transaction data to analyze the customer willingness to proceed with application after applying loan (Acceptance rate).
- Analyzed the customer’s behavior w.r. Confidential existing loan payment by generating several payment reports.
- Authored SQL Code to deliver reports on LTV ratio for various customers
- Generated reports to track the type of documents like Income Proof, Address Proof and Identity proof submitted by the customer.
- Calculated the amount of time taken by the customer to submit requested documents.
- Designed SQL procedures to calculate the monthly average balance (MAB) & monthly weighted average for each customer.
- Authored SQL code to produce the results of loan approval.
- Generated reports to track the weekly and monthly deposits and withdrawals of each customer
- Calculated the mean statistics of customers for each type of loan across different cities
- Extensively used SQL joins to merge various datasets to generate Customer Transactional dataset
Environment: SQL Server, MS Excel.
