We provide IT Staff Augmentation Services!

Data Scientist Resume

Buffalo, NY


  • Around 8 years of experience in IT industry encompassing a wide range of skill set
  • Over 4 + years of experience in Data Governance, Data Lineage, Data Analysis, Machine Learning, Data mining with large data sets of Structured and Unstructured data, Data Acquisition, Data Validation, Predictive modeling, Data Lineage and Data Visualization
  • Experience in Data Analysis and Reporting in Finance, Healthcare and Underwriting domains.
  • Adept in statistical programming languages like Python, R and SAS
  • Hands on experience with big data tools like Hadoop, Spark, Hive, Pyspark, SparkSql, Sqoop.
  • Proficient in managing entire data science project life cycle and actively involved in data acquisition, data cleaning, data engineering, features scaling, feature engineering, statistical modeling like Random Forests, Decision Trees, Linear and Logistic Regression, dimensionality reduction using Principal Component Analysis and Factor Analysis, testing and validation using ROC plot, K - fold cross validation and data visualization.
  • Adept and deep understanding of Statistical modeling, Multivariate Analysis, model testing, problem analysis, model comparison and validation.
  • Experience with metadata,datalineage, definingdataquality policies, procedures and standards, business glossary,datadictionary, etc.
  • Expertise in transforming business requirements into analytical models, designing algorithms, building models, developing data mining and reporting solutions that scales across massive volume of structured and unstructured data.
  • Experience in using various packages in R and python like ggplot2, caret, dplyr, gmodels, NLP, Reshape2, plyr, pandas, numpy, seaborn, scipy, matplotlib, scikit-learn, networkx, pandasql
  • Experience in Text Analytics, generating data visualizations using R, Python.
  • Extensive experience in Data Visualization including producing tables, graphs, listings using various procedures and tools such as Tableau.
  • Experience working on analyzing and reporting Facility, Professional and Pharmacy Claims data using SAS.
  • Excellent Experience in utilizing SAS Procedures, Macros, and other SAS application for data extraction, data cleansing, data loading and reporting.
  • Experience working on various RDBMS such as Oracle and MS SQL Server.
  • Experience in writing advanced SQL programs for sorting and grouping data, joining multiple tables, creating views, indexes, stored procedures and metadata analysis.
  • Extensive experience in writing functional specifications, translating business requirements to technical specifications, created/maintained/modified database design document with detailed description of logical entities and physical tables.
  • Good industry knowledge, analytical &problem solving skills and ability to work well with in a team as well as an individual.
  • Highly creative, innovative, committed, intellectually curious, business savvy with good communication and interpersonal skills.
  • Excellent problem solving skills for delivering useful and compact solutions. Always keen and eager to face up to the challenges by means of innovative ideas.


Programming: R, Python, SAS, SQL, Java, VB Script, C++

Databases: Oracle, SQL Server, MySQL, MS Access

Business Intelligence Tools: Tableau 9.3, MS Excel - Analytical Solver.

Machine Learning: Decision Trees, Naive Bayes classification, OLS, Logistic Regression, Neural Networks, Support Vector Machines, Clustering Algorithms and PCA.

R Packages: dplyr, caret, data. table, reshape, ggplot2, quantMod, sqldf, ggmap, ggvis, dplyr, fselector, lattice, randomForest, rpart, lm, glm, nnet, xgboost, ksvm, lda, qda, adabag, adaboost, lars and lasso.

Python packages: Numpy, Pandas, Scikit-learn, SciPy, matplotlib, networkx

SAS Software: SAS/BASE, SAS/MACROS/, SAS/SQL, SAS Enterprise Guide

Big Data Technologies: Apache Hadoop, Hive and Spark


Confidential, Buffalo, NY

Data Scientist


  • Work with the Data Governance group to identify, classify and define each assigned CriticalDataElement (CDEs) and ensure that each element has a clear and unambiguous definition.
  • Analyzeddatalineage processes and documentation for the CDEs to identify vulnerablepoints, control gaps,dataquality issues, and overall lack ofdatagovernance.
  • Proposed data checks and standard operational procedures on the source systems to enhance data quality
  • Reviewed various Project Management documents such as Business Requirements document, Functional Specification document and suggested changes to ensure it complies with policies and standards.
  • Worked with the Data Governance group in creating a custom data dictionary template to be used across the various business lines.
  • Worked withdatastewards to ensure awareness ofdataquality standards and data requirements
  • Linkeddatalineage todataquality and business glossary work within the overalldatagovernance program.
  • Define process metrics to measuredataimplementations
  • Managed communication and training withdataowners/stewards to ensure awareness of policies and standards
  • Gathered requirements by working with the business users on Business Glossary,DataDictionary and Referencedata

Environment: Data Governance, Data Lineage, Data Quality, Data Checks, Standard Operational Procedures

Confidential, Jersey City, NJ

Data Scientist


  • Worked on identifying data lineage and instrumenting governance across various data movement tools such as NDM, Informatica, Datastage, SFTP, B2Bi. This enterprise level implementation provides an end to end view of data lineage from origination to end reports such as FR Y-14A and CCAR.
  • Involved in analyzing business requirements and preparing the functional requirements document and the technical specification document.
  • Performed data cleaning, data manipulation, data de-duplication and data aggregation
  • Involved in designing the schema for the Enterprise Data Repository(EMR). Also, involved in desiging the conceptual, logical and physical model for the EMR.
  • Developed a network topology using Python to analyze the data lineage of files/feeds across all enterprise applications as they pertain to derived data domains using networkx, pandas, numpy, pandasql, matplotlib.
  • Performed Ad Hoc tasks like Fuzzy String matching using Python on the File names exposed by data movement as well as their corresponding Feed names and scored the matches using Levestein’s distance
  • Worked on exporting the data from Hadoop to SQL Server using Sqoop.
  • Worked on rewriting the python scripts to pyspark so that the scripts will be able to run on big data
  • Designed dashboards using Tableau to identify the coverage of movement data exhaust against enterprise applications
  • Presented the analytical solutions to the senior management

Environment: Python (networkx, pandas, numpy, pandasql, matplotlib, pyodbc), SQL Server, Hadoop, Hive, pyspark, Sqoop

Confidential, Jersey City, NJ

Data Science/Analytics Intern


  • Performed RFM Analysis on Underwriting data to understand the customer behavior and how it impacts the value that can generate for the client with independent RFM scoring using R, SQL.
  • Performed market basket analysis and provided association rules to understand customer’s behavioral evolution and predict churn. This enabled the client to target current customers who fall under these conditions and thus promoting the products based on the analysis.
  • Product Clustering analysis to identify products that are more likely in the same basket and to make product offer selections for cross-sell and up-sell marketing.
  • Carried out segmentation and customer classification in R using K-means clustering and provided association rules to provide specific conditions of buying patterns.
  • Prepareddataby cleaning, extraction, missing value treatments, transformations, and other statistical techniques usingR
  • Analyze the data and provide the insights about the customers using Tableau.
  • Developed advanced SQL programs for sorting and grouping data, joining multiple tables, creating views, indexes and stored procedures

Environment: R, R Studio, SQL, Tableau

Confidential, Boston, MA

Data Analytics/Analyst


  • Created predictive models to analyze the behavior of customer in purchase of an Auto Insurance policy using R and Python
  • Application of various machine learning algorithms and statistical modeling like decision trees, regression models, SVM, clustering to identify Volume using scikit-learn package in python.
  • Collected data needs and requirements by Interacting with the other departments.
  • Used clustering technique K-Means to identify outliers and to classify unlabeled data.
  • Performed Data Cleaning, features scaling, features engineering using pandas and numpy packages in python.
  • Performed a trainingdata/testdatasplit to better manage variance / bias tradeoff in model building process.
  • Produced Confusion Matrix and Classification Reporting to visualize the performance of the logistic regression model per accuracy, precision, recall scores.Evaluated models using Cross Validation, ROC curves and AUC.
  • Using graphical packages, produced ROC Curve to visually represent True Positive Rate versus False Positive Rate. Equally produced visualization of Precision Recall Curve for Area under the Curve.
  • Used Principal Component Analysis in feature engineering to analyze high dimensional data.
  • Used Spark MLlib, Spark’s Machine learning library to build and evaluate different models.
  • Implemented rule based expertise system from the results of exploratory analysis and information gathered from the people from different departments.
  • Created Data Quality Scripts using SQL and Hive to validate successful data load and quality of the data. Created various types of data visualizations using Python and Tableau.
  • Created and designed reports that will use gathered metrics to infer and draw logical conclusions of past and future behavior.
  • Communicated the results with operations team for taking best decisions.

Environment: Python, R, Plotly, Hadoop, Spark, MLLib

Confidential, Cleveland, OH

Data Analytics/Analyst


  • Analyzed business requirements, system requirements, data mapping requirement specifications, and responsible for documenting functional requirements and supplementary requirements.
  • Performed data aggregation, data preprocessing, data cleaning, descriptive and inferential analysis.
  • Ratified the missing data, outlier and invalid data and applied appropriate data management techniques.
  • Worked on data manipulation and raw marketing data of different formats from multiple sources and prepared the data for further analysis using ggvis and ggplot2 packages.
  • Evaluated data for distribution, correlation and Variance Inflation Factor (VIF).
  • Analyzed different trends and market segmentation based on historical data using K means clustering, Classification techniques.
  • Wrote simple and advanced SQL queries for extracting data and created dashboard and stories for senior managers.
  • Provided technical assistance for development and execution of test plans and cases as per client requirements.
  • Supported technical team members in the development of automated processes for data extraction and analysis.
  • Created Rich Graphic visualization/dashboards in Tableau to enable a fast read on claims and key business drivers and to direct attention to key area

Environment: R (dplyr, ggplot2, sqldf, SQLite), SQL Server 2008 R2, Oracle 10g, SQL, Microsoft Project/Office, Tableau 8.x.

Confidential, Tampa, FL

Programmer Analyst/ Data Analyst


  • Calculated total payments by beneficiary for inpatient, skilled nursing facilities, home health agencies and hospice claims and total Medicare Part-A payments using various SAS procedures
  • Created Member Months data for a certain population of interest and for any period of interest
  • Designed programs and analyzed the bucket costs incurred to our client for paying the providers for their services.
  • Integrated the member’s data with the costs data to generate Per Member per Month(PMPM) cost metrics.
  • Programmed SAS code to identify the primary care physicians to using the taxonomy codes present in the Registry
  • Cleaned and mange Member Data to handle overlapping and collapsing records with respect to enrollment period
  • Created SAS reports using to analyze the duration of membership remaining for each member, members who uniquely enrolled to the provider for at least one month and total number of member months carried by each provider.
  • Designing required Tables, Views, Indexes, Stored procedures, User Defined Functions and constraints like Primary Key, Foreign Keys Check and Not Null/Null.
  • Merged Member data and Claim data based at header level and detail level using complex joins and subqueries.
  • Decreased double data entry by deleting the duplicate records and updated the data as mentioned in the client specifications.
  • Visualized the cost metrics via Line plots and Bar charts to interpret the variation of costs with respect to months and members using Tableau

Environment: SAS BASE, SAS Enterprise Guide, SQL, Tableau, MS Excel.

Confidential, Troy, Michigan

Programmer/Data Analyst


  • Worked on Loan Application data, Customer Demographics data, Customer’s Bank Transaction data to analyze the customer willingness to proceed with application after applying loan (Acceptance rate).
  • Analyzed the customer’s behavior w.r.t existing loan payment by generating several payment reports.
  • Authored SQL Code to deliver reports on LTV ratio for various customers
  • Generated reports to track the type of documents like Income Proof, Address Proof and Identity proof submitted by the customer.
  • Calculated the amount of time taken by the customer to submit requested documents.
  • Designed SQL procedures to calculate the monthly average balance (MAB) & monthly weighted average for each customer.
  • Authored SQL code to produce the results of loan approval.
  • Generated reports to track the weekly and monthly deposits and withdrawals of each customer
  • Calculated the mean statistics of customers for each type of loan across different cities
  • Extensively used SQL joins to merge various datasets to generate Customer Transactional dataset

Environment: SQL Server, MS Excel.


Software Trainee


  • Developed business process models using MS Visio to create case diagrams and flow diagrams to show flow of steps that are required.
  • Worked with other teams to analyze customers to analyze parameters of marketing.
  • Used MS Excel, MS Access and SQL to write and run various queries.
  • Used traceability matrix to trace the requirements of the organization.
  • Recommended structural changes and enhancements to systems and databases

Environment: SQL Server, MS Access, MS Excel.

Hire Now