We provide IT Staff Augmentation Services!

Senior Data Scientist/ Data Analyst Resume

Danville, PA


  • Highly analytical and process - oriented data analyst with 10+ years of experience in data analytics and data management, having proven ability to work efficiently in both independent and team work environments. Excellent team player, excellent communication and interpersonal skills with solid team leading capabilities.
  • Expertise and experience in domains like Healthcare, Finance and Telecom.
  • Proficient knowledge of the SDLC and extensive experience in Agile (Scrum and XP) and Waterfall models.
  • Expertise in Cost Benefit Analysis, Feasibility Analysis, Impact Analysis, Gap Analysis, SWOT analysis and ROI analysis, SCRUM, leading JAD Sessions and Dashboard Reporting.
  • Experience in data modeling using Erwin data modeler, data analysis and working with OLTP and OLAP systems and data mining by using EDW, MOLAP and ROLAP.
  • Worked with various RDBMS like Oracle, MYSQL, SQL Server, Teradata and expertise in creating tables, data population and data extraction from these databases.
  • Experience in implementing Data warehouse solutions in Teradata, Oracle and SQL Server.
  • Worked with NoSQL databases like Apache Cassandra and Hbase to deal with stream processing/real time.
  • Experience in extracting, transforming and loading (ETL) data from spreadsheets, database tables, flat files and other sources using Talend Open Studio and Informatica Powercenter.
  • Having good knowledge in Normalization and De-Normalization techniques for optimum schema designing.
  • Data chunking, Data wrangling, Data profiling, Data Cleansing, Data mapping and creating workflows using data integration tools Informatica Powercenter/Talend Open Studio during ETL and ELT processes.
  • Understanding of AWS services EMR, Kinesis, Redis, Glacier, EC2, S3, Kinesis, RDS, DynamoDB & Redshift.
  • A broad experience in ERD and Multidimensional modelling and conceptualize these models to create the physical models from logical model.
  • Experience in Data warehousing concepts like Star Schema, galaxy and Snowflake Schema, DataMarts, Kimball Methodology used in Relational and Multidimensional data modelling.
  • Experience on Google analytics and Apache Hadoop Ecosystem with good knowledge of Hadoop Distributed file system (HDFS), Map Reduce, Hive, HBase, Sqoop, Kafka, Flume, Cassandra, Oozie, Impala, Spark.
  • Experience with conceptual, logical and physical data modeling considering Meta data standards.
  • Experience with DBA tasks involving database creation, performance tuning, creation of indexes, creating and modifying table spaces for optimization purposes.
  • Knowledge of Machine Learning techniques like Regression Models, Artificial Neural Networks, Clustering Analysis, Decision Tree, ANOVA, t-tests, Neural networks and SVM.
  • Experience in Base SAS/STAT, STATA, R, SQL, Tableau, Python, MS EXCEL (VLOOKUP, Pivot charts, Macros).
  • Expertise in Data Manipulations using SAS/STATA by using procedures like SAS PROC SQL, PROC MEANS, PROC FREQ and PROC REPORTS, TABULATE, UNIVARIATE, Append, Array, DO loops, GPLOT and GCHART, Macros and Merge procedures like PROC APPEND, PROC DATASETS, PROC SORT, and PROC TRANSPOSE.
  • Expertise in Tableau Dashboards for data visualization and creating reports and deploying it to the servers.
  • Proficient in Service Oriented architecture (SOA), REST/SOAP, API and Cloud technologies, Web Services.


Analytical Techniques: Hypothesis testing, Predictive analysis, Machine Learning, Regression Modelling, Logistic Modelling, Time Series Analysis, Decision Tree, Neural Networks, Support Vector Machines (SVM), Monte Carlo methods, Random Forest, Time series analysis.

Analytical tool: STATA, MEGASTAT, Rapid Data miner, Google analytics, R Studio, SAS/STAT, Google Ads, Google Website Optimizer, Azure data lake analytics, SAS Enterprise miner

Data Visualization Tool: Tableau, Qlikview, Datawrapper, Microsoft Power BI, Excel, VISIO

Data modeling: Entity relationship Diagrams(ERD), Snowflake schema, Star schema

SQL, U: SQL, HIVEQL, C, R, Python, SAS

Database Systems: SQL Server 10.0/11.0/13.0 , Oracle, MYSQL 5.1/5.6/5.7, Teradata, DB2, Amazon Redshift

NOSQL Databases: HBASE, Apache Cassandra

ETL Tools: Microsoft SSIS, Pentaho ID, IBM Cognos, Talend Open Studio, Informatica Power House 9.0

Testing tool: JIRA, HP Quality Check / HP ALM, Base Camp

Big Data: Apache Hadoop, HDFS, Sqoop, Flume, Kafka, Hive, Impala, MapReduce, Splunk ML-SPL, Splunk Hadoop Connect, Apache Airflow,ElasticSearch

SDLC Methodology and Tools: Waterfall, Agile/Scrum Methodology/XP, BIZAGI BPMN, SeeNowDoScrum, MS Project


Confidential, Danville, PA

Senior Data Scientist/ Data Analyst


  • Collaborated with major key stakeholders and subject matter experts in brainstorming sessions to identify state of reporting structure, identify gaps, and shape the need for new challenges.
  • Defined the data sources and the methodology for the project.
  • Collaborated in creating the requirements document (BRD), Data source documents and data dictionaries.
  • Explored, identified, and aggregated all the meaningful data sources of the organization and performed the data profiling to ensure the Quality of the data using Informatica IDQ.
  • Analyzing the Big Data of the Confidential stored in UDA (Hadoop Distributions) through Hive at the top of Map Reduce for the substance use disorder treatments by utilizing the Patient’s data stored in EPIC Systems.
  • Used Ambari to manage and monitor a Hadoop cluster and implicitly also helped in improving the performance of the HIVE queries.
  • Developed Center of excellence reports on the monthly basis by using the Hive scripts to assess the performance of the care providers and tracking the progress of the Opioid addicted patients as a part of mandatory activity of the Department of Human Services Pennsylvania (DHS).
  • Created Data dictionary, Hive script for all the reports and automated those reports through crystal reports.
  • Created interactive live Tableau dashboards, encapsulating all the KPI’s of the Opioid Addiction Program.
  • During State Reporting under the guidelines of center of Medicaid and Medicare Services (CMS), linked all the patients with their Medicaid ID’s with the help of Promise.
  • Developed reporting system for the Claims processing (revenue cycle/ ACHIP billing) for the Opioid addiction program.
  • Developing the format, logic and Hive script of the huddle Report for the opioid addiction Huddle team to give a better coordinated healthcare treatment to the patients. The Huddle report uses the Hadoop data through the Hive Scripts and generates the reports in the required format through Tableau on the daily basis.
  • Analyzed the patient Electronic Health Records (EHR) and Medical Health Records (EMR) through EPIC systems and participated in the creation of smart data elements for it.
  • Created Opioid addiction outcomes survey for the Opioid addiction program through Redcap.
  • Collaborated with Breast Cancer Department, in developing the Breast Cancer analytical report and its automation. Worked on developing and evaluating the different algorithms and risk models to calculate the breast cancer risk score.
  • Tracking the lineage of the data, ensuring the data integrity, data reconciliation, optimizing the queries and analyzing the EHR /EMR data of the EPIC systems to enhance the efficiency of the existing reporting and analytics structure.
  • Working on the EPIC Clarity database (optimized for reporting) and accessing it using industry-standard tools such as Crystal Reports. Examined and analyzed the Clarity data for slot utilization, appointment statistics, appointment cycle time and the number of new patients seen last year across all departments.
  • Creating internal tables, external tables and Views on the Hadoop framework by using the Hive to gain the insights regarding periodic assessments of the patients.
  • Performed data analytics on the output data obtained from the spark’s core by using spark’s MyLib, an advanced machine Learning tool of Apache Spark
  • Partitioned the data set into, testing and validation set to use it in the supervised learning processes.
  • Performed the predictive analysis on Python and R platforms through popular machine learning algorithms like Linear regression, logistic regression and Artificial Neural networks.
  • Collaborating with the Physicians, pharmacists, Nurse practitioners and other stakeholders from the provider side to figure out their requirements. Applying meaningful logics, analytics, writing Hive codes and translating it into the meaningful data to improve the healthcare quality by creating the accurate reporting system.
  • Ensured the proper process was followed to demonstrate that the data provided to them had gone through a stringent data governance process and compliances like HIPPA, HL7.

Environment: Apache Hadoop, HIVE, Informatica Power, Ambari, Crystal Report, Spark’s Mylib, Tableau, MS office suite, Excel, EPIC systems, Promise, Redcap, Apache TEZ, Informatica IDQ, R studio, Python, SQL, Ambari, SSIS, SQL SERVER, EPIC CLARITY SYSTEM, TERADATA, SAS, ICD, CPT, HCPCS, WEBi, Sharepoint, SKLEARN,Keras,ElasticSearch

Confidential, Columbus, OH

Senior Data Scientist


  • Collaborated with product owner, key stakeholders and subject matter experts in brainstorming sessions to identify state of reporting structure, identify gaps, and shape the need for event correlation over log data
  • Participated in as-is to-be analysis between the available system information and event monitoring(SIEM) enterprise tool and Splunk enterprise platform to prepare the business case
  • Assisted the Product Owner in creating a Proof of Concept by analyzing months of syslog data in the proposed Splunk platform in terms of reporting, monitoring and alerts.
  • Built models in machine learning to identify those crucial patterns that lead to MiFID violations.
  • Expertise on various components within Splunk (indexer, forwarder, search head, deployment server), Heavy and Universal forwarder, Parsing, Indexing, Searching concepts, Hot, Warm, Cold, Frozen bucketing.
  • Created data models and lookup knowledge objects for day-to-day operational analysis
  • Coached a team of data analyst on how Splunk can be leveraged for machine data analysis at scale
  • Shell scripting and extensively used Regular expressions (Regex) in search string and data anonymization.
  • Utilized in-built Search Processing Language(SPL) to analyze time-series events and identify event correlations
  • Used DB connect to create lookups into MySQL database for key management day to day decisions.
  • Collaborated with SME’s, identified data sources, did data profiling, and documented analysis undertaken.
  • Stored the Processed data in the Hadoop distributed file system to use this data for the predictive analysis.
  • Integrated the R platform with Hadoop Ecosystem and performed predictive analysis & prescriptive analysis.
  • Used Machine learning algorithms like Random Forest, KNN, Artificial Neural Networks(ANN), Regression and logistic regression to predict the behavior of the customers and market.

Environment: Apache Hadoop Distribution 2.7.X, HDFS, Splunk Hadoop Connect, JIRASuite, MS Office Suite, Splunk ML-SPL (Machine Learning toolkit), MySQL, Machine Learning, HDFS.

Confidential, Nashville, TN

Data Scientist


  • Identified the Edge node(s), an IOT gateway or a cloud aggregator.
  • Explored and classified all the IOT devices of the healthcare infrastructure and performed the data profiling through Informatica to get the better understanding of it.
  • During the Collaboration phase, performed the As-is To-be analyses and came up with two options, analyzed and compared Azure data lake analytics and Apache Hadoop distributions.
  • Compared the Performance of both and finally came up with our business case with Apache Sparks.
  • Used the Kafka Interfacing software of Apache Hadoop framework to get the data from all the IOT devices of the healthcare network into the Apache Hadoop Spark system.
  • Through Sqoop, imported structured data of the healthcare from Teradata, which was an RDBMS into HBASE.
  • Used Teradata for EDW solutions, to gain Business intelligence insights regarding healthcare operations.
  • Processed the stream data in the Apache Spark streaming by breaking the stream data into the micro batches and later processed by the spark’s core, which results in lower latency.
  • Stored the processed data in the HDFS and generated the reports through Spark’s SQL queries.
  • Performed data analytics on the output data obtained from the spark’s core by using spark’s MyLib, an advanced machine Learning tool of Apache Spark
  • Performed the descriptive analysis on the data like correlations and Scatter plots to understand the current performance of the IOT devices of the healthcare and to improve the efficiency and optimizing the usage.
  • Partitioned the data set into, testing and validation set to use it in the supervised learning processes.
  • Performed the predictive analysis through popular machine learning algorithms like Linear regression, logistic regression and Artificial Neural networks.
  • Visualized the model performance through the ROC curve (Receiver operating characteristic Curve) by plotting sensitivity against specificity at different thresholds.
  • Measured the predictive ability of a classifier by the Area under the curve (AUC). Area under the curve greater than 0.75 was the model acceptance criteria.
  • Performed the data reporting and created the dashboard and shared with all the major stake holders
  • Performed the statistical analysis to improve the system processes to give better and quality healthcare. Performed the process modelling through BIZAGI BPMN MODELER business process modeler to remove the inefficient tasks and processes in the systems
  • Experience in documenting all the analytical results and findings of the Apache Sparks by Apache Zeppelin.
  • Ensured the proper process was followed to demonstrate to the monitoring government entity that the data provided to them had gone through a stringent data governance process.

Environment: Apache Hadoop, HBASE, Apache Spark, Kafka, Apache Zeppelin, Informatica, BIZAGI BPMN MODELER, Spark’s Mylib, Tableau, R, SAS/STAT, Spark’s SQL, Teradata, Predictive analysis, Machine Learning, MS office suite, U-SQL, Azure data lake analytics, Base Camp, HIVE (UDF), Apache Airflow,Keras,ElasticSearch.

Confidential, Foster city, CA

Data Scientist / Senior Data Analyst


  • Worked on the Facebook developers and twitter developers to extract the user tweets and comments through their API’s and use it for the analytical purposes.
  • By using the R platform, created R scripts and merging it with the Twitter’s and Facebook API’s to make the connections for the data extraction part.
  • Made the Dataframes of the extracted data through call function in R, to get the data in the tabular form.
  • Performed the data cleansing through the R functions sapply and gsub by removing the emoticons and URL’s. By doing this, processes were streamlined, and data quality has been improved steadily over time.
  • Performed the lexical Analysis on the cleaned Twitter tweets and FB comments to analyze the sentiment of the tweets and comments and to convert it into the numerical values.
  • Performed the scan analysis on each tweet and comments to find the number of positive and negative words through a Scan function through its negative and positive words repository.
  • Quantification of the sentiments in terms of Positive Score, Negative score and overall score.
  • Performed the descriptive analysis on the data correlations, Scatter plots and measures of central tendency.
  • Visualized the results of the Lexical analysis over the Tableau by making the Histograms, Bubble charts, Pie charts and creating the Dashboard for the reporting purposes.
  • Assisted the departments by giving data driven solutions, based on lexical analysis findings in taking key decisions regarding manufacturing and marketing of key products.
  • Optimized data collection procedures and proposed solutions to improve system efficiencies and reduce total expenses through BIZAGI BPMN MODELER tool.
  • Analyzed the customer web trail on the real time basis by using Google Analytics to get the insight regarding pages per visit, average visit duration and bounce rate.
  • Used the Adword / Google Ads platform to do the paid online advertisement of the products.
  • Measured and analyzed our online marketing campaigns through Revenue Metrics, Exit Rate, Bounce rate, Conversion Metrics and performed A/B and multivariate testing.
  • Performed the search engine optimization (SEO) through White hat CEO, Black hat CEO and Grey hat CEO.
  • To Check the effectiveness of the marketing campaigns, we perform the ROI Analysis.
  • Design, implement and track KPI to measure performance against goals,
  • Ensured the proper process was followed to demonstrate to the monitoring government entity that the data provided to them had gone through a stringent data governance process.

Environment: R, Lexical Analysis, Tableau, MS office suite, Microsoft PowerBI, R, SAS/STAT, SQL, MYSQL., Google Analytics, Google Ads., Control-M

Confidential, Reno, NV

Data Analyst


  • Explored, identified, and aggregated all the data sources and performed the data profiling through SPSS.
  • Performed the extraction from OLAP Cube into the Python environment. By using thePycharm platform, and using the ODBC drivers, created the connections between OLAP and PYcharm.
  • Making the Data frames of the extracted data through the Pandas library.
  • Performed exploratory data analysis (EDA) to analyze the data sets to summarize their main characteristics in terms of Box plot, Histogram, Multi-vari chart, Run chart, Pareto chart, Scatter plot, and Odds ratio.
  • Analyze, understand, interpret and explain complex medical and pharmacy trends through Python functions.
  • Performed the logistic and multiple linear regression algorithms to find the dependency rate of readmissions in the hospitals over the other dependent variables by considering the HCSPCS, CPT, ICD-10 codes .
  • Performed the Anova test, individual t-tests and F-test to explore the significance of variables which are affecting the readmissions in the Hospitals and not useful for the model.
  • Provided recommendations for cost and performance improvement to internal management and clients through the documentation and data reporting by Tableau.
  • Visualizing the results of the analysis over the Tableau by making the Histograms, pie charts, box plots and bubble charts and other charts. Created the Dashboard and deployed it on the servers.
  • Leveraged the evidence-based learning capabilities of IBM Watson in clinical decision support systems to aid physicians in the treatment of their patients.
  • Trusted with management of confidential professional and rmation / HIPAA Compliance.

Environment: OLAP, Pycharm, Pandas, Scikit, Numpy, Tableau, SPSS, Excel, Microsoft PowerBI, Python, R, SAS/STAT, SQL, MYSQL, Predictive analysis, Neural Networks, HCSPCS, CPT, ICD-10 codes, IBM Watson


Data Analyst


  • Collaborated with the different stakeholders and identified the potential variables from the broad categories like Demographic data, Policy-related data, Claims and Complaints related variables.
  • Explored customer behavior through the data available on the social media & networking sites, website logs, blog posts, surveys etc., to get the 360o degree view.
  • Identified and addressed the key issues that organization was facing from the customer side in terms of lack of feedback, sudden inactivity and friction while accessing the services.
  • Performed the data profiling using k-means clustering to make the clusters of the population and checked the anomalies in it and did the cleansing of the data by using SPSS.
  • Balanced the dataset through under and over sampling techniques, by using the R platform.
  • Performed the Comparative analysis of churning and non-churning profiles to generalize the model.
  • Performed the predictive analysis of the extracted data by using machine learning algorithms like Regression analysis, Support vector machines, decision tree and neural networks to predict the churn.
  • Partitioned the data set into and testing sets and executed on each model.
  • Measured the performance of the models through a confusion matrix, classification accuracy, sensitivity, specificity, precision, AUC, ROC, AUK and used it for the best model selection.
  • Performed the cost benefit analysis (CBA) to investigate the models and to identify the minimum percentage of churners to be contacted for the profitability.
  • Visualized analysis on tableau and created Dashboard having key indicators of customer retention analysis.

Environment: R, Tableau, SPSS, Excel, SAS/STAT, SQL, MYSQL, Machine Learning, MS OFFICE 2012.


Data Engineer


  • Created connections to access data from relational databases like Oracle, MySQL, and SQL Server and other data sources. Created the connection to create data objects, preview data, run profiles, and run mappings.
  • Imported metadata to create data objects for sources and targets for the mapping. Used data objects to define the input and output of the mapping.
  • Developed mappings to implement data integration tasks through Informatica. Linked the sources and targets with transformation objects that define the rules for data transformation.
  • Created a workflow to define a sequence of events, tasks, and decisions based on a business process architected through BIZAGI BPMN MODELER tool.
  • Deployed the workflow to the Informatica and run the workflow.
  • Implemented concepts like Star Schema and Snowflake Schema, Data Marts, Relational and Multidimensional data modelling, with facts and dimension tables.
  • Experience in monitoring the workflow instance run on Monitoring tab of the Informatica Administrator tool.
  • Experience in visualization of data with the help of data visualization tool Excel and Tableau.
  • Understand the business requirements and identified gaps in different processes and implemented process improvement initiatives across the business improvement model.
  • Maintained and updated all data archives and Conducted periodic internal audit.

Environment: BIZAGI BPMN MODELER, Tableau, MySQL, Oracle, Postgres, Informatica, Windows XP/NT/2000, SQL Server 2005/2008, SQL, MYSQL, Microsoft Visio 2009, MS Office 2010, MS Access 2010, MS Project


Data Support Engineer


  • Provided Data analytics support for high-priority decisions, customer service improvement and organizational realignment through Data reporting and analysis on Excel.
  • Worked on RAMCO Systems Enterprise Resource Planning Systems (ERP)to generate and manage the Management information systems reports (MIS).
  • Developed database objects, including tables and views to normalize our data and to secure its integrity and materialized views using SQL queries on MYSQL database.
  • Developed SQL Queries with multiple table joins, functions, subqueries, set operations and T-SQL stored procedures and user defined functions for data analysis.
  • Coordinated statistical data analysis, design, and information flow.
  • Analyzed and interpreted consumer behavior, market opportunities and conditions, marketing results, trends, and investment levels through Content and Collaborative analysis to do the market segmentation.
  • Work directly with internal users on implementing ERP system to meet business needs.

Environment: Ramco Systems ERP, SQL Server 2005/2008, SQL, MYSQL, Oracle, Microsoft Visio, MS Office 2010.

Hire Now