We provide IT Staff Augmentation Services!

Data Scientist Resume

3.00/5 (Submit Your Rating)

Nashville, TN

SUMMARY:

  • Highly analytical and process - oriented data analyst with Around 7+ years of experience in data analysis and data management having proven ability to work efficiently in both independent and teamwork environments. Excellent team player, good communication and interpersonal skills with solid team leading capabilities.
  • Expertise and experience in domains like Healthcare, Banking, Insurance and e-commerce.
  • Having the 5+ years of experience of working with both Python, R and SAS analytical platforms.
  • Expertise in SQL Queries and 7+ years of experience in creating the databases, populating it, to extract data from data tables along with creation of tables, Sub queries, Joins, Views, Indexes, SQL Functions, Set Operators and other Functionalities.
  • Proficient knowledge of the SDLC and extensive experience in Agile (Scrum and XP) and Waterfall models.
  • Expertise in Cost Benefit Analysis, Feasibility Analysis, Impact Analysis, Gap Analysis, SWOT analysis and ROI analysis, SCRUM, leading JAD Sessions and Dashboard Reporting.
  • Experience in data modeling, data analysis and working with OLTP and OLAP systems and data mining techniques such as EDW, MOLAP DM and ROLAP.
  • Worked with various RDBMS like Oracle, MYSQL, SQL Server, DB2, Postgres, Teradata, SAP HANA and expertise in creating tables, data population and data extraction from these databases.
  • Worked with NoSQL databases like Apache Cassandra to deal with stream processing/real time analysis regarding unstructured data using KAFKA and doing analytics on its data using Pentaho ID.
  • Strong Experience in implementing Data warehouse solutions in Amazon Redshift, Oracle and SQL Server.
  • Experience in extracting, transforming and loading (ETL) data from spreadsheets, database tables, flat files and other sources using Talend Open Studio and Informatica.
  • Having good knowledge in Normalization and De-Normalization techniques for optimum schema designing.
  • Skilled in Data chunking, Data profiling, Data Cleansing, Data mapping, creating workflows and Data Validation using data integration tools like SPSS, Informatica and Talend Open Studio during the ETL and ELT processes.
  • Supporting AWS services IAM, EC2, S3, VPC, RDS, DynamoDB & Redshift services.
  • A great experience in ERD and UML modelling, and conceptualize these models to create the physical models from logical model.
  • Experience in Data warehousing concepts like Star Schema, galaxy and Snowflake Schema, DataMarts, Kimball Methodology used in Relational and Multidimensional data modelling.
  • Experience on IBM Watson, Google analytics and Apache Hadoop Ecosystem with good knowledge of Apache Hadoop Distributed file system (HDFS), Map Reduce, Hive, Pig, Python, HBase, Sqoop, Kafka, Flume, Cassandra, Oozie, Impala, Spark.
  • Experience with conceptual, logical and physical data modeling considering Meta data standards.
  • Experience with DBA tasks involving database creation, performance tuning, creation of indexes, creating and modifying table spaces for optimization purposes.
  • Knowledge of Machine Learning techniques like Regression Models, Artificial Neural Networks, Clustering Analysis, Decision Tree, ANOVA, t-tests, Neural networks and SVM. experience in Base SAS/STAT, STATA, R, SQL, Tableau, Python, MS EXCEL (VLOOKUP, Pivot charts, Macros).
  • Expertise in Data Manipulations using SAS/STATA by using procedures like SAS PROC SQL, PROC MEANS, PROC FREQ and PROC REPORTS, TABULATE, UNIVARIATE, Append, Array, DO loops, GPLOT and GCHART, Macros and Merge procedures like PROC APPEND, PROC DATASETS, PROC SORT, and PROC TRANSPOSE.
  • Expertise in creating Tableau Dashboards for data visualization and deploying it to the servers.
  • Proficient in Service Oriented architecture (SOA), REST/SOAP, API and Cloud technologies, Web Services.

TECHNICAL SKILLS:

Analytical Techniques: Hypothesis testing, Predictive analysis, Machine Learning, Regression Modelling, Logistic Modelling, Time Series Analysis, Decision Tree, Neural Networks, Support Vector Machines(SVM), Monte Carlo methods, Random Forest, Time series analysis.

Data Visualization Tool: Tableau, Qlikview, Datawrapper, Microsoft Power BI, Excel, VISIO

Analytical tool: STATA, MEGASTAT, Rapid Data miner, Google analytics, IBM Watson, R Studio, SAS/STAT, Google Ads, Azure data lake analytics, SAS Enterprise miner, Pycharm, Jupyter notebook, NLP, MATLAB, GGPLOT, WEKA

Data modeling: Entity relationship Diagrams(ERD), Snowflake schema, Star schema

Languages: SQL, U-SQL, HIVEQL, C, R, Python, SAS

Database Systems: SQL Server 10.0/11.0/13.0 , Oracle, MYSQL 5.1/5.6/5.7, Teradata, DB2, Amazon Redshift, Sybase IQ, SAP HANA

NOSQL Databases: HBASE, Apache Cassandra

ETL Tools: Microsoft SSIS, Pentaho ID, IBM Cognos, Talend Open Studio, Informatica Powerhouse 9.0, Informatica IDQ, Collibra, KAFKA, FLUME

Testing tool: JIRA, HP Quality Check / HP ALM, Base Camp

Big Data: Apache Hadoop, HDFS, Sqoop, Flume, Kafka, Hive, Impala, MapReduce, Splunk ML-SPL, Splunk Hadoop Connect, Apache Airflow

SDLC Methodology and Tools: Waterfall, Agile / Scrum Methodology / XP, BIZAGI BPMN MODELER, SeeNowDoScrum, MS Project

PROFESSIONAL EXPERIENCE:

Confidential, Nashville, TN

Data Scientist

Roles and Responsibility:

  • Identified the Edge node(s), an IOT gateway or a cloud aggregator.
  • Explored and classified all the IOT devices of the healthcare infrastructure and performed the data profiling through Informatica to get the better understanding of it.
  • During the Collaboration phase, we performed the As-is To-be analyses and we came up with two options, analyzed and compared Azure data lake analytics, IBM Watson analytics and through Apache Hadoop distributions.
  • Later we also analyzed the performance of Apache Spark and data lake analytics in the Apache Hadoop distributions at the top of YARN.
  • Compared the Performance of both Sparks Query language and U-SQL and finally came up with our business case with Apache Sparks.
  • Used the Kafka Interfacing software of Apache Hadoop framework to get the data from all the IOT devices of the healthcare network into the Apache Hadoop Spark system.
  • Through Sqoop, imported structured data of the healthcare from SAP HANA, which was an RDBMS into HBASE.
  • Used SAP HANA for the datawarehouse solutions, for gaining the Business intelligence insights regarding the operations of the healthcare.
  • Applied digital transformation and Migrated data from the Teradata DB into HDFS through SQOOP.
  • Processed the stream data in the Apache Spark streaming by breaking the stream data into the micro batches and later processed by the spark’s core, which results in lower latency.
  • Stored the processed data in the HDFS and generated the reports through Spark’s SQL queries.
  • Performed data analytics on the output data obtained from the spark’s core by using spark’s MyLib, an advanced machine Learning tool of Apache Spark
  • Performed the descriptive analysis on the data like correlations and Scatter plots to understand the current performance of the IOT devices of the healthcare and to improve the efficiency and optimizing the usage of it.
  • Partitioned the data set into training, testing and validation set to use it in the supervised learning processes.
  • Performed the predictive analysis on Python and R platforms through popular machine learning algorithms like Linear regression, logistic regression and Artificial Neural networks.
  • Visualized the model performance through the ROC curve (Receiver operating characteristic Curve) by plotting sensitivity against specificity at different thresholds.
  • Measured the predictive ability of a classifier by the Area under the curve (AUC). Area under the curve greater than 0.75 was the model acceptance criteria.
  • Used Natural language processing algorithms for Image recognition for processing radiology’s Reports like X-Rays, CT-Scan and MRI’s.
  • Performed the data reporting and created the dashboard and shared with all the major stake holders
  • Performed the statistical analysis to improve the system processes to give better and quality healthcare. Performed the process modelling through BIZAGI BPMN MODELER business process modeler to remove the inefficient tasks and processes in the systems.
  • Experience in documenting all the analytical results and findings of the Apache Sparks by Apache Zeppelin.
  • Ensured the proper process was followed to demonstrate to the monitoring government entity that the data provided to them had gone through a stringent data governance process.

Environment: Apache Hadoop, HBASE, Apache Spark, Kafka, Apache Zeppelin, Informatica, BIZAGI BPMN MODELER, Spark’s Mylib, Tableau, R, SAS/STAT, Spark’s SQL, SAP HANA, Predictive analysis, Machine Learning, MS office suite, U-SQL, Azure data lake analytics, Base Camp, HIVE (UDF), Apache Airflow, NLP (Natural language processing), Python, R, numpy, scipy scikit-learn, TensorFlow, GGPLOT, Pytorch, Informatica IDQ, Cassandra

Confidential, Columbus, OH

Senior Data Scientist

Roles and Responsibility:

  • Collaborated with product owner, key stakeholders and subject matter experts in brainstorming sessions to identify state of reporting structure, identify gaps, and shape the need for event correlation over log data
  • Participated in as-is to-be analysis between the available system information and event monitoring(SIEM) enterprise tool and Splunk enterprise platform to prepare the business case
  • Assisted the Product Owner in creating a Proof of Concept by analyzing months of syslog data in the proposed Splunk platform in terms of reporting, monitoring and alerts.
  • Built models in machine learning tool kit provided in the Splunk platform to identify those crucial patterns that lead to MiFID violations.
  • Expertise knowledge on various components within Splunk (indexer, forwarder, search head, deployment server), Heavy and Universal forwarder, Parsing, Indexing, Searching concepts, Hot, Warm, Cold, Frozen bucketing, License model.
  • Created data models and lookup knowledge objects for day-to-day operational analysis
  • Coached a team of data analyst on how Splunk can be leveraged for machine data analysis at scale
  • Shell scripting and extensively used Regular expressions (Regex) in search string and data anonymization.
  • Utilized the in-built Search Processing Language (SPL) to analyze massive number of time-series events and identify event correlations
  • Used IBM Watson to make the client interface more interactive, fast and efficient to process the queries.
  • Used DB connect to create lookups into operational Teradata DB and migrated data into splunk after applying digital transformation. for key management day to day decisions.
  • Assisted the Information Security team to build visualizations into operational syslog’s to identify security anomaly and detect outliers
  • Collaborated with subject matter experts and project manager identified data input sources, did input data profiling, and documented analysis undertaken
  • Authored technical case documentation for Splunk Hadoop Connect provided as an input to the architectural runway setup for the project
  • Stored the Processed data in the Hadoop distributed file system to use this data for the predictive analysis.
  • Integrated the R platform with the Hadoop Ecosystem and performed predictive analysis and prescriptive analysis.
  • Used Machine learning algorithms like Random Forest, KNN, Artificial Neural Networks(ANN), Regression and logistic regression to predict the behavior of the customers and market on the Python and R platforms.

Environment: Apache Hadoop Distribution 2.7.X, HDFS, Splunk Hadoop Connect, JIRASuite, MS Office Suite, Splunk ML-SPL (Machine Learning toolkit), Machine Learning, HDFS, IBM Watson, Python, R, numpy, scipy scikit-learn, TensorFlow, H20, Pytorch, Reinforcement Learning, Cassandra, Teradata, Snow Flow Schema

Confidential, Foster city, CA

Senior Data Analyst

Roles and Responsibility:

  • Worked on the Facebook developers and twitter developers to extract the user tweets and comments through their API’s and use it for the analytical purposes.
  • By using the R platform, created R scripts and merging it with the Twitter’s and Facebook API’s to make the connections for the data extraction part.
  • Made the Dataframes of the extracted data through call function in R, to get the data in the tabular form.
  • Performed the data cleansing through the R functions sapply and gsub by removing the emoticons and URL’s. By doing this, processes were streamlined, and data quality has been improved steadily over time.
  • Performed the lexical Analysis on the cleaned Twitter tweets and FB comments to analyze the sentiment of the tweets and comments and to convert it into the numerical values.
  • Performed the scan analysis on each tweet and comments to find the number of positive and negative words through a Scan function through its negative and positive words repository.
  • Quantification of the sentiments in terms of Positive Score, Negative score and overall score.
  • Performed the descriptive analysis on the data correlations, Scatter plots and measures of central tendency.
  • Visualized the results of the Lexical analysis over the Tableau by making the Histograms, Bubble charts, Pie charts and creating the Dashboard for the reporting purposes.
  • Assisted the departments by giving data driven solutions, based on lexical analysis findings in taking key decisions regarding manufacturing and marketing of key products.
  • Optimized data collection procedures and proposed solutions to improve system efficiencies and reduce total expenses through BIZAGI BPMN MODELER tool.
  • Analyzed the customer web trail on the real time basis by using Google Analytics to get the insight regarding pages per visit, average visit duration and bounce rate.
  • Used the Adword / Google Ads platform to do the paid online advertisement of the products.
  • Measured and analyzed our online marketing campaigns through Revenue Metrics, Exit Rate, Bounce rate, Conversion Metrics and performed A/B and multivariate testing under the premises of Google Analytics.
  • Performed the search engine optimization (SEO) through White hat CEO, Black hat CEO and Grey hat CEO.
  • To Check the effectiveness of the marketing campaigns, we perform the ROI Analysis.
  • Design, implement and track KPI to measure performance against goals,
  • Ensured the proper process was followed to demonstrate to the monitoring government entity that the data provided to them had gone through a stringent data governance process.

Environment: R, Lexical Analysis, Tableau, MS office suite, Microsoft PowerBI, R, SAS/STAT, SQL, MYSQL., Google Analytics, Google Ads., Control-M

Confidential, Reno, NV

Data Analyst

Roles and Responsibility:

  • Explored, identified, and aggregated all the meaningful data sources of the healthcare group and performed the data profiling through SPSS.
  • Performed the extraction from OLAP Cube into the Python environment. By using thePycharm platform, and using the ODBC drivers, created the connections between OLAP and PYcharm.
  • Performed exploratory data analysis (EDA) to analyze the data sets to summarize their main characteristics in terms of Box plot, Histogram, Multi-vari chart, Run chart, Pareto chart, Scatter plot, and Odds ratio.
  • Making the Data frames of the extracted data through the Pandas library.
  • Analyze, understand, interpret and explain complex medical and pharmacy trends through Python functions.
  • Performed the logistic and multiple linear regression algorithms to find the dependency rate of readmissions in the hospitals over the other dependent variables by considering the HCSPCS, CPT, ICD-10 codes .
  • Performed the Anova test, individual t-tests and F-test to explore the significance of variables which are affecting the readmissions in the Hospitals and not useful for the model.
  • Provided recommendations for cost and performance improvement to internal management and clients through the documentation and data reporting by Tableau.
  • Visualizing the results of the analysis over the Tableau by making the Histograms, pie charts, box plots and bubble charts and other charts. Created the Dashboard and deployed it on the servers.
  • Leveraged the evidence-based learning capabilities of IBM Watson in clinical decision support systems to aid physicians in the treatment of their patients.
  • Trusted with management of confidential professional and personal information / HIPAA Compliance.

Environment: OLAP, Pycharm, Pandas, Scikit, Numpy, Tableau, SPSS, Excel, Microsoft PowerBI, Python, R, SAS/STAT, SQL, MYSQL, Predictive analysis, Neural Networks, HCSPCS, CPT, ICD-10 codes, IBM Watson, Informatica IDQ

Confidential, New York, NY

Data Analyst

Roles and Responsibility:

  • Collaborated with the different stakeholders and identified the potential variables from the broad categories like Demographic data, Policy-related data, Claims and Complaints related variables.
  • Explored customer behavior through the data available on the social media & networking sites, website logs, blog posts, surveys etc., to get the 360o degree view.
  • Identified and addressed the key issues that organization was facing from the customer side in terms of lack of feedback, sudden inactivity and friction while accessing the services.
  • Performed the data profiling using k-means clustering to make the clusters of the population and checked the anomalies in it and did the cleansing of the data by using SPSS.
  • Balanced the dataset through under and over sampling techniques, by using the R platform.
  • Performed the Comparative analysis of churning and non-churning profiles to generalize the model by using the Hypothesis testing.
  • Performed the predictive analysis of the extracted data by using machine learning algorithms like Regression analysis, Support vector machines, decision tree and neural networks to predict the churn.
  • Partitioned the data set into training and testing sets and executed on each model.
  • Measured the performance of the models through a confusion matrix, classification accuracy, sensitivity, specificity, precision, AUC, ROC, AUK and used it for the best model selection.
  • Performed the cost benefit analysis (CBA) to investigate the models and to identify the minimum percentage of churners to be contacted for the profitability.
  • Visualized the predictive analysis results over the tableau and created the Dashboard of key indicators in the customer retention analysis and deployed it on the servers.
  • Documented the findings and provided data driven recommendations to the decision makers.

Environment: R platform, Tableau, SPSS, Excel, SAS/STAT, SQL, MYSQL, Machine Learning, Time Series Analysis, Neural Networks.

Confidential, Chicago, IL

Data Analyst

Roles and Responsibility:

  • Data segmentation through Decision tree on SAS/STAT to better understand the riskiness of borrowers.
  • Defined the data sources and the methodology for the project and collaborated with key stakeholders.
  • Performed the data integration through Talend Open StudioETL tool on data coming from both internal and external sources through data merging, concatenation and manipulation of relational tables.
  • Developed a Probability of Default (PD) Model through Logistic regression algorithm to estimate the probability of a loan whether it will get repaid or it fall into default.
  • Developed a Loss given Default(LGD) Model through Artificial Neural Networks (ANN) algorithm to estimate the economic loss incurred if an obligor goes into default, expressed as a percentage of exposure.
  • Created behavioral scorecards and application scorecards to know the applicant in a better way.
  • Data cleaning through KNN imputation and Multiple imputation to remove outliers and missing values.
  • Performed the variable selection through stepwise selection method to remove irrelevant and redundant variables from the data set, with the aim of improving the performance of regression techniques.
  • Developing the credit scorecard model by using logistic regression, model validation and scaling processes.
  • Assessed the model accuracy by complexity, Mean Absolute Error, RMS Error, Confusion Matrix, etc.
  • Visualized the model performance through the ROC curve by plotting sensitivity against specificity at different thresholds and Measured the predictive ability of a classifier by the Area under the curve (AUC).
  • Performed Segmentation and Reject Interference to eliminate the selection bias.
  • Addressing the problem of overfitting through Bootstrapping and cross-validation frameworks.
  • Visualizing the results of the credit score modelling over the Tableau and creating the Dashboard.

Environment: Talend Open Studio, Tableau, R, SAS/STAT, SQL, MYSQL, Machine Learning, Time Series Analysis, Neutral Networks, MS Office 2010

Confidential, Providence, RI

Data Engineer

Roles and Responsibility:

  • Created connections to access data from relational databases like Oracle, MySQL, Postgres and SQL Server and other data sources. Created the connection to create data objects, preview data, run profiles, and run mappings.
  • Imported metadata to create data objects for sources and targets for the mapping. Used data objects to define the input and output of the mapping.
  • Executed the profile to analyze the structure and content and quality of the data, and to determine the quality of your data. Informatica applies the profiling rules and runs the profile through Informatica.
  • Developed mappings to implement data integration tasks through Informatica. Linked the sources and targets with transformation objects that define the rules for data transformation.
  • Created a workflow to define a sequence of events, tasks, and decisions based on a business process architected through BIZAGI BPMN MODELER tool.
  • Deployed the workflow to the Informatica and run the workflow.
  • Implemented concepts like Star Schema and Snowflake Schema, Data Marts, Relational and Multidimensional data modelling, with facts and dimension tables.
  • Experience in monitoring the workflow instance run on the Monitoring tab of the Informatica Administrator tool.
  • Extensive working knowledge with different types of data load strategies and scenarios like Historical Dimensions, Surrogate keys, Summary facts etc.,
  • Experience in visualization of data with the help of data visualization tool Excel and Tableau.
  • Understand the business requirements and identified gaps in different processes and implemented process improvement initiatives across the business improvement model.
  • Maintained and updated all data archives and Conducted periodic internal audit.

Environment: BIZAGI BPMN MODELER, Tableau, MySQL, Oracle, Postgres, Informatica, Windows XP/NT/2000, SQL Server 2005/2008, SQL, MYSQL, Microsoft Visio 2009, MS Office 2010, MS Access 2010, MS Project, MATLAB, Snowflake schema.

We'd love your feedback!