We provide IT Staff Augmentation Services!

Data Scientist Resume

5.00/5 (Submit Your Rating)

Irving, TexaS

PROFESSIONAL SUMMARY:

  • Above 6 years of experience in quantitative and qualitative research skills with experience in data analysis, predictive analytics, and constructing models to deliver impactful strategies and creative insights.
  • Actively participated in all phases of teh project life cycle includingdataacquisition (Web Scraping),datacleaning,DataEngineering (dimensionality reduction (PCA & LDA), normalization, weight of evidence, information value), feature selection, features scaling & features engineering.
  • Expertise in statistical modeling (decision trees, regression models, neural networks, SVM, clustering), testing and validation (ROC plot, k - fold cross validation), Association Rule Learning, Reinforcement Learning, Deep Learning anddatavisualization.
  • Having good experience with Normalization (1NF, 2NF and 3NF) and De-normalization techniques for improved database performance in OLTP, OLAP,DataWarehouse andDataMart environments.
  • Experience working fulldatainsight cycle - from discussions with business, understanding business logic and business drivers, ExploratoryDataAnalysis, identifying predictors, enrichingdata, working with missing values, exploringdatadynamics, meaning or building predictivedatamodels.
  • Extensive working experience with Python 3.x including Pandas, NumPy, Matplotlib, Seaborn and Scikit-learn etc. and R packages like GGPLOT2, CARET, DPLYR etc.
  • Excellentdatavisualization experience either with proprietary code in R or Python, or using other visualization tools such as Spotfire, Tableau and Power BI ready for insight digestion by business and decision making to senior management (Global CTO, Global BI Leadership level).
  • Experience in Text Mining and good noledge on NLP components such as Natural Language Understanding (NLU) and Natural Language Generation (NLG) by using Python NLTK package.
  • Ingesting datasets from variousdatasources ranging from HDFS, AWS, Cassandra and other RDBMS like Oracle, MYSQL, SQL Server, DB2, Postgres, Teradata, SAP HANA etc.
  • Extensive experience indatacleaning, web scraping, fetching live streamingdata,dataloading &dataparsing using a wide variety of Python & R packages like beautiful soup.
  • Experienced with BigDataTools like Hadoop (HDFS), SAP HANA, Hive, & PIG.
  • Experienced teh full software life cycle in SDLC, Agile and Scrum methodologies.
  • Skilled in Advanced Regression Modeling, Correlation, Multivariate Analysis, Model Building, Business Intelligence tools and application of Statistical Concepts.
  • Expertise in writing effective Test Cases and Requirement Traceability Matrix to ensure adequate software testing and manage Scope Creep.
  • Expert in creating PL/SQL Schema objects like Packages, Procedures, Functions, Subprograms, Triggers, Views, Materialized Views, Indexes, Constraints, Sequences, Exception Handling, Dynamic SQL/Cursors, Native Compilation, Collection Types, Record Type etc.
  • Strong understanding of Software Testing Techniques especially those performed or supervised by BA including Black Box testing, Regression testing, and UAT.
  • Experience using markup and scripting languages such as HTML, XML.
  • Experienced in BigDatawith Hadoop 2, HDFS, MapReduce, and Spark.
  • Experience working withdatamodeling tools like Erwin, PowerDesigner and ERStudio.
  • Experience in designing star schema, Snow flake schema forDataWarehouse, ODS architecture.
  • Excellent experience in Extract, Transfer and Load process using ETL tools like Data Stage, Informatica, Data Integrator and SSIS for Data migration and Data Warehousing projects.
  • Flexible with Unix/Linux and Windows Environments, working with Operating Systems like Centos5/6, Ubuntu13/14, Cosmos.

TECHNICAL SKILLS:

Analytical Techniques: Hypothesis testing, Predictive analysis, Machine Learning, Regression Modelling, Logistic Modelling, Time Series Analysis, Decision Tree, Neural Networks, Support Vector Machines (SVM), Monte Carlo methods, Random Forest, Time series analysis.

Analytical tool: Rapid Data miner, Google analytics, IBM Watson, R Studio, SAS/STAT, Google Ads, Azure data lake analytics, SAS Enterprise miner, Pycharm, Jupyter notebook, NLP, MATLAB, GGPLOT, WEKA

Data Visualization Tool: Tableau, Qlikview, Qlik Sense, Datawrapper, Microsoft Power BI, Excel, VISIO, looker

Data modeling: Entity relationship Diagrams (ERD), Snowflake schema, Star schema

Languages: SQL, U-SQL, HIVE QL, C, R, Python, SAS

Database Systems: SQL Server 10.0/11.0/13.0, Oracle, MYSQL 5.1/5.6/5.7, Teradata, DB2, Amazon Redshift, Sybase IQ, SAP HANA, Salesforce

NOSQL Databases: HBASE, Apache Cassandra, MongoDB, Redis

ETL Tools: Microsoft SSIS, Pentaho ID, IBM Cognos, Talend Open Studio, Data Stage 11.x, Informatica Power House 9.0, Informatica IDQ, Collibra, KAFKA, FLUME

PROFESSIONAL EXPERIENCE:

Confidential - Irving, Texas

Data Scientist

Responsibilities:

  • Performed preliminary data analysis using descriptive statistical analysis. Handled anomalies by removing duplicates and imputing missing values using R Tidyverse and Python Pandas/NumPy packages.
  • Applied various ML algorithms and statistical models like decision trees, regression models, social network analysis, neural networks, deep learning, SVM, clustering in python and R libraries to identify and predict customer behavior.
  • Developed NLP models for Topic Extraction (Unsupervised learning) to predict why customers call teh client by using various algorithms like Text Rank, RAKE, LDA, NMF, GLDA, Sentiment Analysis, using python and R.
  • When creating teh NLP model, R’s Tidyverse package was initially used for importing and preprocessing teh data, text rank and RAKE packages were used for creating an unsupervised model for classification. Teh R script was converted into python scripts because while deploying teh model in teh production server, their were dependencies for R which could not be satisfied in teh production environment.
  • Pre-processed and processed data for NLP by parsing call transcripts in JSON format from IBM Watson by performing lemmatization, stop words removal, parts of speech tagging, named entity removal, bag-of-words, chunking, tf-idf, regular expressions and word embeddings using NLTK, JSON, Pandas and NumPy libraries in python.
  • Models were developed both in R and Python in conjugation because we used open scoring a Standards-based, Open-source Middleware for Predictive Analytics Applications. R was used to convert all teh models into PMML format which was used as an input for open scoring.
  • Extracted customer centric data from various sources like sales force, Teradata, Big data (Hive and Spark) to build datasets particular to each problem statement.
  • Assisted in building a semantic layer for a proposed data lake by studying existing data sources and leveraging teh concepts of exploratory data analysis and entity relationship diagram by using Erwin Data Modeler.
  • Built a business and technical metadata dictionary for business use while performing exploratory data analysis by interacting with SME/Data owners and MMT tool (In house Meta data tool).
  • Wrote complex ad-hoc Teradata SQL queries to pull transactional data from teh Teradata warehouse and matched them to all digital touchpoints from big data to move it to a data lake.

Environment: Teradata, Tableau Desktop (9.x/10.x), Tableau Server (9.x/10.x), HDFS, Git, Erwin r9.6, Hive, Python 3.x(Scikit-Learn/SciPy/NumPy/Pandas,RAKE, NLTK, LDA, GLDA), Machine Learning (Naïve Bayes, KNN, Regressions, Random Forest, SVM, XGboost, Ensemble), R Studio, R 3.5(Tidyverse, Dplyr, Caret, RPMML), Spark (PySpark, MLlib, Spark SQL),.

Confidential - Johnston, Rhode Island

Data Scientist

Responsibilities:

  • Collaborated with data engineers and operation team to implement ETL process using IBM InfoSphere DataStage, wrote and optimized SQL queries to perform data extraction to fit teh analytical requirements after performing data profiling.
  • Worked with Data Architects and IT Architects to understand teh movement of data and its storage in ERStudio9.7.
  • Created ecosystem models (e.g. conceptual, logical, physical, canonical) that are required for supporting services within teh enterprise data architecture.
  • Performed Source System Analysis, database design, data modeling for teh warehouse layer and package layer using Dimensional modeling by using ERStudio9.7.
  • Performed extensiveDataValidation,DataVerification againstDataWarehouse and performed debugging of teh SQL-Statements and stored procedures.
  • Developed and maintained datamodels anddatadictionaries,datamaps and other artifacts across teh organization, including teh conceptual and physical models, as well as metadata repository.
  • Participated in feature engineering such as Feature Filters, Wrapper Methods, Feature Extraction and Construction, Dimensionality reduction, feature intersection generating, feature normalize and label encoding with python Scikit-learn and R Caret to reduce computational cost and time.
  • Worked on data cleaning to ensure data quality, consistency, integrity using Pandas, NumPy and Tidyverse.
  • Tackled highly imbalanced dataset using Over-sampling and Under-sampling techniques (SMOTE, ROS, RUS) to improve data and model accuracy.
  • Performed Naïve Bayes, KNN, Logistic Regression, Random forest, SVM, XGboost and ensemble method to identify Spoofing Pattern and Collusion Pattern.
  • Used various metrics (RMSE, MAE, F-Score, ROC and AUC) to evaluate teh model performance in R.
  • Improved model performance by using random forest and gradient boosting for feature selection.
  • Performed data analysis by using Spectrum to run Redshift SQL queries against Amazon S3 to directly retrieve data theirby reducing teh query run time and improve performance.
  • Feasibility and performance tests were run and as a result Spark (Pyspark, SparkSQL, Mllib) was used to conduct real time analysis of Spoofing detection based on AWS.
  • Conducted Data blending, Data preparation using Alteryx and SQL for tableau consumption and publishing data sources to Tableau server.
  • Wrote complex Spark SQL queries for data analysis to meet business requirements.
  • Created multiple custom SQL queries to prepare teh right data sets for Tableau dashboards. Queries involved retrieving data from multiple tables using various join conditions that enabled to utilize efficiently optimized data extracts for Tableau workbooks.

Environment: SQL Server, Oracle, Teradata, ETL, Alteryx, Tableau Desktop (9.x/10.x), Tableau Server (9.x/10.x), Python 3.x(Scikit-Learn/SciPy/NumPy/Pandas), Machine Learning (Naïve Bayes, KNN, Regressions, Random Forest, SVM, XGboost, Ensemble), AWS RedShift, EC2, EMR, Hadoop Framework, S3, Spectrum, Spark (PySpark, MLlib, Spark SQL), HDFS, Git, Erwin r9.6, IBM InfoSphere DataStage, R Studio, R 3.5(Tidyverse, dplyr, ggplot2, caret, Random forest, XGboost),.

Confidential, California

Data Scientist / Big data Analyst

Responsibilities:

  • Extracted data like Medicare, Medicaid, ACA claims from multiple sources like Oracle, Hadoop etc. by using HiveQL, Pig Latin and SQL for creating teh NLP models.
  • Designed ER diagrams along with data architects (Physical and Logical using Erwin) and mapping teh data into database objects and produced Logical /Physical Data Models and validated teh data-mapping document from source to target for data quality assessments of teh source data.
  • Designed teh prototype of teh Data mart and documented possible outcome from it for end-user.
  • Developed and maintained data dictionary to create metadata reports for technical and business requirements.
  • Several visualizations (density plots, forest plots, leverage plots, network plots, covariant adjustment plots etc.) were made using packages such as GGPLOT2, GGMCMC to perform exploratory analysis.
  • Performed data pre-processing like data cleaning, text preprocessing, noise removal, lexicon normalization, Named Entity recognition, word stemming and object standardization using SpaCy and tm.
  • Perform featuring engineering like Word Embedding using word2vec models to reduce overfitting.
  • Successfully delivered multiple NLP projects like building a Chabot that assists a customer to trouble shoot claim issues and recommend actions by using python NLTK package.
  • Build seq2seqmodels by using a recurrent neural network (LSTM/Memory Network) at teh backend which can take any arbitrary length question and returns and answers in natural language.
  • Analyzed administrative claims data - Medicare and ACA Marketplace to answer health services research questions on costs, utilization or outcomes, using advanced statistical and econometric methods in R.
  • Developed Oozie workflows to ingest/parse teh raw data, populate staging tables and store teh refined data in partitioned tables in teh Hive.
  • Hand-on experience with data ingestion into Big Data platform from disparate data sources using Sqoop, Hive, Pig, Flume and Spark by manipulating large data sets and integrating diverse data sources to create an End-to-End data analytical solutions and models.
  • Worked with team of developers to design, develop and implement BI solutions in Tableau to measure Point of Sale KPI’s at micro and macro level.

Environment: Tableau, Python 3.x(Scikit-Learn/SciPy/NumPy/Pandas/NLTK/ SpaCy), PyCharm, Statistics, Machine Learning (RNN), Alteryx, Hadoop, Hive, Pig, No SQL, Salesforce, PL/SQL, Excel, AWS RedShift, EC2, EMR, Hadoop Framework, S3, R 3.5 (GGMCMC, GGPLOT2, dplyr, tm), Oozie, Spark (PySpark, MLlib, Spark SQL).

Confidential - New York, NY

Senior Data Scientist

Responsibilities:

  • Collaborate with teams of health services researchers, business analysts to draw data insight, intelligence from large administrative claims data, electronic medical records (EMR), various healthcare registry data to update teh billing.
  • Identified solutions to strategic business problems high-level modeling, statistical analysis techniques.
  • Utilized Spark SQL to perform advanced-level data extraction, data transformation, data management tasks providing on teh go responses to some management questions by performing complex joins, queries.
  • Responsible for fully documenting, managing library of source code, algorithms for future use.
  • Developed, tested hypotheses (t-test, F-test) using R to support research, product offerings and communicate findings to data reports/ visualization in a clear, precise, actionable manner.
  • Responded to operational data requests, create ad-hoc queries to support research projects.
  • Worked closely with data management, data integration teams to identify, understand, resolve data issues to improve teh efficiency, productivity, scalability of data, production of data processes.
  • Used SMOTE to treat highly imbalanced data before prediction to improve model accuracy when symptom prediction.
  • Implement NLP methods using Python NLTK and SpaCy to process client data like prescriptive data, customer comments data to improve customer satisfaction.
  • Used data analysis packages (e1071, caTools, scikit-learn) in programming languages (R, Python) as well as use of big data tools (AWS, Spark SQL) in query, extraction, manipulation of teh data to validate data quality and data preparation.
  • Built many machine learning models such as Random Forests, Logistic Regression, Naive Bayes, RNN to predict teh billing amount for each client.
  • Enhanced and tuned already existing statistical models’ accuracy (linear models) for predicting teh best prices for commercialization by applying an ensemble model with Linear Regression, Logistic Regression, Random Forest, XGBoosting, Feed forward Neural Network.
  • Collaborated with other data analysts, key stakeholders to identify underlying trends, both internally, externally, impacting current, future enrollment and financial considerations by incorporating teh resulting trends into forecast models to make improved predictions.
  • Worked independently to develop models that address specific business problems related to enrollment management, retention, marketing, class scheduling.

Environment: Tableau, Python (Scikit-Learn/SciPy/NumPy/Pandas/NLTK/ SpaCy), Statistics, Machine Learning (Random Forests, Logistic Regression, Naive Bayes, RNN), Hadoop, Hive, Pig, No SQL, PL/SQL, Excel, AWS RedShift, EC2, EMR, Hadoop Framework, S3, R, Spark (PySpark, MLlib, Spark SQL).

Confidential

Data Modeler

Responsibilities:

  • Conducted one-to-one sessions with business users to gather data for Data Warehouse requirements.
  • Part of team analyzing database requirements in detail with teh project stakeholders through Joint Requirements Development (JRD) sessions.
  • Developed an Object modeling in UML for Conceptual Data Model using Enterprise Architect.
  • Developed logical and Physical data models using Erwin to design OLTP system for different applications.
  • Worked with DBA group to create Best-Fit Physical Data Model with DDL from teh Logical Data Model using Forward engineering.
  • Created entity process association matrices using ZachmanFramework, functional decomposition diagrams and data flow diagrams from business requirements documents.
  • Involved in detail designing of data marts by using Star Schema incorporating shared dimensions.
  • Used Model Manager Option in Erwin to synchronize teh data models in ModelMart approach.
  • Reverse Engineering teh reports and identified DataElements (in teh source system) such as Dimensions, Facts and Measures required for proposed new reports.
  • Worked with teh ETL team to document teh transformation rules for data migration from OLTP to Warehouse environment for reporting purposes.
  • Experience in integration of Salesforce and SQL server using Sql Server Integration Services.
  • Integrated Spotfire visualization into client's Salesforce environment.
  • Developed Data Migration and Cleansing rules for teh Integration Architecture (OLTP, ODS, DW)
  • Used Teradata utilities such as FastExport, MultiLOAD for handling various tasks.
  • Involved in migration projects to migrate data from data warehouses on Oracle/DB2 and migrated those to Teradata.
  • Developed data mapping documents between Legacy, Production, and User Interface Systems.
  • Generated comprehensive analytical reports by running SQL queries against current databases to conduct data analysis.
  • Generated ad-hoc repots using Crystal Reports 9and SQL Server Reporting Services (SSRS)
  • Environment: Erwin r9.6, DB2, Teradata, SQL-Server2008, Oracle, Tablue, Talend, Informatica 8.1, IBM InfoSphere DataStage 11.x, Enterprise Architect, Power Designer, MS SSAS, Crystal Reports, SSRS, ER Studio, Lotus Notes, Salesforce, MS Excel, word and Access.

Confidential

Data Modeler/Data Analyst

Responsibilities:

  • Communicated effectively in both a verbal and written manner to client team.
  • Completed documentation on all assigned systems and databases, including business rules, and processes.
  • Created Test data and Test Cases documentation for regression to validate performance.
  • Designed, built, and implemented relational databases.
  • Determined changes in physical database by studying project requirements.
  • Developed intermediate business noledge of teh functional area and processed to understand teh application of data information to support business function.
  • Facilitated gathering moderately complex business requirements by defining teh business problem.
  • Utilized SPSS statistical software to track and analyze data.
  • Optimized data collection procedures to generate reports on a weekly, monthly, and quarterly basis.
  • Used advanced Microsoft Excel to create pivot tables, used VLOOKUP and other Excel functions.
  • Successfully interpreted data to draw conclusions for managerial action and strategy.
  • Created Data chart presentations and coded variables from original data, conducted statistical analysis as and when required and provided summaries of analysis.
  • Maintained teh data integrity during extraction, manipulation, processing, analysis, and storage.

Environment: Data Analysis, SQL, FTP, SFTP, XML, Web Services, MATLAB, Oracle, T-SQL, UNIX Shell Scripting, DB2, Windows XP/NT/2000, Linux Cent OS, SQL Server 2005/2008, Sybase IQ, Microsoft Visio, MS Office 2010, MS Access 2010.

We'd love your feedback!