We provide IT Staff Augmentation Services!

Data Scientist Resume

3.00/5 (Submit Your Rating)

Wilmington, DE

SUMMARY

  • Over 8+ years of experience in IT and comprehensive industry knowledge on Data Analysis, Data Manipulation, Data Engineering, Machine Learning, Artificial Intelligence, Statistical Modeling, Predictive Analysis, Data Mining, Data Visualization and Business Intelligence.
  • Experience in transforming business requirements into actionable data models, working in a variety of industries Banking / Financial, Healthcare, Pharmaceutical & Insurance domains.
  • Expertise in performing Feature Selection, Linear Regression, Logistic Regression, k - Means Clustering, Classification, Decision Tree, Supporting Vector Machines (SVM), Naive Bayes, K-Nearest Neighbors (KNN), Random Forest, Gradient Descent, Hidden Markov Model and Neural Network algorithms to train and test the huge data sets.
  • Adept in statistical programming languages like Python, R and SAS including Big Data technologies like Hadoop, Hive, HDFS, MapReduce and NoSQL Based Databases.
  • Expertized in Python data extraction and data manipulation, and widely used python libraries like NumPy, Pandas, and Matplotlib for data analysis.
  • Highly skilled in using Hadoop (pig and Hive) for basic analysis and extraction of data in the infrastructure to provide data summarization.
  • Proficient in HiveQL, SparkSQL, PySpark. In depth knowledge in using of spark machine learning library Spark MLlib.
  • Hands on experience and in provisioning virtual clusters under Amazon Web Service (AWS) cloud which includes services like Elastic compute cloud (EC2), S3, and EMR.
  • Proficient in designing and creating various Data Visualization Dashboards, worksheets and analytical reports to help users to identify critical KPIs and facilitate strategic planning in the organization utilizing Tableau Visualizations according to the end user requirements.
  • Strong knowledge in Statistical methodologies such as Hypothesis Testing, Principal Component Analysis (PCA), Sampling Distributions, ANOVA, Chi-Square tests, Time Series, Factor Analysis, Discriminant Analysis.
  • Extensively worked on other machine learning libraries such as Seaborn, SciKit-learn, SciPy and familiar working with TensorFlow, NLTK for Deep Learning.
  • Experience in Text Analytics, developing different Statistical Machine Learning, Data Mining solutions to various business problems and generating data visualizations using R, Python.
  • Expert in creating PL/SQL Schema objects like Packages, Procedures, Functions, Subprograms, Triggers, Views, Materialized Views, Indexes, Constraints, Sequences, Exception Handling, Dynamic SQL/Cursors, Native Compilation, Collection Types, Record Type, Object Type using SQL Developer.
  • Exposed to the manipulating large data sets, by using R Packages like tidyr, tidyverse, dplyr reshape, lubridate, Caret and data visualization using ggplot2 packages.
  • Experience in working with Big Data technologies such as Hadoop, MapReduce jobs, HDFS, Apache Spark, Hive, Pig, Sqoop, Flume, Kafka and familiar with Scala Programming.
  • Good understanding of Teradata SQL Assistant, Teradata Administrator and data load/ export utilities like BTEQ, Fast Load, Multi Load, and Fast Export.
  • Knowledge and experience working in Waterfall as well as Agile environments including the Scrum process and used Project Management tools like ProjectLibre, Jira/Confluence and version control tools such as GitHub/Git.
  • Exposure towards Azure Data Lake and Azure Storage.
  • Experienced in Data Integration Validation and Data Quality controls for ETL process and Data Warehousing using MS Visual Studio, SSAS, SSIS and SSRS.

PROFESSIONAL EXPERIENCE

Data Scientist

Confidential, Wilmington, DE

Responsibilities:

  • Developing, monitoring and maintenance of custom risk scorecards using advanced machine learning and statistical method. Recommending and implementing model changes with credit risk management team to improve performance of credit functions.
  • Extensively involved in all phases of data acquisition, data collection, data cleaning, model development, model validation, and visualization to deliver the data science solutions.
  • Working on cleaning the data using exploratory data analysis (EDA) and python libraries (NumPy, Pandas) by creating into Pandas data frames, and replacing the missing values using imputation techniques.
  • Used PCA and other feature engineering techniques to reduce the high dimensional data, feature normalization techniques and label encoding with Scikit-learn library in python.
  • Pull the data from the Database using SQL and perform Data preprocessing like cleaning (for outlier, missing values analysis, etc.) and Data Visualization (Scatter Plots, Box Plots, Histograms, etc.) Matplotlib, Seaborn, and ggplot2 and libraries.
  • Experimented with the Machine Learning models including Linear Regression & Logistic Regression, Hidden Markov Model (HMM), Naïve Bayes, Decision Trees, Random Forests, KNN, Clustering, Supporting vector machine (SVM), Neural Networks, Principle Component Analysis, and Bayesian.
  • Used Cross-validation to test the models with various batches of data to optimize the models and prevent overfitting.
  • Extracted data from SQL Server database, copied into HDFS using Hadoop tools such as Sqoop and Hive to retrieve data required for building the predictive models.
  • Working on AWS which includes Amazon Kinesis, Amazon Simple Storage Service (Amazon S3), Spark Streaming, PySpark and Spark SQL on top of an Amazon EMR cluster.
  • Stored and retrieved data from data-warehouses using Amazon Redshift.
  • Performing Map Reduce jobs in Hadoop and implemented Spark analysis using Python for performing machine learning & predictive analytics on AWS platform.
  • Involving in creating data frames in Hadoop system, Spark using PySpark and then applying HiveQL into Spark transformations using Spark RDDs, Python libraries.
  • Implementing Statistical models, Predictive models, enterprise data model, metadata solution and data life cycle management in both RDBMS, NoSQL databases like MongoDB.
  • Worked on large scale of data sets and extracted data from various database sources like Oracle, SQL Server, DB2, and Teradata.
  • Utilized PySpark, Spark Streaming, MLlib, in Spark ecosystem with a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc.
  • Evaluated models using Cross Validation, Log loss function, ROC curves and used AUC for feature selection and elastic technologies like ElasticSearch (EC2).

Environment: Python 3x, Cloudera, Hadoop, Apache Spark, Hive, NumPy, NLTK, Pandas, SciPy, Map Reduce, Tableau, Sqoop, HBase, Oozie, HDFS, PySpark, NoSQL, Tableau, Mongo DB, Teradata, SQL Server.

Jr. Data Scientist

Confidential, Irvine, California

Responsibilities:

  • Worked closely with health services researchers and business analysts to draw insight and intelligence from large administrative claims datasets, electronic medical records (EMR) and various healthcare registry datasets.
  • In data preprocessing phase, used Pandas to remove or replace all the missing data and balanced the dataset with Over-sampling the minority label class and Under-sampling the majority label class.
  • Used correlation analysis and graphical techniques in Python library, Matplotlib to get some insights from data in data exploration stage.
  • Trained and tested large data sets on building and predicting models such as Logistic Regression, SVM, Random forests and KNN.
  • Designed and implemented Cross-validation and statistical tests including ANOVA, Chi-square test to verify the models' significance.
  • Implemented big data processing applications to collect, clean and normalization large volumes of open data using Hadoop ecosystems such as PIG, HIVE, and HBase.
  • Worked with various data formats such as Text, JSON, XML, Avro, Parquet file formats and performed machine learning techniques using Python and R.
  • Collaborated with data engineers and ETL team to process and write SQL queries to perform data extraction to fit the analytical requirement.
  • Designed, developed and maintained daily and monthly summary, trending and benchmark reports in Tableau Desktop.
  • Developed Map Reduce jobs written in Java, using Hive for data cleaning and preprocessing.
  • Identified problems with customer data and developed cost effective models by the Root Cause analysis.
  • Exported the data required information to RDBMS using Hadoop Sqoop to make the data available for the claims processing team to assist in processing a claim based on the data.
  • Implemented, tuned and tested the model on AWS EC2 with the best performing algorithm and parameters.

Environment: Python 2.x, Ski-Kit, R- Studio, ggplot2, XML, HDFS, Hadoop 2.3, Hive, Impala, Linux, Spark, SQL Server 2014, Microsoft Excel, MATLAB, Spark SQL, PySpark.

Tableau Data Analyst

Confidential, Santa Ana, California

Responsibilities:

  • Interacted with Business Analysts and Data Modelers for defined mapping and design documents for various data sources.
  • Created visual analytics for large data using Tableau on Sales and marketing Data, to assure integrity, identifying the root cause of data inconsistencies.
  • Worked on data cleaning and ensured data quality, consistency, integrity using Python libraries Pandas, Numpy.
  • Worked on the dashboard reports for the Key Performance Indicators (KPI) for the top management.
  • Extensively worked in creating Aggregates, Hierarchies, Charts, Histograms, Filters, Quick filters, Cascading Filters, Table Calculations, Trend lines and calculates measures, LOD expressions Sets, Groups, Actions, Parameters etc.
  • Worked with various data sources such as Excel, MySQL, Teradata, and Oracle databases by blending data in to a single sheet.
  • Analyzed various reports, dashboards, scorecards in MicroStrategy and created the same using Tableau Desktop server.
  • Provided customer support to Tableau users and Wrote Custom SQL to support business requirements.
  • Worked closely with reporting team for deploying Tableau reports and publishing them on the Tableau and Share point server.
  • Optimized the performance of queries with modification in T-SQL queries, removed the unnecessary columns and redundant data, normalized tables, established joins and created index.
  • Worked closely with ETL team for various trouble shooting issues.

Environment: Tableau Desktop 9.x/10.0, Table Servers 9.x/10.0, Tableau Repository, T-SQL, SQL Server, Pandas, Numpy, MySQL, Crystal Reports, MS-Excel, Teradata, Share Point, Agile.

Data Analyst

Confidential

Responsibilities:

  • Interacted with the Client and documented the Business Reporting needs to analyze the data.
  • Used SAS for pre-processing data, SQL queries, data analysis, generating reports, graphics, and statistical analyses.
  • Supported the research staff for technical and programming help. Worked with Bio statistician to analyze the results obtained from various statistical procedures like ANOVA, t-test.
  • Performed data analysis, statistical analysis, and generated reports, listings and graphs using SAS tools-SAS/Base, SAS/Macros and SAS graph, SAS/SQL, SAS/Connect, and SAS/Access.
  • Worked on SSIS Control Flow items (For Loop, Execute package/SQL tasks, Script task, and send mail task) and SSIS Data Flow items (Conditional Split, Data Conversion, Fuzzy lookup, Fuzzy Grouping, Pivot).
  • Used Normalization methods up to 3NF De-Normalization techniques for effective OLTP systems.
  • Developed data mapping documentation to establish relationships between source and target tables including transformation processes using SQL.
  • Worked on Entity-Relationship concept, Facts and dimensions tables, slowly changing dimensions and Dimensional Modeling (Star Schema and Snow Flake Schema).
  • Testing database to examine the field size validation, check constraints, stored procedures to verify with metadata.

Environment: SAS/Enterprise Guide, SAS/SQL, SAS/BASE, SAS/MACROS, SAS/GRAPH, ANOVA,SQL Server 2008 R2, DB2, MS BI Suite (SSIS/SSRS), T-SQL, Share Point 2010, Visual Studio 2010, Crystal Reports, Agile/SCRUM

BI Developer

Confidential

Responsibilities:

  • Performed data analysis, data migration, data preparation, graphical presentation, statistical analysis, reporting, validation and documentation.
  • Created new Query Subjects and arranged in different folder structure for business View and manually created the query subjects using the complex SQL quires.
  • Responsible for the integration of the various data sources, validation, and system monitoring.
  • Accountable for report generation using SQL Server Reporting Services (SSRS) and Crystal Reports based on business requirements and connect with Teradata base for generating daily reports.
  • Created sub-reports, drill down reports, summary reports, parameterized reports, and ad-hoc reports using SSRS.
  • Performed Gap analysis by identifying existing system, documenting the enhancements to meet the end state requirements.
  • Created, maintained, and adhered to Enterprise Data Modeling Standards while performing analysis of possibly sensitive data, and made recommendations in accordance with the objectives of the project.
  • Developed reports with appropriate properties set to show the reports in various output formats like HTML, PDF, CSV, Excel and expertise in formatting the reports for Excel output format.

Environment: ER Studio, MS Access, Teradata, DB2, T-SQL, SSIS, SSRS, SSAS, ETL, SQL Server 2008

TECHNICAL SKILLS

Languages: Python, R, Bash scripting

Packages/libraries: Pandas, NumPy, Seaborn, SciPy, Matplotlib, Scikit-learn, BeautifulSoup, MLlib, ggplot2, Rpy2, caret, dplyr, RWeka, gmodels, NLP, Reshape2, plyr.

Machine Learning: Linear Regression, Logistic Regression, Decision trees, Random forest, Association Rule Mining (Market Basket Analysis), Clustering (K-Means, Hierarchal), Gradient decent, SVM (Support Vector Machines), Deep Learning (CNN, RNN, ANN) using TensorFlow (Keras).

Statistical Tools: Time Series, Regression models, splines, confidence intervals, principal component analysis, Dimensionality Reduction, bootstrapping

Big Data: Hadoop, Hive, HDFS, MapReduce, Pig, Kafka, Flume, Oozie, Spark

BI Tools: Tableau, Amazon Redshift, MSBI

Data Modeling Tools: Erwin r, Rational Rose, ER/Studio, MS Visio, SAP Power designer

Databases: MySQL, SQL Server, Oracle, Hadoop/Hbase, Cassandra, DynamoDB, Azure Table Storage, Teradata.

Reporting Tools: MS Office (Word/Excel/Power Point/ Visio), Tableau, Crystal reports XI, SSRS, IBM Cognos7.0/6.0.

Version Control Tools: SVM, GitHub

Operating Systems: Windows, Linux, Unix

We'd love your feedback!