We provide IT Staff Augmentation Services!

Data Scientist Resume

Irving, TX


  • Over 7+ years of hands - on experience in Data Science and Analytics including Data Mining and Statistical Analysis.
  • Actively involved in all phases of Data Science project life cycle including Data Extraction, Data Cleaning, Data Visualization with large sets of structured and unstructured data and building models.
  • Strong mathematical knowledge and hands on experience in implementing Machine Learning algorithms like K-Nearest Neighbors, Logistic Regression, Linear regression, Naïve Bayes, Support Vector Machines, Decision Trees and Random Forests.
  • Experience with statistical programming languages such as Python and R.
  • Extensive knowledge in building models with Deep Learning frameworks like Tensor Flow and Keras.
  • Performed multiple Data Mining techniques like Classification, Clustering, Outlier detection and derived new insights from the data during exploratory analysis.
  • Experience in building machine learning solutions using PySpark for large sets of data on Hadoop System.
  • Hands on experience in Natural Language Processing(NLP) and Natural Language Toolkit (NLTK).
  • Performed data manipulation and analysis using Pandas and performed data visualization using Matplotlib in Python in the process of estimating product demand.
  • Proficient in building and publishing interactive reports and dashboards with design customizations in Tableau.
  • Strong understanding of AWS services like Amazon Redshift and S3.
  • Proficient in writing complex SQL queries like triggers, cursors, joints, Subqueries, clone tables, null functions and stored procedures.
  • Experience in developing Shell Scripts for system management and automating routine tasks.
  • In-depth knowledge on how to create AWS data pipelines to automate the data transformation to AWS S3.
  • Demonstrated expertise in administering various databases like Oracle, MySQL and Microsoft SQL.
  • Well experienced in normalization & de-normalization techniques for optimum performance in relational and dimensional database environments.
  • Vast experience of working in the area of data management including data analysis, gap analysis and data mapping.
  • Experience with front end technologies like HTML (5), CSS, JavaScript, XML and jQuery.
  • Knowledge and experience in GitHub/Git version control tools.
  • Familiar with data compression techniques to reduce the file size for efficient data transfer across the network.
  • Experience in writing bash commands in Linux and Windows.
  • In depth knowledge of Software Development Life Cycle (SDLC), Waterfall, Agile/Scrum methodologies, as well as Test-Driven Development.
  • Quick learner adapted to new requirements and challenges in any software environment to deliver the best solutions.


Languages: Python, R, T-SQL, PL/SQL

Packages/libraries: Pandas, NumPy, Seaborn, SciPy, Matplotlib, Scikit-learn, BeautifulSoup, MLlib, ggplot2, Rpy2, caret, dplyr, RWeka, gmodels, NLP, Reshape2, plyr.

Machine Learning: Linear Regression, Logistic Regression, Decision trees, Random forest, Association Rule Mining (Market Basket Analysis), Clustering (K-Means, Hierarchal), Gradient decent, SVM (Support Vector Machines), Deep Learning (CNN, RNN, ANN) using TensorFlow (Keras).

Statistical Tools: Time Series, Regression models, splines, confidence intervals, principal component analysis, Dimensionality Reduction, bootstrapping

Big Data: Hadoop, Hive, HDFS, MapReduce, Pig, Kafka, Flume, Oozie, Spark

BI Tools: Tableau, Amazon Redshift, Birst

Data Modeling Tools: Erwin r, Rational Rose, ER/Studio, MS Visio, SAP Power designer

Databases: MySQL, SQL Server, Oracle, Hadoop/Hbase, Cassandra, DynamoDB, Azure Table Storage, Natezza

Reporting Tools: MS Office (Word/Excel/Power Point/ Visio), Tableau, Crystal reports XI, SSRS, IBM Cognos7.0/6.0.

Version Control Tools: SVM, GitHub

Operating Systems: Windows, Linux, Ubuntu


Confidential, Irving, TX

Data Scientist


  • Worked as Data Scientist and developed and deployed predictive models for analyzing customer churn and retention
  • Performed Data Extraction, Data Manipulation and Data Analysis on TBs of structured and unstructured data
  • Developed machine learning models using Logistic Regression, Naïve Bayes, Random Forest and KNN
  • Performed Data Imputation using scikit-learn package of Python
  • Created interactive analytic dashboards using Tableau
  • Conducted analysis on assessing customer consuming behaviors and discover value of customers with RMF analysis; applied customer segmentation with clustering algorithms such as K-Means Clustering and Hierarchical Clustering
  • Collaborated with data engineers and operation team to implement ETL process, wrote and optimized SQL queries to perform data extraction to fit the analytical requirements
  • Performed sentiment analysis and captured customer sentiments and categorized positive, negative, angry and happy customers from feedback forms
  • Ensured that the model has low False Positive Rate and Text classification and sentiment analysis for unstructured and semi-structured data
  • Use Principal Component Analysis in feature engineering to analyze high dimensional data
  • Perform data analysis by using Hive to retrieve the data from Hadoop cluster, SQL to retrieve data from SQL Server and used ETL for data transformation
  • Developed MapReduce pipeline for feature extraction using Hive and Pig
  • Use MLlib, Spark's Machine learning library to build and evaluate different models
  • Communicate with team members, leadership, and stakeholders on findings to ensure models are well understood and incorporated into business processes

Environment: Python, R, Hadoop, Hive, Pig, Apache Spark, SQL Server 2014, Tableau Desktop, Microsoft Excel, PySpark, Linux, Azure

Confidential - South Plainfield, NJ

Data Scientist


  • Used Tableau to automatically generate reports. Worked with partially adjudicated insurance flat files, internal records, 3rd party data sources, JSON, XML and more.
  • Experienced in building models by using Spark (PySpark, SparkSQL, Spark MLLib, and Spark ML).
  • Experienced in Cloud Services such as AWS EC2, EMR, RDS, S3 to assist with big data tools, solve the data storage issue and work on deployment solution.
  • Worked with several R packages including knitr, dplyr, SparkR, Causal Infer, spacetime.
  • Performed Exploratory Data Analysis and Data Visualizations using R, and Tableau.
  • Implemented end-to-end systems for Data Analytics, Data Automation and integrated with custom visualization tools using R, Mahout, Hadoop and MongoDB.
  • Gathering all the data that is required from multiple data sources and creating datasets that will be used in analysis.
  • Knowledge extraction from Notes using NLP (Python, NLTK, MLLIB, PySpark,)
  • Independently coded new programs and designed Tables to load and test the program effectively for the given POC's using with Big Data/Hadoop.
  • Worked with BTEQ to submit SQL statements, import and export data, and generate reports in Teradata.
  • Built and optimized data mining pipelines of NLP, and text analytic to extract information.
  • Coded R functions to interface with Caffe Deep Learning Framework
  • Working in Amazon Web Services cloud computing environment
  • Interacted with the other departments to understand and identify data needs and requirements and work with other members of the IT organization to deliver data visualization and reporting solutions to address those needs.
  • Perform a proper EDA, Univariate and bi-variate analysis to understand the intrinsic effect/combined effects.
  • Designed data models and data flow diagrams using Erwin and MS Visio.
  • Established Data architecture strategy, best practices, standards, and roadmaps.
  • Performed data cleaning and imputation of missing values using R.
  • Developed, Implemented & Maintained the Conceptual, Logical & Physical Data Models using Erwin for Forward/Reverse Engineered Databases.
  • Built and optimized data mining pipelines of NLP, and text analytic to extract information.
  • Worked with Hadoop eco system covering HDFS, HBase, YARN and Map Reduce.
  • Creating customized business reports and sharing insights to the management.
  • Take up ad-hoc requests based on different departments and locations.
  • Used Hive to store the data and perform data cleaning steps for huge datasets.
  • Created dash boards and visualization on regular basis using ggplot2 and Tableau.

Environment: R 3.0, Erwin 9.5, Tableau 8.0, MDM, Qlikview, MLLib, PL/SQL, HDFS, Teradata 14.1, JSON, HADOOP (HDFS), Map Reduce, PIG, Spark, R Studio, MAHOUT, JAVA, HIVE, AWS.

Confidential - Bloomfield, CT

Data Scientist


  • Wrote optimized SQL queries to perform data extraction and implementing ETL process along with data engineers.
  • Used Matplotlib in Python and dashboards in Tableau as a part of Exploratory Data Analysis (EDA) on customer interest.
  • Ensured data quality, consistency and integrity by cleaning the data using NumPy and Pandas.
  • Analyzed the data and performed data imputation using Scikit-learn package in Python.
  • Using Pandas, imported the customer data and performed various data analysis showing patterns which helped the company to make key decisions.
  • With Scikit-learn preprocessing, actively involved in performing feature engineering techniques like PCA and label encoding.
  • Developed and built predictive models including ensemble methods such as Gradient boosting trees by Keras to predict Sales amount.
  • Analyzed customer's shopping habits in different location under different categories in different months and found patterns using time series model.
  • Modelled data into human-readable form enriched with data visualization using Tableau and Matplotlib.
  • Used Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) to evaluate different models' performance.
  • Collected the feedback after deployment, retrained the model to improve the performance.

Environment: Machine Learning algorithms (Random Forest, Gradient Boosting trees by Keras), Python 3.x, (Scikit-Learn, SciPy, NumPy, Pandas, Matplotlib), Tableau.

Confidential - Fairfax, VA

Data Analyst


  • Responsible for gathering requirements from Business Analyst and Operational Analyst and identifying the data sources required for the request.
  • Performed Data analysis on many ad hoc request's and critical projects through which some of the critical business decisions will be made.
  • Worked on Data Verifications and Validations to evaluate the data generated according to the requirements is appropriate and consistent.
  • Played the role of team Database captain which includes monitoring of container space allocated to our team, running weekly reports of usage done by users, checking for skewness and statistics on a table and recommended changes to users.
  • Optimized the Data environment in order to efficiently access data Marts and implemented efficient data extraction routines for the delivery of data .
  • Analyze, design, code, test, implement and a support data warehousing extract programs and end-user reports and queries.
  • Aggregate functions were executed on measures in the OLAP cube to generate information about dynamic trends including bandwidth consumption and their cost analysis.
  • Helped in Development and implementation of specialized training for effective use of data and reporting resources.
  • Converted Xcelsius dashboard to Tableau Dashboard with High Visualization and Good Flexibility
  • Worked on Bar charts, line Charts, combo charts, heat maps and incorporated them into dashboards.
  • Worked with ETL team if there are any issues in production data like load delays, missing data and data quality and fix it. Also involved in modification and creation of new data warehouse table design.
  • Wrote hundreds of DDL scripts to create tables and views in the company Data Warehouse, Ad-hoc reports developed using Teradata SQL, and UNIX.

Environment: Scrum, Version One, Oracle, HTML5, Tableau, MS Excel Ideaboardz, Server Services, Informatica PowerCenter v9.1, SQL, Microsoft Test Manager, Adobe Connect, MS Office Suite, LDAP, Kerberos, Knox, Ranger, Atlas, Hive, Spark, Pig, Oozie, Zookeeper, Zeppelin, Sqoop, Kaka, Nifi.

Confidential, Irvine, CA

Data Analyst


  • Identified problems with customer data and developed cost effective models by the root cause analysis.
  • Worked closely with teams of health services researchers and business analysts to draw insight and intelligence from large administrative claims datasets, electronic medical records and various healthcare registry datasets.
  • Developed and test hypotheses in support of research and product offerings, and communicate findings in a clear, precise, and actionable manner to our stakeholders.
  • Implemented big data processing applications to collect, clean and normalization large volumes of open data using Hadoop ecosystems such as PIG, HIVE, and HBase.
  • Applied various machine learning algorithms like decision tress, regression models, clustering, SVM to identify Volume using Scikit-learn packages in R.
  • Worked with various data formats such as JSON, XML, performed machine learning techniques using python and R.
  • Generated graphs and reports using ggplot, ggplot2 in R-Studio for analyzing models.
  • Integrate R into MicroStrategy to expose metrics determined by more sophisticated and detailed models than natively available in the tool and communicated with users to build an understanding of the business process.
  • Worked with BI team in gathering the report requirements and used Sqoop to export data into Hadoop File System (HDFS) and Hive.
  • Involved in collecting and analyzing the internal and external data, data entry error correction, and defined criteria for missing values.
  • Developed Map Reduce jobs written in java, using Hive for data cleaning and preprocessing.
  • Exported the data required information to RDBMS using Hadoop Sqoop to make the data available for the claims processing team to assist in processing a claim based on the data.
  • Developed Map Reduce programs in Hadoop to extract and transform the data sets and results were exported back to RDBMS using Sqoop.

Environment: Python 2.x, Ski-Kit, R- Studio, ggplot2, XML, HDFS, Hadoop 2.3, Hive, Impala, Linux, Spark, SQL Server, Microsoft Excel, MATLAB, Spark SQL, PySpark.


Data Analyst


  • Analyze business information requirements and model class diagrams and/or conceptual domain models.
  • Managed the project requirements, documents and use cases by IBM Rational RequisitePro.
  • Assisted in building an Integrated LogicalDataDesign, propose physical database design for building the data mart.
  • Gather& Review Customer Information Requirements for OLAP and building the data mart.
  • Responsible for defining the key identifiers for each mapping/interface
  • Responsible for defining the functional requirement documents for each source to target interface.
  • Coordinated meetings with vendors to define requirements and system interaction agreement documentation between client and vendor system.
  • Enterprise Metadata Library with any changes or updates.
  • Document data quality and traceability documents for each source interface.
  • Performed document analysis involving creation of Use Cases and Use Case narrations using Microsoft Visio, in order to present the efficiency of the gathered requirements.
  • Analyzed business process workflows and assisted in the development of ETL procedures for mapping data from source to target systems.
  • Worked with BTEQ to submit SQL statements, import and export data, and generate reports in Terra- data .
  • Calculated and analyzed claims data for provider incentive and supplemental benefit analysis using Microsoft Access and Oracle SQL.
  • Establish standards of procedures.
  • Generate weekly and monthly asset inventory reports.
  • Document all data mapping and transformation processes in the Functional Design documents based on the business requirements

Environment: SQL Server 2008R2/2005 Enterprise, SSRS, SSIS, Crystal Reports, Windows Enterprise Server 2000, DTS, SQL Profiler, and Query Analyzer

Hire Now