We provide IT Staff Augmentation Services!

Pyspark Data Scientist Resume

Cleveland, OhiO


  • Over 10 years of experience in manipulation, wrangling, model building and visualization with large data sets.
  • An analytical and detail oriented Data science professional with proven records of success in the collection and manipulation of large datasets.
  • Demonstrated expertise in decisive leadership and in delivering research based, data driven solutions that move organizations vision forward.
  • Highly competent at researching, visualizing and analyzing raw data in order to identify recommendations for meeting organizational challenges.
  • Proven excellence in personal management and program development.
  • Unparalleled capacity to link quantitative and qualitative statistics to improvements in operating standards.
  • Ability to perform Data preparation and exploration to build the appropriate machine learning model.
  • Proficient in Statistical Modeling and Machine Learning techniques in Forecasting/ Predictive Analytics, Segmentation methodologies, Regression based models, Hypothesis testing, Factor analysis/ PCA, Ensembles.
  • Expertise in Machine Learning models like Linear, Logistics, Decision Trees, Random Forest, SVM, K - Nearest Neighbors, clustering (K-means, Hierarchical),Bayesian.
  • Implement and practice Machine learning techniques on structured and unstructured data with equal proficiency.
  • Experience with Data Analytics, Data Reporting, Ad-hoc Reporting, Graphs, Scales, Pivot Tables and OLAP reporting.
  • Ability to use dimensionality reduction techniques and regularization techniques.
  • Expert in data flow between primary DB and various reporting tools. Expert in finding Trends and Patterns within Datasets and providing recommendations accordingly.
  • Highly skilled in using visualization tools like Tableau, ggplot2 and d3.js for creating dashboards.
  • Proficient in requirement gathering, writing, analysis, estimation, use case review, scenario preparation, test planning and strategy decision making, test execution, test results analysis, team management and test result reporting.
  • Conducted POCs and research on different customer objective and generate the ROI.
  • Expertise in HIVE for data handling and used Azure ML for modelling the datasets. Also worked on many dataset problems in Kaggle.


R 3.2.2, SAS: 9.2, Python-3.5 & 2.7, Pyspark, Spark 2.1, Big data eco-systems (Hive, Spark RDD, Spark ML), Java Script.

MS: Office (Excel, Power point), Tableau, R (ggplot2, Shiny), D3.js.

Topical Expertise: Statistical Modeling, Data Analytics, Machine Learning, Text Mining and Optimization, Web scraping, Factor analysis.

Techniques: Regression, GLM, Trees (Decision tress, Oblique decision trees, CHAID), Random Forest, Clustering (K-means, Hierarchical, SOM), Association Rules, K-Nearest Neighbors, Neural Networks, XG Boost, SVM, Bayesian, Linear Programming, Quadratic Programming, Genetic Algorithm, Ant colony optimization, Collaborative filtering

Linux administration, Oracle DBA, Teradata: SQL, Shell & Batch Scripting, SAS Macro s, SSMS.


Confidential, Cleveland, Ohio

Pyspark Data Scientist


  • Understanding the CCAR and DFAST regulations of the company.
  • Responsible of researching and developing the action plan required for the development of the model.
  • Providing stakeholder end to end scenarios on the project life cycle.
  • Model was completely developed in Pyspark. Created pyspark function for each SAS macro as a replica.
  • Imported SAS data files into python(Pandas) to test the initial phase of the project.
  • Dealt with data ambiguity and performed lazy evaluation in pyspark for code optimization.
  • Created Spark Data Frames and performed quantitative analysis on the Data Frames.
  • Designed the visualization of the core competencies in the model parameter using ggplots2, Seaborn and Matplotlib.
  • End to end Responsibility for developing, designing and debugging the code.
  • Worked in conjunction with the SME (Subject Matter Expert) team to ensure the project objective has been delivered successfully.

Environment: Tools: Python 3.6, Pyspark, Spark 2.1, Anaconda 4x

Data Source External SAS Data files, Hadoop,Parquet,CSV.

Reporting Platforms MS-Excel, ggplot2, Seaborn and Matplotlib

Confidential, Austin,Texas

Python developer/Data Scientist


  • Responsible for researching and developing algorithms that can efficiently learn diverse and powerful representations for many kinds of data, including representations for electrical device power consumption.
  • End to end responsibility in designing and architecture of the models, setting up the cloud platforms for Action and lab manuals.
  • The data collected from the chips is raw signal (Time domain) data is really tedious for analysis. Transformations are applied on the RAW Signal data (Frequency data), so that it’s easy to understand and analyze the behavior.
  • Built models using Statistical techniques like Bayesian HMM and Machine Learning classification models like XG Boost, SVM, and Random Forest.
  • Lambda architecture was proposed to be deployed on Raspberry Pi. The model build on the data was pickled and deployed on the UNIX like environment to run every second.
  • The outputs from the pickle file are written to the Database which run on AWS.
  • The data from the database is re-directed to the Web client & Smart app cloud for display using a unified REST API written in FLASK.
  • Automated the process of creating daily reports including Total, power consumption and individual appliance power consumption.
  • The model was further leveraged to detect the appliance health and utilization.

Environment: Statistical tools: R - 3.2.2, Python 3.5, Anaconda 2-5.0, Flask

Data sources: MS - Excel, Postgres 9.5.5

Reporting platforms: MS Excel, MS-Power Point.

Confidential, Atlanta, GA

R developer/Data Scientist


  • Understanding the business, problem statement and manual approaches company has followed since years.
  • Gathering all the data that is required from multiple data sources and creating datasets that will be used in analysis.
  • Perform data cleaning and transformations that is suitable for applying models.
  • Perform a proper EDA, Univariate and bi-variate analysis to understand the intrinsic effect/combined effects.
  • Worked on multiple predictive models to predict future electricity and gas bills based on the current usage patterns.
  • Automate the process of performing stepwise regression and model based on AIC values and P-values.
  • Perform cross-validation to check MAPE values to self-train the model.
  • Generate a log file for every step, so that the error can be traced when deployed in production.
  • Performed statistical tests and applied regression techniques to predict the power.
  • Created dash boards and visualization on regular basis using ggplot2 and Tableau
  • Creating customized business reports and sharing insights to the management
  • Take up ad-hoc requests based on different departments and locations
  • Create a demand chart comparing a full week’s consumption.
  • Engage with client on day-today basis to update the progress.

Environment: Statistical tools: R - 3.2.2, Python 2.7, Microsoft Excel

Data sources: MS - Excel, MySQL.

Reporting platforms: ggplot2, MS Excel and MS-PowerPoint.


Data Scientist


  • Data from Teradata server was pulled at Week-Market level.
  • Additional data required was collected from third party data source. Information like calamities, holiday, and market factors etc.
  • Data elements validation using exploratory data analysis (univariate, bi-variate, multi-variate analysis).
  • Missing value treatment, outlier capping and anomalies treatment using statistical methods, deriving customized key metrics
  • Dummy variables where created for certain datasets to into the regression.
  • Variable selection was done by performing forward stepwise regression, R-square and VIF values
  • Multiple Regression techniques where used and tested. Robust regression was finalized based the feasibility and accuracy of results
  • Robust regression is an alternative to least squares regression when data is contaminated with outliers or influential observations and it can also be used for the purpose of detecting influential observations. MAPE was given the priority as part of stats.
  • The outputs from the regression where used in calculating the price elasticity.
  • The process was automated to perform the same for all the markets without any human interventions and bugs after rigorous testing and analysis.
  • Model was automated to run for all the markets without any human interventions. Used Hadoop - HIVE to fit the complete data and HIVE queries to perform Data Munging.
  • Data engineering and optimized architecture was designed for optimal usage of resources.
  • Used Tableau to refresh and make changes to the dashboards.

Environment: Statistical tools: R-3.2.2, R - Hive, Hadoop 2.6, SAP-HANA, MS - Excel

Data sources: External Data files in Hive, Teradata 13.10.

Reporting platforms: Tableau - 8.3, MS-Excel, MS-PowerPoint.


Data Scientist


  • Engage with management to define the scope and drafting requirements
  • Obtain the churned customer flagship by understanding the parameters that influence the objective of the study
  • Examine the feasibility on data requirements along with data warehousing team
  • Defining problem DNA, Factor map and Hypothesis matrix to get the approvals from the respective technical manager related to data marts & data fields from the customer demographics, Income, calls, age, gender etc.
  • Perform EDA using PROC sql.
  • Building programming logics for developing analysis datasets by integrating with various data marts in the sandbox environment
  • Data elements validation using exploratory data analysis
  • Missing value treatment, outlier capping and anomalies treatment using statistical methods, deriving customized key metrics
  • Segmenting the customer based using demographics, and transactional behaviors using advanced statistical methods - Cluster analysis (agglomerative & Divisive)
  • Multi co-linearity treatment by conducting statistical tests
  • Calculating churn scores using Logistic Regression, CHAID which satisfies both business assumptions, model fitness and significance of parameters into the model
  • Measuring best Machine learning model fit using Hit Ratio, ROC curves, Concordance, LIFT charts & Gains charts.
  • Identifying segments which are high likely to churn among rest of the customer segments
  • Identifying potential reasons of churning against CRM data sources
  • Determining customer lifetime values for high likely churn customer
  • Implementing churn score model into production environment post signoff from management
  • Teaming up with digital channeling team for ensuring customer retention promotion contents reaches right customers in the right time

Environment: Statistical tools: SAS Base 9.2, SAS Enterprise Guide - 4.3, SQL, Microsoft Excel

Data sources: External Data files in .txt, .csv, .xlsx, and transactional databases - Teradata

Reporting platforms: SAS ODS, MS Excel.

Hire Now