Data Scientist Resume
Dallas, TX
SUMMARY
- An ardent Data Wrangler possessing 7 years’ of cross industry including healthcare experience in handling volumes of data nuggets as structured & unstructured data using Python, R, SQL, Microsoft Excel, Hadoop ecosystem (Hive, Sqoop, PySpark, Spark SQL) for Data Mining, Data Cleansing, Data Munging and Machine Learning.
- Experienced working on data preprocessing steps like exploration, aggregation, missing data imputation, sampling, feature selection, dimensionality reduction, outlier detection.
- Well versed in machine learning algorithms such as Linear and Logistic regression, Decision Trees, Random Forest, Support Vector Machines, K nearest neighbors.
- Experienced in performing Time Series forecasting using Auto ARIMA models and NLP projects like Text Analytics, Sentimental Analysis in RStudio.
- Experienced in working with various Python IDE’s using PyCharm, PyScripter, Jupyter, Spyder and Sublime Text.
- Experienced on data wrangling, data visualization and reporting using Python, Tableau.
- Built time - series models and statistical models for sales predictions and descriptive visualizations of sales data to showcase identified hidden trends and anomalies
- Built models to identify & associate product sales with products from same category, aiding business decisions by cleansing, standardizing, pre-processing data & inferring hidden trends
- Built models to validate the effect of product sales during holiday season based on the seasonality, assisting in shelf and store cluster analysis.
- Experienced on extracting data from various non-traditional data sources (Like web scraping)
- Extensive experience in developing dashboards, reports using tools like Tableau, PowerBI e.t.c.
- Extensive Knowledge on developing Spark SQL jobs by developing Data Frames.
- Wrote SQOOP scripts for moving data between Relational DBs, HDFS, and S3 storage.
TECHNICAL SKILLS
Languages: Python, R Programming, SAS, SQL
Cloud Platforms: AWS, GCP(BigQuery), Azure DataBricks
Big Data: Hive, Impala, Sqoop, Pig, PySpark
Analysis: Supervised Learning (Linear Regression, Logistic Regression, Decision Tree, Random Forest, SVM, Classification), Unsupervised Learning (Clustering, KNN, Factor Analysis, PCA), Natural Language Processing, Time Series Forecasting
Serverless ML: BigML, DataRobot, H2O.ai
Relational Databases: Oracle10g, IBM DB2, PostgreSQL, SAP HANA
NoSQL: MongoDB
Data Visualization: Tableau, PowerBI, AWS QuickSight
Specialties: Machine Learning / Predictive Analytics / Text Mining / Market Basket Analytics:
Regression: Simple Linear Regression, Multiple Linear Regression, Logistic Regression
Ensemble: Boosting, Bagging, Stacking
Instance-based: k-Nearest Neighbor (kNN)
Decision Tree Learning: Classification and Regression Tree (CART), Gradient Boosting Machines (GBM)
Clustering: k-Means
Deep Learning: TensorFlow, Keras, PyTorch
Time Series: Moving Average, ARIMA
Text Analytics: NLTK, Pandas, Word Cloud
PROFESSIONAL EXPERIENCE
Data Scientist
Confidential, Dallas, TX
Techniques: Data wrangling, Logistic Regression, Dataset creation, Predictive Modeling, PCA, Time Series Analysis, Random Forest, Decision Tree
Tools: AWS (Sagemaker, ECS, Kinesis)
Responsibilities:
- Performed exploratory analysis on product data to know the structure, attributes, dimensions, missing values and outliers in the data using Python
- Detected and treated outliers and ran stepwise regression and all subset regression methods to choose effective variables to build the Machine Learning model.
- Worked on Jupyter, RStudio creating EC2 instances by accessing unstructured data stored in S3 by leveraging the capabilities of AWS Cloud Machine Learning platform.
- Well versed with BigData Hadoop ecosystem within the AWS EMR cluster, serverless ML frameworks like BigML, DataRobot and H20.ai
- Solved performance issues in Hive and Pig scripts with understanding of Joins, Group and aggregation.
- Loading and transforming of large sets of structured data from SQL Server through Sqoop and placed in HDFS/HIVE for further processing.
- Created Hive queries that helped market analysts spot emerging trends by comparing fresh data with EDW reference tables and historical metrics.
- Enabled speedy reviews and first mover advantages by using Oozie to automate data loading into the Hadoop Distributed File System and PIG to pre-process the data.
- Involved in creating Hive table, loading with data, writing hive queries to test the test cases
- Created Rich dashboards using AWS QuickSight, Tableau and prepared user stories to create compelling dashboards to deliver actionable insights.
- Wrote Dockerfiles, pushed to ECR and written task definitions to orchestrate containers across various target groups with knowledge of ECS as the container clustering/management resource.
- Architected continuous Machine Learning system that generates predictive models for every product channel, evaluates performance of models, and smart enough to decide the best model for each product channel and deploy it in production automatically using AWS Lambda and Docker (Elastic Container)
Data Scientist
Confidential
Projects / Techniques: Cluster Analysis, Customer Attrition, Predictive Modeling, Data Wrangling
Tools: Used: Hadoop Cluster, Python, SQL
Responsibilities:
- End to end analytical solutions to the business problems. Understanding the problem, data and creating an analytical solution using statistical techniques in Python or R and providing recommendations.
- Extracted data from database or other resources using SAS, SQL, EXCEL etc., prepared data for further analysis and modeling, validated data, ensured data integrity and consistency
- Carried out data cleansing, converted data into structured format, removed outliers, dropped irrelevant columns & missing values, imputed missing values with median/mode/average/min/max
- Dropped highly correlated & low variance variables & ensured normal distribution through transformation
- Developed and scaled machine learning and deep learning models like Logistic Regression, Random Forest, Gradient Boosting Machines, SVM (Support Vector Machines), etc. for classification
- Programmed using python to prototype and deploy Machine Learning, Deep Learning, Predictive models, Probabilistic and Statistical Modeling based approach with user interface development
- Analyzing customer survey data to enable better targeting of customers and understanding the concerns raised by the consumers using text analytics (leveraged Python's Natural Language Tool Kit)
- Extensive experience with building time series forecasting models (ARIMAX, ARIMA, exponential smoothing, etc. & building Visualizations using tools like Tableau, Excel, ggplot, Matplotlib
- Equipped with strong analytical skills in using Big data tools like Hadoop, SQOOP, Map Reduce, Hive, Pig
Senior Data Analyst
Confidential
Responsibilities:
- Created SQL and SWOT analysis reports and dashboards to provide visibility into the products in the supply chain pipeline at major trans-load sites and in-transit to delivery centers, enhancing delivery efficiency by 7%
- Performed data analysis and reporting and helped with strategic initiatives to support Regional Distribution Center operations while working cross functionally with Yard Operations, Finance and Quality to collect, measure, visualize and interpret large data sets
- Collected data from disparate sources like MS SQL Server, Excel, and other Flat files. Integrated, analysed and interpreted the data and presented the findings in the form of reports and briefings for Senior Management
- Utilized, created, and maintained SQL scripts used for data exchange and validation
- Investigated ETL job or process failure by checking underlying queries/log files
- Developed database structures, SQL queries and ETL scripts to extract, transform and load data from all the departmental transaction processing systems into the Data warehouse.
Associate - Marketing Analytics
Confidential
Responsibilities:
- Collected and analyzed data on customer demographics, preferences, needs, and buying habits to identify potential markets and factors affecting product demand.
- Help strategize sample and data collection for the purpose of research data analysis
- Maintain, update and clean datasets generated from various sources to ensure data integrity for downstream analysis
- Discover potential analytical methods using intuitive and interactive data visualization techniques
- Conduct statistical analysis, including but not limited to correlation test, univariate and multivariate modeling, on cross-sectional and longitudinal data to serve the aim of the study
- Provide clear result interpretation to meeting participants in order to facilitate collaboration in current and future projects
- Communicate daily with project manager and weekly with primary investigator to provide updates, develop methods, and prevent problems
- Prepared reports of findings, illustrating data graphically and translating complex findings into written text.
Data Analyst
Confidential
Responsibilities:
- Evaluated outcomes experienced by children and adult with severe Mental Illness(SMI) including the incidence of restrictive behavioral health treatment, involuntary examinations and criminal justice encounters that occur before and after plan enrollment
- Worked as technical consultant for Agency in developing reports on patient demographics, disease and respective MMA data
- Developed SAS code to analyze data from multiple sources like baker act data, healthcare claims data
- Identified and tracked high risk Medicaid recipients by analyzing patient, prescriber, plan & provider level data
- Processed monthly data MMA encounters data in SAS which again includes Institutional, Professional, Dental claims files
- Extracted data from flat files using PROC IMPORT and merged the datasets as per requirement using PROC SQL
- Developed SAS/Macros for weekly, monthly and quarterly reports
- Created RTF and PDF formatted reports using SAS/ ODS for presentation to Agency For Healthcare Administration