- Data Science and statistician professional, creative thinker and problem solver.
- Able to distill high performant solutions from data to drive business strategy.
- Versatile, results - driven, and meticulous professional in Data Science and programming.
- Experience in Machine Learning and Data Mining with large Structured and Unstructured datasets, performing Data Acquisition, Data Validation, Predictive modelling, and Data Visualization.
- Experience in text mining - transposing words and phrases in unstructured data into numerical values.
- Used statistical packages in Python, R together with SQL to build complex statistical models for predictive analysis, principal component analysis, and performing cluster analysis. Experience in designing informative visualizations using Tableau software, publishing and presenting dashboards, storyline on web and desktop platforms.
- Familiarity with developing, deploying, and maintaining production NLP models with scalability in mind.
- Hands on experience in implementing linear discriminant analysis (LDA), linear and logistic regression models, Naïve Bayes, support vector machine classifiers, K nearest neighbors, Random Forests, Decision Trees and neural networks while applying know how of Principle Component Analysis to strengthen Recommender Systems.
- Experienced with machine learning algorithms such as logistic regression, random forest, XGboost, KNN, SVM, neural network, linear regression, lasso regression, and k-means.
- Adept in statistical programming languages like R and Python including Big Data technologies like Spark, Hadoop 2.0, HIVE, HDFS; Experienced in Spark 2.1, Spark SQL, and PySpark.
- Visualization tools like Tableau, Matplotlib, ggplot2
- Skilled in using dplyr, ggplot2, Pandas, Numpy, Matplotlib, Seaborn and Pandas in R and python for performing Exploratory data analysis.
- Experience with Data Analytics, Data Reporting, Ad-hoc Reporting, Graphs, Scales, PivotTables.
- Highly skilled in using Hadoop (pig and Hive) for basic analysis and extraction of data in the infrastructure to provide data summarization.
- Expert knowledge in statistics, mathematics, machine learning, recommendation algorithms and analytics with excellent understanding of business operations and analytics tools for effective analysis of data.
- The ability to balance the “art and science” by solving analytical problems using quantitative and qualitative approaches that will be critical to driving high-end business value.
- Establish scalable, efficient, automated processes for large scale data analyses, model development, model validation and model implementation.
- Drives the analytics roadmap proactively by identifying opportunities in the data based on the business priorities working with all divisions.
- Responsible for delivering solving problems in the domains of Ecommerce, Shipping, Internet of Things and Spatial analytics with batch, real-time and predictive models
- Analyze large data sets comprising of e-commerce data (clickstream, order data, tracking data, competitive price changes, currency fluctuations) and optimize business goals.
- Stays current with research in data science, machine learning, operations research and Natural Language Processing to ensure we are leveraging best-in-class techniques, algorithms, and technologies.
- Works closely with Senior Leadership to champion informatics-based innovation efforts and to develop and execute a prioritized roadmap of analytic studies that targets advanced analytics initiatives.
- Experience with a variety of NLP methods for information extraction, topic modeling, parsing, and relationship extraction
- Proactively researches and develops moderately complex Proofs of Concept that have will have the potential to serve as conceptual designs that analysts and data science practitioners can use in their respective initiatives.
- Researches and implements methodologies to measure the impact of the technologies.
- Provides business expertise and supports the development of models and analysis to provide the organization with insights.
Data Science Specialties: Natural Language Processing, Machine Learning, Internet of Things (IoT) analytics, Social Analytics, Predictive Maintenance
Version Control: GitHub, Git, SVN
IDE: Jupyter, Spyder, IntelliJ, Eclipse
Data Frameworks: R, Python, HiveQL, Spark, Spark SQL, Storm, Scala, Impala, MapReduce, Kinesis, EMR
Analytic Tools: Classification and Regression Trees (CART), Support Vector Machine, Random Forest, Gradient Boosting Machine (GBM), TensorFlow, PCA, RNN, Regression, Na ve Bayes
Visualization: Tableau, R, R shiny, ggPlot2, PowerBI, seaborn, matplotlib
Modeling and Methods: Bayesian Analysis, Inference, Models, Regression Analysis, Linear models, Multivariate analysis, Stochastic Gradient Descent, Sampling methods, Forecasting, Segmentation, Clustering, Sentiment Analysis, Predictive Analytics
Databases: Azure, Google, Amazon RedShift; HDFS, RDBMS, SQL and noSQL, data warehouse, data lake and various SQL and NoSQL databases and data warehouses.
Deep Learning:: Machine perception, Data Mining, Machine Learning algorithms, Neural Networks, TensorFlow, Keras, pytorch
Soft Skills: Able to deliver presentations and highly technical reports; collaboration with stakeholders and cross-functional teams, advisement on how to leverage analytical insights. Development of clear analytical reports which directly address strategic goals.
Senior Data Scientist
Confidential, Denver, CO
- Lead a data science product unit in using structured and unstructured dialysis patient biometric data with machine learning to predict whether a patient is requires rehospitalization and to allow intervention to prevent future hospital visits.
- Identified important and interesting questions about large datasets, then translated those questions into concrete analytical tasks.
- Researched and tested survival model for data, including state of the art neural networks for survival analysis using Python deep learning packages Theano, Tensorflow, and Keras.
- Provided evidence survival analysis was the incorrect machine learning approach for project and convinced principle project lead to change to a classification approach
- Implemented machine learning classification algorithm XGBoost Classifier in Python on structured patient biometric data.
- Delivered feature engineering on structured patient biometric data to improve results. Approaches included:
- One-hot encoding categorical data
- Converting data labeled “MISSING” by original source providers into numpy NaN format to be usable by the algorithm
- Testing small subsamples of features to determine feature importance
- Tested and implemented multiple ways to handle missing values in the data, including replacing with a measure of central tendency (mean, median), removing values, using tree-based algorithms that can use missing values as decider nodes, and imputing the missing values using the R package MICE (Multiple Imputations by Chained Equations).
- Introduced new features into the dataset in collaboration with data engineer and principle project lead, most significantly previous hospital admission count, which led to a significant lift in accuracy.
- Worked in an Anaconda environment with coding in Python and R-Programming.
- Implemented grid search from the scikit learn package in Python to efficiently test multiple hyperparameters for the machine learning algorithm
- Implementations done in collaboration with data engineer led to an over 30% gain in accuracy over previously tested machine learning models
- Produced rank-order feature importance tables to provide subject matter experts with a list of important drivers of dialysis hospitalization.
- Used values from the SHAP library in Python to give subject matter experts individualized drivers on a patient level to help plan treatment and interventions.
- Project received significant attention from C-level executives, and as a result of changes implemented, project was approved for pilot testing
- Collaborated with data engineer to introduce Python code into data pipeline to produce machine learning predictions quickly and efficiently.
- Collaborated with data engineer to encode unstructured doctor’s notes into features identified by subject matter experts using Doc2Vec and cosine similarity values for machine learning in Python.
- Experimented with ensemble methods of machine learning analysis to improve prediction results, including stacking Random Forest, Stochastic Gradient Descent Classifier, Support Vector Machines, Naïve Bayes, and K-Nearest Neighbors.
- Made use of Anaconda environments for dependency control in Python
- Became familiar with HIPAA regulations to protect privacy of subjects in dataset and anonymize data points
- Documented changes and results of experiments through use of Jupyter Notebooks in Python to track versions
- In collaboration with data engineer and subject matter experts, discovered errors in dataset and identified source for correction.
- Created visualizations to help explain the prediction results using a ROC curve in the matplotlib library in Python
- Determined cross correlations among the feature data by producing a heatmap in the seaborn library in Python
- Developed a dashboard in Tableau to provide valuable insights to stakeholders
- Created visualizations to help interpret model predictions and explain feature importance
Confidential, Austin, TX
- Used machine learning and statistical techniques to analyze invoices and transactions for large oil company.
- Used Python and Excel to create flat files from invoice data
- Developed Python script to automate comparisons between internal company data and subcontractor invoices
- Used machine learning to detect error rates and flag invoices in need of correction
- Successfully lowered rate of error from subcontracting company
- Along with software engineer, successfully standardized subcontractor reporting system
- Along with software engineer, improved efficiency of cataloging itemized lists of charges on subcontractor invoices using SQL tables
- Worked on creating filters and calculated sets for preparing dashboards and worksheets in Tableau.
- Identified areas of inefficiency and waste that could be improved upon using Excel graphs and Tableau dashboards
- Delivered various complex scorecards, dashboards, and reports.
- Collaborated on database design, data ingestion schemas.
- Developed interfaces with RESTful services.
- Utilized Tensorflow and Keras in Python to create artificial neural network for productionized model
Confidential, New York City, NY
- Used Excel to create analytics spreadsheets for outside firms
- Applied Bayesian statistics to financial data to model outcomes of investments using R programming language
- Used time-series analysis and ARIMA modeling in R to predict bond trade fluctuations
- Created dashboards of financial data using Tableau and Power BI to present to executive level stakeholders
- Along with business intelligence analyst, drafted and created a proposal to increase efficiency of company's recruitment and training program
- Delivered presentations to C-suite level executives and other nontechnical audiences
- Wrote SQL queries to pull financial transaction data from on-premise Oracle database
- Used R and SQL to clean and transform normalized financial data into flat files for analysis
Senior Data Scientist
Confidential, Austin, TX
- Worked as a data scientist to analyze sentiment in preparation of iPhone X launch and critical response to product release.
- Gathered data from various social media sources to perform sentiment analysis
- Evaluated performance of bag-of-words and TFIDF tokenization
- Performed stemming and lemmatization as well as stop word removal
- Implemented sentiment analysis on large dataset of many customer reviews of products
- Created convolutional neural network model using Tensorflow and Keras in Python
- Grouped reviews by sentiment score to perform topic modeling and provide insight into data trends
- Created LDA model in Python with genism to extract topics from large corpus of documents
- Provided and created data presentation to reduce biases and telling true story of people by pulling millions of rows of data using SQL and performed Exploratory Data Analysis.
- Applied breadth of knowledge in programming (Python, R), for Descriptive, Inferential statistics
- Utilized a diverse array of technologies and tools as needed, to deliver insights such as R, SAS, Matlab, Tableau and more.
- Involved in extensive ad-hoc reporting, routine operational reporting, and data manipulation to produce routine metrics and dashboards for management
- Created parameters, action filters and calculated sets for preparing dashboards and worksheets in Tableau.
- Interacting with other data scientists and architects, custom solutions for data visualization using tools like Tableau and Packages in Python.
- Involved in running Spark jobs for processing millions of records.
- The building, publishing customized interactive reports, report scheduling and dashboards using Tableau Server.
Lead Data Scientist
Confidential, Austin, TX
- Involved in evaluating and prescribing methods for company processes and procedures
- Created content for mentoring individuals in on-level statistics
- Utilized data driven methodologies for analyzing junior statistician performance that resulted in more effectively assessing junior statistician needs and supporting struggling employees
- Compiled performance data in csv files using R and created reports for review by administration with ggplot2
- Developed statistical models using Bayesian probabilities to predict likelihood of churn
- Examined conditional and marginal probabilities to create a recommender system using collaborative filtering and similarity scores
- Performed Z-tests and T-tests to perform optimization on price points for various products sold by the company
- Investigated the usability of machine learning in R&D for new products and finding appropriate price points based on similar features to existing products
Lead Data Scientist
Confidential, Harker Heights, TX
- Designed a new training methodology in statistics that met financial standards and requirements
- Researched available data sources and examined the common thread of class imbalance in financial fraud detection
- Instructed employees in statistical methods and data visualization techniques
- Improved model performance by 3 percentage points by utilizing Gaussian Mixture model
- Demonstrated among administrative executives the statistical significance of improvement in model performance and increased recall in fraud detection
- Involved in leading large team of statisticians to create mathematical and statistical models to evaluate trends and provide insight into data
- Led a project among employees in predicting the outcome of loans using Bayesian statistics and clustering on customer data in Python with scikit-learn
- Created a dashboard using R to report to stakeholders the estimated monthly return on investments as well as weekly number of fraudulent purchase requests correctly identified
Confidential, Austin, TX
- Examined the relationship between SAT/ACT scores and college admissions
- Performed deep mathematical analysis of large datasets, using R and ggplot to produce visualizations that revealed the relationships and trends within the data
- Investigated the correlations between temperature and energy demand
- Created logistic regression model to demonstrate likelihood of acceptance into various industries
- Performed NLP, topic modeling, and clustering analysis on job titles and descriptions to identify multiple employment opportunities in the same field with different names
- Utilized decision trees in Python to explain feature importance and observe effect of weather data on product sales