- Statistics and Probability, including statistical modelling, statistical hypothesis testing; sound performance executing machine learning projects
- Familiarity with trends in relevant technologies and shifts in the data analytics climate
- Strong leadership skills with specific experience in the Agile framework; excellent communication skills, both verbal and written
- Competent taking machine learning from experimentation to full deployment
- Extensive experience with 3rd - party cloud resources: AWS, Google Cloud, and Azure
- Developed neural networks architectures from scratch, such as Convolutional (CNN’s), LSTM’s, and Transformers. Also built unsupervised approaches such as k-means, gaussian mixture models, and auto-encoders.
- Proficient in all supervised machine learning manners - Linear Regression, Logistic Regression, Support Vector Machines, Random Forests, Gradient Boosting, Survival Modeling
- NumPy stack (NumPy, SciPy, Pandas, and matplotlib) and Sklearn.
- Proficient in TensorFlow and PyTorch for building, validating, testing, and deploying reliable deep learning algorithms for specific business challenges
- Experience with ensemble algorithm techniques, including Bagging, Boosting, and Stacking; knowledge with Natural Language Processing (NLP) methods, in particular FastText, word2vec, sentiment analysis
Programming: Python, Spark, SQL, R, Git, bash
Libraries: NumPy, Pandas, Scipy, Scikit-Learn, Tensorflow, Keras, PyTorch, statsmodels, Prophet, lifelines, PyFlux, arch, FeatureTools, Lime
Version Control: GitHub, Git, BitBucket
IDE: Pycharm, Sublime, Atom, Jupyter Notebook, Spyder
Data Stores: Large Data Stores, both SQL and noSQL, data warehouse, data lake, Hadoop HDFS, S3
RDBMS: SQL, MySQL, PL/SQL, T-SQL, PostgreSQL
NoSQL: Amazon Redshift, Amazon Web Services (AWS), Cassandra, MongoDB, MariaDB
Computer Vision: Convolutional Neural Network (CNN), Faster R-CNN, YOLO
Big Data Ecosystems: Hadoop (HBase, Hive, Pig, RHadoop, Spark, HDFS), Elastic Search, Cloudera Impala.
Cloud Data Systems: AWS (RDS, S3, EC2, Lambda), Azure, GCP
Data Visualization: Matplotlib, Seaborn, Plotly, Bokeh
SENIOR DATA SCIENTIST
- Manipulated GEOTiff files for conducting spatial analysis using MATLAB & Python to plot population densities and overlay other socioeconomic data in various regions across the United States & world.
- Built scraping modules using Scrapy, BeautifulSoup, & requests libraries to extract Confidential ’s reseller data, including locations, discounts, & additional pricing data, along with associated dempographic data within specific US regions.
- Created various scripts to load, concatenate, & clean multiple data files used by data science members for analysis and forecasting Confidential ’s vendors’ future behaviors.
- Undertook several techniques for forecasting month ahead loss/gain at each SSG level with Python, including ARIMA, Prophet, and LSTM.
- Constructed production-level code for new vendor data that was fed into Tableau for data analysts to present at various times.
- Devoted Data Lab, Confidential ’s cloud platform, to train different time-series models for vendor forecasting .
- Exercised appropriate version control using Confidential ’s Box & Quip platforms to synchronize code & data files with data science members.
SENIOR DATA SCIENTIST
- With the PyTorch Python API, the team built the architecture and trained the convolutional neural networks (CNN).
- Exploited transfer learning with custom-built classifiers in PyTorch to speed up production time and improve results.
- Fine-tuned ResNet-50, ResNet-101, and ResNet-152 models to adapt their pre-trained weights to our use case.
- Used a fully convolutional network (FCN) - pre-trained YOLO v3 algorithm - to speed up predictions.
- Took into consideration prediction time and overhead to make sure our predictions happened in real time.
- Regularized the data by applying transformations to the images using Pillow.
- Worked with large stores of video imaging data stored on AWS S3 buckets for training the model.
- Supplied our pickled model to the software development team to integrate into the drone pilot’s heads-up display (HUD).
- Employed proper version control using git with BitBucket to coordinate with fellow team members.
- Employed AWS Sagemaker to explore object detection at a high level and to train my model before opting for a lower level approach.
- Replaced proprietary software with custom-built algorithms for greater control over the outcomes.
- Endeavored multiple approaches for predicting day ahead energy demand with Python, including exponential smoothing, ARIMA, Prophet, TBATS, and RNN’s (LSTM)
- Successfully built a Generalized Autoregressive Conditional Heteroskedasticity (GARCH) using PyFlux, to model the uncertainty of Dominion’s other time series, ensuring a ‘safety’ stock of generating units
- Incorporated geographical and socio-economic data scraped from outside resources to improve accuracy.
- Incessantly validated models using a train-validate-test split to ensure forecasting was sufficient to elevate optimal output of the number of generation facilities to meet system load.
- Prevented over-fitting with the use of a validation set while training.
- Built a meta-model to ensemble the predictions of several different models.
- Performed feature engineering with the use of NumPy, Pandas, and FeatureTools to engineer time-series features.
- Coordinated with facility engineers to understand the problem and ensure our predictions were beneficial.
- Participated in daily standups working under an Agile KanBan environment.
- Queried Hive by utilizing Spark through the use of Python’s PySpark Library.