Programming: Python (Pandas, NumPy, SciPy, Scikit - Learn, Matplotlib, Seaborn, Statsmodels, NLTK, regex, Plotly), PyTorch, TensorFlow, Keras, SQL, OpenCV, Linux (Ubuntu, Mint), R-Studio
Machine Learning & Statistics: Regression (Linear, Logistic, Lasso, Ridge), Random Forest, Gradient Boosting Machines (XGBoost, LightGBM), Ensemble, Stacking, Bayes and Naïve Bayes, Support Vector Machines, K-Nearest Neighbors, Principal Component Analysis, K-Means Clustering
Deep Learning: Artificial Neural Networks, Computer Vision (CNN, YoloV3, Mask-RCNN, SSD), Natural Language Processing (RNN, LSTM, GRU), Word Embedding
Tools: PySpark, Hive, Amazon AWS, Google Cloud, Docker, MySQL, Git, GitHub, Tableau, MS Excel.
Confidential, Livonia, MI
- Detecting the duplicate claims using Artificial Intelligence solution with more accuracy and low labor costs.
- Used Apache Spark (PySpark) for data cleaning, preprocessing, merging, feature engineering to handle around 30 million records efficiently and saved preprocessed data into HDFS for future analysis.
- Used Spark UDF functions, window functions, created new features from raw features for descriptive stats and modeling.
- Created an End - to-End pipeline including all the stages like Standardization, Encoding, Vector Assembler and modeling.
- Performed Error analysis on wrongly predicted claim records using SHAP. Engaged with subject matter experts to improve the mode performance by adding/deleting/updating features and retraining the models.
- Developed classification models for classifying the Windshield based on available advanced features in cars and Trucks.
- Performed Exploratory Data Analysis based on Make and Model like Cleaning, Visualization, missing value analysis, Imputation, correlation analysis, Feature Engineering etc.
- Used models like SVM, Random Forest, XGBoost. Built pipeline to create separate models for each Car ID and Accuracy were improved from ~51% to ~72% which resulted in savings of ~$250k per year.
- Built prediction models for SO2 emission to meet the Environment Protection Agency’s (EPA) requirements.
- Performed extensive cleaning on powerhouse and black liquor boiler features, built storytelling detailed visualizations.
- Started with Linear Regression model for interpretability of internal process variables. Used other advanced models like Lasso Regression and XGBoost for better accuracy, tuned the model using Grid Search and Bayesian Optimization.
- Performed RFM analysis using unsupervised model called K-Means Clustering on Customers and Sales representatives.
- Used Elbow method and Silhouette analysis to determine optimal number of clusters (4), calculated inter-cluster movement of customer for last 365 days, extracted customers where business needs to target, and Stock Keeping Units per customer.
- Developed predictive analytics models in Python for the utility of summer reading program and summer slide of the reading score of 100,000+ Pre-K to Fifth-grade kids of the schools in Ohio state.
- Achieved accuracy, precision and recall in the range of 75-80% on average for the validation dataset.
Graduate Student Researcher
- Developed Regression algorithm to predict Length of Stay of patients in a Veteran Emergency Dept. based on features like arrival time & mode, complaint, diagnosis time, Acuity index etc. using Random Forest, XGBoost and Light GBM.
- Achieved Mean Absolute error of 73.72 minutes on validation dataset using Light GBM.