Data Scientist Resume
Holmdel New, JerseY
SUMMARY:
- Strong hands on experience in the field of Data Sciences transforming business requirements into actionable data models, prediction models and informative reporting solutions working in a variety of industries including Banking and Manufacturing.
- Expert in the Data Science process life cycle including Data Acquisition, Data Preparation, Data Manipulation, Feature Engineering, Machine Learning Algorithms, Validation, Visualization and Deployment.
- Strong knowledge in Statistical methodologies such as Hypothesis Testing, Principal Component Analysis (PCA), Sampling Distributions, ANOVA, Chi - Square tests, Time Series, Factor Analysis, Discriminant Analysis.
- Proficient in Python and its libraries such as NumPy, Pandas, Scikit-learn, Matplotlib and Seaborn.
- Efficient in preprocessing data in Python using Visualization, Data cleaning, Correlation analysis, Imputations, Feature Selection, Scaling and Normalization, and Dimensionality Reduction methods.
- Experienced in building various machine learning predictive models using algorithms such as Linear Regression, Logistic Regression, Naïve Bayes Classifier, Support Vector Machines (SVM), Neural Networks, KNN, K-means Clustering, Decision Trees, Ensemble methods (Random Forest, AdaBoost, Gradient Boosting, and Bagging).
- Knowledge in Text Mining, Topic Modelling, Association Rules, Sentiment Analysis, Market Basket Analysis, Recommendation Systems, Natural Language Processing (NLP).
- Knowledge on Time Series Analysis using AR, MA, ARIMA, GARCH and ARCH model.
- Experienced in tuning models using Grid Search, Randomized Search, K-Fold Cross Validation.
- Experience working with Big Data tools such as Hadoop - HDFS and MapReduce, Hive QL, Sqoop, Pig Latin and Apache Spark (PySpark).
- Extensive experience working with RDBMS such as SQL Server, MySQL, Oracle and NoSQL databases such as MongoDB, Cassandra, HBase.
- Adept in developing and debugging Stored Procedures, User-defined Functions (UDFs), Triggers, Indexes, Constraints, Views, Transactions and Queries using Transact-SQL (T-SQL).
- Proficient in developing and designing ETL packages and reporting solutions using MS BI Suite (SSIS/SSRS).
- Experience in building and publishing interactive reports and dashboards with design customizations based on the client requirements in Tableau, Looker, Power BI and SSRS.
- Proficient in data visualization tools such as Tableau, Python Matplotlib, Python Seaborn, R Shiny, R ggplot2 to create visually powerful and actionable interactive reports and dashboards.
- Knowledge and experience working in Waterfall as well as Agile environments including the Scrum process and using Project Management tools like ProjectLibre, Jira/Confluence and version control tools such as GitHub/Git.
- Self-motivated, Fast Learner, good team lead and player, strong managing and communication skills
TECHNICAL SKILLS:
Databases: MS SQL Server 2008/2008R 2/2012/2014/2016, MongoDB 3.x, MySQL 5.x, Oracle, HBase, Amazon Redshift, Teradata
Statistical Methods: Hypothetical Testing, ANOVA, Time Series, Confidence Intervals, Bayes Law, Principal Component Analysis (PCA), Chi-square test, Chebyshev's inequality
Machine Learning: Linear Regressions, Logistic Regression, Na ve Bayes, Decision Trees, Random Forest, Support Vector Machine(SVM), Neural Networks, Sentiment Analysis, K-Means Clustering, K-nearest Neighbors (KNN), Ensemble Methods, Gradient Boosting Trees, Ada Boosting, PCA, LDA
Hadoop Ecosystem: Hadoop 2.x, Spark 2.x, MapReduce, Hive QL, HDFS, Sqoop, Pig Latin
BI Reporting Tools: Tableau 10.x / 9.x, MS SQL Server Integration Service and Reporting Service (SSIS/SSRS), Power BI
Data Visualization: Tableau, Python (Matplotlib, Seaborn), R(ggplot2), Looker, Power BI, QlikView
Languages: Python 2.x/3.x (NumPy, Pandas, Scikit-learn, Matplotlib, Seaborn), R (dplyr, ggplot2, rpart, caret, Random Forest, gbm, neuralnet), SQL (T-SQL, MySQL), C++, MATLAB, Octave
Operating Systems: UNIX/UNIX Shell Scripting (via PuTTY client), Linux and Windows XP/7/8/10, Mac OS
Other tools and technologies: Azure ML Studio, Google TensorFlow, Apache Tomcat Webserver, MS Office Suite, Lucid Chart, StatTools, ProjectLibre, Google Analytics, Google Tag Manager, Salesforce, MS SharePoint, Trello, JIRA, Confluence, GitHub/Git, AWS (EC2/S3/Redshift/EMR/Lambda)
PROFESSIONAL EXPERIENCE:
Confidential, Holmdel, New Jersey
Data Scientist
- Extensively involved in all phases of data acquisition, data collection, data cleaning, model development, model validation, and visualization to deliver data science solutions
- Built machine learning models to predict the customer life time value of the customers that are involved with the bank using the supervised and unsupervised learning methods.
- Extracted data from various sources systems like SQL Server, Hadoop HDFS file system and various other sources to Oracle database.
- Worked on data cleaning and ensured data quality, consistency, integrity using Pandas, NumPy.
- Tackled highly imbalanced Fraud dataset using sampling techniques like under sampling and oversampling with SMOTE (Synthetic Minority Over-Sampling Technique) using Python Scikit-learn.
- Used PCA and other feature engineering techniques to reduce the high dimensional data, feature normalization techniques and label encoding with Scikit-learn library in Python.
- Used Pandas, NumPy, Seaborn, Matplotlib, Scikit-learn in Python for developing various machine learning models such as Gradient Boosting, Lasso/Ridge Regression, Random forest and step-wise regression.
- Worked on Amazon Web Services cloud services to do machine learning on big data and using lambda function.
- Developed Spark Python modules for machine learning & predictive analytics in Hadoop.
- Used cross-validation to test the models with different batches of data to optimize the models and prevent overfitting.
- Experimented with Ensemble methods to increase the accuracy of the training model with different Bagging and Boosting methods.
- Deployed the model on AWS EC2 using Flask.
- Created and maintained reports to display the status and performance of deployed model and algorithm with Tableau.
Technology Stack: Hadoop2.x, HDFS, Hive, Pig Latin, Oracle, MS-SQL Server, Apache Spark/PySpark/MLlib, Python 3.x (NumPy, Pandas, Scikit-learn, Matplotlib, Seaborn), Jupyter Notebook, Spyder, AWS, GitHub, Linux, Tableau.
Confidential, Arlington TX
Data Scientist
- Communicated and coordinated with end client for collecting data and performed ETL to define the uniform standard format.
- Queried and retrieved data from Oracle database servers to get the sample dataset.
- In preprocessing phase, used Pandas to remove or replace all the missing data and balanced the dataset with Over-sampling the minority label class and Under-sampling the majority label class.
- Used PCA and other feature engineering, feature normalization and label encoding Scikit-learn preprocessing techniques to reduce the high dimensional data (>150 features) using entire patient visit history, proprietary comorbidity flags and comorbidity scoring from over 12 million EMR and claims data.
- In data exploration stage used correlation analysis and graphical techniques in Matplotlib and Seaborn to get some insights about the patient admission and discharge data.
- Experimented with predictive models including Logistic Regression, Support Vector Machine (SVC), Gradient Boosting and Random Forest using Python Scikit-learn to predict whether a patient might be readmitted.
- Designed and implemented Cross-validation and statistical tests including ANOVA, Chi-square test to verify the models’ significance.
- Implemented, tuned and tested the model on AWS EC2 with the best performing algorithm and parameters.
- Set up data preprocessing pipeline to guarantee the consistency between the training data and new coming data.
- Deployed the model on AWS Lambda.
- Collected the feedback after deployment, retrained the model to improve the performance.
- Designed, developed and maintained daily and monthly summary, trending and benchmark reports in Tableau Desktop.
- Used Agile methodology and Scrum process for project developing.
Technology Stack: AWS EC2, S3, Oracle DB, AWS Lambda, Linux, Python (Scikit-Learn/NumPy/Pandas/Matplotlib), Machine Learning (Logistic Regression/Support Vector Machine/Gradient Boosting/Random Forest), Tableau
Confidential, Atlanta, GA
Data Scientist
- Modeled and simulated the warranty and lease operations of an electro mechanical RMC plant using machine learning, statistical modelling and WITNESS simulation software.
- Implemented queuing theory concepts to model the system, verified and validated the model using statistical techniques.
- Measured the performance statistics like number of products shipped, WIP, Idle time and analyzed the primary cost required to run the facility
- Developed the cost analysis, identified the bottlenecks in the production process and devised effective solutions to improve the facility cost up to 7%.
- Utilized decision theories linear programming methods like simplex, dual- simplex methods to find the optimum solutions for shift schedules, logistics.
- Worked on data cleaning and ensured data quality, consistency, integrity using Pandas, NumPy.
- Performed univariate and multivariate analysis on the data to identify any underlying pattern in the data and associations between the variables.
- Explored and analyzed the customer specific features by using Matplotlib in Python and ggplot2 in R.
- Performed data imputation using Scikit-learn package in Python.
- Participated in features engineering such as feature generating, PCA, feature normalization and label encoding with Scikit-learn preprocessing.
- Acquired the data of size 120k records from various sources and performed querying operations to get the required data for the analysis.
- Loaded the data into SAS, analyzed the data set and prepared prediction model for various prediction variables.
- Log & (1/x) Transformations were used before creating prediction models and eliminated outliers using partial regression plots.
- Selected the best model out of all the models using techniques like forward elimination, backward elimination and stepwise approach.
- Used F-Score, AUC/ROC, Confusion Matrix, Precision, and Recall evaluating different model’s performance.
- Used Python 3.X (NumPy, SciPy, pandas, Scikit-learn, seaborn) and R (caret, trees, arules) to develop variety of models and algorithms for analytic purposes.
- Provided delivery recommendations on optimal shift schedules, material handling solutions, Staff Employment and purchase of additional equipment as a function of service demand, and production control logic.
Confidential
BI Developer
- Collaborated with data engineers and operation team to implement ETL process, wrote and optimized SQL queries to perform data extraction to fit the analytical requirements.
- Experimented and built predictive models including ensemble models using machine learning algorithms such as Logistic regression, Random Forests and KNN to predict output requirement with demand.
- Conducted analysis on operator behaviors and discover value of unaccounted time with RMF analysis; applied Six Sigma/lean implementation with clustering algorithms such as K-Means Clustering. Gaussian Mixture Model and Hierarchical Clustering.
- Designed and implemented a recommendation system which leveraged Google Analytics data and the machine learning models and utilized Collaborative filtering techniques to recommend courses for different customers.
- Designed rich data visualizations to model data into human-readable form with Tableau.
Technology Stack: SQL Server 2008 R2, SQL Server Management Studio, Microsoft BI Suite (SSIS/SSRS), T-SQL, Visual Studio 2010, Tableau, AWS RedShift, Hadoop, HDFS, Python 3.x (Scikit -Learn/ SciPy/ NumPy/ Pandas/ Matplotlib/ Seaborn), R (ggplot2/ caret/ trees), Tableau (9.x/10.x), Machine Learning (Logistic regression/ Random Forests/ KNN/ K-Means Clustering/ Gaussian Mixture Model / Hierarchical Clustering/ Ensemble methods/ Collaborative filtering), JIRA, GitHub, Agile/ SCRUM