We provide IT Staff Augmentation Services!

Data Scientist Resume

3.00/5 (Submit Your Rating)

Holmdel New, JerseY

SUMMARY:

  • Strong hands on experience in the field of Data Sciences transforming business requirements into actionable data models, prediction models and informative reporting solutions working in a variety of industries including Banking and Manufacturing.
  • Expert in the Data Science process life cycle including Data Acquisition, Data Preparation, Data Manipulation, Feature Engineering, Machine Learning Algorithms, Validation, Visualization and Deployment.
  • Strong knowledge in Statistical methodologies such as Hypothesis Testing, Principal Component Analysis (PCA), Sampling Distributions, ANOVA, Chi - Square tests, Time Series, Factor Analysis, Discriminant Analysis.
  • Proficient in Python and its libraries such as NumPy, Pandas, Scikit-learn, Matplotlib and Seaborn.
  • Efficient in preprocessing data in Python using Visualization, Data cleaning, Correlation analysis, Imputations, Feature Selection, Scaling and Normalization, and Dimensionality Reduction methods.
  • Experienced in building various machine learning predictive models using algorithms such as Linear Regression, Logistic Regression, Naïve Bayes Classifier, Support Vector Machines (SVM), Neural Networks, KNN, K-means Clustering, Decision Trees, Ensemble methods (Random Forest, AdaBoost, Gradient Boosting, and Bagging).
  • Knowledge in Text Mining, Topic Modelling, Association Rules, Sentiment Analysis, Market Basket Analysis, Recommendation Systems, Natural Language Processing (NLP).
  • Knowledge on Time Series Analysis using AR, MA, ARIMA, GARCH and ARCH model.
  • Experienced in tuning models using Grid Search, Randomized Search, K-Fold Cross Validation.
  • Experience working with Big Data tools such as Hadoop - HDFS and MapReduce, Hive QL, Sqoop, Pig Latin and Apache Spark (PySpark).
  • Extensive experience working with RDBMS such as SQL Server, MySQL, Oracle and NoSQL databases such as MongoDB, Cassandra, HBase.
  • Adept in developing and debugging Stored Procedures, User-defined Functions (UDFs), Triggers, Indexes, Constraints, Views, Transactions and Queries using Transact-SQL (T-SQL).
  • Proficient in developing and designing ETL packages and reporting solutions using MS BI Suite (SSIS/SSRS).
  • Experience in building and publishing interactive reports and dashboards with design customizations based on the client requirements in Tableau, Looker, Power BI and SSRS.
  • Proficient in data visualization tools such as Tableau, Python Matplotlib, Python Seaborn, R Shiny, R ggplot2 to create visually powerful and actionable interactive reports and dashboards.
  • Knowledge and experience working in Waterfall as well as Agile environments including the Scrum process and using Project Management tools like ProjectLibre, Jira/Confluence and version control tools such as GitHub/Git.
  • Self-motivated, Fast Learner, good team lead and player, strong managing and communication skills

TECHNICAL SKILLS:

Databases: MS SQL Server 2008/2008R 2/2012/2014/2016, MongoDB 3.x, MySQL 5.x, Oracle, HBase, Amazon Redshift, Teradata

Statistical Methods: Hypothetical Testing, ANOVA, Time Series, Confidence Intervals, Bayes Law, Principal Component Analysis (PCA), Chi-square test, Chebyshev's inequality

Machine Learning: Linear Regressions, Logistic Regression, Na ve Bayes, Decision Trees, Random Forest, Support Vector Machine(SVM), Neural Networks, Sentiment Analysis, K-Means Clustering, K-nearest Neighbors (KNN), Ensemble Methods, Gradient Boosting Trees, Ada Boosting, PCA, LDA

Hadoop Ecosystem: Hadoop 2.x, Spark 2.x, MapReduce, Hive QL, HDFS, Sqoop, Pig Latin

BI Reporting Tools: Tableau 10.x / 9.x, MS SQL Server Integration Service and Reporting Service (SSIS/SSRS), Power BI

Data Visualization: Tableau, Python (Matplotlib, Seaborn), R(ggplot2), Looker, Power BI, QlikView

Languages: Python 2.x/3.x (NumPy, Pandas, Scikit-learn, Matplotlib, Seaborn), R (dplyr, ggplot2, rpart, caret, Random Forest, gbm, neuralnet), SQL (T-SQL, MySQL), C++, MATLAB, Octave

Operating Systems: UNIX/UNIX Shell Scripting (via PuTTY client), Linux and Windows XP/7/8/10, Mac OS

Other tools and technologies: Azure ML Studio, Google TensorFlow, Apache Tomcat Webserver, MS Office Suite, Lucid Chart, StatTools, ProjectLibre, Google Analytics, Google Tag Manager, Salesforce, MS SharePoint, Trello, JIRA, Confluence, GitHub/Git, AWS (EC2/S3/Redshift/EMR/Lambda)

PROFESSIONAL EXPERIENCE:

Confidential, Holmdel, New Jersey

Data Scientist

  • Extensively involved in all phases of data acquisition, data collection, data cleaning, model development, model validation, and visualization to deliver data science solutions
  • Built machine learning models to predict the customer life time value of the customers that are involved with the bank using the supervised and unsupervised learning methods.
  • Extracted data from various sources systems like SQL Server, Hadoop HDFS file system and various other sources to Oracle database.
  • Worked on data cleaning and ensured data quality, consistency, integrity using Pandas, NumPy.
  • Tackled highly imbalanced Fraud dataset using sampling techniques like under sampling and oversampling with SMOTE (Synthetic Minority Over-Sampling Technique) using Python Scikit-learn.
  • Used PCA and other feature engineering techniques to reduce the high dimensional data, feature normalization techniques and label encoding with Scikit-learn library in Python.
  • Used Pandas, NumPy, Seaborn, Matplotlib, Scikit-learn in Python for developing various machine learning models such as Gradient Boosting, Lasso/Ridge Regression, Random forest and step-wise regression.
  • Worked on Amazon Web Services cloud services to do machine learning on big data and using lambda function.
  • Developed Spark Python modules for machine learning & predictive analytics in Hadoop.
  • Used cross-validation to test the models with different batches of data to optimize the models and prevent overfitting.
  • Experimented with Ensemble methods to increase the accuracy of the training model with different Bagging and Boosting methods.
  • Deployed the model on AWS EC2 using Flask.
  • Created and maintained reports to display the status and performance of deployed model and algorithm with Tableau.

Technology Stack: Hadoop2.x, HDFS, Hive, Pig Latin, Oracle, MS-SQL Server, Apache Spark/PySpark/MLlib, Python 3.x (NumPy, Pandas, Scikit-learn, Matplotlib, Seaborn), Jupyter Notebook, Spyder, AWS, GitHub, Linux, Tableau.

Confidential, Arlington TX

Data Scientist

  • Communicated and coordinated with end client for collecting data and performed ETL to define the uniform standard format.
  • Queried and retrieved data from Oracle database servers to get the sample dataset.
  • In preprocessing phase, used Pandas to remove or replace all the missing data and balanced the dataset with Over-sampling the minority label class and Under-sampling the majority label class.
  • Used PCA and other feature engineering, feature normalization and label encoding Scikit-learn preprocessing techniques to reduce the high dimensional data (>150 features) using entire patient visit history, proprietary comorbidity flags and comorbidity scoring from over 12 million EMR and claims data.
  • In data exploration stage used correlation analysis and graphical techniques in Matplotlib and Seaborn to get some insights about the patient admission and discharge data.
  • Experimented with predictive models including Logistic Regression, Support Vector Machine (SVC), Gradient Boosting and Random Forest using Python Scikit-learn to predict whether a patient might be readmitted.
  • Designed and implemented Cross-validation and statistical tests including ANOVA, Chi-square test to verify the models’ significance.
  • Implemented, tuned and tested the model on AWS EC2 with the best performing algorithm and parameters.
  • Set up data preprocessing pipeline to guarantee the consistency between the training data and new coming data.
  • Deployed the model on AWS Lambda.
  • Collected the feedback after deployment, retrained the model to improve the performance.
  • Designed, developed and maintained daily and monthly summary, trending and benchmark reports in Tableau Desktop.
  • Used Agile methodology and Scrum process for project developing.

Technology Stack: AWS EC2, S3, Oracle DB, AWS Lambda, Linux, Python (Scikit-Learn/NumPy/Pandas/Matplotlib), Machine Learning (Logistic Regression/Support Vector Machine/Gradient Boosting/Random Forest), Tableau

Confidential, Atlanta, GA

Data Scientist

  • Modeled and simulated the warranty and lease operations of an electro mechanical RMC plant using machine learning, statistical modelling and WITNESS simulation software.
  • Implemented queuing theory concepts to model the system, verified and validated the model using statistical techniques.
  • Measured the performance statistics like number of products shipped, WIP, Idle time and analyzed the primary cost required to run the facility
  • Developed the cost analysis, identified the bottlenecks in the production process and devised effective solutions to improve the facility cost up to 7%.
  • Utilized decision theories linear programming methods like simplex, dual- simplex methods to find the optimum solutions for shift schedules, logistics.
  • Worked on data cleaning and ensured data quality, consistency, integrity using Pandas, NumPy.
  • Performed univariate and multivariate analysis on the data to identify any underlying pattern in the data and associations between the variables.
  • Explored and analyzed the customer specific features by using Matplotlib in Python and ggplot2 in R.
  • Performed data imputation using Scikit-learn package in Python.
  • Participated in features engineering such as feature generating, PCA, feature normalization and label encoding with Scikit-learn preprocessing.
  • Acquired the data of size 120k records from various sources and performed querying operations to get the required data for the analysis.
  • Loaded the data into SAS, analyzed the data set and prepared prediction model for various prediction variables.
  • Log & (1/x) Transformations were used before creating prediction models and eliminated outliers using partial regression plots.
  • Selected the best model out of all the models using techniques like forward elimination, backward elimination and stepwise approach.
  • Used F-Score, AUC/ROC, Confusion Matrix, Precision, and Recall evaluating different model’s performance.
  • Used Python 3.X (NumPy, SciPy, pandas, Scikit-learn, seaborn) and R (caret, trees, arules) to develop variety of models and algorithms for analytic purposes.
  • Provided delivery recommendations on optimal shift schedules, material handling solutions, Staff Employment and purchase of additional equipment as a function of service demand, and production control logic.

Confidential

BI Developer

  • Collaborated with data engineers and operation team to implement ETL process, wrote and optimized SQL queries to perform data extraction to fit the analytical requirements.
  • Experimented and built predictive models including ensemble models using machine learning algorithms such as Logistic regression, Random Forests and KNN to predict output requirement with demand.
  • Conducted analysis on operator behaviors and discover value of unaccounted time with RMF analysis; applied Six Sigma/lean implementation with clustering algorithms such as K-Means Clustering. Gaussian Mixture Model and Hierarchical Clustering.
  • Designed and implemented a recommendation system which leveraged Google Analytics data and the machine learning models and utilized Collaborative filtering techniques to recommend courses for different customers.
  • Designed rich data visualizations to model data into human-readable form with Tableau.

Technology Stack: SQL Server 2008 R2, SQL Server Management Studio, Microsoft BI Suite (SSIS/SSRS), T-SQL, Visual Studio 2010, Tableau, AWS RedShift, Hadoop, HDFS, Python 3.x (Scikit -Learn/ SciPy/ NumPy/ Pandas/ Matplotlib/ Seaborn), R (ggplot2/ caret/ trees), Tableau (9.x/10.x), Machine Learning (Logistic regression/ Random Forests/ KNN/ K-Means Clustering/ Gaussian Mixture Model / Hierarchical Clustering/ Ensemble methods/ Collaborative filtering), JIRA, GitHub, Agile/ SCRUM

We'd love your feedback!