We provide IT Staff Augmentation Services!

Lead Data Scientist Resume

New, YorK


I have strong analytical skills with 5 years of professional Data Scientist/Data Engineering with Manager level experience, Research, building and deploying AI Models in Python, PySpark, HIVE, AWS Services, SQL & NoSQL Data bases, Data Mining, Machine Learning, Neural Networks (Deep Learning), Customer Analytics, Financial Modeling, Data Visualization, A/B Testing and Statistics with rich domain knowledge in Finance and Healthcare sectors.


Programming Languages: R, Python, SQL, PySpark.

Big Data Technologies: PySpark, Hadoop, Google Cloud, Informatica, Azure.

Tools: Tableau, Advance Excel, MATLAB, BI Reporting.

Data bases: MySQL, HDFS, Neo4j, AWS DynamoDB, AWS RDS.

AWS Services: EC2, S3, DynamoDB, RDS, Auto Scaling, Kafka, CloudWatch, ELB, SageMaker. Lambda.

Other Tools: Dockers for Containerizing the codes

Technology Experience: Python, R, Informatica. 6 Years SQL, Oracle, Excel, Tableau, Power BI, Hadoop, PySpark 6 years Machine Learning, Deep learning, NLP 5 Years AWS (EC2, S3 Buckets, Dynamo DB, RDS, Lambda), SAS 4 years

Machine Learning Techniques: (Numpy, Pandas, Scikit - learn, SVM, Linear & Logistic Regression, Random Forests, Decision Trees, Nearest Neighbors, Apriori, K-Means, DBSCAN clustering, XGBoost, LGBRegressors)

NLP Techniques: (Tokenization, Bag of Words, TF-IDF, Word Embeddings, Word2Vec, Regular Expressions, Stemming, Lemmatization, LDA, NMF, Naïve Base, Latent Analysis, NLTK, Gesim), OCR, RPA.

Deep Learning Techniques: (CNN, R-CNN, LSTMs, GRUs, Stacked AutoEncoders, GANs),OpenCV, OCR, OpenAI Gym for Reinforcement Learning (Tensorflow+Keras, PyTorch, Theano backend)


Lead Data Scientist

Confidential, New York


  • Using Python, PySpark analyzed huge data analysis and extracting insights
  • Complete computation on PySpark environment, Colab GPUs for quick analysis of the huge data and for building the model.
  • Data extraction from EHR and complete in-depth analysis on EHR for insights.
  • Created a Secured Automated ETL on Medical data using Informatica PowerCenter.
  • Stored all the data in Graph Database for semantic queries with nodes and edges by Graph Structures.
  • Customized Entity Extraction from the Patient- Doctor Conversation form IBM Watson generated data
  • By integrating the ML, Deep Learning (DL) with Natural Language Processing (NLP). Which includes.
  • Topic Clustering (K-Means, Hierarchical). Finding the optimal number of Topics and Clusters with Optimizing Algorithms, also building the classification Models using the below Techniques
  • Recurrent Neural Networks (single and Bi-Directional LSTMs, Deep Bi-Directional LSTMs (DBLSTMs), GRUs) creating into Auto-Encoders, Transformers (BERT, ALBERT, ULMFit), Word Vectors, Word Embeddings

Techniques: Coding, cleaning, transformation to Model building are in Python, SQL, AWS Services, PySpark, Hive, NLP, ML, NLTK, SpaCy, Standford Core NLP, Inception Annotation, Gensim, T5 Transformers, PyTorch, SageMaker, S3, Tableau.

Health Outcomes Lead Researcher/ Data Engineer

Confidential, Florida


  • PySpark for Exploratory Data Analysis and also built End-to-End ETL pipelines.
  • Exploratory Data Analysis of Billion’s EHR’s of patients, AWS SageMaker for computations also Scala programming language in Spark environment for Big data analysis. OCR for information extraction from Images, Diagnosed Image reports
  • Feature Engineering on 180 features for finding the high influencing features for predicting the target.
  • Shortlisting the ICD-9 and ICD-10 Codes for the Risky Surgeries for reducing its Mortality.
  • Built and Trained the model that predicts the ICD 9 & 10 codes based on the patient's EHR.
  • SageMakers and S3 Buckets for computational purpose
  • Finally built a highly dynamic model for predicting the Exact Principle Diagnosis ICD with 95.78% Acc.

Techniques: SQL, PySpark, Python, Pandas, Numpy, NLTK, Scikit-Learn, Keras, Tensorflow, SageMaker, AWS Lambda, S3.

Lead Data Scientist

Confidential, NJ


  • Leaded two projects, A Project on Health care analytics on improving the Patient care quality by predicting the optimal Length of stay (LOS) and Readmissions to enhance the efficiency in operation workload.
  • Text Data Engineering (Data Cleaning, Text Preprocessing, Extracting data) from Electronic Health Records (EHR), EDA, Visualization (Tableau, PowerBI, Excel), OCR for extracting information from images
  • Predicted the optimal LOS and Readmissions for different cases and working on Reducing the Mortality for Risky Surgeries.
  • In this project to enhance and improve the business and customer satisfaction. Using NLP techniques by performing Topic Modelling to find out the most discussed negative topics from 400K reviews. Built an efficient models
  • Text Preprocessing, Text Classification, Document Similarity using Cosine similarity technique and used Euclidean distance, LDA (Latent Dirichlet Allocation), NMF (Non-Negative matrix factorization), Topic Modeling on Nouns, Entity Recognition, Latent Semantic Analysis, N-Grams, TF-IDF, Naïve Base (Text Classification) Visualizations (Word Phrase, Word Nets and Word Clouds).

Techniques: Python, NLP, NLTK, Pandas, Numpy, Gensim, SpaCy, Tansformers, OCR, Tableau, Excel, OCR, RPA, PyTorch, SQL, PowerBI.

Lead Data Scientist/Data Engineer

Confidential, New York


  • Informatica and SQL for Extracting, Transforming and Loading the Data from different resources such as MySQL Server and from Oracle DB and performed different Transformation in the PowerCenter.
  • Data flow Designing and flow Management and Workflow Monitoring on different stages of ETL.
  • PySpark and Informatica for ETL pipelines and Exploration on millions of records quickly.
  • Integrated different set of customers data, purchasing data, service orders data and many more.
  • Finally built the datasets that are ready for Data Analytics.
  • Built an ML Model to Predict & Forecast the future sales:
  • Built and Deployed a Forecasting ML model by considering past years sales, reviews and Star ratings.
  • Discovered the positive and negative opinions out from the text data using NLP techniques and from sales data predicted the future sales using predictive analytics.
  • Built an efficient Machine Learning model that classifies and predicts a solution accurately by comparing all the domain competitors with cost-efficient marketing strategy by conducting all sorts of analysis includes numerical, categorical, time-series analysis to improve the business.
  • Finally concluded using A/B testing for finding the best model and employed Model validation Classification metrics. Fulfilled all my responsibilities by reporting to the CEO using Tableau.

Techniques: Python, Web Scraping, Data Engineering, Feature Engineering, Scikit-learn, NLTK, Pandas, OCR, NumPy, SciPy and Seaborn.

Research Data Science Intern

Confidential, NJ


  • Performed all phases of text data acquisition, text cleaning, developing models, validation, and visualization to deliver data science solutions using Tableau.
  • RNN(LSTMs) for text modeling. Integrated NLP and CNN for building a Model for Image Classification and Categories the images using its text. VGG-16 & VGG-19 for Feature extraction.
  • Developed a new model using Transfer Learning by fine-tuning the significant hyperparameters of Convolutional Neural Networks (CNN). Performing sentimental Analysis on text data using NLP.

Data Engineer



  • Analyzing huge real-world data of the company products, sales and giving meaningful insights using Exploratory Data Analysis (EDA) and Predictive Analytics.
  • H20 Computational GPU for building the model and for quick analysis on the huge datasets.
  • Informatica PowerCenter for Data Transformations like manipulations, integrations, Aggregation.
  • Built and Deployed a Recommendation engine for better improvement of the sales with Machine Learning Algorithms such as Linear, Logistic Regression, SVC, SVR, Decision Trees, Ensembling techniques.
  • MySQL, PySpark and AWS servers for the big data processing, MATLAB for Data Analytics.
  • Experience to collect, aggregate and store the web log data from web servers, stored into HDFS.
  • Stored all the manipulated data sets in Graph Database

Techniques: Python, Scikit-Learn, Numpy, Pandas, Hadoop, Hive, Impala, Spark, MapReduce, PySpark, Informatica, SQL, NoSQL.

Financial Research Analyst



  • Major concern is the data security where I worked on SAP Lumira and Informatica for secured ETL Pipelines
  • Participated in all phases of data acquisition, data cleaning, developing models, validation, and visualization to deliver data science solutions that predicts the fraud transactions.
  • Worked on fraud detection analysis on payments transactions using the history of transactions with supervised learning methods.
  • Ensembled methods were used to increase the accuracy of the training model with different Bagging and Boosting methods.
  • Used cross-validation to test the models with different batches of data to optimize the models and prevent overfitting.

Techniques: Python, Numpy, Pandas, Machine Learning, SAP, Informatica

Hire Now