- 12 years of experience in IT industry - As a Data Scientist and Data Analyst. Worked in Insurance, Banking and Healthcare domains.
- As a data scientist, implemented various machine learning algorithms based on data processing, image processing and text processing to deliver insights and implement action-oriented solutions to complex business problems.
- Experienced in building both supervised and unsupervised machine learning models such as linear regression, logistic regression, decision trees, deep learning neural networks (CNN, RNN, ANN) and K-Means clustering by employing various machine learning languages and technologies - Python, R, Hive, Hadoop, PIG, PySpark, AWS and Azure. Expert at designing data visualizations using Tableau.
- Work with stakeholders throughout the organization to identify opportunities for leveraging company data to drive business solutions.
- Identify relevant data sources and sets to mine for client business needs, and collect large structured and unstructured datasets and variables.
- Perform feature engineering on the data using appropriate Python libraries such as pandas and numpy.
- Identify, analyze and interpret trends or patterns in complex data sets using various regression, classification or clustering ML approaches - Linear & Logistic Regression, Naïve Bayes, Decision Trees, Random Forests, Clustering, SVM, Neural Networks, Principle Component Analysis, Bayesian, XGBoost, Forecasting and Recommender Systems.
- Extensive experience working in a cloud-computing environment such as AWS, Azure etc.
- Identify and integrate new datasets from various data sources including SQLServer, AWS, Azure by employing languages such as SPARK, Hive, HBASE and PIG by working closely with the data engineering team to strategize and execute the development of models
- Perform error analysis by employing appropriate loss parameters such as MSE, RMSE and cross entropy loss to validate the results for performance and accuracy
- Enable automation of business processes by developing predictive models in the space of Image Processing, Text Processing (NLP) and time series forecasting using CNN, Naïve Bayes models by using languages such as Python, R, PySpark and machine learning libraries including Keras, Tensorflow, Theano and Scikit learn.
- Generate data insights using visualization software such as Tableau and Gephi in order to present data insights to drive business value and enable marketing decisions
- Implement Informatica design and development for extracting data from the source systems like SQL, flat files, CSV files, PSV files, XML files and loading into data mart by making new ETL designs.
- Exhibit strong communications skills with ability to present to technical and business audiences
Machine Learning: Linear Regression, Logistic Regression, Decision trees, NLP, AI, Ensemble Models(Random forest), Association Rule Mining (Market Basket Analysis), Apriori, PCA, Factor Analysis, Clustering (K-Means, Hierarchal), Gradient decent, Adaboost, XGBoost, SVM, Deep Learning (CNN, RNN, ANN) using TensorFlow (Keras), Image Processing, Theano, Time series / Forecasting, Recommendation Systems, Jupyter UI Widgets
Programming Languages: Python (Scikit Learn), PySpark, Processing and Java
Big data Skills: Hadoop, HDFS, MapReduce, SPARK, Pig, HBASE (NoSQL) and Hive
Devops: Docker, GitHub
Visual analytics tools: Python (3.6), RStudio, PySpark ETL, Informatica Processing, Gephi, Tableau, Orange, WEKA
AWS: EC2, Azure with Databricks, HDInsight Cluster, Kinesis, Redshift
Traditional Database: MYSQL, DB2 and Oracle
Statistical Skills: Descriptive and inferential Techniques
Senior Data Scientist
- Worked with stakeholders to understand the business problem, data available and draw meaningful insights from the data using machine learning techniques and statistics
- Spearheaded the extraction of information from various data sources including AWS Cloud on EC2 cluster using PySpark socket connection
- Created AWS EC2 clusters, setup Spark jobs in AWS using Python and Implemented iterative algorithms such as breadth-first-search as part of data management
- Performed Image pre-processing to capture vector with RGB normalized values representing each of the convoluted data frame corresponding to the vehicle damage images to be fed into the input layer of CNN
- Performed Image augmentation (rotation, shear, zoom, shift) to increase the number of images available to enhance algorithm/model’s performance
- Assessed the GPU computational power required for and testing and implementing the on Nvidia GeForce RTX 2080 Ti GPU
- Employed ensemble of CNN and ANN on Image data and historical claims data respectively to make predictions using a comprehensive collection of data
- Trained a Shallownet machine learning model in Python Jupyter Notebook to help train the model faster with minimal computational power
- Analyzed the accuracy and loss metrics, adjusted the model parameters to improve the model’s performance
Environment: and Techniques: CNN, ANN, TensorFlow-Keras, Scikit-Learn, PySpark, AWS, Python, Jupyter Notebook, Docker, GitHub
Senior Data Scientist
- Worked with business stakeholders to identify the business requirements and the expected outcome
- Collaborated with subject matter experts to select the relevant sources of information - Data Warehouse and AWS EMR cluster. Used Apache Spark API to interact with data from Apache Spark shell on the cluster
- Employed feature engineering to identify and understand the correlations between features, missing data, temporal variables, outliers by employing K-Means clustering techniques and other visualizations in PySpark.
- Employed feature selection by selecting necessary features, slicing and dicing the data and developing necessary data frames for each of the prediction algorithms using pandas data frames and numpy utilities in Jupyter Notebook.
- Performed data enumeration on string variables, One-Hot encoding on target variable and employ standard scalar to normalize the data parameters to improve the loss and accuracy of the model
- Employed and compare the classification algorithms (decisions trees, Support Vector Machines (SVM), Logistic regression and neural networks) and the regression algorithms (linear regression and feed forward neural networks) to pick the best performing algorithms.
- Used pre-configured tools such as Orange, Weka to compare the model’s performance to that of the models built using machine learning algorithms
- As part of deployment, worked with data engineering team to the re-engineer the code to enable the creation of Docker images and deployment into the AWS cloud and created REST API End point.
Environment: Python, Numpy, Scikit Learn, Jupyter Notebook Tensor flow and Theano, XGBoost, Orange, WEKA, Neural Networks, Docker, REST-API, AWS - EC2, S3, PySpark, Spark MLib, Decision Trees - Random Forests, Linear Regression, Logistic Regression, SVM, Github, Docker, Jenkins.
- Acquired necessary understanding of business need and available data to conduct appropriate analysis through communication with SMEs(Subject Matter Experts) and business stakeholders.
- Identifying distinct groups for customers based on key attributes such as policy history and feedback to be used as data set.
- Imported data from heterogeneous data sources, generated necessary using PySpark MLib such as pandas data frames (enumerating, on-hot encoding, applying standard scalar) in order to aid with visualizations.
- Used best practices to develop statistical, machine learning techniques (Neural networks, Linear Regression, Logistic Regression, Decision Trees) in building customer segmentation and CLTV prediction models in Python.
- Prepared presentations for executives with consolidated reports and meaningful visualizations (bubble charts and heat maps, Scatter Plots, Histograms etc.) in Tableau, Processing and R languages to influence marketing decisions.
Environment: and Techniques: Python, PySpark, pandas, Numpy, Scikit Learn, Neural Networks, Linear Regression, Logistic Regression, Decision Trees, Processing, Jupyter Notebook Tensor flow, Tableau, R, AWS, Github, Jenkins.
- Created Azure HDInsight clusters, used Hive to establish connection with the cluster and access the relevant data in avro file format from HDInsight clusters
- Imported necessary libraries including NLTK, genism, tokenizer and keras to enable text processing in python
- Performed data processing by employing utf8 decoding, spell correction, tokenization, lemmatizing, stemming, TF-IDF matrix creation, Word2Vec similarity matrix generation, LDA Topic modelling to be ready to be fed into the algorithm
- Developed various models including neural networks, decision trees and Naïve based models in Python to compare the performance of the models
- Employed early stopping and monitor to stop the model when the model’s accuracy no longer improves significantly
- Prepared and process available data and split them into and testing data sets in order to evaluate model’s performance
Environment: and Techniques: Keras Tensor Flow, Jupyter Notebook, Hadoop, Tokenizer, Github, Docker, Azure, Hive, Python, NLTK Libraries, XGBoost, python dictionaries, Early Stopping, Monitor, LDA, neural networks, decision trees, Naïve based models and Topic Modelling
- Imported data from various data sources including Azure HDInsight Cluster, Hadoop and Data warehouse
- Performed tokenization, case conversion, word replacement, lemmatizing, stemming during the data preprocessing stage using NLTK package in python.
- Used a DTM matrix to store the TF-IDF values and later calculated the Cosine Vector Space model to find the similarity between queries and product descriptions.
- Built machine learning model to predict positive & negative sentiment of texts based on unigram word frequency features as well as other features like total words in documents, TF-IDF score etc.
- Pre-Processing was done in Python.
- Built multiple model based on different algorithm like RNN LSTM, logistic regression, Naïve Bayes, Random Forest, SVM, AdaBoost. Final Model selection was based on cross-validation approach.
- Used Model ensemble (majority voting) as an alternative to improve the performance metrics.
- Communicated, to leadership and stakeholders, on findings to ensure models are well understood and incorporated into business processes
Techniques: NLP, Python, Jupyter Notebook, Jupyter UI Widgets, RNN-LSTM, NLTK, TF-IDF, PySpark, logistic regression, Naïve Bayes, Random Forest, SVM, AdaBoost iGATE (Now Capgemini)
- Worked on Informatica tools such as Designer, Repository Manager, Workflow Manager, and Workflow Monitor.
- Responsible in building the Informatica workflows, Source to Target mapping to load data into Data warehouse.
- Collaborate with business to identify key functionalities, analyze the source data and design Informatica mappings to point at new data sources.
- Worked on creating Slowly Changing Dimension (SCD) methodologies.
- Worked on Informatica design and development for extracting data from the source systems like SQL, flat files, CSV files, PSV files, XML files and loading into data mart by making new ETL designs.
- Used various transformations like Filter, Expression, Sequence Generator, Update Strategy, Joiner, Stored Procedure, and Union to develop robust mappings in the Informatica Designer.
- Develop cleanup utility shell scripts for Informatica EDD migration.
Environment: Informatica 8.6, DB2, SQL server, MS access, Web services, HP Service Manager
- Developed “COBOL, DB2, CICS” online programs accounting, update for funds transfer
- Developed “JCL’s and PROC’s” for batch execution
- Enhanced VSAM for batch programs
- Scheduled, processed and monitored thousands of batch jobs scheduled through CA-7 scheduling system and automated panels.
- Assisted Support team in resolving system ABENDS generated by scheduled and manually submitted batch jobs.
- Developed and updated code in COBOL on the OS400 operating system for the AS400 systems.
Environment: COBOL, DB2, JCL, VSAM and UNIX UI transactions
- Developed IMS and DB2 batch programs
- Provided on-call production support
- Analyzed COBOL and PL/1 code for system testing and debugging.
- Implemented new programs and supported monthly production jobs
- Performed issue triaging for Abends jobs in SPUFI
- Implemented new programs and supported monthly production jobs
Technologies: COBOL, JCL, DB2, SPUFI, Abend-Aid