- Over 6 years of professional experience as a Data Scientist working across Healthcare, Supply Chain Management, Insurance with a master's degree in Business Analytics.
- Experience in Data Acquisition, Data Validation, Predictive modeling, Data Visualization.
- Adept in statistical programming languages like R and Python including Big Data technologies like Hadoop, Hive.
- Experienced in Data Cleaning, Model Training and Building, Model Testing and Model Deployment.
- Knowledge in SDLC using Development, Component Integration, Performance Testing, Deployment, Support Maintenance .
- Built various kinds of machine learning algorithms like Linear, Logistic Regression - Decision Trees, Random Forest, Extra Trees, KNN, Kmeans, Naïve Bayes for both structured, semi-structured and unstructured data.
- Worked on Data Visualization tools like Tableau and Google Analytics. Performed Statistical analysis on both descriptive and Predictive analysis using machine learning algorithms.
- Used Python Libraries like Numpy,Pandas, SciPy, scikit-learn, matplotlib,Seaborn,datetime.
- Worked in Hadoop platform using Hive and Pig. Hive is used to convert the series into MapReduce using the Hive Framework.
- Experienced in SQL programming and creation of relational database models. Experienced in creating cutting-edge data processing algorithms to meet project demands.
- Involved in writing the complex structured queries using views, triggers, and joins. Worked with packages like Matplotlib, Seaborn, and pandas in Python.
- Connected python with Hadoop Hive and Spark and performed data analytics. Worked on Large datasets of structured, unstructured and semi-structured data.
- Experienced in Linear Regression, Logistic Regression, Random Forest, Decision Trees, Naïve Bayes, K-Means.
- Worked in Current Techniques and approaches in Natural Language Processing. Better Understanding of Statistical Analysis and Modeling, Algorithms and Multivariate Analysis and familiar with model selection, testing, comparison, and validations.
- Experience in Machine learning with NLP Text classification and prediction using python.
- Worked with the Amazon Web Services Environment for database storage. Identify problems and provide solutions to business problems using data processing, data visualization.
- Worked with python libraries like matplotlib, numpy, scipy and pandas for data analysis. Connected python with Hadoop to perform Hive and Spark to perform data analysis.
- Ability to work independently and problem-solving skills as the part of the team. Excellent skills in using pandas in python and dplyr in R for performing exploratory analysis.
- Good understanding of installations, configurations, supporting and managing Hadoop clusters using Amazon Web Services (AWS)
- Good understanding of Natural Language Processing. Good knowledge of analytical and problem-solving skills and able to work within the team and as an individual.
- Extensive experience in Data Visualization using tables, lists, and tools like Tableau. Experience in Business Intelligence tools like SSIS, SSRS, and ETL.
- Proficient in design and development of various Dashboards, Reports utilizing Tableau Visualizations like bar graphs, scatter plots, pie-charts, Geographic's and other making use of actions, local and global filters, cascading filters, context filters, Quick filters, parameters according to the end user requirements
- Experience in Apache Spark, Kafka for Big Data Processing & Scala Functional programming. In depth knowledge and hands on experience of Big Data/ Hadoop ecosystem ( MapReduce, HDFS, Hive, Pig and Sqoop ).
- Experience in manipulating the large data sets with R packages like tidyr, tidyverse, dplyr reshape, lubridate, Caret and visualizing the data using lattice and ggplot2 packages. Experience in dimensionality reduction using techniques like PCA, LDA and factor analysis.
- Very good experience and knowledge in provisioning virtual clusters under AWS cloud which includes services like EC2, S3, and EMR.
- Experience in migration from heterogeneous sources including Oracle to MS SQL Server . Experience in writing SQL queries and working with various databases ( MS Access, MySQL, Oracle DB ).
- Worked on Jupiter notebook, PySpark through cloud platform in EC2 instance using putty and estimated models using Cross Validation, Log loss function, ROC curves used AUC for feature selection.
- Experience in data analytics, predictive analysis like Classification, Regression, Recommender Systems . Experience in developing Custom Report and different types of Tabular Reports, Matrix Reports, Ad hoc reports and distributed reports in multiple formats using SQL Server Reporting Services (SSRS).
- Used Elastic Search in NOSQL database to store the data, retrieve and manage the documents in JSON format.
- Good Knowledge in CNN, RNN and LSTM.
- Used TensorFlow for data flow graphs and understanding the deep learning models.
Languages: R, PYTHON, MySQL, C++
Operating Systems: Microsoft Windows, macOS, Linux
Document Management: SharePoint 2013
Development Tools: Anaconda, Geany, RStudio, Jupiter, MySQL Work Bench, Hive, Sqoop, HBase, Spark, ETL, HDFS, Hadoop, Pig
Server Software: MySQL, Oracle, MS Access, SQL, TSQL
Web Technologies: HTML, CSS
Productivity Software: Microsoft Excel, PowerPoint
Visualization Platforms: Tableau.
Machine Learning Algorithms: Supervised: Linear Regression, Logistic Regression, Decision Trees, Random Forest, Extra Trees, KNN, Support Vector Machines
Confidential, St.Louis, MO
- Compiled, defined, and analyzed data requirements for projects. Prepared and presented data reports and offered analytical and statistical interpretations using Machine Learning Algorithms.
- Performed data pre-processing and cleaning to prepare the data sets for further statistical analysis.
- Developed Machine Learning algorithms to find the number of health insurance claimed and provided insights for even smarter healthcare.
- Involved in all phases like data collection, data cleaning, developing models, validation, visualization and performed gap analysis.
- Performed Data Analysis using Pandas, NumPy, Seaborn, SciPy, Matplotlib, Scikit-learn and NLTK and developed various machine learning algorithms such as linear regression, multivariate regression, naïve Bayes, K-means, KNN and random forest.
- Experimented with predictive models including Logistic Regression, Support Vector Machine (SVC), Random Forest Provided by Scikit-Learn, XGBoost, LightGBM .
- Worked on different data formats like JSON and XML and performed different machine learning algorithms.
- Used Ensemble methods like Bagging, Boosting, Gradient Boosting Machines, XGBoosting Techniques to improve the accuracy for the weak learners.
- Managed entire data science project life cycle and actively involved in all the phases of project life cycle including data acquisition, data cleaning, data engineering, features scaling, features engineering, statistical modeling, dimensionality reduction using Principal Component Analysis and Factor Analysis, testing and validation using ROC plot, K- fold cross validation and data visualization .
- Performed data parsing, data manipulation and data preparation with methods including describe data contents, compute descriptive statistics of data, regex, split and combine, Remap, merge, subset, re index, melt and reshape.
- Designed and implemented Cross-validation and statistical tests including k-fold, stratified k-fold, hold-out scheme to test and verify the models' significance.
- Used Bayesian Algorithm for Gaussian Naïve Bayes and Naïve Bayes for the prediction of new data from the training data.
- Used backpropagation model in neural networks for gradient and cost function.
- Used Natural Language Processing for dictating the documentation and translating the speech to text in the healthcare industry.
- Worked on large datasets, acquired data and cleaned the data, analyzed trends by making visualizations using matplotlib. Created reports to show the insights, helped to make correct decisions that advance patient care, show the customers graphs and reduce the price and improve health.
- Good knowledge in Hadoop Architecture and several components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, Secondary data node and MapReduce Concepts.
- Applied Spark, Scala, Hadoop, Cassandra, Hbase, spark streaming, ML and Python.
- Implemented, tuned and tested the model on AWS Lambda with the best performing algorithm and parameters.
- Designed and Implemented cross-validation and statistical tests including k-fold, stratified k-fold, hold-out schema to test and verify the model’s significance.
- Used Spark platform for analysis using Pyspark library and performed splitting of date into clusters on AWS.
- Developed the load script for Extracting, Transformation and Loading the data into Tableau applications.
- Designed and developed the new interface elements and objects as required and SET ANALYSIS to provide functionality using Tableau.
- Visualized the data with graphs and reports using matplotlib, seaborn and panda packages in python on datasets for analytical models to know the missing values, correlation between the features and outliers.
- Used Tableau application for data visualization and found perceptions between the datasets from various sources for grouping, clustering and forecasting.
- Used Kanban boards for tracking the project and flow of the project and track the status and progress of the project.
Environment: Python, AWS, HDFS, OLTP, MS Excel, Hive, OLAP, Metadata, MapReduce, SQL, MangoDB, DB2, Oracle 10g.
Confidential, San Francisco, CA
- Worked as data identification, collection, and exploration, cleaning the model, participate in model identification.
- Worked with both supervised and unsupervised data algorithms and evaluated the models, tested and validated before selecting the best fit model for predictions.
- Working with large sets of complex datasets that include structured, semi-structured and unstructured data and discover meaningful business insights.
- Utilized Machine Learning Algorithms such as Linear Regression, Logistic Regression, Naive Bayes, Random Forests, K-means, & KNN for data analysis.
- Skilled in Advanced Regression Modelling, Correlation, Multivariate Analysis, Model Building, Business Intelligence tools and application of Statistical Concepts .
- Extracted, Transformed and loaded the data into a database using python scripts. Involving in gathering information while uncovering and defining multiple dimensions.
- Enforced F-Score, AUC/ROC, Confusion Matrix, Precision, and Recall evaluating different model's performance.
- Participated in features engineering such as feature generating, PCA, Feature normalization and label encoding with Scikit-learn preprocessing. Data Imputation using variant methods in Scikit-learn package in Python. Worked with Tableau Reports to test and validate data integrity to the reports. Evaluated the performance of various models based on real datasets.
- Completed many tasks from collecting the data and exploring the data and interpreting the statistical information.
- Identifying the data needs and requirements and work with other members of the IT- organization to deliver proper Data Visualization and reporting solutions to those needs.
- Created classification models to recognize web request with product association to classify the orders and scoring the products for analytics which improved the online sales percentage by 13%.
- Used pruning algorithms to cut away the connections and perceptron’s to significantly improve the performance of back-propagation algorithm.
- Customer segmentation based on their behavior or specific characteristics like age, region, income, geographical location and applying Clustering algorithms to group the customers based on their similar behavior patterns.
- Conducting studies, rapid plots and using advanced data mining and statistical modeling techniques to build a solution that optimizes the quality and performance of data.
- Worked with several outlier algorithms like Z-score, PCA, LMS, and DBSCAN to better process the data for higher accuracy.
- Developed the model with ~1.4million data points and used the elbow method to find the optimal value of K using the Sum of Squared error as the error measure.
- Implemented Pearson's Correlation and Maximum Variance techniques to find the key predictors for the Regression models.
- Used Tableau to convey the results by using dashboards to communicate with team members and with other data science teams, marketing, and engineering teams. Enforced Model Validation using test and Validation sets via K- fold cross validation, statistical significance testing.
- Designed predictive models using the Machine Learning platform. Used query languages such as SQL, HiveQL, Piglatin. Worked with a large dataset and deep learning class using TensorFlow.
- Worked with numerous data visualization tools in python like matplotlib, seaborn, ggplot, pygal.
Environment: Apache Spark 2.0.2, Tableau9.3, logistic regression, random forest, neural networks, SVM, JSON, XML, MLLib, Tensorflow.
- Collaborated with my manager to gather all the information needed for data analysis and databases and analyzed the raw data.
- Implemented and designed predictive models using Natural Processing Language Techniques and machine learning algorithms such as linear, logistic and multivariate regression, random forests, k means clustering, KNN, PCA for data analysis.
- Involved in all aspects like data collection, data cleaning, developing models, visualization.
- Maintenance of large data sets, combining datasets from various sources like Excel and SQL Queries. Writing SQL Scripts to select the data from the serves and modify the data as the need of python pandas and stored back to the different database servers.
- Created action filters, parameters and calculated sets for dashboards and worksheets in the tableau. Published customized reports and dashboards, report scheduling using tableau server and used Teradata SQL Queries using Teradata SQL Assistant.
- Performed data cleaning, exploring analysis and feature engineer using R. Performed data visualization with tableau and generated the findings and enchanted customer satisfactions.Programmed in python that used in packages like numpy, pandas and scipy. Developed the content involving in data manipulation, visualization, Machine Learning, and SQL.
- Used different kinds of statistical models like chi-square test,hypothesis testing,t-Test, ANOVA, Correlation Testing and Descriptive Testing.
- Implemented classification using supervised algorithms like Decision trees, KNN, Logistic Regression, Naive Bayes.
- Good knowledge of Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, Secondary Name Node, and MapReduce concepts.
- Understanding and analyzing the data using appropriate statistical models to generate insights.
Environment: R, Python, Tableau 6.0, ML Lib, PL/SQL, HDFS, Teradata 12, HADOOP (HDFS), HIVE, AWS.
- Experience in Transforming logic into database design and maintaining into SQL tools like Stored Procedures, Views and User Defined Functions .
- Worked with Database Analysts and Architect for the design of Summary Tables required.
- Used T-SQL for Stored Procedures to transfer the data from the databases to initial stages and transfer into data warehouse.
- Formatted report and sub-reports layout using Expressions, Global Variables and Functions .
- Worked in SQL and T-SQL Views, Joins, Stored Procedures and Indexes . Tuned and Optimized SQL Queries using Execution plans .
- Used Joins and Sub-Queries to simplify the complex queries involving multiple tables . Worked with Stored Procedures, Triggers, Functions, User-Defined Functions, Indexes.
- Analyzed User Requirements and Specifications for various database applications and Identified Issues and resolved them with the database.
- Developed charts and graphs like line charts, pie charts, bar charts by using chart expert.
- Enhanced the database by creating the clustering and non-clustering indexes and indexes views.
- Extract Transform and Load ( ETL ) development using SQL server and SQL integrated services 2008.
- Analyzed data using complex SQL queries across various databases.
- Used SSIS and T-SQL stored procedures to transfer data from OLTP databases to staging area and finally transfer into data marts.
- Performed maintenance of stored procedures to improve performance of different front-end applications.
- Scheduled the reports to run on the daily basis and weekly basis in Report Manager and analyst to review the excel sheets.
- Actively involved in Software Development Life Cycle (SDLC) involving Systems Initiation, Analysis, Plan, Design, and Development, Implementation.
Environment: SQL SERVER 2008, VISUAL STUDIO 2008, Microsoft office 2010 PowerPoint office.