- Data Scientist with 8+ years of experience in transforming business requirements into analytical models, designing algorithms, building models, developing data mining and reporting solutions that scale across a massive volume of structured and unstructured data and expertise working in a variety of industries including Banking and Healthcare.
- Expert in Data Science process life cycle: Data Acquisition, Data Preparation, Modeling (Feature Engineering, Model Evaluation) and Deployment.
- Equipped with experience in utilizing statistical techniques which include hypothesis testing, Principal Component Analysis (PCA), Confidential - distributed Stochastic Neighbor Embedding ( Confidential -SNE), ANOVA, sampling distributions, chi-square tests, time-series & discriminant analysis, Bayesian inference, multivariate analysis.
- Efficient in preprocessing data including Data cleaning, Correlation analysis, Imputation, Visualization, Feature Scaling and Dimensionality Reduction techniques using Machine learning platforms like Python Data Science Packages (Scikit-Learn, Pandas, NumPy ).
- Expertise in building various machine learning models using algorithms such as Linear Regression, Logistic Regression, Naive Bayes, Support Vector Machines (SVM), Decision trees, KNN, K-means Clustering, Ensemble methods (Bagging, Gradient Boosting).
- Experience in Text mining, Topic modeling, Natural Language Processing (NLP ), Content Classification, Sentiment analysis, Market Basket Analysis, Recommendation systems, Entity recognition etc.
- Applied text pre-processing and normalization techniques, such as tokenization, POS tagging, and parsing. Expertise using NLP techniques (BOW, TF-IDF, Word2Vec) and toolkits such as NLTK, Genism, SpaCy.
- Experienced in tuning models using Grid Search, Randomized Grid Search, K-Fold Cross Validation and Confidential - SNE.
- Strong Understanding with artificial neural networks, convolutional neural networks, and deep learning
- Skilled in using statistical methods including exploratory data analysis, regression analysis regularized linear models, time-series analysis, cluster analysis, goodness of fit, Monte Carlo simulation, sampling, cross-validation, ANOVA, A/B testing.
- Working experience in Natural Language Processing (NLP) and Deep understanding of Statistics/Linear Algebra/Calculus and various optimization algorithms like gradient descent.
- Familiar with key data science concepts (statistics, data visualization, machine learning, etc.). Experienced in Python, R, Py Spark programming for statistic and quantitative analysis.
- Knowledge on Time Series Analysis using AR, MA, ARIMA, GARCH and ARCH model.
- Experienced in working on Jupiter notebook, PySpark through cloud platform in EC2 instance using putty and estimated models using Cross Validation, Log loss function, ROC curves used AUC for feature selection.
- Experience in building production quality and large-scale deployment of applications related to natural language processing and machine learning algorithms.
- Experience with high performance computing (cluster computing on AWS with Spark/Hadoop) and building real-time analysis with Kafka and Spark Streaming. Knowledge using Qlik, Tableau, and Power BI
- Exposure to AI and Deep learning platforms such as TensorFlow, Keras, AWS ML, Azure ML studio
- Experience working with Big Data tools such as Hadoop - HDFS and MapReduce, Hive QL, Sqoop, Pig Latin and Apache Spark (PySpark).
- Experience in implementation of the Stored Procedures, Triggers, Functions using Confidential -SQL
- Extensive experience working with RDBMS such as SQL Server, MySQL, and NoSQL databases MongoDB & HBase.
- Experience in designing and deploying AWS Solutions using EC2, S3, EBS, Elastic Load Balancer (ELB), auto scaling groups, optimizing volumes and EC2 instances.
- Experience in creating separate virtual data warehouses with difference size classes in AWS Snowflake .
- Experienced on working different file formats like JSON, CSV, XML in Anaconda Navigator, Jupyter Notebook, Visual Studio code, and Spyder. Experience in using Git and Git Hub for source code management.
- Tackled highly imbalanced fraud dataset using under sampling, oversampling with SMOTE and cost-sensitive algorithms using Python Sci-kit Learn.
- Generated data visualizations using tools such as Tableau, Python Matplotlib, Python Seaborn, R.
- Knowledge and experience working in Agile environments including the scrum process and used Project Management tools like Jira and version control tools such as GitHub/Git.
- Maintains a fun, casual, professional, and productive team atmosphere.
- Excellent communication skills. Successfully working in fast-paced multitasking environment both independently and in a collaborative team, a self-motivated enthusiastic learner
Data Sources: MS SQL Server, MongoDB, MySQL, HBase, Amazon Redshift, Teradata
Statistical Methods: Hypothesis Testing, ANOVA, Principal Component Analysis (PCA), Time Series, Correlation (Chi-square test, covariance), Multivariate Analysis, Bayes Law.
Machine Learning: Linear Regression, Logistic Regression, Naive Bayes, Decision Trees, Random Forest, Support Vector Machines (SVM), Hierarchical clustering, K-Means Clustering, K-Nearest Neighbors (KNN), Random Forest, Gradient Boosting Trees, Ada Boosting, PCA, Confidential -distributed Stochastic Neighbor Embedding ( Confidential -SNE), LDA, Natural Language Processing
Deep Learning: Artificial Neural Networks, Convolutional Neural Networks, RNN, Deep Learning on AWS, Cloud AI
Hadoop Ecosystem: Hadoop, Spark, MapReduce, Hive QL, HDFS, Sqoop, Pig Latin
Data Visualization: Tableau, Python (Matplotlib, Seaborn), R(ggplot2), Power BI, QlikView, D3.js
Languages: Python (NumPy, Pandas, Scikit-learn, Matplotlib, Seaborn), R, SQL, SparkSQL Spark, Java, C#
Operating Systems: UNIX Shell Scripting (via PuTTY client), Linux, Windows, Mac OS
Other tools and technologies: TensorFlow, Keras, AWS ML, Azure ML studio, GCP, NLTK, SpaCy, Gensim, MS Office Suite, Google Analytics, GitHub,AWS (EC2/Lambda/Docker/Kubernetes)
Confidential, Dallas, TX
- Extensively involved in all phases of data acquisition, data collection, data cleaning, model development, model validation and visualization to deliver data science solutions.
- Built machine learning models to identify whether a user is legitimate using real-time data analysis and prevent fraudulent transactions using the history of customer transactions with supervised learning.
- Extracted data from SQL Server Database, copied into HDFS File system and used Hadoop tools such as Hive and Pig Latin to retrieve the data required for building models.
- Performed data cleaning including transforming variables and dealing with missing value and ensured data quality, consistency, integrity using Pandas, NumPy.
- Tackled highly imbalanced Fraud dataset using sampling techniques like under sampling and oversampling with SMOTE (Synthetic Minority Over-Sampling Technique) using Python Scikit-learn.
- Utilized PCA, Confidential -SNE and other feature engineering techniques to reduce the high dimensional data, applied feature scaling, handled categorical attributes using one hot encoder of scikit-learn library
- Developed various machine learning models such as Logistic regression, KNN, and Gradient Boosting with Pandas, NumPy, Seaborn, Matplotlib, Scikit-learn in Python.
- Worked on Amazon Web Services (AWS) cloud services to do machine learning on big data.
- Developed Spark Python modules for machine learning & predictive analytics in Hadoop.
- Implemented a Python-based distributed random forest via PySpark and MLlib.
- Used cross-validation to test the model with different batches of data to find the best parameters for the model and optimized, which eventually boosted the performance.
- Experimented with Ensemble methods to increase the accuracy of the model with different Bagging and Boosting methods and deployed the model on AWS.
- Experience in setting up the CI/CD pipelines using Jenkins, GitHub and Terraform.
- Experience in Implementing Continuous Integration and deployment using various CI Tools Jenkins, Docker & Kubernetes,monitoring EC2 instances, several AWS services using Nagios as well as log monitoring tools Splunk.
- Visualized the data with graphs and reports using matplotlib, seaborn and panda packages in python on datasets for analytical models to know the missing values, correlation between the features and outliers.
- Created & maintained reports to display the status & performance of deployed model & algorithm with Tableau.
- Participated in Business meetings, Data Night Live ( Confidential & Confidential ML Meet up) to understand Confidential & Confidential 's growth in the Machine Learning world .
- Self-starter, Self-learner, can work either independently or as part of a team.
- Pro-active in preventing issues, Provide valuable suggestion for improvement, Simplifying complex problems with Proper planning, Predict problems before they occur and avoid unnecessary escalations.
Technology Stack: Machine Learning, AWS, Python (Scikit-learn, SciPy NumPy, Pandas, Matplotlib, Seaborn), SQL Server, Hadoop, HDFS, Hive, Pig Latin, SQL, Spark/PySpark/MLlib, GitHub, Linux, Tableau.
Confidential, Kansas City, MO
Data Scientist / Machine learning Engineer
- Collaborated with data engineers and operation team to implement the ETL process, wrote and optimized SQL queries to perform data extraction to fit the analytical requirements.
- Performed data analysis by retrieving the data from the Hadoop cluster.
- Performed univariate and multivariate analysis on the data to identify any underlying pattern in the data and associations between the variables.
- Explored and analyzed the customer specific features by using Matplotlib in Python and ggplot2 in R.
- Performed data imputation using Scikit-learn package in Python.
- Participated in features engineering such as feature generating, PCA, feature normalization and label encoding with Scikit-learn preprocessing.
- Used Python (NumPy, SciPy, pandas, Scikit-learn, seaborn) and R to develop a variety of models and algorithms for analytic purposes.
- Worked on Natural Language Processing with NLTK module of python and developed NLP models for sentiment analysis
- Experimented and built predictive models including ensemble models using machine learning algorithms such as Logistic regression, Random Forests, and KNN to predict customer churn.
- Conducted analysis of customer behaviors and discover the value of customers with RMF analysis; applied customer segmentation with clustering algorithms such as K-Means Clustering. Gaussian Mixture Model and Hierarchical Clustering.
- Used F-Score, AUC/ROC, Confusion Matrix, Precision, and Recall evaluating different models’ performance.
- Designed and implemented a recommendation system which leveraged Google Analytics data and the machine learning models and utilized Collaborative filtering techniques to recommend courses for different customers.
- Designed rich data visualizations to model data into human-readable form with Tableau and Matplotlib.
Technology Stack: Hadoop, HDFS, Python, R, Tableau, Machine Learning (Logistic regression/ Random Forests/ KNN/ K-Means Clustering/ Hierarchical Clustering/ Ensemble methods/ Collaborative filtering), JIRA, GitHub, Agile/ SCRUM, GCP
Machine Learning Engineer / Sr. Data Analyst
- Communicated and coordinated with end client for collecting data and performed ETL to define the uniform standard format. Queried and retrieved data from Oracle database servers to get the dataset.
- In the preprocessing phase, used Pandas to remove or replace all the missing data and balanced the dataset with Over-sampling the minority label class and Under-sampling the majority label class.
- Used PCA and other feature engineering, feature scaling, Scikit-learn preprocessing techniques to reduce the high dimensional data using entire patient visit history, proprietary comorbidity flags and comorbidity scoring from over 12 million EMR and claims data.
- In data exploration stage used correlation analysis and graphical techniques in Matplotlib and Seaborn to get some insights about the patient admission and discharge data.
- Developed Time series forecasting model for various business databases using the ARIMA and ARIMAX.
- Experimented with predictive models including Logistic Regression, Support Vector Machine (SVM), Gradient Boosting and Random Forest using Python Scikit-learn to predict whether a patient might be readmitted.
- Designed and implemented Cross-validation and statistical tests including ANOVA, Chi-square test to verify the models’ significance.
- Implemented, tuned & tested the model on AWS EC2 with the best performing algorithm and parameters.
- Set up data preprocessing pipeline to guarantee the consistency between the data & new coming data.
- Deployed the model on AWS Lambda. Collected the feedback after deployment, retrained the model and tweaked the parameters to improve the performance.
- Designed, developed and maintained daily and monthly summary, trending and benchmark reports in Tableau Desktop.
- Used Agile methodology and Scrum process for project developing.
Technology Stack: AWS EC2, S3, Oracle DB, AWS, Linux, Python (Scikit-Learn/NumPy/Pandas/Matplotlib), Machine Learning (Logistic Regression/Support Vector Machine/Gradient Boosting/Random Forest), Tableau
- Created Database in Microsoft Access by using a blank database and create tables and entered dataset manually and data types, performed ER Diagram and Basic SQL Queries on that database.
- Used Microsoft Excel for formatting data as a table, visualization and analyzing data by using certain methods like Conditional Formatting, Remove Duplicates, Pivot and Unpivot tables, Created Charts & Sort and Filter Data Set
- Wrote application code to do SQL queries in MySQL and to organize useful information based on the business requirements
- Applied concepts of probability distribution and statistical inference on the given dataset to unearth interesting findings using comparison, Confidential -test, F-test, R-squared, P-value etc.
- Performed Statistical Analysis and Hypothesis Testing in Excel by using Data Analysis Tool.
- Analyzed the partitioned and bucketed data and compute various metrics for reporting.
- Integrated data from disparate sources, mined large data set to identify patterns using predictive analytics
- Conducted intermediate and advanced statistical analysis, such as linear regression, ANOVA, time-series analysis, classification models, and forecasting future sales.
- Created Entity Relationship Diagrams and Data mapping for a better understanding of the dataset
- Interacted with the other departments to understand and identify data needs and requirements and work with other members of the IT organization to deliver data visualization and reporting solutions to address those needs.
- Identified patterns, data quality issues, and opportunities and leveraged insights by communicating opportunities with business partners.
- Performed module specific configuration duties for implemented applications to include establishing role- based responsibilities for user access, administration, maintenance, and support.
- Worked closely with internal business units to understand business drivers and objectives which can be enabled through effective application deployment.
Technology Stack: SQL Server, Tableau, Excel, SQL server management studio, Microsoft BI Suite, SQL, Visual Studio.
Java Full Stack Developer
- Involved in design and development of all modules in the application.
- Developed and deployed Java REST web services using Spring Framework.
- Converted the mock-ups into hand-written HTML, AJAX, XML and JSON.
- Designed and implemented Spring ORM module to integrate Hibernate.
- Implemented SOA architecture with web services using SOAP, WSDL, UDDI and XML.
- Designed and Implemented User Interface in Model-View-Controller Architecture, which accomplishes a tight and neat co-ordination of Spring MVC, JSP, Servlets.
- Designed and developed a REST Web Services using Jersey framework builds on JAX-RS.
- Prepared Unit test cases for existing functionality as per the requirement and execute the same.
- Used WS-Security for authenticating the SOAP messages along with encryption and decryption.
- Prepared Unit test cases for existing functionality as per the requirement and execute the same.
- Entity Beans used for accessing data from the SQL Server database.
- Used Java Messaging Services (JMS) for reliable and asynchronous exchange of important information such as payment status report.
- Worked exclusively on MDB, Messaging Queues and Posting Error Messages to the Queue.
- Worked on the Spring MVC Restful Web services, exposing services and consuming the third party.
- Developed UML models consisting of Use Case Diagrams, and Sequence Diagrams using Rational Rose software.
- Involved in handling Hibernate as part of DB connectivity and persistence as ORM tool and writing HQL queries.
- Used JUnit Testing Framework for performing Unit testing.
- Strong experience in implementing the Web Services (WSDL using SOAP protocol, JAXB, JAX-RS, RESTful).
Technology Stack: JSP, AJAX, Struts framework, Hibernate Framework, JMS, SOAP, XML, Spring Framework, Log4j, Java Script, HTML, Oracle9i, SQL, PL/SQL, Web Sphere, WSAD, JSTL, Struts tags, Junit, Mockito, SQL, Struts, CSS, Jenkins.
- Involved in creating stored procedures & SQL queries to import data from SQL server to Tableau.
- Create filters, parameters and calculated sets for preparing dashboards and worksheets in Tableau.
- Acquired strong experience in all areas of SQL server development including tables, user functions, views, indexes, stored procedures, functions, joins.
- Explored dataset using various diagrams such as Histograms, boxplots, skewness in R studio.
- Analyzed the customer data and business rules to maintain data quality and integrity.
- Extensively created excel charts, pivot tables, functions in Microsoft Excel to analyze the data.
- Clean dataset by removing missing values and outliers using R studio
- Perform various mathematical functions such as max, min, log, round, sum, mean, standard deviation in R studio.
- Performed ETL process to Extract, Transform & Load the data from OLTP tables into staging tables & data warehouse.
- Establish credibility and strong working relationships with stakeholders and customers.
- Produce reports on ad hoc basis per requirements.
- Conduct User Acceptance Testing (UAT) for various system releases
Technology Stack: SQL Server 2008, SQL Management Studio, SQL Profiler, Visual Studio, MS Excel, TFS