- Have 6 years of extensive IT experience with 4+ years of experience in data science with excellent integration of machine learning algorithms on statistical data. Performed Advanced Analytics, Predictive Modeling and Data Science to solve business issues enabling fact - based decision-making.
- Significant expertise in data acquisition, storage, analysis, integration, machine learning, Predictive modeling, logistic regression, decision trees, data mining methods, forecasting, factor analysis, Ad hoc analysis, A/B testing, multivariate testing, time series analysis, cluster analysis, ANOVA, neural networks and other advanced statistical and econometric techniques.
- Expertise includes abstracting and quantifying the computational aspects of the problems, designing and applying new statistical algorithms, as well as systems-level software design and implementation in different platforms e.g. R, SAS, Python, Spark. Experience in applying machine learning and statistical modeling techniques to solve business problems.
- Expert in distilling vast amounts of data to meaningful discoveries at requisite depths. Ability to analyze most complex projects at various levels.
- Experience in building big data data-intense applications and products using Hadoop ecosystem components like Hadoop, Pig, HIVE, Sqoop, Apache spark, Apache Kafka.
- Experience of working in text understanding, classification, pattern recognition, recommendation systems, targeting systems and ranking systems using Python.
- A deep understanding of Statistical Modeling, Multivariate Analysis, Big data analytics and Standard Procedures Highly efficient in Dimensionality Reduction methods such as PCA (Principal component Analysis) etc.
- Experienced in job workflow scheduling and monitoring tools like Oozie and ESP. Experience using various Hadoop Distributions (PIVOTAL, Hortonworks, MapR etc) to fully implement and leverage new Hadoop features.
- Experienced in requirement analysis, application development, application migration and maintenance using Software Development Lifecycle (SDLC).
- Visualization and dash boarding using Tableau, Python's Matplotlib, graphing in R.
Machine Learning -: Regression analysis, Ridge, Lasso Regression, K-NN, Decision Tree, Support Vector Machine (SVM), Artificial Neural Network (ANN), CNN, RNN, Ensembles method like Bagging, Boosting, Stacking, K Means clustering and Hierarchical clustering.
Python Libraries: Statistics
Databases -: MySQL, SQL Server 2008/2012/2014 , MongoDB, AWS DynamoDB.
Hadoop Ecosystem: Cloud Services
Reporting & Visualization Tools: Tableau, SSRS, Seaborn, Matplotlib, ggplot2.
Languages: System Linux (Ubuntu 14.x - 16.x), Windows 7 - 10, Mac OS.
Confidential, Philadelphia, PA
- Developed computational and data science solutions for the storage, management, analysis, and visualization of genomic data.
- Leveraged existing tools and publicly available genomics data to develop, test, or implement bioinformatics pipelines.
- Extracted patent text and numerical features with python library Beautiful Soup, created Decision Tree algorithm to predict the patent classification on their Diseases.
- Detected the near-duplicated news by applying NLP methods (e.g. word2vec) and developing machine learning models like label spreading, clustering
- Provided expertise in statistical methods or machine learning with the goal of applying these techniques to health data.
- Extracted data from database, copied into HDFS File system and used Hadoop tools such as Hive and Pig Latin to retrieve the data required for building models.
- Worked on data cleaning and ensured data quality, consistency, integrity using Pandas, NumPy.
- Tackled highly imbalanced Fraud dataset using sampling techniques like down-sampling, up-sampling and SMOTE (Synthetic Minority Over-Sampling Technique) using Python Scikit-learn.
- Used clustering technique K-Means to identify outliers and to classify unlabelled data.
- Cleaned, analyzed and selected data to gauge customer experience.
- Used algorithms and programming to efficiently go through large datasets and apply treatments, filters, and conditions as needed.
- Used PCA and other feature engineering techniques to reduce the high dimensional data, feature normalization techniques and label encoding with Scikit-learn library in Python.
- Used Pandas, NumPy, Seaborn, Matplotlib, Scikit-learn in Python for developing various machine learning models such as Logistic regression, Gradient Boost Decision Tree and Neural Network.
- Used cross-validation to test the models with different batches of data to optimize the models and prevent overfitting.
- Experimented with Ensemble methods to increase the accuracy of the training model with different Bagging and Boosting methods.
- Implemented a Python-based distributed random forest via PySpark and MLlib.
- Used AWS S3, DynamoDB, AWS lambda, AWS EC2 for data storage and models' deployment.
- Created and maintained reports to display the status and performance of deployed model and algorithm with Tableau.
Technology Stack: Oracle 11g, Hadoop 2.x, HDFS, Hive, Pig Latin, Spark/PySpark/MLlib, Python 3.x (Numpy, Pandas, Scikit-learn, Matplotlib, Seaborn), Jupyter Notebook, AWS, Github, Linux, Machine learning algorithms, Tableau.
Confidential - Morristown, NJ
- Communicated and coordinated with end client for collecting data and performed ETL to define the uniform standard format.
- Queried and retrieved data from SQL Server database to get the sample dataset.
- In pre-processing phase, used Pandas to clean all the missing data, datatype casting and merging or grouping tables for EDA process.
- Used PCA and other feature engineering, feature normalization and label encoding Scikit-learn pre-processing techniques to reduce the high dimensional data (>150 features).
- In data exploration stage used correlation analysis and graphical techniques in Matplotlib and Seaborn to get some insights about the patient admission and discharge data.
- Experimented with predictive models including Logistic Regression, Support Vector Machine (SVC), Random Forest provided by Scikit-learn, XGBoost, LightGBM and Neural network by Keras to predict showing probability and visiting counts.
- Designed and implemented Cross-validation and statistical tests including k-fold, stratified k-fold, hold-out scheme to test and verify the models' significance.
- Worked on data cleaning and ensured data quality, consistency, integrity using Pandas, NumPy.
- Participated in feature engineering such as feature intersection generating, feature normalize and label encoding with Scikit-learn pre-processing.
- Improved fraud prediction performance by using random forest and gradient boosting for feature selection with Python Scikit-learn.
- Used Python (NumPy, Scipy, Pandas, Scikit-Learn, Seaborn), and Spark 2.0 (PySpark, MLlib) to develop variety of models and algorithms for analytic purposes.
- Utilized spark, Scala, Hadoop, HBase, Kafka, Spark Streaming, MLlib, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc.
- Implemented, tuned and tested the model on AWS Lambda with the best performing algorithm and parameters.
- Implemented Hypothesis testing kit for sparse sample data by wring R packages.
- Collected the feedback after deployment, retrained the model to improve the performance.
- Designed, developed and maintained daily and monthly summary, trending and benchmark reports in Tableau Desktop.
Technology Stack: SQL Server 2012/2014, AWS EC2, AWS Lambda, AWS S3, AWS EMR, Linux, Python3.x (Scikit-Learn, NumPy, Pandas, Matplotlib), R, Machine Learning algorithms, Tableau.
Confidential - Indianapolis, IN
Intern Data Scientist
- Soloed 4 projects from data orchestration, workflow design, to production code for online software release.
- Developed custom intent classification techniques to be used during the intent creation and testing, by modifying the Word Mover Distance algorithm.
- Diagnosed performance issues that only occurred on server and not locally, used Jprofiler to monitor memory utilization.
- Analyzed incoming new data, and identified possible problems with intent design.
- Diagnosed problems that were rooted in bad SQL schema design.
- Used local and Azure cloud multiprocessing to forecast time series predictions for 50+ million search terms.
- Optimized key features for ad campaigns to generate best ROI for ad bid, ad budget, and sales margins.
- Used feature importance to find top search terms that generated most revenue for top 20+ million products.
- Applied computer vision and split testing to optimize product pictures to generate best sales conversion.
- Performed data profiling in the source systems that are required for New Customer Engagement (NCE) Data mart.
- Document the complete process flow to describe program development, logic, testing, and implementation, application integration, coding.
- Manipulating, cleansing & processing data using Excel, Access and SQL.
- Responsible for loading, extracting and validation of client data.
- Liaising with end-users and 3rd party suppliers. Analyzing raw data, drawing conclusions & developing recommendations writing SQL scripts to manipulate data for data loads and extracts.
- Developing data analytical databases from complex financial source data. Performing daily system checks. Data entry, data auditing, creating data reports & monitoring all data for accuracy. Designing, developing and implementing new functionality.
- Monitoring the automated loading processes. Advising on the suitability of methodologies and suggesting improvements.
- Involved in defining the source to target data mappings, business rules, and business and data definitions. Responsible for defining the key identifiers for each mapping/interface.
- Responsible for defining the functional requirement documents for each source to target interface.
- Coordinate with the business users in providing appropriate, effective and efficient way to design the new reporting needs based on the user with the existing functionality.
- Document data quality and traceability documents for each source interface.
- Designed and implemented data integration modules for Extract/Transform/Load (ETL) functions.
- Involved in Data warehouse and DataMart design.
- Documented the complete process flow to describe program development, logic, testing, and implementation, application integration, coding.
- Worked with internal architects and, assisting in the development of current and target state data architectures.
- Worked with project team representatives to ensure that logical and physical ER/Studio data models were developed in line with corporate standards and guidelines.
Technology Stack: SQL/Server, Oracle, MS-Office, Teradata, Enterprise Architect, Informatica Data Quality, ER Studio, TOAD, Business Objects, Green plum Database, PL/SQL