Data Scientist/machine Learning Resume
Philadelphia, PA
SUMMARY:
- Over 8 years of experience in Machine learning with large datasets of structured and unstructured data, Predictive modeling, Data analysis, Data acquisition, Data validation and Data visualization.
- Hands - on experience with Machine Learning algorithms such as Regression Analysis, Clustering, Boosting, Classification, Principal Component Analysis and Data Visualization Tools.
- Data Scientist with proven expertise in Data Analysis, Machine Learning, and Modeling.
- Experience in Machine Learning algorithms such as Linear Regression, Logistic Regression, Naive Bayes, Decision Trees, K-Means Clustering and Association Rules.
- Experience in applying predictive modeling and machine learning algorithms for analytical reports.
- Experience using technology to work efficiently with datasets such as scripting, Data cleaning tools, statistical software packages.
- Developed predictive models using Decision Tree, Random Forest, Naïve Bayes, Logistic Regression, Cluster Analysis, and Neural Networks.
- Very Strong in Python, statistical analysis, tools, and modeling.
- Experienced in Machine Learning and Statistical Analysis with Python Scikit-Learn.
- Strong programming skills in a variety of languages such as Python, R and SQL.
- Valuable experience working with large datasets and Deep Learning algorithms with Tensor Flow.
- Worked on various applications using Python integrated IDEs such as Anaconda and Py Charm.
- Experience in Data Cleaning, Transformation, Integration, Data Imports and Data Exports.
- Experienced with machine learning algorithm such as logistic regression, random forest, Xg boost, KNN, SVM, neural network, linear regression, and k-means.
- Good Knowledge in Data Validation, Data Cleaning, Data Verification and Identifying Data mismatch.
- Experienced with Machine Learning, Regression Analysis, Clustering, Boosting, Classification, Principal Component Analysis and Data Visualization Tools.
- Experienced with tuning parameters for different machine learning models to improve performance.
- Interacted with various clients, teams to update and modify deliverables to meet the business needs.
- Have hands on experience in applying SVM, Random Forest, K means clustering.
- Experienced in writing complex SQL Queries like Stored Procedures, triggers, joints, and Sub Queries.
- Extensive experience in Text Analytics, developing different Statistical Machine Learning models, Data Mining solutions to various business problems and generating Data visualizations using R, Python and Tableau.
- Used Python to generate regression models to provide statistical forecasting and applied Clustering Algorithms such as K-Means to categorize customers into certain groups.
- Performed Data manipulation, Data preparation, normalization, and predictive modeling. Improved efficiency and accuracy by evaluating model in Python.
- Worked on SQL Server concepts SSIS (SQL Server Integration Services), SSAS (Analysis Services) and SSRS (Reporting Services).
- Experience building and optimizing big Data pipelines, architectures, and Data sets Hadoop, Spark, Hive and Python.
- Experience implementing machine learning back-end pipeline Spark ML-lib, Scikit-learn, Pandas, Numpy.
- Working knowledge of extract, transform, and Load (ETL) components and process flow using Talend
- Experience with process mining with Microsoft Visio.
- Experience with AWS cloud services EC2, S3.
- Experience with Building and implementing architecture roadmaps for next generation Artificial Intelligence solutions for clients.
TECHNICAL SKILLS:
Databases: MS SQL Server, Oracle, HBase, Amazon Redshift, MS SQL
Statistical Methods: Hypothetical Testing, Exploratory Data Analysis (EDA), Confidence Intervals, Bayes Law, Principal Component Analysis (PCA), Dimensionality Reduction, Cross-Validation, Auto-correlation
Machine Learning: Regression analysis, Na ve Bayes, Decision Tree, Random Forests, Support Vector Machine, Neural Network, Sentiment Analysis, Collaborative Filtering, K-Means Clustering, KNN, CNN, RNN and Ada Boosting.
Data Visualization: Tableau, MatPlotLib, Seaborn, ggplot2
Reporting Tools: Tableau Suite of Tools 10.x, Server and Online, Server Reporting Services(SSRS)
Hadoop Ecosystem: Hadoop 2.x, Spark 2.x, MapReduce, Hive, HDFS, Pig
Cloud Services: Amazon Web Services (AWS) EC2/S3/Redshift
Operating Systems: Microsoft Windows, Linux (Ubuntu), Microsoft Office Suite (Word, PowerPoint, Excel)
PROFESSIONAL EXPERIENCE:
Confidential - Philadelphia, PA
Data Scientist/Machine Learning
Roles & Responsibilities:
- Responsible for performing Machine-learning techniques regression/classification to predict the outcomes.
- Performed Ad-hoc reporting/customer profiling, segmentation using R/Python.
- Tracked various campaigns, generating customer profiling analysis and data manipulation.
- Provided R/SQL programming, with detailed direction, in the execution of data analysis that contributed to the final project deliverables. Responsible for data mining.
- Utilized Label Encoders in Python to convert non-numerical significant variables to numerical significant variables to identify their impact on pre-acquisition and post acquisitions by using 2 sample paired t test.
- Worked with ETLSQL Server Integration Services (SSIS) for data investigation and mapping to extract data and applied fast parsing and enhanced efficiency by 17%.
- Developed Data Science content involving Data Manipulation and Visualization, Web Scraping, Machine Learning, Python programming, SQL, GIT and ETL for Data Extraction.
- Designing suite of Interactive dashboards, which provided an opportunity to scale and measure the statistics of the HR dept. which was not possible earlier and schedule and publish reports.
- Provided and created data presentation to reduce biases and telling true story of people by pulling millions of rows of data using SQL and performed Exploratory Data Analysis.
- Applied breadth of knowledge in programming (Python, R), Descriptive, Inferential, and Experimental Design statistics, advanced mathematics, and database functionality (SQL, Hadoop).
- Migrated data from Heterogeneous Data Sources and legacy system (DB2, Access, Excel) to centralized SQL Server databases using SQL Server Integration Services (SSIS).
- Involved in defining the Source To business rules, Target data mappings, and data definitions.
- Successfully interpreted, analyzed and performed Predictive Modelling using Python with Numpy, Pandas packages.
- Performing Data Validation / Data Reconciliation between disparate source and target systems for various projects.
- Utilized a diverse array of technologies and tools as needed, to deliver insights such as R, SAS, Matlab, Tableau and more.
- Built Regression model to understand order fulfillment time lag issue using Scikit-learn in Python.
- Utilized Spark, Scala, Hadoop, H Base, Kafka, Spark Streaming, MLlib, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc.
- Used T-SQL queries to pull the data from disparate systems and Data warehouse in different environments.
- Worked closely with the Data Governance Office team in assessing the source systems for project deliverables.
- Extracting data from different databases as per the business requirements using Sql Server Management Studio.
- Interacting with the ETL, BI teams to understand / support on various ongoing projects.
- Extensively using MS Excel for data validation.
- Involved in data analysis with using different analytic techniques and modeling techniques.
Environment: Data Governance, SQL Server, Python, ETL, MS Office Suite - Excel(Pivot, VLOOKUP), DB2, R, Visio, HP ALM, Agile, Azure, MDM, Share point, Data Quality, Tableau and Reference Data Management.
Confidential - VA
Data Engineer
Roles & Responsibilities:
- Developed applications of Machine Learning, Statistical Analysis, and Data Visualizations with challenging data Processing problems in sustainability and biomedical domain.
- Compiled data from various sources public and private databases to perform complex analysis and data manipulation for actionable results.
- Designed and developed Natural Language Processing models for sentiment analysis.
- Worked on Natural Language Processing with NLTK module of python for application development for automated customer response.
- Used predictive modeling with tools in SAS, SPSS, R, Python.
- Applied concepts of probability, distribution and statistical inference on given dataset to unearth interesting findings through theuse of comparison, T-test, F-test, R-squared, P-value etc.
- Applied linear regression, multiple regression, ordinary least square method, mean-variance, the theory of large numbers, logistic regression, dummy variable, residuals, Poisson distribution, Bayes, Naive Bayes, fitting function etc to data with help of Scikit, Scipy, Numpy and Pandas module of Python.
- Applied clustering algorithms i.e.Hierarchical, K-means with help of Scikit and Scipy.
- Developed visualizations and dashboards using ggplot, Tableau
- Worked on development of data warehouse, Data Lake and ETL systems using relational and non relational tools like SQL, No SQL.
- Built and analyzed datasets using R, SAS, Matlab, and Python (in decreasing order of usage).
- Applied linear regression in Python and SAS to understand the relationship between different attributes of dataset and causal relationship between them
- Performs complex pattern recognition of financial time series data and forecast of returns through the ARMA and ARIMA models and exponential smoothening for multivariate time series data
- Pipelined (ingest/clean/munge/transform) data for feature extraction toward downstream classification.
- Used ClouderaHadoop YARN to perform analytics on data in Hive.
- Wrote Hive queries for data analysis to meet the business requirements.
- Expertise in Business Intelligence and data visualization using R and Tableau.
- Expert in Agile and Scrum Process.
- Validated the Macro-Economic data (e.g. Black Rock, Moody's etc.) and predictive analysis of world markets using key indicators in Python and machine learning concepts like regression, Bootstrap Aggregation and Random Forest.
- Worked in large-scale database environments like Hadoop and Map Reduce, with working mechanism of Hadoop clusters, nodes and Hadoop Distributed File System (HDFS).
- Interfaced with large-scale database system through an ETL server for data extraction and preparation.
- Identified patterns, data quality issues, and opportunities and leveraged insights by communicating opportunities with business partners.
Environment: Machine learning, AWS, MS Azure, Cassandra, Spark, HDFS, Hive, Pig, Linux, Python (Scikit-Learn/SciPy/NumPy/Pandas), R, SAS, SPSS, MySQL, Eclipse, PL/SQL, SQL connector, Tableau.
Confidential - New York, NY
Data Scientist
Roles & Responsibilities:
- Extensively involved in all phases of data acquisition, data collection, data cleaning, model development, model validation and visualization to deliver data science solutions.
- Built machine learning models to identify whether a user is legitimate using real-time data analysis and prevent fraudulent transactions using the history of customer transactions with supervised learning.
- Extracted data from SQL Server Database copied into HDFS File system and used Hadoop tools such as Hive and Pig Latin to retrieve the data required for building models.
- Performed data cleaning including transforming variables and dealing with missing value and ensured data quality, consistency, integrity using Pandas, NumPy.
- Tackled highly imbalanced Fraud dataset using sampling techniques like under sampling and oversampling with SMOTE (Synthetic Minority Over-Sampling Technique) using Python Scikit-learn.
- Utilized PCA, t-SNE and other feature engineering techniques to reduce the high dimensional data, applied feature scaling, handled categorical attributes using one hot encoder of scikit-learn library
- Developed various machine learning models such as Logistic regression, KNN, and Gradient Boosting with Pandas, NumPy, Seaborn, Matplotlib, Scikit-learn in Python.
- Worked on Amazon Web Services (AWS) cloud services to do machine learning on big data.
- Developed Spark Python modules for machine learning & predictive analytics in Hadoop.
- Implemented a Python-based distributed random forest via PySpark and MLlib.
- Used cross-validation to test the model with different batches of data to find the best parameters for the model and optimized, which eventually boosted the performance.
- Experimented with Ensemble methods to increase the accuracy of the training model with different Bagging and Boosting methods and deployed the model on AWS.
- Created and maintained reports to display the status and performance of deployed model and algorithm with Tableau.
Environment: Machine Learning, AWS, Python (Scikit-learn, SciPy NumPy, Pandas, Matplotlib, Seaborn), SQL Server, Hadoop, HDFS, Hive, Pig Latin, Apache Spark/PySpark/MLlib, GitHub, Linux, Tableau.
Confidential - Charlotte, NC
Data Scientist II
Roles & Responsibilities:
- SAP Altiscale Cloud Platform - Big Data Services (BDS), Interim Big Data Administrator
- Provided leadership and led the transition to Big Data platform stack (Hadoop, Spark etc). Migration of Propensity models to dev/prod cluster in BDS; automate production models (python), built ML & ETL data pipeline (PySpark) & onboard team to Hadoop
- Transition to cloud infrastructure will help BHF exit current service agreements with MetLife, hence cost savings ~ $10MM
- Marketing: Third Party Distribution & Primerica Propensity Model
- Built Propensity model from various data sources to score Financial Advisors most likely to sell Flex/Shield annuity product
- Used Logistic regression, Lasso, Random forest etc, combine seven models in three layers - Face-to-Face, Active, Inactive advisors and Product models. Driver analysis to measure success of email campaigns, used for lead generation
- Target non-sellers on propensity score output, generates $60MM in incremental quarterly sales revenue, score model quarterly
- Built Classification models (SVM, Naive Bayes, GBM) to score Advisors for eight firms having strategic relationship with BHF
- Product: Guaranteed Minimum Income Benefit (GMIB) variable annuity product utilization and withdrawals
- Analyzed how GMIB has been utilized by consumers based on demographics, geography; examine withdrawal rates & surrenders
- Build Survival (cox ph) model to predict customer churn (policy surrenders) & find statistically significant drivers of policy lapse
- Insights will drive improvement in future product design with new features, pricing decisions and for managing risk by stakeholders
- Distribution: Wholesaler Effective Analysis
- Collaboration with experts from University of Missouri to (a) determine optimal number of wholesalers (b) territory alignment (c) wholesaler incentive plan. Translate results from sales response models into actionable insights for stakeholders, used Tableau
- Territory alignment based on top advisor prospects, channels, opportunity. Estimated average increase in sales revenue - $300MM
- Select Projects and Experience:
- Natural Language Processing (NLP): Live Twitter Sentiment Analysis with NLTK
- Sentiment analysis on live data from Twitter by using its streaming API. Five different classifiers (Naive Bayes, Bernoulli NB, Linear SVC, etc) are used for training on a labelled movie reviews dataset with a voting classifier and confidence level. Project in Github.
- Completed Data Science and Data Engineering boot camp, Seattle 2017 conducted by Data Science Dojo
Environment: Data Governance, SQL Server, Python, ETL, MS Office Suite - Excel(Pivot, VLOOKUP), DB2, R, Visio, HP ALM, Agile, Azure, MDM, Share point, Data Quality, Tableau and Reference Data Management.
Confidential
Data ScientistResponsibilities:
- Initial work started as a Mainframe Developer which changed designing the Physical Data Architecture.
- Involved in designing the Physical Data Architecture model of Machine to Machine Model based on Consumer Model.
- Improved the efficiency of processing to achieve a defect-free bulk order for New Multiple Connections, under VISION System.
- Worked with consumers and different teams to gain insights about the data concepts behind their business.
- Analyzed business requirements, system requirements, data mapping requirement specifications, and responsible for documenting functional requirements and supplementary requirements.
- Involved in initial data pattern recognition and data cleaning using dplyr package in R.
- Developed Tabulation datasets and Analysis datasets as per the specifications.
- Coordinate with the team members to reach the target successfully.
- Provided programming support for the generation of analysis datasets and M2M for interim and final analysis.
- Developed Ad-hoc reports upon request form for M2M and Data Manger.
- Identified, reviewed and documented the business requirements for pricing calculation and billing processing.
- Responsible for periodic Reporting using Reports and Graphs in TABLEAU and EXCEL.
- Compared actual results to expected results and recorded test results.
- Validated the data between source tables and target table.
- Involved in coordinating testing activities with different testing and development teams.
- Implemented Checkpoints for Back-end Testing
- Quality Center and CMIS Ticketing system used for Test Management and Defect Tracking.
- Created several efficacy tables like Summaries of Best Response etc. and determined survival analysis by using proc Life Test.
- Determined the missing data, outlier and invalid data and applied appropriate data management techniques.
- Analyzed different trends and market segmentation based on historical data using K means clustering, Classification techniques.
- Ceated dashboard and stories for senior managers.
- Involved in creating dashboards and reports in Tableau 8.1.1 and Maintaining server activities, user activity, and customized views on Server Analysis.
- Created Rich Graphic visualization/dashboards in Tableau to enable a fast read on claims and key business drivers and to direct attention to key area
Environment: Python, R, R Studio, DB2, OLAP, OLTP, Multi-dimensional modelling, Data Warehousing, SQL, Microsoft Office, Tableau 8.1