Data Scientist/data Analyst Resume
New Orleans, LA
SUMMARY
- Over 9 years of experience in Machine Learning, Deep Learning, Data Mining with large datasets of structured and unstructured data, Data Validation, Data acquisition, Data Visualization, Predictive Modeling and developed predictive models that help to provide intelligent solutions.
- Experience with statistical programming languages such as R and Python.
- Extensive experience in Text Analytics, developing different Statistical Machine Learning, Data Mining solutions to various business problems and generating Data Visualizations using R and Python.
- Hands on Experience on Customer Churn, Sales Forecasting, Market Mix Modeling, Customer Classification, Survival Analysis, Sentiment Analysis, Text Mining, Recommendation Systems.
- Experience in using Statistical procedures and Machine Learning algorithms such as ANOVA, Clustering, Regression and Time Series Analysis to analyze data for further Model Building.
- Strong mathematical knowledge and hands on experience in implementing Machine Learning algorithms like K - Nearest Neighbors, Logistic Regression, Linear regression, Naïve Bayes, Support Vector Machines, Decision Trees, =Random Forests, Gradient Boosted Decision Trees, Stacking Models.
- Experience in building models with Deep Learning frameworks like Tensor Flow and Keras.
- Expertise in Machine learning Unsupervised algorithms such as K-Means, Density Based Clustering (DBSCAN), Hierarchical Clustering and strong knowledge on Recommender Systems.
- Hands on experience in implementing Dimensionality Reduction Techniques like Truncated SVD, Principal Component Analysis, t-Stochastics Neighborhood Embedding (t-SNE).
- Proficient in advising on the use of data for compiling personnel and statistical reports and preparing personnel action documents, patterns within data, analyzing data and interpreting results.
- Good knowledge on Deep Learning concepts like Multi-Layer Perceptron, Deep Neural Networks, Artificial Neural Networks, Convolutional Neural Networks, Recurrent Neural Networks.
- Hands on experience on Deep Learning Techniques such as Back Propagation, Choosing Activation Functions, Weight Initialization based on Optimizer, Avoiding Vanishing Gradient and Exploding Gradient Problems, Using Dropout, Regularization and Batch Normalization, Gradient Monitoring and Clipping Padding and Striding, Max pooling, LSTM.
- Experience in using Optimization Techniques like Gradient Descent, Stochastic Gradient Descent, Adam, Adadelta, RMS prop, Adagram.
- Experience in building models with Deep Learning frameworks like Tensor Flow and Keras.
- Actively involved in all phases of data science project life cycle including Data Extraction, Data Cleaning, Data Visualization and building Models.
- Extensive hands-on experience and high proficiency in writing complex SQL queries like stored procedures, triggers, joins and subqueries along with that used MongoDB for extraction data.
- Good knowledge of Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, Secondary Name Node, MapReduce concepts, and ecosystems including Hive and Pig.
- Experience with data visualization using tools like GGplot, Matplotlib, Seaborn, Tableau, R Shiny and using Tableau software to publish and presenting dashboards, storyline on web and desktop platforms.
- Experienced in python data manipulation for loading and extraction as well as with python libraries such as NumPy, SciPy and Pandas and Spark 2.0 (PySpark, MLlib) to develop variety of models and algorithms for analytic purposes.
- Well experienced in Normalization, De-Normalization and Standardization techniques for optimal performance in relational and dimensional database environments.
- Proficient knowledge on Mathematical Matrix Operations, Statistics, Linear Algebra, Probability, Differentiation, Integration and Geometry.
- Extensive experience working in a Test-Driven Development and Agile-Scrum Development.
- Experience in Amazon Web Services (AWS) Cloud services like EC2, S3.
- Highly skilled in using Hadoop (pig and Hive) for basic analysis and extraction of data in the infrastructure to provide data summarization.
- Highly skilled in using visualization tools like Tableau, ggplot2 and d3.JS for creating dashboards.
- Worked and extracted data from various database sources like Oracle, SQL Server, DB2, Regularly accessing JIRA tool and other internal issue trackers for the Project development.
- Skilled in System Analysis, E-R/Dimensional Data Modeling, Database Design and implementing RDBMS specific features.
- Knowledge of working with Proof of Concepts (PoC's) and gap analysis and gathered necessary data for analysis from different sources, prepared data for data exploration using Data Munging and Teradata.
- Well experienced in Normalization & De-Normalization techniques for optimum performance in relational and dimensional database environments.
- Experience in using GIT Version Control System. Implemented Kafka for building data pipeline and analytic modules.
- Power user of python libraries including NumPy, Pandas, SciPy, Scikit-learn, statsmodels, Requests, Matplotlib, Plotly, Seaborn, NLTK, Tensorflow, Keras, SQLAlchemy, Flask.
- Utilize NLP applications such as topic models and sentiment analysis to identify trends and patterns within massive data sets.
- In-depth understanding of Enterprise Data Warehouse system, Dimensional Modeling using Facts, Dimensions, Star Schema & Snowflake Schema, and OLAP Cubeslike MOLAP, ROLAP and HOLAP (hybrid). Executed various OLAP operations of Slicing, Dicing, Roll-Up, Drill-Down and Pivot in multidimensional data and analyzed reports in Analysis Toolpak in MS Excel.
- Excellent initiative, innovative thinking skills, and the ability to analyze details and adopt a big picture view and Excellent organizational, project management and problem-solving skills.
TECHNICAL SKILLS
Languages: Python, SQL, Java, R, MATLAB
Databases: MySQL, Microsoft SQL Server, Oracle, MongoDB
Hypotheses testing: Independent & pairwise t-tests, one-way and two-way factorial ANOVA, Pearson's correlation;
Regression Methods: Linear, Multiple, Polynomial, Decision trees and Support vector;
Classification: Logistic, K-NN, Naïve Bayes, Decision trees and SVM;
Clustering: K-means, DBSCAN, Hierarchical, Expectation maximization;
Association Rule Learning: Apriori, Eclat;
Reinforcement Learning: Upper Confidence Bound, Thompson Sampling;
Deep Learning: Artificial Neural Networks, Convolutional Neural Networks, Recurrent Neural networks with Long short term memory (LSTM), Deep Boltzmann machines;
Dimensionality Reduction: Principal component Analysis (PCA), Linear discriminant Analysis (LDA), Autoencoders;
Text mining: Natural Language processing;
Ensemble Learning: Random forests, Bagging, Stacking, Gradient Boosting;
Validation Techniques: K-fold cross Validation, A/B Tests, Out of bag sample estimate
Data Visualization: Tableau, Microsoft PowerBI, ggplot2, MatplotLib, Seaborn
Data modeling: Entity relationship Diagrams (ERD), Snowflake Schema
Big Data: Apache Hadoop, HDFS, Kafka, MapReduce, Spark
Cloud Technologies: AWS EC2, S3, Kinesis, Google Colab, Google Compute, Microsoft Azure
Business Intelligence Tools: Tableau, Power BI, SAP Business Intelligence
Other Tools: Spring Boot, Maven, Stata
PROFESSIONAL EXPERIENCE
Confidential - New Orleans, LA
Data Scientist/Data Analyst
Responsibilities:
- Gathering requirements from business and Reviewing business requirements and analyzing data sources.
- Performed Data collection, Data Cleaning, features scaling, features engineering, validation, Visualize, interpret, report findings, and develop strategic uses of data by python libraries like NumPy, Pandas, SciPy, Scikit-Learn.
- Involved with Recommendation Systems such as Collaborative filtering and content-based filtering.
- Studied and implemented Fraud detection models to monitor the unconventional purchases from customer bases and alert them with updates.
- Worked with Credit Analysis, Risk modeling algorithms to implement in customer acquisition strategies into the real time business.
- Implemented various statistical techniques to manipulate the data like missing data imputation, principle component analysis, sampling and t-SNE for visualizing high dimensional data.
- Worked with Customer Churn Models including Random forest regression, lasso regression along with pre-processing of the data.
- Created a text classification model using RNN and LSTM with TensorFlow.
- Explored and visualized the data to get descriptive statistics and inferential statistics for better understanding the dataset.
- Built predictive models including support Vector Machine, Decision tree, Naive Bayes Classifier, Neural Network plus ensemble methods of the models to evaluate how the likelihood to recommend of customer groups would change in different set of service by using pythonscikit-learn.
- Implemented training process using cross-validation and test sets, evaluated the result based on different performance matrices and collected feedback and retrained the model to improve the performance.
- Performed multiple MapReduce jobs in PIG and Hive for data cleaning and pre-processing.
- Configured SQL database to store Hive metadata.
- Loaded unstructured data into Hadoop File System (HDFS).
- Customer segmentation based on their behavior or specific characteristics like age, region, income, geographical location and applying Clustering algorithms to group the customers based on their similar behavior patterns.
- The results from the segmentation helps to learn the Customer Lifetime Value of every segment and discover high value and low value segments and to improve the customer service to retain the customers.
- Used Principal Component Analysis and t-SNE in feature engineering to analyze high dimensional data.
- Analyzed and implemented few research proofs of concept models for Real time fraud detection over credit card and online banking purchases.
- Worked with Credit Analysis, Risk modeling algorithms to implement in customer acquisition strategies into the real time business.
- Performed Data Profiling to learn about behavior with various features such as traffic pattern, location, time, Date and Time etc. Integrating with external data sources and APIs to discover interesting trends.
- Involved in various pre-processing phases of text data like Tokenizing, Stemming, Lemmatization and converting the raw text data to structured data.
- Predicted potential credit card defaulters with 82% accuracy with Random Forest.
- Provide expertise and consultation regarding consumer and small business behavior score modeling issues and gives advice and guidance to risk manager using the models in strategies.
- Participate in strategically-critical analytic initiatives around customer segmentation, channel preference and targeting/propensity scoring.
- Build customer journey analytic maps and utilize NLP to enhance the customer experience and reduce customer friction points.
- Personalization, Target Marketing, Customer Segmentation and profiling.
- Performed Data Cleaning, features scaling, featurization, features engineering.
- Used Pandas, NumPy, SciPy, Matplotlib, Seaborn, Scikit-learn in Python at various stages for developing machine learning model and utilized machine learning algorithms such as linear regression, Naive Bayes, Random Forests, Decision Trees, K-means, & KNN.
- Implemented number of Natural Language process mechanism for Chart Bots.
- Customer segmentation based on their behavior or specific characteristics like age, region, income, geographical location and applying Clustering algorithms to group the customers based on their similar behavior patterns.
- The results from the segmentation helps to learn the Customer Lifetime Value of every segment and discover high value and low value segments and to improve the customer service to retain the customers.
- Performed Clustering with historical, demographic and behavioral data as features to implement the Personalized marketing that offers right product to right person at the right time on the right device.
- Used Principal Component Analysis and t-SNE in feature engineering to analyze high dimensional data.
- Addressed overfitting and underfitting by tuning the hyper parameter of the algorithm and by using L1 and L2 Regularization.
- Used Spark's Machine learning library to build and evaluate different models.
- Supported in setting up QA environment and updating configurations for implementing scripts with Pig, Hive and Sqoop.
- Applied image processing techniques for general beautification and for computer vision purposes and its image processing toolkit using 3D matrices and manipulating individual pixel values.
- Worked in AWS EC2, configuring the servers for Auto scaling and Elastic load balancing.
- Completed detailed sentiment analysis using SPSS Modeler Premium (Text Analytics).
Confidential - Hicksville, NY
Data Scientist/ Data Analyst
Responsibilities:
- Collaborated with data engineers and operation team to implement the ETL process, wrote and optimized SQL queries to perform data extraction to fit the analytical requirements.
- Performed data analysis by retrieving the data from the Hadoop cluster.
- Performed univariate and multivariate analysis on the data to identify any underlying pattern in the data and associations between the variables.
- Explored and analyzed the customer specific features by using Matplotlib in Python and ggplot2 in R.
- Performed data imputation using Scikit-learn package in Python.
- Participated in features engineering such as feature generating, PCA, feature normalization and label encoding with Scikit-learn preprocessing.
- Used Python(NumPy, SciPy, pandas, Scikit-learn, seaborn) and R to develop a variety of models and algorithms for analytic purposes.
- Experimented and built predictive models including ensemble models using machine learning algorithms such as Logistic regression, Random Forests, and KNN to predict customer churn.
- Conducted analysis of customer behaviors and discover the value of customers with RMF analysis; applied customer segmentation with clustering algorithms such as K-Means Clustering. Gaussian Mixture Model and Hierarchical Clustering.
- Used F-Score, AUC/ROC, Confusion Matrix, Precision, and Recall evaluating different models' performance.
- Designed and implemented a recommendation system which leveraged Google Analytics data and the machine learning models and utilized Collaborative filtering techniques to recommend courses for different customers.
- Designed rich data visualizations to model data into human-readable form with Tableau and Matplotlib.
Confidential - Cincinnati, OH
Data Scientist
Responsibilities:
- Extracted, transformed and loaded data from multiple data stores into HDFS using Sqoop.
- Used Spark Streaming API to collect real time transactional data.
- Used Python 3.6 programming for handling various datasets and preparing them for further analysis.
- Carried out Statistical Analysis such as Hypothesis and Chi-square tests using R 3.4.
- Initial models were built using supervised classification techniques like K-Nearest Neighbor (KNN), Logistic Regression and Random Forests with Principal component analysis to identify important features.
- Built models using K-means clustering algorithm to create user groups.
- Generated PL/SQL scripts for data manipulation, validation and materialized views for remote instances.
- Reviewed basic SQL queries and edited inner, left, and right joins in Tableau Desktop by connecting live/dynamic and static datasets.
- Created and modified several database objects such as Tables, Views, Indexes, Constraints, Stored procedures, Packages, Functions and Triggers using SQL and PL/SQL.
- Wrote Python scripts to parse XML documents and load the data in database.
- Developed Python scripts to clean the raw data.
- Developed live reports in a drill down mode to facilitate usability and enhance user interaction.
- Query Data from Hadoop/Hive & MySQL data sources to build visualization in Tableau.
- Facilitated the automation process for the Delinquency Report -This report was required to run on a monthly basis.
- Validated regulatory finance data and created automated adjustments using advanced SAS Macros, PROC SQL and various reporting procedures.
- Developed statistical reports with Charts, Bar Charts, Box plots, Line plots using PROC SGPLOT, PROC GCHART and PROCGBARLINE.
- Extensive use of Proc freq, Proc Report and Proc Tabulate for reporting purposes.
- Designed and developed various analytical reports from multiple data sources by blending data on a single worksheet in Tableau Desktop. Involved in creating Tree Map, Heat maps and background maps.
- Involved in generating dual-axis bar chart, Pie chart and Bubble chart with multiple measures and data blending in case of merging different sources.
- Tested dashboards to ensure data was matching as per the business requirements and if there were any changes in underlying data.
- Created reports using analysis output and exported them to the web to enable the customers to have access through Internet.
Confidential - Mahwah, NJ
Data Scientist/Data Analyst
Responsibilities:
- Collaborated with database engineers to implement ETL process, wrote and optimized SQL queries to perform data extraction and merging from SQL server database .
- Gathered, analyzed, and translated business requirements, communicated with other departments to collected client business requirements and access available data.
- Responsible for Data Cleaning, features scaling, features engineering by using NumPy and Pandas in Python .
- Conducted Exploratory Data Analysis using Python Matplotlib and Seaborn to identify underlying patterns and correlation between features.
- Used information value, principal components analysis, and Chi square feature selection techniques to identify.
- Applied resampling methods like Synthetic Minority Over Sampling Technique (SMOTE) to balance the classes in large data sets.
- Designed and implemented customized Linear regression model to predict the sales utilizing diverse . sources of data to predict demand, risk and price elasticity.
- Experimented with multiple classification algorithms, such as Logistic Regression, Support Vector .
- Machine (SVM), Random Forest, AdA boost and Gradient boosting using Python Scikit-Learn and evaluated the performance on customer discount optimization on millions of customers.
- Used F-Score, AUC/ROC, Confusion Matrix and RMSE to evaluate different model performance .
- Performed data visualization and Designed dashboards with Tableau, and generated complex reports, including charts, summaries, and graphs to interpret the findings to the team and stakeholders.
- Used Keras for implementation and trained using cyclic learning rate schedule.
- Overfitting issues was resolved by batch norm, dropout helped to overcome the issue.
- Conducted in-depth analysis and predictive modelling to uncover hidden opportunities; communicate insights to the product, sales and marketing teams.
- Built models using Python and Pyspark to predict the probability of attendance for various campaigns and events.
- Environment: NumPy, Pandas, Matplotlib, Seaborn, Scikit-Learn, Tableau, SQL, Linux, Git, Microsoft .
- Excel, PySpark-ML, Random Forests, SVM, Tensor Flow, Keras .
Environment: NumPy, Pandas, Matplotlib, Seaborn, Scikit-Learn, Tableau, SQL, Linux, Git, Microsoft Excel, PySpark-ML, Random Forests, SVM, Tensor Flow, Keras.
Confidential
Data Scientist/Data Analyst
Responsibilities:
- Statistical Modelling with ML to bring Insights in Data under guidance of Principal Data Scientist.
- Data modeling with Pig, Hive, Impala.
- Ingestion with Sqoop, Flume.
- Used SVN to commit the Changes into the main EMM application trunk.
- Understanding and implementation of text mining concepts, graph processing and semi structured and unstructured data processing.
- Worked with Ajax API calls to communicate with Hadoop through Impala Connection and SQL to render the required data through it .These API calls are similar to Microsoft Cognitive API calls.
- Good grip on Cloudera and HDP ecosystem components.
- Used ElasticSearch (Big Data) to retrieve data into application as required.
- Performed Map Reduce Programs those are running on the cluster.
- Developed multiple MapReduce jobs in java for data cleaning and preprocessing.
- Developed scalable machine learning solutions within a distributed computation framework (e.g. Hadoop, Spark, Storm etc.).
- Analyzed the partitioned and bucketed data and compute various metrics for reporting.
- Involved in loading data from RDBMS and web logs into HDFS using Sqoop and Flume.
- Worked on loading the data from MySQL to HBase where necessary using Sqoop.
- Developed Hive queries for Analysis across different banners.
- Extracted data from Twitter using Java and Twitter API. Parsed JSON formatted twitter data and uploaded to database.
- Launching Amazon EC2 Cloud Instances using Amazon Images (Linux/ Ubuntu) and Configuring launched instances with respect to specific applications to improve robustness.
- Exported the result set from Hive to MySQL using Sqoop after processing the data.
- Analyzed the data by performing Hive queries and running Pig scripts to study customer behavior.
- Have hands on experience working on Sequence files, AVRO, HAR file formats and compression.
- Used Hive to partition and bucket data.
- Experience in writing MapReduce programs with Java API to cleanse Structured and unstructured data.
- Wrote Pig Scripts to perform ETL procedures on the data in HDFS.
- Created HBase tables to store various data formats of data coming from different portfolios.
- Worked on improving performance of existing Pig and Hive Queries.
Environment: SQL/Server, Oracle 9i, MS-Office, Teradata, Informatica, ER Studio, XML, Business Objects, HDFS, Teradata 14.1, JSON, HADOOP (HDFS), MapReduce, PIG, Spark, R Studio, MAHOUT, JAVA, HIVE, AWS.
ETL Developer
Confidential
Responsibilities:
- Developing Data Mapping, Data Governance, Transformation and Cleansing rules for the Master Data Management (MDM) Architecture involving OLTP, ODS and OLAP.
- Providing source to target mappings to the ETL team to perform initial, full, and incremental loads into the target data mart.
- Conducting JAD sessions, writing meeting minutes, collecting requirements from business users and analyze based on the requirements.
- Involved in defining the source to target data mappings, business rules, and data definitions.
- Transformation on the files received from clients and consumed by Sql Server.
- Working closely with the ETL, SSIS, SSRS Developers to explain the complex Data Transformation using Logic.
- Worked on DTS Packages, DTS Import/Export for transferring data between SQL Server 2000 to 2005.
- Performing Data Profiling, Cleansing, Integration and extraction tools
- Defining the list codes and code conversions between the source systems and the data mart using Reference Data Management (RDM).
- Applying data cleansing/data scrubbing techniques to ensure consistency amongst data sets.
- Extensively using ETL methodology for supporting data extraction, transformations and loading processing, in a complex EDW.
- Designed and implemented an internal reporting tool named I-CUBE using Python to automate sales and financial operational data accessible through a built-in SharePoint for leaders globally.
- Used API for I-Cube to extract sales data on an hourly-basis.
- Built and customized interactive reports on forecasts, targets and actuals data using BI/ETL tools such as SAS, SSAS, SSIS in the CRM which slashed manual efforts by 8%.
- Conducted operational analyses for business worth $3M working through all phases such as requirements gathering, developing use cases, data mapping and creating workflow diagrams.
- Accomplished data cleansing and analysis results using Excel pivot tables, VLOOKUPs, data validation, graphs and chart manipulation in Excel.
- Designed complex SQL queries, Views, Stored Procedures, Functions and Triggers to handle database manipulation and performance.
- Used SQL, PLSQL scripts for automating repeatable tasks of customer feedback survey data collection and distribution which increased the departmental efficiency by 8%.
Environment: MS Excel, Agile, Oracle 11g, Sql Server, SOA, SSIS, SSRS, ETL, UNIX, T-SQL, HP Quality Center 11, RDM (Reference Data Management).