Data Scientist Resume
SUMMARY:
- Dedicated and self - motivated data scientist seeking a position in data science or data analytics where I can utilize my excellent professional and academic experience to contribute towards the organizational goals.
- Overall 9+ years of IT experience in Data Science, Business Analytics, Business Intelligence (BI) Systems, Data Mining, Reporting, Marketing Analytics, Data Advisory & Decision support systems.
- 5+ years of hands-on Data science/Machine learning experience across a variety of business contexts and data sources. Proficient in quickly extracting hidden insights from the data and building useful models by leveraging a repertoire of diverse and deep technical skills
- Expertise in Python, Scikit-learn, ‘R’, Tableau, Java and working knowledge of Tensorflow, Theano, and Kera.
- Strong Data Visualization skills in communicating the statistical findings to Business users and roll out the Insights in to day to day operations using Tableau, Lumira and Power BI
- Expertise in Hadoop ecosystem as Hadoop Architect
- Expertise in machine learning tools such as support vector machine, random forest, neural network, logistic regression, decision trees
- Strong experience in natural language processing and text analytics techniques
- Experience in development of recommendation engines using collaborative filtering and association rule mining techniques
- Good working experience in advanced statistical analysis techniques including logistic regression, anova, hypothesis testing, discriminant analysis, chi-squared test, f-test
- Development experience in scripting languages like Python and Scala
- Experienced with analyzing large data sets and developing analytics solutions
- Implemented ASP Aggregator Data Warehouse using MS SQL Server Data Tools
- Advanced knowledge in data management and relational databases
- Excellent understanding of HDFS, MapReduce framework and extensive experience in developing MapReduce Jobs
- Strong knowledge of Spark along with Scala and Python for handling large-scale data processing
- In-depth understanding of Hadoop architecture and various components such as HDFS, YARN, Zookeeper, Oozie, Hue, Job Tracker, Task Tracker, Name Node, Data Node and MapReduce concepts
- Hands-on experience and knowledge in streaming systems like Flume and Kafka
TECHNICAL SKILLS:
Programming Languages: R Programming, Python, Scikit-Learn, SQL, Base SAS, Scala, Java, C#, VBA
Machine Learning techniques: Natural Language Processing (Sentiment Analysis, LDA, LSA, PLSA, Named Entity Recognition, Word Vectorization, Document Vectorization, Stemming, Lemmatization), Linear Regression (LASSO, Ridge, Elastic net), Classification (Logistic Regression, LDA, Naïve Bayes, KNN,SVM), Decision trees (XgBoost & RF), PCA, Association rules & Recommendation Engines, Survival Analysis, Time Series Analysis, Clustering, Deep Learning . Excellent skill in conducting Exploratory Analysis(EDA)
Big Data Eco System: HADOOP3.0, RHadoop, MongoDB, Spark 1.6, Pig, Hbase, Hive, Impala, Flume, Kafka, Solr, NoSQL
Relational Databases: MySQL, MS SQL Server, Oracle 10g
Data Visualization: R ggplot2, Python Matplotlib, Tableau Desktop 9.3, SAS Enterprise Miner
Web Analytics: Google Analytics, Google Adwords, Facebook Ads
Other: Weka, MATLAB, Statistical Modeling, Econometrics, A/B Testing
PROFESSIONAL EXPERIENCE:
Confidential
Data Scientist
Environment: Python (Xgboost, nltk, scikit-learn, pandas, scipy, numpy, seaborn, matplotlib, re, gensim, wordnet), R, Spark, Scala, Hadoop, Solr
Responsibilities:
- Analyzed unstructured textual data from Quora questions and performed exploratory data analysis
- Cleaned and prepared textual data by lemmatization, stemming, removing stopwords, and by regular expression
- Recognized named entities using named entity recognition technique and tagged parts of speech of each sentence
- Used sentence vectorization and cosine distance techniques to check similarity of Quora questions and used gradient boosting classification algorithm XgBoost to classify similarity of Quora questions
- Detected spammed SMS messages from a large SMS data set using naïve bayes, tree, and neural network classifiers
- Designed and implemented a platform using text mining Packages in R & Python( tm, openNLP,NLTK3.0 ) for cleansing, harmonizing, classifying, matching, merging, de-duping, and profiling buyer, supplier and third-party data for use within solutions and created Business Taxonomy model involving Millions of rows of customer transaction text data
- Developed customer segmentation analysis and created campaign planning tools for use by marketing managers
- Performed statistical modeling to predict the backorder probability for various products using R & Lumira. Used Logistic regression, LDA & random forest decision trees.
- Implemented Ensemble Models-Bagging and Boosting to enhance the efficiency and performance of model
- Built a predictive model based on advanced statistical analysis, hypothesis testing and machine learning- multivariate regression model to predict future sales, explaining approximately 35% of variance
- Automated large scale data processing in a distributed environment
Confidential
Data Scientist
Environment: Python (Scikit-learn, Nltk, Pandas, Seaborn, Matplotlib, Beautifulsoup, Numpy, Scipy), R (Arules, RandomForest, Caret, Tree), SAS (Proc SQL, Proc Panel, Proc Logistic, ODS), MySQL, Tableau, R-ggplot2, R-Shiny, Spark, Scala, Hadoop, Hive, Impala, Flume, Solr, Weka
Responsibilities:
- Detected patterns of physician frauds from US Medicare datasets using logistic regression, random forest, and neural network models and generated heat maps, geographic maps, bubble charts, and dashboards in Tableau to show regional and procedural variances in Medicare costs
- Time Series Forecasting: Built various Univariate and Multivariate Time series models (ets, stl, Naïve, Holt, Holt-Winter ARIMA, ARIMAX, VAR, GARCH models) to Forecast Sales for various products.
- Extracted insights from a manufacturing industry dataset using spark dataframes, RDDs, SparkSQL, and Scala in Hadoop ecosystem
- Responsible for quantitative analysis, data mining, and the presentation of data to see beyond the numbers and understand how users interact with consumer products.
- Configured Apache Flume in Hadoop ecosystem to stream Twitter data to HDFS and Apache Solr
- Created sentiment analysis model and complex query model of Twitter data using Hadoop ecosystem, HiveQL, Impala, and regular expression
- Analyzed trends and regional variances of sales of various products of an office supply company and identified top products in terms of seasonal and regional performances. Prescribed ways to improve sales by performing machine learning and regression analysis.
- Worked on large-scale Hadoop YARN clusters for distributed data processing and analysis using Databricks Connectors, Spark core, Spark SQL, Sqoop, Pig, Hive, Impala and NoSQL databases
- Implemented Spark scripts using Scala, Python, and SparkSQL to access Hive tables into spark for faster processing of data
- Designed and implemented importing data to HDFS using Sqoop from different RDBMS servers.
- Developed multinomial logistic regression model to predict brand choice of US deodorant industry and developed clustering model to identify valuable customers.
- Developed collaborative filtering-based recommendation engines using Python and R to recommend retail products
- Investigated association between physicians’ qualities and their online reviews by scraping physician review data using Python, applying natural language processing techniques, and then using panel regression techniques
- Investigated several natural language techniques such as LDA, LSA, PLSA, and Doc2Vec to evaluate their performances in mining consumer review datasets
- Compared performances of classification algorithms including Naïve Bayes, Decision Tree, Neural Network, kNN, and Logistic regression techniques
Confidential
Graduate Teaching Assistant
Environment: Python, R, SAS, MySQL, MS SQL Server, Tableau, R-ggplot2, R-Shiny, MS Access, MS Excel Pivot Table, MS Excel V-lookup
Responsibilities:
- Created and tested skill-based development assignments and projects for Data Visualization, Systems Analysis and Project Management, IT Security, IT for Management, IT Strategy and Management, and IT for Business courses using Machine Learning techniques, Tableau, R-ggplot2, R-shiny, MS Excel Pivot Table, MS Excel V-lookup, MS Access, Predictive Analytics, and Prescriptive Analytics techniques .
- Problem analysis and troubleshooting issue in Machine Learning algorithms, SQL, and Tableau.
- Worked as a project manager for IT Strategy and Management Course
- Evaluated the assignments submitted by students.
- Mentored students in a weekly development lab for student assignments.
- Responsible for handling technical issues for 300 customers
- Managed paid search campaigns and monitored budgets for same
- Analyzed keyword list for searching website and expanded it as required
- Optimized search engine and landing page experience of client’s website
- Created advertisement campaign for the client through Google Adwords and Facebook Ads and optimized click-through rate for the client. Achieved a click-through rate of over 15%
- Used Google Analytics to create ad hoc reports on business segment performances
- Analyzed Google Adwords and Facebook Ads data of the client and prescribed ways of improve search engine and landing page experiences of users
- Created weekly queries and trending reports on advertisements
- Used statistical modeling to identify and rank top business school programs based on customer perception
Confidential
Senior Data Analyst
Environment: Base SAS, SAS Enterprise Miner, MS SQL Server, SSIS, SSRS, SSAS, Tableau, Excel Pivot Table, R
Responsibilities:
- Evaluated performances of neural network, logistic regression, and decision tree algorithms to predict success of direct marketing campaign of a European bank. Used base SAS and SAS Enterprise miner.
- Created data warehouse using ETL techniques and then used OLAP cubes to analyze the data
- Developed integrated sales analysis using sales transaction data, customer data, and product data by modeling calculation views with joined analytical views (OLAP cubes) using actual and planned sales data
- Created business intelligence reports of OLAP cube data using Tableau and Pivot Tables
- Created data models, dimension fact model (DFM), and star schema for relational data
- Used SAS procedures such as Proc Datasets and Proc Contents to make data dictionary
- Analyzed large data sets using SAS and Proc SQL & SAS Macro
- Used Proc Compare for comparing datasets from different source systems
- Created datasets to be used by Tableau for visualization
- Used Statistical SAS Procedures such as Proc Freq for univariate statistical analysis
- Developed parametric and non-parametric hazard models for client churn
Confidential
Senior Data Analyst
Environment: Excel Pivot Table, Python, R, Tableau
Responsibilities:
- Developed content analysis models to identify most popular descriptive, predictive, and prescriptive analytics tools and techniques used in different value chain activities of firms
- Developed statistical models to compare performances of online and in-class students of University of Confidential at Greensboro
- Created social media analytics models to analyze Twitter and Facebook data
- Created statistical models to explore relationships between student performance and the following variables: attendance, study time, weekly assignment grades, and instructor’s teaching experience.
- Performed Market Basket analysis to identify customer buying patterns, preferences and behaviors to better manage sales and inventory
- Developed data visualization models in Tableau to develop heat maps, trend reports, dashboards, etc.
Confidential
Data Scientist/ Graduate Research Assistant
Environment: R, Python, MATLAB, Linux, FORTRAN
Responsibilities:
- Developed a Markov Chain Monte Carlo (MCMC) based Ensemble Kalman Filter forecasting model in MATLAB with 79% more accuracy compared to existing models to predict contaminant transport in groundwater
- Evaluated performances of singular value decomposition (SVD) and eigen-value decomposition (EVD) techniques for Ensemble Squared-Root Kalman filter (EnSRKF) models
- Compared performances of Kalman filter, Ensemble Kalman filter, and Ensemble Square-Root Kalman filter in groundwater contaminant transport modeling
- Performed unsupervised k-means clustering on Fisher’s Iris dataset and checked performances of k-means algorithm with different parameters
- Performed Bayes’ classification on Fisher’s Iris dataset with Euclidean and Mahalanobis distances
- Developed feature selection models on Fisher’s Iris dataset using Divergence, Transformed Divergence and Bhattacharyya Distance algorithms
Confidential
Senior Analyst/ Assistant Manager
Environment: Python, R, SQL, Excel, SAS
Responsibilities:
- Developed advanced analytics-based and statistical models to predict market potential of untapped geographic locations
- Developed customer churn analytics models using randomForest in R for business clients to predict probability of churn from financial products
- Investigated performances of parametric and non-parametric hazard model (Cox Proportional Hazard) in churn prediction
- Developed decision tree and logistic regression-based models to predict default probability of small and medium business customers
- 32% increase in market reach by suggesting changes based on descriptive and predictive analytics
- Performed cluster analysis to identify important client segments for marketing campaign
- Created business intelligence reports involving v-lookup, pivot table, bubble chart, heat map for top management of the company
- Worked on development of a client fraud detection platform using various machine learning algorithms in R and Python
- Developed Discriminant Analysis, Greedy Forward Selection, Greedy Backward Selection and Feature reduction algorithms like Principal Component Analysis (PCA) and Factor Analysis
- Collaborated closely with Subject Master Experts to identify& define business reporting requirements
- Developed Generalized Linear Models such as Poisson Regression model and Negative Binomial Distribution model for count data of number of accounts for each customer
- Developed multinomial discrete choice models to predict popularity of financial products of the company
- Developed analytics-based solutions based on predictive, behavioral or other models using statistical analysis and relevant modeling techniques
- Participated in all the phases of knowledge discovery: data collection, data cleaning, developing models, validation and visualization
- Analyzed key data points and variables to optimize cross-sell, up-sale, and renewal possibilities of existing clients
- Developed decision tree-based models for cross-sell, up-sales and renewal possibilities
- Constructed recency, frequency and monetary (RFM) analysis to perform customer-segmentation using k-means and k-medoids clustering.
- Performed t-tests and other statistical models to identify performance differences of two teams
- Developed advanced regression models to predict sales in presence of interaction of advertisements, consumer demographics, and geographic variances.
Confidential
MBA Intern
Environment: SQL, SAS
Responsibilities:
- Worked for a monthly expenditure tracking project. Responsibilities include Cleaning, aggregating, analyzing, interpreting data, carrying out quality analysis of the tracker, and preparation of weekly growth analysis report for financial products by performing advanced statistical Techniques(Cluster, Factor & Tree) using SAS and SQL
- Developed time series ARIMA models to predict trend and seasonality of product sales
- Created data visualization reports involving v-lookup, pivot table, bubble chart, heat map for business leaders of the company
- Developed attrition models based on recency, frequency, monetary (RFM) analysis.
- Built customer churn model using logistic regression and random forest algorithms to predict the churn probability customers. The model, with an accuracy of 81%, helped the client to retain customers worth $5M.
Confidential
Analyst
Environment: Excel, Python, R, MS SQL Server, MATLAB
Responsibilities:
- Developed databases of company clients and construction projects
- Managed several construction projects as a project manager
- Developed simulation models to determine structure behavior under various loadings
- Developed pivot table, heat maps, bar charts, bubble charts for top management of the company
- Created interactive visualization models for product sales stages
- Developed procurement and spend analytics models to find optimal solution for the client
- Developed Gantt chart, CPM, PERT models for various projects
- Developed BPMN and UML-based business process models
- Created pattern recognition-based models in MATLAB to identify possibility of project delay
- Wrote advanced SQL queries to get subtle insights from data and created dashboards and BI reports.