We provide IT Staff Augmentation Services!

Data Scientist Resume

2.00/5 (Submit Your Rating)

Jersey City, NJ

SUMMARY

  • Data Scientist with 7 years of experience in Data Acquisition, Data Screening, Statistical Modelling, Data Exploration, Data Visualization with large data sets of Structured and Unstructured data and implementing Machine Learning algorithms to facilitate important business decisions.
  • Experience with working in domains including E - commerce, Real-estate, Telecom and Retail.
  • Proficient in utilizing analytical applications like Python and R to identify trends and relationships between different data points to draw appropriate conclusions and translate analytical findings into risk management and marketing strategies dat drive value.
  • Experience in facilitating teh entire Data Science project lifecycle and actively involved in Data Extraction, Data Cleaning, Feature Engineering, Dimensionality Reduction, Prototyping & Training (Data processing & Encoding), Model selection, Backtesting, Model tuning and Productionization.
  • Adept in working with statistical tests: t-tests, One-way & Two-way ANOVA; Normality tests: Jarque-Bera, Shapiro-Wilk tests; Non-parametric tests: Chi-square, Wilcoxin signed-rank tests.
  • Experience in working with Machine Learning algorithms including Regression models like Linear, Polynomial, Support Vector, Decision trees; Classification models including Logistic Regression, Support Vector Machines, K-Nearest Neighbor, Naïve Bayes, Decision trees; Ensemble learning methods like Random forests, Bagging, Boosting, Stacking; Clustering techniques like K-means, DBSCAN, Hierarchical clustering; Association Rule learning with Apriori, Eclat; Reinforcement learning with Upper Confidence Bound, Thompson Sampling;
  • Extensive knowledge of Dimensionality reduction (PCA, LDA), Hyper-parameter tuning, Model regularization, Grid search techniques to optimize teh cost function and model performance.
  • Expertise in Data Cleaning process of outlier detection and removal using Isolation forest, Grubb’s test for univariate analysis, Mahalanobis & Cook’s distance for multi-variate analysis; Imputing null values using Multiple Imputed Chained Equations (MICE) in R and Iterative imputer in Python.
  • Skilled in Big Data technologies like Spark, SparkSQL, PySpark, Hadoop Distributed File System, MapReduce & Kafka.
  • Experience in Web Data mining with Python’s ScraPy and BeautifulSoup packages along with working knowledge of Natural Language processing (NLP) to analyze text patterns.
  • Experience in working with text by implementing Recurrent Neural networks using Long Short term memory (LSTM) architecture with Many-to-One combination for Sentiment Analysis.
  • Good knowledge of Database Creation and maintenance of physical data models with Oracle, DB2 and SQL Server databases.
  • Excellent exposure to Data Visualization with Tableau & PowerBI.
  • Experience with Python libraries including NumPy, Pandas, SciPy, SkLearn, MatplotLib, Seaborn, Dask, Theano, Tensorflow, nltk and R libraries including ggplot2, dplyr, Esquisse.
  • Experience developing algorithms to create Artificial Neural networks to implement AI solutions to optimize business processes and minimize costs.
  • Expertise in Computer vision for image classification and face detection using Convolutional Neural networks with Res-Net architecture.
  • Utilized Excel Pivot Tables and VLookup for data pre-processing and created ANOVA sheets, regressions and performed hypotheses testing using data analysis add-on in Excel.

TECHNICAL SKILLS

Programming Languages: Python, R, Matlab

Database: MySQL, PostgreSQL, Oracle, MongoDB, Microsoft SQL Server

Analytical Techniques: Hypotheses testing: Independent & pairwise t-tests, one-way and two-way factorial ANOVA, Pearson’s correlation; Regression Methods: Linear, Multiple, Polynomial, Decision trees and Support vector; Classification: Logistic, K-NN, Naïve Bayes, Decision trees and SVM; Clustering: K-means, DBSCAN, Hierarchical, Expectation maximization; Association Rule Learning: Apriori, Eclat; Reinforcement Learning: Upper Confidence Bound, Thompson Sampling; Deep Learning: Artificial Neural Networks, Convolutional Neural Networks, Recurrent Neural networks with Long short term memory (LSTM), Deep Boltzmann machines; Dimensionality Reduction: Principal component Analysis (PCA), Linear discriminant Analysis (LDA), Autoencoders; Text mining: Natural Language processing; Ensemble Learning: Random forests, Bagging, Stacking, Gradient Boosting;

Algorithms: Gradient Descent, Stochastic Gradient Descent, Gradient Optimization - Adam, Momentum, RMSProp

Validation Techniques: Gradient, K-fold cross Validation, Monte-Carlo simulations, Out of bag sample estimate

Data Visualization: Tableau, Microsoft PowerBI, ggplot2, MatplotLib, Seaborn and Bokeh

Data modeling: Entity relationship Diagrams (ERD), Snowflake Schema

Big Data: Apache Hadoop, HDFS, Kafka, MapReduce, Spark

PROFESSIONAL EXPERIENCE

Confidential, Jersey City, NJ

Data Scientist

Responsibilities:

  • Assisted in developing a Recommendation System with Machine learning algorithms such as K-Nearest Neighbor, Apriori and Deep Neural Networks.
  • Incorporated Deep Boltzmann Machines and AutoEncoders architectures of teh Recurrent Neural Networks class to develop a highly accurate model. Communicated and coordinated with end client for collectingdata and performed ETL to define teh uniform standard format.
  • Implemented dimensionality reduction using Deep Autoencoder collaborative filtering technique to increase teh recall by 6% and customer reach by 11%.
  • Performed correlation analysis in teh data exploration phase using graphical techniques in MatplotLib and Seaborn to produce insights about teh product sales in a region and type of business.
  • Explored teh product sales data for cannibalization, when similar products were launched in teh same category.
  • Segmented teh data using K-Means clustering and analyzed client's behavior according to their demographic details, regions and monthly revenues in each cluster.
  • Incorporated image embedding’s like t-distributed stochastic neighbor embedding (t-SNE) obtained from deep convolutional networks for improved recommendation of items.
  • Designed and implemented Cross-validation and statistical tests including k-fold, stratified k-fold, hold-out scheme to test and verify teh models' significance.
  • Developed multivariate Gaussian anomaly detection algorithm in Python to identify suspicious patterns in network traffic. Employed Expectation-maximization clustering using Gaussian Mixture models (GMM).
  • Developed dashboards in Tableau to visualize teh suspicious patterns/activities in real time for business users.
  • Used Python to perform text mining to find out meaningful patterns from unstructured textual feedback. Created word cloud and word corpuses dat were used by teh higher management.
  • Implemented topic modeling using LDA to predict teh product category of feedback and detect specific issues in each category by analyzing teh feedback collected from customer service.
  • Performed sentiment analysis on customer feedback after every release.

Environment: Python 3.6, Pytorch, Tableau

Confidential, Framingham, MA

Data Scientist

Responsibilities:

  • Gathered requirements from business and Reviewed business requirements & analyzed data sources.
  • Involved in various pre-processing phases of textlike Tokenizing, Stemming, Lemmatization and converting teh raw textto structured data.
  • Performed Data Collection, Data Cleaning, Feature Engineering (Deep Feature Synthesis), Validation, Visualization, Report findings and developed strategic uses of data.
  • Implemented sampling, Principal component analysis and t-SNE for visualizing high dimensional data.
  • Worked with NLP to classify text with teh data drawn from a big data system. Teh text categorization involved labeling natural language texts with relevant categories from a predefined set.
  • One of teh goals was to target users through automated classification. This assisted in creating cohorts to improve marketing.
  • Teh NLP text analysis monitored, tracked and classified user discussion about product and service in online discussion.
  • Teh gradient boosted decision trees classifier was trained to identify whether a cohort was a promoter or detractor.
  • Constructed new vocabulary to encode teh variables in a machine readable format using Bag of words, TF-IDF, Word2vec, Average Word2vec.
  • Implemented Long Short Term Memory (LSTM) layer network of moderate depth to gain teh information in teh sequence.
  • Optimized teh performance of teh neural network by Pruning and choosing teh right number of hidden layers and neurons per layer.
  • Executed processes in parallel using distributed environment of Tensorflow across multiple devices (CPUs & GPUs).
  • Teh overall project improved teh marketing Return on Investment (ROI) by 15% and customer satisfaction by 20%.

Environment: s: Python - NLTK

Confidential

Data Scientist

Responsibilities:

  • Screened for missing values by rows and columns and removed variables with missing values above teh cutoff point.
  • Used Mahalanobis distance, cook’s distance, leverage statistics along with chisq cutoff on teh numerical variables to detect outliers.
  • Checked for correlation in data to observe teh distributions of all numeric and categorical variables.
  • Analyzed customer data for churn prediction using Logistic Regression, Support Vector Machines, Decision Trees and Random Forests and compared teh results.
  • Optimized teh decision tree model using ensemble learning methods mainly Bagging, Random Forests, Stacking and Extreme Gradient Boosting techniques.
  • Analyzed and grouped customers into different clusters based on purchase and historic data using techniques such as k-means clustering.
  • Built a Logistic Regression Classifier to determine user's purchase intention and target potential buyers from past data history.
  • Teh models were validated using Backtesting through K-fold cross validation, teh learning rate was optimized through Hyper-parameter tuning and Grid search.
  • Teh models were refined on teh basis of values obtained from teh ROC plot and CAP curve. Various metrics such as RMSE, MAE & Confusion matrix were used to evaluate teh performance of teh model.
  • Teh final results were summarized using a dashboard in PowerBI and presented to teh client.

Environment: Python, SQL, R, Microsoft PowerBI, Spark

Confidential

Data Analyst

Responsibilities:

  • Led initiative to build statistical models using historical data to predict real estate prices in several economic markets. Focused on analyzing teh factors affecting teh value of properties in Bangalore, India.
  • Developed prediction algorithm using advanced data mining techniques to classify similar properties together and to develop sub-markets based on zip codes.
  • Created database designs through data-mapping using ER diagrams and normalization up to teh 3rd normal form and extracted relevant data whenever required using joins in PostgreSQL, Microsoft SQL Server and SQLite.
  • Extracted terabytes of relevant data using HDFS & MapReduce from Hadoop.
  • Conducted data preparation, and outlier detection using MS SQL server; built teh model using Python.
  • Provided statistical insights using t-tests, Anova, chisq tests and performed Post-hoc analysis including Bonferroni correction and Tukeys’s HSD to assess differences across levels of categories, test significance of proportional differences & assess whether sample size is large enough to detect teh differences.
  • Provided statistical insights into 5% VaR, expected shortfall, semi-deviation & skewness to-kurtosis ratio to guide investment decisions.
  • Predicted house-prices and area population income using regression methods in Excel and Octave(Matlab).
  • Worked with Portfolio managers to arrive at an optimal solution to teh problem. Increased teh revenue of teh firm by 5%.
  • Created and presented executive dashboards and scorecards to show teh trends in teh data using Excel and VBA-Macros.

Environment: Microsoft Excel, Python, Microsoft SQL Server

We'd love your feedback!