Sr. Data Scientist Resume
Boston, MA
PROFESSIONAL SUMMARY:
- A data scientist professional with 7 years of progressive experience in Data Analytics, Statistical Modeling, Visualization and Machine Learning . Excellent capability in collaboration, quick learning and adaptation.
- Experience in Data mining with large datasets of Structured and Unstructured data, Data Acquisition, Data Validation, Predictive modeling, Data Visualization.
- Experience in integrating data, profiling, validating and data cleansing transformation and data visualization using R and Python .
- Theoretical foundations and practical hands - on projects related to (i) supervised learning (linear and logistic regression, boosted decision trees, Support Vector Machines, neural networks, NLP), (ii) unsupervised learning (clustering, dimensionality reduction, recommender systems), (iii) probability & statistics, experiment analysis, confidence intervals, A/B testing, (iv) algorithms and data structures .
- Extensive knowledge on Azure Data Lake and Azure Storage .
- Experience in migration from heterogeneous sources including Oracle to MS SQL Server .
- Experience in writing SQL queries and working with various databases (MS Access, MySQL, Oracle DB)
- Hands on experience in design, management and visualization of databases using Oracle, MySQL and SQL Server .
- In depth knowledge and hands on experience of Big Data / Hadoop ecosystem (MapReduce, HDFS, Hive, Pig and Sqoop) .
- Experience in Apache Spark, Kafka for Big Data Processing & Scala Functional programming .
- Experience in manipulating the large data sets with R packages like tidyr, tidyverse, dplyr reshape, lubridate, Caret and visualizing the data using lattice and ggplot2 packages .
- Experience in dimensionality reduction using techniques like PCA and LDA .
- Intensive hands-on Boot camp on Data Analytics course spanning from Statistics to Programming including data engineering, data visualization, machine learning and programming in R, SQL
- Experience in data analytics, predictive analysis like Classification, Regression, Recommender Systems.
- Good Exposure with Factor Analysis, Bagging and Boosting algorithms .
- Experience in Descriptive Analysis Problems like Frequent Pattern Mining, Clustering, Outlier Detection .
- Worked on Machine Learning algorithms like Classification and Regression with KNN Model, Decision Tree Model, Naïve Bayes Model, Logistic Regression, SVM Model and Latent Factor Model .
- Hands-on experience on Python and libraries like Numpy, Pandas, Matplotlib, Seaborn, NLTK, Sci-Kit learn, SciPy .
- Expertise and knowledge in TensorFlow to do machine learning/deep learning package in python .
- Good knowledge on Microsoft Azure SQL, Machine Learning and HDInsight .
- Good Exposure on SAS analytics .
- Good Exposure in deep learning with Tensor flow in python .
- Good Knowledge on Natural Language Processing (NLP) and Time Series Analysis and Forecasting using ARIMA model in Python and R .
- Good knowledge in Tableau, Power BI for interactive data visualizations.
- In-depth Understanding in NoSQL databases like MongoDB, HBase .
- Very good experience and knowledge in provisioning virtual clusters under AWS cloud which includes services like EC2, S3, and EMR .
- Experience and Knowledge in developing software using Java, C++ (Data Structures and Algorithms) technologies.
- Experience in Business Intelligence tools like SSIS, SSRS and ETL.
- Proficient in design and development of various Dashboards, Reports utilizing Tableau Visualizations like bar graphs, scatter plots, pie-charts, Geographic’s and other making use of actions, local and global filters, cascading filters, context filters, Quick filters, parameters according to the end user requirements.
- Good exposure in creating pivot tables and charts in Excel .
- Experience in developing Custom Report and different types of Tabular Reports, Matrix Reports, Ad hoc reports and distributed reports in multiple formats using SQL Server Reporting Services (SSRS) .
- Excellent Database administration (DBA) skills including user authorizations, Database creation, Tables, indexes and backup creation.
TECHNICAL SKILLS:
Languages: Java 8, Python, R
Python and R: Numpy, SciPy, Pandas, Scikit-learn, Matplotlib, Seaborn, ggplot2, caret, dplyr, purrr, readxl, tidyr, Rweka, gmodels, RCurl, C50, twitter, NLP, Reshape2, rjson, plyr, Beautiful Soup, Rpy2
Algorithms: Kernel Density Estimation and Non-parametric Bayes Classifier, K-Means, Linear Regression, Neighbors (Nearest, Farthest, Range, k, Classification), Non-Negative Matrix Factorization, Dimensionality Reduction, Decision Tree, Gaussian Processes, Logistic RegressionNa ve Bayes, Random Forest, Ridge Regression, Matrix Factorization/SVD
NLP/Machine Learning/Deep Learning: LDA (Latent Dirichlet Allocation), NLTK, Apache OpenNLP, Stanford NLP, Sentiment Analysis, SVMs, ANN, RNN, CNN, TensorFlow, MXNet, Caffe, H2O, Keras, PyTorch, Theano, Azure ML
Cloud: Google Cloud Platform, AWS, Azure, Bluemix
Web Technologies: JDBC, HTML5, DHTML and XML, CSS3, Web Services, WSDL
Data Modelling Tools: Erwin r 9.6, 9.5, 9.1, 8.x, Rational Rose, ER/Studio, MS Visio, SAP Power designer
Big Data Technologies: Hadoop, Hive, HDFS, MapReduce, Pig, Kafka
Databases: SQL, Hive, Impala, Pig, Spark SQL, Databases SQL-Server, My SQL, MS Access, HDFS, HBase, Teradata, Netezza, MongoDB, Cassandra.
Reporting Tools: MS Office (Word/Excel/Power Point/ Visio), Tableau, Crystal reports XI, Business Intelligence, SSRS, Business Objects 5.x/ 6.x, Cognos7.0/6.0.
ETL Tools: Informatica Power Centre, SSIS.
Version Control Tools: SVM, GitHub
BI Tools: Tableau, Tableau Server, Tableau Reader, SAP Business Objects, OBIEE, QlikView, SAP Business Intelligence, Amazon Redshift, or Azure Data Warehouse
Operating System: Windows, Linux, Unix, Macintosh HD, Red Hat
PROFESSIONAL EXPERIENCE:
Confidential - Boston, MA
Sr. Data Scientist
Responsibilities:
- Perform Data Profiling to learn about behavior with various features such as traffic pattern, location, Date and Time etc.
- Extracted the data from hive tables by writing efficient Hive queries.
- Performed preliminary data analysis using descriptive statistics and handled anomalies such as removing duplicates and imputing missing values.
- Analyze Data and Performed Data Preparation by applying historical model on the data set in AZUREML.
- Application of various machine learning algorithms and statistical modeling like decision trees, text analytics, natural language processing (NLP), supervised and unsupervised, regression models, social network analysis, neural networks, deep learning, SVM, clustering to identify Volume using scikit-learn package in python, Matlab.
- Exploring DAG's, their dependencies and logs using AirFlow pipelines for automation.
- Performed data cleaning and feature selection using MLlib package in PySpark and working with deep learning frameworks such as Caffe, Neon etc.
- Conducted a hybrid of Hierarchical and K-means Cluster Analysis using IBM SPSS and identified meaningful segments of customers through a discovery approach.
- Develop Spark/Scala, Python, R for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources. Used clustering technique K-Means to identify outliers and to classify unlabeled data.
- Evaluate models using Cross Validation, Log loss function, ROC curves and used AUC for feature selection and elastic technologies like ElasticSearch, Kibana etc.
- Work with NLTK library to NLP data processing and finding the patterns.
- Categorize comments into positive and negative clusters from different social networking sites using Sentiment Analysis and Text Analytics.
- Analyze traffic patterns by calculating autocorrelation with different time lags.
- Ensure that the model has low False Positive Rate and Text classification and sentiment analysis for unstructured and semi-structured data.
- Addressed over fitting by implementing of the algorithm regularization methods like L2 and L1.
- Use Principal Component Analysis in feature engineering to analyze high dimensional data.
- Create and design reports that will use gathered metrics to infer and draw logical conclusions of past and future behavior.
- Perform Multinomial Logistic Regression, Random forest, Decision Tree, SVM to classify package is going to deliver on time for the new route.
- Implemented different models like Logistic Regression, Random Forest and Gradient-Boost Trees to predict whether a given die will pass or fail the test.
- Perform data analysis by using Hive to retrieve the data from Hadoop cluster, Sql to retrieve data from Oracle database and used ETL for data transformation.
- Use MLlib, Spark's Machine learning library to build and evaluate different models.
- Perform Data Cleaning, features scaling, features engineering using pandas and numpy packages in python.
- Develop MapReduce pipeline for feature extraction using Hive and Pig.
- Create Data Quality Scripts using SQL and Hive to validate successful data load and quality of the data. Create various types of data visualizations using Python and Tableau.
- Communicate the results with operations team for taking best decisions.
- Collect data needs and requirements by Interacting with the other departments.
Environment: Python 2.x, R, HDFS, Hadoop 2.3, Hive, Linux, Spark, IBM SPSS, Tableau Desktop, SQL Server 2012, Microsoft Excel, Matlab, Spark SQL, Pyspark.
Confidential - Austin, TX
Data Scientist
Responsibilities:
- Implemented Data Exploration to analyze patterns and to select features using Python SciPy.
- Built Factor Analysis and Cluster Analysis models using Python SciPy to classify customers into different target groups.
- Built predictive models including Support Vector Machine, Random Forests and Naïve Bayes Classifier using Python Scikit-Learn to predict the personalized product choice for each client.
- Using R’s dplyr and ggplot2 packages, performed an extensive graphical visualization of overall data, including customized graphical representation of revenue reports, specific item sales statistics and visualization.
- Designed and implemented cross-validation and statistical tests including Hypothetical Testing, ANOVA, Auto-correlation to verify the models’ significance.
- Designed an A/B experiment for testing the business performance of the new recommendation system.
- Supported MapReduce Programs running on the cluster.
- Evaluated business requirements and prepared detailed specifications that follow project guidelines required to develop written programs.
- Configured Hadoop cluster with Namenode and slaves and formatted HDFS.
- Used Oozie workflow engine to run multiple Hive and Pig jobs.
- Participated in Data Acquisition with Data Engineer team to extract historical and real-time data by using Hadoop MapReduce and HDFS.
- Performed Data Enrichment jobs to deal missing value, to normalize data, and to select features by using HiveQL.
- Developed multiple MapReduce jobs in java for data cleaning and pre-processing.
- Analyzed the partitioned and bucketed data and compute various metrics for reporting.
- Involved in loading data from RDBMS and web logs into HDFS using Sqoop and Flume.
- Worked on loading the data from MySQL to HBase where necessary using Sqoop.
- Developed Hive queries for Analysis across different banners.
- Extracted data from Twitter using Java and Twitter API. Parsed JSON formatted twitter data and uploaded to database.
- Launching Amazon EC2 Cloud Instances using Amazon Images (Linux/ Ubuntu) and Configuring launched instances with respect to specific applications.
- Developed Hive queries for analysis, and exported the result set from Hive to MySQL using Sqoop after processing the data.
- Analyzed the data by performing Hive queries and running Pig scripts to study customer behavior.
- Created HBase tables to store various data formats of data coming from different portfolios.
- Worked on improving performance of existing Pig and Hive Queries.
- Created reports and dashboards, by using D3.js and Tableau 9.x, to explain and communicate data insights, significant features, models scores and performance of new recommendation system to both technical and business teams.
- Utilize SQL, Excel and several Marketing/Web Analytics tools (Google Analytics, AdWords) in order to complete business & marketing analysis and assessment.
- Used Git 2.x for version control with Data Engineer team and Data Scientists colleagues.
- Used Agile methodology and SCRUM process for project developing.
Environment: HDFS, Hive, Scoop, Pig, Oozie, Amazon Web Services (AWS), Python 3.x (SciPy, Scikit-Learn), Tableau 9.x, D3.js, SVM, Random Forests, Naïve Bayes Classifier, A/B experiment, Git 2.x, Agile/SCRUM.
Confidential - Freeport, ME
Data Scientist
Responsibilities:
- Involved in all phases of data acquisition, data collection, data cleaning, model development, model validation, and visualization to deliver data science solutions.
- Created classification models to recognize web request with product association in order to classify the orders and scoring the products for analytics which improved the online sales percentage by 13%.
- Used Pandas, NumPy, Scikit-learn in Python for developing various machine learning models such Random forest and step-wise regression.
- Worked on NLTK library in python for doing sentiment analysis on customer product reviews and other third party websites using web scrapping.
- Used cross-validation to test the models with different batches of data to optimize the models and prevent overfitting.
- Implemented and developed fraud detection model by implementing a Feed Forward Multilayer Perceptron which is a type of ANN.
- Worked with ANN (Artificial Neural Networks) and BBN (Bayesian Belief Networks).
- Used pruning algorithms to cut away the connections and perceptions to significantly improve the performance of back-propagation algorithm.
- Hands on experience in Dimensionality Reduction, Model selection and Model boosting methods using Principal Component Analysis (PCA), K-Fold Cross Validation and Gradient Tree Boosting.
- Implemented a structured learning method that is based on search and scoring method.
- Customer segmentation based on their behavior or specific characteristics like age, region, income, geographical location and applying Clustering algorithms to group the customers based on their similar behavior patterns.
- Created and maintained reports to display the status and performance of deployed model and algorithm with Tableau.
- Worked with numerous data visualization tools in python like matplotlib, seaborn, ggplot, pygal.
Environment: Pandas, NumPy, Oozie, Amazon Web Services (AWS), Python 3.x (SciPy, Scikit-Learn), Tableau 9.x, D3.js, PCA, Random Forests, Naïve Bayes Classifier, A/B experiment, Git 2.x.
Confidential - Plan, RI
Data Scientist
Responsibilities:
- Gathered business requirements, definition and design of the data sourcing, worked with the data warehouse architect on the development of logical data models.
- Collaborated with Data Engineers to filter data as per the project requirements.
- Conducted reverse engineering based on demo reports to understand the data without documentation and redefined the proper requirements and negotiated with our client.
- Implemented automated Ticket routing algorithm using term affinity matrix which is one of the NLP models .
- Implemented various statistical techniques to manipulate the data (missing data imputation, principle component analysis and sampling).
- Applied different dimensionality reduction techniques like principle component analysis (PCA) and t -stochastic neighborhood embedding (t-SNE) on feature matrix.
- Identified outliers and inconsistencies in data by conducting exploratory data analysis (EDA) using python NumPy and Seaborn to see the insights of data and validate each feature.
- Performed univariate and multivariate analysis on the data to identify any underlying pattern in the data and associations between the variables.
- Worked with Market Mix Modeling to strategize the advertisement investments to better balance the ROI on advertisements.
- Worked with several prof of concept models using Deep Learning, Neural networks .
- Performed feature engineering including feature intersection generating, feature normalize and label encoding with Scikit-learn preprocessing.
- Developed and implemented predictive models using machine learning algorithms such as linear regression, classification, multivariate regression, Naive Bayes, Random Forests, K-means clustering, KNN.
- Used clustering techniques like DBSCAN, K-means, K-means++ and Hierarchical clustering for customer profiling to design insurance plans according to their behavior pattern.
- Used Grid Search to evaluate the best hyper-parameters for my model and K-fold cross validation technique to train my model for best results.
- Worked with Customer Churn Models including Random forest regression, lasso regression along with pre-processing of the data.
- Implemented time-based learningrate decay and drop based learning rate to reduce the computational time of the model by 2.8 minutes for 10 epochs.
- Used Python 3.X (NumPy, SciPy, pandas, scikit-learn, seaborn) and Spark 2.0 (PySpark, MLlib) to develop variety of models and algorithms for analytic purposes.
- Designed rich data visualizations to model data into human-readable form with Matplotlib.
Environment: MS Excel, Agile, Oracle 11g, Sql Server, SOA, SSIS, SSRS, ETL, UNIX, T-SQL, HP Quality Center 11, RDM (Reference Data Management).
Confidential
Data Analyst
Responsibilities:
- Database management, maintenance and data analysis, processing and testing.
- Tested and ensured data accuracy through the creation and implementation of data integrity queries, debugging and trouble shooting .
- Worked with business analysts for understanding the problem statement and their requirements.
- Extracted data from various relational databases and performed SQL queries depending on how the data needs to be modified, Used FTP to download SAS formatted data.
- Developing new code and modifying existing code to extract data from various data sources like DB2, oracle, Used MS-excel and SQL server extensively to manipulate the data for business requirements.
- Worked on creating new datasets from uncleaned or raw data using some importing techniques and modifies the datasets which already existed using join, set, sort, merge, update and some other conditional statements.
- Closely worked with Machine learning engineers to analyze the data based upon their requirements. Experienced in creating pivot tables for analyzing data in excel.
- Hands-on experience in data analyzing and wrote MySQL queries for improved performance.
- Optimized queries with some manipulations and modifications in MySQL code and removed unwanted columns and duplicate data.
- Exceeded expectations by helping machine learning team in data cleaning and preprocessing which is used for building their machine learning algorithms .
- Performed data cleaning and data manipulation activities using NOSQL utility.
- Worked with ETL to load data from SQL tables to NoSQL databases (MongoDB) with referential integrity and developed queries using SQL, SQL*PLUS and Rapid SQL.
- Decent experience in creating dashboards in tableau for report submissions .
Environment: R/R studio, Python, Tableau, MS SQL Server, MS Access, MS Excel, Outlook, Power BI.
Confidential
Data Reporting Analyst
Responsibilities:
- Designed and implemented an internal reporting tool named I-CUBE using Python to automate sales and financial operational data accessible through a built-in SharePoint for leaders globally. Used API for I-Cube to extract sales data on an hourly-basis.
- Built and customised interactive reports on forecasts, targets and actuals data using BI/ETL tools such as SAS, SSAS, SSIS in the CRM which slashed manual efforts by 8%.
- Conducted operational analyses for business worth $3M working through all phases such as requirements gathering, developing use cases, data mapping and creating workflow diagrams.
- Accomplished data cleansing and analysis results using Excel pivot tables, VLOOKUPs, data validation, graphs and chart manipulation in Excel.
- Designed complex SQL queries, Views, Stored Procedures, Functions and Triggers to handle database manipulation and performance.
- Used SQL, PLSQL scripts for automating repeatable tasks of customer feedback survey data collection and distribution which increased the departmental efficiency by 8%.