Data Scientist Resume

SUMMARY

A Data Scientist professional with over 5 years of progressive experience in Data Analytics, Statistical Modeling, Visualization, Machine Learning, and Deep learning. Excellent capability in collaboration, quick learning and adaptation.
Experience in Data mining with large datasets of Structured and Unstructured data, Data Acquisition, Data Validation, Predictive modeling, Data Visualization.
Experience in integrating data, profiling, validating and data cleansing transformation and data visualization using R and Python.
Worked on MongoDB database concepts such as locking, transactions, indexes, Sharding, replication, schema design.
Experience in writing Sub Queries, Stored Procedures, Triggers, Cursors, and Functions in MySQL.
Good experience in using various Python libraries (Beautiful Soup, NumPy, SciPy, matplotlib, python - twitter, Pandas, MySQL dB for database connectivity).
Strong Experience in Big data technologies including Apache Spark, HDFS, Hive, MongoDB.
Theoretical foundations and practical hands-on projects related to (i) supervised learning (linear and logistic regression, boosted decision trees, Support Vector Machines, neural networks, NLP), (ii) unsupervised learning (clustering, dimensionality reduction, recommender systems), (iii) probability & statistics, experiment analysis, confidence intervals, A/B testing, (iv) algorithms and data structures.
Extensive knowledge on Azure Data Lake and Azure Storage.
Experience in migration from heterogeneous sources including Oracle to MS SQL Server.
Hands on experience in design, management and visualization of databases using Oracle, MySQL and SQL Server.
Experienced in Hadoop 2.x ecosystem and Apache Spark 2.x framework such as Hive, Pig, Scoop, Pyspark.
Experience in Apache Spark, Kafka for Big Data Processing & Scala Functional programming.
Experience in manipulating the large data sets with R packages like tidyr, tidyverse, dplyr reshape, lubridate, Caret and visualizing the data using lattice and ggplot2 packages.
Experience in dimensionality reduction using techniques like PCA and LDA.
Intensive hands-on Boot camp on Data Analytics course spanning from Statistics to Programming including data engineering, data visualization, machine learning and programming in R, SQL.
Experience in data analytics, predictive analysis like Classification, Regression, Recommender Systems.
Worked on Machine Learning algorithms like Classification and Regression with KNN Model, Decision Tree Model, Naïve Bayes Model, Logistic Regression, SVM Model and Latent Factor Model.
Hands-on experience on Python and libraries like NumPy, Pandas, Matplotlib, Seaborn, NLTK, Sci-Kit learn, SciPy.
Expertise and knowledge in TensorFlow to do machine learning/deep learning package in python.
Good knowledge on Microsoft Azure SQL, Machine Learning and HDInsight.
Good Exposure on SAS analytics.
Good Exposure in deep learning with Tensor flow in python.
Good Knowledge on Natural Language Processing (NLP) and Time Series Analysis and Forecasting using ARIMA model in Python and R.
Good knowledge in Tableau, Power BI for interactive data visualizations.
In-depth Understanding in NoSQL databases like MongoDB, HBase.
Experienced in Amazon Web Services (AWS) and Microsoft Azure, such as AWS EC2, S3, RD3, Azure HDInsight, Machine Learning Studio, Azure Data Lake. Very good experience and knowledge in provisioning virtual clusters under AWS cloud which includes services like EC2, S3, and EMR.
Experience in developing Custom Report and different types of Tabular Reports, Matrix Reports, Ad hoc reports and distributed reports in multiple formats using SQL Server Reporting Services (SSRS).
Excellent Database administration (DBA) skills including user authorizations, Database creation, Tables, indexes and backup creation.

TECHNICAL SKILLS

Languages: Java 8, Python, R

Python and R: NumPy, SciPy, Pandas, Scikit-learn, Matplotlib, Seaborn, ggplot2, caret, dplyr, purrr, readxl, tidyr, Rweka, gmodels, RCurl, C50, twitter, NLP, Reshape2, rjson, plyr, Beautiful Soup, Rpy2

Algorithms: Kernel Density Estimation and Non-parametric Bayes Classifier, K-Means, Linear Regression, Neighbors (Nearest, Farthest, Range, k, Classification), Non-Negative Matrix Factorization, Dimensionality Reduction, Decision Tree, Gaussian Processes, Logistic Regression, Naïve Bayes, Random Forest, Ridge Regression, Matrix Factorization/SVD

NLP/Machine Learning/Deep Learning: LDA (Latent Dirichlet Allocation), NLTK, Apache OpenNLP, Stanford NLP, Sentiment Analysis, SVMs, ANN, RNN, CNN, TensorFlow, MXNet, Caffe, H2O, Keras, PyTorch, Theano, Azure ML

Cloud: Google Cloud Platform, AWS, Azure, Bluemix

Web Technologies: JDBC, HTML5, DHTML and XML, CSS3, Web Services, WSDL

Data Modeling Tools: Erwin r 9.6, 9.5, 9.1, 8.x, Rational Rose, ER/Studio, MS Visio, SAP Power designer

Big Data Technologies: Hadoop, Hive, HDFS, MapReduce, Pig, Kafka

Databases: SQL, Hive, Impala, Pig, Spark SQL, Databases SQL-Server, My SQL, MS Access, HDFS, HBase, Teradata, Netezza, MongoDB, Cassandra.

Reporting Tools: MS Office (Word/Excel/Power Point/ Visio), Tableau, Crystal reports XI, Business Intelligence, SSRS, Business Objects 5.x/ 6.x, Cognos7.0/6.0.

ETL Tools: Informatica Power Centre, SSIS.

Version Control Tools: SVM, GitHub

BI Tools: Tableau, Tableau Server, Tableau Reader, SAP Business Objects, OBIEE, QlikView, SAP Business Intelligence, Amazon Redshift, or Azure Data Warehouse

Operating System: Windows, Linux, Unix, Macintosh HD, Red Hat

PROFESSIONAL EXPERIENCE

Confidential

Data Scientist

Responsibilities:

Extracted the data from hive tables by writing efficient Hive queries.
Performed preliminary data analysis using descriptive statistics and handled anomalies such as removing duplicates and imputing missing values.
Analyze Data and Performed Data Preparation by applying historical model on the data set in AZUREML.
Application of various machine learning algorithms and statistical modeling like decision trees, text analytics, natural language processing (NLP), supervised and unsupervised, regression models, neural networks, deep learning, SVM, clustering to identify Volume using scikit-learn package in python, MATLAB.
Performed data cleaning and feature selection using MLlib package in PySpark and working with deep learning frameworks such as Caffe, Neon etc.
Conducted a hybrid of Hierarchical and K-means Cluster Analysis using IBM SPSS and identified meaningful segments of customers through a discovery approach.
Develop Spark/Scala, Python, R for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources. Used clustering technique K-Means to identify outliers and to classify unlabeled data.
Evaluate models using Cross Validation, Log loss function, ROC curves and used AUC for feature selection and elastic technologies like Elasticsearch, Kibana etc.
Work with NLTK library to NLP data processing and finding the patterns.
Categorize comments into positive and negative clusters from different social networking sites using Sentiment Analysis and Text Analytics.
Ensure that the model has low False Positive Rate and Text classification and sentiment analysis for unstructured and semi-structured data.
Addressed over fitting by implementing of the algorithm regularization methods like L2 and L1.
Use Principal Component Analysis in feature engineering to analyze high dimensional data.
Create and design reports that will use gathered metrics to infer and draw logical conclusions of past and future behavior.
Perform Multinomial Logistic Regression, Random forest, Decision Tree, SVM to classify package is going to deliver on time for the new route.
Implemented different models like Logistic Regression, Random Forest and Gradient-Boost Trees to predict whether a given die will pass or fail the test.
Perform data analysis by using Hive to retrieve the data from Hadoop cluster, SQL to retrieve data from the database and used ETL for data transformation.
Use MLlib, Spark's Machine learning library to build and evaluate different models.
Perform Data Cleaning, features scaling, features engineering using pandas and NumPy packages in python.
Communicate the results with operations team for taking best decisions.
Collect data needs and requirements by Interacting with the other departments.

Environment: Python 2.x, R, HDFS, Hadoop 2.3, Hive, Linux, Spark, IBM SPSS, Tableau Desktop, SQL Server 2012, Microsoft Excel, MATLAB, Spark SQL, Pyspark.

Confidential, San Francisco, CA

Data scientist

Responsibilities:

Gathered, analyzed, documented and translated application requirements into data models, supported standardization of documentation and the adoption of standards and practices related to data and applications.
Queried and aggregated data from Amazon Redshift to get the sample dataset.
Identified patterns, data quality issues, and leveraged insights by communicating with BI team.
In preprocessing phase, used Pandas to remove or replace all the missing data, and feature engineering to eliminate unrelated features.
Balanced the dataset with Over-sampling the minority label class and Under-sampling the majority label class.
In data exploration stage used correlation analysis and graphical techniques to get some insights about the claim data.
Applied machine learning techniques to tap into new markets, new customers and put forth my recommendations to the top management which resulted in increase in customer base by 5% and customer portfolio by 9%.
Build multi-layers Neural Networks to implement Deep Learning by using TensorFlow and Keras.
Perform hyperparameter tuning by doing Distributed Cross Validation in Spark to speed up the computation process.
Export trained models into Protobuf to be served by TensorFlow Serving and performed integration job with client's application.
Analyzed customer master data for the identification of prospective business, to understand their business needs, built client relationships and explored opportunities for cross-selling of financial products. 60% (Increased from 40%) of customers availed more than 6 products.
Improved fraud prediction performance by using random forest and gradient boosting for feature selection with Python Scikit-learn.
Tested classification algorithms such as Logistic Regression, Gradient Boosting and Random Forest using Pandas and Scikit-learn and evaluated the performance.
Worked extensively with data governance team to maintain data models, Metadata and dictionaries.
Developed advanced models using multivariate regression, Logistic regression, Random forests, decision trees and clustering.
Applied predictive analysis and statistical modeling techniques to analyze customer behavior and offer customized products, reduce delinquency rate and default rate. Lead to fall in default rates from 5% to 2%.
Application of various machine learning algorithms and statistical modeling like decision trees, regression models, neural networks, SVM, clustering to identify Volume using scikit-learn package in python, MATLAB.
Implemented, tuned and tested the model on AWS EC2 with the best algorithm and parameters.
Set up data preprocessing pipeline to guarantee the consistency between the training data and new coming data.
Deployed the model on AWS Lambda, collaborated with develop team to build the business solutions.
Collected the feedback after deployment retrained the model to improve the performance.
Discovered flaws in the methodology being used to calculate weather peril zone relativities; designed and implemented a 3D algorithm based on k-means clustering and Monte Carlo methods.
Observed groups of customers being neglected by the pricing algorithm; used hierarchical clustering to improve customer segmentation and increase profits by 6%.
Designed, developed and maintained daily and monthly summary, trending and benchmark reports in Tableau Desktop.

Environment: Python 2.x, R, HDFS, Hadoop 2.3, Hive, Linux, Spark, IBM SPSS, Tableau Desktop, SQL Server 2012, Microsoft Excel, MATLAB, Spark SQL, Pyspark.

Confidential, NY

Data Modeler/ Data Scientist

Responsibilities:

Involved in designing Context Flow Diagrams, Structure Chart and ER- diagrams.
Gathered and documented requirements of a QlikView application from users.
Extracted data from various sources (SQL Server, Oracle, text files and excel sheets), used ETL load scripts to manipulate, concatenate and clean source data.
Extensive system study, design, development and testing were carried out in the Oracle environment to meet the customer requirements.
Serve as a member of a development team to provide business data requirements analysis services, producing logical and Physical data models using Erwin 7.1.
Ensure the first cut physical data model includes business definitions of the fields (columns) and records (tables) were generated.
Writing and executing customized SQL code for ad hoc reporting duties and used other tools for routine report generation.
Worked as part of a team of Data Management professionals supporting a Portfolio of development projects both regional and global in scope.
Applied organizational best practices to enable application project teams to produce data structures that fully meet application needs for accurate, timely, and consistent data that fully meets its intended purposes.
Conducted peer reviews of completed data models and plans to ensure quality and integrity from data capture through usage and archiving. Using advanced Excel features like Pivot tables and Charts for generating Graphs.
Designed and developed weekly, monthly reports using MS Excel Techniques (Charts, Graphs, Pivot tables) and Powerpoint presentations.
Strong Excel skills, including pivots, VLOOKUP, conditional formatting, large record sets. Including data manipulation and cleaning.

Environment: Erwin, Oracle11g, SQL server 2012, Informatica Power center 9.1, Cognos.

We provide IT Staff Augmentation Services!

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship