We provide IT Staff Augmentation Services!

Sr. Data Scientist Resume

Wallingford, CT


  • A Data Scientist professional with over 7 years of progressive experience in Data Analytics, Statistical Modeling, Visualization , Machine Learning , and Deep learning . Excellent capability in collaboration, quick learning and adaptation.
  • Experience in Data mining with large datasets of Structured and Unstructured data, Data Acquisition, Data Validation, Predictive modeling, Data Visualization .
  • Experience in integrating data, profiling, validating and data cleansing transformation and data visualization using R and Python .
  • Worked on MongoDB database concepts such as locking, transactions, indexes, Sharding, replication, schema design .
  • Experience in writing Sub Queries, Stored Procedures, Triggers, Cursors, and Functions in MySQL.
  • Experience in applying machine learning algorithms for a variety of programs.
  • Experienced in Agile Methodologies, Scrum stories and sprints experience in a Python based environment, along with data analytics, data wrangling.
  • Good experience in using various Python libraries (Beautiful Soup, Numpy, Scipy, matplotlib, python - twitter, Pandas, MySQL dB for database connectivity) .
  • Strong Experience in Big data technologies including Apache Spark, HDFS, Hive, MongoDB .
  • Hands on experience of Git .
  • Good working experience in processing large datasets with Spark using Python .
  • Sound understanding of Deep learning using CNN, RNN, ANN, reinforcement learning, transfer learning .
  • Theoretical foundations and practical hands-on projects related to (i) supervised learning (linear and logistic regression, boosted decision trees, Support Vector Machines, neural networks, NLP), (ii) unsupervised learning (clustering, dimensionality reduction, recommender systems), (iii) probability & statistics, experiment analysis, confidence intervals, A/B testing, (iv) algorithms and data structures .
  • Extensive knowledge on Azure Data Lake and Azure Storage .
  • Experience in migration from heterogeneous sources including Oracle to MS SQL Server .
  • Hands on experience in design, management and visualization of databases using Oracle, MySQL and SQL Server .
  • Experienced in Hadoop 2.x ecosystem and Apache Spark 2.x framework such as Hive, Pig, Scoop, Pyspark .
  • Experience in Apache Spark, Kafka for Big Data Processing & Scala Functional programming .
  • Experience in manipulating the large data sets with R packages like tidyr, tidyverse, dplyr reshape, lubridate, Caret and visualizing the data using lattice and ggplot2 packages .
  • Experience in dimensionality reduction using techniques like PCA and LDA .
  • Intensive hands-on Boot camp on Data Analytics course spanning from Statistics to Programming including data engineering, data visualization, machine learning and programming in R, SQL.
  • Experience in data analytics, predictive analysis like Classification, Regression, Recommender Systems.
  • Good Exposure with Factor Analysis, Bagging and Boosting algorithms .
  • Experience in Descriptive Analysis Problems like Frequent Pattern Mining, Clustering, Outlier Detection .
  • Worked on Machine Learning algorithms like Classification and Regression with KNN Model, Decision Tree Model, Naïve Bayes Model, Logistic Regression, SVM Model and Latent Factor Model .
  • Hands-on experience on Python and libraries like Numpy, Pandas, Matplotlib, Seaborn, NLTK, Sci-Kit learn, SciPy .
  • Expertise and knowledge in TensorFlow to do machine learning/deep learning package in python .
  • Good knowledge on Microsoft Azure SQL, Machine Learning and HDInsight .
  • Good Exposure on SAS analytics .
  • Good Exposure in deep learning with Tensor flow in python .
  • Good Knowledge on Natural Language Processing (NLP) and Time Series Analysis and Forecasting using ARIMA model in Python and R .
  • Good knowledge in Tableau, Power BI for interactive data visualizations.
  • In-depth Understanding in NoSQL databases like MongoDB, HBase .
  • Experienced in Amazon Web Services (AWS) and Microsoft Azure , such as AWS EC2, S3, RD3, Azure HDInsight , Machine Learning Studio, Azure Data Lake . Very good experience and knowledge in provisioning virtual clusters under AWS cloud which includes services like EC2, S3, and EMR .
  • Experience and Knowledge in developing software using Java, C++ (Data Structures and Algorithms) technologies.
  • Good exposure in creating pivot tables and charts in Excel .
  • Experience in developing Custom Report and different types of Tabular Reports, Matrix Reports, Ad hoc reports and distributed reports in multiple formats using SQL Server Reporting Services (SSRS) .
  • Excellent Database administration (DBA) skills including user authorizations, Database creation, Tables, indexes and backup creation.


Languages: Java 8, Python, R

Numpy, SciPy, Pandas, Scikit: learn, Matplotlib, Seaborn, ggplot2, caret, dplyr, purrr, readxl, tidyr, Rweka, gmodels, RCurl, C50, twitter, NLP, Reshape2, rjson, plyr, Beautiful Soup, Rpy2

Kernel Density Estimation and Non: parametric Bayes Classifier, K-Means, Linear Regression, Neighbors (Nearest, Farthest, Range, k, Classification), Non-Negative Matrix Factorization, Dimensionality Reduction, Decision Tree, Gaussian Processes, Logistic Regression, Na ve Bayes, Random Forest, Ridge Regression, Matrix Factorization/SVD

NLP/Machine Learning/Deep Learning: LDA (Latent Dirichlet Allocation), NLTK, Apache OpenNLP, Stanford NLP, Sentiment Analysis, SVMs, ANN, RNN, CNN, TensorFlow, MXNet, Caffe, H2O, Keras, PyTorch, Theano, Azure ML

Cloud: Google Cloud Platform, AWS, Azure, Bluemix

Web Technologies: JDBC, HTML5, DHTML and XML, CSS3, Web Services, WSDL

Data Modeling Tools: Erwin r 9.6, 9.5, 9.1, 8.x, Rational Rose, ER/Studio, MS Visio, SAP Power designer

Big Data Technologies: Hadoop, Hive, HDFS, MapReduce, Pig, Kafka

Databases: SQL, Hive, Impala, Pig, Spark SQL, Databases SQL-Server, My SQL, MS Access, HDFS, HBase, Teradata, Netezza, MongoDB, Cassandra.

Reporting Tools: MS Office (Word/Excel/Power Point/ Visio), Tableau, Crystal reports XI, Business Intelligence, SSRS, Business Objects 5.x/ 6.x, Cognos7.0/6.0.

ETL Tools: Informatica Power Centre, SSIS.

Version Control Tools: SVM, GitHub

BI Tools: Tableau, Tableau Server, Tableau Reader, SAP Business Objects, OBIEE, QlikView, SAP Business Intelligence, Amazon Redshift, or Azure Data Warehouse

Operating System: Windows, Linux, Unix, Macintosh HD, Red Hat


Confidential, Wallingford, CT

Sr. Data Scientist


  • Analyze Data and Performed Data Preparation by applying historical model on the data set in AZURE ML.
  • Perform Data cleaning process applied Backward - Forward filling methods on dataset for handling missing value.
  • Perform Data Transformation method for Rescaling and Normalizing Variables.
  • Develop a predictive model and validate KNN model for predict the feature label.
  • Plan, develop, and apply leading-edge analytic and quantitative tools and modeling techniques to help clients gain insights and improve decision-making.
  • Utilize Spark, Scala, Hadoop, HQL, VQL, oozie, pySpark, Data Lake, TensorFlow, HBase, Cassandra, Redshift, MongoDB, Kafka, Kinesis, Spark Streaming, Edward, CUDA, MLLib, AWS, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc.
  • Apply various machine learning algorithms and statistical modeling like decision trees, text analytics, natural language processing (NLP), supervised and unsupervised, regression models, social network analysis, neural networks, deep learning, SVM, clustering to identify Volume using scikit-learn package in python, Matlab.
  • Perform data cleaning and feature selection using MLlib package in PySpark and working with deep learning frameworks such as Caffe, Neon.
  • Leverage the most appropriate algorithms and be prepared to justify your decisions.
  • Work closely with key stakeholders in product, finance and operations to form deep understanding of growth and marketplace dynamics, including product and pricing patterns, outlier detection, forecasting, and imputation.
  • Collaborate with product and engineering to integrate various sources of data.
  • Apply strict sampling, statistical inference, and survey techniques to derive insights from small samples of data.
  • Utilize Sqoop to ingest real-time data. Used analytics libraries Sci-Kit Learn, MLLIB and MLxtend.
  • Extensively use Python's multiple data science packages like Pandas, NumPy, matplotlib, Seaborn, SciPy, Scikit-learn and NLTK.
  • Work on data pre-processing and cleaning the data to perform feature engineering and performed data imputation techniques for the missing values in the dataset using Python.
  • Implement machine learning model (logistic regression, XGBoost, SVM) with Python Scikit- learn.
  • Work on different data formats such as JSON, XML and applied machine learning algorithms in Python.
  • Performed Exploratory Data Analysis, trying to find trends and clusters.
  • Develop rigorous data science models to aggregate inconsistent real-time signals into strong predictors of market trends.
  • Automate and own the end-to-end process of modeling and data visualization.
  • Collaborate with Data Engineers and Software Developers to develop experiments and deploy solutions to production.
  • Create and publish multiple dashboards and reports using Tableau server.
  • Work on Text Analytics, Naive Bayes, Sentiment analysis, creating word clouds and retrieving data from Twitter and other social networking platforms.
  • Work on data that was a combination of unstructured and structured data from multiple sources and automate the cleaning using Python scripts.
  • Perform data analysis by using Hive to retrieve the data from Hadoop cluster, SQL to retrieve data from Oracle database.
  • Create Data Quality Scripts using SQL to validate successful data load and quality of the data.
  • Create data visualizations using Python and Tableau.
  • Extract data from HDFS and prepared data for exploratory analysis using data munging.
  • Interface with supervisors, artists, systems administrators, and production to ensure production deadlines are met.
  • Extensively perform large data read/writes to and from csv and excel files using pandas.
  • Tasked with maintaining RDD's using SparkSQL.
  • Communicate and coordinate with other departments to collection business requirement.
  • Tackle highly imbalanced Fraud dataset using under sampling with ensemble methods, oversampling and cost sensitive algorithms.
  • Improve fraud prediction performance by using random forest and gradient boosting for feature selection with Python Scikit-learn.
  • Optimize algorithm with stochastic gradient descent algorithm Fine-tuned the algorithm parameter with manual tuning and automated tuning such as Bayesian Optimization.
  • Write research reports describing the experiment conducted, results, and findings and also make strategic recommendations to technology, product, and senior management.

Confidential, Dallas, TX

Data Scientist


  • Implemented Data Exploration to analyze patterns and to select features using Python SciPy.
  • Built Factor Analysis and Cluster Analysis models using Python SciPy to classify customers into different target groups.
  • Built predictive models including Support Vector Machine, Random Forests and Naïve Bayes Classifier using Python Scikit-Learn to predict the personalized product choice for each client.
  • Using R’s dplyr and ggplot2 packages, performed an extensive graphical visualization of overall data, including customized graphical representation of revenue reports, specific item sales statistics and visualization.
  • Designed and implemented cross-validation and statistical tests including Hypothetical Testing, ANOVA, Auto-correlation to verify the models’ significance.
  • Designed an A/B experiment for testing the business performance of the new recommendation system.
  • Supported MapReduce Programs running on the cluster.
  • Evaluated business requirements and prepared detailed specifications that follow project guidelines required to develop written programs.
  • Configured Hadoop cluster with Namenode and slaves and formatted HDFS.
  • Used Oozie workflow engine to run multiple Hive and Pig jobs.
  • Participated in Data Acquisition with Data Engineer team to extract historical and real-time data by using Hadoop MapReduce and HDFS.
  • Performed Data Enrichment jobs to deal missing value, to normalize data, and to select features by using HiveQL.
  • Developed multiple MapReduce jobs in java for data cleaning and pre-processing.
  • Analyzed the partitioned and bucketed data and compute various metrics for reporting.
  • Involved in loading data from RDBMS and web logs into HDFS using Sqoop and Flume.
  • Worked on loading the data from MySQL to HBase where necessary using Sqoop.
  • Developed Hive queries for Analysis across different banners.
  • Extracted data from Twitter using Java and Twitter API. Parsed JSON formatted twitter data and uploaded to database.
  • Launching Amazon EC2 Cloud Instances using Amazon Images (Linux/ Ubuntu) and Configuring launched instances with respect to specific applications.
  • Developed Hive queries for analysis, and exported the result set from Hive to MySQL using Sqoop after processing the data.
  • Analyzed the data by performing Hive queries and running Pig scripts to study customer behavior.
  • Created HBase tables to store various data formats of data coming from different portfolios.
  • Worked on improving performance of existing Pig and Hive Queries.
  • Created reports and dashboards, by using D3.js and Tableau 9.x, to explain and communicate data insights, significant features, models scores and performance of new recommendation system to both technical and business teams.
  • Utilize SQL, Excel and several Marketing/Web Analytics tools (Google Analytics, AdWords) in order to complete business & marketing analysis and assessment.
  • Used Git 2.x for version control with Data Engineer team and Data Scientists colleagues.
  • Used Agile methodology and SCRUM process for project developing.

Environment: HDFS, Hive, Scoop, Pig, Oozie, Amazon Web Services (AWS), Python 3.x (SciPy, Scikit-Learn), Tableau 9.x, D3.js, SVM, Random Forests, Naïve Bayes Classifier, A/B experiment, Git 2.x, Agile/SCRUM.

Confidential, NY

Data Scientist/Data Analyst


  • Statistical Modelling with ML to bring Insights in Data under guidance of Principal Data Scientist.
  • Data modeling with Pig, Hive, Impala.
  • Ingestion with Sqoop, Flume.
  • Used SVN to commit the Changes into the main EMM application trunk.
  • Understanding and implementation of text mining concepts, graph processing and semi structured and unstructured data processing.
  • Worked with Ajax API calls to communicate with Hadoop through Impala Connection and SQL to render the required data through it .These API calls are similar to Microsoft Cognitive API calls.
  • Good grip on Cloudera and HDP ecosystem components.
  • Used ElasticSearch (Big Data) to retrieve data into application as required.
  • Performed Map Reduce Programs those are running on the cluster.
  • Developed multiple MapReduce jobs in java for data cleaning and preprocessing.
  • Developed scalable machine learning solutions within a distributed computation framework (e.g. Hadoop, Spark, Storm etc.).
  • Analyzed the partitioned and bucketed data and compute various metrics for reporting.
  • Involved in loading data from RDBMS and web logs into HDFS using Sqoop and Flume.
  • Worked on loading the data from MySQL to HBase where necessary using Sqoop.
  • Developed Hive queries for Analysis across different banners.
  • Extracted data from Twitter using Java and Twitter API. Parsed JSON formatted twitter data and uploaded to database.
  • Launching Amazon EC2 Cloud Instances using Amazon Images (Linux/ Ubuntu) and Configuring launched instances with respect to specific applications to improve robustness.
  • Exported the result set from Hive to MySQL using Sqoop after processing the data.
  • Analyzed the data by performing Hive queries and running Pig scripts to study customer behavior.
  • Have hands on experience working on Sequence files, AVRO, HAR file formats and compression.
  • Used Hive to partition and bucket data.
  • Experience in writing MapReduce programs with Java API to cleanse Structured and unstructured data.
  • Wrote Pig Scripts to perform ETL procedures on the data in HDFS.
  • Created HBase tables to store various data formats of data coming from different portfolios.
  • Worked on improving performance of existing Pig and Hive Queries.

Environment: SQL/Server, Oracle 9i, MS-Office, Teradata, Informatica, ER Studio, XML, Business Objects, HDFS, Teradata 14.1, JSON, HADOOP (HDFS), MapReduce, PIG, Spark, R Studio, MAHOUT, JAVA, HIVE, AWS.

Confidential, Austin, TX

Data Analyst/Data Scientist


  • Integrated data from multiple data sources or functional areas, ensures data accuracy and integrity, and updates data as need using SQL and Python.
  • Expertise leveraging SQL, Excel and Tableau to manipulate, analyze and present data.
  • Performs analyses of structured and unstructured data to solve multiple and/or complex business problems utilizing advanced statistical techniques and mathematical analyses.
  • Developed advanced models using multivariate regression, Logistic regression, Random forests, decision trees and clustering.
  • Used Pandas, Numpy, Seaborn, Scikit-learn in Python for developing various machine learning algorithms.
  • Build and improve models using natural language processing (NLP) and machine learning to extract insights from unstructured data.
  • Experienced working with distributed computing technologies (Apache Spark, Hive).
  • Applied predictive analysis and statistical modeling techniques to analyze customer behavior and offer customized products, reduce delinquency rate and default rate. Lead to fall in default rates from 5% to 2%.
  • Applied machine learning techniques to tap into new markets, new customers and put forth my recommendations to the top management which resulted in increase in customer base by 5% and customer portfolio by 9%.
  • Analyzed customer master data for the identification of prospective business, to understand their business needs, built client relationships and explored opportunities for cross-selling of financial products. 60% (Increased from 40%) of customers availed more than 6 products.
  • Collaborated with business partners to understand their problems and goals, develop predictive modeling, statistical analysis, data reports and performance metrics.
  • Participate in the on-going design and development of a consolidated data warehouse supporting key business metrics across the organization.
  • Designed, developed, and implemented data quality validation rules to inspect and monitor the health of the data.
  • Dashboard and report development experience using Tableau.
  • Involved in full SDLC of BI Project including Data Analysis, Designing, Development of Data Warehouse environment.
  • Used Oracle Data Integrator Designer to develop processes for extracting, cleansing, transforming, integrating, and loading data into data warehouse database.
  • Experience in Developing and customizing PL/SQL packages, procedures, functions, triggers and reports using Oracle SQL Developer.
  • Responsible for designing, developing and testing of the ETL strategy to populate the data from various source systems (Flat files, Oracle).
  • Worked with the Business units to identify data quality rule requirements against identified anomalies.
  • Develop Data Mapping, Join and queries - Validation, and addressing/fixing data queries raised by project team in a timely manner.
  • Worked closely with Business analyst and interacted with the Business users to gather new business requirements and to understand the accurate business and current requirements.
  • Created Repositories, Agent, Contexts and both of Physical & Logical Schema in Topology Manager for all the source and target schemas.
  • Data mapping, logical data modeling, created class diagrams and ER diagrams and used SQL queries to filter data within the Oracle database.
  • Installed and Setup ODI Master Repository, Work Repository, Execution Repository.
  • Used Topology Manager to manage the data describing the information systems physical and logical architecture.
  • Extensively worked and utilized ODI Knowledge Modules (Reverse Engineering, Loading, Integration, Check, Journalizing and service).
  • Created various procedures and variables.
  • Created ODI Packages, Jobs of various complexities and automated process data flow.
  • Configured and setup ODI, Master repository, Work repository, Project, Models, sources, targets, packages, Knowledge Modules, Interfaces, Scenarios, filters, condition, metadata.

Hire Now