Sr. Data Scientist Resume Boston, MA - Hire IT People

PROFESSIONAL SUMMARY:

A Data Scientist professional with over 7+ years of progressive experience in Data Analytics, Statistical Modeling, Visualization, Machine Learning, and Deep learning. Excellent capability in collaboration, quick learning and adaptation.
Experience in Data mining with large datasets of Structured and Unstructured data, Data Acquisition, Data Validation, Predictive modeling, Data Visualization.
Experience in integrating data, profiling, validating and data cleansing transformation and data visualization using R and Python.
Worked on MongoDB database concepts such as locking, transactions, indexes, Sharding, replication, schema design.
Experience in writing Sub Queries, Stored Procedures, Triggers, Cursors, and Functions in MySQL.
Experience in applying machine learning algorithms for a variety of programs.
Experienced in Agile Methodologies, Scrum stories and sprints experience in a Python based environment, along with data analytics, data wrangling.
Good experience in using various Python libraries (Beautiful Soup, Numpy, Scipy, matplotlib, python - twitter, Pandas, MySQL dB for database connectivity).
Strong Experience in Big data technologies including Apache Spark, HDFS, Hive, MongoDB.
Hands on experience of Git.
Good working experience in processing large datasets with Spark using Python.
Sound understanding of Deep learning using CNN, RNN, ANN, reinforcement learning, transfer learning.
Theoretical foundations and practical hands-on projects related to (i) supervised learning (linear and logistic regression, boosted decision trees, Support Vector Machines, neural networks, NLP), (ii) unsupervised learning (clustering, dimensionality reduction, recommender systems), (iii) probability & statistics, experiment analysis, confidence intervals, A/B testing, (iv) algorithms and data structures.
Extensive knowledge on Azure Data Lake and Azure Storage.
Experience in migration from heterogeneous sources including Oracle to MS SQL Server.
Hands on experience in design, management and visualization of databases using Oracle, MySQL and SQL Server.
Experienced in Hadoop 2.x ecosystem and Apache Spark 2.x framework such as Hive, Pig, Scoop, Pyspark.
Experience in Apache Spark, Kafka for Big Data Processing & Scala Functional programming.
Experience in manipulating the large data sets with R packages like tidyr, tidyverse, dplyr reshape, lubridate, Caret and visualizing the data using lattice and ggplot2 packages.
Experience in dimensionality reduction using techniques like PCA and LDA.
Intensive hands-on Boot camp on Data Analytics course spanning from Statistics to Programming including data engineering, data visualization, machine learning and programming in R, SQL.
Experience in data analytics, predictive analysis like Classification, Regression, Recommender Systems.
Good Exposure with Factor Analysis, Bagging and Boosting algorithms.
Experience in Descriptive Analysis Problems like Frequent Pattern Mining, Clustering, Outlier Detection.
Worked on Machine Learning algorithms like Classification and Regression with KNN Model, Decision Tree Model, Naïve Bayes Model, Logistic Regression, SVM Model and Latent Factor Model.
Hands-on experience on Python and libraries like Numpy, Pandas, Matplotlib, Seaborn, NLTK, Sci-Kit learn, SciPy.
Expertise and knowledge in TensorFlow to do machine learning/deep learning package in python.
Good knowledge on Microsoft Azure SQL, Machine Learning and HDInsight.
Good Exposure on SAS analytics.
Good Exposure in deep learning with Tensor flow in python.
Good Knowledge on Natural Language Processing (NLP) and Time Series Analysis and Forecasting using ARIMA model in Python and R.
Good knowledge in Tableau, Power BI for interactive data visualizations.
In-depth Understanding in NoSQL databases like MongoDB, HBase.
Experienced in Amazon Web Services (AWS) and Microsoft Azure, such as AWS EC2, S3, RD3, Azure HDInsight, Machine Learning Studio, Azure Data Lake. Very good experience and knowledge in provisioning virtual clusters under AWS cloud which includes services like EC2, S3, and EMR.
Experience and Knowledge in developing software using Java, C++ (Data Structures and Algorithms) technologies.
Good exposure in creating pivot tables and charts in Excel.
Experience in developing Custom Report and different types of Tabular Reports, Matrix Reports, Ad hoc reports and distributed reports in multiple formats using SQL Server Reporting Services (SSRS).
Excellent Database administration (DBA) skills including user authorizations, Database creation, Tables, indexes and backup creation.

TECHNICAL SKILLS:

Languages: Java 8, Python, R

Python and R: Numpy, SciPy, Pandas, Scikit-learn, Matplotlib, Seaborn, ggplot2, caret, dplyr, purrr, readxl, tidyr, Rweka, gmodels, RCurl, C50, twitter, NLP, Reshape2, rjson, plyr, Beautiful Soup, Rpy2

Algorithms: Kernel Density Estimation and Non-parametric Bayes Classifier, K-Means, Linear Regression, Neighbors (Nearest, Farthest, Range, k, Classification), Non- Negative Matrix Factorization, Dimensionality Reduction, Decision Tree, Gaussian Processes, Logistic Regression, Na ve Bayes, Random Forest, Ridge Regression, Matrix Factorization/SVD

NLP/Machine Learning: LDA (Latent Dirichlet Allocation), NLTK, Apache OpenNLP, Stanford NLPDeep Learning: Sentiment Analysis, SVMs, ANN, RNN, CNN, TensorFlow, MXNet, Caffe, H2O, Keras, PyTorch, Theano, Azure ML

Cloud: Google Cloud Platform, AWS, Azure, Bluemix

Web Technologies: JDBC, HTML5, DHTML and XML, CSS3, Web Services, WSDL

Data Modeling Tools: Erwin r 9.6, 9.5, 9.1, 8.x, Rational Rose, ER/Studio, MS Visio, SAP Power designer

Big Data Technologies: Hadoop, Hive, HDFS, MapReduce, Pig, Kafka

Databases: SQL, Hive, Impala, Pig, Spark SQL, Databases SQL-Server, My SQL, MS Access, HDFS, HBase, Teradata, Netezza, MongoDB, Cassandra.

Reporting Tools: MS Office (Word/Excel/Power Point/ Visio), Tableau, Crystal reports XI, Business Intelligence, SSRS, Business Objects 5.x/ 6.x, Cognos7.0/6.0.

ETL Tools: Informatica Power Centre, SSIS.

Version Control Tools: SVM, GitHub

BI Tools: Tableau, Tableau Server, Tableau Reader, SAP Business Objects, OBIEE, QlikView, SAP Business Intelligence, Amazon Redshift, or Azure Data Warehouse

Operating System: Windows, Linux, Unix, Macintosh HD, Red Hat

PROFESSIONAL EXPERIENCE:

Confidential, Boston, MA

Sr. Data Scientist

Responsibilities:

Perform Data Profiling to learn about behavior with various features such as traffic pattern, location, Date and Time etc.
Extracted the data from hive tables by writing efficient Hive queries.
Performed preliminary data analysis using descriptive statistics and handled anomalies such as removing duplicates and imputing missing values.
Analyze Data and Performed Data Preparation by applying historical model on the data set in AZUREML.
Application of various machine learning algorithms and statistical modeling like decision trees, text analytics, natural language processing (NLP), supervised and unsupervised, regression models, social network analysis, neural networks, deep learning, SVM, clustering to identify Volume using scikit-learn package in python, Matlab.
Exploring DAG's, their dependencies and logs using AirFlow pipelines for automation.
Performed data cleaning and feature selection using MLlib package in PySpark and working with deep learning frameworks such as Caffe, Neon etc.
Conducted a hybrid of Hierarchical and K-means Cluster Analysis using IBM SPSS and identified meaningful segments of customers through a discovery approach.
Develop Spark/Scala, Python, R for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources. Used clustering technique K-Means to identify outliers and to classify unlabeled data.
Evaluate models using Cross Validation, Log loss function, ROC curves and used AUC for feature selection and elastic technologies like ElasticSearch, Kibana etc.
Work with NLTK library to NLP data processing and finding the patterns.
Categorize comments into positive and negative clusters from different social networking sites using Sentiment Analysis and Text Analytics.
Analyze traffic patterns by calculating autocorrelation with different time lags.
Ensure that the model has low False Positive Rate and Text classification and sentiment analysis for unstructured and semi-structured data.
Addressed over fitting by implementing of the algorithm regularization methods like L2 and L1.
Use Principal Component Analysis in feature engineering to analyze high dimensional data.
Create and design reports that will use gathered metrics to infer and draw logical conclusions of past and future behavior.
Perform Multinomial Logistic Regression, Random forest, Decision Tree, SVM to classify package is going to deliver on time for the new route.
Implemented different models like Logistic Regression, Random Forest and Gradient-Boost Trees to predict whether a given die will pass or fail the test.
Perform data analysis by using Hive to retrieve the data from Hadoop cluster, Sql to retrieve data from Oracle database and used ETL for data transformation.
Use MLlib, Spark's Machine learning library to build and evaluate different models.
Perform Data Cleaning, features scaling, features engineering using pandas and numpy packages in python.
Develop MapReduce pipeline for feature extraction using Hive and Pig.
Create Data Quality Scripts using SQL and Hive to validate successful data load and quality of the data. Create various types of data visualizations using Python and Tableau.
Communicate the results with operations team for taking best decisions.
Collect data needs and requirements by Interacting with the other departments.

Environment: Python 2.x, R, HDFS, Hadoop 2.3, Hive, Linux, Spark, IBM SPSS, Tableau Desktop, SQL Server 2012, Microsoft Excel, Matlab, Spark SQL, Pyspark.

Confidential, San Francisco, CA

Data Scientist

Responsibilities:

Gathered, analyzed, documented and translated application requirements into data models, supported standardization of documentation and the adoption of standards and practices related to data and applications.
Queried and aggregated data from Amazon Redshift to get the sample dataset.
Identified patterns, data quality issues, and leveraged insights by communicating with BI team.
In preprocessing phase, used Pandas to remove or replace all the missing data, and feature engineering to eliminate unrelated features.
Balanced the dataset with Over-sampling the minority label class and Under-sampling the majority label class.
In data exploration stage used correlation analysis and graphical techniques to get some insights about the claim data.
Applied machine learning techniques to tap into new markets, new customers and put forth my recommendations to the top management which resulted in increase in customer base by 5% and customer portfolio by 9%.
Build mulit-layers Neural Networks to implement Deep Learning by using Tensorflow and Keras.
Perform hyperparameter tuning by doing Distributed Cross Validation in Spark to speed up the computation process.
Export trained models into Protobuf to be served by Tensorflow Serving and performed integration job with client's application.
Analyzed customer master data for the identification of prospective business, to understand their business needs, built client relationships and explored opportunities for cross-selling of financial products. 60% (Increased from 40%) of customers availed more than 6 products.
Improved fraud prediction performance by using random forest and gradient boosting for feature selection with Python Scikit-learn.
Tested classification algorithms such as Logistic Regression, Gradient Boosting and Random Forest using Pandas and Scikit-learn and evaluated the performance.
Worked extensively with data governance team to maintain data models, Metadata and dictionaries.
Developed advanced models using multivariate regression, Logistic regression, Random forests, decision trees and clustering.
Applied predictive analysis and statistical modeling techniques to analyze customer behavior and offer customized products, reduce delinquency rate and default rate. Lead to fall in default rates from 5% to 2%.
Application of various machine learning algorithms and statistical modeling like decision trees, regression models, neural networks, SVM, clustering to identify Volume using scikit-learn package in python, Matlab.
Implemented, tuned and tested the model on AWS EC2 with the best algorithm and parameters.
Set up data preprocessing pipeline to guarantee the consistency between the training data and new coming data.
Deployed the model on AWS Lambda, collaborated with develop team to build the business solutions.
Collected the feedback after deployment retrained the model to improve the performance.
Discovered flaws in the methodology being used to calculate weather peril zone relativities; designed and implemented a 3D algorithm based on k-means clustering and Monte Carlo methods.
Observed groups of customers being neglected by the pricing algorithm; used hierarchical clustering to improve customer segmentation and increase profits by 6%.
Designed, developed and maintained daily and monthly summary, trending and benchmark reports in Tableau Desktop.

Environment: Python 2.x, R, HDFS, Hadoop 2.3, Hive, Linux, Spark, IBM SPSS, Tableau Desktop, SQL Server 2012, Microsoft Excel, Matlab, Spark SQL, Pyspark.

Confidential, Pittsburgh, PA

Data Scientist, Data Analyst

Responsibilities:

Statistical Modelling with ML to bring Insights in Data under guidance of Principal Data Scientist.
Data modeling with Pig, Hive, Impala.
Ingestion with Sqoop, Flume.
Used SVN to commit the Changes into the main EMM application trunk.
Understanding and implementation of text mining concepts, graph processing and semi structured and unstructured data processing.
Worked with Ajax API calls to communicate with Hadoop through Impala Connection and SQL to render the required data through it. These API calls are similar to Microsoft Cognitive API calls.
Good grip on Cloudera and HDP ecosystem components.
Used ElasticSearch (Big Data) to retrieve data into application as required.
Performed Map Reduce Programs those are running on the cluster.
Developed multiple MapReduce jobs in java for data cleaning and preprocessing.
Developed scalable machine learning solutions within a distributed computation framework (e.g. Hadoop, Spark, Storm etc.).
Analyzed the partitioned and bucketed data and compute various metrics for reporting.
Involved in loading data from RDBMS and web logs into HDFS using Sqoop and Flume.
Worked on loading the data from MySQL to HBase where necessary using Sqoop.
Developed Hive queries for Analysis across different banners.
Extracted data from Twitter using Java and Twitter API. Parsed JSON formatted twitter data and uploaded to database.
Launching Amazon EC2 Cloud Instances using Amazon Images (Linux/ Ubuntu) and Configuring launched instances with respect to specific applications to improve robustness.
Exported the result set from Hive to MySQL using Sqoop after processing the data.
Analyzed the data by performing Hive queries and running Pig scripts to study customer behavior.
Have hands on experience working on Sequence files, AVRO, HAR file formats and compression.
Used Hive to partition and bucket data.
Experience in writing MapReduce programs with Java API to cleanse Structured and unstructured data.
Wrote Pig Scripts to perform ETL procedures on the data in HDFS.
Created HBase tables to store various data formats of data coming from different portfolios.
Worked on improving performance of existing Pig and Hive Queries.

Environment: SQL/Server, Oracle 9i, MS-Office, Teradata, Informatica, ER Studio, XML, Business Objects, HDFS, Teradata 14.1, JSON, HADOOP (HDFS), MapReduce, PIG, Spark, R Studio, MAHOUT, JAVA, HIVE, AWS.

Confidential, Framingham, MA

Data Analyst, Data Scientist

Responsibilities:

Integrated data from multiple data sources or functional areas, ensures data accuracy and integrity, and updates data as need using SQL and Python.
Expertise leveraging SQL, Excel and Tableau to manipulate, analyze and present data.
Performs analyses of structured and unstructured data to solve multiple and/or complex business problems utilizing advanced statistical techniques and mathematical analyses.
Developed advanced models using multivariate regression, Logistic regression, Random forests, decision trees and clustering.
Used Pandas, Numpy, Seaborn, Scikit-learn in Python for developing various machine learning algorithms.
Build and improve models using natural language processing (NLP) and machine learning to extract insights from unstructured data.
Experienced working with distributed computing technologies (Apache Spark, Hive).
Applied predictive analysis and statistical modeling techniques to analyze customer behavior and offer customized products, reduce delinquency rate and default rate. Lead to fall in default rates from 5% to 2%.
Applied machine learning techniques to tap into new markets, new customers and put forth my recommendations to the top management which resulted in increase in customer base by 5% and customer portfolio by 9%.
Analyzed customer master data for the identification of prospective business, to understand their business needs, built client relationships and explored opportunities for cross-selling of financial products. 60% (Increased from 40%) of customers availed more than 6 products.
Collaborated with business partners to understand their problems and goals, develop predictive modeling, statistical analysis, data reports and performance metrics.
Participate in the on-going design and development of a consolidated data warehouse supporting key business metrics across the organization.
Designed, developed, and implemented data quality validation rules to inspect and monitor the health of the data.
Dashboard and report development experience using Tableau.

Confidential, Cherry Hill, NJ

ETL Developer

Responsibilities:

Involved in full SDLC of BI Project including Data Analysis, Designing, Development of Data Warehouse environment.
Used Oracle Data Integrator Designer to develop processes for extracting, cleansing, transforming, integrating, and loading data into data warehouse database.
Experience in Developing and customizing PL/SQL packages, procedures, functions, triggers and reports using Oracle SQL Developer.
Responsible for designing, developing and testing of the ETL strategy to populate the data from various source systems (Flat files, Oracle).
Worked with the Business units to identify data quality rule requirements against identified anomalies.
Develop Data Mapping, Join and queries - Validation, and addressing/fixing data queries raised by project team in a timely manner.
Worked closely with Business analyst and interacted with the Business users to gather new business requirements and to understand the accurate business and current requirements.
Created Repositories, Agent, Contexts and both of Physical & Logical Schema in Topology Manager for all the source and target schemas.
Data mapping, logical data modeling, created class diagrams and ER diagrams and used SQL queries to filter data within the Oracle database.
Installed and Setup ODI Master Repository, Work Repository, Execution Repository.
Used Topology Manager to manage the data describing the information systems physical and logical architecture.
Extensively worked and utilized ODI Knowledge Modules (Reverse Engineering, Loading, Integration, Check, Journalizing and service).
Created various procedures and variables.
Created ODI Packages, Jobs of various complexities and automated process data flow.
Configured and setup ODI, Master repository, Work repository, Project, Models, sources, targets, packages, Knowledge Modules, Interfaces, Scenarios, filters, condition, metadata.

Confidential, New York City, NY

Systems Analyst

Responsibilities:

Participate in full-life cycle of SQL database development; Create conceptual, logical and physical database models to support project requirements
Design, implement, and maintain databases for Corporate Data, Finance, and Operations business units.
Responsible for maintaining the integrity of the SQL database and reporting any issues to the database architect.
Assisted in creating and presenting informational reports to Management based on SQL data.
Build, manage and maintain all project documentation including Business Requirements Documents (BRD), technical specifications, process flows and client-specific user guides.
Developed and validated conceptual data models, including implementing logical and physical data mart data models.
Work closely with clients and internal teams to elicit and document business and functional requirements.
Delivered projects on time using Agile and Waterfall methodologies for timely completion of projects.

We provide IT Staff Augmentation Services!

Sr. Data Scientist Resume

Boston, MA

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship