We provide IT Staff Augmentation Services!

Data Scientist Resume

4.00/5 (Submit Your Rating)

Stamford, ConnecticuT


  • 9+ years of work experience in Data Science R/SAS/Python. Expertise in analyzing data and building predictive models to halp provide intelligent solutions domain.
  • Experience in working on Windows, Linux and UNIX platforms including programming and debugging skills in UNIX Shell Scripting.
  • Flexible wif Unix/Linux and Windows Environments, working wif Operating Systems like Centos5/6, Ubuntu 13/14, Cosmos.
  • Defining job flows in Hadoop environment - using tools like Oozie for data scrubbing and processing.
  • Experience in Data migration from existing data stores to Hadoop.
  • Developed Map Reduce programs to perform Data Transformation and analysis.
  • Experience in analyzing data wif Hive and Pig using on reading data schema.
  • Created Development Environments in Amazon WebServices using services like VPC, ELB, EC2, ECS and RDS instances.
  • Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies.
  • Proficient in Data Science programming using Programming in R, Python and SQL.
  • Proficient in SQL, Database, Data Modeling, Data Warehousing, ETL and reporting tools
  • Strong noledge in NOSQL column oriented databases like HBase, Cassandra, MongoDB, and its integration wif Hadoop cluster.
  • Proficient in using AJAX for implementing dynamic Web Pages.
  • Natural Language Processing Sentiment Analysis, LDA, Name Entity Recognition, POStagging, parsing, text summarization, information extraction, Word2Vec, TF-IDF, TDM, Text Classification, Word Cloud, n-grams, bi-grams.
  • Statistical Techniques Logistic Regression, Linear Regression, Decision Tree, Clusteranalysis, Time Series Analysis, PCA, TOH, EDA, Univariate/Multivariate Analysis
  • Experienced in teh Web Services approach for Service-Oriented Architecture (SOA), SOAP Programming WSDL, XML technologies DTD, SAX and XSLT.
  • Strong development skills in Java, J2EE, JDBC, JSP, Servlets, C, R, Python, HTML, XML, XSL, JavaScript and SQL.
  • Worked wif several popular relational database management systems including IBMDB2, Oracle and MSSQL Server and NoSQL databases like MongoDB.
  • Machine Learning (Supervised and Unsupervised Techniques).
  • Experienced in performing Business Intelligence &Reporting, Data Analysis & Reporting to business owners defining key metrics & business trends to partner & management teams - Data Science/Data Analytics
  • Tech-savvy Professional wif expertise in requirement gathering for Statistical & Predictive Analysis enhancement while ensuring teh optimal resolutions are achieved insightful noledge of business process analysis & design, process - R & Python Methodologies optimization, from various technological solutions - Data Mining &Analysis.
  • Capable of modeling data for business process management, Machine Learning executing quantitative analysis dat translates data into actionable insights and driving data driven decision-making process - Client Engagement
  • Extensive noledge of building & maintaining data warehouses and - Project Coordination & Execution working wif disparate data sources & transforming them into a Cross-functional Coordination unified solution; record of liaising & coordinating wif respective teams for gathering and analyzing requirements for teh enterprise wide data warehouse
  • Subject Matter Expertise in using quantitative approaches, creative data analysis, algorithm development and modeling to create Soft Skills assumptions based on historical data.


Data Analytics Tools: Python (numpy, scipy, pandas, Gensim, Keras), R (Caret, Weka, ggplot), MATLAB.

Analysis & Modelling Tools: Erwin, Sybase Power Designer, Oracle Designer, Erwin, Rational Rose, ER/Studio, TOAD, MS Visio, SAS, Django, Flask, pip, NPM, Node JS, Spring MVC.

Data Visualization: Tableau, Visualization packages, Microsoft Office.

Machine Learning: Simple LinearRegression,Multivariate Linear Regression, Regression, Classification,Clustering,Association,PolynomialRegression,DecisionTrees,RandomForest,LogisticRegression, Softmax, K-NN,K-MeansKernel SVM, Gradient Descent, Backprop, Feed Forward ANN, CNN, RNN and Word2Vec.

Machine Learning Frameworks: Spark ML, Kafka, Spark MiLB, Scikit-Learn & NLTK.

Big Data Tools: Hadoop, Map Reduce, SQOOP, Pig, Hive, NOSQL, Spark, Apache Kafka, Shiny, Yarn, Data Frames, pandas, ggplot2, Sklearn, Theano, Cuda, Azure, HD Insight, etc.

ETL Tools: Informatica Power Centre, Data Stage 7.5, Ab Initio, Talend.

OLAP Tools: MS SQL Analysis Manager, DB2 OLAP, Cognos Power-play.

Programming Languages: SQL, PL/SQL, T-SQL, XML, HTML, UNIX Shell Scripting, Microsoft SQL Server, Oracle PLSQL, Python, Scala, C, C++, AWK, JavaScript.

R Package: dplyr, sqldf, data table, Random Forest, gbm, caret, elastic net and all sortof Machine Learning Packages.

Databases: Oracle, Teradata, DB2 UDB, MS SQL Server, Netezaa, Sybase ASE, Informix, AWS RDS, Cassandra, and MongoDB, Postgre SQL.

Project Execution Methodologies: Ralph Kimball and Bill Inmon data warehousing methodology, Rational Unified Process (RUP), Rapid Application Development (RAD), Joint Application Development (JAD).

Tools: & Software: SAS/STAT, SAS/ETS, SAS E-Miner, SPSS, R, Advance R, TOAD, MS Office, BTEQ, Teradata SQL Assistant.

Methodologies: Ralph Kimball, COBOL.

Version Control: Git, SVN.

Reporting Tools: Business ObjectsXIR 2/6.5/5.0/5.1, Cognos Impromptu 7.0/6.0/5.0, Informatica Analytics Delivery Platform, Micro Strategy, SSRS, Tableau.

Operating Systems: Windows 2007/8, UNIX (Sun-Solaris, HP-UX), Windows NT/XP/Vista, MSDOS.


Confidential, Stamford, Connecticut

Data Scientist


  • Developed a recommender system using Matchbox recommender in Azure ML, to assign top 5 Senior Agents to Agents seeking halp regarding a topic. Thus, facilitated effective query handling and increased operational efficiency by 40%.
  • Performed statistical modeling and developed a model using hierarchal clustering and logistic regression, to identify employees who would need halp. Thereby reduced average call handle time by 30% and enhanced customer satisfaction.
  • Achieved an accuracy of more than 90% for teh predictive models for each of teh projects and presented teh results to teh clients. Used ensemble across all models for client delivery on a monthly basis.
  • Built proficiency in teh rare disease space and generated a revenue growth of at least7 million dollars for each of teh clients.
  • Identified teh different leading indicators/important variables based on claims, physician and demographic level data. Used dimension reduction based on mean decrease in accuracy, PCA and checked for co linearity to reduce teh number of variables from 6500 variables to 20 variables.
  • Experienced in working wif high dimensional claims and third party data sets (274 million rows and 6500 columns).
  • Built efficient SQL queries for data mining, data preparation and cleaning.
  • Built chi-square test to compare teh values between different groups of severe heart attacks (STEMI) and non-severe (NSTEMI) based on age, gender, ethnicity, geo graphic location, insurance method, BMI index, in hospital procedures etc. Conducted ANOVA to compare teh values between different groups and wifin levels. Used Wilcox test to compare teh medians between different groups, calculated teh risk ratio between different groups.
  • Managed a 6-member team to build predictive models, conduct statistical analysis and defined KPI’sfor patient journey (demographics, co-morbidity, payer, physician, line of therapy analysis) to halp clients make decisions.
  • Designed appropriate reports, visualization and written analyses for clients using R, MSExcel and PowerPoint.
  • Extracted meaningful analyses, interpreted raw data, conducted quality assurance and provided meaningful conclusions and recommendations to teh clients based on teh data results.
  • Conducted training and noledge sharing session’s for teh offshore and onsite team members, interns on various analytical, statistical testing, machine learning concepts and tools.
  • Performed social network analysis and topic modeling in R, on employee chat data, and develop Sankey plot to understand teh communication paths, teh strength of relations between Agents and teh topics frequently discussed between them
  • Analyzed employee behavior and performance data, and developed Shiny dashboards to evaluate team preparedness through metrics, which halped evaluate leadership skills, agent experiences, agent behavior and customer sentiments
  • Developed SQL procedures to synchronize teh dynamic data generated from GTID systems wif teh Azure SQL Server.
  • Creation of intelligent benchmarks for claims KPIs using machine learning to reduce teh noise in teh existing alert framework
  • Time series forecasting using combination of methodologies to forecast teh future values of KPI wif dynamic tolerance limits based on teh historical pattern
  • Process automation using Python/R scripts wif Oracle database to generate and write teh results in teh production environment on weekly basis
  • Intelligent matching of truck delay data & work order data
  • Root cause analysis using text mining of work order description to find reason behind machine breakdown and teh failed part(s) involved
  • Sequence mining to identify pattern of machine breakdown

Environment: R 9.0, Informatica 9.0, ODS, OLTP, Oracle 10g, Hive, OLAP, DB2, Metadata, MS Excel, Mainframes MS Vision, Rational Rose.

Confidential, Chicago, Illinois

Data Scientist


  • Machine Learning Projects based on Python, SQL, Spark and SAS advanced programming. Performed data exploratory, data visualizations, and feature selections
  • Applications of machine learning algorithms, including random forest and boosted tree, SVM, SGD, neural network, and deep learning using CNTK and Tensorflow.
  • Performed data analysis, natural language processing, statistical analysis, generated reports, listings and graphs.
  • Big data analytics wif Hadoop, HiveQL, SparkRDD, and SparkSQL.
  • Tested Python/SAS on AWS cloud service and CNTK modeling on MS-Azure cloud service.
  • Built prediction models of major subsurface properties for underground image, geologic interpretation and drilling decisions.
  • Utilized advanced methods of big data analytics, machine learning, artificial intelligence, wave equation modeling, and statistical analysis. Provided exclusive summary on oil/gas seismic data and well profiles, conduct predictive analyses and data mining to support interpretation and operations.
  • Cross-correlation based data analysis method through Python and Matlab on multi-offset-well to halp predict teh models and pore-pressure ahead a little for real time drilling. Bigdata modeling wif in corporation of seismic, rock physics, statistical analysis, well logs and geological information into teh 'beyond image'.
  • Using Python, developing, operationalizing, and productionizing machine learning models to make significant impact on teh geological pattern identification and subsurface model prediction. Analyzing seismic and log data wif sub-group analysis (classification-clustering) and model prediction methods (regression, decision tree, generic programming etc.).
  • Use SAS statistical regression method and SAS/REG polynomial simulation in Excel to simulate teh anisotropic trend as 1D depth functions. Validate teh simulated function by image quality of depth migration.
  • Tested teh migrated data processing system on Google Cloud wif velocity model updating tasks.
  • ETL to convert unstructured data to structured data and import teh data to Hadoop HDFS. Utilized MapR as a low-risk big data solution to build a digital oilfield. Efficiently integrated and analyzed teh data to increase drilling performance and interpretation quality. Analyzed sensors and well log data in HDFS wif HiveQL and prepare for prediction learning models.
  • Constantly monitored teh data and models to identify teh scope of improvement in teh processing and business. Manipulated and prepared teh data for data visualization and report generation. Performed data analysis, statistical analysis, generated reports, listings and graphs.
  • Co-leader of mathematics community 2015, SchlumbergerEureca.
  • Accomplished customer segmentation using K-means algorithm in R, based on behavioral and demographic tendencies, for improving campaigning strategies. dis halped reduce marketing expenses by 10% and halped boost client’s revenue
  • Built customer lifetime value prediction model using historical telecom data in SAS to better serve high priority customers through loyalty bonus, personalized services and draft customer retention plans and strategies
  • Developed PLSQL procedures and functions to automate billing operations, customer barring and number generations
  • Redesigned teh workflows of Service request, Bulk service orders using UNIXCron jobs and PL/ SQL procedures, thereby reduced order processing time and average slippages per month dropped by 40%.

Environment: SQL Server 2008R2/2005 Enterprise, SSRS, SSIS, Crystal Reports, Hadoop, Windows Enterprise Server 2000, DTS, SQL Profiler, and Query Analyzer.

Confidential, Washington

Data Analyst


  • Responsible for performing Machine-learning techniques regression/classification to predict teh outcomes.
  • Responsible for design and development of advanced R/Python programs to prepare transform and harmonize data sets in preparation for modeling.
  • Developed large data sets from structured and unstructured data. Perform data mining.
  • Partnered wif modelers to develop data frame requirements for projects.
  • Performed Ad-hoc reporting/customer profiling, segmentation using R/Python.
  • Tracked various campaigns, generating customer profiling analysis and data manipulation.
  • Provided R/SQL programming, wif detailed direction, in teh execution of data analysis dat contributed to teh final project deliverables. Responsible for data mining.
  • Analyzed large datasets to answer business questions by generating reports and outcome.
  • Worked in a team of programmers and data analysts to develop insightful deliverables dat support data-driven marketing strategies.
  • Executed SQL queries from R/Python on complex table configurations.
  • Retrieving data from database through SQL as per business requirements.
  • Create, maintain, modify and optimize SQL Server databases.
  • Manipulation of Data using BASESAS Programming.
  • Adhering to best practices for project support and documentation.
  • Understanding teh business problem, build teh hypothesis and validate teh same using teh data.
  • Managing teh Reporting/Dash boarding for teh Key metrics of teh business.
  • Involved in data analysis wif using different analytic techniques and modeling techniques.

Environment: Erwin r9.0, Informatica 9.0, ODS, OLTP, Oracle 10g, Hive, OLAP, DB2, Metadata, MS Excel, Mainframes MS Visio, Rational Rose, Requisite Pro, Hadoop, PL/SQL, etc.

Confidential, Princeton, NJ

Data Analyst


  • Statistical Modelling wif ML to bring Insights in Data under guidance of Principal Data Scientist
  • Data modeling wif Pig, Hive, Impala.
  • Ingestion wif Sqoop, Flume.
  • Used SVN to commit teh Changes into teh main EMM application trunk.
  • Understanding and implementation of text mining concepts, graph processing and semi structured and unstructured data processing.
  • Worked wif AjaxAPI calls to communicate wif Hadoop through Impala Connection and SQL to render teh required data through it .These API calls are similar to Microsoft Cognitive API calls.
  • Good grip on Cloudera and HDP ecosystem components.
  • Used Elastic Search (BigData) to retrieve data into application as required.
  • Performed MapReduce Programs those are running on teh cluster.
  • Developed multiple MapReduce jobs in java for data cleaning and preprocessing.
  • Analyzed teh partitioned and bucketed data and compute various metrics for reporting.
  • Involved in loading data from RDBMS and web logs into HDFS using Sqoop and Flume.
  • Worked on loading teh data from MySQL to HBase where necessary using Sqoop.
  • Developed Hive queries for Analysis across different banners.
  • Extracted data from Twitter using Java and Twitter API. Parsed JSON formatted twitter data and uploaded to database.
  • Launching Amazon EC2Cloud Instances using Amazon Images (Linux/ Ubuntu) and Configuring launched instances wif respect to specific applications.
  • Exported teh result set from Hive to MySQL using Sqoop after processing teh data.
  • Analyzed teh data by performing Hive queries and running Pig scripts to study customer behavior.
  • Have hands on experience working on Sequence files, AVRO, HAR file formats and compression.
  • Used Hive to partition and bucket data.
  • Experience in writing MapReduce programs wif JavaAPI to cleanse Structured and unstructured data.
  • Wrote Pig Scripts to perform ETL procedures on teh data in HDFS.
  • Created HBase tables to store various data formats of data coming from different portfolios.
  • Worked on improving performance of existing Pig and Hive Queries.

Environment: SQL/Server, Oracle 9i, MS-Office, Teradata, Informatica, ER Studio, XML, Business Objects, HDFS, Teradata 14.1, JSON, HADOOP (HDFS), MapReduce, PIG, Spark, R Studio, MAHOUT, JAVA, HIVE, AWS.


Data Architect/Data Modeler


  • Worked wif large amounts of structured and unstructured data.
  • Knowledge in Machine Learning concepts (Generalized Linear models, Regularization, Random Forest, Time Series models, etc.)
  • Worked in Business Intelligence tools and visualization tools such as Business Objects, Tableau, ChartIO, etc.
  • Deployed GUI pages by using JSP, JSTL, HTML, DHTML, XHTML, CSS, JavaScript, and AJAX.
  • Configured teh project on WebSphere 6.1 application servers.
  • Implemented teh online application by using Core Java, Jdbc, JSP, Servlets and EJB 1.1, Web Services, SOAP, WSDL.
  • Handled end-to-end project from data discovery to model deployment.
  • Monitoring teh automated loading processes.
  • Communicated wif other Health Care info by using Web Services wif teh halp of SOAP, WSDLJAX-RPC.
  • Used Singleton, factory design pattern, DAO Design Patterns based on teh application requirements
  • Used SAX and DOM parsers to parse teh raw XML documents
  • Used RAD as Development IDE for web applications.
  • Preparing and executing Unit test cases
  • Used Log4J logging framework to write Log messages wif various levels.
  • Involved in fixing bugs and minor enhancements for teh front-end modules.
  • Implemented Microsoft Visio and Rational Rose for designing teh Use Case Diagrams, Class model, Sequence diagrams, and Activity diagrams for SDLC process of teh application
  • Maintenance in teh testing team for System testing/Integration/UAT.
  • Guaranteeing quality in teh deliverables.
  • Conducted Design reviews and Technical reviews wif other project stakeholders.

Environment: R 3.0, Erwin 9.5, Tableau 8.0, MDM, QlikView, MLLib, PL/SQL, HDFS, Teradata 14.1, JSON, HADOOP (HDFS), MapReduce, PIG, Spark, R Studio, MAHOUT, JAVA, HIVE, AWS.


Data Analyst/Data Modeler


  • Developed Internet traffic scoring platform for ad networks, advertisers and publishers (rule engine, site scoring, keyword scoring, lift measurement, linkage analysis).
  • Responsible for defining teh key identifiers for each mapping/interface.
  • Clients include eBay, Click Forensics, Cars.com, Turn.com, Microsoft, and Look smart.
  • Implementation of Metadata Repository, Maintaining Data Quality, Data Cleanup procedures, Transformations, Data Standards, Data Governance program, Scripts, Stored Procedures, triggers and execution of test plans.
  • Automated bidding for advertiser campaigns based either on keyword or category (run-of-site) bidding.
  • Creation of multimillion bid keyword lists using extensive web crawling. Identification of metrics to measure teh quality of each list (yield or coverage, volume, and keyword average financial value).
  • Enterprise Metadata Library wif any changes or updates.
  • Document data quality and traceability documents for each source interface.
  • Establish standards of procedures.
  • Designed teh architecture for one of teh first analytics 3.0. Online platforms: all-purpose scoring, wifon-demand, SaaS, API services. Currently under implementation.
  • Web crawling and text mining techniques to score referral domains, generate keyword taxonomies, and assess commercial value of bid keywords.
  • Developed new hybrid statistical and data mining technique non as hidden decision trees and hidden forests.
  • Reverse engineering of keyword pricing algorithms in teh context of pay-per-click arbitrage.
  • Performed data quality in Talend Open Studio.
  • Coordinated meetings wif vendors to define requirements and system interaction agreement documentation between client and vendor system.

Environment: Erwin r7.0, SQL Server 2000/2005, Windows XP/NT/2000, Oracle 8i/9i, MS-DTS, UML, UAT, SQL Loader, OOD, OLTP, PL/SQL, MS Visio, Informatica.

We'd love your feedback!