We provide IT Staff Augmentation Services!

Data Scientist Resume

Fremont, CA


  • Above 8 years of experience in Machine Learning, Data mining with large datasets of Structured and Unstructured data, Data Acquisition, Data Validation, Predictive modeling, Data Visualization..
  • Independently led analytics, visualization, meetings for business importance with clients, manage SLAs, modeling, reporting and providing actionable insights to managers and C - level executives.
  • Established classification and forecast models, automate processes, text mining, sentiment analysis, statistical models, risk analysis, platform integrations, optimization models, models to increase user experience, A/B testing using R, SAS, Python, SPSS, SAS E-miner, E-Views, tableau, etc.
  • Good knowledge and understanding of web designing programming languages like HTML, CSS, and JavaScript.
  • Experience in checking with the interconnection of databases with the user interface.
  • Developed complex database objects like Stored Procedures, Functions, Packages, and using SQL and PL/SQL.
  • Excellent experience in setting, configuring and monitoring of Hadoop cluster of Cloud era, Horton works distribution, big data, Hadoop Ecosystem.
  • Very good Knowledge of Amazon AWS concepts like EMR and EC2 web services which provide fast and efficient processing of Big Data.
  • Experience in Interacting with Business users to analyses the business process and transforming requirements into screens, performing ETL, documenting and rolling out the deliverables.
  • Strong Experience in Creating, Configuring, Deploying, and Testing SSIS Packages.
  • Worked extensively on Extraction, Transformation, loading data from Oracle, DB2, Access, Excel, Flat Files, and XML using DTS, SSIS.
  • Advanced experience with Python (2.x, 3.x) and its libraries such as NumPy, Pandas, Scikit-learn, XGboost, Lightbgm, Keras, Matplotlib, Seaborn.
  • Strong Knowledge in Statistical methodologies such as Hypothesis Testing, ANOVA, Principal Component Analysis (PCA), Monte Carlo Sampling and Time Series Analysis.
  • Extensive experience in using Flume to transfer log data files to Hadoop Distributed File System (HDFS).
  • Expertise in using tools like SQOOP, Kafka to ingest data into Hadoop
  • Expertise in deploying the code through web application servers like web sphere/web logic/ apache tomcat in AWS CLOUD and Expertise in using AMI, IAM, Instance, S3 and all other AWS resources
  • Proficient in Big Data, Hadoop, Hive, MapReduce, Pig and NoSQL databases like MongoDB, HBase, Cassandra.
  • Experienced in SQL Queries and optimizing the queries in Oracle, SQL Server, DB2, PostgreSQL, Netezza, and Teradata.
  • Strong experience in maintenance of PostgreSQL, Oracle, Big Data databases and updating the versions.
  • Experience in installing, configuring and maintaining the databases like PostgreSQL, Oracle, Big Data HDFS systems.
  • Hands on experience with clustering algorithms like K-means & Medoids clustering and Predictive and Descriptive algorithms.
  • Expertise in Model Development, Data Mining, Predictive Modeling, Descriptive Modeling Data Visualization, Data Clearing and Management, and Database Management.
  • Expertise in applying data mining techniques and optimization techniques in B2B and B2C industries and proficient in Machine Learning, Data/Text Mining, Statistical Analysis and Predictive Modeling.
  • Used DFAST Modeling and Solutions for expected loss calculations and viewing the results in a dashboard for further insights.
  • Experienced in designing star schema (identification of facts, measures, and dimensions), Snowflake schema for Data Warehouse, ODS Architecture by using tools like Erwin Data Modeler, Power Designer, E-R Studio and Microsoft Visio.
  • Expertise in Excel Macros, Pivot Tables, VLOOKUP-ups and other advanced functions and expertise R user with knowledge of statistical programming languages SAS.
  • Excellent experience on Teradata SQL queries, Teradata Indexes, Utilities such as MLOAD, TPump, Fast load and Fast Export.
  • Experience in Data Mining, Text Mining, Data Analysis, Data Migration, Data Cleansing, Transformation, Integration, Data Import, and Data Export..
  • Experienced in designing Architecture for Modeling a Data Warehouse by using tools like Erwin, Power Designer, and E-R Studio.


DATABASE: SQL, ETL, EXTRACT, TRANSFORM, AND LOAD, MS SQL SERVER, R, Python, Machine Learning, Data Science, NLP, Algorithm.

Languages: C, C++, WSDL, R/R Studio, SAS Enterprise Guide, SAS R, R (Caret, Weka, ggplot), Perl, MATLAB, Java, Scala, Python (NumPy, SciPy, Pandas).

IDE's: Pycharm, Emacs, Eclipse, NetBeans, MS Visual Studio, Sublime, SOAP UI

Application Servers: Web Logic, Web Sphere, JBoss, Tomcat.

Cloud Computing Tools: Amazon AWS

Databases: Microsoft SQL Server 2008 MySQL 4.x/5.x, Oracle 10g, 11g, 12c, DB2, Teradata, Netezza

NO SQL Databases: HBase, Cassandra, MongoDB, MariaDB

Build Tools: Jenkins, Maven, ANT, Toad, SQL Loader, RTC, RSA, Control-M, Oozie, Hue, SOAP UI

Business Intelligence Tools: Tableau, Tableau Server, Tableau Reader, Splunk, SAP Business Objects, OBIEE, QlikView, SAP Business Intelligence, Amazon Redshift, or Azure Data Warehouse

Development Tools: Microsoft SQL Studio, Eclipse, NetBeans, IntelliJ

Database Tools: SQL Server Data Tools, Visual Studio, Spotlight, SQL Server Management Studio, Query Analyzer, Enterprise Manager, JIRA, Profiler

ETL Tools: Informatica Power Centre, SSIS

Reporting Tools: MS Office (Word/Excel/PowerPoint/ Visio/Outlook), Crystal Reports XI, SSRS, Cognos7.0/6.0.

Operating Systems: All versions of UNIX, Windows, LINUX, Macintosh HD, Sun Solaris.


Data Scientist

Confidential, Fremont, CA


  • Developed applications of Machine Learning, Statistical Analysis, and Data Visualizations with challenging data Processing problems in sustainability and biomedical domain.
  • Compiled data from various sources public and private databases to perform complex analysis and data manipulation for actionable results.
  • Designed and developed Natural Language Processing models for sentiment analysis.
  • Worked on Natural Language Processing with NLTK module of python for application development for automated customer response.
  • Used predictive modeling with tools in SAS, SPSS, R, and Python.
  • Applied concepts of probability, distribution and statistical inference on the given dataset to unearth interesting findings through the use of comparison, T-test, F-test, R-squared, P-value etc.
  • Applied linear regression, multiple regression, ordinary least square method, mean-variance, the theory of large numbers, logistic regression, dummy variable, residuals, Poisson distribution, Bayes, Naive Bayes, fitting function etc. to data with help of Scikit, Scipy, Numpy, and Pandas module of Python.
  • Applied clustering algorithms i.e. Hierarchical, K-means with help of Scikit and Scipy.
  • Worked on development of data warehouse, Data Lake and ETL systems using relational and non relational tools like SQL, No SQL.
  • Built and analyzed datasets using R, SAS, Matlab, and Python (in decreasing order of usage).
  • Applied linear regression in Python and SAS to understand the relationship between different attributes of the dataset and causal relationship between them
  • Worked in large-scale database environments like Hadoop and Map Reduce, with working mechanism of Hadoop clusters, nodes and Hadoop Distributed File System (HDFS).
  • Interfaced with large-scale database system through an ETL server for data extraction and preparation.

Environment: Machine learning, AWS, MS Azure, Cassandra, Pig, Linux, Python (Scikit-Learn/SciPy/NumPy/Pandas), R, SAS, SPSS, MySQL, Eclipse, PL/SQL, SQL connector, Tableau.

Data Scientist

Confidential, Philadelphia, PA


  • Analyzed and solved business problems and found patterns and insights within structured and unstructured data.
  • Used clustering technique K-Means to identify outliers and to classify unlabelled data.
  • Cleaned, analyzed and selected data to gauge customer experience
  • Used algorithms and programming to efficiently go through large datasets and apply treatments, filters, and conditions as needed.
  • Created meaningful data visualizations to communicate findings and relate them back to how they create business impact.
  • Utilized a diverse array of technologies and tools as needed, to deliver insights such as R, SAS, MATLAB, Tableau and more.
  • Worked on Natural Language Processing with NLTK module of python for application development and automated customer response.
  • Utilized statistical Natural Language Processing for sentiment analysis, mine unstructured data, and creative insights.
  • In data exploration stage used correlation analysis and graphical techniques in Matplotlib and Seaborn to get some insights about the patient admission and discharge data.
  • Worked on feature engineering such as feature creating, feature scaling and One-Hot encoding with Scikit-learn.
  • Forecasted based on exponential smoothing, ARIMA modeling, statistical algorithms and statistical analysis and transfer function models.
  • Conducted studies, rapid plots and using advanced data mining and statistical modeling techniques to build a solution that optimizes the quality and performance of data.
  • Conducted analyses to support Methods programs.
  • Worked on the topics related to trend analysis, identifying gaps for improvements in coverage, data integration.
  • Performed data profiling, data cleaning, and data quality.
  • Analyzed data from the complex relational database for various analyses and/or requests, using access, SQL, Excel and other statistical packages like SAS, Minitab to meet business goals and objectives.
  • Used SAS for multivariate analysis, business intelligence, data management and predictive analytics.
  • Worked with cross-functional teams to design, implement, and test new consumer and audience measurement methodologies.
  • Able to use a combination of data visualization tools, Tableau and communication tools to clearly and effectively explain the problem, root cause, and recommendations.
  • Assisted in the designing and testing of data collection methodologies for Nielsen panels and surveys.
  • Been part for the identification and implementation of methods and best practices to improve respondent cooperation and optimize data collection approaches.
  • Identifying the Customer and account attributes required for MDM implementation from disparate sources and preparing detailed documentation.
  • Developed SQL procedures to synchronize the dynamic data generated from GTID systems with the Azure SQL Server.
  • Process automation using Python/R scripts with Oracle database to generate and write the results in the production environment on weekly basis.
  • Interaction with Business Analyst, SMEs, and other Data Architects to understand Business needs and functionality for various project solutions
  • Used Data Quality validation techniques to validate Critical Data elements (CDE) and identified various anomalies.
  • Performing Data Validation / Data Reconciliation between the disparate source and target systems for various projects.
  • Interacting with the Business teams and Project Managers to clearly articulate the anomalies, issues, findings during data validation.
  • Writing complex SQL queries for validating the data against different kinds of reports generated by Cognos.

Environment: SAS, R, MLIB, Python, Data Governance, MDM, MATLAB, Tableau, Azure SQL Server, NLTK, Tableau.

Data Scientist/Data Modeler

Confidential, Hartford, CT


  • Provided the architectural leadership in shaping strategic, business technology projects, with an emphasis on application architecture.
  • Utilized domain knowledge and application portfolio knowledge to play a key role in defining the future state of large, business technology programs.
  • Participated in all phases of data mining, data collection, data cleaning, developing models, validation, and visualization and performed Gap analysis.
  • Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS. Implemented a Python-based distributed random forest via Python streaming.
  • Created ecosystem models (e.g. conceptual, logical, physical, canonical) that are required for supporting services within the enterprise data architecture (conceptual data model for defining the major subject areas used, ecosystem logical model for defining standard business meaning for entities and fields, and an ecosystem canonical model for defining the standard messages and formats to be used in data integration services throughout the ecosystem).
  • Used Pandas, NumPy, seaborn, SciPy, Matplotlib, Scikit-learn, NLTK in Python for developing various machine learning algorithms and utilized machine learning algorithms such as linear regression, multivariate regression, naive Bayes, Random Forests, K-means, & KNN for data analysis.
  • Conducted studies, rapid plots and using advance data mining and statistical modeling techniques to build solution that optimize the quality and performance of data.
  • Demonstrated experience in design and implementation of Statistical models, Predictive models, enterprise data model, metadata solution and data life cycle management in both RDBMS, Big Data environments.
  • Analyzed large data sets apply machine learning techniques and develop predictive models, statistical models and developing and enhancing statistical models by leveraging best-in-class modeling techniques.
  • Worked on database design, relational integrity constraints, OLAP, OLTP, Cubes and Normalization (3NF) and De-normalization of database.
  • Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS.
  • Worked on customer segmentation using an unsupervised learning technique - clustering.
  • Worked with various Teradata15 tools and utilities like Teradata Viewpoint, Multi Load, ARC, Teradata Administrator, BTEQ and other Teradata Utilities.
  • Utilized Spark, Scala, Hadoop, HBase, Kafka, Spark Streaming, MLlib, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc.
  • Developed LINUX Shell scripts by using NZSQL/NZLOAD utilities to load data from flat files to Netezza database.
  • Designed and implemented system architecture for Amazon EC2 based cloud-hosted solution for client.

Environment: Erwin r9.6, Python, SQL, Oracle 12c, Netezza, SQL Server, Informatica, Java, SSRS, PL/SQL, T-SQL, Tableau, MLlib, regression, Cluster analysis, Scala NLP, Spark, Kafka, MongoDB, logistic regression, Hadoop, Hive, Teradata, random forest, OLAP, Azure, MariaDB, SAP CRM, HDFS, ODS, NLTK, SVM, JSON, Tableau, XML, Cassandra, MapReduce, AWS.

Data Analyst/Data Modeler



  • Performing data profiling and analysis on different source systems that are required for Customer Master.
  • Identifying the Customer and account attributes required for MDM implementation from disparate sources and preparing detailed documentation.
  • Used T-SQL queries to pull the data from disparate systems and Data warehouse in different environments.
  • Worked closely with the Data Governance Office team in assessing the source systems for project deliverables.
  • Presented DQ analysis reports and scorecards on all the validated data elements and presented -to the business teams and stakeholders.
  • Used Data Quality validation techniques to validate Critical Data elements (CDE) and identified various anomalies.
  • Developed clinical NLP methods that ingest large unstructured clinical data sets, separate signal from noise, and provide personalized insights at the patient level that directly improve our analytics platform.
  • Used NLP methods for information extraction, topic modeling, parsing, and relationship extraction
  • Worked with the NLTK library for NLP data processing and finding the patterns.
  • Extensively used open source tools - R Studio(R) and Spyder (Python) for statistical analysis and building the machine learning.
  • Involved in defining the Source Of business rules, Target data mappings, and data definitions.
  • Performing Data Validation / Data Reconciliation between the disparate source and target systems (Salesforce, Cisco-UIC, Cognos, Data Warehouse) for various projects.
  • Interacting with the Business teams and Project Managers to clearly articulate the anomalies, issues, findings during data validation.
  • Writing complex SQL queries for validating the data against different kinds of reports generated by Cognos.
  • Extracting data from different databases as per the business requirements using SQL Server Management Studio.
  • Work with the Data Governance group to identify, classify and define each assigned Critical Data Element (CDEs) and ensure that each element has a clear and unambiguous definition.
  • Analyzed data lineage processes and documentation for the CDEs to identify vulnerable points, control gaps, data quality issues, and overall lack of data governance.
  • Proposed data checks and standard operating procedures on the source systems to enhance data quality
  • Reviewed various Project Management documents such as Business Requirements document, Functional Specification document and suggested changes to ensure it complies with policies and standards.
  • Worked with the Data Governance group in creating a custom data dictionary template to be used across the various business lines.
  • Worked with data stewards to ensure awareness of data quality standards and data requirements
  • Linked data lineage to data quality and business glossary work within the overall data governance program.
  • Managed communication and training with data owners/stewards to ensure awareness of policies and standards
  • Gathered requirements by working with the business users on Business Glossary, Data Dictionary, and Reference data
  • Generating weekly, monthly reports for various business users according to the business requirements. Manipulating/mining data from database tables (Redshift, Oracle, Data Warehouse).
  • Create statistical models using distributed and standalone models to build various diagnostics, predictive and prescriptive solution.
  • Interface with other technology teams to load (ETL), extract and transform data from a wide variety of data sources.

Environment: Data Governance, SQL Server, ETL, MS Office Suite - Excel (Pivot, VLOOKUP), DB2, R, Python, Visio, HP ALM, Agile, Azure, Data Quality, Tableau and Reference Data Management.

Jr. Data Analyst



  • Worked with internal architects, assisting in the development of current and target state data architectures.
  • Worked with project team representatives to ensure that logical and physical ER/Studio data models were developed in line with corporate standards and guidelines.
  • Involved in defining the business/transformation rules applied for sales and service data.
  • Implementation of Metadata Repository, Transformations, Maintaining DataQuality, Data Standards, Data Governance program, Scripts, Stored Procedures, triggers and execution of test plans.
  • Define the list codes and code conversions between the source systems and the data mart.
  • Involved in defining the source to business rules, target data mappings, data definitions.
  • Responsible for defining the key identifiers for each mapping/interface.
  • Remain knowledgeable in all areas of business operations in order to identify systems needs and requirements.
  • Responsible for defining the key identifiers for each mapping/interface.
  • Performed data quality in Talend Open Studio.
  • Enterprise Metadata Library with any changes or updates.

Environment: Windows Enterprise server 2000, SSRS, SSIS, Crystal Reports, DTS, SQL Profiler, and query Analyse.

Hire Now