We provide IT Staff Augmentation Services!

Data Scientist Resume

Pleasanton, CA


  • 8+ years of IT industry experience encompassing in MachineLearning, Datamining with largedatasets of Structured and Unstructureddata, DataAcquisition, DataValidation, Predictivemodeling, DataVisualization.
  • Extensive experience in Text Analytics, developing different StatisticalMachineLearning, DataMining solutions to various business problems and generating datavisualizations using R, Python and Tableau.
  • Experience on advanced SAS programming techniques, such as PROC SQL (JOIN/ UNION), PROC APPEND, PROC DATASETS, and PROC TRANSPOSE.
  • Integration Architect & Data Scientist experience in Analytics, BigData, BPM, SOA, ETL and Cloud technologies.
  • Good experience in Google Cloud Vision APIintegratevisiontodetectionfeatureswithinapplications, including image labeling, face and landmark detection, optical character recognition (OCR), and plicit content.
  • Good knowledge in Gazebo and point cloud library.
  • Highly skilled in using visualization tools like Tableau, ggplot2 and d3.js for creating dashboards. tagging of exExperience in foundational machine learning models and concepts: regression, random forest, boosting, GBM, NNs, HMMs, CRFs, MRFs, deep learning.
  • Proficiency in understanding statistical and other tools/languages - R, Python, C, C++, Java, SQL, UNIX, Qlikview data visualization tool and Anaplan forecasting tool.
  • Proficient in the Integration of various data sources with multiple relational databases like Oracle/, MS SQL Server, DB2, Teradata and Flat Files into the staging area, ODS, Data Warehouse and Data Mart.
  • Ensure accurate loading of 110, 210, 310, 214, and other transportation data via BizTalk
  • Experience in Extracting data for creating Value Added Datasets using Python, R, SAS, Azure and SQL to analyze the behaviour to target a specific set of customers to obtain hidden insights within the data to effectively implement the project Objectives.
  • Worked with NoSQL Database including Hbase, Cassandra and MongoDB.
  • Extensively worked on statistical analysis tools and adept at writing code in Advanced Excel, R, MATLAB, Python.
  • Good Knowledge in TensorFlow.
  • Implemented deep learning models and numerical Computation with the help of data flow graphs using Tensor Flow Machine Learning.
  • Good experience in Textmining to transposing words and phrases in unstructured data into numerical values
  • Light mapping of customized data forms
  • Worked with complex applications such as R, Stata, Scala, Perl, Linear, SASand SPSS to develop neural network, cluster analysis.
  • Experienced the full software life cycle in SDLC, Agile and Scrummethodologies.
  • Experience in designing stunning visualizations using Tableau software and publishing and presenting dashboards, Storyline on web and desktop platforms.
  • Designing of PhysicalDataArchitecture of New system engines.
  • Hands on experience in implementing LDA, NaiveBayes and skilled in RandomForests, Decision Trees, Linear and Logistic Regression, SVM, Clustering, neuralnetworks, Principle Component Analysis and good knowledge on Recommender Systems.
  • Experienced with machine learning algorithm such as logistic regression, random forest, XGboost,KNN, SVM, neural network, linear regression, lasso regression and k-means
  • Developing LogicalDataArchitecture with adherence to Enterprise Architecture.
  • Strong experience in Software Development Life Cycle (SDLC) including RequirementsAnalysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies.
  • Adept in statisticalprogramminglanguages like R and also Python including BigData technologies like Hadoop 2, HAVE, HDFS, MapReduce, and Spark.
  • Experienced in Spark 2.1, Spark SQL and PySpark.
  • Skilled in using dplyr and pandas in R and python for performing Exploratory dataanalysis.
  • Experience working withdatamodeling tools like Erwin, PowerDesigner and ERStudio.
  • Good understanding of TeradataSQLAssistant, Teradata Administrator anddataload/ export utilities like BTEQ, FastLoad, MultiLoad, FastExport.
  • Experience with DataAnalytics, DataReporting, Ad-hocReporting,Graphs, Scales, PivotTables and OLAP reporting.
  • Highly skilled in using Hadoop (pig and Hive) for basic analysis and extraction of data in the infrastructure to provide datasummarization.
  • Highly skilled in using visualization tools like Tableau, ggplot2 and d3.js for creating dashboards.
  • Worked and extracteddatafrom various database sources like Oracle, SQLServer, DB2, and Teradata.
  • Proficient knowledge in statistics, mathematics, machine learning, recommendation algorithms and analytics with excellent understanding of business operations and analytics tools for effective analysis of data.


Languages: Java 8, Python, R

Packages: ggplot2, caret, dplyr, Rweka, gmodels, RCurl, tm, C50, twitteR, NLP, Reshape2, rjson, plyr, pandas, numPy, seaborn, sciPy, matplot lib, scikit-learn, Beautiful Soup, Rpy2.

Web Technologies: JDBC, HTML5, DHTML and XML, CSS3, Web Services, WSDL

Data Modelling Tools: Erwin r 9.6, 9.5, 9.1, 8.x, Rational Rose, ER/Studio, MS Visio, SAP Power designer,Text Mining, and Google Cloud Vision.

Big Data Technologies: Hadoop, Hive, HDFS, MapReduce, Pig, Kafka.

Databases: SQL, Hive, Impala, Pig, Spark SQL, Databases SQL-Server, My SQL, MS Access, HDFS, HBase, Teradata, Netezza, MongoDB, Cassandra.

Reporting Tools: MS Office (Word/Excel/Power Point/ Visio), Tableau,Crystal reports XI, Business Intelligence, SSRS, Business Objects 5.x/ 6.x, Cognos7.0/6.0.

ETL Tools: Informatica Power Centre, SSIS.

Version Control Tools: SVM, GitHub.

Methodologies: Ralph Kimball and Bill Inmon data warehousing methodology, Rational Unified Process (RUP), Rapid Application Development (RAD), Joint Application Development (JAD).

BI Tools: Tableau, Tableau server, Tableau Reader, SAP Business Objects, OBIEE, QlikView, SAP Business Intelligence, Amazon Redshift, or Azure Data Warehouse

Operating System: Windows, Linux, Unix, Macintosh HD, Red Hat.


Confidential, Pleasanton, CA

Data Scientist


  • Perform Data Profiling to learn about user behavior and merge data from multiple data sources.
  • Implemented big data processing applications to collect, clean and normalization large volumes of open data using Hadoop eco system such as PIG, HIVE, and HBase.
  • Designing and developing various machine learning frameworks using Python, R, and Matlab.
  • Integrate R into Micro Strategy to expose metrics determined by more sophisticated and detailed models than natively available in the tool.
  • Independently coded new programs and designed Tables to load and test the program effectively for the given POC's using with Big Data/Hadoop.
  • Solution architecting BIG Data solution for Projects & Proposal using Hadoop, Spark, ELK Stack, Kafka, Tensor flow.
  • Correct minor data errors that prevent loading of EDI files
  • Worked on Clustering and classification of data using machine learning algorithms. Used Tensor Flow machine learning to create sentimental and time series analysis.
  • Develop documents and dashboards of predictions in Microstrategy and present it to the BusinessIntelligence team.
  • Used CloudVision API integrate vision to detection features within applications, including image labeling, face and landmark detection, optical character recognition (OCR), and tagging of explicit content.
  • Implemented Text mining to transposing words and phrases in unstructured data into numerical values
  • Developed various QlikViewDataModels by extracting and using the data from various sources files, DB2, Excel, Flat Files and Bigdata.
  • Good knowledge of HadoopArchitecture and various components such as HDFS, JobTracker, TaskTracker, NameNode, DataNode, SecondaryNameNode, and MapReduce concepts.
  • AsArchitectdelivered various complex OLAPdatabases/cubes, scorecards, dashboards and reports.
  • Track and enable communication across multiple departments to make sure all parties are as educated about potential issues as they can be.
  • Utilized human face recognition OpenCV and tackled the challenge of long running time on personal computer for face
  • Programmed a utility in Python that used multiple packages (scipy, numpy, pandas)
  • Implemented Classification using supervised algorithms like LogisticRegression, Decisiontrees, KNN, NaiveBayes.
  • Gained knowledge about OpenCV and learned to apply it to achieve the red color object identifying with the drone's camera.
  • Used Teradata15 utilities such as FastExport, MLOAD for handling various tasks data migration/ETL from OLTP Source Systems to OLAP Target Systems
  • Handled importing data from various data sources, performed transformations using Hive, MapReduce, and loadeddata into HDFS.
  • Collaborate with data engineers to implement ETL process, write and optimized SQL queries to perform data extraction from Cloud and merging from Oracle 12c.
  • Collect unstructured data from MongoDB 3.3 and completed data aggregation.
  • Perform data integrity checks, data cleaning, exploratory analysis and feature engineer using R 3.4.0.
  • Work with freight carriers to correct EDI issues as they arise
  • Conducted analysis on assessing customer consuming behaviors and discover value of customers with RMF analysis; applied customer segmentation with clustering algorithms such as K-MeansClustering and Hierarchical Clustering.
  • Work on outliers identification with box-plot, K-means clustering using Pandas, Numpy.
  • Participate in features engineering such as feature intersection generating, feature normalize and Label encoding with Scikit-learn preprocessing.
  • Use Python 3.0 (numpy, scipy, pandas, scikit-learn, seaborn, NLTK) and Spark 1.6 / 2.0 (PySpark, MLlib) to develop variety of models and algorithms for analytic purposes.
  • Analyze Data and Performed Data Preparation by applying historical model on the data set in AZUREML.
  • Experienced in Delivery, Portfolio, Team / Career, Vendor and Program management Competency in Solution Architecture, implementation & delivery of Big Data, data science analytics & DWH projects on GreenPlum, SPARK, Keras, Python and TensorFlow.
  • Coordinate the execution of A/B tests to measure the effectiveness of personalized recommendation system.
  • Perform data visualization with Tableau 10 and generate dashboards to present the findings.
  • Recommend and evaluate marketing approaches based on quality analytics of customer consuming behavior.
  • Determine customer satisfaction and help enhance customer experience using NLP.
  • Work on Text Analytics, NaiveBayes, Sentiment analysis, creating word clouds and retrieving data from Twitter and other social networking platforms.
  • Use Git 2.6 to apply version control. Tracked changes in files and coordinated work on the files among multiple team members.

Environment: R, Matlab, MongoDB, exploratory analysis, feature engineering, K-Means Clustering, Hierarchical Clustering, Machine Learning), Python, Spark (MLlib, PySpark), Tableau, MicroStrategy, Git,Unix,, MLLib, SAS, Tensor Flow, regression, logistic regression, Hadoop 2.7, OLTP, random forest, OLAP, HDFS, ODS, NLTK, SVM, JSON, XML and MapReduce.OpenCV.

Confidential, San Jose, CA

Data Scientist


  • Used Pandas, NumPy, seaborn, SciPy, Matplotlib, Scikit-learn, NLTK in Python for developing various machine learning algorithms and utilized machine learning algorithms such as linear regression, multivariate regression, naive Bayes, Random Forests, K-means, &KNN for data analysis.
  • Demonstrated experience in design and implementation of Statistical models, Predictive models, enterprise data model, metadata solution and data life cycle management in both RDBMS, Big Data environments.
  • Utilized domain knowledge and application portfolio knowledge to play a key role in defining the future state of large, business technology programs.
  • Developed MapReduce/SparkPython modules for machine learning & predictive analytics in Hadoop on AWS. Implemented a Python-based distributed random forest via Python streaming.
  • Performed Source System Analysis, database design, data modeling for the warehouse layer using MLDM concepts and package layer using Dimensional modeling.
  • Created ecosystem models (e.g. conceptual, logical, physical, canonical) that are required for supporting services within the enterprise data architecture (conceptual data model for defining the major subject areas used, ecosystem logical model for defining standard business meaning for entities and fields, and an ecosystem canonical model for defining the standard messages and formats to be used in data integration services throughout the ecosystem)
  • Developed LINUX Shell scripts by using NZSQL/NZLOAD utilities to load data from flat files to Netezza database.
  • Designed and implemented system architecture for Amazon EC2 based cloud-hosted solution for client.
  • Tested Complex ETL Mappings and Sessions based on business user requirements and business rules to load data from source flat files and RDBMS tables to target tables.
  • Hands on database design, relational integrity constraints, OLAP, OLTP, Cubes and Normalization (3NF) and De-normalization of database.
  • Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS.
  • Worked on customer segmentation using an unsupervised learning technique - clustering.
  • Worked with various Teradata15 tools and utilities like Teradata Viewpoint, Multi Load, ARC, Teradata Administrator, BTEQ and other Teradata Utilities.
  • Utilized Spark, Scala, Hadoop, HBase, Kafka, Spark Streaming, MLlib, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction.
  • Analyzed large data sets apply machine learning techniques and develop predictive models, statistical models and developing and enhancing statistical models by leveraging best-in-class modeling techniques.

Environment: Erwin r9.6, Python, SQL, Oracle 12c, Netezza, SQL Server, Informatica, Java, SSRS, PL/SQL, T-SQL, Tableau, MLlib, regression, Cluster analysis, Scala NLP, Cassandra, MapReduce, Spark, Kafka, MongoDB, logistic regression, Hadoop, Hive, Teradata0, random forest, OLAP, Azure, MariaDB, SAP CRM, HDFS, ODS, NLTK, SVM, JSON, Tableau, XML, AWS.

Confidential - Boise, ID

Data Scientist


  • Developed applications of Machine Learning, Statistical Analysis and Data Visualizations with challenging data Processing problems in sustainability and biomedical domain.
  • Compiled data from various sources public and private databases to perform complex analysis and data manipulation for actionable results.
  • Designed and developed Natural Language Processing models for sentiment analysis.
  • Worked on Natural Language Processing with NLTK module of python for application development for automated customer response.
  • Used predictive modeling with tools in SAS, SPSS, R, Python.
  • Applied concepts of probability, distribution and statistical inference on given dataset to unearth interesting findings through use of comparison, T-test, F-test, R-squared, P-value etc.
  • Applied linear regression, multiple regression, ordinary least square method, mean-variance, theory of large numbers, logistic regression, dummy variable, residuals, Poisson distribution, Bayes, Naive Bayes, fitting function etc to data with help of Scikit, Scipy, Numpy and Pandas module of Python.
  • Applied clustering algorithms i.e.Hierarchical, K-means with help of Scikit and Scipy.
  • Developed visualizations and dashboards using ggplot, Tableau
  • Worked on development of data warehouse, DataLake and ETL systems using relational and non relational tools like SQL, No SQL.
  • Built and analyzed datasets using R, SAS, Matlab and Python (in decreasing order of usage).
  • Applied linear regression in Python and SAS to understand the relationship between different attributes of dataset and causal relationship between them
  • Performs complex pattern recognition of financial time series data and forecast of returns through the ARMA and ARIMA models and exponential smoothening for multivariate time series data
  • Pipelined (ingest/clean/munge/transform) data for feature extraction toward downstream classification.
  • Used ClouderaHadoop YARN to perform analytics on data in Hive.
  • Wrote Hive queries for data analysis to meet the business requirements.
  • Expertise in Business Intelligence and data visualization using R and Tableau.
  • Expert in Agile and Scrum Process.
  • Validated the Macro-Economic data (e.g. BlackRock, Moody's etc.) and predictive analysis of world markets using key indicators in Python and machine learning concepts like regression, Boot strap Aggregation and Random Forest.
  • Worked in large scale database environment like Hadoop and MapReduce, with working mechanism of Hadoop clusters, nodes and Hadoop Distributed File System (HDFS).
  • Interfaced with large scale database system through an ETL server for data extraction and preparation.
  • Identified patterns, data quality issues, and opportunities and leveraged insights by communicating opportunities with business partners.

Environment: Machine learning, AWS, MS Azure, Cassandra, Spark, HDFS, Hive, Pig, Linux, Python (Scikit-Learn/Scipy/Numpy/Pandas), R, SAS, SPSS, MySQL, Eclipse, PL/SQL, SQL connector, Tableau.

Confidential - Long Beach, CA

Data Scientist/Data Modeller


  • Developing propensity models for Retail liability products to drive proactive campaigns.
  • Extraction and tabulation of data from multiple data sources using R, SAS.
  • Data cleansing, transformation and creating new variables using R.
  • Built predictive scorecards for Cross-selling Car loan, Life Insurance, TDand RD.
  • Scoring predictive models as per regulatory requirements & ensuring deliverables with PSI.
  • Data modeling and formulation of statistical equations using advanced statistical forecasting techniques.
  • Provide guidance and mentoring to team members.
  • Arrange and chair Data Workshops with SME's and related stake holders for requirement data catalogue understanding.
  • Responsible for defining the key identifiers for each mapping/interface.
  • Responsible for defining the functional requirement documents for each source to target interface.
  • Document, clarify, and communicate requests for change requests with the requestor and coordinate with the development and testing team.
  • Work with users to identify the most appropriate source of record and profile the data required for sales and service.
  • Document the complete process flow to describe program development, logic, testing, and implementation, application integration, coding.
  • Design Logical Data Model which will fit and adopt the Teradata Financial Logical Data Model (FSLDM11) using Erwin data modeler tool.
  • Present and approve designed LogicalData Model in DataModel Governance Committee (DMGC).
  • Identifying the Customer and account attributes required for MDM implementation from disparate sourcesand preparing detailed documentation.
  • Validated the machine learning classifiers using ROC Curves and Lift Charts.
  • Extracted data from HDFS and prepared data for exploratory analysis using datamunging.

Environment: Erwin 8, Teradata 13, SQL Server 2008, Oracle 9i, SQL*Loader, PL/SQL, ODS, OLAP, OLTP, SSAS, Informatica Power Center 8.1.


Data Analyst


  • Implemented Microsoft Visio and Rational Rose for designing the Use Case Diagrams, Class model, Sequence diagrams, and Activity diagrams for SDLC process of the application
  • Worked with other teams to analyze customers to analyze parameters of marketing.
  • Conducted Design reviews and Technical reviews with other project stakeholders.
  • Was a part of the complete life cycle of the project from the requirements to the production support
  • Created test plan documents for all back-end database modules
  • Used MS Excel, MS Access and SQL to write and run various queries.
  • Used traceabilitymatrix to trace the requirements of the organization.
  • Recommended structural changes and enhancements to systems and databases.
  • Conducted Design reviews and Technical reviews with other project stakeholders.
  • Maintenance in the testing team for System testing/Integration/UAT
  • Guaranteeing quality in the deliverables.

Environment: UNIX, SQL, Oracle 10g, MS Office, MS Visio.


Data Analyst


  • Developed ETL processes for data conversions and construction of data warehouse using IBM InfoSphere DataStage.
  • Used Star Schema and designed Mappings between sources to operational staging targets.
  • Involved in defining the business/transformation rules applied for sales and service data.
  • Define the list codes and code conversions between the source systems and the data mart.
  • Provided On-call Support for the project and gave a knowledge transfer for the clients.
  • Used Rational Application Developer (RAD) for version control.
  • Developed transformations using jobs like Filter, Join, Lookup, Merge, Hashed file, Aggregator, Transformer and Dataset.
  • Worked with internal architects and, assisting in the development of current and target state data architectures.
  • Coordinate with the business users in providing appropriate, effective and efficient way to design the new reporting needs based on the user with the existing functionality.
  • Remain knowledgeable in all areas of business operations to identify systems needs and requirements.
  • Document the complete process flow to describe program development, logic, testing, and implementation, application integration, coding.

Environment: IBM Rational Clear Case & Clear Quest and IBM InfoSphere Metadata Workbench 8.7,IBMInfoSphereDataStage and Quality Stage, IBM InfoSphere CDC version 6.5.1, XML files.

Hire Now