We provide IT Staff Augmentation Services!

Data Scientist Resume

St Louis, Mo


  • Over 8 years of experienced working as a Data Scientist/Data Analyst/Data Modeling with emphasis on Data Mapping, Data Validation in Data Warehousing Environment.
  • Extensively experienced in business intelligence (and BI technologies) tools such as OLAP, Data warehousing, reporting and querying tools, Data mining and Spreadsheets
  • Worked on a different type of Python modules such as requests, Boto, flake8, flask, mock and nose
  • Extensively experienced in business intelligence (and BI technologies) tools such as OLAP, Data warehousing, reporting and querying tools, Data mining and Spreadsheets
  • Efficient in developing Logical and Physical Data model and organizing data as per the business requirements using Sybase Power Designer, Erwin, ER Studio in both OLTP and OLAP applications
  • Strong understanding of when to use an ODS or data mart or data warehousing.
  • Experienced in employing R Programming, MATLAB, SAS, Tableau, and SQL for data cleaning, data visualization, risk analysis and predictive analytics
  • Adept at using SAS Enterprise suite, R, Python, and Big Data related technologies including Hadoop, Hive, Pig, Sqoop, Cassandra, Oozie, Flume, Map - Reduce and Cloudera Manager for the design of business intelligence applications
  • Ability to provide wing-to-wing analytic support including pulling data, preparing analysis, interpreting data, making strategic recommendations and presenting to client/product teams.
  • Hands-on experience with Machine Learning, Regression Analysis, Clustering, Boosting, Classification, Principal Component Analysis and Data Visualization Tools
  • Strong programming skills in a variety of languages such as Python and SQL.
  • Familiarity with Crystal Reports, and SSRS - Query, Reporting, Analysis and Enterprise Information Management
  • Created many calculated columns and measures using DAX in Power BI based on report requirements and published Power BI reports to end user
  • Experienced in Database using Oracle, XML, DB2, Teradata15/14, Netezza, server, Big Data and NoSQL.
  • Worked with engineering teams to integrate algorithms and data into Return Path solutions
  • Worked closely with other data scientists to create data-driven products
  • Strong experienced in Statistical Modeling/Machine Learning and Visualization Tools
  • Expert at Full SDLC processes involving Requirements Gathering, Source Data Analysis, Creating Data Models, and Source to target data mapping, DDL generation, performance tuning for data models.
  • Extensively used the Agile methodology as the Organization Standard to implement the Data Models
  • Experienced with machine learning tools and libraries such as Scikit-learn, R, Spark, and Weka
  • Hands-on experienced with NLP, mining of structured, semi-structured, and unstructured data


Big Data/Hadoop Technologies: Hadoop, HDFS, YARN, MapReduce, Hive, Pig, Impala, Sqoop, Flume, Spark, Kafka, Zookeeper, and Oozie

Languages: C, C++, HTML5, DHTML, WSDL, css3 XML, R/R Studio, SAS Enterprise Guide, SAS R, R (Caret, Weka, ggplot), Python (NumPy, SciPy, Pandas), SQL, PL/SQL, Pig Latin, HiveQL, Shell Scripting.

Cloud Computing Tools: Amazon AWS

Databases: Microsoft SQL Server 2008 MySQL 4.x/5.x, Oracle 10g, 11g, 12c, DB2, Teradata, Netezza

NO SQL Databases: HBase, Cassandra, MongoDB, MariaDB

Build Tools: Maven, ANT, Toad, SQL Loader, RTC, RSA, Control-M, Oozie, Hue, SOAP UI

Development Tools: Microsoft SQL Studio, Eclipse, NetBeans, IntelliJ

Development Methodologies: Agile/Scrum, Waterfall, UML, Design Patterns

Version Control Tools and Testing: API Git, SVM, GitHub, SVN and JUNIT

ETL Tools: Informatica Power Centre, SSIS

Reporting Tools: MS Office (Word/Excel/PowerPoint/ Visio/Outlook), Crystal Reports XI, SSRS, Cognos7.0/6.0.

Operating Systems: All versions of UNIX, Windows, LINUX, Macintosh HD, Sun Solaris


Confidential - St. Louis, MO.

Data Scientist


  • Review suspicious activity and complex fraud cases to help identify and resolve fraud risk trends and issues.
  • Clearly and thoroughly document investigation findings and conclusions.
  • Offline analysis of customer data to tune rules, exposes patterns, research anomalies, reduce false positives, and build executive and project-level reports.
  • Identify meaningful insights from chargeback data. Interpret and communicate findings from analysis to engineers, product, and stakeholders.
  • Utilize Sqoop to ingest real-time data. Used analytics libraries Sci-Kit Learn, MLLIB and MLX tend.
  • Extensively use Python's multiple data science packages like Pandas, NumPy, matplotlib, Seaborn, SciPy, Scikit-learn, and NLTK.
  • Analytically solved packaging, distribution and traveling problems in Xpress IVE optimization solver
  • Modeled algorithms using column generation, constraint generation and integer programming techniques
  • Developed modules for extracting encounter data from multiple systems, converted to common format, and loaded into a regional Health Information Exchange for regional patient care management and clinical decision support.
  • Performed Exploratory Data Analysis, trying to find trends and clusters.
  • Built models using techniques like Regression, Tree-based ensemble methods, Time Series forecasting, KNN, Clustering and Isolation Forest methods.
  • Work on data that was a combination of unstructured and structured data from multiple sources and automate the cleaning using Python scripts.
  • Extensively perform large data read/writes to and from csv and excel files using pandas.
  • Tasked with maintaining RDD's using SparkSQL.
  • Communicate and coordinate with other departments to collection business requirement.
  • Tackle highly imbalanced Fraud dataset using undersampling with ensemble methods, oversampling and cost-sensitive algorithms.
  • Improved fraud prediction performance by using random forest and gradient boosting for feature selection with Python Scikit-learn.
  • Implemented machine learning model (logistic regression, XGBoost) with Python Scikit- learn.
  • Optimize algorithm with stochastic gradient descent algorithm Fine-tuned the algorithm parameter with manual tuning and automated tuning such as Bayesian Optimization.
  • Develop a technical brief based on the business brief. This contains detailed steps and stages of developing and delivering the project including timelines.
  • After sign-off from the client on technical brief, started developing the SAS codes.
  • Write the data validation SAS codes with the help of Univariate, Frequency procedures.
  • Separately calculated the KPIs for Target and Mass campaigns at pre-promo-post periods with respect to their transactions, spend and visits.
  • Implement cluster analysis (PROC CLUSTER and PROC FASTCLUS) iteratively.
  • Work extensively with data governance team to maintain data models, Metadata, and dictionaries.
  • Use Python to preprocess data and attempt to find insights.
  • Scheduled Automatic refresh and scheduling refresh in Power BI service.
  • Extensively use SQL queries for legacy data retrieval jobs.
  • The task of migrating the Django database from MySQL to PostgreSQL.
  • Responsible for maintaining and analyzing large datasets used to analyze risk by domain experts.
  • Develop Hive queries that compared new incoming data against historical data. Built tables in Hive to store large volumes of data.
  • Use big data tools Spark (SparkSQL, MLLib) to conduct the real-time analysis of credit card fraud based on AWS.
  • Perform Data audit, QA of SAS code/projects and sense check of results.

Environment: Spark, Hadoop, AWS, SAS Enterprise Guide, SAS/MACROS, SAS/ACCESS, SAS/STAT, SAS/SQL, ORACLE, MS-OFFICE, Python (Scikit-learn, pandas, NumPy), Machine Learning (logistic regression, XP boost).

Confidential - Collegeville, PA.

Data Scientist


  • Utilized Spark, Scala, Hadoop, HBase, Cassandra, MongoDB, Kafka, Spark Streaming, MLLib, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc. and Utilized the engine to increase user lifetime by 45% and triple user conversations for target categories.
  • Performed Data Profiling to learn about behavior with various features such as traffic pattern, location, time, Date and Time etc.
  • Application of various machine learning algorithms and statistical Modeling like decision trees, regression models, neural networks, SVM, clustering to identify Volume using Scikit-learn package in python, MATLAB .
  • Developed Spark/Scala , Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources. Used clustering technique K-Means to identify outliers and to classify unlabeled data.
  • Determined customer satisfaction and helped enhance customer experience using NLP
  • Performed data visualization with Tableau and D3.js , and generated dashboards to present the findings
  • Recommended and evaluated marketing approaches based on quality analytics of customer consuming behavior
  • Evaluated models using Cross Validation, Log loss function, ROC curves and used AUC for feature selection.
  • Worked with sales and Marketing team for Partner and collaborate with a cross-functional team to frame and answer important data questions prototyping and experimenting ML/DL algorithms and integrating into production system for different business needs.
  • Researched the existing client processes and guided the team in aligning with the HIPAA rules and regulations for the systems for all the EDI transaction sets.
  • Consulted with healthcare insurance company to develop conversion specifications for other insurance Coordination of Benefits (including Medicare).
  • Analyse traffic patterns by calculating autocorrelation with different time lags.
  • Ensured that the model has low False Positive Rate.
  • Addressed overfitting by implementing the algorithm regularization methods like L2 and L1 .
  • Used Principal Component Analysis in feature engineering to analyze high dimensional data.
  • Created and designed reports that will use gathered metrics to infer and draw logical conclusions of past and future behavior.
  • Performed Multinomial Logistic Regression, Random forest, Decision Tree, SVM to classify package is going to deliver on time for the new route.
  • Performed data analysis by using Hive to retrieve the data from Hadoop cluster , SQL to retrieve data from Oracle database.
  • Used Python and Spark to implement different machine learning algorithms including Generalized Linear Model, SVM, Random Forest, Boosting and Neural Network
  • Used MLLib, Spark's Machine learning library to build and evaluate different models.
  • Implemented rule-based expertise system from the results of exploratory analysis and information gathered from the people from different departments.
  • Performed Data Cleaning, features scaling, features engineering using pandas and NumPy packages in python.
  • Communicated the results with operations team for taking best decisions.
  • Collected data needs and requirements by Interacting with the other departments.
  • Developed MapReduce pipeline for feature extraction using Hive .
  • Created Data Quality Scripts using SQL and Hive to validate successful data load and quality of the data. Created various types of data visualizations using Python and Tableau .

Environment: Python 2.x, CDH5, HDFS, Hadoop 2.3, Hive, Impala, Linux, Spark, Tableau Desktop, SQL Server 2012, Microsoft Excel, MATLAB, Spark SQL, PySpark.

Confidential, Pennsylvania.

Data Scientist.


  • Developed applications of Machine Learning, Statistical Analysis, and Data Visualizations with challenging data Processing problems in sustainability and biomedical domain.
  • Goal is to identify the subtypes in autism for the development of targeted and more effective therapies.
  • We used hierarchical clustering methods to identify the clusters in the data based on some important features, further analysis to identify the most significant brain volumes is under way.
  • Compiled data from various sources public and private databases to perform complex analysis and data manipulation for actionable results.
  • Designed and developed Natural Language Processing models for sentiment analysis.
  • Worked on Natural Language Processing with NLTK module of python for application development for automated customer response.
  • Applied concepts of probability, distribution and statistical inference on given dataset to unearth interesting findings through the use of comparison, T-test, F-test, R-squared, P-value etc.
  • Applied linear regression, multiple regression, ordinary least square method, mean-variance, the theory of large numbers, logistic regression, dummy variable, residuals, Poisson distribution, Bayes, Naive Bayes, fitting function etc to data with help of Scikit, SciPy, NumPy and Pandas module of Python.
  • Applied clustering algorithms i.e. Hierarchical, K-means with help of Scikit and SciPy.
  • Developed visualizations and dashboards using ggplot, Tableau
  • Worked on development of data warehouse, Data Lake and ETL systems using relational and non-relational tools like SQL, No SQL.
  • Built and analyzed datasets using R, SAS, MATLAB, and Python (in decreasing order of usage).
  • Applied linear regression in Python and SAS to understand the relationship between different attributes of the dataset and causal relationship between them
  • Performs complex pattern recognition of financial time series data and forecast of returns through the ARMA and ARIMA models and exponential smoothening for multivariate time series data
  • Used Cloudera Hadoop YARN to perform analytics on data in Hive.
  • Wrote Hive queries for data analysis to meet the business requirements.
  • Expertise in Business Intelligence and data visualization using R and Tableau.
  • Expert in Agile and Scrum Process.
  • Validated the Macro-Economic data (e.g. BlackRock, Moody's etc.) and predictive analysis of world markets using key indicators in Python and machine learning concepts like regression, Bootstrap Aggregation and Random Forest.
  • Worked in large-scale database environments like Hadoop and MapReduce, with working mechanism of Hadoop clusters, nodes and Hadoop Distributed File System (HDFS).

Environment: AWS, MS Azure, Cassandra, Spark, HDFS, Hive, Pig, Linux, Python (Scikit-Learn/SciPy/NumPy/Pandas), R, SAS, SPSS, MySQL, Eclipse, PL/SQL, SQL connector.

Confidential - IN

Data Analyst.


  • Designed, Build the Dimensions, cubes with star schema and Snow Flake Schema using SQL Server Analysis Services (SSAS).
  • Participated in JAD session with business users and sponsors to understand and document the business requirements in alignment with the financial goals of the company.
  • Involved in the analysis of Business requirement, Design, and Development of the High level and Low-level designs, Unit, and Integration testing
  • Performed data analysis and data profiling using complex SQL on various sources systems including Teradata, SQL Server.
  • Developed the logical data models and physical data models that confine existing condition/potential status data fundamentals and data flows using ER Studio
  • Reviewed and implemented the naming standards for the entities, attributes, alternate keys, and primary keys for the logical model.
  • Performed second and third normalizations for ER data model of OLTP system
  • Worked with data compliance teams, Data governance team to maintain data models, Metadata, Data Dictionaries; define source fields and its definitions.
  • Translate business and data requirements into Logical data models in support of Enterprise Data Models, ODS, OLAP, OLTP, Operational Data Structures and Analytical systems.
  • Design and model the reporting data warehouse considering current and future reporting requirement
  • Involved in the daily maintenance of the database that involved monitoring the daily run of the scripts as well as troubleshooting in the event of any errors in the entire process.
  • Worked with Data Scientist in order to create a Data marts for data science specific functions.
  • Determined data rules and conducted Logical and Physical design reviews with business analysts, developers, and DBAs.
  • Used External Loaders like Multi-Load, TPump and Fast Load to load data into Oracle and Database analysis, development, testing, implementation, and deployment.
  • Reviewed the logical model with application developers, ETL Team, DBAs, and testing team to provide information about the data model and business requirements.

Environment: : Erwin r7.0, Informatica 6.2, ODS, OLTP, Oracle 10g, Hive, OLAP, DB2, Metadata, MS Excel, Mainframes MS Visio, Rational Rose, Requisite Pro, Hadoop, PL/SQL.

Confidential - IN

Data Analyst.


  • Used SAS Proc SQL pass-through facility to connect to Oracle tables and created SAS datasets using various SQL joins such as left join, right join, inner join and full join.
  • Performing data validation, transforming data from RDBMS oracle to SAS datasets.
  • Produce quality customized reports by using PROC TABULATE, PROC REPORT Styles, and ODS RTF and provide descriptive statistics using PROC MEANS, PROC FREQ, and PROC UNIVARIATE.
  • Developed SAS macros for data cleaning, reporting and to support routing processing.
  • Performed advanced querying using SAS Enterprise Guide, calculating computed columns, using a filter, manipulate and prepare data for Reporting, Graphing, and Summarization, statistical analysis, finally generating SAS datasets.
  • Involved in Developing, Debugging, and validating the project-specific SAS programs to generate derived SAS datasets, summary tables, and data listings according to study documents.
  • Created datasets as per the approved specification collaborated with project teams to complete scientific reports and review reports to ensure accuracy and clarity.
  • Performed different calculations like Quick table calculations, Date Calculations, Aggregate Calculations, String and Number Calculations.
  • Good expertise in building dashboards and stories based on the available data points.
  • Created action filters, user filters, parameters and calculated sets for preparing dashboards and worksheets in Tableau.
  • Expertise in Agile Scrum Methodology to implement project life cycles of reports design and development
  • Combined Tableau visualizations into Interactive Dashboards using filter actions, highlight actions etc. and published them to the web.
  • Created Rich dashboards using Tableau Dashboard and prepared user stories to create compelling dashboards to deliver actionable insights
  • Working with the manager to prioritize requirements and preparing reports on the weekly and monthly basis.

Environment: : SQL Server, Oracle 11g/10g, MS Office Suite, PowerPivot, Power Point, SAS Base, SAS Enterprise Guide, SAS/MACRO, SAS/SQL, SAS/ODS, SQL, PL/SQL, Visio.

Hire Now