We provide IT Staff Augmentation Services!

Data Scientist Resume

5.00/5 (Submit Your Rating)

Dallas, TX

PROFESSIONAL SUMMARY:

  • Above 8+ years of experience in Data Analysis, Machine Learning, Data mining with large data sets of Structured and Unstructured data, Data Acquisition, Data Validation, Predictive modeling, Data Visualization.
  • Actively participated in all phases of the project life cycle including data acquisition (Web Scraping), data cleaning, Data Engineering (dimensionality reduction (PCA & LDA), normalization, weight of evidence, information value), feature selection, features scaling & features engineering, Statistical modeling (decision trees, regression models, neural networks, SVM, clustering), testing and validation (ROC plot, k - fold cross validation), Association Rule Learning, Reinforcement Learning, Deep Learning and data visualization.
  • Experience working full data insight cycle - from discussions with business, understanding business logic and business drivers, Exploratory Data Analysis, identifying predictors, enriching data, working with missing values, exploring data dynamics, meaning or building predictive data models (if predictability can be found)
  • Excellent data visualization experience either with proprietary code in R or Python, or using other visualization tools; ready for insight digestion by business and decision making to senior management (Global CTO, Global BI Leadership level)
  • Extensive experience using R packages like (GGPLOT2, CARET, DPLYR)
  • Extensive experience in creating visualizations and dashboards using R Shiny.
  • Developed numerous visualizations in d3js.
  • Hands on Experience with Natural Language Processing.
  • Ingesting datasets from various data sources ranging from HDFS, AWS, etc.
  • Packaging applications with docker and vigrant.
  • Developed visualizations in numerous tools such as Spotfire, Tableau, Power BI.
  • Knowledge on Groovy, Scala and Ruby.
  • Developed numerous reports in R markdown and Jupyter notebooks.
  • Experience collating sparse data into single source, working with unstructured data, writing custom data logic validation scripts.
  • Extensive experience in data cleaning, web scraping, fetching live streaming data, data loading & data parsing using a wide variety of Python & R packages like beautiful soup.
  • Hands on experience in implementing SVM, Naïve Bayes, Logistic Regression, LDA, Decision trees, Random Forests, recursive partitioning (CART), Passive Aggressive, Bagging & Boosting
  • Experienced with Big Data Tools like Hadoop (HDFS), SAP HANA, Hive, & PIG
  • Expertise in writing effective Test Cases and Requirement Traceability Matrix to ensure adequate software testing and manage Scope Creep.
  • Experience in working with Data Management and Data Governance based assignments.
  • Proficient with high-level Logical Data Models, Data Mapping and Data Analysis.
  • Extensive knowledge in Data Validation in Oracle and MySQL by writing SQL queries.
  • Experience in Healthcare Management, Retail with excellent Domain knowledge in financial industry’s financial instruments and financial markets (Capital & Money). Excellent communication, analytical, interpersonal and presentation skills; expert at managing multiple projects simultaneously.
  • Experience working with on-shore, offshore, on-site and off-site individuals and teams.
  • Strong understanding of Software Testing Techniques especially those performed or supervised by BA including Black Box testing, Regression testing, and UAT.
  • Experience with Object Oriented Programming, Data structures and Algorithms and Design Patterns.
  • Experience using markup and scripting languages
  • Working knowledge using source control ranging from git, svn and cvs.
  • Experience using Webservices using SOAP and REST.
  • Software development experience in Java and Java Libraries such as Hibernate, Spring
  • Experience using various IDE’S.
  • Developed an application in NODEJS using Gulp, Browserify, SASS, ESLint, Image Compression and Material Design Lite.
  • Proficient in Machine Learning techniques (Decision Trees, Linear, Logistics, Random Forest, SVM, Bayesian, XG Boost, K-Nearest Neighbors) and Statistical Modeling in Forecasting/ Predictive Analytics, Segmentation methodologies, Regression based models, Hypothesis testing, Factor analysis/ PCA, Ensembles.
  • Experience in designing visualizations using Tableau software and Storyline on web and desktop platforms, publishing and presenting dashboards.
  • Experience on advanced SASprogramming techniques, such as PROC APPEND, PROC DATASETS, and PROC TRANSPOSE.
  • Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies.
  • Excellent knowledge of Hadoop Ecosystem and Big Data tools as Pig, Hive & Spark.
  • Worked on different data formats such as JSON, XML and performed machine learning algorithms in Python.

TECHNICAL SKILLS:

Exploratory Data Analysis: Univariate/Multivariate Outlier detection, Missing value imputation, Histograms/Density estimation, EDA in Tableau

Supervised Learning: Linear/Logistic Regression, Lasso, Ridge, Elastic Nets, Decision Trees, Ensemble Methods, Random Forests, Support Vector Machines, Gradient Boosting, XGB, Deep Neural Networks, Bayesian Learning

Unsupervised Learning: Principal Component Analysis, Association Rules, Factor Analysis, K-Means, Hierarchical Clustering, Gaussian Mixture Models, Market Basket Analysis, Collaborative Filtering and Low Rank Matrix Factorization

Sampling Methods: Bootstrap sampling methods and Stratified sampling

Model Tuning/Selection: Cross Validation, Walk Forward Estimation, AIC/BIC Criterions, Grid Search and Regularization

Time Series: ARIMA, Holt winters, Exponential smoothing, Bayesian structural time series

Machine Learning / Deep Learning: Python caret, glmnet, forecast, xgboost,Keras,Pytorch, theano, Sk-learn

SAS: Forecast server, SAS Procedures and Data Steps.

Spark: MLlib, GraphX.

SQL: Subqueries, joins, DDL/DML statements.

Databases/ETL/Query: Teradata, SQL Server, Redshift, Postgres and Hadoop (MapReduce); SQL, Hive, Pig and Alteryx, Talend Open Studio, SSIS

DW-BI tools: Tableau, ggplot2,RShiny, Microsoft Power BI

PROFESSIONAL EXPERIENCE:

Confidential, DALLAS, TX

Data Scientist

Responsibilities:

  • Utilized Apache Spark with Python to develop and execute Big Data Analytics and Machine learning applications, executed machine learning use cases under Spark ML and Mllib.
  • Solutions architect for transforming business problems into BigData and Data Science solutions and define Big Data strategy and Roap map.
  • Identified areas of improvement in existing business by unearthing insights by analyzing vast amount of data using machine learning techniques. TensorFlow, Scala, Spark, MLLib, Python and other tools and languages needed.
  • Create and validate machine learning models with Azure Machine Learning
  • Designing a machine learning pipeline using Microsoft Azure Machine Learning to predict and prescribe and Implemented a machine learning scenario for a given data problem
  • Used Scala for coding the components in Play and Akka.
  • Worked on different Machine learning models like LogisticRegression, Multilayer perceptron classifier, K-means clustering by creating Scala-SBT packaging and run it in Spark-shell (Scala) and Auto-encoder model with using R programming.
  • A passionate Data Scientist with around 6 years of experience in Data Mining, Data Modelling, Data Visualization, Machine Learning with rich domain knowledge and experience in Healthcare , Banking and Travel industries.
  • Worked on setting up and configuring AWS's EMR Clusters and Used Amazon IAM to grant fine-grained access to AWS resources to users
  • Created detailed AWS Security Groups, which behaved as virtual firewalls that controlled the traffic allowed to reach one or more AWSEC2 instances
  • Wrote scripts and indexing strategy for a migration to Redshift from Postgres9.2 and MySQL databases.
  • Wrote Kinesis agents to pipe data from streaming app into S3.
  • Good Knowledge in Azure cloud services, Azure storage, Azure active directory, Azure Service Bus. Create and manage Azure ADtenants, and configure application integration with AzureAD. Integrate on-premises WindowsAD with AzureAD Integrating on-premises identity with Azure Active Directory.
  • Working knowledge of Azure Fabric, Micro services, IoT &Docker containers in Azure. Azure infrastructure management & PaaS Solution Architect - (Azure AD, Licenses, Office365, DR on cloud using Azure RecoveryVault, Azure Web Roles, Worker Roles, SQLAzure, Azure Storage).
  • Utilized Spark, Scala, Hadoop, HBase, Cassandra, MongoDB, Kafka, Spark Streaming, MLLib, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc. and Utilized the engine to increase user lifetime by 45% and triple user conversations for target categories.
  • Designed and developed NLP models for sentiment analysis.
  • Led discussions with users to gather business processes requirements and data requirements to develop a variety of Conceptual, Logical and Physical Data Models. Expert in Business Intelligence and Data Visualization tools: Tableau, Microstrategy.
  • Performed Multinomial Logistic Regression, Random forest, Decision Tree, SVM to classify package is going to deliver on time for the new route and Performed data analysis by using Hive to retrieve the data from Hadoop cluster, Sql to retrieve data from Oracle database.
  • Worked on machine learning on large size data using Spark and MapReduce.
  • Let the implementation of new statistical algorithms and operators on Hadoop and SQL platforms and utilized optimizations techniques, linear regressions, K-means clustering, Native Bayes and other approaches.
  • Power User of Machine learning algorithms and libraries likeNumpy, Pandas, Scikit learn Shiny and ggplot2.
  • Developed Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
  • Developed Data Mapping, Data Governance, Transformation and Cleansing rules for the Master Data Management Architecture involving OLTP, ODS and OLAP.
  • Data sources are extracted, transformed and loaded to generate CSV data files with Python programming and SQL queries.
  • Stored and retrieved data from data-warehouses using Amazon Redshift.
  • Worked on TeradataSQL queries, Teradata Indexes, Utilities such as Mload, Tpump, Fast load and Fast Export.
  • Application of various machine learning algorithms and statistical modeling like decision trees, regression models, neural networks, SVM, clustering to identify Volume using scikit-learn package in python, Matlab.
  • Used Data Warehousing Concepts like Ralph Kimball Methodology, Bill Inmon Methodology, OLAP, OLTP, Star Schema, Snow Flake Schema, Fact Table and Dimension Table.
  • Refined time-series data and validated mathematical models using analytical tools like R and SPSS to reduce forecasting errors.
  • Worked on data pre-processing and cleaning the data to perform feature engineering and performed data imputation techniques for the missing values in the dataset using Python.
  • Created Data Quality Scripts using SQL and Hive to validate successful data load and quality of the data. Created various types of data visualizations using Python and Tableau.

Environment: Horton works - Hadoop Map Reduce, Pyspark, Spark, R, Spark MLLib, Tableau, Informatica, SQL, Excel, VBA, BO, CSV, Erwin, SAS, AWS Redshift, ScalaNlp, Cassandra, Oracle, MongoDB, Cognos, SQL Server 2012, Teradata, DB2, SPSS, T-SQL, PL/SQL, Flat Files, XML, and Tableau.

Confidential, Mclean, VA

Data Scientist

Responsibilities:

  • Worked closely with business, datagovernance, SMEs and vendors to define data requirements.
  • Designed and provisioned the platform architecture to execute Hadoop and machine learning use cases under Cloud infrastructure, AWS, EMR, and S3.
  • Selection of statistical algorithms - (Two Class Logistic Regression Boosted Decision Tree, Decision Forest Classifiers etc)
  • Used MLlib, Spark's Machine learning library to build and evaluate different models.
  • Worked in using Teradata14 tools like Fast Load, Multi Load, T Pump, Fast Export, Teradata Parallel Transporter (TPT) and BTEQ.
  • Involved in creating Data Lake by extracting customer's Big Data from various data sources into Hadoop HDFS. This included data from Excel, Flat Files, Oracle, SQL Server, MongoDb, Cassandra, HBase, Teradata, Netezza and also log data from servers
  • Used Spark Data frames, Spark-SQL, Spark MLLib extensively and developing and designing POC's using Scala, Spark SQL and MLlib libraries.
  • Created high level ETL design document and assisted ETL developers in the detail design and development of ETL maps using Informatica.
  • Used R, SQL to create Statistical algorithms involving Multivariate Regression, Linear Regression, Logistic Regression, PCA, Random forest models, Decision trees, Support Vector Machine for estimating the risks of welfare dependency.
  • Helped in migration and conversion of data from the Sybase database into Oracle database, preparing mapping documents and developing partial SQL scripts as required.
  • Generated ad-hoc SQL queries using joins, database connections and transformation rules to fetch data from legacy Oracle and SQL Server database systems
  • Executed ad-hoc data analysis for customer insights using SQL using Amazon AWS Hadoop Cluster.
  • Worked on predictive and what-if analysis using R from HDFS and successfully loaded files to HDFS from Teradata, and loaded from HDFS to HIVE.
  • Designed the schema, configured and deployed AWS Redshift for optimal storage and fast retrieval of data.
  • Analyzed data and predicted end customer behaviors and product performance by applying machine learning algorithms using Spark MLlib.
  • Performed data mining on data using very complex SQL queries and discovered pattern and Used extensive SQL for data profiling/analysis to provide guidance in building the data model

Environment: R, Machine Learning, Teradata 14, Hadoop Map Reduce, Pyspark, Spark, R, Spark MLLib, Tableau, Informatica, SQL, Excel, AWS Redshift, Scala Nlp, Cassandra, Oracle, MongoDB, Informatica MDM, Cognos,SQL Server 2012, Teradata, DB2, SPSS, T-SQL, PL/SQL, Flat Files, XML, and Tableau.

Confidential, Dallas, TX

Data scientist/Machine learning

Responsibilities:

  • Responsible for performing Machine-learning techniques regression/classification to predict the outcomes.
  • Performed Ad-hoc reporting/customer profiling, segmentation using R/Python.
  • Tracked various campaigns, generating customer profiling analysis and data manipulation.
  • Provided R/SQL programming, with detailed direction, in the execution of data analysis that contributed to the final project deliverables. Responsible for data mining.
  • Utilized Label Encoders in Python to convert non-numerical significant variables to numerical significant variables to identify their impact on pre-acquisition and post acquisitions by using 2 sample paired t test.
  • Worked with ETLSQL Server Integration Services (SSIS) for data investigation and mapping to extract data and applied fast parsing and enhanced efficiency by 17%.
  • Developed Data Science content involving Data Manipulation and Visualization, Web Scraping, Machine Learning, Python programming, SQL, GIT and ETL for DataExtraction.
  • Designing suite of Interactive dashboards, which provided an opportunity to scale and measure the statistics of the HR dept. which was not possible earlier and schedule and publish reports.
  • Provided and created data presentation to reduce biases and telling true story of people by pulling millions of rows of data using SQL and performed Exploratory DataAnalysis.
  • Applied breadth of knowledge in programming (Python, R), Descriptive, Inferential, and Experimental Design statistics, advanced mathematics, and database functionality (SQL, Hadoop).
  • Migrated data from Heterogeneous Data Sources and legacy system (DB2, Access, Excel) to centralized SQLServer databases using SQLServer Integration Services (SSIS).
  • Involved in defining the Source To business rules, Target data mappings, and data definitions.
  • Successfully interpreted, analyzed and performed Predictive Modelling using Python with Numpy, Pandas packages.
  • Performing Data Validation / Data Reconciliation between disparate source and target systems for various projects.
  • Utilized a diverse array of technologies and tools as needed, to deliver insights such as R, SAS, Matlab, Tableau and more.
  • Built Regression model to understand order fulfillment time lag issue using Scikit-learn in Python.
  • Utilized Spark, Scala, Hadoop, HBase, Kafka, Spark Streaming, MLlib, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc.
  • Used T-SQL queries to pull the data from disparate systems and Data warehouse in different environments.
  • Worked closely with the Data Governance Office team in assessing the source systems for project deliverables.
  • Extracting data from different databases as per the business requirements using Sql Server Management Studio.
  • Interacting with the ETL, BI teams to understand / support on various ongoing projects.
  • Extensively using MS Excel for data validation.
  • Involved in data analysis with using different analytic techniques and modeling techniques.

Environment: Data Governance, SQL Server, Python, ETL, MS Office Suite - Excel(Pivot, VLOOKUP), DB2, R, Visio, HP ALM, Agile, Azure, MDM, Share point, Data Quality, Tableau and Reference Data Management.

Confidential, Bethesda, Maryland

Machine Learning Engineer

Responsibilities:

  • Coded R functionstointerfacewith CaffeDeepLearning Framework
  • Workingin AmazonWebServices cloudcomputingenvironment
  • Workedwithseveral python packagesincluding numpy, pandas, Pyspark, CausalInfer , spacetime .
  • Implementedend-to-endsystemsfor DataAnalytics , DataAutomation andintegratedwithcustomvisualizationtoolsusing R,Mahout , Hadoop and MongoDB .
  • Gatheringal l the data thatisrequiredfrom multiple datasourcesandcreating datasets thatwillbeusedin analys is .
  • PerformedExploratory DataAnalysis and DataVisualizations using R and Tableau .
  • Perform a proper EDA , Univariateandbi-variate analysis tounderstandtheintrinsiceffect/combinedeffects.
  • Workedwith Data governance , Data quality , data lineage , Data architect to design variousmodelsandprocesses.
  • IndependentlycodednewprogramsanddesignedTablestoloadandtesttheprogrameffectivelyforthegiven POC's usingwith Big Data / Hadoop .
  • Designed data modelsand data flowdiagramsusing Erwin and MSVisio .
  • Asan Architect implemented MDM hubtoprovideclean, consistentdatafor a SOA implementation.
  • Developed,Implemented&Maintainedthe Conceptual , Logical & Physical Data Modelsusing Erwin for Forward / ReverseEngineered Databases.
  • Established Data architecturestrategy, bestpractices, standards, androadmaps.
  • Leadthedevelopmentandpresentationofa dataanalytics data-hubprototypewiththehelpoftheothermembersoftheemergingsolutionsteam
  • Performed datacleaning andimputationofmissingvaluesusing R .
  • WorkedwithHadoopecosystemcovering HDFS , HBase , YARN and MapReduce
  • Takeupad-hocrequestsbasedondifferentdepartmentsandlocations
  • Used Hive to store the data andperform datacleaning stepsforhugedatasets.

Environment: Oracle 10g, Hive, OLAP, DB2, Metadata, MS Excel, Mainframes MS Visio, Rational Rose, Requisite Pro, Hadoop, PL/SQL, etc

Confidential

Data Analyst/Data Modeler

Responsibilities:

  • Analyzed data sources and requirements and business rules to perform logical and physical data modeling.
  • Analyzed and designed best fit logical and physical data models and relational database definitions using DB2. Generated reports of data definitions.
  • Involved in Normalization/De-normalization, Normal Form and database design methodology.
  • Maintained existing ETL procedures, fixed bugs and restored software to production environment.
  • Developed the code as per the client's requirements using SQL, PL/SQL and Data Warehousing concepts.
  • Involved in Dimensional modeling (Star Schema) of the Data warehouse and used Erwin to design the business process, dimensions and measured facts.
  • Worked with Data Warehouse Extract and load developers to design mappings for Data Capture, Staging, Cleansing, Loading, and Auditing.
  • Developed enterprise data model management process to manage multiple data models developed by different groups
  • Designed and created Data Marts as part of a data warehouse.
  • Wrote complex SQL queries for validating the data against different kinds of reports generated by Business Objects XIR2.
  • Using Erwin modeling tool, publishing of a data dictionary, review of the model and dictionary with subject matter experts and generation of data definition language.
  • Coordinated with DBA in implementing the Database changes and also updating Data Models with changes implemented in development, QA and Production. Worked Extensively with DBA and Reporting team for improving the Report Performance with the Use of appropriate indexes and Partitioning.
  • Developed Data Mapping, Transformation and Cleansing rules for the Master Data Management Architecture involved OLTP, ODS and OLAP.
  • Tuned and coded optimization using different techniques like dynamic SQL, dynamic cursors, and tuning SQL queries, writing generic procedures, functions and packages.
  • Experienced in GUI, Relational Database Management System (RDBMS), designing of OLAP system environment as well as Report Development.
  • Extensively used SQL, T-SQL and PL/SQL to write stored procedures, functions, packages and triggers.
  • Analyzed of data report were prepared weekly, biweekly, monthly using MS Excel, SQL & UNIX.

Environment: ER Studio, Informatica Power Center 8.1/9.1, Power Connect/ Power exchange, Oracle 11g, Mainframes,DB2 MS SQL Server 2008, SQL,PL/SQL, XML, Windows NT 4.0, Tableau, Workday, SPSS, SAS, Business Objects, XML, Tableau, Unix Shell Scripting, Teradata, Netezza, Aginity.

Confidential

Data Analyst

Responsibilities:

  • Worked with internal architects, assisting in the development of current and target state data architectures.
  • Worked with project team representatives to ensure that logical and physical ER/Studio data models were developed in line with corporate standards and guidelines.
  • Involved in defining the business/transformation rules applied for sales and service data.
  • Implementation of Metadata Repository, Transformations, Maintaining DataQuality, DataStandards, Data Governanceprogram, Scripts, Stored Procedures, triggers and execution of test plans
  • Define the list codes and code conversions between the source systems and the data mart.
  • Involved in defining the source to business rules, target data mappings, data definitions.
  • Responsible for defining the key identifiers for each mapping/interface.
  • Remain knowledgeable in all areas of business operations in order to identify systems needs and requirements.
  • Responsible for defining the key identifiers for each mapping/interface.
  • Performed data quality in Talend Open Studio.
  • Enterprise Metadata Library with any changes or updates.
  • Document data quality and traceability documents for each source interface.
  • Establish standards of procedures.
  • Coordinated meetings with vendors to define requirements and system interaction agreement documentation between client and vendor system.

Environment: Windows Enterprise Server 2000, SSRS, SSIS, Crystal Reports, DTS, SQL Profiler, and Query Analyze.

We'd love your feedback!