We provide IT Staff Augmentation Services!

Data Scientist Resume

Bowie, MD

PROFESSIONAL SUMMARY:

  • Above 8+ years of experience in Machine Learning, Data mining with large datasets of Structured and Unstructured data, Data Acquisition, Data Validation, Predictive modelling, Data Visualization.
  • Independently led analytics, visualization, meetings for business importance with clients, manage SLAs, modeling, reporting and providing actionable insights to managers and C - level executives.
  • Established classification and forecast models, automate processes, text mining, sentiment analysis, statistical models, risk analysis, platform integrations, optimization models, models to increase user experience, A/B testing using R, SAS, Python, SPSS, SAS E-miner, E-Views, tableau, etc.
  • Good knowledge and understanding of web designing programming languages like HTML, CSS, and JavaScript.
  • Experience in checking with the interconnection of databases with the user interface.
  • Developed complex database objects like Stored Procedures, Functions, Packages, and using SQL and PL/SQL.
  • Excellent experience in setting, configuring and monitoring of Hadoop cluster of Cloud era, Horton works distribution, big data, Hadoop Ecosystem.
  • Very good Knowledge of Amazon AWS concepts like EMR and EC2 web services which provide fast and efficient processing of Big Data.
  • Experience in Interacting with Business users to analyses the business process and transforming requirements into screens, performing ETL, documenting and rolling out the deliverables.
  • Strong Experience in Creating, Configuring, Deploying, and Testing SSIS Packages.
  • Worked extensively on Extraction, Transformation, loading data from Oracle, DB2, Access, Excel, Flat Files, and XML using DTS, SSIS.
  • Advanced experience with Python (2.x, 3.x) and its libraries such as NumPy, Pandas, Scikit-learn, XGBoost, Lightbgm, Keras, Matplotlib, Seaborn.
  • Strong Knowledge in Statistical methodologies such as Hypothesis Testing, ANOVA, Principal Component Analysis (PCA), Monte Carlo Sampling and Time Series Analysis.
  • Extensive experience in using Flume to transfer log data files to Hadoop Distributed File System (HDFS).
  • Expertise in using tools like SQOOP, Kafka to ingest data into Hadoop
  • Expertise in deploying the code through web application servers like web sphere/web logic/ apache tomcat in AWS CLOUD and Expertise in using AMI, IAM, Instance, S3 and all other AWS resources
  • Proficient in Big Data, Hadoop, Hive, MapReduce, Pig and NoSQL databases like MongoDB, HBase, Cassandra.
  • Experienced in SQL Queries and optimizing the queries in Oracle, SQL Server, DB2, PostgreSQL, Netezza, and Teradata.
  • Strong experience in maintenance of PostgreSQL, Oracle, Big Data databases and updating the versions.
  • Experience in installing, configuring and maintaining the databases like PostgreSQL, Oracle, Big Data HDFS systems.
  • Hands on experience with clustering algorithms like K-means & Medoids clustering and Predictive and Descriptive algorithms.
  • Expertise in Model Development, Data Mining, Predictive Modeling, Descriptive Modelling Data Visualization, Data Clearing and Management, and Database Management.
  • Expertise in applying data mining techniques and optimization techniques in B2B and B2C industries and proficient in Machine Learning, Data/Text Mining, Statistical Analysis and Predictive Modeling.
  • Used DFAST Modelling and Solutions for expected loss calculations and viewing the results in a dashboard for further insights.
  • Experienced in designing star schema (identification of facts, measures, and dimensions), Snowflake schema for Data Warehouse, ODS Architecture by using tools like Erwin Data Modeler, Power Designer, E-R Studio and Microsoft Visio.
  • Expertise in Excel Macros, Pivot Tables, VLOOKUP-ups and other advanced functions and expertise R user with knowledge of statistical programming languages SAS.
  • Excellent experience on Teradata SQL queries, Teradata Indexes, Utilities such as MLOAD, TPump, Fast load and Fast Export.
  • Experience in Data Mining, Text Mining, Data Analysis, Data Migration, Data Cleansing, Transformation, Integration, Data Import, and Data Export.
  • Experienced in designing Architecture for Modelling a Data Warehouse by using tools like Erwin, Power Designer, and E-R Studio.

TECHNICAL SKILLS:

Languages: C, C++, WSDL, R/R Studio, SAS Enterprise Guide, SAS R, R (Caret, Weka, ggplot), Perl, MATLAB, Java, Scala, Python (NumPy, SciPy, Pandas).

IDE's: PyCharm, Emacs, Eclipse, NetBeans, MS Visual Studio, Sublime, SOAP UI

Application Servers: Web Logic, Web Sphere, JBoss, Tomcat.

Cloud Computing Tools: Amazon AWS

Databases: Microsoft SQL Server 2008 MySQL 4.x/5.x, Oracle 10g, 11g, 12c, DB2, Teradata, Netezza

NO SQL Databases: HBase, Cassandra, MongoDB, MariaDB

Build Tools: Jenkins, Maven, ANT, Toad, SQL Loader, RTC, RSA, Control-M, Oozie, Hue, SOAP UI

Business Intelligence Tools: Tableau, Tableau Server, Tableau Reader, Splunk, SAP Business Objects, OBIEE, QlikView, SAP Business Intelligence, Amazon Redshift, or Azure Data Warehouse

Development Tools: Microsoft SQL Studio, Eclipse, NetBeans, IntelliJ

Database Tools: SQL Server Data Tools, Visual Studio, Spotlight, SQL Server Management Studio, Query Analyzer, Enterprise Manager, JIRA, Profiler

ETL Tools: Informatica Power Centre, SSIS

Reporting Tools: MS Office (Word/Excel/PowerPoint/ Visio/Outlook), Crystal Reports XI, SSRS, Cognos7.0/6.0.

Operating Systems: All versions of UNIX, Windows, LINUX, Macintosh HD, Sun Solaris

PROFESSIONAL EXPERIENCE:

Confidential, Bowie, MD

Data Scientist

Responsibilities:

  • Analyzed and solved business problems and found patterns and insights within structured and unstructured data.
  • Used clustering technique K-Means to identify outliers and to classify unlabelled data.
  • Cleaned, analyzed and selected data to gauge customer experience
  • Used algorithms and programming to efficiently go through large datasets and apply treatments, filters, and conditions as needed.
  • Created meaningful data visualizations to communicate findings and relate them back to how they create business impact.
  • Utilized a diverse array of technologies and tools as needed, to deliver insights such as R, SAS, MATLAB, Tableau and more.
  • Worked on Natural Language Processing with NLTK module of python for application development and automated customer response.
  • Utilized statistical Natural Language Processing for sentiment analysis, mine unstructured data, and creative insights.
  • In data exploration stage used correlation analysis and graphical techniques in Matplotlib and Seaborn to get some insights about the patient admission and discharge data.
  • Worked on feature engineering such as feature creating, feature scaling and One-Hot encoding with Scikit-learn.
  • Forecasted based on exponential smoothing, ARIMA modeling, statistical algorithms and statistical analysis and transfer function models.
  • Conducted studies, rapidplots and using advanced data mining and statistical modeling techniques to build a solution that optimizes the quality and performance of data.
  • Conducted analyses to support Methods programs.
  • Worked on the topics related to trend analysis, identifying gaps for improvements in coverage, data integration.
  • Performed data profiling, data cleaning, and data quality.
  • Analyzed data from the complex relational database for various analyses and/or requests, using access, SQL, Excel and other statistical packages like SAS, Minitab to meet business goals and objectives.
  • Used SAS for multivariate analysis, business intelligence, data management and predictive analytics.
  • Worked with cross-functional teams to design, implement, and test new consumer and audience measurement methodologies.
  • Able to use a combination of data visualization tools, Tableau and communication tools to clearly and effectively explain the problem, root cause, and recommendations.
  • Assisted in the designing and testing of data collection methodologies for Nielsen panels and surveys.
  • Been part for the identification and implementation of methods and best practices to improve respondent cooperation and optimize data collection approaches.
  • Identifying the Customer and account attributes required for MDM implementation from disparate sources and preparing detailed documentation.
  • Developed SQL procedures to synchronize the dynamic data generated from GTID systems with the Azure SQL Server.
  • Process automation using Python/R scripts with Oracle database to generate and write the results in the production environment on weekly basis.
  • Interaction with Business Analyst, SMEs, and other Data Architects to understand Business needs and functionality for various project solutions
  • Used Data Quality validation techniques to validate Critical Data elements (CDE) and identified various anomalies.
  • Performing Data Validation / Data Reconciliation between the disparate source and target systems for various projects.
  • Interacting with the Business teams and Project Managers to clearly articulate the anomalies, issues, findings during data validation.
  • Writing complex SQL queries for validating the data against different kinds of reports generated by Cognos.
  • Provides input and recommendations on technical issues to Business & Data Analysts, BI Engineers, and Data Scientists.

Environment : SAS, R, MLIB, Python, Data Governance, MDM, MATLAB, Tableau, Azure SQL Server, NLTK, Tableau.

Confidential, Boston, MA

Data Scientist

Responsibilities:

  • Performing data profiling and analysis on different source systems that are required for Customer Master.
  • Identifying the Customer and account attributes required for MDM implementation from disparate sources and preparing detailed documentation.
  • Used T-SQL queries to pull the data from disparate systems and Data warehouse in different environments.
  • Worked closely with the Data Governance Office team in assessing the source systems for project deliverables.
  • Presented DQ analysis reports and scorecards on all the validated data elements and presented -to the business teams and stakeholders.
  • Used Data Quality validation techniques to validate Critical Data elements (CDE) and identified various anomalies.
  • Developed clinical NLP methods that ingest large unstructured clinical data sets, separate signal from noise, and provide personalized insights at the patient level that directly improve our analytics platform.
  • Used NLP methods for information extraction, topic modeling, parsing, and relationship extraction.
  • Worked with the NLTK library for NLP data processing and finding the patterns.
  • Extensively used open source tools - R Studio(R) and Spyder (Python) for statistical analysis and building the machine learning.
  • Involved in defining the Source Of business rules, Target data mappings, and data definitions.
  • Performing Data Validation / Data Reconciliation between the disparate source and target systems (Salesforce, Cisco-UIC, Cognos, Data Warehouse) for various projects.
  • Interacting with the Business teams and Project Managers to clearly articulate the anomalies, issues, findings during data validation.
  • Writing complex SQL queries for validating the data against different kinds of reports generated by Cognos.
  • Extracting data from different databases as per the business requirements using SQL Server Management Studio.
  • Work with the Data Governance group to identify, classify and define each assigned Critical Data Element (CDEs) and ensure that each element has a clear and unambiguous definition.
  • Analyzed data lineage processes and documentation for the CDEs to identify vulnerable points, control gaps, data quality issues, and overall lack of data governance.
  • Proposed data checks and standard operating procedures on the source systems to enhance data quality
  • Reviewed various Project Management documents such as Business Requirements document, Functional Specification document and suggested changes to ensure it complies with policies and standards.
  • Worked with the Data Governance group in creating a custom data dictionary template to be used across the various business lines.
  • Worked with data stewards to ensure awareness of data quality standards and data requirements
  • Linked data lineage to data quality and business glossary work within the overall data governance program.
  • Managed communication and training with data owners/stewards to ensure awareness of policies and standards
  • Gathered requirements by working with the business users on Business Glossary, Data Dictionary, and Reference data
  • Generating weekly, monthly reports for various business users according to the business requirements. Manipulating/mining data from database tables (Redshift, Oracle, Data Warehouse).
  • Create statistical models using distributed and standalone models to build various diagnostics, predictive and prescriptive solution.
  • Interface with other technology teams to load (ETL), extract and transform data from a wide variety of data sources.
  • Provides input and recommendations on technical issues to Business & Data Analysts, BI Engineers, and Data Scientists.

Environment: Data Governance, SQL Server, ETL, MS Office Suite - Excel (Pivot, VLOOKUP), DB2, R, Python, Visio, HP ALM, Agile, Azure, Data Quality, Tableau and Reference Data Management.

Confidential, Alpharetta,GA

Data Scientist

Responsibilities:

  • Developed applications of Machine Learning, Statistical Analysis, and Data Visualizations with challenging data Processing problems in sustainability and biomedical domain.
  • Compiled data from various sources public and private databases to perform complex analysis and data manipulation for actionable results.
  • Designed and developed Natural Language Processing models for sentiment analysis.
  • Worked on Natural Language Processing with NLTK module of python for application development for automated customer response.
  • Used predictive modeling with tools in SAS, SPSS, R, and Python.
  • Applied concepts of probability, distribution and statistical inference on the given dataset to unearth interesting findings through the use of comparison, T-test, F-test, R-squared, P-value etc.
  • Applied linear regression, multiple regression, ordinary least square method, mean-variance, the theory of large numbers, logistic regression, dummy variable, residuals, Poisson distribution, Bayes, Naive Bayes, fitting function etc. to data with help of Scikit, Scipy, Numpy, and Pandas module of Python.
  • Applied clustering algorithms i.e. Hierarchical, K-means with help of Scikit and Scipy.
  • Worked on development of data warehouse, Data Lake and ETL systems using relational and nonrelational tools like SQL, No SQL.
  • Built and analyzed datasets using R, SAS, Matlab, and Python (in decreasing order of usage).
  • Applied linear regression in Python and SAS to understand the relationship between different attributes of the dataset and causal relationship between them
  • Worked in large-scale database environments like Hadoop and Map Reduce, with working mechanism of Hadoop clusters, nodes and Hadoop Distributed File System (HDFS).
  • Interfaced with large-scale database system through an ETL server for data extraction and preparation.
  • Identified patterns, data quality issues, and opportunities and leveraged insights by communicating opportunities with business partners.

Environment: Machine learning, AWS, MS Azure, Cassandra, Pig, Linux, Python (Scikit-Learn/SciPy/NumPy/Pandas), R, SAS, SPSS, MySQL, Eclipse, PL/SQL, SQL connector, Tableau.

Confidential, Irvine, CA

Data Scientist/Data Modeler

Responsibilities:

  • Provided the architectural leadership in shaping strategic, business technology projects, with an emphasis on application architecture.
  • Utilized domain knowledge and application portfolio knowledge to play a key role in defining the future state of large, business technology programs.
  • Participated in all phases of data mining, data collection, data cleaning, developing models, validation, and visualization and performed Gap analysis.
  • Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS. Implemented a Python-based distributed random forest via Python streaming.
  • Created ecosystem models (e.g. conceptual, logical, physical, canonical) that are required for supporting services within the enterprise data architecture (conceptual data model for defining the major subject areas used, ecosystem logical model for defining standard business meaning for entities and fields, and an ecosystem canonical model for defining the standard messages and formats to be used in data integration services throughout the ecosystem).
  • Used Pandas, NumPy, seaborn, SciPy, Matplotlib, Scikit-learn, NLTK in Python for developing various machine learning algorithms and utilized machine learning algorithms such as linear regression, multivariate regression, naive Bayes, Random Forests, K-means, & KNN for data analysis.
  • Conducted studies, rapid plots and using advance data mining and statistical modelling techniques to build solution that optimize the quality and performance of data.
  • Demonstrated experience in design and implementation of Statistical models, Predictive models, enterprise data model, metadata solution and data life cycle management in both RDBMS, Big Data environments.
  • Analyzed large data sets apply machine learning techniques and develop predictive models, statistical models and developing and enhancing statistical models by leveraging best-in-class modeling techniques.
  • Worked on database design, relational integrity constraints, OLAP, OLTP, Cubes and Normalization (3NF) and De-normalization of database.
  • Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS.
  • Worked on customer segmentation using an unsupervised learning technique - clustering.
  • Worked with various Teradata15 tools and utilities like Teradata Viewpoint, Multi Load, ARC, Teradata Administrator, BTEQ and other Teradata Utilities.
  • Utilized Spark, Scala, Hadoop, HBase, Kafka, Spark Streaming, MLlib, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc.
  • Developed LINUX Shell scripts by using NZSQL/NZLOAD utilities to load data from flat files to Netezza database.
  • Designed and implemented system architecture for Amazon EC2 based cloud-hosted solution for client.
  • Tested Complex ETL Mappings and Sessions based on business user requirements and business rules to load data from source flat files and RDBMS tables to target tables.

Environment : Erwin r9.6, Python, SQL, Oracle 12c, Netezza, SQL Server, Informatica, Java, SSRS, PL/SQL, T-SQL, Tableau, MLlib, regression, Cluster analysis, Scala NLP, Spark, Kafka, MongoDB, logistic regression, Hadoop, Hive, Teradata, random forest, OLAP, Azure, MariaDB, SAP CRM, HDFS, ODS, NLTK, SVM, JSON, Tableau, XML, Cassandra, MapReduce, AWS.

Confidential

Data Analyst/Data Modeler

Responsibilities:

  • Analyzed data sources and requirements and business rules to perform logical and physical data modeling
  • Analyzed and designed best fit logical and physical data models and relational database definitions using DB2. Generated reports of data definitions
  • Involved in Normalization/De-normalization, Normal Form and database design methodology.
  • Maintained existing ETL procedures, fixed bugs and restored software to production environment.
  • Developed the code as per the client's requirements using SQL, PL/SQL and Data Warehousing concepts
  • Worked with Data Warehouse Extract and load developers to design mappings for Data Capture, Staging, Cleansing, Loading, and Auditing.
  • Developed enterprise data model management process to manage multiple data models developed by different groups
  • Designed and created Data Marts as part of a data warehouse.
  • Developed enterprise data model management process to manage multiple data models developed by different groups.
  • Wrote complex SQL queries for validating the data against different kinds of reports generated by Business Objects XIR2.
  • Using Erwin modelling tool, publishing of a data dictionary, review of the model and dictionary with subject matter experts and generation of data definition language.
  • Coordinated with DBA in implementing the Database changes and also updating Data Models with changes implemented in development, QA and Production. Worked Extensively with DBA and Reporting team for improving the Report Performance with the Use of appropriate indexes and Partitioning.
  • Developed Data Mapping, Transformation and Cleansing rules for the Master Data Management Architecture involved OLTP, ODS and OLAP.
  • Tuned and coded optimization using different techniques like dynamic SQL, dynamic cursors, and tuning SQL queries, writing generic procedures, functions and packages.
  • Experienced in GUI, Relational Database Management System (RDBMS), designing of OLAP system environment as well as Report Development.
  • Analyzed of data report were prepared weekly, biweekly, monthly using MS Excel, SQL & UNIX

Environment : ER Studio, Informatica Power Center 8.1/9.1, Power Connect/ Power exchange, Oracle 11g, Mainframes,DB2 MS SQL Server 2008, SQL,PL/SQL, XML, Windows NT 4.0, Tableau, Workday, SPSS, SAS, Business Objects, XML, Tableau, Unix Shell Scripting, Teradata, Netezza, Aginity.

Confidential

Jr. Data Analyst

Responsibilities:

  • Worked with internal architects, assisting in the development of current and target state data architectures.
  • Worked with project team representatives to ensure that logical and physical ER/Studio data models were developed in line with corporate standards and guidelines.
  • Involved in defining the business/transformation rules applied for sales and service data.
  • Implementation of Metadata Repository, Transformations, Maintaining DataQuality , Data Standards , Data Governance program , Scripts, Stored Procedures, triggers and execution of test plans.
  • Define the list codes and code conversions between the source systems and the data mart.
  • Involved in defining the source to business rules, target data mappings, data definitions.
  • Responsible for defining the key identifiers for each mapping/interface.
  • Remain knowledgeable in all areas of business operations in order to identify systems needs and requirements.
  • Responsible for defining the key identifiers for each mapping/interface.
  • Performed data quality in Talend Open Studio.
  • Enterprise Metadata Library with any changes or updates.
  • Document data quality and traceability documents for each source interface.

Environment : Windows Enterprise server 2000, SSRS, SSIS, Crystal Reports, DTS, SQL Profiler, and query Analyse.

Hire Now