We provide IT Staff Augmentation Services!

Sr. Data Scientist/data Architect Resume

3.00/5 (Submit Your Rating)

Richmond, VA

SUMMARY:

  • 8+ years of experience in Data Science, Data Modeling, Data Analysis, Data Warehousing, Machine Learning, Data mining with large data sets of Structured and Unstructured data, Data Acquisition, Data Validation, Predictive modeling, Statistical modeling, Data modeling, Data Visualization, Web Crawling, Web Scraping. Adept in statistical programming languages like R and Python, SAS, Apache Spark, Matlab including Big Data technologies like Hadoop, Hive, Pig.
  • Experienced in provisioning virtual clusters under AWS cloud which includes services like EC2, S3, and EMR.
  • Expertise in managing entire data science project life cycle and actively involved in all the phases of project life cycle including data acquisition, data cleaning, data engineering, features scaling, features engineering, statistical modeling (decision trees, regression models, neural networks, SVM, clustering), dimensionality reduction using Principal Component Analysis and Factor Analysis, testing and validation using ROC plot, K - fold cross validation and data visualization.
  • Extensive experience in Relational and Dimensional Data modeling for creating Logical and Physical Design of Database and ER Diagrams using multiple data modeling tools like Erwin and ER Studio.
  • Strong knowledge in all phases of the SDLC (Software Development Life Cycle) from analysis, design, development, testing, implementation and maintenance.
  • Expertise in synthesizing Machine learning, Predictive Analytics and Big data technologies into integrated solutions.
  • Experienced in Data Modeling techniques employing Data warehousing concepts like star/snowflake schema and Extended Star.
  • Experience in using various packages in Rand python like ggplot2, caret, dplyr, Rweka, gmodels, RCurl, tm, C50, twitteR, NLP, Reshape2, rjson, plyr, pandas, numpy, seaborn, scipy, matplotlib, scikit-learn, Beautiful Soup, Rpy2.
  • Expertise in writing functional specifications, translating business requirements to technical specifications, created/maintained/modified database design document with detailed description of logical entities and physical tables.
  • Excellent knowledge of Machine Learning, Mathematical Modeling and Operations Research. Comfortable with R, Python, SAS and Weka, MATLAB, Relational databases. Deep understanding & exposure of Big Data Eco-system.
  • Excellent experience in Normalization (1NF, 2NF, 3NF and BCNF) and De-normalization techniques for effective and optimum performance in OLTP and OLAP environments.
  • Expertise in OLTP/OLAP System Study, Analysis and E-R modeling, developing Database Schemaslike Star schema and Snowflake schema used in relational, dimensional and multidimensional modeling.
  • Excellent experience in Extract, Transfer and Load process using ETL tools like Data Stage, Informatica, Data Integrator and SSIS for Data migration and Data Warehousing projects.
  • Expertise in Data Analysis, Data Migration, Data Profiling, Data Cleansing, Transformation, Integration, Data Import, and Data Export through the use of multiple ETL tools such as Informatica Power Center.
  • Proficient in Machine Learning, Data/Text Mining, Statistical Analysis & Predictive Modeling.
  • Experienced in Teradata RDBMS using Fast load, Fast Export, Multi load, T pump, and Teradata SQL Assistance and BTEQ Teradata utilities.
  • Experienced in SQL, PL/SQL package, function, stored procedure, triggers, and materialized view, to implement business logics of oracle database.
  • Experience in developing analytics solutions based on Azure Machine Learning platform and Selection of statistical algorithms - (Two Class Logistic Regression Boosted Decision Tree, Decision Forest Classifiers etc).
  • Strong experience and knowledge in Data Visualization with Tableau creating: Line and scatter plots, Bar Charts, Histograms, Pie chart, Dot charts, Box plots, Time series, Error Bars, Multiple Charts types, Multiple Axes, subplots etc.
  • Extensive experience in Text Analytics, generating data visualizations using R, Python and creating dashboards using tools like Tableau.
  • Hands on experience with big data tools like Hadoop, Spark, Hive, Pig, Impala, Pyspark, SparkSql.
  • Strong experience with Oracle/SQL Server programming skills, with experience in working with functions, packages and triggers.
  • Expertise in Excel Macros, Pivot Tables, vlookups and other advanced functions.

TECHNICAL SKILLS:

Data Analytics Tools/Programming: Python (numpy, scipy, pandas,Gensim, Keras), R ( Caret, Weka, ggplot), MATLAB, Microsoft SQL Server, Oracle PLSQL, Python.

Data Visualization: Tableau, Visualization packages, Microsoft Excel.

Machine Learning Algorithms: Classifications, Regression, Clustering, Feature Engineering.

Data Modeling: Erwin 9.x, Star Schema, Snow-Flake Schema, ER Studio.

Big Data Tools: Hadoop, MapReduce, SQOOP, Pig, Hive, NOSQL, Cassandra, MongoDB, Spark, Scala.

Databases: Oracle, SQL Server, Teradata, Netezza, DB2.

ETL: Informatica, SSIS.

Others: Deep Learning, Graph Mining, Text Mining, C, C++, Java, Javascript, ASP, Shell Scripting, Scala npl, Spark MLLib, SAS, SPSS, Cognos, Azure.

PROFESSIONAL EXPERIENCE:

Confidential, Richmond,VA

Sr. Data Scientist/Data Architect

Responsibilities:

  • Provided the architectural leadership in shaping strategic, business technology projects, with an emphasis on application architecture and utilized domain knowledge and application portfolio knowledge to play a key role in defining the future state of large, business technology programs.
  • Participated in all phases of data mining, data collection, data cleaning, developing models, validation, and visualization and performed Gap analysis.
  • Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS. Implemented a Python-based distributed random forest via Python streaming.
  • Used Pandas, NumPy, Seaborn, SciPy, Matplotlib, Scikit-learn, and NLTK in Python for developing various machine learning algorithms and utilized machine learning algorithms such as linear regression, multivariate regression, naive Bayes, Random Forests, K-means, & KNN for data analysis.
  • Demonstrated experience in design and implementation of Statistical models, Predictive models, enterprise data model, metadata solution and data lifecycle management in both RDBMS, Big Data environments.
  • Worked on database design, relational integrity constraints, OLAP, OLTP, Cubes and Normalization (3NF) and De-normalization of the database.
  • Worked with various Teradata15 tools and utilities like Teradata Viewpoint, Multi-Load, ARC, Teradata Administrator, BTEQ and other Teradata Utilities.
  • Utilized Spark, Scala, Hadoop, HBase, Kafka, Spark Streaming, MLlib, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc.
  • Developed LINUX Shell scripts by using NZSQL/NZLOAD utilities to load data from flat files to Netezza database.
  • Designed and implemented system architecture for Amazon EC2 based cloud-hosted solution for the client.
  • Tested Complex ETL Mappings and Sessions based on business user requirements and business rules to load data from source flat files and RDBMS tables to target tables. Built and published customized interactive reports and dashboards, report scheduling using Tableau server.
  • Used TensorFlow to develop deep learning models for classification and regression for direct response modeling and convolution neural network for image classification.
  • Utilized various supervised and unsupervised machine learning techniques, such as multivariate regression with linear and logistic models, naive Bayes, KNN, PCA and K-means clustering.
  • Hands-on Oracle External Tables feature to read the data from flat files into Oracle staging tables.
  • Analyzed the weblog data using the HiveQL to extract a number of unique visitors per day, page views, visit duration, most purchased product on the website and managed and reviewed Hadoop log files.
  • Used Erwin9.6 for effective model management of sharing, dividing and reusing model information and design for productivity improvement.
  • Wrote adhoc data normalization jobs for new data ingested into Redshift
  • Used JSON schema to define table and column mapping from S3 data to Redshift
  • Developed and implemented SSIS, SSRS and SSAS application solutions for various business units across the organization.

Environment: ERwin9.6, Teradata, Oracle12c, Hadoop, HDFS, Pig, Hive, MapReduce, PL/SQL, UNIX, Informatica Power Center, MDM, SQL Server, Netezza, DB2, Tableau, Aginity, Architecture, SAS/Graph, SAS/SQL, Tableau, SAS/Connect and SAS/Access, Python, SQL,, AWS, EC2, MongoDB, HBase.

Confidential, MN

Sr. Data Scientist/Data Architect

Responsibilities:

  • Utilized Apache Spark with Python to develop and execute Big Data Analytics and Machine learning applications, executed machine Learning use cases under Spark ML and Mllib.
  • Solutions architect for transforming business problems into Big Data and Data Science solutions and define Big Data strategy and Roap map.
  • Identified areas of improvement in existing business by unearthing insights by analyzing vast amount of data using machine learning techniques.
  • Interpret problems and provides solutions to business problems using data analysis, data mining, optimization tools, and machine learning techniques and statistics.
  • Utilized Spark, Scala, Hadoop, HBase, Cassandra, MongoDB, Kafka, Spark Streaming, MLLib, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc. and Utilized the engine to increase user lifetime by 45% and triple user conversations for target categories.
  • Designed and developed NLP models for sentiment analysis.
  • Implemented Work Load Management (WML) in Redshift to prioritize basic dashboard queries
  • Wrote indexing strategy for Redshift tables. This involved designing the most efficient sortkey and distkey
  • Led discussions with users to gather business processes requirements and data requirements to develop a variety of Conceptual, Logical and Physical Data Models. Expert in Business Intelligence and Data Visualization tools: Tableau, Microstrategy.
  • Built analytical data pipelines to port data in and out of Hadoop/HDFS from structured and unstructured sources and designed and implemented system architecture for Amazon EC2 based cloud-hosted solution for client.
  • Performed Multinomial Logistic Regression, Random forest, Decision Tree, SVM to classify package is going to deliver on time for the new route and Performed data analysis by using Hive to retrieve the data from Hadoop cluster, Sql to retrieve data from Oracle database.
  • Worked on machine learning on large size data using Spark and MapReduce.
  • Designed conceptual/theoretical frameworks to support richer models of language use (e.g. incorporating frame semantics, the syntax/semantics interface, discourse and pragmatics, etc).
  • Let the implementation of new statistical algorithms and operators on Hadoop and SQL platforms and utilized optimizations techniques, linear regressions, K-means clustering, Native Bayes and other approaches.
  • Developed Spark/Scala,Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
  • Developed Data Mapping, Data Governance, Transformation and Cleansing rules for the Master Data Management Architecture involving OLTP, ODS and OLAP.
  • Data sources are extracted, transformed and loaded to generate CSV data files with Python programming and SQL queries.
  • Stored and retrieved data from data-warehouses using Amazon Redshift.
  • Worked on TeradataSQL queries, Teradata Indexes, Utilities such as Mload, Tpump, Fast load and FastExport.
  • Worked on data pre-processing and cleaning the data to perform feature engineering and performed data imputation techniques for the missing values in the dataset using Python.
  • Created Data Quality Scripts using SQL and Hive to validate successful data load and quality of the data. Created various types of data visualizations using Python and Tableau.

Environment: Python, ER Studio, Hadoop, Map Reduce, EC2, S3, Pyspark, Spark, Spark MLLib, Tableau, Informatica, SQL, Excel, VBA, BO, CSV, Netezza, SAS, Matlab, AWS, Scala Nlp, SPSS, Cassandra, Oracle, Amazon Redshift, MongoDB, SQL Server 2012, Teradata, DB2, T-SQL, PL/SQL, Flat Files, XML, Tableau.

Confidential, Atlanta GA

Sr. Data Scientist/Data Architect

Responsibilities:

  • Used R, SQL to create Statistical algorithms involving Multivariate Regression, Linear Regression, Logistic Regression, PCA, Random forest models, Decision trees, Support Vector Machine for estimating the risks.
  • Managed data operations team and collaborated with data warehouse developers to meet business user needs, promote data security, and maintain data integrity.
  • Used R and python for Exploratory Data Analysis, A/B testing, Anova test and Hypothesis test to compare and identify the effectiveness of Creative Campaigns.
  • Designed and provisioned the platform architecture to execute Hadoop and machine learning use cases under Cloud infrastructure, AWS, EMR, S3.
  • Wrote archiving scripts to periodically transfer older data from S3 to glacier to minimize storage costs •Wrote ETL code in Python 3.x to clean and normalize unstructured data in legacy Postgres DB to accommodate schema updates in Redshift
  • Implemented public segmentation using unsupervised machine learning algorithms by implementing k-means algorithm using Pyspark.
  • Used ETL Tools for masking and cleaning data and mined data from various sources.
  • Involved in creating Data Lake by extracting customer's Big Data from various data sources into Hadoop HDFS. This included data from Excel, Flat Files, Oracle, SQL Server, MongoDb, Cassandra, HBase, Teradata, Netezza and also log data from servers
  • Developed Python code for data analysis (also using NumPy and SciPy), Curve-fitting.
  • Performed extensive Data Validation, Data Verification against Data Warehouse and performed debugging of the SQL-Statements and stored procedures for business scenarios.
  • Used Spark Data frames, Spark-SQL, Spark MLLib extensively and developing and designing POC's using Scala, Spark SQL and MLlib libraries.
  • Created and reviewed Informatica mapping documents too with business and data governance rules.
  • Worked on predictive and what-if analysis using R from HDFS and successfully loaded files to HDFS from Teradata, and loaded from HDFS to HIVE.
  • Designed the schema, configured and deployed AWS Redshift for optimal storage and fast retrieval of data.
  • Developed ETL mappings, testing, correction and enhancement and resolved data integrity issues and coordinated multiple OLAP and ETL projects for various data lineage and reconciliation.
  • Performed transformations of data using Spark and Hive according to business requirements for generating various analytical datasets.
  • NLTK, Stanford NLP, RAKE to preprocess the data, entity extraction and keyword extraction.
  • Used concepts of Data Modeling Star Schema/Snowflake modeling, FACT & Dimensions tables and Logical & Physical data modeling.
  • Worked on analyzing the data statistically and also prepared statistical reports SAS tool.
  • Created mapreduce running over HDFS for data mining and analysis using R and Loading & Storage data to Pig Script and R for MapReduce operations.
  • Participated in big data architecture for both batch and real-time analytics and mapped data using scoring system over large data on HDFS

Environment: Horton works - Hadoop Map Reduce, Pyspark, Spark, R, Spark MLLib, Tableau, Informatica, SQL, Excel, VBA, BO, CSV, Erwin, SAS, AWS Redshift, Scala Nlp, Cassandra, Oracle, MongoDB, Cognos,SQL Server 2012, Teradata, DB2, SPSS, T-SQL, PL/SQL, Flat Files, XML, and Tableau.

Confidential, Chicago IL

Sr. Data Architect/ Data Modeler

Responsibilities:

  • Understand and analyze business data requirements and architect an accurate, extensible, flexible and logical data model and Defining and implementing conceptual, logical, and physical data modeling concepts.
  • Design and build world class high-volume real-time data ingestion frameworks and automate various data sources into Bigdata technologies like Hadoop etc.
  • Performed Data mapping between source systems to Target systems, logical data modeling, created class diagrams and ERdiagrams and used SQLqueries to filter data.
  • Developed MapReduce programs to parse the raw data, populate staging tables and store the refined data in partitioned tables in the EDW.
  • Analyze database infrastructure to insure compliance with customer security standards, database performance considerations, and reverse engineering of existing database environments.
  • Used Hive and created Hive tables and involved in dataloading and writing HiveUDFs.
  • Creation of BTEQ, Fast export, MultiLoad, TPump, Fast load scripts for extracting data from various production systems.
  • Creation of database objects like tables, views, Materialized views, procedures, packages using Oracle tools like PL/SQL, SQL* Plus, SQL*Loader and Handled Exceptions.
  • Worked on Hive for exposing data for further analysis and for generating transforming files from different analytical formats to text files.
  • Extensively used Erwin for developing data model using star schema methodologies.
  • Worked on importing and exporting data from Oracle and DB2 into HDFS using Sqoop
  • Created, optimized, reviewed and executed Teradata SQL test queries to validate transformation rules used in source to target mappings/source views, and to verify data in target tables.
  • Used Pig as ETL tool to do transformations, event joins and some pre-aggregations before storing the data onto HDFS.
  • Provided ad-hoc queries and data metrics to the Business Users using Hive, Pig.

Environment: ERWIN 9.x, Informatica Power Mart (Source Analyzer, Data warehousing designer, Mapping Designer, Transformations), MS SQL Server, Oracle, SQL, Hive, Map Reduce, PIG, Sqoop, HDFS, Hadoop, Teradata, Netezza, PL/SQL, Informatica, SSIS, SSRS.

Confidential, Ohio

Sr. Data Modeler/Data Analyst

Responsibilities:

  • Worked with business users to gather requirements and create data flow, process flows and functional specification documents.
  • Developed Data Mapping, Data Governance and transformation and cleansing rules for the Master Data Management Architecture involving OLTP, ODS.
  • Based on client requirement, created design documents for workday reporting and created dashboard which gives all the information regarding those reports.
  • Developed, enhanced and maintained Snow Flakes Schemas within data warehouse and data mart with conceptual data models.
  • Designed 3rd normal form target data model and mapped to logical model.
  • Involved in extensive Data validation using SQLqueries and back-end testing and used SQL for Querying the database in UNIX environment
  • Involved with Data Analysis primarily Identifying Data Sets, Source Data, Source Meta Data, Data Definitions and Data Formats
  • Involved in dataanalysis and creating datamapping documents to capture source to target transformation rules.
  • Used ER Studio and Visio to create 3NF and dimensional data models and published to the business users and ETL / BIteams.
  • Involved in Datamapping specifications to create and execute detailed system test plans. The datamapping specifies what data will be extracted from an internal data warehouse, transformed and sent to an external entity.
  • Developed Informatica SCD type-I, Type-II and Type III mappings and tuned them for better performance. Extensively used almost all of the transformations of Informatica including complex lookups, Stored Procedures, Update Strategy, mapplets and others.
  • Creating or modifying the T-SQL queries as per the business requirements and worked on creating role playing dimensions, factlessFact, snowflake and starschemas.
  • Using ER Studio modeling tool, publishing of a data dictionary, review of the model and dictionary with subject matter experts and generation of data definition language.
  • Extracted data from databases Oracle, Teradata, Netezza, SQL server and DB2 using Informatica to load it into a single repository for Data analysis.
  • Involved in development and implementation of SSIS, SSRS and SSAS application solutions for various business units across the organization.
  • Managed full SDLC processes involving requirements management, workflow analysis, source data analysis, data mapping, metadata management, data quality, testing strategy and maintenance of the model.
  • Created custom Workday reports and modify/troubleshoot existing custom reports.
  • Used Teradata utilities such as Fast Export, MLOAD for handling various tasks.

Environment: ER Studio, Informatica Power Center 8.1/9.1, Power Connect/ Power exchange, Oracle 11g, Main frames,DB2 MS SQL Server 2008, SQL,PL/SQL, XML, Windows NT 4.0, Tableau, Workday, SPSS, SAS, Business Objects, XML, Tableau, Unix Shell Scripting, Teradata, Netezza, Aginity.

We'd love your feedback!