Sr. Data Scientist/data Architect Resume
Burlington, NJ
SUMMARY:
- Highly efficient Data Scientist with 8+ years of experience in Machine Learning, Data mining with large data sets of Structured and Unstructured data, Data Acquisition, Data Validation, Statistical modeling, NLP, Text Mining, Predictive modeling, Data Visualization, Web Crawling, Web Scraping. Adept in statistical programming languages like R and Python including Big Data technologies like Hadoop, Hive, Pig, Spark, Scala.
- Extensive experience in Text Analytics, developing different Statistical Machine Learning, Data Mining solutions to various business problems and generating data visualizations using R, Python and creating dashboards using tools like Tableau.
- Proficient in managing entire data science project life cycle and actively involved in all the phases of project life cycle including data acquisition, data cleaning, data engineering, features scaling, features engineering, statistical modeling (decision trees, regression models, neural networks, SVM, clustering), dimensionality reduction using Principal Component Analysis and Factor Analysis, testing and validation using ROC plot, K - fold cross validation and data visualization.
- Adept and deep understanding of Statistical modeling, Multivariate Analysis, model testing, problem analysis, model comparison and validation.
- Expertise in transforming business requirements into analytical models, designing algorithms, building models, developing data mining and reporting solutions that scales across massive volume of structured and unstructured data.
- Skilled in performing data parsing, data manipulation and data preparation with methods including describe data contents, compute descriptive statistics of data, regex, split and combine, Remap, merge, subset, reindex, melt and reshape.
- Experience in using various packages in Rand python like ggplot2, caret, dplyr, Rweka, gmodels, RCurl, tm, C50, twitteR, NLP, Reshape2, rjson, plyr, pandas, numpy, seaborn, scipy, matplotlib, scikit-learn, Beautiful Soup, Rpy2.
- Hands on experience with big data tools like Hadoop, Spark, Hive, Pig, Impala, Pyspark, SparkSql.
- Hands on experience in implementing LDA, Naive Bayes and skilled in Random Forests, Decision Trees, Linear and Logistic Regression, SVM, Clustering, neural networks, Principle Component Analysis.
- Excellent experience in SQL Loader, SQL Data, SQL Data Modeling, Reporting, SQL Database Development to load data from the Legacy systems into Oracle Databases using control files and used Oracle External Tables feature to read the data from flat files into Oracle staging tables. Used EXPORT/IMPORTOracle utilities to help the DBA to migrate the databases from Oracle 12c/11g/10g.
- Extensive experience on usage of ETL & Reporting tools like SQL Server Integration Services (SSIS), SQL Server Reporting Services (SSRS)
- Expertise in Data Analysis, Data Migration, Data Profiling, Data Cleansing, Transformation, Integration, Data Import, and Data Export through the use of multiple ETL tools such as Informatica Power Center.
- Expertise in Excel Macros, Pivot Tables, vlookups and other advanced functions and expertise R user with knowledge of statistical programming languages SAS.
- Experience in BI/DW solution (ETL, OLAP, Data mart), Informatica, BI Reporting tool like Tableau and Qlikview and also experienced leading the team of application, ETL, BI developers, Testing team.
- Expertise in data acquisition, storage, analysis, integration, predictive modeling, logistic regression, decision trees, data mining methods, forecasting, factor analysis, cluster analysis, and other advanced statistical techniques.
- Good Knowledge in Proof of Concepts (PoC's), gap analysis and gathered necessary data for analysis from different sources, prepared data for data exploration using data munging.
- Extensive experience in Data Visualization including producing tables, graphs, listings using various procedures and tools such as Tableau.
- Good industry knowledge, analytical &problem solving skills and ability to work well within a team as well as an individual.
TECHNICAL SKILLS:
Data Analytics Tools/Programming: Python (numpy, scipy, pandas,Gensim, Keras), R ( Caret, Weka, ggplot), MATLAB, Microsoft SQL Server, Oracle PLSQL, Python, SQL, PL/SQL, T-SQL, UNIX shell scripting, Java, SAS.
Data Modeling: Erwin, ER Studio, MS Visio.
Packages: ggplot2, caret, dplyr, Rweka, gmodels, RCurl, tm, C50, twitteR, NLP, Reshape2, rjson, plyr, pandas, numpy, seaborn, scipy, matplotlib, scikit-learn, Beautiful Soup, Rpy2.
Databases: Oracle, Teradata, Netezza, SQL Server, MongoDB, HBase, Cassandra.
Big Data Techs: Hadoop, Hive, HDFS, MapReduce, Pig, Kafka, MongoDB.
Reporting Tools: Crystal reports XI, Business Intelligence, SSRS, Business Objects 5.x/ 6.x, Cognos7.0/6.0, Tableau.
BI Tools: Tableau, Tableau server, Tableau Reader, SAP Business Objects, OBIEE, QlikView, SAP Business Intelligence, Amazon Redshift, or Azure Data Warehouse
Other Tools: MS-Office suite (Word, Excel, MS Project and Outlook), Spark MLLib, Scala NLP, MariaDB, Azure.
WORK EXPERIENCE:
Confidential, Burlington NJ
Sr. Data Scientist/Data Architect
Responsibilities:
- Participated in all phases of data mining, data collection, data cleaning, developing models, validation, and visualization and performed Gap analysis.
- Performed Data Profiling to learn about behavior with various features such as traffic pattern, location, and time, Date and Time etc.
- Developing strategies for data acquisitions, archive recovery, and implementation of databases and working in a data warehouse environment, which includes data design, database architecture, and Metadata and repository creation.
- Application of various machine learning algorithms and statistical modeling like decision trees, regression models, neural networks, SVM, clustering to identify Volume using scikit-learn package in python, Matlab.
- Used pandas, numpy, seaborn, scipy, matplotlib, scikit-learn, NLTK in Python for developing various machine learning algorithms.
- Used clustering technique K-Means to identify outliers and to classify unlabeled data.
- Lead the strategy, architecture and process improvements for data architecture and data management, balancing long and short-term needs of the business.
- Utilized machine learning algorithms such as linear regression, multivariate regression, naive bayes, Random Forests, K-means, & KNN for data analysis.
- Analyzing large data sets apply machine learning techniques and develop predictive models, statistical models and developing and enhancing statistical models by leveraging best-in-class modeling techniques.
- Responsible for Big data initiatives and engagement including analysis, brainstorming, POC, and architecture.
- Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS. Implemented a Python-based distributed random forest via Python streaming.
- Evaluated models using Cross Validation, Log loss function, ROC curves and used AUC for feature selection.
- Performed Multinomial Logistic Regression, Random forest, Decision Tree, SVM to classify package is going to deliver on time for the new route.
- Performed data analysis by using Hive to retrieve the data from Hadoop cluster, SQL to retrieve data from Oracle database.
- Used MLlib, Spark's Machine learning library to build and evaluate different models.
- Performed Data Cleaning, features scaling, features engineering using pandas and numpy packages in python.
- Building Statistical, predictive modeling techniques and business models using raw data, primary research to help the organization reduce the future risk and implement a better strategy on the concerned project.
- Processed data to build matrix-based collaborative filtering recommendation model via spark (MLlibALS) to drive the web application of the financial product recommendation
- Worked with advanced NLP, clustering, classification, and graph analytics algorithms
- Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS.
- Developed MapReduce pipeline for feature extraction using Hive.
- Involved in designing and architecting data warehouses and data lakes on regular (Oracle, SQL Server) high performance (Netezza and Teradata) and big data (Hadoop - MongoDB, Hive, Cassandra and HBase) databases.
- Created Data Quality Scripts using SQL and Hive to validate successful data load and quality of the data.
- Created various types of data visualizations using Python and Tableau.
Environment: Python, SQL, Oracle 12c, Netezza, SQL Server, Informatica, Java, SSRS, PL/SQL, T-SQL, Tableau, MLLib, regression, Scala NLP, Spark, Kafka, MongoDB, logistic regression, Hadoop, Hive, Teradata, random forest, OLAP, MariaDB, SAP CRM, HDFS, ODS, NLTK, SVM, JSON, XML, Cassandra, MapReduce, AWS.
Confidential NJ
Sr. Data Scientist/ Data Architect
Responsibilities:
- Collaborates with cross-functional team in support of business case development and identifying modeling method(s) to provide business solutions. Determines the appropriate statistical and analytical methodologies to solve business problems within specific areas of expertise.
- Implemented public segmentation using unsupervised machine learning algorithms by implementing k-means algorithm using Pyspark.
- Implemented end-to-end systems for Data Analytics, Data Automation and integrated with custom visualization tools using R, Hadoop and MongoDB, Cassandra.
- Generating Data Models using Erwin9.5 and developed relational database system and involved in Logical modeling using the Dimensional Modeling techniques such as Star Schema and Snow Flake Schema.
- Explored and Extracted data from source XML in HDFS, preparing data for exploratory analysis using data munging.
- Used R and python for Exploratory Data Analysis, A/B testing, Anova test and Hypothesis test to compare and identify the effectiveness of Creative Campaigns.
- Used Spark for test data analytics using MLLib and Analyzed the performance to identify bottlenecks.
- Involved in designing and developing Data Models and Data Marts that support the Business Intelligence Data Warehouse and Involved in Data Warehouse Support - Star Schema and Dimensional modeling to help design data marts and data warehouse
- Worked on different data formats such as JSON, XML and performed machine learning algorithms in R.
- Used Spark for test data analytics using MLLib and Analyzed the performance to identify bottlenecks and used Supervised learning techniques such as classifiers and neural networks to identify patters in these data sets
- Worked on Linux shell scripts for business process and loading data from different interfaces to HDFS.
- Utilized machine learning algorithms such as linear regression, multivariate regression, Naive Bayes, Random Forests, K-means, & KNN for data analysis.
- Created various types of data visualizations using R, python and Tableau.
- Design and develop data warehouse architecture, data modeling/conversion solutions, and ETL mapping solutions within structured data warehouse environments
- Independently coded new programs and design Tables to load and test the program effectively for the given POC's using Hadoop.
- Created MDM, OLAP data architecture, analytical data marts, and cubes optimized for reporting.
- Used S3 Bucket to store the jar's, input datasets and used Dynamo DB to store the processed output from the input data set.
- Worked with different sources such as Oracle, Teradata, SQL Server2012 and Excel, Flat, Complex Flat File, Cassandra, MongoDB, HBase, and COBOL files.
- Developing complex mappings to extract data from diverse sources including flat files, RDBMS tables, legacy system files, XML files, Applications and Teradata.
- Created partitioned and bucketed tables in Hive. Involved in creating Hive internal and external tables, loading with data and writing hive queries which involves multiple join scenarios.
- Performed K-means clustering, Multivariate analysis and Support Vector Machines in R.
- Used Python, R, SQL to create Statistical algorithms involving Multivariate Regression, Linear Regression, Logistic Regression, PCA, Random forest models, Decision trees, Support Vector Machine for estimating the risks of welfare dependency.
- Performed data cleaning and data preparation tasks to convert data into a meaningful data set using R.
- Identified and targeted welfare high-risk groups with Machine learning algorithms.
Environment: R3.2, Python, MDM, QlikView, MLLib, PL/SQL, Tableau, Teradata 14.1, JSON, HADOOP (HDFS), MapReduce, SQL Server, MLLib, Scala NLP, SSMS, ERP, CRM, Netezza, Cassandra, SQL, PL/SQL, AWS, SSRS, Informatica, PIG, Spark, Azure, R Studio, MongoDB, MAHOUT, JAVA, HIVE, AWS
Confidential, Bentonville AR
Sr. Data Scientist/Data Modeler
Responsibilities:
- Deployed different predictive models using python Scikit-Learn python framework and prototype machine learning algorithm for POC (Proof of Concept)
- Participated in all phases of data mining; data collection, data cleaning, developing models, validation and visualization.
- Improved statistical models performance by using leaning curves, feature selection methods and regularization.
- Developed predictive models for use in machine learning platform using the scikit-learn python framework.
- Handled importing data from various data sources, performed transformations using Hive, Map Reduce, and loaded data into HDFS.
- Developed Data Mapping, Data Governance and transformation and cleansing rules for the Master Data Management Architecture involving OLTP, ODS.
- Performed Exploratory Data Analysis using R and Hive on Hadoop HDFS and performed Data cleaning, features scaling, features engineering.
- Data for modeling was collected using SQL by querying several tables. The extracted tables were further appended or merged to create tables for modeling using SAS PROC MERGE and PROC SET procedures.
- Created dynamic linear models to perform trend analysis on customer transactional data in R
- Created SAS datasets from Oracle database with random sampling technique and created Oracle tables from SAS datasets by using SAS Macros.
- Executed ad-hoc data analysis for customer insights using SQL using Amazon AWS Hadoop Cluster.
- Implemented Principal Component Analysis and Liner Discriminate Analysis.
- Used ER Studio and Visio to create 3NF and dimensional data models and published to the business users and ETL / BIteams.
- Performed ad-hoc data analysis for customer insights using Hive and developed Performance metrics to evaluate Algorithm's performance.
- Extracted data from databases Oracle, Teradata, Netezza, SQL server and DB2 using Informatica to load it into a single repository for Data analysis.
- Involved in development and implementation of SSIS, SSRS and SSAS application solutions for various business units across the organization.
- Conducted database performance tuning techniques (database objects, SQL, T-SQL, and PL/SQL) including Normalization/De-normalization, Indexes, Table Partitioning, Parallel Processing, Caching, and Data Compression.
- Used Teradata utilities such as Fast Export, MLOAD for handling various tasks.
- Developed ETL mappings, testing, correction and enhancement and resolved data integrity issues.
- Created in tables, indexes and designing constraints and wrote T-SQL statements for retrieval of data and involved in performance tuning of T-SQL Queries and Stored Procedures.
- Developed Informatica SCD type-I, Type-II and Type III mappings and tuned them for better performance. Extensively used almost all of the transformations of Informatica including complex lookups, Stored Procedures, Update Strategy, mapplets and others.
- Checking and validating data in the Mapping document running the queries on Teradata.
- Building, publishing customized interactive reports and dashboards, report scheduling using Tableau server.
- Worked on analyzing the data statistically and also prepared statistical reports SAS tool.
- Involved in fixing invalid mappings, testing of Stored Procedures and Functions, Unit and Integrating testing of Informatica Sessions, Batches and the Target Data.
- Translated cell formulas for business users in Excel into VBA code to design, analyze, and deploy programs for their ad-hoc needs.
- Created numerous dashboards in tableau desktop based on the data collected from zonal and compass, while blending data from MS-excel and CSV files, with MS SQL server databases.
Environment: Python, R, HADOOP (HDFS), PIG, SQL Server, ER Studio, Informatica, SSRS, Tableau, Teradata, Netezza, Java, SQL, oracle, SQL Server, T-SQL. PL/SQL, SAS Macros, SAS, VBA, Machine Learning.
Confidential, San Francisco CA
Sr. Data Analyst/Data Modeler
Responsibilities:
- Responsible for direct interaction with the Business Consultants, DBA team and End User to gain a thorough understanding of the data being requested.
- Created and maintained Logical and Physical models for the data mart. Created partitions and indexes for the tables in the data mart.
- Gathered and translated business requirements into detailed, Business Requirement Document and Functional Requirement Specifications and involved in analyzing them.
- Performed Data Analysis and Data Profiling and worked on data transformations and data quality rules.
- Involved in extensive data validation by writing several complex SQL queries and Involved in back-end testing and worked with data quality issues.
- Developed mappings to load Fact and Dimension tables, SCD Type 1 and SCD Type 2 dimensions and Incremental loading and unit tested the mappings.
- Generated data extractions, summary tables and graphical representations in SAS, Excel, and R, Power Point.
- Developed and executed load scripts using Teradata client utilities MULTILOAD, FASTLOAD and BTEQ.
- Performed linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering using R.
- Created the Source to target mapping documents and developed the file processing rules.
- Used Programming techniques (R) for data management and analysis, including geographical datasets.
- Involved in Data Analysis, Data Cleansing, Requirements gathering, Business Analysis, Data Mapping, Entity Relationship diagrams (ERD), Architectural design docs, Functional and Technical design docs, and Process Flow diagrams.
- Prepared scripts to ensure proper data access, manipulation and reporting functions with R programming languages.
- Worked with the ETL team to document the Transformation Rules for Data Migration from OLTP to Warehouse Environment for reporting purposes.
- Prepared complex SQL/R scripts for ODBC and Teradata servers for analysis and modeling.
- Created and maintained Database Objects (Tables, Views, Indexes, Partitions, Synonyms, Database triggers, and Stored Procedures) in the data model.
- Involved with Data Analysis Primarily Identifying Data Sets, Source Data, Source Meta Data, Data Definitions and Data Formats.
- Designs and or develops database objects (databases, tables, stored procedures, DTS Packages) to support the collection, tracking and reporting of business data.
- Done R programming for statistical data analysis and data management.
- Designed different type of STARschemas for detailed data marts and plan data marts in the OLAP environment.
- Worked extensively on Data Quality (running Data Profiling, Examine Profile outcome); Metadata management (loading metadata, mapping metadata, or perform data linkage)
Environment: R, MySQL, Erwin, Netezza, DB2, ORACLE, HTML5, ETL, CSS3, JavaScript, Shell, Linux & Windows, SAS, SQL, T-SQL, ETL, Excel, Pivot Tables, Teradata, SQL Server, PL/SQL, Metadata.
Confidential
Data Analyst
Responsibilities:
- Create various Data Mapping Repository documents as part of Metadata services (EMR).
- Collaborate with data modelers, ETL developers in the creating the Data Functional Design documents.
- Provide inputs to development team in performing extraction, transformation and load for data marts and data warehouses.
- Performed in depth analysis in data & prepared weekly, biweekly, monthly reports by using SQL, MsExcel, MsAccess, and UNIX.
- Document the complete process flow to describe program development, logic, testing, and implementation, application integration, coding.
- Document various Data Quality mapping document, audit and security compliance adherence
- Good Understanding of advanced statistical modeling and logical modeling using SAS.
- Comfort manipulating and analyzing complex, high-volume, and high-dimensionality data from varying data sources.
- Written SQL scripts to test the mappings and Developed Traceability Matrix of Business Requirements mapped to Test Scripts to ensure any Change Control in requirements leads to test case update.
- Interact with Business System Analysts and Software Developers to transform business requirements and application requirements into appropriate data model solutions
- Work with the business and the ETL developers in the analysis and resolution of data related problem tickets.
- Performed data analysis and data profiling using complex SQL on various sources systems including Oracle.
Environment: MySQL 5.x, ORACLE, HTML5, CSS3, JavaScript, Shell, Linux & Windows, SAS, SQL, T-SQL, ETL, Excel, Pivot Tables, Teradata, SQL Server.
