- Above 8+ years of experience in Machine Learning, Data mining with large datasets of Structured and Unstructured data, Data Acquisition, Data Validation, Predictive modeling, Data Visualization and designing of Physical Data Architecture of New system engines.
- Extensive experience in Text Analytics, developing different Statistical Machine Learning, Data Mining solutions to various business problems and generating data visualizations using R, Python and Tableau.
- Having good experience in NLP with Apache, Hadoop and Python and hands on SparkMlib utilities such as including classification, regression, clustering, collaborative filtering, dimensionality reduction.
- Experienced with cloud, Hadoop - on-Azure, AWS/EMR, Cloudera Manager (also direct-Hadoop-EC2 (non EMR)).
- Proficient in Statistical Modeling and Machine Learning techniques (Linear, Logistics, Decision Trees, Random Forest, SVM, K-Nearest Neighbors, Bayesian, XG Boost) in Forecasting/ Predictive Analytics, Segmentation methodologies, Regression based models, Hypothesis testing, Factor analysis/ PCA, Ensembles.
- Excellent working experience and knowledge on big data tools like Hadoop, Azure Data Lake, AWS Redshift.
- Hands on experience in implementing LDA, NaiveBayes and skilled in Random Forests, Decision Trees, Linear and Logistic Regression, SVM, Clustering, neural networks, Principle Component Analysis and good knowledge on Recommender Systems and developing Logical Data Architecture with adherence to Enterprise Architecture.
- Expertise in transforming business requirements into analytical models, designing algorithms, building models, developing data mining and reporting solutions that scales across massive volume of structured and unstructured data.
- Adept in statistical programming languages like Python, Scala including BigData technologies like Hadoop, Hive, Pig, MapReduce, HBase, MongoDB and Cassandra.
- Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies.
- Experienced working with data modeling tools like Erwin, Power Designer and ER Studio and skilled in using dplyr and pandas in Python for performing exploratory data analysis.
- Experienced in designing stunning visualizations using Tableau software and publishing and presenting dashboards, Storyline on web and desktop platforms.
- Experienced in designing star schema, Snow flake schema for Data Warehouse, ODS architecture.
- Good experience and understanding of Teradata SQL Assistant, Teradata Administrator and data load/ export utilities like BTEQ, FastLoad, MultiLoad, FastExport.
- Experienced and Technical proficiency in Designing, Data Modeling Online Applications, Solution Lead for Architecting Data Warehouse/Business Intelligence Applications.
- Experienced in maintaining database architecture and metadata that support the Enterprise Data warehouse.
- Experienced with Data Analytics, Data Reporting, Ad-hoc Reporting, Graphs, Scales, Pivot Tables and OLAP reporting.
- Highly skilled in using visualization tools like Tableau, ggplot2 and d3.js for creating dashboard and highly skilled in using Hadoop (pig and Hive) for basic analysis and extraction of data in the infrastructure to provide data summarization.
- Expertise in creating PL/SQL Schema objects like Packages, Procedures, Functions, Subprograms, Triggers, Views, Materialized Views, Indexes, Constraints, Sequences, Exception Handling, Dynamic SQL/Cursors, Native Compilation, Collection Types, Record Type, Object Type using SQL Developer
Data Modeling Tools: Erwin r9.6, 9.5, 9.1, 8.x, Rational Rose, ER/Studio, MS Visio, SAP Power designer.
Programming Languages: Oracle PL/SQL, Python, Scala, Hive, SQL, T-SQL, UNIX shell scripting, Java.
Scripting Languages: Python (NumPy, SciPy, Pandas, Gensim, Keras), R (Caret, Weka, ggplot)
Big Data Technologies: Hadoop, Hive, HDFS, MapReduce, Pig, Kafka, HBase, MongoDB, Sqoop.
Reporting Tools: Crystal reports XI, Business Intelligence, SSRS, Business Objects, Cognos, and Tableau.
ETL: Informatica Power Centre, SSIS.
Project Execution Methodologies: Ralph Kimball and Bill Inmon data warehousing methodology, Rational Unified Process (RUP), Rapid Application Development (RAD) and Joint Application Development (JAD)
BI Tools: Tableau, Tableau server, Tableau Reader, SAP Business Objects, OBIEE, Qlikview, SAP Business Intelligence, Amazon Redshift, or Azure Data Warehouse
Tools: MS-Office suite (Word, Excel, MS Project and Outlook), Spark MLlib, Scala NLP, MariaDB, Azure, SAS.
Databases: Oracle, Teradata, Netezza, Microsoft SQL Server, MongoDB, HBase, Cassandra.
Operating Systems: Windows, UNIX, MS DOS, Sun Solaris.
Sr. Data Scientist/ Architect
Confidential, Chicago IL
- As an Architect design conceptual, logical and physical models using Erwin and build data marts using hybrid Inmon and Kimball DW methodologies and worked closely with business, data governance, SMEs and vendors to define data requirements.
- Interaction with Business Analyst, SMEs and other Data Architects to understand Business needs and functionality for various project solutions
- Designed the prototype of the Data mart and documented possible outcome from it for end user and worked with data investigation, discovery and mapping tools to scan every single data record from many sources.
- Designed and architecting AWS Cloud solutions for data and analytical workloads such as warehouses, Big Data, data lakes, real-time streams and advanced analytics
- Performed Multinomial Logistic Regression, Random forest, Decision Tree, SVM to classify package is going to deliver on time for the new route.
- Designed and implemented system architecture for Amazon EC2 based cloud-hosted solution for client.
- Working on Amazon Redshift and AWS and architecting a solution to load data, create data models and run BI on it.
- Used Pandas, NumPy, seaborn, SciPy, Matplotlib, Scikit-learn, NLTK in Python for developing various machine learning algorithms and utilized machine learning algorithms such as linear regression, multivariate regression, naive Bayes, Random Forests, K-means, & KNN for dataanalysis
- Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS. Implemented a Python-based distributed random forest via Python streaming.
- Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS.
- Performed data analysis by using Hive to retrieve the data from Hadoop cluster, Sql to retrieve datafrom Oracle database and used MLlib, Spark's Machine learning library to build and evaluate different models.
- Involved in business process modeling using UML and implementing SparkMlib utilities such as including classification, regression, clustering, collaborative filtering and dimensionality reduction.
- Design, coding, unit testing of ETL package source marts and subject marts using Informatica ETL processes for Oracle database.
- Designed both 3NF data models for ODS, OLTP systems and dimensional data models using Star and Snowflake Schemas.
- Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and Extracted the data from Oracle into HDFS using Sqoop.
- Driven the technical design of AWS solutions by working with customers to understand their needs and conducted numerous POCs (Proof of Concepts) to efficiently import large data sets into the database from AWS S3 Bucket.
- Handled importing data from various data sources, performed transformations using Hive, Map Reduce, and loaded data into HDFS.
- Designed the Enterprise Conceptual, Logical, and Physical Data Model for 'Bulk Data Storage System using Embarcadero ER Studio, the data models were designed in 3NF.
- Developed LINUX Shell scripts by using NZSQL/NZLOAD utilities to load data from flat files to Netezza database.
- Developed ETL processes extracted data daily and loaded data into a SSIS based Decision Support Warehouse.
- Worked with delivery of Data & Analytics applications involving structured and un-structured data on Hadoop based platforms on AWS EMR
Environment: ER Studio, Python, SQL, Oracle 12c, Netezza, SQL Server, Informatica, Java, SSRS, PL/SQL, T-SQL, Tableau, MLlib, regression, Cluster analysis, Scala NLP, Spark, Kafka, MongoDB, logistic regression, Hadoop, Hive, Teradata, random forest, OLAP, Azure, MariaDB, SAP CRM, HDFS, ODS, NLTK, SVM, JSON, Tableau, XML, Cassandra, MapReduce, AWS.
Sr. Data Scientist/Architect
Confidential, Mentor OH
- Analyzed the business requirements of the project by studying the Business Requirement Specification document and worked as a Data Modeler/Analyst to generate Data Models using Erwin and developed relational database system.
- Reviewed the Conceptual EDW (Enterprise Data Warehouse) Data Model with Business Users, App Dev. and Information Architects to make sure all the requirements are fully covered. Extensively used the Erwin design tool &Erwin model manager to create and maintain the DataMart and designed logical and physical data models for multiple OLTP and Analytic applications.
- Used Python for Exploratory Data Analysis, A/B testing, Anova test and Hypothesis test to compare and identify the effectiveness of Creative Campaigns.
- Performed performance improvement of the existing Data warehouse applications to increase efficiency of the existing system.
- Worked in Implementation of full lifecycle in Data Modeler/Data Analyst, Data warehouses and Datamarts with Star Schemas, Snowflake Schemas, and SCD& Dimensional Modeling Erwin.
- Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and Extracted the data from Oracle into HDFS using Sqoop.
- Design, built and deployed a set of python modeling APIs for customer analytics, which integrate multiple machine learning techniques for various user behavior prediction.
- Worked on NoSQL databases including HBase, Mongo DB, and Cassandra. Implemented multi-datacenter and multi-rack Cassandra cluster.
- Worked on AWS Redshift and RDS for implementing models and data on RDS and Redshift.
- Used Python to generate regression models to provide statistical forecasting and applied Clustering Algorithms such as K-Means to categorize customers into certain groups
- Performed data manipulation, data preparation, normalization, and predictive modeling. Improve efficiency and accuracy by evaluating model in Python.
- Used classification techniques including Random Forest and Logistic Regression to quantify the likelihood of each user referring
- Implemented Partitioning, Dynamic Partitions, Buckets in Hive and Supported MapReduce Programs those are running on the cluster.
- Configured Spark streaming to receive real time data from the Kafka and store the stream data to HDFS using Scala and deployed Algorithms in Scala with Spark, using sample datasets and done Spark based development with Scala.
- Extracted feeds form social media sites such as Facebook, Twitter using Python scripts.
- Involved in Teradata utilities (BTEQ, Fast Load, Fast Export, Multiload, and Tpump) in both Windows and Mainframe platforms.
- Utilized Spark, Scala, Hadoop, HBase, Kafka, Spark Streaming, MLlib, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc.
- Created Hive queries that helped analysts spot emerging trends by comparing fresh data with EDW reference tables and historical metrics and processed the data using HQL (like SQL) on top of Map-reduce.
- Designed and developed user interfaces and customization of Reports using Tableau and OBIEE and designed cubes for data visualization, mobile/web presentation with parameterization and cascading.
- Involved in OLAP model based on Dimension and FACTS for efficient loads of data based on Star Schema structure on levels of reports using multi-dimensional models such as Star Schemas and Snowflake Schema.
- Storing and loading the data from HDFS to AmazonAWSS3 and backing up and Created tables in AWS cluster with S3 storage.
Environment: Environment: Erwin r9.6, Python, SQL, Oracle 12c, Netezza, SQL Server, Informatica, Java, SSRS, PL/SQL, T-SQL, Tableau, MLlib, regression, Cluster analysis, Scala NLP, Spark, Kafka, MongoDB, logistic regression, Hadoop, Hive, Teradata, random forest, OLAP, Azure, MariaDB, SAP CRM, HDFS, ODS, NLTK, SVM, JSON, Tableau, XML, Cassandra, MapReduce, AWS.
Confidential, Dallas, TX
- Working in Amazon Web Services cloud computing environment and Coded Python functions to interface with Caffe Deep Learning Framework.
- Lead the development and presentation of a data analytics data-hub prototype with the help of the other members of the emerging solutions team.
- Implemented end-to-end systems for Data Analytics, Data Automation and integrated with custom visualization tools using Python, Mahout, Hadoop and MongoDB.
- Executed ad-hoc data analysis for customer insights using SQL using Amazon AWS Hadoop Cluste
- Performed Exploratory Data Analysis and Data Visualizations using Python, and Tableau and worked with several Python packages including knitr, dplyr, PySpark, CausalInfer, space time.
- Gathering all the data that is required from multiple data sources and creating datasets that will be used in analysis.
- Developed Map Reduce programs to cleanse the data in HDFS obtained from heterogeneous datasources to make it suitable for ingestion into Hiveschema for analysis.
- Deployed different predictive models using python Scikit-Learn python framework and Pprototype machine learning algorithm for POC (Proof of Concept) and improved statistical models performance by using leaning curves, feature selection methods and regularization.
- Worked with Data governance, Data quality, data lineage, Data architect to design various models and processes and perform a proper EDA, Uni-variate and bi-variate analysis to understand the intrinsic effect/ combined effects.
- Designed data models and data flow diagrams using Erwin and MSVisio and Independently coded new programs and designed Tables to load and test the program effectively for the given POC's using with Big Data/Hadoop.
- Developed, Implemented & Maintained the Conceptual, Logical & Physical Data Models using Erwin for Forward/Reverse Engineered Databases.
- Established Data architecture strategy, best practices, standards, and roadmaps and worked with Hadoop eco system covering HDFS, HBase, YARN and MapReduce.
- Implemented Spark using Java and Spark SQL for processing of event data to calculate various usage metrics of the app like search relevance, active users and others.
- Performed Source System Analysis, database design, data modeling for the warehouse layer using MLDM concepts and package layer using Dimensional modeling
- Performed data cleaning and imputation of missing values using Python and used Hive to store the data and perform data cleaning steps for huge datasets.
- Created dash boards and visualization on regular basis using ggplot2 and Tableau.
- Used Spark streaming to divide streaming data into batches as an input to spark engine for batch processing and Developed Spark SQL to load tables into HDFS to run select queries on top.
- Used MySQL db package and Python-MySQL connector for writing and executing several MYSQL database queries from Python.
- Interacted with the other departments to understand and identify data needs and requirements and work with other members of the IT organization to deliver data visualization and reporting solutions to address those needs.
- Extracted the data from the flat files and other RDBMS databases into staging area and populated onto Data warehouse.
- Created dynamic linear models to perform trend analysis on customer transactional data in Python
- Take up ad-hoc requests based on different departments and locations and worked with BTEQ to submit SQL statements, import and export data, and generate reports in Teradata.
- Creating customized business reports and sharing insights to the management.
- Used Normalization (1NF, 2NF&3NF) and De-normalization techniques for effective performance in OLTP and OLAP systems.
- Extensively used ETL methodology for supporting data extraction, transformations and loading processing, in a complex EDW using Informatica.
Environment: Erwin r9.x, Teradata V14, Teradata SQL Assistant, Informatica, Oracle 11g, OLAP, OLTP, ODS, MapReduce, Netezza, Mainframes, SQL, PL/SQL, XML, Hive, Hadoop, PIG, Python, AWS, AWS S3, AWS EMR, AWS Redshift, MapReduce, HBase, MondoDB, Flume, Sqoop, SQL, PL/SQL, XML.
Sr. Data Modeler/Analyst
Confidential, Minneapolis, MN
- Evaluated business requirements and prepared detailed specifications that follow project guidelines required to develop written programs.
- Supported MapReduce Programs running on the cluster and used Oozie workflow engine to run multiple Hive and Pig jobs.
- Configured Hadoop cluster with Name node and slaves and formatted HDFS and developed multiple MapReduce jobs in java for data cleaning and preprocessing.
- Performed Map Reduce Programs those are running on the cluster and involved in loading data from RDBMS and web logs into HDFS using Sqoop and Flume.
- Analyzed the partitioned and bucketed data and compute various metrics for reporting and developed Hive queries for Analysis across different banners.
- Worked on loading the data from MySQL to HBase where necessary using Sqoop and launching Amazon EC2 Cloud Instances using Amazon Images (Linux/ Ubuntu) and Configuring launched instances with respect to specific applications.
- Generated ad-hoc SQL queries using joins, database connections and transformation rules to fetch data from legacy Oracle and SQL Server database systems
- Used ERWIN for reverse engineering to connect to existing database and ODS to create graphical representation in the form of Entity Relationships and elicit more information.
- Performed data mining on data using very complex SQL queries and discovered pattern and Used extensive SQL for data profiling/analysis to provide guidance in building the data model
- Used Model Mart of ERWIN for effective model management of sharing, dividing and reusing model information and design for productivity improvement.
- Worked on SQLServer concepts SSIS (SQL Server Integration Services), SSAS (Analysis Services) and SSRS (Reporting Services).
- Extracted data from Twitter using Java and Twitter API. Parsed JSON formatted twitter data and uploaded to database.
- Analyzed the data by performing Hive queries and running Pig scripts to study customer behavior.
- Used External Loaders like Multi Load, T Pump and Fast Load to load data into Teradata Database, Involved in analysis, development, testing, implementation and deployment.
- Implemented slowly changing dimensions Type2 and Type3 for accessing history of reference datachanges.
- Exported the result set from Hive to MySQL using Sqoop after processing the data.
- Used Hive to partition and bucket data and worked on Sequence files, AVRO, HAR file formats and compression.
- Wrote Pig Scripts to perform ETL procedures on the data in HDFS and involved in writing MapReduce programs with Java API to cleanse Structured and unstructured data.
- Worked on improving performance of existing Pig and Hive Queries and created HBase tables to store various data formats of data coming from different portfolios.
Environment: Teradata 14, ERWIN r9.1, SQL server 2005/2008, Business Objects XI, MS Excel 2010, Informatica, Rational Rose, Oracle 10g, SAS, SQL, PL/SQL, SSRS, SSIS, T-SQL, XML, DDL, TOAD for Data Analysis, Teradata SQL Assistant 13, Hadoop, Hive, Pig, HDFS, Java, SSAS, SSRS, SSIS.
Sr. Data Analyst/Modeler
- Extensively analyzed Ralph-Kimball Methodology and implemented it successfully and created logical, physical data models and Meta Data to support the requirements
- Designed the ER diagrams, logical model (relationship, cardinality, attributes, and, candidate keys) and physical database (capacity planning, object creation and aggregation strategies) for Oracle as per business requirements using Erwin.
- Walked through the Logical Data Models of all source systems for data quality analysis and created Tasks, Workflows and Work lets using Workflow Manager.
- Tuned performance of Informatica sessions for large data files by increasing block size, data cache size and, sequence buffer length.
- Extensively used Teradata utilities (BTEQ, Fast load, Multiload, TPUMP) to import/export and load the data from oracle, flat files.
- Developed normalized Logical and Physical database models to design OLTP system for different applications
- Involved in Data Modeling using ERWIN, Star Schema/Snow flake schema, FACT & Dimensions tables, Physical & logical data modeling.
- Automated load run on Informatica sessions through UNIX Corn, PL/SQL scripts and implemented pre and post-session scripts, also automated load failures with successful notification through email.
- Worked with SQL, SQL PLUS, Oracle PL/SQL Stored Procedures, Triggers, SQL queries and loading data into Data Warehouse/Data Marts.
- Developed Logical and Physical data models that capture current state/future state data elements and data flows using Erwin / Star Schema.
- Led enterprise logical data modeling project (in third normal form 3NF) to gather data requirements for OLTP enhancements and converted third normal form ERDs into dimensional ERDs for data warehouse effort.
- Involved in mapping spreadsheets that will provide the Data Warehouse Development (ETL) team with source to target data mapping, inclusive of logical names, physical names, data types, domain definitions, and corporate meta-data definitions.
- Converted physical database models from logical models, to build/generate DDL scripts.
- Maintained warehouse metadata, naming standards and warehouse standards for future application development.
- Extensively used ETL to load data from DB2, Oracle databases and Utilized SQL server's reporting services SSRS and SSIS to support reporting requirements.
- Built, managed, customized ETL (Extraction, Transformation, and Loading) Mappings & workflows using Informatica workflow manager & Designer tools.
- Extensively worked on ETL transformations Lookup, Update Strategy, Joiner, Router and Stored procedure transformations and esigned and developed Informatica mappings for data loads and data cleansing.
- Maintained Meta Data Repository for storing table definitions, table spaces and entity definitions and tuned Informatica Mappings for optimum performance and scheduling ETL Sessions.
- Deployed naming standard to the Data Model and followed company standard for Project Documentation.
Environment: Erwin, Informatica Power Center (Workflow Manager, Workflow Monitor, Worklets, Source Analyzer, Warehouse designer, Mapping Designer, Mapplet Designer, Transformations), SQL Server 2000(Query Analyzer, DTS, TSQL), Oracle 10g, SQL, XML, Excel, Teradata, SSRS, SSIS, DDL, PL/SQL, SQL* Loader, TOAD, Business Objects 6.0, Sun Solaris 2.7.