- Over 9+ years of hands on experience and comprehensive industry knowledge of Machine Learning (ML), Deep Learning (DL), Statistical Modeling, Predictive Modeling, Data Analytics, Data Modeling, Data Analysis, Data Mining, Text Mining & Natural Language Processing (NLP), Artificial Intelligence algorithms, Business Intelligence, Analytics Models (like Decision Trees, Linear & Logistic Regression, Hadoop (Hive, PIG), R, Python, Spark MLLib, NLP, Scala, MS Excel, SQL and PostGres SQL, AWS, Erwin.
- Experienced in utilizing analytical applications like R, SAS, and Python to identify trends and relationships between different pieces of data draw appropriate conclusions and translate analytical findings into risk management and marketing strategies that drive value.
- Good Understanding of working on Artificial Neural Networks and Deep Learning models using Theano and TensorFlow packages using in Python.
- Experienced in designing star schema (identification of facts, measures and dimensions), Snowflake schema for Data Warehouse, ODS Architecture by using tools like Erwin Data Modeler, Power Designer, and Microsoft Visio.
- Extensive experience in Text Analytics, developing different Statistical Machine Learning, Data Mining solutions to various business problems and generating data visualizations using R, SAS and Python and creating dashboards using tools like Tableau.
- Expertise in applying data mining techniques and optimization techniques in B2B and B2C industries and proficient in Machine Learning, Data/Text Mining, Statistical Analysis & Predictive Modeling.
- Experienced writing spark streaming and spark batch jobs using spark MLlib for analytics and proficient in Hadoop, Hive, MapReduce, Pig and NOSQL databases like MongoDB, HBase, Cassandra.
- Experienced in SQL queries and optimizing the queries in Oracle, SQL Server, DB2, Netezza &Teradata.
- Experienced Data Modeler with conceptual, Logical and Physical Data Modeling skills, Data Profiling skills, Maintaining Data Quality, Teradata 15/14, experienced with JAD sessions for requirements gathering, creating Data Mapping, documents, writing functional specifications, queries.
- Hands on experience on clustering algorithms like K - means & Medoids and Predictive algorithms and expertise in Model Development, Data Mining, Predictive Modeling, Data Visualization, Data Clearing and Management, and Database Management.
- Excellent experience in SQL Loader, SQL Data, SQL Data Modeling, Reporting, SQL Database Development to load data from the Legacy systems into Oracle Databases using control files and used Oracle External Tables feature to read the data from flat files into Oracle staging tables. Used EXPORT/IMPORT Oracle utilities to help the DBA to migrate the databases from Oracle 12c/11g/10g.
- Experienced in Data Modeling retaining concepts of RDBMS, Logical and Physical Data Modeling until 3NormalForm (3NF) and Multidimensional Data Modeling Schema (Star schema, Snow-Flake Modeling, Facts and dimensions).
- Experiencedin SAS/BASE, SAS/STAT, SAS/SQL, SAS/MACROS, SAS/GRAPH, SAS/ACCESS, SAS/ODS, SAS/QC, SAS/ETS in Mainframe, Windows and UNIX environments
- Expertise in Excel Macros, Pivot Tables, VLOOKUPs and other advanced functions and expertise R user with knowledge of statistical programming languages SAS.
- Experienced in Data Analysis, Data Migration, Data Cleansing, Transformation, Integration, Data Import, and Data Export through use of multiple ETL tools such as SSIS and Informatic Power Center.
- Excellent experience on Teradata SQL queries, Teradata Indexes, Utilities such as Mload, Tpump, Fast load and Fast Export.
- Strong experience and knowledge in Data Visualization with Tableau creating: Line and scatter plots, Bar Charts, Histograms, Pie chart, Dot charts, Box plots, Time series, Error Bars, Multiple Charts types, Multiple Axes, subplots etc and skilled in using visualization tools like ggplot2 and d3.js for creating dashboards.
- Experienced in Database performance tuning and Data Access optimization, writing complex SQL quires and PL/SQL blocks like stored procedures, Functions, Triggers, Cursors and ETL packages.
Analytical Tools and Lnaguages: Python (numpy, scipy, pandas, Gensim, Keras), R (Caret, Weka, ggplot), MATLAB, Microsoft SQL Server, Oracle PLSQL, SAS, SQL, PL/SQL, T-SQL, UNIX shell scripting.
Data Modeling Tools: Erwin r9.6, 9.5, 9.1, ER/Studio, MS Visio and SAP Power designer.
Operating Systems: Windows 10/8/7, UNIX, MS DOS, Sun Solaris.
Databases: Oracle, Teradata, Netezza, SQL Server, MongoDB, Cassandra
Big Data Techs: Hadoop, Hive, HDFS, MapReduce, Pig, Kafka and Spark
Reporting Tools: Crystal reports XI, Power BI, SSRS, Business Objects, Cognos, Tableau
ETL: Informatica Power Centre, SSIS
Project Execution Methodologies: Ralph Kimball and Bill Inmon data warehousing methodology, Rational Unified Process (RUP), Rapid Application Development (RAD), Joint Application Development (JAD)
BI Tools: Tableau, Tableau server, Tableau Reader, SAP Business Objects, OBIEE, QlikView, SAP Business Intelligence, Amazon Redshift, or Azure Data Warehouse
Other Tools: MS-Office suite (Word, Excel, MS Project and Outlook), Spark MLLib, Scala NLP, MariaDB, Azure.
Algorithms Tools: Machine Learning, Neural Networks, Deep Learning, NLP, Bayesian Learning, Optimization, Prediction, Pattern Identification, Data / Text mining, Regression, Logistic Regression, Bayesian Belief, Clustering, Classification, Statistical modeling
Development Tools: R Studio, Notepad++, Python, Jupyter, Spyder IDE
Cloud Technologies: AWS (EC2, S3, RDS, Security Groups), Microsoft Azure
Sr. Data Scientist/Engineer
Confidential, Chicago IL
- Utilized Apache Spark with Python to develop and execute Big Data Analytics and Machine learning applications, executed machine learning use cases under Spark ML and Mllib.
- Involved in all phases of data mining, data collection, data cleaning, developing models, validation, and visualization and performed Gap analysis.
- Let the implementation of new statistical algorithms and operators on Hadoop and SQL platforms and utilized optimizations techniques, linear regressions, K-means clustering, Native Bayes and other approaches.
- Developing Spark Python modules for machine learning & predictive analytics in Hadoop on AWS and implemented a Python-based distributed random forest via Python streaming.
- Used pandas, numpy, seaborn, scipy, matplotlib, scikit-learn, NLTK in Python for developing various machine learning algorithms and utilized machine learning algorithms such as linear regression, multivariate regression, naive bayes, Random Forests, K-means, & KNN for data analysis.
- Developed text classification algorithm using classical machine learning algorithms & applied the state-of-the-art machine learning algorithms such as deep neural networks and RNN's.
- Conducting studies, rapid plots and using advance data mining and statistical modeling techniques to build solution that optimize the quality and performance of data.
- Reduced the log-loss error to below 1.0 for text classification problem using the machine learning & deep learning algorithms.
- Deployed the model on AWS Lambda, collaborated with development team to build the business solutions.
- Worked on data pre-processing and cleaning the data to perform feature engineering and performed data imputation techniques for the missing values in the dataset using Python.
- Designed and implemented system architecture for Amazon EC2 based cloud-hosted solution for client.
- Extensively used SQL, Numpy, Pandas, Scikit-learn, Spark, Hive for Data Analysis and Model building
- Extracted the data from Teradata into HDFS using Sqoop and exported the patterns analyzed back into Teradata using Sqoop
- Developed the platform with word cloud, world map using various data visualization libraries, text mining, and NLP and sentiment analysis.
- Work with NLTK library to NLP data processing and finding the patterns and developed NLP models for Topic Extraction, Sentiment Analysis
- Demonstrated experience in design and implementation of Statistical models, Predictive models, enterprise data model, metadata solution and data life cycle management in both RDBMS, Big Data environments.
- Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS.
- Worked with various Teradata15 tools and utilities like Teradata Viewpoint, Multi Load, ARC, Teradata Administrator, BTEQ and other Teradata Utilities.
- Utilized Spark, Scala, Hadoop, HBase, Kafka, Spark Streaming, MLLib, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc.
- Designed and implemented system architecture for Amazon EC2 based cloud-hosted solution for client.
Environment: Python, SQL, Oracle 12c, Machine Learning, Deep Learning, Netezza, SQL Server, Informatica, SSRS, PL/SQL, T-SQL, Tableau, MLLib, regression, Scala NLP, Spark, Kafka, MongoDB, logistic regression, Hadoop, Hive, Teradata, random forest, Erwin r9.6, MariaDB, SAP CRM, HDFS, NLTK, SVM, JSON, Tableau, XML,, MapReduce, AWS S3, EC2, RDS.
Sr. Data Scientist/Engineer
Confidential, Burlington NJ
- Designed and provisioned the platform architecture to execute Hadoop and machine learning use cases under Cloud infrastructure, AWS, EMR, and S3 and designed the schema, configured and deployed AWS Redshift for optimal storage and fast retrieval of data.
- Worked with business and technical subject matter experts to develop graphical representations of data and work flows from a business and system level perspective and maintain database performance through the resolution of application development and production issues.
- Implemented end-to-end systems for Data Analytics, Data Automation and integrated with custom visualization tools using R, Hadoop and MongoDB, Cassandra.
- Utilized Spark, Scala, Hadoop, HBase, Cassandra, MongoDB, Kafka, Spark Streaming, MLLib, R, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc.
- Currently working on building clustering and predictive models using Mllib to predict fault code occurrences using Spark and Mllib.
- Performed data cleaning and feature selection using MLLib package in PySpark and working with deep learning frameworks such as Caffe, Neon etc.
- Developed Data Mapping, Data Governance, Transformation and Cleansing rules for the Master Data Management Architecture involving OLTP, ODS and OLAP.
- Worked on different data formats such as JSON, XML and performed machine learning algorithms in R and used Spark for test data analytics using MLLib and Analyzed the performance to identify bottlenecks.
- Performed NLP using KNIME (text mining and analysis, topic modeling, Ngram, and Sentiment Analysis) with the help of survey data to understand customer reactions to build Attrition Model
- Provide solutions within the Hadoop environment using technologies such as HDFS, MapReduce, Pig, Hive, HBase, ZooKeeper, Storm, and other Big Data technologies
- Create Control-M and Oozie workflow scripts to automate the Data ingestion process into Hadoop Data Lake and partitioned the data in Hive tables and persisted the data using Parquet format for faster data load while extracting data using Tableau.
- Application of various machine learning algorithms and statistical modeling like decision trees, regression models, neural networks, SVM, clustering to identify Volume using scikit-learn package in R, Matlab.
- Executed ad-hoc data analysis for customer insights using SQL using Amazon AWS Hadoop Cluster.
- Developing predictive data models using techniques like regression, clustering, decision trees, Random forests, clustering on R.
- Creating various B2B Predictive and descriptive analytics using R and Tableau and performed data cleaning and data preparation tasks to convert data into a meaningful data set using R.
- Used R, SQL to create Statistical algorithms involving Multivariate Regression, Linear Regression, Logistic Regression, PCA, Random forest models, Decision trees, Support Vector Machine
- Worked with ETL developer for design and development of Extract, Transform and Load processes for data integration projects to build data marts.
- Performed extensive data discovery and research activities to draw meaningful insights into data and trends from public sources using Tableau for visualization.
- Involved in Building predictive models to identify High risk cases using Regression and Machine learning techniques by using SAS and R and Performed Data analysis, statistical analysis, generated reports, listings and graphs using SAS Tools-SAS/Base, SAS/Macros and SAS/Graph, SAS/SQL, SAS/Connect, SAS/Access.
- Extensively Used Sqoop to import/export data between RDBMS and hive tables, incremental imports and created Sqoop jobs for last saved value.
- Created partitioned and bucketed tables in Hive and involved in creating Hive internal and external tables, loading with data and writing hive queries which involves multiple join scenarios.
- Used MLlib, Spark's Machine learning library to build and evaluate different models and performed K-means clustering, Multivariate analysis and Support Vector Machines in R.
- Used External Loaders like Multi Load, T Pump and Fast Load to load data into Teradata14.1Database.
- Assisted in model building and model training for conversational analytics leveraging Natural Language Processing (NLP)
- Used S3 Bucket to store the jar's, input datasets and used Dynamo DB to store the processed output from the input data set.
- Used Spark for test data analytics using MLLib and Analyzed the performance to identify bottlenecks and used Supervised learning techniques such as classifiers and neural networks to identify patters in these data sets
- Developed Tableau visualizations and dashboards using Tableau Desktop. Tableau workbooks from multiple data sources using Data Blending.
Environment: R3.x, MDM, Machine Learning, Deep Learning, QlikView, MLLib, PL/SQL, Tableau, Teradata 14.1, JSON, HADOOP (HDFS), MapReduce, SQL Server, MLLib, Scala NLP, SSMS, Erwin 9.5.2, ERP, CRM, Matlab, Netezza, SAS, Cassandra, SQL, PL/SQL, AWS, SSRS, Informatica, PIG, Spark, R Studio, MongoDB, MAHOUT, HIVE, AWS
Sr. Data Engineer/Modeler
- Used R, SQL to create Statistical algorithms involving Multivariate Regression, Linear Regression, Logistic Regression, PCA, Random forest models, Decision trees, Support Vector Machine for estimating the risks.
- Managed data operations team and collaborated with data warehouse developers to meet business user needs, promote data security, and maintain data integrity.
- Used R and python for Exploratory Data Analysis, A/B testing, Anova test and Hypothesis test to compare and identify the effectiveness of Creative Campaigns.
- Used Hive for creation of ORC formatted tables and used ADF for data orchestration to Azure database, Data copied from ADLS to Azure SQL database using ADF pipelines invoked using PowerShell scripting.
- Implemented public segmentation using unsupervised machine learning algorithms by implementing k-means algorithm using Pyspark.
- Used ETL Tools for masking and cleaning data and mined data from various sources and performed extensive Data Validation, Data Verification against Data Warehouse and performed debugging of the SQL-Statements and stored procedures for business scenarios.
- Developed ADF pipelines to move data from on - premise source systems to COSMOS, from COSMOS (with data transformation) to Azure Warehouse (staging), from Azure warehouse to Azure ML (for scoring) and appending scores back to the data in Azure warehouse.
- Involved in creating Data Lake by extracting customer's Big Data from various data sources into Hadoop HDFS. This included data from Excel, Flat Files, Oracle, SQL Server, MongoDb, Cassandra, HBase, Teradata, Netezza and also log data from servers
- Developed Python code for data analysis (also using NumPy and SciPy), Curve-fitting and used Spark Data frames, Spark-SQL, Spark MLLib extensively and developing and designing POC's using Scala, Spark SQL and MLlib libraries.
- Worked on predictive and what-if analysis using R from HDFS and successfully loaded files to HDFS from Teradata, and loaded from HDFS to HIVE.
- Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics . Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.
- Developed ETL mappings, testing, correction and enhancement and resolved data integrity issues and coordinated multiple OLAP and ETL projects for various data lineage and reconciliation.
- Analyzed data and predicted end customer behaviors and product performance by applying machine learning algorithms using Spark MLlib and NLTK, Stanford NLP, RAKE to preprocess the data, entity extraction and keyword extraction.
- Performed transformations of data using Spark and Hive according to business requirements for generating various analytical datasets.
- Analyzed the bug reports in BO reports by running similar SQL queries against the source system (s) to perform root-cause analysis.
- Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
- Used concepts of Data Modeling Star Schema/Snowflake modeling, FACT & Dimensions tables and Logical & Physical data modeling.
- Coding using Teradata Analytical functions, BTEQ SQL of TERADATA, write UNIX scripts to validate, format and execute the SQLs on UNIX environment.
- Created numerous dashboards in tableau desktop based on the data collected from zonal and compass, while blending data from MS-excel and CSV files, with MS SQL server databases.
Environment: Machine Learning, Deep Learning, Pyspark, Azure ADLS, Azure Data Factory, Azure Blob, Azure DW, Azure SQL, Azure Databricks, Spark, R, Spark MLLib, Tableau, Informatica, SQL, Excel, VBA, BO, CSV, Erwin, SAS, Scala NLP, Cassandra, Oracle, MongoDB, Cognos, SQL Server 2012, Teradata, DB2, SPSS, T-SQL, PL/SQL, Flat Files, XML, and Tableau.
Sr. Data Modeler/Data Analyst
- Worked with data compliance teams, data governance team to maintain data models, Metadata, Data Dictionaries; define source fields and its definitions.
- Performing Source System Analysis, database design, data modeling for the warehouse layer using MLDM concepts and package layer using Dimensional modeling.
- Documented logical, physical, relational and dimensional data models and designed the Data Marts in dimensional data modeling using star and snowflake schemas.
- Transformed Logical Data Model to ER Studio, Physical Data Model ensuring the Primary Key and Foreign Key relationships in PDM, Consistency of definitions of Data Attributes and Primary Index Considerations.
- Conducted several Physical Data Model-Training sessions with the ETL Developers. Worked with them on day-to-day basis to resolve any questions on Physical Model.
- Extensively developed Oracle10g stored packages, procedures, functions and database triggers using PL/SQL for ETL process, data handling, logging, archiving and to perform Oracle back-end validations for batch processes.
- Used Netezza SQL, Stored Procedures, and NZload utilities as part of the DWH appliance framework.
- Developed normalized Logical and Physical database models to design OLTP system for education finance applications.
- Worked with the UNIX team and installed TIDAL job scheduler on QA and Production Netezza environment.
- Worked with BTEQ to submit SQL statements, import and export data, and generate reports in Teradata.
- Designed and documented Use Cases, Activity Diagrams, Sequence Diagrams, OOD (Object Oriented Design) using UML and Visio.
- Worked in development and maintenance using Oracle SQL, PL/SQL, SQL Loader, and Informatica Power Center9.1.
- Involved in Designing the ETL process to Extract translates and load data from OLTP Oracle database system to Teradata data warehouse.
- Involved in the design and development of user interfaces and customization of Reports using Tableau and OBIEE and designed cubes for data visualization, mobile/web presentation with parameterization and cascading.
- Performed Data Analysis and Data Profiling and worked on data transformations and data quality rules.
- Created SSIS Packages using Pivot Transformation, Execute SQL Task, Data Flow Task, etc to import data into the data warehouse.
- Perform administrative tasks, including creation of database objects such as database, tables, and views, using SQL DCL, DDL, and DML requests.
- Used Erwin for reverse engineering to connect to existing database and ODS to create graphical representation in the form of Entity Relationships and elicit more information.
- Created high level ETL design document and assisted ETL developers in the detail design and development of ETL maps using Informatica.
- Involved in development and implementation of SSIS, SSRS and SSAS application solutions for various business units across the organization.
- Building, publishing customized interactive reports and dashboards, report scheduling using Tableau server.
Environment: ER Studio, Teradata, Oracle10g, Hadoop, HDFS, Pig, Hive, MapReduce, PL/SQL, UNIX, Informatica Power Center, MDM, SQL Server, Netezza, DB2, Tableau, Aginity, Architecture, SAS/Graph, SAS/SQL, Tableau, SAS/Connect and SAS/Access.