- Highly efficient Data Scientist/data engineer with over 8+ years of experience in areas including DataAnalysis, Statistical Analysis, Machine Learning, predictive modeling, data mining with large data sets of structured and unstructured data in banking, automobile, food and market research sectors.
- Involved in the entire data science project life cycle including data extraction, data cleansing, transform modeling, data visualization and documentations.
- Developed predictive models using Regression, Multiple linear regression,Logistic Regression, Decision Trees, Random Forests, NaiveBayes, ClusterAnalysis, and Association rules/Market Basket Analysis, and Neural Networks.
- Experience in using various packages in R and pythonlike ggplot2, caret, dplyr, Rweka, gmodels, RCurl, tm, C50, twitter, NLP, Reshape2, rjson, plyr, SciPy, scikit - learn, BeautifulSoup, Rpy2.
- Extensive experience with statistical programming languages such as R and Python.
- Proficient in Predictive Modeling, Data Mining Methods, Factor Analysis, ANOVA, Hypothetical testing, normal distribution and other advanced statistical and econometric techniques.
- Extensively worked for data analysis using RStudio, SQL, Tableau and other BItools.
- Expertise in leveraging the Exploratory Data Analysis (EDA) with all numerical computations and by plotting all kinds of relevant visualizations to do feature engineering and to get feature importance.
- Skilled in using dplyr and pandas inR and Python for performing exploratory data analysis.
- Skilled in using Principal Component Analysis for dimensionality reduction.
- Extensive hands-on experience with structured, semi-structured and unstructured data using R, Python, SparkMLlib, SQL and Scikit-Learn.
- Strong with ETL, Datawarehousing, DataStore concepts and Datamining.
- Extensive experience in Text Analytics, developing different Statistical MachineLearning, Data Mining solutions to various business problems and generating data visualizations using R, Python, and Tableau.
- Knowledge on twitter text analytics using R functions like sapply, corpus, tmmap, searchTwitter and packages like twitter, RCurl, tm, wordcloud.
- Proficient in SAS/BASE, SAS EG, SAS/SQL, SAS MACRO, SAS/ACCESS.
- Proficient in writing complex SQLqueries like stored procedures, triggers, joints and subqueries.
- Extensive working experience with Python including Scikit-learn, Pandas, and Numpy.
- Experienced in python data manipulation for loading and extraction as well as with python libraries such as NumPy, SciPy and Pandas for data analysis and numerical computations
- Skilled in data wrangling, Correlation analysis, multi-collinearity, missing values, unbalanced data etc.
- Proficient in Statistical Modeling and MachineLearning techniques (Linear, Logistics, DecisionTrees, RandomForest, SVM, K-NearestNeighbors, XGBoost) in Forecasting/ Predictive Analytics, Segmentation methodologies, Regression-based models, Hypothesis testing, Factoranalysis/ PCA and Ensembles
- Experience in designing stunning visualizations using Tableau software and publishing and presenting dashboards, Storyline on web and desktop platforms.
- Experience in using GIT Version Control System.
- Knowledge on time series analysis data using AR, MA, ARIMA, GARCH and ARCH model.
- Good knowledge in Apache- Hive, Sqoop, Flume, Hue,and Oozie.
- Knowledge in BigData with Hadoop2, HDFS, MapReduce, and Spark.
- Knowledge in starschema, Snowflakeschema for DataWarehouse, ODS architecture.
- Good knowledge on Amazon Web Services (AWS)AmazonSageMaker, AmazonS3 for machine learning.
- Collaborated with data warehouse developers to meet business user needs, promote data security, and maintain data integrity.
Data Analytics Tools/ Programming: Python (numpy, scipy, pandas,Gensim, Keras), R (Caret, Weka, ggplot), MATLAB, Microsoft SQL Server, Oracle PLSQL, Python .
Analysis &Modelling Tools: Erwin, Sybase Power Designer, Oracle Designer, Erwin, Rational Rose, ER/Studio, TOAD, MS Visio, SAS.
Data Visualization: Tableau, Visualization packages, Microsoft Excel .
Big Data Tools: Hadoop, MapReduce, SQOOP, Pig, Hive, NOSQL, Cassandra, MongoDB, Spark, Scala.
ETL Tools: Informatica Power Centre, Data Stage 7.5, Ab Initio, Talend.
OLAP Tools: MS SQL Analysis Manager, DB2 OLAP, Cognos Power-play.
Databases: Oracle12c/11g/10g/9i/8i/8.0/7.x,Teradata14.0,DB2 UDB 8.1, MS SQL Server 2008/2005, Netezaa 4.0 and Sybase ASE 12.5.3/15,Informix 9, AWS RDS.
Project Execution Methodologies: Ralph Kimball and Bill Inmon data warehousing methodology, Rational Unified Process (RUP), Rapid Application Development (RAD), Joint Application Development (JAD).
Tools & Software: TOAD, MS Office, BTEQ, Teradata SQL Assistant.
Methodologies: Ralph Kimball, COBOL.
Reporting Tools: Business ObjectsXIR 2/6.5/5.0/5.1, Cognos Impromptu 7.0/6.0/5.0,Informatica Analytics Delivery Platform, Micro Strategy, SSRS, Tableau.
Tools: MS-Office suite (Word, Excel, MS Project and Outlook), VSS.
Programming Languages: SQL, T-SQL, Base SAS and SAS/SQL, HTML, XML.
Operating Systems: Windows 2007/8, UNIX (Sun-Solaris, HP-UX), Windows NT/XP/Vista, MSDOS.
Confidential, San Francisco, CA
- Worked closely with business, datagovernance, SMEs and vendors to define data requirements.
- Designed and provisioned the platform architecture to execute Hadoop and machine learning use cases under Cloud infrastructure, AWS, EMR, and S3.
- Selection of statistical algorithms (Two Class Logistic Regression Boosted Decision Tree, Decision Forest Classifiers etc.).
- Actively participated in data modeling, data warehousing and complex database designing.
- Designed and developed NLP models for sentiment analysis.
- Developed Models using NLP to enhance the performance of Media Service Encoders.
- Used MLlib, Spark's Machine learning library to build and evaluate different models.
- Worked in using Teradata14 tools like Fast Load, Multi Load, TPump, Fast Export, Teradata Parallel Transporter (TPT) and BTEQ.
- Participated in all phases of data mining; data collection, data cleaning, developing models, validation, visualization and performed Gapanalysis.
- Interpret problems and provides solutions to business problems using data analysis, data mining, optimization tools, and machinelearningtechniques and statistics.
- Involved in creating Data Lake by extracting customer's Big Data from various data sources into HadoopHDFS. This included data from Excel, Flat Files, Oracle, SQL Server, MongoDb, Cassandra, HBase, Teradata, Netezza and also log data from servers
- Used Spark Data frames, Spark-SQL, SparkMLLib extensively and developing and designing POC's using Scala, SparkSQL and MLlib libraries.
- Created high level ETL design document and assisted ETL developers in the detail design and development of ETL maps using Informatica.
- Adept at using SASEnterprise suite, Python, and BigData related technologies including knowledge in Hadoop, Hive, Sqoop, Oozie, Flume, Map-Reduce
- Used R, SQL to create Statistical algorithms involving Multivariate Regression, Linear Regression, Logistic Regression, PCA, Random forest models, Decision trees, Support Vector Machine for estimating the risks of welfare dependency.
- Helped in migration and conversion of data from the Sybase database into Oracle database, preparing mapping documents and developing partial SQL scripts as required.
- Generated ad-hocSQL queries using joins, database connections and transformation rules to fetch data from legacy Oracle and SQL Server database systems
- Executed ad-hoc data analysis for customer insights using SQL using AmazonAWSHadoopCluster.
- Strong SQL Server and Python programming skills with experience in working with functions
- Worked on predictive and what-if analysis using R from HDFS and successfully loaded files to HDFS from Teradata, and loaded from HDFS to HIVE.
- Designed the schema, configured and deployed AWS Redshift for optimal storage and fast retrieval of data.
- Analyzed data and predicted end customer behaviors and product performance by applying machine learning algorithms using SparkMLlib.
- Performed data mining on data using very complex SQL queries and discovered pattern and used extensive SQL for data profiling/analysis to provide guidance in building the data model.
Environment: R, Python, Machine Learning, Teradata 14, Hadoop Map Reduce, Pyspark, Spark, R, Spark MLLib, Tableau, Informatica, SQL, Excel, AWS Redshift, ScalaNlp, Cassandra, Oracle, MongoDB, Informatica MDM, Cognos, SQL Server 2012, Teradata, DB2, SPSS, T-SQL, PL/SQL, Flat Files, XML, and Tableau.
Confidential, Dallas, Texas
- Created MapReduce running over HDFS for data mining and analysis using R and Loading & Storage data to PigScript and R for MapReduce operations.
- Created adeeplearningmodels to detect the various object.
- Designed the prototype of the Data mart and documented possible outcome from it for end-user.
- Involved in Analyzing various Dataaspect to know the user behavior’s
- Developed and maintained data dictionary to create metadata reports for technical and business purpose.
- Developed various QlikView Data Models by extracting and using the data from various sources files, DB2, Excel, Flat Files and Bigdata.
- Designed the procedures for getting the data from all systems to Data Warehousing system.
- Application of various machine learning algorithms and statistical modeling like decision trees, text analytics, natural language processing (NLP), supervised and unsupervised, regression models, social network analysis, neural networks, deep learning, SVM, clustering to identify Volume using scikit-learn package in python, Matlab.
- Translate marketing and sales objectives to SQLscripts and datamining initiatives.
- Collected Data from Various Resource and collaborate to Performing various EDA and Visualization.
- Building prediction models using Linear and RidgeRegression, for predicting future customers based on historical data. Developed the model with 3 million data points from historical data and evaluated the model with F-score and adjusted R-squared measure.
- Customer Profiling models using K-means and K-means++ clustering algorithms to enable targeted marketing. Developed the model with 1.4million data points and used the elbow method to find the optimal value of K using Sum of Squared error as the error measure.
- Designed and implemented a probabilistic churn prediction model with 80k customer data to predict the probability of customer churn out using LogisticRegression in Python. Client utilized the results in the business to finalize the list of customers to provide a discount.
- Implemented dimensionality reduction using Principal Component Analysis and k-fold cross validation as part of Model Improvement.
- Implemented Pearson's Correlation and Maximum Variance techniques to find the key predictors for the Regression models.
- Data analysis using Exploratory Data Analysis techniques in Python and R, including generating Univariate and Multivariategraphicalplots.
- Correlation Analysis (chi-square and Pearsoncorrelation test)
- Coordinated with Onsite Actuaries, Senior Management and Client to interpret and report the results for assisting the in corporation if results in business scenarios.
- Implement Various BigdataPipelines to build a machine learning models.
- Analyzed various customer behaviors on product to find out the rootcause of problem.
Confidential, Durham, NC
- Architect and design, solutions for complex business requirements, including data processing, analytics and ETL and reporting processes to improve performance of data loads and processes.
- Develop a high performance, scalable data architecture solution that incorporates a matrix of technology to relate architecturaldecision to business needs.
- Conducting strategy and architecture sessions and deliver artifacts such as MDM strategy (Currentstate, InterimStateandTargetstate) and MDM Architecture (Conceptual, Logical and Physical) at detail level.
- Conducted studies, rapid plots and using advance data mining and statistical modelling techniques to build solution that optimize the quality and performance of data.
- Currently implementing a POC on Chatbot using openNLP, MachineLearning and DeepLearning.
- Owned and managed all changes to the datamodels, Createddatamodels, solutiondesignsanddataarchitecture documentation for complex information systems.
- Design and development of dimensional data model on Redshift to provide advanced selection analytics platform and developed Simple to complex MapReduceJobsusing Hive and Pig.
- Worked on AWS Redshift and RDS for implementing models and data on RDS and Redshift.
- Worked with SME's and other stakeholders to determine the requirements to identify Entities and Attributes to build Conceptual, Logical and Physical Data Models.
- Worked in Data warehousing methodologies/Dimensional Data modeling techniques such as Star/Snowflakeschema using ERWIN9.1.
- Designed and implemented Near Real Time ETL and Analytics using Redshift database.
- Extensively used Netezza utilities like NZLOAD and NZSQL and loaded data directly from Oracle to Netezza without any intermediate files.
- Created a logical design and physical design in Erwin.
- Implemented Hive Generic UDF's to in corporate business logic into Hive Queries and
- Creating Hive tables and working on them using Hive QL.
- Developed DataMapping, DataGovernance, and transformation and cleansing rules for the Master Data Management Architecture involving OLTP, ODS and generated ad-hoc reports using OBIEE.
- Which loads the data from the CMS to the EMS library database and Involved in data modeling and providing technical solutions related to Teradata to the team.
- Build a real time event analytic systems using dynamic Amazon redshift schema.
- Wrote SQL queries, PL/SQL procedures/packages, triggers and cursors to extract and process data from various source tables of database.
- Determine customer satisfaction and help enhance customer experience using NLP.
- Created Hive Tables, loaded transactional data from Teradata usingSqoop and created and worked Sqoop jobs with incremental load to populate Hive External tables.
- Worked with cloud based technology like Redshift, S3, AWS, EC2 Machine, etc. and extracting the data from the Oracle financials and the Redshift database.
- Designing and customizing data models for Data warehouse supporting data from multiple sources on real time. Requirements elicitation and Data analysis. Implementation of ETL Best Practices.
- Generated comprehensive analyticalreports by running SQLqueries against current databases to conduct data analysis.
- Created data models for AWS Redshift and Hive from dimensional data models.
- Developed complex SQL scripts for Teradata database for creating BI layer on DW for Tableau reporting.
- Extensively used ETL methodology for supporting data extraction, transformations and loading processing, in acomplexEDW using Informatica.
- Created Active Batch jobs to load data from distribution servers to PostgreSQL DB using *.bat files and worked on CDC schema to keep track of all transactions.
Data Analyst/ Data Modeler
- Design and develop datawarehousearchitecture, datamodeling/conversionsolutions, & ETL mappingsolutions within structured data warehouse environments
- Reconcile data and ensure data integrity and consistency across various organizational operating platforms for business impact.
- Successfully optimized codes in Python to solve a variety of purposes in data mining and machine learning in Python.
- Used Erwin for effective model management of sharing, dividing and reusing model information and design for productivity improvement.
- Involved in preparing Logical Data Models/Physical Data Models.
- Worked extensively in both Forward Engineering as well as Reverse Engineering using data modeling tools.
- Provide and apply quality assurance best practices for data mining/analysis services.
- Involved in the creation, maintenance of Data Warehouse and repositories containing Metadata.
- Involved using ETL tool Informatica to populate the database, data transformation from the old database to the new database using Oracle and SQL Server.
- Identifying inconsistencies or issues from incoming HL7 messages, documenting the inconsistencies, and working with clients to resolve the datainconsistencies
- Resolved the data type inconsistencies between the source systems and the target system using the MappingDocuments and analyzing the database using SQL queries.
- Extensively used both Star Schema and Snow flake schema methodologies in building and designing the logical data model in both Type1 and Type2Dimensional Models.
- Worked with DBA group to create Best-Fit Physical Data Model from the Logical Data Model using ForwardEngineering.
- Worked with Data Steward Team for designing, documenting and configuring Informatica DataDirector for supporting management of MDM data.
- Conducting HL7 integration testing with clients systems that is testing of business scenarios to ensure that information is able to flow correctly between applications.
- Extensively worked with MySQL and Redshift performance tuning and reduced the ETL job load time by 31% and DW space usage by 50%.
- Used Teradata SQL Assistant, Teradata Administrator, PMON and data load/export utilities like BTEQ, Fast Load, Multi Load, Fast Export, Tpump on UNIX/Windows environments and running the batch process for Teradata.
- Created dimensional model based on star schemas and designed them using ERwin.
- Carrying out HL7 interface unit testing aiming to confirm that HL7 messages sent or received from each application conform to the HL7 interface specification.
- Usedtoolssuchas SAS/Access and SAS/SQL to create and extract oracle tables.
- Enabled the SSIS package configuration to make the flexibility to pass the connection strings to connection managers and values to package variables explicitly based on environments.
- Responsible for Implementation of HL7 to build Orders, Results, ADT, DFT interfaces for client hospitals
- Connected to Amazon RedShift through Tableau to extract live data for real time analysis.
- Developed SQL Queries to fetch complex data from different tables in remote databases using joins, database links and Bulk collects.
Environment: Erwin, Oracle, SQL server 2008, Power BI, MS Excel, Netezza, Agile, MS Visio, Rational Rose, Requisite Pro, SAS, SSIS, SSRS, Windows 7, PL/SQL,, SQl Server, MDM, Teradata, MS Office, MS Access, SQL, SSIS, MS Visio, Tableau, Informatica, Amazon Redshift.
- Designed logical and physical data models for multiple OLTP and Analytic applications.
- Involved in analysis of business requirements and keeping track of data available from various data sources, transform and load the data into Target Tables using Informatica Power Center.
- Extensively used the Erwin design tool & Erwin model manager to create and maintain the Data Mart.
- Extensively used Star Schema methodologies in building and designing the logical data model into Dimensional Models
- Performed data mining on data using very complex SQL queries and discovered pattern and Used extensive SQL for data profiling/analysis to provide guidance in building the data model
- Created stored procedures using PL/SQL and tuned the databases and backend process.
- Involved with Data Analysis primarily Identifying Data Sets, Source Data, Source Meta Data, Data Definitions and Data Formats
- Performance tuning of the database, which includes indexes, and optimizing SQL statements, monitoring the server.
- Developed Informatica mappings, sessions, workflows and have written Pl SQL codes for effective and optimized data flow coding.
- Wrote SQL Queries, Dynamic-queries, sub-queries and complex joins for generating Complex Stored Procedures, Triggers, User-defined Functions, Views and Cursors.
- Created new HL7 interface based on the requirement using XML, XSLT technology.
- Experienced in creating UNIX scripts for file transfer and file manipulation and utilized SDLC and Agile methodologies such as SCRUM.
- DataStage jobs were scheduled, monitored, performance of individual stages was analyzed and multiple instances of a job were run using DataStage Director.
- Led successful integration of HL7 Lab Interfaces and used expertise of SQL to integrate HL7 Interfaces and carried out detailed and various test cases on newly built HL7 interface.
- Wrote simple and advanced SQL queries and scripts to create standard and adhocreports for senior managers.
Environment: SQL Server, UML, Business Objects 5, Teradata, Windows XP, SSIS, SSRS, Embarcadero, ER studio, Erwin, DB2, Informatica, HL7, Oracle, Query Management Facility (QMF), SSRS, Data Stage, Clear Case forms, SAS, Agile, Unix and Shell Scripting.
- Developed DataMapping, DataGovernance and transformation and cleansing rules for the Master Data Management Architecture involving OLTP, ODS.
- Created new conceptual, logicalandphysical data models using ERWinand reviewed these models with application team and modeling team.
- Performed numerous data pulling requests using SQL for analysis and created databases for OLAP Metadata catalog tables using forward engineering of models in Erwin.
- Enforced referential integrity in the OLTP data model for consistent relationship between tables and efficient database design.
- Proficient in importing/exporting large amounts of data from files to Teradata and vice versa.
- Developed Data Mapping, Data Governance, and Transformation and cleansing rules for the Master Data Management Architecture involving OLTP, ODS.
- Identified and tracked the slowly changing dimensions, heterogeneous sources and determined the hierarchies in dimensions.
- Utilized ODBC for connectivity to Teradata &MS Excel for automating reports and graphical representation of data to the Business and Operational Analysts.
- Extracted data from existing data source, Developing and executing departmental reports for performance and response purposes by using oracle SQL, MS Excel.
- Extracted data from existing data source and performed ad-hoc queries and used BETQ to run and Teradata SQL scripts to create physical data model.
Environment: UNIX scripting, Oracle SQL Developer, SSRS, SSIS, Teradata, Windows XP, SAS data sets.