- Around 11+ years working experience as Data Scientist, Data Analyst, Business Intelligence (BI) with high proficiency in Big Data Analytics, Predictive Modeling, Text mining, and Machine Learning.
- Experience in Analytics, Visualization, Data Modeling, Data Mining, Text Mining, Data Analysis, Data Migration, Data Cleansing, Transformation, Integration, Data Import, Data Export and Reporting.
- Experienced Data Scientist with hands on experience in developing Machine Learning models.
- Strong focus on R and Python Statistical Analysis with ML techniques in challenging environments.
- Experienced in data analysis, designing experiments, interpreting data behavior, business decision support.
- Expertise in quantitative analysis, data mining, and the presentation of data to see beyond the numbers and understand how users interact with core/business products
- Experience in diverse set of procedures but not limited to methods such as machine learning, deep learning, Bayesian algorithm, Regressions, cluster analysis, decision trees, time series, resampling and regularization, NLP and other techniques.
- Strong working experience with various python libraries such as NumPy, SciPy for mathematical calculations, Pandas for data preprocessing/wrangling, Mat plot, Seaborn for data visualization, Sklearn for machine learning, Theano, TensorFlow, Keras for Deep leaning and NLTK for NLP.
- Experience in latest Big Data technologies like Spark, Hadoop, Impala, HBase, Hive, MapReduce, Pig, Cassandra with ETL, NoSQL databases like MongoDB, Cassandra.
- Experienced in SQL Queries and optimizing the queries in Oracle, SQL Server, DB2, PostgreSQL, and Teradata.
- Experience in using Tableau, creating dash boards and quality story telling.
- Ability to provide wing - to-wing analytic support including pulling data, preparing analysis, interpreting data, making strategic recommendations and presenting to client/product teams.
- Excellent visual representation of data and communicating analysis with Python, Tableau, WEKA to all levels of business users within the organization, Automate analyses and build analytics data pipelines via SQL and python based ETL framework.
- Experience with common Machine Learning Algorithms like Regression, Classification and Ensemble models.
- Experience in analytics, visualization, meetings for business importance with clients, manage SLAs, modelling, reporting and providing actionable insights to managers and C-level executives.
- Good knowledge in establishing classification and forecast models, automate processes, text mining, sentiment analysis, statistical models, risk analysis, platform integrations, optimization models, models to increase user experience, A/B testing using R, SAS, Python, tableau, etc.
- Strong familiarity in Build the data models by extracting and stitching data from various sources, integrated systems with R to cater efficient data analysis.
- Experience using machine learning models such as random forest, KNN, SVM, logistic regressions and used packages such as ggplot, dpylr, lm, e1071, rpart, randomForest, nnet, tree, PROC-(pca, dtree, corr, princomp, gplot, logistic, cluster), numpy, sci-kit learn, pandas, etc., in R, SAS and python respectively.
- Experience in Developed complex database objects like Stored Procedures, Functions, Packages and Triggers using SQL and PL/SQL.
- Experience in installing, configuring and maintaining the databases like PostgreSQL, Oracle, Big Data HDFS systems.
- Expertise in Model Development, Data Mining, Predictive Modeling, Descriptive Modelling Data Visualization, Data Clearing and Management, and Database Management.
- Expertise in applying data mining techniques and optimization techniques and proficient in Machine Learning, Data/Text Mining, Statistical Analysis and Predictive Modeling.
- Good communication, problem solving and interpersonal skills, versatile team player as well as independent contributor with adaptability and understanding of business processes.
- Highly motivated team player with analytical, organizational and technical skills, unique ability to adapt quickly to challenges and changing environment.
- Excellent interpersonal skills, proven team player with an analytical bent to problem solving and delivering under high stress environment.
- Knowledge and experience in the role of enterprise data model and enterprise data governance.
- Significance exposure to TalendOpen Studio, well versed with Autosys Job Scheduling, and analyzing data flows for validations and to check data quality issues.
- Experience in designing of starschema and snowflake schema, for online analytical processing (OLAP) and online transactional processing (OLTP) systems.
- Evaluating data sources and strong understanding of data warehouse, data mart design, BI,and client/server applications.
- Experience in development of SQL, PL/SQL, NoSQL and T-SQL scripts, storedprocedures, triggers, functions, packages for business logic implementation.
- Experience in building data marts, data quality reports, data profiling, data analysis and data archives.
- Experience with various data extraction and data wrangling techniques in R and Python using varied data sets.
- Worked extensively on forward engineering, reverse engineering and namingstandards processes. Created DDL scripts for implementing data modeling changes.
- Maintained data integrity and consistency by passing data through several analysis steps such as parsing and prototyping.
- Experience in creating tables, constraints, views, and materializedviews using ERwin, ER Studio, Power Designer and SQL Modeler.
- Experience in creating partitions, indexes, indexed views to improve the performance, reduce contention and increase the availability of data.
- Optimized procedures and functions utilized by ETL packages to reduce ETL process time.
- Having good knowledge in normalization (1NF, 2NF, 3NF) and de-normalization techniques for improved database performance.
- Experience in datatransformation and datamapping from source to target database schemas and data cleansing.
Languages: Java, Python, R
Packages: ggplot2, caret, dplyr, Rweka, gmodels, twitter, NLP, Reshape2, rjson, dplyr, pandas, NumPy, Seaborn, SciPy, Matplotlib, Scikit-learn, Beautiful Soup, Rpy2. eb
Technologies: JDBC, HTML5, DHTML and XML, CSS3, Web Services, WSDL,
Data Modelling Tools: Erwin r 9.6, 9.5, 9.1, 8.x, Rational Rose, ER/Studio, MS Visio, SAP Powerdesigner.
Big Data Technologies: Hadoop, Hive, HDFS, MapReduce, Pig, Kafka.
Databases: SQL, Hive, Impala, Pig, Spark SQL, Databases SQL-Server, My SQL, MS
Access, HDFS, HBase, Teradata, MongoDB, Cassandra.: Reporting Tools
MS Office (Word/Excel/Power Point/ Visio), Tableau, Business: Intelligence, SSRSETL Tools: Informatica Power Centre, SSIS.
Version Control Tools: SVM, GitHub.
Project Execution Methodologies: Ralph Kimball and Bill Inmon data warehousing methodology, Rational Unified Process (RUP), Rapid Application Development (RAD), Joint Application Development (JAD).
BI Tools: Tableau, Tableau Server, Tableau Reader, QlikView, SAP Business Intelligence, Amazon RedshiftOperating System: Windows, Linux, Unix, Macintosh HD, Red Hat.
Confidential, Stamford, CT
Data Analyst/Data Scientist
- Provided the architectural leadership in shaping strategic, business technology projects, with an emphasis on application architecture.
- Utilized domain knowledge and application portfolio knowledge to play a key role in defining the future state of large, business technology programs.
- Participated in all phases of data mining, data collection, data cleaning, developing models, validation, and visualization and performed Gap analysis.
- Used a variety of NLP methods for information extraction, topic modeling, parsing, and relationship extraction
- Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS. Implemented a Python-based distributed random forest via Python streaming.
- Developed jobs in Talend Enterprise edition from stage to source, intermediate, conversion and target
- Performed Source System Analysis, database design, data modeling for the warehouse layer using MLDM concepts and package layer using Dimensional modeling.
- Created ecosystem models (e.g. conceptual, logical, physical, canonical) that are required for supporting services within the enterprise data architecture (conceptual data model for defining the major subject areas used, ecosystem logical model for defining standard business meaning for entities and fields, and an ecosystem canonical model for defining the standard messages and formats to be used in data integration services throughout the ecosystem).
- Used Pandas, NumPy, seaborn, SciPy, Matplotlib, Scikit-learn, NLTK in Python for developing various machine learning algorithms and utilized machine learning algorithms such as linear regression, multivariate regression, naive Bayes, Random Forests, K-means, &KNN for data analysis.
- Used libraries and frameworks in Machine Learning such as NumPy, SciPy, Pandas, Theano, Caffe, SciKit-learn Matplotlib, Seaborn, Theano, TensorFlow, Keras, NLTK, PyTorch Gensim, Urllib, Beautiful Soup)
- Demonstrated experience in design and implementation of Statistical models, Predictive models, enterprise data model, metadata solution and data life cycle management in both RDBMS, Big Data environments.
- Designed and implemented system architecture for Amazon EC2 based cloud-hosted solution for client.
- Worked on outlier identification with Gaussian Mixture Models using Pandas, NumPy and matplotlib.
- Adopted feature engineering techniques with 200+ predictors in order to find the most important features for the models.
- Tested Complex ETL Mappings and Sessions based on business user requirements and business rules to load data from source flat files and RDBMS tables to target tables.
- Analyzed large data sets apply machine learning techniques and develop predictive models, statistical models and developing and enhancing statistical models by leveraging best-in-class modeling techniques.
- Hands on database design, relational integrity constraints, OLAP, OLTP, Cubes and Normalization (3NF) and De-normalization of database.
- Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS.
- Worked on customer segmentation using an unsupervised learning technique - clustering.
- Utilized Spark, Scala, Hadoop, HBase, Kafka, Spark Streaming, MLlib, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc.
Environment: Python, SSRS, MLlib, Regression, Cluster analysis, Scala, NLP, Spark, Kafka, MongoDB, logistic regression, Hadoop, Hive, Teradata, Randomforest, OLAP, MariaDB, HDFS, ODS, NLTK, SVM, JSON, Tableau, XML, Cassandra, MapReduce, AWS.
Confidential, Holmdel, NJ
Data Analyst/Data Scientist
- Built models using Statistical techniques like BayesianHMM and MachineLearning classification models like XG Boost, SVM, and Random Forest.
- Worked with data compliance teams, data governance team to maintain data models, Metadata, data Dictionaries, define source fields and its definitions.
- Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS. Implemented a Python-based distributed random forest via Python streaming.
- Setup storage and data analysis tools in AmazonWebServices cloud computing infrastructure.
- Developed REST API’s using python flask for several custom NLP modules, containerize them using Docker and deploy to production using cloud services like AWS
- Created Talend jobs to load data into various Oracle tables. Utilized Oracle stored procedures and wrote few Java code to capture global map variables and use them in the job.
- A highly immersive Data Science program involving Data Manipulation & Visualization, Web Scraping, Machine Learning, Python programming, SQL, GIT, Unix Commands, NoSQL, MongoDB, Hadoop.
- Transformed Logical Data Model to Erwin, Physical Data Model ensuring the Primary Key and Foreign Key relationships in PDM, Consistency of definitions of Data Attributes and Primary Index Considerations.
- Documented logical, physical, relational and dimensional data models. Designed the Data Marts in dimensional data modeling using star and snowflake schemas.
- Worked with NLP to classify text with data draw from a big data system. The text categorization involved labeling natural language texts with relevant categories from a predefined set
- Handled importing data from various data sources, performed transformations using Hive, Map Reduce, and loaded data into HDFS.
- Worked with BTEQ to submit SQL statements, import and export data, and generate reports in Teradata.
- Designed and documented Use Cases, Activity Diagrams, Sequence Diagrams, OOD (Object Oriented Design) using UML and Visio.
- Created Hive queries that helped analysts spot emerging trends by comparing fresh data with EDW reference tables and historical metrics and processed the data using HQL (like SQL) on top of Map- reduce.
- Hands on development and maintenance using Oracle SQL, PL/SQL, SQL Loader, and Informatica Power Center9.1.
- Designed the ETL process to Extract translates and load data from OLTPOracle database system to Teradata data warehouse.
- Created and implemented MDM data model for Consumer/Provider for HealthCareMDM product from Variant.
- Built and published customized interactive reports and dashboards, report scheduling using Tableauserver.
- Hands on Oracle External Tables feature to read the data from flat files into Oracle staging tables.
- Analyzed the web log data using the HiveQL to extract number of unique visitors per day, page views, visit duration, most purchased product on website and managed and reviewed Hadoop log files.
- Used Erwin9.1 for effective model management of sharing, dividing and reusing model information and design for productivity improvement.
- Designed and developed user interfaces and customization of Reports using Tableau and OBIEE and designed cubes for data visualization, mobile/web presentation with parameterization and cascading.
- Performed Data Analysis and Data Profiling and worked on data transformations and data quality rules.
- Created SSIS Packages using Pivot Transformation, Execute SQL Task, DataFlowTask, etc to import data into the data warehouse.
- Developed and implemented SSIS, SSRS and SSAS application solutions for various business units across the organization.
Environment: Data Science, Machine Learning, snowflake schemas, ERwin9.6, Oracle10g, Hadoop, HDFS, Pig, Hive, MapReduce, PL/SQL, UNIX, Informatica Power Center, MDM, DB2, Tableau, Aginity, Architecture, SAS/Graph, SAS/SQL, Tableau, SAS/Connect and SAS/Access.
Confidential, West Des Moines, IA
- Collaborated with data engineers and operation team to implement ETL process, wrote and optimized SQL queries to perform data extraction to fit the analytical requirements.
- Worked on data cleaning and ensured data quality, consistency, integrity using Pandas, NumPy.
- Analyze large customer data and website clicks data using Python and Spark in AWS, build machine learning models to predict consumer purchase intent and optimize marketing efforts
- Explored and analyzed the customer specific features by using Matplotlib, Seaborn in Python and dashboards in Tableau.
- Selected deep learning and gradient boosting methods to achieve high performance metrics in cross- validation and independent test data
- Performed data imputation using Scikit-learn package in Python.
- Participated in features engineering such as feature generating, PCA, feature normalization and label encoding with Scikit-learn preprocessing.
- Used Python 2.x/3.X (NumPy, SciPy, Pandas, Scikit-learn, seaborn to develop variety of models and algorithms for analytic purposes.
- Experimented and built predictive models including ensemble methods such as Gradient boosting trees and Neural Network by Keras to predict Sales amount.
- Conducted analysis and patterns on customers' shopping habits in different location, different categories and different months by using time series modeling techniques.
- Used RMSE/MSE to evaluate different models' performance.
- Designed rich data visualizations to model data into human-readable form with Tableau and Matplotlib.
- Provided Level of Efforts, activity status, identified dependencies impacting deliverables, identified risks, and mitigation plans, identified issues and impacts.
- Designed database solution for applications, including all required database design components and artifacts.
- Developed and maintained Data Dictionary to create Metadata Reports for technical and business purpose using Erwin report designer.
- Generated ad-hoc reports using Crystal Reports 9andSQL Server Reporting Services (SSRS).
Environment: Python 2.x (Scikit-Learn/SciPy/NumPy/Pandas/Matplotlib/Seaborn), Tableau, Deep Learning, Machine Learning algorithms (Random Forest, Gradient Boosting tree, Neural network by Keras), GitHub.
Confidential, Wichita, KS
Jr Data Scientist
- Participated in all phases of data acquisition, data cleaning, developing models, validation, and visualization to deliver data science solutions.
- Transformed real-world problems into statistical models based on clients' needs
- Retrieving data from SQL Server database by writing SQL queries like stored procedure, temp table, view.
- Helped clients design experiment, clean data, analyze data and applied predictive analytics to build predictive models
- Connected Database with Jupyter notebook for modeling and Tableau for visualization and reporting.
- Worked on fraud detection analysis on loan applications using the history of loan taking with supervised learning methods.
- Used Pandas, NumPy, Seaborn, Matplotlib, Scikit-learn in Python for developing various machine learning models and utilized algorithms such as Logistic regression, Random Forest, Gradient Boost Decision Tree and Neural Network.
- Experienced in performing feature engineering such as PCA for high dimensional datasets, important feature selection by Tree-based models.
- Perform model tuning and selection by using cross-validation, parameters tuning to prevent overfitting.
- Ensemble methods were used to increase the accuracy of the training model with different Bagging and Boosting methods.
- Managed and tracked project issues and debugged them if needed and developed SQL queries in SQL Server management studio, Toad and generated complex reports for the end users.
Environment: SQL Server 2008, Python 2.x (NumPy/Pandas/Scikit-Learn), Machine Learning algorithm(Logistic regression/ Random Forests/ KNN/ K-Means Clustering/ DBSCAN/ Ensemble methods), GitHub
Confidential, Jacksonville, FL
- Handled globally distributed team on confidential product management specifically 'Government Pricing' module and drove the team in resolving various issues on production, enhancements to the system and proposing solutions while mapping to the product road map.
- Wrote custom procedures and triggers to improve performance and maintain referential integrity.
- Optimized queries with modifications in SQL code, removed unnecessary columns and eliminated data discrepancies.
- Normalized tables established joins and created indexes wherever necessary utilizing profiler and execution plans.
- Storage Management Managing Space, Tablespaces, Segments, and Extents, Rollback Segments & Data Dictionary.
- Used SQL*Loader to move data from flat files into an Oracle database.
- Utilized dimensional data modeling techniques and story boarding ETL processes.
- Developed ETL procedures to ensure conformity, compliance with standards and lack of redundancy, translated business rules and functionality requirements into ETL procedures using Informatica Power Center.
- Performed Unit Testing and tuned the Informatica mappings for better performance.
- Re-engineer existing Informatica ETL process to improve performance and maintainability.
- Interacted directly with the business teams in arriving at the workarounds or resolutions while involving appropriate groups within Model N.
- Exceeded expectations by collaborating with a testing team in creating and executing the test cases for a giant pharmaceutical company.
- Organized and delegated tasks effectively while working with multiple customers at the same time.
- Demonstrated leadership by guiding newcomers with the best practices of the company.
Environment: Model N Application, Informatica, Oracle 11g, DB2, UNIX, Toad, Putty, HP Quality Center, SSIS,SSAS,SSRS.
Confidential, Pleasanton, CA
- Requirement gathering from the users by participating in JAD sessions, A series of meetings were conducted with the business system users to gather the requirements for reporting.
- Created and design conceptual data models using ERwin data modeler.
- Created logical and physical data models for dimensional data modeling using best practices to ensure high data quality and reduced redundancy with the IDW standards and guidelines.
- Designed and developed Oracle PL/SQL procedures and UNIX shell scripts for data import/export and data conversions.
- Performed legacy application data cleansing, data anomaly resolution and developed cleansing rule sets for ongoing cleansing and data synchronization.
- Extensively used star schema methodologies in building and designing the logical data model into dimensional models.
- Involved in project cycle plan for the data warehouse, source data analysis, data extraction process, transformation and loading strategy designing
- Worked on database design for OLTP and OLAP systems.
- Designed a STAR schema for sales data involving shared dimensions (conformed) using ERwin Data Modeler.
- Designed and build the OLAP cubes forstar schema and snowflake schema using native OLAPservice manager.
- Extensively used Teradata utilities (BTEQ, FastLoad, Multiload, TPUMP) to import/export and load the data from oracle and flat files.
- Performed data analysis tasks on warehouses from several sources like Oracle, DB2, and XML etc. and generated various reports and documents.
- Created database maintenance plans for the performance of SQLserver which covers database integritychecks, update database statistics and re-indexing.
- Involved in workflows and monitored jobs using Informatica tools.
- Used SSIS to create ETL packages to validate, extract, transform and load data into data warehouse and data mart.
- Developed stored procedures on Netezza and SQL server for data manipulation and data warehouse population.
- Actively involved in normalization (3NF) & de-normalization of database.
- Developed multiple processes for Daily Data Ingestion from Client associated data vendors and Production Team, Client site employees using SSIS and SSRS.
- Created multiple custom SQL queries in TeradataSQL workbench to prepare the right data sets for Tableau dashboards.
- Worked with the team in mortgage domain to implement designs based on the freecashflow, acquisition, and capitalefficiency.
- Resolved the data type inconsistencies between the source systems and the target system using the mapping documents and analyzing the database using SQL queries.
Environment: ERwin 9.1, Data Modeling, Informatica Power Center9.6, Teradata SQL, PL/SQL, BTEQ, DB2, Oracle, Agile, ETL, Tableau, Cognos, Business Objects, Netezza, UNIX, SQL Server 2008, TOAD, SAS, SSRS, SSIS, T-SQL etc