- Highly efficient Data Scientist/Data Engineer with around 7 years of experience in areas including Data Analysis, Statistical Analysis, Machine Learning, Deep Learning, Data mining with large data sets of structured and unstructured data in domains like banking, travel services, strong functional knowledge on business processes, latest market trends and manufacturing industries.
- Used Pandas, NumPy, Seaborn, SciPy, Matplotlib, Scikit - learn, NLTK in Python for developing various machine learning algorithms and utilized machine learning algorithms such as Linear Regression, Multivariate Regression, Naive Bayes, Random Forests, K-Means, & KNN for Data Analysis.
- Responsible for design and development of advanced R/Python programs to prepare transform and harmonize data sets in preparation for modeling.
- Developed Natural Language Processing (NLP) algorithms and Machine Learning for development of key features.
- Hands on experience in implementing LDA, Naive Bayes and skilled in Random Forests, Decision Trees, Linear and Logistic Regression, SVM, Clustering, neural networks, Principle Component Analysis and good knowledge on Recommender Systems.
- Proficient in Statistical Modeling techniques (Linear, Logistics, Decision Trees, Random Forest, SVM, K-Nearest Neighbors, Bayesian, XG Boost) in Forecasting/ Predictive Analytics, Segmentation methodologies, Regression based models, Hypothesis testing, Factor analysis/ PCA, Ensembles.
- Expertise in transforming business requirements into Analytical Models, Designing Algorithms, Building Models, Developing Data Mining and reporting solutions that scales across massive volume of structured and unstructured data.
- Wrote Python modules to extract/load asset data from the MySQL source database.
- Highly skilled in using Hadoop (pig and Hive) for basic analysis and extraction of data in the infrastructure to provide data summarization.
- Highly skilled in using visualization tools like Tableau, ggplot2 and d3.JS for creating dashboards.
- Worked and extracted data from various database sources like Oracle, SQL Server, DB2, Regularly accessing JIRA tool and other internal issue trackers for the Project development.
- Skilled in System Analysis, E-R/Dimensional Data Modeling, Database Design and implementing RDBMS specific features.
- Knowledge of working with Proof of Concepts (PoC's), SAP Hana/Ariba and gap analysis and gathered necessary data for analysis from different sources, prepared data for data exploration using Data Munging and Teradata.
- Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile Framework methodologies.
- Adept in statistical programming languages like Rand and Python including Big Data technologies like Hadoop, Hive and also work on LSTM network, Recurrent and D3.js.
- Use supply chain stratergy to improve the volume of sales, improve brand strategy, marketing methods or campaigns.
- Proficient in Exploratory data analysis using Python libraries Pandas, MatPlotlib and R packages dplpyr and ggplot2
- Experience working with data modeling tools like Erwin, Power Designer and ERStudio.
- Experience in designing star schema, Snow flake schema for Data Warehouse, ODS architecture.
- Explicitly fashioned in writing SQL queries for various RDBMS such as SQL Server, MySQL, Microsoft SQL, Postgre SQL, Teradata and Oracle, NoSQL databases such as MongoDB, HBase and Cassandra to handle unstructured data.
- Practically engaged in Evaluating Models performance using A/B Testing, K-fold cross validation, R-Square, CAP Curve, Confusion Matrix, ROC plot, Gini Coefficient and Grid Search.
- Excellent Software Development Life Cycle (SDLC) with good working knowledge of testing methodologies, disciplines, tasks, resources and scheduling.
- Excellent knowledge in Data Analysis, Data Validation, Data Cleansing, Data Verification and identifying data mismatch.
Data Modeling Tools: Erwin r9.6, 9.5, 9.1, 8.x, Rational Rose, ER/Studio, MS Visio, SAP Power designer.
Programming Languages: Oracle PL/SQL, Python, Scala, SQL, T-SQL, UNIX shell scripting.
Scripting Languages: Python (NumPy, SciPy, Pandas, Gensim, Keras), R (Caret, Weka, ggplot)
Big Data Technologies: Hadoop, Hive, HDFS, MapReduce, Pig, Kafka, HDFS, Sqoop, Oozie, Spark and Scala.
Reporting Tools: Crystal reports, Business Intelligence, SSRS, Business Objects, Tableau.
ETL: Informatica Power Centre, SSIS.
Project Execution Methodologies: Ralph Kimball and Bill Inmon data warehousing methodology, RationalUnified Process (RUP), Rapid Application Development (RAD), JointApplication Development (JAD)
BI Tools: Tableau, Tableau server, Tableau Reader, SAP Business Objects, OBIEEQlikview, SAP Business Intelligence, Amazon Redshift, or Azure.
Data Warehouse Tools: MS-Office suite (Word, Excel, MS Project and Outlook), Spark MLlib, Scala NLPMariaDB, Azure, SAS.
Databases: Oracle, Teradata, Netezza, Microsoft SQL Server, MongoDB, HBase, Cassandra.
Operating Systems: Windows, UNIX, MS DOS.
Confidential, New York, NY
- Wrote SQL/T-SQL queries to transformed data and load it to E- discovery tools (Ringtail/Relativity), also prepared custom deliverables according to the client's needs.
- Generated document productions using E-discovery tools that include LFP, OPT, DAT files (Metadata) according to the requirement specified by Business analyst and project managers of the law firms.
- Developed stored procedures, Pyspark scripts to manipulate the data.
- Built ETL pipeline over GCP to automate the data migration tasks.
- Performed Data Profiling to learn about behavior with various features such as traffic pattern, location, time, Date and Time etc.
- Application of various machine learning algorithms and statistical modeling like decision trees, regression models, neural networks, SVM, clustering to identify Volume using scikit-learn package in python, Matlab.
- Used clustering technique K-Means to identify outliers and to classify unlabeled data.
- Evaluated models using Cross Validation, Log loss function, ROC curves and used AUC for feature selection.
- Analyze traffic patterns by calculating autocorrelation with different time lags.
- Ensured that the model has low False Positive Rate.
- Addressed overfitting by implementing of the algorithm regularization methods like L2 and L1.
- Performed Multinomial Logistic Regression, Random forest, Decision Tree, SVM to classify package is going to deliver on time for the new route.
- Performed data analysis by using Hive to retrieve the data from Hadoop cluster, Sql to retrieve data from Oracle database.
- Used MLlib, Spark’s Machine learning library to build and evaluate different models.
- Performed Data Cleaning, features scaling, features engineering using pandas and numpy packages in python.
- Developed MapReduce pipeline for feature extraction using Hive.
- Created Data Quality Scripts using SQL and Hive to validate successful data load and quality of the data. Created various types of data visualizations using Python and Tableau.
Environment: Python 2.x, CDH5, HDFS, Hadoop 2.3, Hive, Impala, Linux, Spark, Tableau Desktop, SQL Server 2012, Microsoft Excel, Matlab, Spark SQL, Pyspark .
Confidential, Fairfax City, VA
- Performed feature selection, data visualization and data exploration using python and Apache Spark.
- Implemented and tested predictive models on Apache Spark Created documentation guide on how to get started with Apache Spark in python for our lab reference.
- Built NLP/Machine learning model (Multi-task learning algorithm) for forecasting crime based on online news articles, which helped to identify features and their importance, patterns helped to improve recall of the model by 75%.
- Used HTML text Parsing tools to collect data from various web sources.
- Designed energy efficient machine learning model by performing approximation for both computation and memory accesses, achieving 1.40X energy benefits for virtually no loss (< 0.5%) Integrated predictive models with Dialog-flow google cloud platform.
- Worked extensively with clinical data to extract and transform to data warehouse.
- Worked extensively with Siebel architecture and data model.
- Worked with Siebel Data Model and Extracted data from Siebel Source System and integrated to the Enterprise Data Warehouse
- Done Data Modeling by gathering specifications and requirements and interacting with end users and done extensive Business analysis.
- Worked on multiple Data warehousing projects involving separation and integration of Boston Scientific and Confidential data.
- Did analysis, design, coding, testing, implementation and production support for Company Code Duplication Project.
- Wrote multiple Oracle stored procedures and PL/SQL Code to enhance and implement new ETL Processes
- Worked extensively with Oracle9i and Informatica7.1.
- Designed, coded and implemented Informatica Mappings as a part of solution implementation for NAM Reporting.
- Worked with Cognos Reportnet for OLAP Reporting.
- Worked extensively with Controlm Scheduling tool to schedule jobs and understand the process flow.
- Worked with UNIX Korn Shell Scripts for FTP and Copy files between Systems.
Environment: Informatica Power Center 7.1/8.1.1, Oracle 9i/10g, Cognos Series 7 - Cognos Impromptu, PL/SQL, SQL*Loader, Erwin3.5.5, Siebel,Vertica,Windows 2000/XP, Korn Shell, Mango DB
- Designed and Developed database using ER Diagram, Normalization, and relational database concept.
- Provided data analysis for client engagements across a range of technical industries.
- Assisted to build analytic tools to manage data and conducted data analysis using Python.
- Mined data from multiple sources and developed analytics providing insights on current trends.
- Developed SQL Server Stored Procedures, Tuned SQL Queries.
- Analyze and Prepare data, identify the patterns on dataset by applying historical models
- Collaborating with Senior Data Scientists for understanding of data
- Perform data manipulation, data preparation, normalization, and predictive modeling
- Improve efficiency and accuracy by evaluating model in R
- Present the existing model to stockholders, give insights for model by using different visualization methods in Power BI
- Used R and Python for programming for improvement of model
- Upgrade the entire models for improvement of the product
- Performed Data cleaning process applied Backward - Forward filling methods on dataset for handling missing values
- Developed a predictive model and validate Neural Network Classification model for predict the feature label
- Performed Boosting method on predicted model for the improve efficiency of the model
- Presented Dashboards to Higher Management for more Insights using Power BI
Environment: R/R Studio, Python, SQL Enterprise Manager, Git Hub, Microsoft Power BI, outlook.
- Neural Network API development and containerization for easy deployment of CNN models on AWS EKS and EC2.
- API development to process tens of TB of data Python on Linux platform in a multithreaded framework.
- Rearchitecting the Neural Network pipeline as an IaaS on AWS cloud for accelerating model building on multiple GPUs and for integrating all the application components.
- Client facing operations involving product requirement gathering and identification of development goals.
- Business Intelligence and data visualization tools to simplify decision making.
- Data cleaning to ensure data quality, consistency, integrity using Pandas and Numpy.
- Documentation of process workflows like implementation, integration, and reporting services.
- Wrote bash scripts for automating the tests and tasks for various services
- Worked with large data sets of the order of Tera Bytes for data association pairing and extracting meaning from the results.
- Developed data transformation tools from different formats like TSV, JSON, CSV, etc.
Environment: Machine learning, Neural Networks, AWS, EC2, Digital Ocean, Linux, Python (Scikit-Learn/SciPy/Numpy/Pandas), R, MySQL, Eclipse, PL/SQL, SQL connector, Git, JIRA, NLP.