- Professional qualified Data Scientist/Data Analyst with over 7 years of experience in Data Science and Analytics including Machine Learning, Data Mining and Statistical Analysis
- Involved in the entire data science project life cycle and actively involved in all the phases including data extraction, data cleaning, statistical modeling and data visualization with large data sets of structured and unstructured data
- Experienced with machine learning algorithm such as logistic regression, random forest, XGboost, KNN, SVM, neural network, linear regression, lasso regression and k - means
- Implemented Bagging and Boosting to enhance the model performance.
- Strong skills in statistical methodologies such as A/B test, experiment design, hypothesis test, ANOVA
- Extensively worked on Python 3.5/2.7 (Numpy, Pandas, Matplotlib, NLTK and Scikit-learn)
- Experience in implementing data analysis with various analytic tools, such as Anaconda 4.0JupiterNotebook 4.X, R 3.0 (ggplot2, Caret, dplyr) and Excel.
- Solid ability to write and optimize diverse SQL queries, working knowledge of RDBMS like SQLServer2008, NoSql databases like MongoDB
- Strong experience in BigData technologies like Spark 1.6, Sparksql, pySpark, Hadoop 2.X, HDFS, Hive 1.X
- Experience in visualization tools like, Tableau9.X, 10.X for creating dashboards
- Excellent understanding Agile and Scrum development methodology
- Used the version control tools like Git 2.X
- Passionate about gleaning insightful information from massive data assets and developing a culture of sound, data-driven decision making
- Ability to maintain a fun, casual, professional and productive team atmosphere
- Experienced the full software life cycle in SDLC, Agile and Scrum methodologies.
- Skilled in Advanced Regression Modeling, Correlation, Multivariate Analysis, Model Building, Business Intelligence tools and application of Statistical Concepts.
- Proficient in Predictive Modeling, Data Mining Methods, Factor Analysis, ANOVA, Hypothetical testing, normal distribution and other advanced statistical and econometric techniques.
- Developed predictive models using Decision Tree, RandomForest, NaïveBayes, LogisticRegression, ClusterAnalysis, and Neural Networks.
- Experienced in Machine Learning and Statistical Analysis with PythonScikit-Learn.
- Experienced in Python to manipulate data for data loading and extraction and worked with python libraries like Matplotlib, Numpy, Scipy and Pandas for data analysis.
- Hands-on experience with Machine Learning, Regression Analysis, Clustering, Boosting, Classification, Principal Component Analysis and Data Visualization Tools
- Strong programming skills in a variety of languages such as Python and SQL.
- Created many calculated columns and measures using DAX in Power BI based on report requirements and published Power BI reports to end user
- Worked with complex applications such as R, SAS, Matlab and SPSS to develop neural network, cluster analysis.
- Expertise in transforming business requirements into analytical models, designing algorithms, building models, developing data mining and reporting solutions that scales across massive volume of structured and unstructured data.
- Skilled in performing dataparsing, data manipulation and data preparation with methods including describe data contents, compute descriptive statistics of data, regex, split and combine, Remap, merge, subset, reindex, melt and reshape.
- Strong SQL programming skills, with experience in working with functions, packages and triggers.
- Experienced in Visual Basic for Applications and VB programming languages to work with developing applications.
- Worked with NoSQL Database including Hbase, Cassandra and MongoDB.
- Experienced in Big Data with Hadoop, HDFS, MapReduce, and Spark.
- Experienced in Data Integration Validation and Data Quality controls for ETL process and Data Warehousing using MS Visual Studio SSIS, SSAS, SSRS.
- Proficient in Tableau and R-Shiny data visualization tools to analyze and obtain insights into large datasets, create visually powerful and actionable interactive reports and dashboards.
- Automated recurring reports using SQL and Python and visualized them on BI platform like Tableau.
- Worked in development environment like Git and VM.
- Excellent communication skills. Successfully working in fast-paced multitasking environment both independently and in collaborative team, a self-motivated enthusiastic learner.
Programming & Scripting Languages: R, C, C++, JAVA, JCL, COBOL, HTML, CSS, JSP, Java Script
Databases: SQL, Hive, Impala, Pig, Spark SQL, Databases SQL-Server, My SQL, MS Access, HDFS, HBase, Teradata, Netezza, MongoDB, Cassandra.
Statistical Software: SPSS, R, SAS.
Web Packages: ggplot2, caret, dplyr, Rweka, gmodels, RCurl, tm, C50, twitteR, NLP, Reshape2, rjson, plyr, pandas, numPy, seaborn, sciPy, matplot lib, scikit-learn, Beautiful Soup, Rpy2, sqlalchemy.
Bigdata Ecosystem: HDFS, PIG, MapReduce, HIVE, SQOOP, FLUME, HBase, Storm, Kafka, Elastic Search, Redis, Flume, Storm, Kafka, Elastic Search, Redis, Flume, Scoop.
Statistical Methods: Time Series, regression models, splines, confidence intervals, principal component analysis and Dimensionality Reduction, bootstrapping
BI Tools: Tableau, Tableau server, Tableau Reader, SAP Business Objects, OBIEE, QlikView, SAP Business Intelligence, Amazon Redshift, or Azure Data Warehouse
Database Design Tools and Data Modeling: Erwin r 9.6, 9.5, 9.1, 8.x, Rational Rose, ER/Studio, MS Visio, SAP Power designer.
Cloud: AWS, S3, EC2.
ETL Tools: Informatica Power Centre, SSIS.
Big Data / Grid Technologies: Cassandra, Coherence, Mongo DB, Zookeeper, Titan, Elasticsearch, Storm, Kafka, Hadoop
Tools and Utilities: SQL Server Management Studio, SQL Server Enterprise Manager, SQL Server Profiler, Import & Export Wizard, Visual Studio.Net, Microsoft Management Console, Visual Source Safe 6.0, DTS, Crystal Reports, Power Pivot, ProClarity, Microsoft Office, Excel Power Pivot, Excel Data Explorer, Tableau, JIRA,Spark MLlib.
Confidential, Newark, NJ
- Performed Data Profiling to learn about behavior with various features such as traffic pattern, location, time, Date and Time etc.
- Application of various machine learning algorithms and statistical modeling like decision trees, regression models, neural networks, SVM, clustering to identify Volume using scikit-learn package in python, Matlab.
- Utilized Spark, Scala, Hadoop, HBase, Cassandra, Mongo DB, Kafka, Spark Streaming, MLLib, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc. and Utilized the engine to increase user lifetime by 45% and triple user conversations for target categories.
- Developed Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources. Used clustering technique K-Means to identify outliers and to classify unlabeled data.
- Evaluated models using Cross Validation, Log loss function, ROC curves and used AUC for feature selection.
- Analyze traffic patterns by calculating autocorrelation with different time lags.
- Ensured that the model has low False Positive Rate.
- Addressed over fitting by implementing of the algorithm regularization methods like L2 and L1.
- Used Principal Component Analysis in feature engineering to analyze high dimensional data.
- Created and designed reports that will use gathered metrics to infer and draw logical conclusions of past and future behavior.
- Performed Multinomial Logistic Regression, Random forest, Decision Tree, SVM to classify package is going to deliver on time for the new route.
- Extensively use Python's multiple data science packages like Pandas, NumPy, matplotlib, Seaborn, SciPy, Scikit-learn, and NLTK.
- Analytically solved packaging, distribution and traveling problems in Xpress IVE optimization solver
- Scheduled Automatic refresh and scheduling refresh in Power BI service.
- Extensively use SQL queries for legacy data retrieval jobs.
- The task of migrating the Django database from MySQL to PostgreSQL.
- Responsible for maintaining and analyzing large datasets used to analyze risk by domain experts.
- Performed data analysis by using Hive to retrieve the data from Hadoop cluster, SQL to retrieve data from Oracle database.
- Used MLlib, Spark's Machine learning library to build and evaluate different models.
- Implemented rule based expertise system from the results of exploratory analysis and information gathered from the people from different departments.
- Performed Data Cleaning, features scaling, features engineering using pandas and numpy packages in python.
- Developed Map Reduce pipeline for feature extraction using Hive.
- Created Data Quality Scripts using SQL and Hive to validate successful data load and quality of the data. Created various types of data visualizations using Python and Tableau.
- Communicated the results with operations team for taking best decisions.
- Collected data needs and requirements by Interacting with the other departments.
Environment: Python 2.x, CDH5, HDFS, Hadoop 2.3, Hive, Impala, Linux, Spark, Tableau Desktop, SQL Server 2012, Microsoft Excel, Matlab, Spark SQL, Pyspark.
Confidential, Deerwood, MaryLand
- Provided Configuration Management and Build support for more than 5 different applications, built and deployed to the production and lower environments.
- Implemented public segmentation using unsupervised machine learning algorithms by implementing k-means algorithm using Pyspark.
- Explored and Extracted data from source XML in HDFS, preparing data for exploratory analysis using data munging.
- Responsible for different Data mapping activities from Source systems to Teradata
- Handled importing data from various data sources, performed transformations using Hive, Map Reduce, and loaded data into HDFS
- Used R and python for Exploratory Data Analysis, A/B testing, Anova test and Hypothesis test to compare and identify the effectiveness of Creative Campaigns.
- Created clusters to classify Control and test groups and conducted group campaigns.
- Analyzed and calculated the lifetime cost of everyone in the welfare system using 20 years of historical data.
- Developed LINUXShell scripts by using NZSQL/NZLOAD utilities to load data from flat files to Netezza database.
- Developed triggers, stored procedures, functions and packages using cursors and ref cursor concepts associated with the project using Pl/SQL
- Created various types of data visualizations using R, python and Tableau.
- Used Python, R, SQL to create Statistical algorithms involving Multivariate Regression, Linear Regression, Logistic Regression, PCA, Random forest models, Decision trees, Support Vector Machine for estimating the risks of welfare dependency.
- Identified and targeted welfare high-risk groups with Machine learning algorithms.
- Conducted campaigns and run real-time trials to determine what works fast and track the impact of different initiatives.
- Developed Tableau visualizations and dashboards using Tableau Desktop.
- Used Graphical Entity-Relationship Diagramming to create new database design via easy to use, graphical interface.
- Created multiple custom SQL queries in Teradata SQL Workbench to prepare the right data sets for Tableau dashboards
- Perform analyses such as regression analysis, logistic regression, discriminant analysis, cluster analysis using SAS programming.
- Used Meta data tool for importing metadata from repository, new job categories and creating new data elements.
- Scheduled the task for weekly updates and running the model in workflow. Automated the entire process flow in generating the analysis and reports.
Environment: R 3.x, HDFS, Hadoop 2.3, Pig, Hive, Linux, R-Studio, Tableau 10, SQL Server, Ms Excel, Pypark.
Confidential - Sunnyvale, CA
- Developed applications of Machine Learning, Statistical Analysis, and Data Visualizations with challenging data Processing problems in sustainability and biomedical domain.
- Goal is to identify the subtypes in autism for the development of targeted and more effective therapies.
- We used hierarchical clustering methods to identify the clusters in the data based on some important features, further analysis to identify the most significant brain volumes is under way.
- Compiled data from various sources public and private databases to perform complex analysis and data manipulation for actionable results.
- Designed and developed Natural Language Processing models for sentiment analysis.
- Worked on Natural Language Processing with NLTK module of python for application development for automated customer response.
- Applied concepts of probability, distribution and statistical inference on given dataset to unearth interesting findings through the use of comparison, T-test, F-test, R-squared, P-value etc.
- Applied linear regression, multiple regression, ordinary least square method, mean-variance, the theory of large numbers, logistic regression, dummy variable, residuals, Poisson distribution, Bayes, Naive Bayes, fitting function etc to data with help of Scikit, SciPy, NumPy and Pandas module of Python.
- Applied clustering algorithms i.e. Hierarchical, K-means with help of Scikit and SciPy.
- Developed visualizations and dashboards using ggplot, Tableau
- Worked on development of data warehouse, Data Lake and ETL systems using relational and non-relational tools like SQL, No SQL.
- Built and analyzed datasets using R, SAS, MATLAB, and Python (in decreasing order of usage).
- Applied linear regression in Python and SAS to understand the relationship between different attributes of the dataset and causal relationship between them
- Performs complex pattern recognition of financial time series data and forecast of returns through the ARMA and ARIMA models and exponential smoothening for multivariate time series data
- Used Cloudera Hadoop YARN to perform analytics on data in Hive.
- Wrote Hive queries for data analysis to meet the business requirements.
- Expertise in Business Intelligence and data visualization using R and Tableau.
- Expert in Agile and Scrum Process.
- Validated the Macro-Economic data (e.g. BlackRock, Moody's etc.) and predictive analysis of world markets using key indicators in Python and machine learning concepts like regression, Bootstrap Aggregation and Random Forest.
- Worked in large-scale database environments like Hadoop and MapReduce, with working mechanism of Hadoop clusters, nodes and Hadoop Distributed File System (HDFS).
Environment: AWS, MS Azure, Cassandra, Spark, HDFS, Hive, Pig, Linux, Python (Scikit-Learn/SciPy/NumPy/Pandas), R, SAS, SPSS, MySQL, Eclipse, PL/SQL, SQL connector.
Data Modeler/ Data Analyst
Confidential, New York, NY
- Involved in requirement gathering, data analysis and Interacted with Business users to understand the reporting requirements, analyzing BI needs for the user community.
- Created Entity/Relationship Diagrams, grouped and created the tables, validated the data, identified PKs for lookup tables.
- Involved in modeling (Star Schema methodologies) in building and designing the logical data model into Dimensional Models.
- Created and maintained logical, dimensional data models for different Claim types and HIPAA Standards.
- Implemented one-many, many-many Entity relationships in the data modeling of Datawarehouse.
- Experience working with MDM team with various business operations involved within the organization.
- Identify the Primary Key, Foreign Key relationships across the entities and across subject areas.
- Developed ETL routines using SSIS packages, to plan an effective package development process and design the control flow within the packages.
- Worked with Big Data Architects for setting up Big Data Platform in the organization and on Hive platform to create Hive Data Models
- Developed customized training documentation based on each client's technical needs and built a curriculum to help each client learn both basic and advanced techniques for using PostgreSQL.
- Took an active role in the design, architecture, and development of user interface objects in QlikView applications. Connected to various data sources like SQL Server, Oracle, and flat files.
- Presented the Dashboard to Business users and cross-functional teams, define KPIs (Key Performance Indicators), and identify data sources.
- Designed data flows that (ETL) extract, transform, and load data by optimizing SSIS performance.
- Deliver end to end mapping from source (Guidewire application) to target (CDW) and legacy systems coverages to Landing Zone and to Guidewire Reporting Pack.
- Involved in loading the data from Source Tables to Operational Data Source tables using Transformation and Cleansing Logic.
- Performed the Data Accuracy, Data Analysis, Data Quality checks before and after loading the data.
- Resolved the data type inconsistencies between the source systems and the target system using the Mapping Documents.
- Generated tableau dashboards for Claims with forecast and reference lines.
- Designed, developed, implemented and maintained Informatica Power center and Informatica Data Quality (IDQ) application for matching and merging process.
- Created ad-hoc reports to users in Tableau by connecting various data sources.
- Worked on the reporting requirements for the data warehouse.
- Created support documentation and worked closely with production support and testing team.
Environment: Erwin8.2, Oracle 11g, OBIEE, Crystal Reports, Toad, Sybase Power Designer, Datahub, MS Visio, DB2, QlikView 11.6, Informatica.
- Designed, Build the Dimensions, cubes with star schema and Snow Flake Schema using SQL Server Analysis Services (SSAS).
- Participated in JAD session with business users and sponsors to understand and document the business requirements in alignment with the financial goals of the company.
- Involved in the analysis of Business requirement, Design, and Development of the High level and Low-level designs, Unit, and Integration testing
- Performed data analysis and data profiling using complex SQL on various sources systems including Teradata, SQL Server.
- Developed the logical data models and physical data models that confine existing condition/potential status data fundamentals and data flows using ER Studio
- Performed second and third normalizations for ER data model of OLTP system
- Worked with data compliance teams, Data governance team to maintain data models, Metadata, Data Dictionaries; define source fields and its definitions.
- Translate business and data requirements into Logical data models in support of Enterprise Data Models, ODS, OLAP, OLTP, Operational Data Structures and Analytical systems.
- Design and model the reporting data warehouse considering current and future reporting requirement
- Involved in the daily maintenance of the database that involved monitoring the daily run of the scripts as well as troubleshooting in the event of any errors in the entire process.
- Worked with Data Scientist in order to create a Data marts for data science specific functions.
- Determined data rules and conducted Logical and Physical design reviews with business analysts, developers, and DBAs.
- Used External Loaders like Multi-Load, TPump and Fast Load to load data into Oracle and Database analysis, development, testing, implementation, and deployment.
- Reviewed the logical model with application developers, ETL Team, DBAs, and testing team to provide information about the data model and business requirements.
Environment: Erwin r7.0, Informatica 6.2, ODS, OLTP, Oracle 10g, Hive, OLAP, DB2, Metadata, MS Excel, Mainframes MS Visio, Rational Rose, Requisite Pro, Hadoop, PL/SQL.
- Used SAS Proc SQL pass-through facility to connect to Oracle tables and created SAS datasets using various SQL joins such as left join, right join, inner join and full join.
- Performing data validation, transforming data from RDBMS oracle to SAS datasets.
- Produce quality customized reports by using PROC TABULATE, PROC REPORT Styles, and ODS RTF and provide descriptive statistics using PROC MEANS, PROC FREQ, and PROC UNIVARIATE.
- Developed SAS macros for data cleaning, reporting and to support routing processing.
- Performed advanced querying using SAS Enterprise Guide, calculating computed columns, using a filter, manipulate and prepare data for Reporting, Graphing, and Summarization, statistical analysis, finally generating SAS datasets.
- Involved in Developing, Debugging, and validating the project-specific SAS programs to generate derived SAS datasets, summary tables, and data listings according to study documents.
- Created datasets as per the approved specification collaborated with project teams to complete scientific reports and review reports to ensure accuracy and clarity.
- Performed different calculations like Quick table calculations, Date Calculations, Aggregate Calculations, String and Number Calculations.
- Good expertise in building dashboards and stories based on the available data points.
- Created action filters, user filters, parameters and calculated sets for preparing dashboards and worksheets in Tableau.
- Created Rich dashboards using Tableau Dashboard and prepared user stories to create compelling dashboards to deliver actionable insights
- Working with the manager to prioritize requirements and preparing reports on the weekly and monthly basis.
Environment: SQL Server, Oracle 11g/10g, MS Office Suite, PowerPivot, Power Point, SAS Base, SAS Enterprise Guide, SAS/MACRO, SAS/SQL, SAS/ODS, SQL, PL/SQL, Visio.