Data Scientist Resume
New York, NY
PROFESSIONAL SUMMARY:
- Professional qualified Data Scientist/Data Analyst with over 8+ years of experience in Data Science and Analytics including Artificial Intelligence/Deep Learning/Machine Learning, Data Mining and Statistical Analysis
- Involved in the entire data science project life cycle and actively involved in all the phases including data extraction, data cleaning, statistical modelling and data visualization with large data sets of structured and unstructured data, created ER diagrams and schema.
- Experienced with machine learning algorithm such as logistic regression, random forest, XGboost, KNN, SVM, neural network, linear regression, lasso regression and k - means
- Implemented Bagging and Boosting to enhance the model performance.
- Strong skills in statistical methodologies such as A/B test, experiment design, hypothesis test, ANOVA .
- Extensively worked on Python 3.5/2.7 ( Numpy, Pandas, Matplotlib, NLTK and Scikit - learn )
- Experience in implementing data analysis with various analytic tools, such as Anaconda 4.0 Jupiter Notebook 4.X, R 3.0 ( ggplot2, Caret, dplyr ) and Excel …
- Solid ability to write and optimize diverse SQL queries, working knowledge of RDBMS like SQL Server 2008, NoSql databases like MongoDB 3.2
- Developed API libraries and coded business logic using C#, XML and designed web pages using .NET framework, C#, Python, Django, HTML, AJAX.
- Excellent understanding Agile and Scrum development methodology
- Used the version control tools like Git 2.X and build tools like Apache Maven/Ant
- Passionate about gleaning insightful information from massive data assets and developing a culture of sound, data-driven decision making
- Ability to maintain a fun, casual, professional and productive team atmosphere
- Experienced the full software life cycle in SDLC, Agile, Devops and Scrum methodologies including creating requirements, test plans.
- Skilled in Advanced Regression Modelling, Correlation, Multivariate Analysis, Model Building, Business Intelligence tools and application of Statistical Concepts.
- Proficient in Predictive Modelling, Data Mining Methods, Factor Analysis, ANOVA, Hypothetical testing, normal distribution and other advanced statistical and econometric techniques.
- Developed predictive models using Decision Tree, Random Forest, Naïve Bayes, Logistic Regression, Social Network Analysis, Cluster Analysis, and Neural Networks.
- Experienced in Machine Learning and Statistical Analysis with Python Scikit -Learn.
- Experienced in Python to manipulate data for data loading and extraction and worked with python libraries like Matplotlib, Numpy, Scipy and Pandas for data analysis .
- Worked with complex applications such as R, SAS, Matlab and SPSS to develop neural network, cluster analysis.
- Strong C#, SQL programming skills, with experience in working with functions, packages and triggers.
- Expertise in transforming business requirements into analytical models, designing algorithms, building models, developing data mining and reporting solutions that scales across massive volume of structured and unstructured data.
- Skilled in performing data parsing, data ingestion, data manipulation, data architecture, data modelling and data preparation with methods including describe data contents, compute descriptive statistics of data, regex, split and combine, Remap, merge, subset, reindex, melt and reshape .
- Experienced in Visual Basic for Applications and VB programming languages C#, .NET framework to work with developing applications.
- Worked with NoSQL Database including Cassandra and MongoDB .
- Experienced in Data Integration Validation and Data Quality controls for ETL process and Data Warehousing using MS Visual Studio SSIS, SSAS, and SSRS .
- Proficient in Tableau and R - Shiny data visualization tools to analyse and obtain insights into large datasets, create visually powerful and actionable interactive reports and dashboards.
- Automated recurring reports using SQL and Python and visualized them on BI platform like Tableau .
- Worked in development environment like Git and VM .
- Excellent communication skills. Successfully working in fast-paced multitasking environment both independently and in collaborative team, a self-motivated enthusiastic learner.
TECHNICAL SKILLS:
Data Analytics Tools/Programming: Python (numpy, scipy, pandas, Gensim, Keras), R (Caret, Weka, ggplot), MATLAB, Microsoft SQL Server, Oracle PLSQL, Python.
Analysis &Modelling Tools: Erwin, Sybase Power Designer, Oracle Designer, Erwin, Rational Rose, ER/Studio, TOAD, MS Visio, & SAS.
Data Visualization: Tableau, Visualization packages, Microsoft Excel.
ETL Tools: Informatica Power Centre, Data Stage, Ab Initio, Talend.
OLAP Tools: MS SQL Analysis Manager, DB2 OLAP, Cognos Power Play.
Languages: SQL, PL/SQL, T-SQL, XML, HTML, UNIX Shell Scripting, C, C++, AWK, JavaScript.
Databases: Oracle, Teradata, DB2 UDB, MS SQL Server, Netezaa, Sybase ASE, Informix, Mongo DB, HBase, Cassendra, AWS RDS.
Project Execution Methodologies: Ralph Kimball and Bill Inmon data warehousing methodology, Rational Unified Process (RUP), Rapid Application Development (RAD), Joint Application Development (JAD).
Tools & Software: TOAD, MS Office, BTEQ, Teradata SQL Assistant.
Methodologies: Ralph Kimball, COBOL.
Reporting Tools: Business Objects XIR2, Cognos Impromptu, Informatica Analytics Delivery Platform, Micro Strategy, SSRS, Tableau.
Tools: MS Office Suite, Scala, NLP, MariaDb, SAS, Spark MLib Kibana, Elastic search packages, VSS.
Languages: SQL, T-SQL, Base SAS and SAS/SQL, HTML, XML.
Operating Systems: Windows, UNIX (Sun-Solaris, HP-UX), Windows NT/XP/Vista, MSDOS.
PROFESSIONAL EXPERIENCE:
Confidential, New York, NY
Data Scientist
Responsibilities:
- Performed Data Profiling to learn about behaviour with various features of USMLE examinations of various student patterns.
- Evaluated models using Cross Validation, Log loss function, ROC curves and used AUC for feature selection and elastic technologies like Elastic Search, Kibana etc.
- Addressed over fitting by implementing of the algorithm regularization methods like L2 and L1 .
- Implemented statistical modelling with XGBoost machine learning software package using Python to determine the predicted probabilities of each model.
- Created master data for modelling by combining various tables and derived fields from client data and students LORs, essays and various performance metrics.
- Formulated a basis for variable selection and Grid Search, KFold for optimal hyper parameters
- Utilized Boosting algorithms to build a model for predictive analysis of student’s behaviour who took USMLE exam apply for residency.
- Used numpy, scipy, pandas, nltk (Natural Language Processing Toolkit), mat plotlib to build the model.
- Formulated several graphs to show the performance of the students by demographics and their mean score in different USMLE exams.
- Application of various Artificial Intelligence(AI)/ machine learning algorithms and statistical modelling like decision trees, text analytics, natural language processing (NLP), supervised, unsupervised, regression models .
- Used Principal Component Analysis in feature engineering to analyse high dimensional data.
- Performed Data Cleaning, features scaling, features engineering using pandas and numpy packages in python and build models using deep learning frameworks.
- Created deep learning models using Tensor flow and keras by combining all tests as a single normalized score and predict residency attainment of students.
- Used XGB classifier if the feature is a categorical variable and XGB regressor for continuous variables and combined it using Feature Union and Function Transfomer methods of Natural Language Processing .
- Used OnevsRest Classifier to fit each classifier against all other classifiers and used it on multiclass classification problems.
- Implemented application of various machine learning algorithms and statistical modelling like Decision Tree, Text Analytics, Sentiment Analysis, Naive Bayes, Logistic Regression and Linear Regression using Python to determine the accuracy rate of each model.
- Created and designed reports that will use gathered metrics to infer and draw logical conclusions of past and future behaviour.
- Generated various models by using different machine learning and deep learning frameworks and tuned the best performance model using Signal Hub.
- Created data layers as signals to Signal Hub to predict new unseen data with performance not less than the static model build using deep learning framework.
Environment: Python, Hive, AWS, Linux, Tableau Desktop, Microsoft Excel, NLP, Deep learning frameworks such as TensorFLow, Keras, Boosting algorithms etc.
Confidential, St.louis, MO
Data Scientist
Responsibilities:
- Performed Data Profiling to learn about behaviour with various features such as traffic pattern, location, Date, and Time etc.
- Application of various Artificial Intelligence(AI)/machine learning algorithms and statistical modelling like decision trees, text analytics, natural language processing(NLP), supervised and unsupervised, regression models, social network analysis, neural networks, deep learning, SVM, clustering to identify Volume using scikit-learn package in python, Matlab .
- Utilized Spark, Snowflake, Scala, HQL, VQL, oozie, pySpark, Data Lake, TensorFlow, HBase, Cassandra, Redshift, MongoDB, Kafka, Kinesis, Spark Streaming, Edward, CUDA, MLLib, AWS, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc. and Utilized the engine to increase user lifetime by 45% and triple user conversations for target categories.
- Created and connected SQL engine through C# to connect database, developed API libraries and business logic using C#, XML and Python
- Exploring DAG’s, their dependencies and logs using AirFlow pipelines for automation
- Performed data cleaning and feature selection using MLlib package in PySpark and working with deep learning frameworks such as Caffe, Neon etc.
- Developed Spark / Scala, Python, and R for regular expression ( regex ) project. Used clustering technique K-Means to identify outliers and to classify unlabelled data.
- Created user friendly interface for quick view of reports by using C#, JSP, XML and developed expandable menu that show drilldown data on graph click
- Evaluated models using Cross Validation, Log loss function, ROC curves and used AUC for feature selection and elastic technologies like Elastic Search, Kibana etc.
- Categorised comments into positive and negative clusters from different social networking sites using Sentiment Analysis and Text Analytics
- Tracking operations using sensors until certain criteria is met using AirFlow technology.
- Responsible for different Data mapping activities from Source systems to Teradata using utilities like TPump, FEXP, MLOAD, BTEQ, FLOAD etc.
- Analyse traffic patterns by calculating autocorrelation with different time lags.
- Ensured that the model has low False Positive Rate and Text classification and sentiment analysis for unstructured and semi-structured data.
- Addressed over fitting by implementing of the algorithm regularization methods like L2 and L1 .
- Used Principal Component Analysis in feature engineering to analyse high dimensional data.
- Created and designed reports that will use gathered metrics to infer and draw logical conclusions of past and future behaviour.
- Performed Multinomial Logistic Regression, Random forest, Decision Tree, SVM to classify package is going to deliver on time for the new route.
- Used MLlib, Spark's Machine learning library to build and evaluate different models.
- Implemented rule based expertise system from the results of exploratory analysis and information gathered from the people from different departments.
- Performed Data Cleaning, features scaling, features engineering using pandas and numpy packages in python and build models using SAP Predictive Analytics.
- Developed MapReduce pipeline for feature extraction using Hive and Pig .
- Created Data Quality Scripts using SQL and Hive to validate successful data load and quality of the data. Created various types of data visualizations using Python and Tableau /Spotfire .
- Communicated the results with operations team for taking best decisions.
- Collected data needs and requirements by Interacting with the other departments.
Environment: Python, Scala, NZSQL, Teradata, PostgreSQL, Tableau, EC2, Netezza, Architecture, SAS/Graph, SAS/SQL, SAS/Access, Time-series analysis, ARIMA.
Confidential, Farmington, Connecticut
Data Scientist
Responsibilities:
- Provided Configuration Management and Build support for more than 5 different applications, built and deployed to the production and lower environments.
- Implemented public segmentation using unsupervised machine learning algorithms by implementing k-means algorithm using Pyspark .
- Using AirFlow to keep track of job statuses in repositories like MySQl and Postgre databases.
- Responsible for different Data mapping activities from Source systems to Teradata, Text mining and building models using topic analysis, sentiment analysis for both semi-structured and unstructured data.
- Used R and python for Exploratory Data Analysis, A/B testing, HQL, VQL, Data Lake, AWS Redshift, oozie, pySpark, Anova test and Hypothesis test to compare and identify the effectiveness of Creative Campaigns.
- Computing A/B testing frameworks, clickstream and time spent databases using Airflow
- Created clusters to Control and test groups and conducted group campaigns using Text Analytics .
- Created positive and negative clusters from merchant’s transaction using Sentiment Analysis to test the authenticity of transactions and resolve any chargebacks.
- Analysed and calculated the lifetime cost of everyone in the welfare system using 20 years of historical data.
- Created and developed classes and web page elements using C# and AJAX. JSP was used for validating client side responses and connected C# to database to retrieve SQL data
- Developed LINUX Shell scripts by using NZSQL / NZLOAD utilities to load data from flat files to Netezza database.
- Developed triggers, stored procedures, functions and packages using cursors and ref cursor concepts associated with the project using Pl / SQL
- Created various types of data visualizations using R, C#, python and Tableau/Spotfire also connected Pipeline Pilot with Spotfire to create more interactive business driven layouts.
- Used Python, R, SQL to create Statistical algorithms involving Multivariate Regression, Linear Regression, Logistic Regression, PCA, Random forest models, Decision trees, Support Vector Machine for estimating the risks of welfare dependency.
- Identified and targeted welfare high-risk groups with Machine learning/deep learning algorithms .
- Conducted campaigns and run real-time trials to determine what works fast and track the impact of different initiatives.
- Developed Tableau visualizations and dashboards using Tableau Desktop.
- Used Graphical Entity-Relationship Diagramming to create new database design via easy to use, graphical interface.
- Created multiple custom SQL queries in Teradata SQL Workbench to prepare the right data sets for Tableau dashboards
- Perform analyses such as regression analysis, logistic regression, discriminant analysis, cluster analysis using SAS programming.
Environment: R, C#, Pig, Hive, Linux, R-Studio, Tableau, SQL Server, Ms Excel, Pypark.
Confidential, Colmar, Pennsylvania
Data Modeler/ Data Analyst
Responsibilities:
- Created and maintained Logical and Physical models for the data mart. Created partitions and indexes for the tables in the data mart .
- Performed data profiling and analysis applied various data cleansing rules designed data standards and architecture/designed the relational models.
- Maintained metadata (data definitions of table structures) and version controlling for the data model.
- Developed SQL scripts for creating tables , Sequences , Triggers , views and materialized views
- Worked on query optimization and performance tuning using SQL Profiler and performance monitoring.
- Developed mappings to load Fact and Dimension tables, SCD Type 1 and SCD Type 2 dimensions and Incremental loading and unit tested the mappings.
- Utilized Erwin's forward / reverse engineering tools and target database schema conversion process.
- Worked on creating enterprise wide Model EDM for products and services in Teradata Environment based on the data from PDM . Conceived, designed, developed and implemented this model from the scratch.
- Building, publishing customized interactive reports and dashboards, report scheduling using Tableau server
- Write SQL scripts to test the mappings and Developed Traceability Matrix of Business Requirements mapped to Test Scripts to ensure any Change Control in requirements leads to test case update.
- Responsible for development and testing of conversion programs for importing Data from text files into map Oracle Database utilizing PERL shell scripts &SQL*Loader.
- Involved in extensive DATA validation by writing several complex SQL queries and Involved in back-end testing and worked with data quality issues.
- Developed and executed load scripts using Teradata client utilities MULTILOAD, FASTLOAD and BTEQ.
- Exporting and importing the data between different platforms such as SAS, MS-Excel.
- Generated periodic reports based on the statistical analysis of the data using SQL Server Reporting Services ( SSRS ).
- Worked with the ETL team to document the Transformation Rules for Data Migration from OLTP to Warehouse Environment for reporting purposes.
- Created SQL scripts to find data quality issues and to identify keys, data anomalies, and data validation issues.
- Formatting the data sets read into SAS by using Format statement in the data step as well as Proc Format.
- Applied Business Objects best practices during development with a strong focus on reusability and better performance.
- Developed Tableau visualizations and dashboards using Tableau Desktop.
- Used Graphical Entity - Relationship Diagramming to create new database design via easy to use, graphical interface.
- Designed different type of STAR schemas for detailed data marts and plan data marts in the OLAP environment.
Environment: Erwin, MS SQL Server 2008, DB2, Oracle SQL Developer, PL/SQL, Business Objects, Erwin, MS office suite, Windows XP, TOAD, SQL*PLUS, SQL*LOADER, Teradata, Netezza, SAS, Tableau, Business Objects, SSRS, tableau, SQL Assistant, Informatica, XML.
Confidential
Data Engineer
Responsibilities:
- Designing and building multi-terabyte, full end-to-end Data Warehouse infrastructure from the ground up on Redshift for large scale data handling Millions of records every day.
- Implementing and Managing ETL solutions and automating operational processes.
- Optimizing and tuning the Redshift environment, enabling queries to perform up to 100x faster for Tableau and SAS Visual Analytics.
- Wrote various data normalization jobs for new data ingested into Redshift
- Advanced knowledge on Confidential Redshift and MPP database concepts.
- Migrated on premise database structure to Confidential Redshift data warehouse
- Was responsible for ETL and data validation using SQL Server Integration Services.
- Defined and deployed monitoring, metrics, and logging systems on AWS .
- Implemented Work Load Management ( WML ) in Redshift to prioritize basic dashboard queries over more complex longer-running ad-hoc queries. This allowed for a more reliable and faster reporting interface, giving sub-second query response for basic queries.
- Worked publishing interactive data visualizations dashboards, reports /workbooks on Tableau and SAS Visual Analytics.
- Responsible for Designing Logical and Physical data modelling for various data sources on Confidential Redshift
- Designed and Developed ETL jobs to extract data from Salesforce replica and load it in data mart in Redshift.
Environment: Redshift, AWS Data Pipeline, SQL Server Integration Services, SQL Server, AWS Data Migration Services, DQS, SAS Visual Analytics, SAS Forecast server and Tableau.
Confidential
Data Analyst/ Data Modeler
Responsibilities:
- Developed UI screens for data entry application in Java swing.
- Worked on backend service in Spring MVC and openEJB for the interaction with Oracle and Mainframe using DAO and model objects.
- Introduced Spring IOC to increase application flexibility and replace the need for hard - coded class based application functions
- Used Spring IOC for dependency injection to auto-wire different beans and data source to the Application.
- Used Spring JDBC templates for database interactions and used declarative Spring AOP transaction management.
- Used mainframe screen scraping for adding forms to mainframe through the claims data entry application.
- Worked on jasper reports ( iReport ) to generate reports for various people (executive secretary and commissioners) based on their authorization.
- Generated Electronic letters for attorneys and insurance carriers using iReport .
- Worked on application deployment on various tomcat server instances using putty .
- Worked in TOAD for PL/SQL in Oracle database for writing queries, functions, stored procedures and triggers.
- Worked on JSP, Servlets, HTML, CSS, JavaScript, JSON, Jquery, AJAX for Vault web based project.
- Used Spring MVC architecture with dispatcher Servlet and view resolver for the web applications.