Data Scientist Resume New York City, NY - Hire IT People

SUMMARY:

Data Scientist with 6+ years of experience in statistical testing methods, data analytics, data management, predictive analytics and to deliver resourceful insights and business strategies
Experience working in various domains such as Banking, E - commerce, Education, Healthcare and performed various result-oriented statistical experiments to identify solutions for business problems
Experience in Data Profiling, Data cleansing, Data mapping, Data chunking, creating workflows and Data validation using data integration tools like Informatica, Talend Open Studio during ETL and ELT processes
Extensive knowledge in Machine Learning techniques like Regression Modeling, Classification, Neural Networks, SVM, Clustering, Decision Tree & Random Forest, Association Rule Mining
Experience working with R packages like tidyverse, and ggplot2
Experience in statistical testing like ANOVA, t-test, Hypothesis Testing and Chi-Square Fit test
Experience on Apache Hadoop Ecosystem with good knowledge of Apache Hadoop Distributed file system (HDFS), Map Reduce, Hive, Pig, HBase, Sqoop, Flume, Cassandra, Spark, Oozie, Kafka
Experience working with various RDBMS like Oracle, MYSQL, expertise in creating tables, data manipulation and data extraction from these databases and performed necessary data screening
Skilled in using dplyr and pandas in R and Python for performing exploratory data analysis.
Experience in Text Mining & also good working knowledge on NLP components and packages like Natural Language Generation (NLG) and Natural Language Understanding (NLU) using Python NLTK package.
Experience in ingesting datasets from various data sources ranging from HDFS, Cassandra, AWS and other RDBMS like MYSQL, ORACLE, SQL SERVER, DB2, Teradata, SAP HANA etc.
Experience working in data modeling, data analysis and working with OLTP and OLAP systems and experience working with data mining techniques such as EDW, MOLAP DM and ROLAP.
Experience with DBA tasks involving database creation, data profiling, data cleaning, performance tuning, creation of indexes, creating and modifying table spaces for better and optimized experience
Expertise in Cost Benefit Analysis, Feasibility Analysis, Impact Analysis, Gap Analysis, SWOT analysis and ROI analysis, SCRUM, leading JAD sessions and Dashboard Reporting using tools like Tableau & Power BI
Experience in SAS/STAT, STATA, R, SQL, Tableau, Python, MS EXCEL (VLOOKUP, Pivot table, Macros).
Skilled in ERD & UML modelling, & theorize these models to create physical models from logical model.
Expertise in creating Tableau Dashboards for data visualization and deploying it to the servers.
Expertise in SQL Queries and 5 years of experience in creating the databases, populating it, to extract data from data tables along with creation of tables, Sub queries, Joins, Views, Indexes, SQL Functions
Experience in Data warehousing concepts like Star Schema, galaxy and Snowflake Schema, Data Marts, Kimball Methodology used in Relational and Multidimensional data modeling.
Good knowledge in Normalization and De-Normalization techniques for optimum schema designing.
Experience with conceptual, logical and physical data modeling considering Meta data standards.
Proficient with Python, R and Object-Oriented Programming concepts such as Inheritance, Polymorphism, Abstraction, Encapsulation, Association, Aggregation, etc.
Expertise in cloud technologies such as AWS, Azure and Google & retrieved data from the cloud to perform data screening, and analytical operations on the data to provide insights
Expertise in leveraging the Exploratory Data Analysis (EDA) with all numerical computations and by plotting all kinds of relevant visualizations to do feature engineering and to get feature importance

TECHNICAL SKILLS:

Statistical Tests: Hypothesis testing, ANOVA tests, t-tests, Chi-Square Fit test, Regression, Time series analysis.

Machine Learning Algorithms: Regression Models (Linear, Polynomial, Support Vector, Decision Trees); Classification Models (Logistic Regression, Decision Trees, Support Vector Machines); Ensemble Learning (Random Forest, Bagging Trees, Gradient Boosting Machine); Text Mining (NLP)

Validation Tests: Monte Carlo methods, k-fold cross validation, Out of the Box Estimate

Analytical tools: Google analytics, R Studio, SAS, MATLAB, Azure data lake analytics, Google Ads

Data Visualization: Tableau, Microsoft Power BI, R ggplot2 and plotly, Python matplotlib, seaborn, bokeh

Data modeling: Entity relationship Diagrams (ERD), Snowflake schema, Star schema

Languages: SQL, HIVE QL, C, R, Python, SAS

Database Systems: SQL Server 10.0/11.0/13.0, Oracle, MYSQL 5.1/5.6/5.7, Teradata, DB2, Amazon Redshift, SAP HANA

NOSQL Databases: HBASE, Apache Cassandra

ETL Tools: Informatica Power House 9.0, Informatica IDQ, Talend Open Studio, KAFKA, FLUME, Microsoft SSIS, Apache Spark

Big Data: Apache Hadoop, HDFS, Sqoop, Spark, Flume, Kafka, Hive, Impala, MapReduce, Splunk ML-SPL, Splunk Hadoop Connect, Oozie

PROFESSIONAL EXPERIENCE:

Confidential, New York City, NY

Data Scientist

Responsibilities:

Performed Data cleaning in a huge dataset which had many missing data & extreme outliers from Hadoop
Used MICE in R & iterative imputer in python to impute missing observations based on the existing observations & tracked outliers using Mahalanobis distance & chi square cut off to remove extreme outliers
Used cook’s distance to detect distinct observational influence on the dataset and removed the outliers
Used two sample independent t-tests to access the differences in mean purchases across dichotomous variables such as gender and marital status, used one-way ANOVA and tukey parameter to access difference between mean purchases across polychotomous variables such as occupation and age
Used Multiple Linear Regression, Decision Tree Regression, Support Vector Regression & ensemble learning like Bagging, Random Forests & Gradient Boosting Machine to train 70% of the model & the models were optimized using Grid Search & the predictions were made on the test set using each trained model
Computed Absolute and Relative return based on the simulation and plotted histograms for the selections to find the best ad and strategy based on the reinforcement learning algorithm (Thompson Sampling)
The final model was selected using Gain Plot Curve Relative Gini Score, Root Mean Squared Error, and Mean Absolute Error and validated using ten-fold cross validation technique
The final results were summarized as a dashboard in tableau and it was presented to the client

Environment: Tableau, Excel, Python (Pandas, Scikit, Numpy), Jupyter Notebook, R, MYSQL, Apache Hadoop Distribution 2.7.X, HDFS, Linux, MS office suite, Apache Spark

Confidential, Santa Clara, CA

Data Scientist

Responsibilities:

Performed extensive data exploration and generated features from 4 Tb data using Hadoop
Used Python to develop variety of models & algorithms to predict the Loan default using the parameters
Deployment of advanced techniques (e.g., text mining, statistical analysis, etc.) & performed a/b testing, to access the historical loan defaults based on binary categories and validated the results using Chi-Square tests.
Successfully formulated the problem and built a classification model to predict the probability of loan default using Logistic Regression and Decision Tree classification
Employed ensemble learning methods like Random Forest and Gradient Boosting to predict the probability of loan default and improved the recall by 40% over the existing system
Used Phyton libraries such as numpy, matplotlib and pandas to work with dataframes and to plot graphs
Validated the model using a ten-fold cross validation technique and used advanced hyper parameter tuning techniques such as OOB estimates and Grid Search to find the optimal model and selected the best model based on precision recall, ROC curve, lift charts, AUC and Pseudo R-square.
Developed simple and compound features from over 50 tables consisting details of user account, transaction, location, logging, time of the user transaction, system interactions etc.

Environment: SQL Server, ETL, Python 3.x(Scikit-Learn/SciPy/NumPy/Pandas), R, Hadoop Framework, HDFS, Jupyter notebook, Apache Spark

Confidential, Bentonville, AR

Data Scientist

Responsibilities:

Applied Digital Marketing analytics using Facebook Analytics, Tableau visualizations and Looker
Collected, processed and cleansed raw data from a wide variety of sources using R and performed statistical testing such as two sample independent t-tests to access the differences in mean purchases across variables
Responsible for enabling analysis through producing information of products and is involved in the research and development efforts and to create data-based customer profiles to build a geo-demographic segmentation model and efficiently allocate resources in future expanding markets
Directly worked with cross-functional teams including analysts, engineers, product managers & executive management to understand the business needs & build data-driven strategies to help them meet their goals
Provided analytical insights and dashboards using Tableau and presented it to the client
Created a recommendation system based on customer purchasing history using Machine Learning algorithms such as K-NN and association rule mining
Analyzed and evaluated performance results from model execution to find key trends and opportunities to expand the business of the client

Environment: SQL Server 2012, Python 3.x (Scikit-Learn, NumPy, Pandas, Matplotlib), Tableau, Looker, Facebook Analytics, R, Linux, MS Excel.

Confidential

Junior Data Scientist

Responsibilities:

Collected data from end client, performed ETL and defined the uniform standard format
Wrote queries to retrieve data from SQL Server database to get the sample dataset containing basic fields
Performed string formatting on the dataset converting hours from date format to a numerical integer
Used Python libraries like Matplotlib and Seaborn to visualize the numerical columns of the dataset such as day of week, age, hour and number of screens.
Developed and implemented predictive models like Logistic Regression, Decision Tree, Support Vector Machine (SVM) to predict the probability of enrollment
Used Ensemble learning methods like Random Forest, Bagging & Gradient Boosting & picked the final model based on confusion matrix, ROC & AUC & predicted the probability of customer enrollment
Tuned the hyper parameters of the above models using Grid Search to find the optimum models
Designed and implemented K-Fold Cross-validation to test and verify the model's significance
Developed a dashboard and story in Tableau showing the benchmarks and summary of model's measure.

Environment: SQL Server 2012/2014, Python 3.x (Scikit-Learn, NumPy, Pandas, Matplotlib, Dateutil, Seaborn), Tableau, Hadoop

Confidential

Data Analyst

Responsibilities:

Gathered and managed Data in SQL server 2008, MS Access & conducted in-depth data analysis & predictive modelling to uncover hidden patterns & communicated the insights to the product, sales & marketing teams
Perform Data Analysis on target data after transfer to Data Warehouse
Created and automated dashboards using Excel VBA
Work with Data Architect on Dimensional Model with both Star and Snowflake Schemas utilized
Created ETL solution using Informatica tool to read Product & Order data from files on shared network into SQL Server database
Made business recommendations based on data collected to improve business efficiency.
Created Data visualization, dashboards & advanced story telling reporting using Tableau and MS Power BI

Environment: Windows XP, SQL Server 2005/2008, PostgreSQL, MSSQL, SQLite, Excel VBA, MS Office 2010, MS Access 2010, Tableau, SSIS, MS Power BI

We provide IT Staff Augmentation Services!

Data Scientist Resume

New York City, NY

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship