Data Scientist Resume
San Diego, CA
SUMMARY
- Overall 6 + years of experience in teh field of Statistical Modeling, Hypothesis testing, Predictive modeling, Machine Learning, Multivariate Analysis, Correlation, ANOVA, Data Analytics, Text Mining, solving SQL queries, Database Management.
- Expert in transforming business requirements into analytical models, designing algorithms, building models, developing data mining and reporting solutions dat scales across massive volume of structured data.
- Well - versed in predictive statistical modeling using Python and R.
- Proven track record of successfully handling data science projects from start to finish.
- Proficient inData Analysis, Cleansing, Transformation, Model Building, Model Evaluation, and Presentation.
- Extensively worked on Python 3.5/2.7 (NumPy, Scipy, Pandas, Matplotlib, Seaborn, NLTK and Scikit-Learn).
- Experienced in extracting data from various data sources like csv, text, tab separated values, pipe separated values, reading tables from teh web using Python.
- Expert in performing data manipulation operations using Python.
- Skilled in R programming using packages like caret, ggplot2, dplyr.
- Expert in implementing machine learning algorithms like K-Means, Clustering, KNN, Naïve Bayes, SVM, Decision Trees, Linear and Logistic Regression Methods.
- Experienced in applying ensemble methods such as Random Forests and Gradient Boosting for improving accuracy of predictions.
- Exposure to Natural Language Processing (NLP) and text mining.
- Experienced in Dimensionality Reduction methods such as PCA (Principal component Analysis) and Regularization techniques like Lasso and Ridge.
- Possess solid understanding of bias-variance trade-off, cross-validation, as well as steps dat must be carried out to avoid overfitting.
- Experienced in writingcomplex SQL queries and designing various database objects like Stored Procedures, Indexes, Views, and Triggers.
- Designed and implemented critical database solutions dat are reliable, scalable, and perform at a high-level to meet teh service levels associated with teh software dat support teh core product of teh organizations.
- Solid ability to write and optimize diverse SQL queries, working knowledge of RDBMS.
- Well-versed in manipulating data using string functions, aggregate functions in MySQL as well as Sybase database.
- Exposure to cloud computing using Amazon AWS (S3 and EMR).
- Capability to use MLlib, Spark's Machine learning library to build and evaluate different models.
- Good at using pivot table operations using Microsoft Excel for slicing and dicing data, counting counts, sums, and other aggregate metrics.
- Excellent ability to communicate technical details to non-technical audience.
- Proficient at summarizing and presenting key insights through effective visualizations using Microsoft PowerPoint.
- Implemented projects in various project life cycle and SDLC methodologies - Scrum, Waterfall.
- Ability to work independently and interdependently within a team setting.
TECHNICAL SKILLS
Programming Languages: Python, R, SQL, Matplotlib, NumPy, Pandas
Database: Sybase, SQL Server
Big Data: Spark, PySpark, Spark ML.
Machine Learning Methodology: Supervised and Unsupervised Learning
Analytical models: Linear Regression, Logistic Regression, PCA, SVM, Classification, Predictive Modelling
Tools: R Studio, Jupiter Notebook, Putty, AWS Spark, Microsoft Excel, Scikit-Learn, SciPy
Others: Exploratory Data Analysis, Data visualization, NLP, Statistical Analysis, Regression Analysis, Correlation
PROFESSIONAL EXPERIENCE
Confidential, San Diego, CA
Data Scientist
Responsibilities:
- Loaded dataset and merged customer information table and subscription table into one table. Performed exploratory data analysis to answer business questions.
- Worked on various graph methods to visualize and understand teh data like Scatter plot, pie-chart, bar charts, boxplot, histograms using seaborn and Matplotlib.
- Cleansed data by applying various techniques like, missing value treatment, outlier treatment, data normalization and hypothesis testing.
- Manipulated and summarized data to maximize possible outcomes efficiently.
- By performing dataset slicing and dicing, found out factors/variables dat are responsible for churning.
- Identified which group of customers are likely to churn through exploratory analysis. Visualized teh findings.
- Discovered wat kind of service most and least preferred among customers.
- Performed regression analysis and correlation analysis.
- Leveraged class imbalances technique from Python for dataset balancing and improved teh accuracy in churn prediction by 50%.
- Increased revenue of teh business by $6000 on monthly basis.
- Provided recommendations for customer engagement.
- Presented findings, business impact to technical team using Microsoft Presentation.
Environment: Python 3.6, NumPy, Pandas, Matplotlib, Seaborn, Microsoft Excel, Microsoft PowerPoint.
Confidential
Data Analyst/Software Engineer
Responsibilities:
- Predicted whether a borrower will default teh loan or not using Python.
- Discovered patterns on borrowers who are likely to default teh loan using NumPy, Pandas and visualized teh findings using Seaborn and Matplotlib.
- Merged different data sets into a single data set and read data from various data sources like CSV, text, excel and json formats.
- Performed pre-processing steps like imputing missing values, removing highly correlated variables and converting categorical variables to dummy variables.
- Selected best features using Recursive Feature Elimination to prevent teh curse of dimensionality.
- Involved in model building process and incorporated cross-validation technique to avoid over-fitting and implemented classification algorithms like Logistic Regression, Decision Trees, Random Forest, Support Vector Machines.
- Validated teh machine learning classifiers using AUC ROC curve and teh accuracy. Teh accuracy of teh best model is 82.7% using Decision Trees.
- Presented teh findings to teh technical team.
Environment: Python, Pandas, NumPy, Scikit-Learn, Seaborn, PCA, Linear models, Non-linear models, Ensemble models.
Confidential
Responsibilities:
- Involved in all phases of teh SDLC (Software Development Life Cycle) from analysis, design, development, testing, implementation and maintenance with timely delivery against aggressive deadlines.
- Developed stored procedures, written complex SQL queries using multiple table joins, sub-queries with knowledge of optimal software performance techniques and generated reports for business shareholders and customers.
- Migrated code from testing to production environment ensuring accuracy and allowing stakeholders to leverage teh information for strategic business decision.
- Monitored systems post go live. Proactively delivered solutions for continuous improvement.
- Worked in scrum team setting as a developer with very good understanding of each of teh scrum roles.
- As a scrum team player, liaised with product manager and product owner to understand client requirements and halped develop software dat exceeds client’s expectations.
- Collaborated with team members to eliminate unwanted dependencies of auto sys batch jobs.
- Automated reports to end clients using TCL scripts and job scheduler dat saved manual efforts of 72 hours/month.
- Improved teh performance of stored procedure by using techniques like bulk insertion of data from teh source to teh database.
Environment: Sybase, TCL scripting, SQL, python scripting, Putty, UNIX.