Data Scientist Resume
South Plainfield, NJ
SUMMARY:
- A Passionate, team - oriented Data Scientist with over 6 years of experience in Statistical Modeling, Data Mining, Data Visualization and Machine Learning with rich domain knowledge in Retail, Healthcare and Banking industries.
- Expertise in transforming business resources and tasks into regularized data and analytical models, designing algorithms, developing data mining and reporting solutions across a massive volume of structured and unstructured data.
- Involved in entire data science project life cycle, including Data Acquisition, Data Cleansing, Data Manipulation, Feature Engineering, Modelling, Evaluation, Optimization, Testing and Deployment.
- Proficient in Machine Learning algorithms and Predictive Modeling including Linear Regression, Logistic Regression, Naive Bayes, Decision Tree, Neural Networks, Random Forest, Ensemble Models, SVM, KNN and K-means clustering.
- Solid experience in Deep Learning techniques with Convolutional Neural Networks(CNN), Recursive Neural Networks(RNN), max pooling, normalization and different architectures such as Alexnet, VGG and Darknet.
- Excellent proficiency in model validation and optimization with Model selection, Parameter tuning and K-fold cross validation.
- Deep understanding of Statistical Methodologies including Hypothesis test, ANOVA, and Chi-Square.
- Strong experience with Python (2.x, 3.x) and R Programming to develop analytic models and solutions.
- Extensive experience in RDBMS such as SQL server 2012, Oracle 9i/10g.
- Experienced in Non-relational database such as MongoDB 3.x.
- Familiar with Hadoop ecosystem and Apache Spark framework such as HDFS, MapReduce, Pig Latin, HiveQL, SparkSQL, PySpark.
- Proficient in data visualization tools such as Tableau, Python Matplotlib/Seaborn, R ggplot2/Shiny to create visually impactful and actionable interactive reports and dashboards.
- Experienced in Amazon Web Services (AWS), such as AWS EC2, EMR, S3, RD3, and Redshift.
- Experienced in designing and developing T-SQL queries, ETL packages and business reports using SQL Server Management Studio (SSMS) and BI Suite (SSIS/SSRS).
- Adept in developing and debugging Stored Procedures, User-defined Functions (UDFs), Triggers, Indexes, Constraints, Transactions and Queries using Transact-SQL (T-SQL).
- Experienced in ticketing systems such as JIRA/confluence and version control tools such as Github.
- Excellent understanding of Systems Development Life Cycle (SDLC) such as Agile and Waterfall.
- Strong business acumen and analytical skills to translate numbers into actionable business decisions. Great passion in learning cutting-edge theories and algorithms for Machine Learning and always looking for new challenges.
TECHNICIAL SKILLSETS
Databases: MS SQL Server 2008/2008R2/2012/2014, Oracle, HBase, Amazon Redshift, MongoDB 3.x, Teradata
Statistical Methods: Hypothetical Testing, ANOVA, Chi-Square, Exploratory Data Analysis (EDA), Confidence Intervals, Bayes Law, Principal Component Analysis (PCA), Dimensionality Reduction, Cross-Validation, Auto-correlation
Machine Learning: Regression analysis, Naïve Bayes, Decision Tree, Random Forests, Support Vector Machine, Neural Network, Sentiment Analysis, Collaborative Filtering, K-Means Clustering, KNN, CNN, RNN and AdaBoost.
Hadoop Ecosystem: Hadoop 2.x, Spark 2.x, MapReduce, Hive, HDFS, Pig
Cloud Services: Amazon Web Services (AWS) EC2/S3/Redshift
Deep Learning: Keras, Tensor flow, Theano, AlexNet, VGG, CNN, and RNN
Reporting Tools: Tableau Suite of Tools 7.x/8.X/9.X/10.X Server and Online, Server Reporting Services(SSRS)
Data Visualization: Tableau, MatPlotLib, Seaborn, ggplot2
Languages: Python (2.x/3.x), R, Java, SQL
Operating Systems: Microsoft Windows, Linux (Ubuntu), Microsoft Office Suite (Word, PowerPoint, Excel)
PROFESSIONAL EXPERIENCE:
Confidential, South Plainfield, NJ
Data Scientist
Responsibilities:
- Collaborated with data engineers and operation team to implement ETL process, wrote and optimized SQL queries to perform data extraction to fit the analytical requirements.
- Performed data analysis by using Hive to retrieve the data from Hadoop cluster, SQL to retrieve data from RedShift.
- Worked on data cleaning and ensured data quality, consistency, integrity using Pandas, Numpy.
- Performed univariate and multivariate analysis on the data to identify any underlying pattern in the data and associations between the variables.
- Explored and analyzed the customer specific features by using Matplotlib in Python and ggplot2 in R.
- Performed data imputation using Scikit-learn package in Python.
- Participated in features engineering such as feature generating, PCA, feature normalization and label encoding with Scikit-learn preprocessing.
- Used Python 3.X (numpy, scipy, pandas, scikit-learn, seaborn) and R (caret, trees, arules) to develop variety of models and algorithms for analytic purposes.
- Experimented and built predictive models including ensemble models using machine learning algorithms such as Logistic regression, Random Forests and KNN to predict customer churn.
- Conducted analysis on customer behaviors and discover value of customers with RMF analysis; applied customer segmentation with clustering algorithms such as K-Means Clustering. Gaussian Mixture Model and Hierarchical Clustering.
- Used F-Score, AUC/ROC, Confusion Matrix, Precision, and Recall to evaluate different models’ performance.
- Designed and implemented a recommendation system which leveraged Google Analytics data and the machine learning models and utilized Collaborative filtering techniques to recommend items for different customers.
- Designed rich data visualizations to model data into human-readable form with Tableau and Matplotlib.
Environment: AWS RedShift, Hadoop, HDFS, Python 3.x (Scikit-Learn/Scipy/Numpy/Pandas/Matplotlib/Seaborn), R (ggplot2/caret/trees/arules), Tableau (9.x), Machine Learning (Logistic regression/Random Forests/KNN/K-Means Clustering/Gaussian Mixture Model/Hierarchical Clustering/Ensemble methods/Collaborative filtering), JIRA, Github, Agile/SCRUM
Confidential, Union, NJ
Data Scientist
Responsibilities:
- Oversaw the ETL process. Extracted and merged data using optimized SQL queries from SQL Server 2012.
- Aggregated data on collected unstructured data in Mongo DB 3.3.
- Performed data cleaning, exploratory analysis and data integrity analysis using Pandas, Numpy.
- Analyzed the customer behavior and value using RMF analysis.
- Experimented with multiple classification algorithms, such as Logistic Regression, Support Vector Machine (SVM), Random Forest, and Adaboost using Python Scikit-Learn and evaluated the performance
- Researched on segmentation of customers by using Random Forest, K-means and Hierarchical Clustering.
- Developed the product recommendation engine using Content Filtering, Collaborative Filtering and Gradient Boosting Tree Algorithms.
- Generated dashboard and report using Tableau.
- Evaluated the marketing strategy using A/B testing.
- Conducted sentiment analysis of customer service based on the survey.
Environment: Python 3.X (Scikit-Learn/Numpy/Pandas/Matplotlib/Seaborn), SQL Server 2012, MongoDB 2.X, Tableau 8.X, Git 2.X, AWS EC2, S3
Confidential, Paterson, NJ
Data Scientist
Responsibilities:
- Gathered, analyzed, and translated business requirements, communicated with other departments to collect client business requirements and access available data.
- Collected data in Hadoop and performed data preparation using Pig Latin to get the right format.
- In Preprocessing phase, used Pandas and Scikit-Learn to remove or impute missing values, detect outliers, scale features, and applied feature selection (filtering) to eliminate irrelevant features.
- Conducted Exploratory Data Analysis using Python Matplotlib and Seaborn to identify underlying patterns and correlation between features.
- Balanced the dataset by over-sampling the minority label class and under-sampling the majority label class.
- Used Python (Numpy, Scipy, Pandas, Scikit-Learn, Seaborn), and Spark 2.0 (PySpark, MLlib) to develop variety of models and algorithms for analytic purposes.
- Experimented with multiple classification algorithms, such as Logistic Regression, Support Vector Machine (SVM), Random Forest, and Adaboost using Python Scikit-Learn and evaluated the performance.
- Implemented, tuned, and tested the model on AWS EC2 to get the best algorithm and parameters.
- Used F-Score, AUC/ROC, Confusion Matrix, RMSE to evaluate different model performance.
- Tracked the performance by unseen data, retrained the model to improve the accuracy.
Environment: AWS EC2, S3, Hadoop, Pig, HDFS, Spark (PySpark/MLlib/Spark SQL), Python 3.x (Numpy/ Pandas/ Matplotlib/ Seaborn/ Scipy/ Scikit-Learn), MS SQL Server 2012
Confidential, New York, NY
SQL BI Developer/Data Analyst
Responsibilities:
- Worked on company’s database and business model and was actively involved in gathering user/project requirements from different stakeholders; worked on documentations required for the project in hand.
- Extracted data using T-SQL in SQL server to write Queries, Stored procedures, Triggers, Views, Temp Tables and User-Defined Functions (UDFs).
- Designed and developed ETL packages using SSIS to create Data Warehouses from different tables and file sources like Flat and Excel files.
- Used different methods in SSIS such as derived columns, aggregations, Merge joins, count, conditional split and more to transform the data.
- Developed reporting solutions for different stakeholders from mock-up till deployment in different areas such as Claims, Transactions, Supply, Assets and others in SSRS.
- Optimized Queries in T-SQL by removing redundancies, retrieving essential data and using SQL methods like Joins efficiently.
Environment: s: MS SQL Server 2008/2008R2/2012 (T-SQL), SQL Server Management Studio, SQL Server Integration Service, SQL Server Reporting Service, Windows 7, MS Office Suite 2010, Tableau (6.X)