Data Scientist Resume
Mount Laurel, NJ
SUMMARY:
- Data Scientist with more than 6 years of data analytics and software development experience in analyzing and solving real world problems in Finance, E - Commerce and Real-Estate domains leveraging my data science skills, coding ability, logical reasoning and technical acumen.
- Expert in managing full life cycle of Data Science project including Data Collection, Cleaning and Predictive Modelling and from structured and unstructured data sources.
- Skilled in implementing supervised and unsupervised machine learning techniques such as Linear and Logistic regression, SVM, Naive Bayes, Random Forests, Decision Trees, KNN , K-means and Principal Component Analysis for classification, regression and clustering problems.
- Working knowledge on Recommender Systems, Feature Creation and Validation.
- Efficient in using K- fold cross validation to detect over-fitting.
- Experienced in using RMSE and Mean Average Precision to check the model accuracy.
- Summarized the performance using Confusion Matrix, F1 score, Precision and Recall.
- Hands on experience with libraries in Python such as Pandas, NumPy, SciPy, Scikit-Learn, Matplotlib, Seaborn, Beautiful Soup and Orange.
- Familiar with packages in R such as dplyr, tidyr, plyr, ggplot2, caret, wordcloud, stringr, e1071, MASS and tidytext.
- Sound understanding of Hadoop 2.0 ecosystem including MapReduce and capable of using Hive to extract, transform and load (ETL) data from large datasets on HDFS.
- Working knowledge on Apache Spark 1.6+ using Spark Core and Spark SQL for data transformation and batch processing.
- Used Kafka with Spark Streaming API for real-time streaming data analysis.
- Proficient with programming languages namely Java, Python and R.
- Adept in using web technologies to build interactive dynamic webpages with HTML5, CSS3, JavaScript and jQuery 1.7.
- Hands on experience with RDBMS including Oracle 11g, MySQL 5.7 and PostgreSQL to query data, perform CRUD operations and for writing functions, triggers, views and stored procedures.
- Experienced in working with No-SQL data stores including MongoDB, HBase and Cassandra.
- Well-versed in data visualization techniques to create interactive layouts using Power BI & Tableau 10.2 for publishing and presenting Dashboards on web and desktop platforms.
- Experienced in handling various phases of Software Development Life Cycle (SDLC) including Requirement Analysis, Design Specification, Implementation, Testing, Deployment and Maintenance in both Waterfall and Agile methodologies.
- Well versed in using Tortoise SVN and GIT repository for version control.
- Familiar with bug fixing and bug tracking tools like BugZilla.
- Quick learner with excellent analytical and problem-solving skills who strives to lead from the front, backed with self-confidence and motive to deliver the best results within the deadline whether working independently or in a team.
TECHNICAL SKILLS:
Big Data Technologies\ Programming Languages\: Apache Hadoop 2.x Map Reduce 2.6+, HDFS, \ Python 2.7/3.x, R 3.3, Java 1.7/8, PL/SQL\
Sqoop, Apache Spark 1.6+, Hive 0.14, Kafka\
Python Packages\ R packages\: Numpy 1.12, Pandas 0.18, Matplotlib 2.0, \ dplyr, tidyr, ggplot2, caret, wordcloud, \
Seaborn 0.7, Scikit: Learn 0.18, Orange, \ stringr, e1071, MASS, plyr, Shiny, \
Beautiful Soup, LibSVM\ Markdown \
Analytical Techniques\ Machine Learning\: EDA, Feature Engineering, Supervised Learning, \ Linear and Logistic Regression, Na ve \
Unsupervised Learning, Statistical Modeling, \ Bayes, Random Forests, SVM, K-Means, \
Regression, Classification, Clustering, \ KNN, Decision Trees, XGBoost, Bagging, \
Generalized Linear Models\ Boosting, Gradient Descent, PCA\
Databases: \ Cloud Platforms\
RDBMS: Oracle 11g, MySQL 5.7, PostgreSQL \ Hortonworks (Sandbox), AWS S3 (EC2)\
NoSQL: MongoDB 3.2, Cassandra 2.2, HBase1.x\ Anaconda Cloud (Jupyter Notebook)
Web Technologies\ IDEs\: Core Java, Struts 2.0, J2EE, HTML5, CSS3, \ Eclipse, Visual Studio Code, R Studio, \
JavaScript, JQuery, JSON, XML \ SQL Developer, PyCharm\
Visualization Tools\ Methodologies\: Tableau 10.2, Power BI\ Agile (Scrum), Waterfall\
Operating Systems: \ Version Control\
Windows, Linux (Fedora, Ubuntu)\ Tortoise SVN, GIT\
PROFESSIONAL EXPERIENCE:
Confidential - Mount Laurel, NJ
Data Scientist
Responsibilities:
- Gathered data from a hybrid of sources including relational databases like Oracle and MySQL and translated it to HDFS using Sqoop.
- Collected geo-location data from MongoDB.
- Used Kafka with Spark Streaming API to fetch real time data from transaction logs.
- Performed Exploratory Data Analysis, Data Cleaning and Aggregation using Spark Core and Python packages like Pandas 0.18 and Numpy 1.12 in PySpark Shell.
- Used Matplotlib and Seaborn libraries in Python to visualize the data for detecting outliers, missing values and interpreting variables.
- Stored the customer and transaction data in MongoDB supporting real time expressive ad-hoc queries with low latency responsiveness.
- Reduced dimensionality of the dataset using Principal Component Analysis (PCA) and feature importance ranked by tree-based classifiers.
- Built user behavior models for finding activity patterns and evaluating risk scores for every transaction using historic data to train the supervised learning models such as Decision Trees, Random Forests, XGBoost and SVM.
- Used K-fold cross validation to evaluate models for detecting over-fitting.
- Used Root Mean Squared Error (RMSE) to check the classification accuracy of our model.
- Classified incoming transactions as fraudulent or legit based on the risk score assigned by our supervised learning model.
- Re-trained the models using Random Forests and XGBoost classifiers to improve whitelisting and to determine a cutoff point for accepting/declining transactions.
- Summarized the performance of models using Confusion matrix, F1 score, Recall and Precision.
- Generated reports and created interactive dashboards using Tableau 10.2.
Environment: Python 3.5, Hortonworks, Apache Hadoop 2.0, Apache Sqoop, Apache Kafka, Apache Spark 2.0, Hive, PySpark, MongoDB 3.2, Scikit-Learn, Tableau 10.2, MS Excel, Linux.
Confidential - Manhattan, NY
Data Scientist
Responsibilities:
- Integrated data from various resources including customer behavior data, transactional data, portfolio, etc. by querying and processing large volumes of data using Hive on HDFS.
- Collected products information from Cassandra database.
- Used Spark to load the data and do cluster computations on HDFS.
- Performed Exploratory Data Analysis and Data Pre-Processing on order history using R packages like dplyr and tidyr in SparkR environment.
- Extracted patterns in the structured and unstructured data sets and displayed them with interactive charts using ggplot2 and ggiraph packages in R.
- Stored the pre-processed user and item data in HBase.
- Built initial models using supervised classification techniques like K-Nearest Neighbor (KNN), Logistic Regression and Random Forests.
- Measured feature correlation using Pearson Correlation Coefficient (PCC) for identifying highly correlated features in the data used for dimensionality reduction.
- Used K-Fold cross validation to overcome the problem of over-fitting.
- Built models using K-means clustering to create user groups.
- Experimented with already built models using item-based and user-based recommendation.
- Used item-based Collaborative Filtering to improve the prediction accuracy.
- Created a hybrid model to support new user recommendation and existing ones with changing trends.
- Used RMSE and Mean Average Precision to evaluate recommender’s performance in both stimulated environment and real world.
- Helped in deploying model in production and monitored user activity and add-on sales from items that were recommended without being searched.
- Used the results to tune the model parameters and rebuild the model.
- Created visualizations to convey results and analyze data using PowerBI.
Environment: R 3.3, AWS S3, AWS EC2, Apache Hadoop 2.0, Apache Spark 1.6, Apache Hive, Apache HBase 1.1, SparkR, Cassandra 3.1, PowerBI, MS Excel, Linux
Confidential
PL/SQL Developer
Responsibilities:
- Successfully transformed the client’s business requirements to interactive webpages accepting continuous change requests and rendering technical advice wherever necessary.
- Worked with a huge Oracle database of several schemas consisting of more than 500 tables and played an active part in redesigning database structure and removing redundancy.
- Optimized the performance of complex queries, triggers and stored procedures, and wrote functions and views using PL/SQL.
- Well versed with object-oriented programming in PL/SQL and hands on experience in working with Project Lead and Database Admin to debug and modify the Ledger Proc.
- Good working knowledge with reporting tools like JasperReports and ERD tools like Dia.
- Worked closely with developers and client to complete several modules of the project and lead a couple of them along with providing mentorship to junior employees.
Environment: Apache Tomcat 7.0, Struts MVC 2.0, Oracle 11g, Java, JDK 1.7, JDBC, HTML 5.0, CSS 3, jQuery-1.7, iReport 4.7, SQL Developer, Dia Diagram Editor, Eclipse (Juno/Kepler/Luna).
Confidential
Data Analyst
Responsibilities:
- Worked closely with a highly skilled team of middle school and high school teachers to prepare a database of the important conceptual questions for each grade.
- Categorized these questions into specific groups and sub groups from where they were probabilistically chosen to generate different sets of tests.
- Designed ER diagrams using Dia Diagram Editor to understand the database schema, define different types of constraints and normalize the tables to eliminate redundancy.
- Used MySQL for CRUD operations and writing functions, triggers and stored procedures.
- Generated individual student and class performance reports using JasperReports with JDBC connector for MySQL database.
Environment: WampServer 2.2, PHP 5.5, MySQL Server 5.5, HTML 4.0, CSS3, JavaScript, Dia Diagram Editor, iReport 4.0, SQL Developer.
Confidential
Programmer Analyst
Responsibilities:
- Focused on building a highly scalable, lightweight complaint lodging web portal to allow the public to lodge complaints against the grievances related to various municipal services like Water Management, Waste Disposal Management, Property Disputes, etc.
- Gained hands on experience in designing and developing dynamic java server pages (JSPs) using HTML, CSS and JavaScript for adding client-side validations and functionality.
- Used PL/SQL to query the database and CRUD operations as well as for writing and debugging functions, triggers and stored procedures.
- Well versed in using core Java for writing servlets to get, post and manipulate data as per the user request and connecting and retrieving results from Oracle database.
Environment: Apache Tomcat 7.0, Oracle 11g, Java, JDK 1.7, JDBC, HTML 4.0, CSS 3, JavaScript, Eclipse 3.7 (Indigo), Adobe Dreamweaver 10.0