Data Scientist Resume
Springfield, MassachusettS
SUMMARY
- Professional qualified Data Scientist/Data Analyst with over 8 years of experience in Data Science and Analytics including Machine Learning, Data Mining and Statistical Analysis
- Involved in the entire data science project life cycle and actively involved in all the phases including dataextraction, data cleaning, statistical modeling and data visualization with large data sets of structured and unstructured data
- Experienced with machine learning algorithm such as logistic regression, random forest, XGboost, KNN, SVM, neural network, linear regression, lasso regression and k - means
- Implemented Bagging and Boosting to enhance the model performance.
- Strong skills in statistical methodologies such as A/B test, experiment design, hypothesis test, ANOVA
- Extensively worked on Python 3.5/2.7 (Numpy, Pandas, Matplotlib, NLTK and Scikit-learn)
- Experience inimplementing data analysis with various analytic tools, such as Anaconda 4.0JupiterNotebook 4.X, R 3.0 (ggplot2, Caret, dplyr) and Excel
- Solid ability to write and optimize diverse SQL queries, working knowledge of RDBMS like SQLServer2008, NoSql databases like MongoDB3.2
- Strong experience in BigData technologies like Spark 1.6, Sparksql, pySpark, Hadoop 2.X, HDFS, Hive 1.X
- Experience in visualization tools like, Tableau9.X, 10.X for creating dashboards
- Excellent understanding Agile and Scrum development methodology
- Used the version control tools like Git 2.X
- Passionate about gleaning insightful information from massive data assets and developing a culture of sound, data-driven decision making
- Ability to maintain a fun, casual, professional and productive team atmosphere
- Experienced the full software life cycle in SDLC, Agile and Scrum methodologies.
- Skilled in Advanced Regression Modeling, Correlation, Multivariate Analysis, Model Building, Business Intelligence tools and application of Statistical Concepts.
- Proficient in Predictive Modeling, Data Mining Methods, Factor Analysis, ANOVA, Hypotheticaltesting, normal distribution and other advanced statistical and econometric techniques.
- Developed predictive models using Decision Tree, RandomForest, NaïveBayes, LogisticRegression, ClusterAnalysis, and Neural Networks.
- Experienced in Machine Learning and Statistical Analysis with PythonScikit-Learn.
- Experienced in Python to manipulate data for data loading and extraction and worked with python libraries like Matplotlib, Numpy, Scipy and Pandas for dataanalysis.
- Worked with complex applications such as R, SAS, Matlab and SPSS to develop neural network, cluster analysis.
- Expertise in transforming business requirements into analytical models, designing algorithms, building models, developing data mining and reporting solutions that scales across massive volume of structured and unstructured data.
- Skilled in performing dataparsing, data manipulation and data preparation with methods including describe data contents, compute descriptive statistics of data, regex, split and combine, Remap, merge, subset, reindex, melt and reshape.
- Strong SQL programming skills, with experience in working with functions, packages and triggers.
- Experienced in Visual Basic for Applications and VB programming languages to work with developing applications.
- Worked with NoSQL Database including Hbase, Cassandra and MongoDB.
- Experienced in Big Data with Hadoop, HDFS, MapReduce, and Spark.
- Experienced in Data Integration Validation and Data Quality controls for ETL process and Data Warehousing using MS Visual Studio SSIS, SSAS, SSRS.
- Proficient in Tableau and R-Shiny data visualization tools to analyze and obtain insights into large datasets, create visually powerful and actionable interactive reports and dashboards.
- Automated recurring reports using SQL and Python and visualized them on BI platform like Tableau.
- Worked in development environment like Git and VM.
- Excellent communication skills. Successfully working in fast-paced multitasking environment both independently and in collaborative team, a self-motivated enthusiastic learner.
TECHNICAL SKILLS
Languages: Java 8, Python, R
Packages: ggplot2, caret, dplyr, Rweka, gmodels, RCurl, tm, C50, twitter, NLP, Reshape2, rjson, plyr, pandas, numPy, seaborn, sciPy, matplot lib, scikit-learn, Beautiful Soup, Rpy2, sqlalchemy.
Web Technologies: JDBC, HTML5, DHTML and XML, CSS3, Web Services, WSDL
Data Modelling Tools: Erwin r 9.6, 9.5, 9.1, 8.x, Rational Rose, ER/Studio, MS Visio, SAP Power designer.
Big Data Technologies: Hadoop, Hive, HDFS, MapReduce, Pig, Kafka.
Databases: SQL, Hive, Impala, Pig, Spark SQL, Databases SQL-Server, My SQL, MS Access, HDFS, HBase, Teradata, Netezza, Mongo DB, Cassandra.
Reporting Tools: MS Office (Word/Excel/Power Point/ Visio), Tableau,Crystal reports XI, Business Intelligence, SSRS, Business Objects 5.x/ 6.x, Cognos7.0/6.0.
ETL Tools: Informatica Power Centre, SSIS.
Version Control Tools: SVM, GitHub.
Project Execution Methodologies: Ralph Kimball and Bill Inmon data warehousing methodology, Rational Unified Process (RUP), Rapid Application Development (RAD), Joint Application Development (JAD).
BI Tools: Tableau, Tableau server, Tableau Reader, SAP Business Objects, OBIEE, QlikView, SAP Business Intelligence, Amazon Redshift, or Azure Data Warehouse
Operating System: Windows, Linux, Unix, Macintosh HD, Red Hat.
PROFESSIONAL EXPERIENCE
Confidential, Springfield, Massachusetts
Data Scientist
Responsibilities:
- Performed Data Profiling to learn about behavior with various features such as traffic pattern, location, time, Date and Time etc.
- Application of various machine learning algorithms and statistical modeling like decision trees, regressionmodels, neural networks, SVM, clustering to identify Volume using scikit-learn package in python, Matlab.
- Utilized Spark, Scala, Hadoop, HBase, Cassandra, MongoDB, Kafka, Spark Streaming, MLLib, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc. and Utilized the engine to increase user lifetime by 45% and triple user conversations for target categories.
- Developed Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources. Used clustering technique K-Means to identify outliers and to classify unlabeled data.
- Evaluated models using Cross Validation, Log loss function, ROC curves and used AUC for feature selection.
- Analyze traffic patterns by calculating autocorrelation with different time lags.
- Ensured that the model has low False Positive Rate.
- Addressed overfitting by implementing of the algorithm regularization methods like L2 and L1.
- Used Principal Component Analysis in feature engineering to analyze high dimensional data.
- Created and designed reports that will use gathered metrics to infer and draw logical conclusions of past and future behavior.
- Performed Multinomial Logistic Regression, Randomforest, Decision Tree, SVM to classify package is going to deliver on time for the new route.
- Performed data analysis by using Hive to retrieve the data from Hadoopcluster, Sql to retrieve data from Oracle database.
- Used MLlib, Spark's Machine learning library to build and evaluate different models.
- Implemented rule based expertise system from the results of exploratory analysis and information gathered from the people from different departments.
- Performed Data Cleaning, features scaling, features engineering using pandas and numpy packages in python.
- Developed MapReduce pipeline for feature extraction using Hive.
- Created Data QualityScripts using SQL and Hive to validate successful data load and quality of the data. Created various types of data visualizations using Python and Tableau.
- Communicated the results with operations team for taking best decisions.
- Collected data needs and requirements by Interacting with the other departments.
Environment: Python 2.x, CDH5, HDFS, Hadoop2.3, Hive, Impala, Linux, Spark, Tableau Desktop, SQL Server 2012, Microsoft Excel, Mat lab, Spark SQL, Pyspark.
Confidential, Jersey city, New Jersey
Data Scientist
Responsibilities:
- Provided Configuration Management and Build support for more than 5 different applications, built and deployed to the production and lower environments.
- Implemented public segmentation using unsupervised machine learning algorithms by implementing k-means algorithm using Pyspark.
- Explored and Extracted data from source XML in HDFS, preparing data for exploratory analysis using data munging.
- Responsible for different Data mapping activities from Source systems to Teradata
- Handled importing data from various data sources, performed transformations using Hive, Map Reduce, and loaded data into HDFS
- Used R and python for Exploratory Data Analysis, A/B testing, Anova test and Hypothesis test to compare and identify the effectiveness of Creative Campaigns.
- Created clusters to classify Control and test groups and conducted group campaigns.
- Analyzed and calculated the lifetime cost of everyone in the welfare system using 20 years of historical data.
- Developed LINUXShell scripts by using NZSQL/NZLOAD utilities to load data from flat files to Netezza database.
- Developed triggers, stored procedures, functions and packages using cursors and ref cursor concepts associated with the project using Pl/SQL
- Created various types of data visualizations using R, python and Tableau.
- Used Python, R, SQL to create Statistical algorithms involving Multivariate Regression, LinearRegression, LogisticRegression, PCA, Random forest models, Decision trees, Support Vector Machine for estimating the risks of welfare dependency.
- Identified and targeted welfare high-risk groups with Machinelearningalgorithms.
- Conducted campaigns and run real-time trials to determine what works fast and track the impact of different initiatives.
- Developed Tableau visualizations and dashboards using Tableau Desktop.
- Used Graphical Entity-Relationship Diagramming to create new database design via easy to use, graphical interface.
- Created multiple custom SQLqueries in TeradataSQLWorkbench to prepare the right data sets for Tableau dashboards
- Perform analyses such as regression analysis, logistic regression, discriminant analysis, cluster analysis using SAS programming.
- Used Meta data tool for importing metadata from repository, new job categories and creating new data elements.
- Scheduled the task for weekly updates and running the model in workflow. Automated the entire process flow in generating the analysis and reports.
Environment: R3.x, HDFS, Hadoop2.3, Pig, Hive, Linux, R-Studio, Tableau 10, SQL Server, Ms Excel, Pypark.
Confidential - New York, NY
Data Scientist
Responsibilities:
- Involved in Design, Development and Support phases of Software Development Life Cycle (SDLC)
- Performed data ETL by collecting, exporting, merging and massaging data from multiple sources and platforms including SSIS (SQL Server Integration Services) in SQL Server.
- Worked with cross-functional teams (including data engineer team) to extract data and rapidly execute from MongoDB through MongDB connector for Hadoop.
- Performed data cleaning and feature selection using MLlib package in PySpark.
- Performed partitional clustering into 100 by k-means clustering using Scikit-learn package in Python where similar hotels for a search are grouped together.
- Used Python to perform ANOVA test to analyze the differences among hotel clusters.
- Implemented application of various machine learning algorithms and statistical modeling like Decision Tree, Naive Bayes, Logistic Regression and Linear Regression using Python to determine the accuracy rate of each model.
- Determined the most accurately prediction model based on the accuracy rate.
- Used text-mining process of reviews to determine customers' concentrations.
- Delivered analysis support to hotel recommendation and providing an online A/B test.
- Designed Tableau bar graphs, scattered plots, and geographical maps to create detailed level summary reports and dashboards.
- Developed hybrid model to improve the accuracy rate.
- Delivered the results to operation team for better decisions and feedbacks.
Environment: Python, PySpark, Tableau, MongoDB, Hadoop, SQL Server, SDLC, ETL, SSIS, recommendation systems, Machine Learning Algorithms, text-mining process, A/B test
Confidential - Wilmington, DE
Data Scientist
Responsibilities:
- Participated in all phases of research including data collection, data cleaning, data mining, developing models and visualizations.
- Collaborated with data engineers and operation team to collect data from internal system to fit the analytical requirements.
- Redefined many attributes and relationships and cleansed unwanted tables/columns using SQL queries.
- Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries.
- Performed data imputation using Scikit-learn package in Python.
- Performed data processing using Python libraries like Numpy and Pandas.
- Worked with data analysis using ggplot2 library in R to do data visualizations for better understanding of customers' behaviors.
- Visually plotted data using Tableau for dashboards and reports.
- Implemented statistical modeling with XGBoost machine learning software package using R to determine the predicted probabilities of each model.
- Delivered the results with operation team for better decisions.
Environment: Python, R, SQL, Tableau, Spark, Machine Learning Software Package, recommendation systems.
Confidential
Python Developer
Responsibilities:
- Developed entire frontend and backend modules using Python on Django Web Framework.
- Implemented the presentation layer with HTML, CSS and JavaScript.
- Involved in writing stored procedures using Oracle.
- Optimized the database queries to improve the performance.
- Designed and developed data management system using Oracle.
Environment: MySQL, ORACLE, HTML5, CSS3, JavaScript, Shell, Linux & Windows, Django, Python
Confidential
Programmer Analyst
Responsibilities:
- Effectively communicated with the stakeholders to gather requirements for different projects
- Used MySQL db package and Python-MySQL connector for writing and executing several MYSQL database queries from Python.
- Created functions, triggers, views and stored procedures using My SQL.
- Worked closely with back-end developer to find ways to push the limits of existing Web technology.
- Involved in the code review meetings.
Environment: Python, MySQL.