Data Scientist Resume
Plano, TX
PROFESSIONAL SUMMARY:
- Professional qualified Data Scientist with around 6+ years of experience in Data Science and Analytics including Deep Learning/Machine Learning, Data Mining and Statistical Analysis .
- Involved in the entire data science project life cycle and actively involved in all the phases including data extraction, data cleaning , statistical modeling and data visualization with large data sets of structured and unstructured data, created ER diagrams and schema.
- Experienced with machine learning algorithm such as logistic regression, random forest, XGboost, KNN, SVM , neural network, linear regression, lasso regression and k - means
- Implemented Bagging and Boosting to enhance the model performance.
- Strong skills in statistical methodologies such as A/B test , experiment design, hypothesis test , ANOVA
- Extensively worked on Python 3.5/2.7 ( Numpy , Pandas , Matplotlib , NLTK and Scikit - learn )
- Experience in implementing data analysis with various analytic tools, such as Anaconda 4.0 Jupiter Notebook 4.X, R 3.0 (ggplot2, Caret, dplyr) and Excel
- Solid ability to write and optimize diverse SQL queries, working knowledge of RDBMS like SQL Server 2008, NoSql databases like MongoDB 3.2
- Strong experience in Big Data technologies like Spark 1.6, Sparksql, pySpark, Hadoop 2.X, HDFS, Hive 1.X
- Experience in visualization tools like, Tableau 9.X, 10.X for creating dashboards
- Excellent understanding Agile and Scrum development methodology
- Passionate about gleaning insightful information from massive data assets and developing a culture of sound, data-driven decision making
- Skilled in Advanced Regression Modeling, Correlation, Multivariate Analysis, Model Building, Business Intelligence tools and application of Statistical Concepts .
- Proficient in Predictive Modeling, Data Mining Methods, Factor Analysis, ANOVA, Hypothetical testing , normal distribution and other advanced statistical and econometric techniques .
- Developed predictive models using Decision Tree, Random Forest, Naïve Bayes, Logistic Regression, Social Network Analysis, Cluster Analysis, and Neural Networks.
- Experienced in Machine Learning and Statistical Analysis with Python Scikit-Learn .
- Experienced in Python to manipulate data for data loading and extraction and worked with python libraries like Matplotlib, Numpy, Scipy and Pandas for data analysis.
- Worked with complex applications such as R, SAS, Matlab and SPSS to develop neural network, cluster analysis.
- Strong SQL programming skills , with experience in working with functions, packages and triggers.
- Expertise in transforming business requirements into analytical models, designing algorithms , building models , developing data mining and reporting solutions that scales across massive volume of structured and unstructured data.
- Worked with NoSQL Database including Hbase, Cassandra and MongoDB .
- Experienced in Big Data with Hadoop , HDFS, MapReduce , and Spark .
- Proficient in Tableau and R-Shiny data visualization tools to analyze and obtain insights into large datasets, create visually powerful and actionable interactive reports and dashboards.
- Automated recurring reports using SQL and Python and visualized them on BI platform like Tableau.
- Excellent communication skills. Successfully working in fast-paced multitasking environment both independently and in collaborative team, a self-motivated enthusiastic learner.
TECHNICAL SKILLS:
Machine Learning: Classification, Regression, Feature Engineering, Clustering, Neural Networks
Statistical Methods: Time Series, Regression models, Splines, Confidence Intervals, Principal Component Analysis and Dimensionality Reduction, Bootstrapping
Programming languages: Python (panda, numpy, Scikit-learn), R, SQL, ML, Excel
Selected Coursework: Machine Learning, Linear Algebra, Multivariate Calculus, Probability and Statistics, Visualization, Big Data Analysis
Hadoop Components: MapReduce V2, HBase, HIVE, Pig, Sqoop, Oozie, Kafka
Spark Components: Spark Core 1.6, Spark SQL, Spark Streaming
WORK EXPERIENCE:
Data Scientist
Confidential, Plano, TX
Responsibilities:
- Performed Data Profiling to learn about behavior with various features such as traffic pattern, location, Date and Time etc .
- Application of various machine learning algorithms and statistical modeling like decision trees, text analytics, natural language processing (NLP), supervised and unsupervised, regression models, social network analysis, neural networks, deep learning, SVM, clustering to identify Volume using scikit-learn package in python, Matlab .
- Exploring DAG's , their dependencies and logs using AirFlow pipelines for automation Performed data cleaning and feature selection using MLlib package in PySpark and working with deep learning frameworks such as Caffe , Neon etc
- Developed Spark/Scala , Python, R for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources. Used clustering technique K-Means to identify outliers and to classify unlabeled data.
- Evaluated models using Cross Validation, Log loss function, ROC curves and used AUC for feature selection and elastic technologies like ElasticSearch, Kibana etc
- Categorized comments into positive and negative clusters from different social networking sites using Sentiment Analysis and Text Analytics .
- Analyze traffic patterns by calculating autocorrelation with different time lags.
- Ensured that the model has low False Positive Rate and Text classification and sentiment analysis for unstructured and semi-structured data .
- Addressed over fitting by implementing of the algorithm regularization methods like L2 and L1 .
- Used Principal Component Analysis in feature engineering to analyze high dimensional data.
- Created and designed reports that will use gathered metrics to infer and draw logical conclusions of past and future behavior.
- Performed Multinomial Logistic Regression, Random forest, Decision Tree, SVM to classify package is going to deliver on time for the new route.
- Performed data analysis by using Hive to retrieve the data from Hadoop cluster , Sql to retrieve datafrom Oracle database and used ETL for data transformation.
- Used MLlib , Spark's Machine learning library to build and evaluate different models. .
- Performed Data Cleaning, features scaling, features engineering using pandas and numpy packages in python .
- Developed MapReduce pipeline for feature extraction using Hive and Pig .
- Created Data Quality Scripts using SQL and Hive to validate successful data load and quality of the data. Created various types of data visualizations using Python and Tableau .
- Communicated the results with operations team for taking best decisions.
- Collected data needs and requirements by Interacting with the other departments.
Environment: Python 2.x, R, CDH5, HDFS, Hadoop 2.3, Hive,Linux, Spark, Tableau Desktop, SQL Server 2012, Microsoft Excel, Matlab, Spark SQL, Pyspark.
Data Scientist
Confidential - Charlotte, NC
Responsibilities: .
- Identified and targeted welfare high-risk groups with Machine learning/Deep learning algorithms.
Conducted campaigns and run real-time trials to determine what works fast and track the impact of different initiatives.
- Created positive and negative clusters from merchant's transaction using Sentiment Analysis to test the authenticity of transactions and resolve any chargebacks.
- Analyzed and calculated the lifetime cost of everyone in the welfare system using 20 years of historical data.
- Explored and Extracted data from XML source into HDFS , used ETL for preparing data for exploratory analysis using data munging.
- Handled importing data from various data sources into HDFS, and performed transformations using Hive and Pig.
- Created clusters to Control and test groups and conducted group campaigns using Text Analytics.
- Used Python and R to create Statistical algorithms involving Multivariate Regression, Linear Regression, Logistic Regression, PCA, Random forest models, Decision trees, Support Vector Machine for estimating the risks of welfare dependency.
- Used R and python for Exploratory Data Analysis, A/B testing, Anova test and Hypothesis test to compare and identify the effectiveness of Creative Campaigns .
- Responsible for different Data mapping activities from Source systems to Teradata, Text mining and building models using topic analysis, sentiment analysis for both semi-structured and unstructured data.
- Created various types of data visualizations using R, P ython and Tableau .
- Developed Tableau visualizations and dashboards using Tableau Desktop .
- Created multiple custom SQL queries in Teradata SQL Workbench to prepare the right data sets for Tableau dashboards
- Perform analyses such as regression analysis, logistic regression, discriminant analysis, cluster analysis using Python.
Environment: R 3.x, HDFS, Hadoop 2.3, Pig, Hive, Linux, R-Studio, Tableau 10, SQL Server, Ms Excel.
Data Scientist
Confidential, New York, NY
Responsibilities:
- Involved in Design, Development and Support phases of Software Development Life Cycle (SDLC)
- Performed data ETL by collecting, exporting, merging and massaging data from multiple sources and platforms including SSIS (SQL Server Integration Services) in SQL Server .
- Worked with cross-functional teams (including data engineer team) to extract data and rapidly execute from MongoDB through MongDB connector for Hadoop .
- Performed data cleaning and feature selection using MLlib package in PySpark .
- Performed partitional clustering into 100 by k-means clustering using Scikit-learn package in Python where similar hotels for a search are grouped together.
- Used Python to perform ANOVA test to analyze the differences among hotel clusters.
- Implemented application of various machine learning algorithms and statistical modeling like Decision Tree, Text Analytics, Sentiment Analysis, Naive Bayes, Logistic Regression and Linear Regression using Python to determine the accuracy rate of each model.
- Determined the most accurately prediction model based on the accuracy rate.
- Used text-mining process of reviews to determine customers' concentrations.
- Delivered analysis support to hotel recommendation and providing an online A/B test .
- Designed Tableau B ar G raphs , S catter Plots , and G eographical maps to create detailed level summary reports and dashboards.
- Developed hybrid model to improve the accuracy rate.
- Delivered the results to operation team for better decisions and feedbacks.
Environment: Python, Tableau, MongoDB, Hadoop, SQL Server, SDLC, ETL, SSIS, Recommendation Systems, Machine Learning Algorithms, Text-mining Process, A/B test
Big Data Analyst
Confidential, Basking Ridge, NJ
Responsibilities:
- Loaded and transformed large sets of structured, semi structured and unstructured data in various formats like text, zip, XML and JSON.
- Written multiple Map Reduce programs for data extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV & other compressed file formats.
- Written many independent Pig scripts and Python UDF’s for extracting analytical information.
- Developed Simple to complex MapReduce Jobs for data cleaning and preprocessing using Hive, Pig, and MapReduce.
- Involved in running Hadoop jobs for processing millions of records of text data.
- Responsible for managing data from multiple sources.
- Assisted in exporting analyzed data to relational databases using Sqoop.
- Developed multiple MapReduce jobs to clean and preprocess large amount of customer behavioral data, obtaining business insight on TV and Internet end users.
- Analyzed data and predicted end customer behaviors and TV product performance by applying machine learning algorithms using R.
- Imported and transformed large volumes of data from various data sources to HDFS and HBase.
- Developed Python scripts for Hadoop streaming jobs to process XML, JSON and CSV data.
- Integrated Oozie with Hadoop jobs including Map-Reduce, Pig, Hive, and Sqoop, Kafka.
- Collaborated with the network, database, and BI teams to ensure data quality and availability.
Environment: Hadoop, Cloudera Manager, HDFS, Hive, Pig, R-Studio, Python, Java, Kafka, Sqoop.