Data Scientist Resume
Houston, TX
PROFESSIONAL SUMMARY:
- Around 5 years of Machine Learning/Data Science experience with excellence in developing and implementing large scale algorithms that have significantly impacted business revenues and user experience.
- Developed intricate algorithms based on deep - dive Statistical Analysis and Predictive Data Modeling that were used to deepen relationships, strengthen longevity and personalize interactions with customers.
- Experience in Data Mining, Machine learning and Spark Development with big informational datasets of Structured and Unstructured information, Data Acquisition, Data Validation, Predictive Demonstrating, Data Visualization.
- Analyzed and processed complex data sets using advanced Querying, Visualization and Analytics tools.
- Hands-on experience in applying several Machine Learning/Statistical Algorithms to real-world problems by using Deep Learning, Gradient Boosted Trees, Natural Language Processing, Random Forests, Clustering, Generalized Linear Models, Simulation Models and Gaussian Mixture Models.
- Proficient in managing entire data science project life cycle and actively involved in all the phases of project life cycle including Data Extraction, Data Cleaning, Data Engineering, Data Loading, Data Wrangling, Feature Scaling, Statistical Modeling (Decision Trees, Regression Models, Neural Networks, SVM, Clustering), Dimensionality Reduction and Factor Analysis, testing and validation using ROC Plot, K- fold Cross Validation, Predictive Modeling using R, Python and Data Visualization using Tableau.
- Expertise in implementing Dimensionality Reduction techniques like Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), Kernel PCA, Quadratic Discriminant Analysis (QDA) in Feature Extraction and Backward Elimination, Forward Selection, Bidirectional Elimination, Score Comparison in Feature Selection Techniques.
- Strong experience in building end-to-end Machine Learning platform using Java and Big Data technologies like Cassandra, Spark, Apache Hadoop, MapReduce, HDFS Architecture, HBase, Sqoop, Pig, MLlib, ELT and Hive.
- Highly skilled in Statistical Thinking which include Graphical and Quantitative EDA, sentiment analysis, Bootstrap Confidence Intervals, Correlation, Hypotheses Modeling, C ollaborative Filtering, Recommender Systems, Time-Series, Inferential Statistics, Matrix Factorization as well as Modeling Techniques to gain valid inferences.
- Strong understanding of the principles of Data Warehousing(OLAP) using Kimball Methodology, Business Intelligence applications, Online Transaction Processing(OLTP), Fact Tables, Dimension Tables, Star and Snowflake schema modelling.
- Highly skilled in Tableau Desktop for Data Visualization using Cross Map, Scatter Plots, Geographic Map, Pie Charts and Bar Charts, Page Trails, Heat Map and Density Chart.
- Expertise in dealing with Relational Database Management Systems including Normalization, Stored Procedures, Constraints, Querying, Joins, Keys, Indexes, data import/export, Triggers and Cursors.
- Comprehensive Knowledge and experience in writing queries in SQL, MySQL, NOSQL, Postgre SQL and R to Extract, Transform and Load (ETL) data from large datasets. Strong Data Analysis skills using Business Intelligence (BI), SQL & MS Office Tools.
- Smart in examining large databases like Microsoft Azure, MongoDB, Cassandra, Oracle, SQL Server, DB2.
- Highly Skilled in using various Data Science related libraries in Python like Scikit-learn, OpenCV, NumPy, SciPy, Matplotlib, Pandas, Seaborn, Bokeh, nltk, Genism, Scikit, networkx, Stats models, TensorFlow, Theano and Keras.
- Expertise in using variant libraries of R such as ggplot2, caret, CA Tools, Amelia, Beautiful Soup, e1071, lubridate, miss Forest, caret, CORE learn, BigRF, rpart, PROC, I graph, tree, random Forest, LTSA, LSMeans, ROCR, Rweka, arules, sqldf, RODBC, RMarkdown.
- Super-eminent understanding of AWS (Amazon Web Services), S3, Amazon RDS, Apache Spark RDD, process and concepts. Developing Logical Data Architecture with adherence to Enterprise Architecture.
- Meticulously experienced working on data modeling tools like CA Erwin, Power Designer, MS Visio, ER/Studio and Data quality tools Informatica IDQ, Informatica MDM.
- Experience with Data Analytics, Data Reporting, Ad-hoc Reporting, Graphs, Scales, PivotTables and OLAP reporting.
- Progressive involvement in GIT, Agile Methodology and SCRUM process. Strong business sense and abilities to communicate data insights to both technical and nontechnical clients.
TECHNICAL SKILLS:
Relational Databases: SQL Server, MS Access, Teradata, Oracle
Programming Languages: Python, R, Java, C, Scala, MATLAB, Pig
Cloud Technologies: Microsoft Azure, AWS
Markup languages: XML, HTML, DHTML, XSLT, X Path, X Query and UML
ETL Tools: ETL Informatica Power Center, SSIS, SAS
Deployment Tools: Anaconda Enterprise, R-Studio, Azure Machine Learning Studio
Data Modeling Tools: MS Visio, Rational Rose, Erwin, ER /studio 9.7
Big Data Tools: Hadoop, Hive, Apache Spark
Operating Systems: Red Hat Linux, Unix, Ubuntu, Windows
Reporting & Visualization: Microsoft BI, Tableau, Matplotlib, Seaborn, ggplot, SAP Business Objects, Crystal Reports, Cognos, Shiny, Splunk, QlikView
PROFESSIONAL EXPERIENCE:
Confidential, Houston, TX
Data Scientist
Responsibilities:
- Retrieved data from Hadoop Cluster by developing a pipeline using Hive(HQL), SQL to retrieve data from Oracle database and used ETL for data transformation.
- Performed data wrangling to clean, transform and reshape the data utilizing pandas library. Analyzed data using SQL, R, Java, Scala, Python, Apache Spark and presented analytical reports to management and technical teams.
- Worked with different datasets with complexity including both structured and unstructured data and Participated in all phases of Data mining, Data cleaning, Data collection, variable selection, feature engineering, developing models, Validation and Visualization.
- Developed predictive models on large scale datasets to address various business problems through leveraging advanced statistical modeling, machine learning and deep learning.
- Analyzed Historical data by using various machine learning algorithms such as clustering, multiple linear regression, logistic regression, SVM, Naive Bayes, Random Forests, K-means, & KNN for data analysis.
- Conducted exploratory data analysis using Pandas, NumPy, Seaborn, Matplotlib, Scikit-learn, SciPy, NLTK in Python for developing various machine learning algorithms.
- Implemented Data Quality validation techniques to validate data and identified many anomalies. Extensively worked on statistical analysis tools and adept at writing code in Advanced Excel, R and Python.
- Enforced model Validation using test and Validation sets via K- fold cross validation, statistical significance testing.
- I have worked with various kinds of data (open-source as well as internal). I have developed models for labeled and unlabeled datasets, and have worked with big data technologies, such as Hadoop and Spark, and cloud resources, like Azure and Google Cloud.
- Utilized Amazon Web Services' (AWS) S3, EC2, and EMR as well Python (Pandas, SciPy) and Spark to model large-scale time series data.
- Enforced F-Score, AUC/ROC, Confusion Matrix, Precision, and Recall evaluating different model’s performance.
- Multi-layers Neural Networks built in Python Scikit-learn, Theano, TensorFlow and keras packages to implement machine learning models.
- Development of ETL code to extract data from multiple sources and load to data warehouse using Informatica and load data into AWS Redshift.
- Created Data Quality Scripts using SQL and Hive to validate successful data load and quality of the data.
- Created complex charts and graphs with drill downs that will allow various divisions to quickly locate outliers and correct any anomalies.
- Developing Stored Procedures, Functions, Views and Triggers, complex SQL queries using SQL Server, TSQL and Oracle PL/SQL.
- Worked with various data sources with multiple relational databases like Oracle11g /Oracle10g/9i, MS SQL Server; Relational and Flat Files into the staging area, ODS, Data Warehouse and Data Mart.
- R programming language for graphically critiquing the data and performed data mining. Interpreting Business requirements, data mapping specifications and responsible for extracting data as per the business requirements.
- Participated in features engineering such as feature generating, PCA, Feature normalization and label encoding with Scikit-learn preprocessing. Data Imputation using variant methods in Scikit-learn package in Python.
- Created reports, dashboards and data Visualizations by using Tableau, to explain and communicated data insights, significant features, model's score and performance to perfectly elucidate for both technical and business teams.
Environment: Python 3.6.4, R Studio, MLLib, Regression, NoSQL, SQL Server, Hive, Hadoop Cluster, ETL, Spyder 3.6, Agile, Tableau, Java, NumPy, Pandas, Matplotlib, Power BI, Scikit-Learn, Seaborn, e1071, ggplot2, Shiny, TensorFlow, AWS, Azure, HTML.
Confidential, South Jordan, UT
Data Scientist
Responsibilities:
- Performed data wrangling to clean, transform and reshape the data utilizing pandas library. Analyzed data using SQL, R, Java, Scala, Python, Apache Spark and presented analytical reports to management and technical teams.
- Worked with different datasets which includes both structured and unstructured data and Participated in all phases of Data mining, Data cleaning, Data collection, variable selection, feature engineering, developing models, Validation and Visualization.
- Developed predictive models on large scale datasets to address various business problems through leveraging advanced statistical modeling, machine learning and deep learning.
- Implemented public segmentation using Unsupervised machine learning algorithms by implementing K-means algorithm by using PySpark using data munging.
- Experience in Machine learning using NLP text classification, churn prediction using Python.
- Worked on different Machine Learning models like Logistic Regression, Multi-layer perceptron classifier and K-means clustering.
- Lead discussions with users to gather business processes requirements and data requirements to develop a variety of conceptual, logical and Physical Data models.
- Expertise in Business intelligence and Data Visualization tools like Tableau.
- Handled importing data from various data sources, performed transformations using Hive, MapReduce and loaded data into HDFS.
- Good knowledge in Azure cloud services, Azure Storage to manage and configure the data.
- Used R and Python for Exploratory Data Analysis to compare and identify the effectiveness of the data.
- Created clusters to classify control and test groups.
- Analyzed and calculated the life cost of everyone in a welfare system using 20 years of historical data.
- Developed triggers, stored procedures, functions and packagers using cursors associated with the project using PL/SQL.
- Used Python, R, SQL to create statistical algorithms involving Multivariate Regression, Linear Regression, Logistic Regression, PCA, Random forest models, Decision trees, SVM for estimating and identifying the risks of welfare dependency.
- Designed and implemented a recommendation system which leveraged Google Analytics data and the machine learning models and utilized Collaborative filtering techniques to recommend policies for different customers.
- Performed analysis such as Regression analysis, Logistic Regression, Discriminant Analysis, Cluster analysis using SAS programming.
- Worked on No SQL databases including Cassandra, Mongo DB, Mark Logic and HBase to access the advantages and disadvantages of them for a particular goal of a project.
Environment: Hadoop, HDFS, Python 3.x (Scikit -Learn/ Keras/ SciPy/ NumPy/ Pandas/ Matplotlib/ NLTK/ Seaborn), R (ggplot2/ caret/ trees/ arules), Tableau (9.x/10.x), Machine Learning (Logistic regression/ Random Forests/ KNN/ K-Means Clustering / Hierarchical Clustering/ Ensemble methods/ Collaborative filtering), GitHub, Agile/ SCRUM
Confidential
Data Analyst
Responsibilities:
- Developed complex SQL queries, stored procedures, views, functions and reports that qualify customer requirements using Microsoft SQL Server.
- Worked with the ETL team to document the transformation rules for Data migration from OLTP to Warehouse environment for reporting purposes.
- Used Pandas, NumPy, seaborn, SciPy, Matplotlib, Scikit-learn, NLTK in Python for developing various machine learning algorithms and utilized machine learning algorithms such as linear regression, multivariate regression, naive Bayes, Random Forests, K-means, & KNN for data analysis.
- Optimized the performance of queries with modification in T-SQL queries, removed the unnecessary columns and redundant data, normalized tables, established joins and created index.
- Implemented Predictive analytics and machine learning algorithms to forecast key metrics in the form of designed dashboards on to AWS and Django platform.
- Implemented supervised, semi-supervised, and unsupervised algorithms in machine learning for tasks, like classification, regression, and clustering.
- Used various machine learning algorithms, like decision trees and forests, support vector machines, and deep networks (CNNs, RNNs, and LSTMs).
- Operated univariate and multivariate analysis on the data to identify any underlying pattern in the data and associations between the variables.
- Explored and analyzed the customer specific features by using Matplotlib and ggplot2. Extracted structured data from MySQL databases, developing basic visualizations or analyzing A/B test results.
- Conventionally designed and implemented statistical tests including Hypothesis testing, AVOVA, Chi-square test to verify models' significance by using R.
- Participated in features engineering such as feature generating, PCA, feature normalization and label encoding with Scikit-learn preprocessing. Data Imputation using variant methods in Scikit-learn package in Python.
- Work with business stakeholders to refine and respond to their ad hoc requests and improve their existing reporting and dashboards as necessary.
- Predictive analytics helps in targeting the right customer at the right time based on their past behavior and choices. It also helps in boosting revenue by proper planning and reducing the operational costs in the long term.
- Customer sentiment analysis, customer experience and positioning of the company can be analyzed to make the customer experience richer and smoother.
- Develop large scale data analytic solutions in machine learning such as regressions, KNN, random forest, SVM, K-means to solve classification and clustering problems.
Environment: Python, Scikit -Learn, SciPy, NumPy, Pandas, Matplotlib, NLTK, Seaborn, R-studio,ggplot2,trees, arules, Tableau (9.x/10.x), Machine Learning, Logistic regression, Random Forests, KNN, K-Means Clustering.
Confidential
Data Analyst
Responsibilities:
- Built various Machine Learning models on the Home loan customers data to predict credit risk, fraud detection, customer churn and target marketing
- Performed Data wrangling, Preprocessed and transformed the data to useful formats, including joining multiple tables using Pandas.
- Performed data pre-processing tasks like sorting, merging, treating the missing values, outliers, lexical errors, transforming the date-time fields and more, preparing it for statistical analysis like univariate, multivariate and correlation analysis.
- Implemented numerous Feature Engineering techniques to generate new features from the existing ones and tested their performance.
- Performed a complete data analysis on the Home loan data, extracted meaningful insights from the data regarding customer demographics, No. of previous loans, timely payments ranking and more.
- Used Tableau for building and publishing customized interactive visualizations to present the analysis results by finding patterns, anomalies and predictions.
- Worked with Bigdata tools such as Apache Spark and Hadoop for data processing and parallel computing.
- Identified the target groups for the home loans by conducting Segmentation analysis using Clustering techniques like K-Means.
- Implemented regression models to estimate the loan interest rates for the customers with low credit scores, validating various regression models such as Support Vector Regression, Random Forest Regression and Multiple Linear Regression.
- Developed multiple classification models such as Naïve Bayes Classifier, Random Forests, XGBoost to predict the potential defaulters and compared with the performance of the current working model.
Environment: Python 2.x/3.x, PyCharm, Tableau 10.x/9.x, Apache Spark, SQL, Spark MLlib, HDFS, Regression, Cluster analysis, Sklearn, NLTK, Hadoop Hive