We provide IT Staff Augmentation Services!

Data Scientist Resume

3.00/5 (Submit Your Rating)

Plano, TX

SUMMARY

  • Highly efficient Data Scientist/Data Analyst with 9+ years of experience in Data Analysis, Machine Learning, Data mining with large data sets of Structured and Unstructured data, Data Acquisition, Data Validation, Predictive modeling, Data Visualization, Web Scraping. Adept in statistical programming languages like R and Python including Big Data technologies like Hadoop, Hive.
  • Proficient in managing entire data science project life cycle and actively involved in all the phases of project life cycle including data acquisition, data cleaning, data engineering, features scaling, features engineering, statistical modeling (decision trees, regression models,clustering), dimensionality reduction using Principal Component Analysis and Factor Analysis, testing and validation using ROC plot, K - fold cross-validation and data visualization.
  • Adept and deep understanding of Statistical modeling, Multivariate Analysis, model testing, problem analysis, model comparison, and validation.
  • Have ability to build advanced statistical and predictive models, such as generalized linear, decision tree, neural network models, ensembles models, Support Vector Machines (SVM), and Random Forest.
  • Experience with a variety of NLP methods for information extraction, topic modeling, parsing, and relationship extraction with developing, deploying, and maintaining production NLP models with scalability. Creative thinking and propose innovative ways to look at problems by using data mining approaches on the set of information available.
  • Experience in working with relational databases (Teradata, Oracle) with advanced SQL programming skills.
  • Experience in Big Data platforms like Hadoop platforms (Map-R, Hortonworks & others), Aster and Graph Databases
  • Identifies/creates the appropriate algorithm to discover patterns, validate their findings using an experimental and iterative approach.
  • Closely worked with product managers, Service development managers, and product development team in productizing the algorithms developed.
  • Experience in designing visualizations using Tableau software and publishing and presenting dashboards.
  • Experience in operations research / optimization.
  • Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, and design
  • Expertise in transforming business requirements into analytical models, designing algorithms, building models, developing data mining and reporting solutions that scale across a massive volume of structured and unstructured data.
  • Skilled in performing data parsing, data manipulation and data preparation with methods including describe data contents, compute descriptive statistics of data, regex, split and combine, Remap, merge, subset, reindex, melt and reshape.
  • Experience in using various packages in R and python-like ggplot2, caret, dplyr, Rweka, gmodels, twitter, NLP, Reshape2, rjson, plyr, pandas, NumPy, Seaborn, SciPy, Matplotlib, sci-kit-learn, Beautiful Soup.
  • Extensive experience in Text Analytics, generating data visualizations using R, Python and creating dashboards using tools like Tableau.
  • Hands on experience with big data tools like Hadoop, Spark, Hive, Pig, PySpark, Spark SQL,PySpark
  • Hands on experience in implementing LDA, Naive Bayes and skilled in Random Forests, Decision Trees, Linear and Logistic Regression, SVM, Clustering, neural networks, Principle Component Analysis.
  • Good Knowledge in Proof of Concepts (PoC's), gap analysis and gathered necessary data for analysis from different sources, prepared data for data exploration using data munging.
  • Good industry knowledge, analytical &problem-solving skills and ability to work well within a team as well as an individual.
  • Expertise in transforming business requirements into analytical models, designing algorithms, building models, developing data mining and reporting solutions that scale across a massive volume of structured and unstructured data.
  • Experience in designing stunning visualizations using Tableau software and publishing and presenting dashboards, Storyline on web and desktop platforms.
  • Experience and Technical proficiency in Designing, Data Modeling Online Applications, Solution Lead for Architecting Data Warehouse/Business Intelligence Applications.
  • Experience with Data Analytics, Data Reporting, Ad-hoc Reporting, Graphs, Scales, PivotTables and OLAP reporting.
  • Highly skilled in using Hadoop (pig and Hive) for basic analysis and extraction of data in the infrastructure to provide data summarization.
  • Highly skilled in using visualization tools like Tableau, ggplot2, dash, flask for creating dashboards.
  • Worked and extracted data from various database sources like Oracle, SQL Server, DB2, regularly accessing JIRA tool and other internal issue trackers for the Project development.
  • Highly creative, innovative, committed, intellectually curious, business savvy with good communication and interpersonal skills.
  • Extensive experience in Data Visualization including producing tables, graphs, listings using various procedures and tools such as Tableau.
  • Experience of implementing deep learning algorithms such as Artificial Neural network (ANN) and Recurrent Neural Network (RNN), tuned hyper-parameter and improved models with Python packages TensorFlow.
  • Extensive experience in operating Big Data Pipelines (Spark, Hive, Presto, SQL engines) batch and streaming.
  • Extracted data from HDFS and prepared data for exploratory analysis using data munging.
  • Implemented, tuned and tested the model on AWS EC2 with the best algorithm and parameters.

TECHNICAL SKILLS

Languages: C, C++, XML, R/R Studio, SAS Enterprise Guide, SAS, R, Python 2.x/3.x, Java, C, SQL, Shell Scripting, Maven, Scala, spark 2, 2.3, Spark Sql, Spark Streaming, Hadoop, MapReduce, R - (Packages: Stats, Zoo, Matrix, data table, OpenSSL), HDFS, Eclipse, Anaconda, Jupyter notebook

NO SQL Databases: Cassandra, HBase, MongoDB, Maria DB

Statistics: Hypothetical Testing, ANOVA, Confidence Intervals, Bayes Law, MLE, Fish Information, Principal Component Analysis (PCA), Cross-Validation, correlation.

BI Tools: Tableau, Tableau server, Tableau Reader, Splunk, SAP Business Objects, OBIEE, SAP Business Intelligence, QlikView, Amazon Redshift, or Azure Data Warehouse

Algorithms: Logistic Regression, Random Forest, XG Boost, KNN, SVM, Neural Network rk, Linear Regression, Lasso Regression, Generalized Linear Models, Boxplots, K-Means Clustering, SVN, PuTTY, WinSCP, Redmine (Bug Tracking, Documentation, Scrum), Neural networks, AI, Teradata, Tableau, H2O flow, Splunk, GitHub, Linear, regression.

Data Analysis and Data Science: Deep neural network, Logistic regression, Decision Tress, Random Forests, KNN, XGBoost, Ensembles (Bagging, Boosting), Support Vector Machines, Neural Networks, graph/network analysis, and time series analysis(ARIMA model), NLP.

Big Data: Hadoop, HDFS, HIVE, PuTTy, Spark, Scala, Sqoop

Reporting Tools: MS Office (Word/Excel/PowerPoint/ Visio/Outlook), Crystal Reports XI, SSRS

Database Design Tools and Data Modeling: MS Visio, ERWIN 4.5/4.0, Star Schema/Snowflake Schema modeling, Fact & Dimensions tables, physical & logical data modeling, Normalization and De-normalization techniques, Kimball &Inmon Methodologies

PROFESSIONAL EXPERIENCE

Confidential, Plano, TX

Data Scientist

Responsibilities:

  • Various data science process such as data mining, data collection, data cleansing, dataset preparation, for machine learning models in an RDBMS database was done using T-SQL and Netezza
  • Various analysis processes such as trend analysis, predictive modeling, machine learning and statistics, and other data analysis techniques was used to collect, explore, and identify the data to explain customer behavior and segmentation, text analytics and big data analytics, product level analysis and customer experience analysis.
  • Used ARIMA model to forecast a time series using the past values and built an optimal ARIMA model from scratch and extend it to Seasonal ARIMA (SARIMA) and SARIMAX models using python.
  • Analyze Data and Performed Data Preparation by applying historical model on the data set in AZURE ML.
  • Perform Data cleaning process applied Backward - Forward filling methods on dataset for handling missing value.
  • Perform Data Transformation method for Rescaling and Normalizing Variables.
  • Develop a predictive model and validate KNN model for predict the feature label.
  • Plan, develop, and apply leading-edge analytic and quantitative tools and modeling techniques to help clients gain insights and improve decision-making.
  • Utilize Spark, Scala, Hadoop, HQL, VQL, oozie, pySpark, Data Lake, TensorFlow, HBase, Cassandra, Redshift, MongoDB, Kafka, Kinesis, Spark Streaming, Edward, CUDA, MLLib, AWS, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc.
  • Apply various machine learning algorithms and statistical modeling like decision trees, text analytics, natural language processing (NLP), supervised and unsupervised, regression models, social network analysis, neural networks, deep learning, SVM, clustering to identify Volume using scikit-learn package in python, Matlab.
  • Built and tested different Ensemble Models such as Bootstrap aggregating, Bagged Decision Trees and Random Forest, Gradient boosting, XGBoost, and AdaBoost to improve accuracy, reduce variance and bias, and improve stability of a model
  • Used advanced SQL query to extract data from SQL server and to integrate the data with Tableau.
  • Extensively used Power BI, Pivot Table and Tableau to manipulate large data and develop visualization dash board that resembles the vulnerability remediation of Confidential assets.
  • Involved in agile environment to implement agile management ideals such as sprint planning, daily standups, managing project timelines, and communicate with clients to ensure project progress satisfactorily. Also worked in canban environment.
  • Used the python packages, such as pandas, numpy, Jupyter Notebook, scipy, scikit-learn, TensorFlow, keras to design and implement logistic regression, multivariate regressions, clustering algorithms, simulation, modeling, neural networks, GLM, Regression, Random Forest, Boosting, text mining, neural networks, NLP, support vector machines, and ensemble trees.
  • Convolutional Neural Networks (ConvNets or CNNs) was used for image recognition and classification.
  • Natural language processing (NLP) was used to identify and separate words, to extract topics in a text, to build fake news classifier. The libraries such as NLTK was used to utilize deep learning to solve common NLP problems such as sentimental analysis
  • Performance was optimized using Hyperparameter tuning, debugging, parameter fitting and troubleshooting of models and automated the processes.
  • Developed reports, charts, tables, and other visual aids in support of findings to recommend business direction or outcomes.
  • MLlib, Spark machine learning (ML) functionality was used for machine learning problem such as binary classification, regression, clustering and collaborative filtering, as well underlying gradient descent optimization.
  • Deep learning (LSTM recurrent neural network (RNN)) package was used for sequence to sequence (Seq2Seq) LSTM model .
  • Optimize algorithm with stochastic gradient descent algorithm Fine-tuned the algorithm parameter with manual tuning and automated tuning such as Bayesian Optimization .
  • Analyzed the data by performing Hive queries and running Pig scripts to study customer behavior.
  • Implement machine learning model (logistic regression, XGBoost, SVM) with Python Scikit- learn.
  • Work on different data formats such as JSON, XML and applied machine learning algorithms in Python.
  • Optimize algorithm with stochastic gradient descent algorithm Fine-tuned the algorithm parameter with manual tuning and automated tuning such as Bayesian Optimization.
  • Configured Hadoop cluster with Namenode and slaves and formatted HDFS .
  • Work on Text Analytics, Naive Bayes, Sentiment analysis, creating word clouds and retrieving data from Twitter and other social networking platforms.
  • Utilize SQL, Excel and several Marketing/Web Analytics tools ( Google Analytics, AdWords ) in order to complete business & marketing analysis and assessment.

Environmen t: HDFS, Hive, Scoop, Pig, Google Cloud Plateform(GCP), Python 3.x (SciPy, Scikit-Learn), Tableau 9.x, D3.js, SVM, Random Forests, Naïve Bayes Classifier, A/B experiment, Git 2.x, Agile/SCRUM.

Confidential - Mountain View, CA

Data Scientist

.

Responsibilities:

  • Designed a light-weight convolutional neural network (cNN) on TensorFlow to recognize keywords, phrases, tone and inflection.
  • Re-trained semantic segmentation models and applied state-of-the-art Machine Learning (ML) / Deep Learning (DL) technologies to evaluate the sentiment score.
  • Used classifiers built in SciKit-learn, on top of hand-engineered linguistic features extracted by spaCy, an industrial-strength NLP library.
  • Developed a DL module with Keras, using GloVe (Wikipedia-trained word vectors) as the embedding layer.
  • Our linguistic model got 88.2% accuracy on IMDB (marked as our benchmark performance), DL models reached the benchmark after 2 epochs of training, and the ensemble DL model achieved 90.6% after 4 epochs.
  • Reported that cNN architecture in our application was leading to fast convergence and quick overfitting; Bi-direction LSTM architecture cost 10x training time, and it tended to be under fitting the training data initially.
  • Migrated eDiscovery search engine from monolith to microservice architecture using NLTK, SciKit-Learn, and DynamoDB.
  • Application of various machine learning algorithms and statistical modeling like decision trees, text analytics, natural language processing (NLP), supervised and unsupervised, regression models, social network analysis, neural networks, deep learning, SVM, clustering to identify Volume using scikit-learn package in python, Matlab.
  • Achieved a 16x performance increase compared to the industry standard, i.e. systems based on LSA embedding.
  • Exploring DAG's, their dependencies and logs using AirFlow pipelines for automation.
  • Performed data cleaning and feature selection using MLlib package in PySpark and working with deep learning frameworks such as Caffe, Neon etc.
  • Conducted a hybrid of Hierarchical and K-means Cluster Analysis using IBM SPSS and identified meaningful segments of customers through a discovery approach.
  • Evaluate models using Cross Validation, Log loss function, ROC curves and used AUC for feature selection and elastic technologies like Elastic Search, Kibana etc.
  • Engineered pipelines for tokenization and word vector encoding with CoreNLP and Python and deployed them on premise.
  • Created pipeline for server-side rendering word highlighting visualizations using Flask, HTML and CSS.
  • Used data mining and machine learning algorithms, theories, principles and practices.
  • Built and tested different Ensemble Models such as Bootstrap aggregating, Bagged Decision Trees and Random Forest, Gradient boosting, XGBoost, and AdaBoost to improve accuracy, reduce variance and bias, and improve stability of a model
  • Relational and non-relational databases were connected using Azure services for data science and machine learning and also was used to build, train, host, and deploy models from Python environment.
  • Documented logical data models, semantic data models and physical data models.
  • Implemented use of real-time data using Storm and Spark as well as Spark ML.
  • Implemented model on batch data using Spark SQL.
  • Worked on Cloud Services such as Amazon Web Services (AWS), S3, Redshift to assist with big data tools, solve storage issue.
  • Performed MapReduce jobs and Spark analysis using Python for machine learning and predictive analytics models on big data in Hadoop ecosystem on AWS cloud platform as well as some data from on-premise SQL.
  • Developed Spark/Scala, Python, R for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources. Used clustering technique K-Means to identify outliers and to classify unlabeled data.
  • Evaluate models using Cross Validation, Log loss function, ROC curves and used AUC for feature selection and elastic technologies like Elastic Search, Kibana etc.
  • Worked in the agile environment to implement agile management ideals such as sprint planning, daily standups, managing project timelines, and communicate with clients to ensure project progress satisfactorily.
  • Created Data Quality Scripts using SQL and Hive to validate successful data load and quality of the data. Create various types of data visualizations using Python and Tableau .

Environmen t: HDFS, Hive, Scoop, Pig, Oozie, Amazon Web Services (AWS), Python 3.x (SciPy, Scikit-Learn), Tableau 9.x, D3.js, SVM, Random Forests, Naïve Bayes Classifier, A/B experiment, Git 2.x, Agile/SCRUM.

Confidential - Atlanta, GA

Data Analyst/Data Scientist

Responsibilities:

  • Created real-time data pipelines using Spark Streaming, Spark
  • Developed pipelines using SparkML that drive data for the automation of training and testing the models.
  • Supervised model types including Generalized Linear Models, Random Forests, Gradient Boosting Machines, Support Vector Machines, Deep Learning Neural Nets, and Ensemble Learning/Stacking. Unsupervised model types like Principal Component Analysis, K-means clustering, Hierarchical Clustering, AutoEnconders.
  • Built models for highly imbalanced data sets. Bias/Variance tradeoff. Model quality metrics like RSquared, AUC. Outlier detection and removal.
  • Advanced Statistics/Math: ANOVA/ANCOVA. Bootstrapping, confidence intervals.
  • Worked in Big Data Hadoop Hortonworks, HDFS architecture, R, Python, Jupyter, Pandas, NumPy, SciKit, Matplotlib, PyHive, Keras, Hive, NoSQL- HBASE, Sqoop, Pig, MapReduce, Oozie, Spark MLlib.
  • Used Cloudera Hadoop YARN to perform analytics on data in Hive, build models with big data frameworks like Cloudera Manager and Hadoop
  • Work with different data science models Machine Learning Algorithms such as Linear, Logistic, Decision Tree, Random Forests, Support Vector Machines, Neural Networks, KNN, Deep learning
  • Causal modeling in both experimental and observational data sets. Bayesian networks. Bayesian regression.
  • Advanced programming in Python using SciKit-Learn and NumPy libraries.
  • Predicted the Remaining Useful Life (RUL), or Time to Failure (TTF) using Regression.
  • Predicted if an asset will fail within certain time frame (e.g. days) with Binary classification.
  • Used LSTM to predict probability of failure at different time intervals compensating for independent variables reflecting states of wear.
  • Worked in the agile environment to implement agile management ideals such as sprint planning, daily standups, managing project timelines, and communicate with clients to ensure project progress satisfactorily.

Confidential - Austin, TX

Data Analyst/Data Scientist

Responsibilities:

  • Integrated data from multiple data sources or functional areas, ensures data accuracy and integrity, and updates data as need using SQL and Python.
  • Expertise leveraging SQL, Excel and Tableau to manipulate, analyze and present data.
  • Performs analyses of structured and unstructured data to solve multiple and/or complex business problems utilizing advanced statistical techniques and mathematical analyses.
  • Developed advanced models using multivariate regression, Logistic regression, Random forests, decision trees and clustering.
  • Used Pandas, Numpy, Seaborn, Scikit-learn in Python for developing various machine learning algorithms.
  • Build and improve models using natural language processing (NLP) and machine learning to extract insights from unstructured data.
  • Experienced working with distributed computing technologies (Apache Spark, Hive).
  • Applied predictive analysis and statistical modeling techniques to analyze customer behavior and offer customized products, reduce delinquency rate and default rate. Lead to fall in default rates from 5% to 2%.
  • Applied machine learning techniques to tap into new markets, new customers and put forth my recommendations to the top management which resulted in increase in customer base by 5% and customer portfolio by 9%.
  • Analyzed customer master data for the identification of prospective business, to understand their business needs, built client relationships and explored opportunities for cross-selling of financial products. 60% (Increased from 40%) of customers availed more than 6 products.
  • Collaborated with business partners to understand their problems and goals, develop predictive modeling, statistical analysis, data reports and performance metrics.
  • Participate in the on-going design and development of a consolidated data warehouse supporting key business metrics across the organization.
  • Designed, developed, and implemented data quality validation rules to inspect and monitor the health of the data.
  • Dashboard and report development experience using Tableau.

We'd love your feedback!