We provide IT Staff Augmentation Services!

Data Scientist/ml Engineer Resume

3.00/5 (Submit Your Rating)

Richardson, TX

SUMMARY

  • Highly accomplished Data Scientist/Data Analyst with 5+ years of experience in, Data Analysis, Data mining, Machine Learning with large data sets of Structured and Unstructured data, Data Acquisition, Data Validation, Predictive modeling, Data Visualization.
  • Adept and deep understanding of Statistical modeling, Multivariate Analysis, model testing, problem analysis, model comparison, and validation.
  • Have ability to build advanced statistical and predictive models, such as generalized linear, decision tree, neural network models, ensembles models, Support Vector Machines (SVM), and Random Forest.
  • Skilled in managing entire data science project life cycle and actively involved in all the phases of project life cycle including data acquisition, data cleaning, data engineering, features scaling, features engineering, statistical modeling (decision trees, regression models,clustering), dimensionality reduction using Principal Component Analysis, testing and validation using ROC plot, K - fold cross-validation and data visualization.
  • Experience with a variety of NLP methods for information extraction, topic modeling, parsing, and relationship extraction with developing, deploying, and maintaining production NLP models with scalability. Creative thinking and propose innovative ways to look at problems by using data mining approaches on the set of information available.
  • Experience and Technical proficiency in Designing, Data Modeling Online Applications, Solution Lead for Architecting Data Warehouse/Business Intelligence Applications.
  • Experience in working with relational databases (Teradata, Oracle) with advanced SQL programming skills.
  • Identifies/creates the appropriate algorithm to discover patterns, validate their findings using an experimental and iterative approach.
  • Experience in designing visualizations using Tableau software and publishing and presenting dashboards.
  • Extensive experience in Text Analytics, generating data visualizations using R, Python and creating dashboards using tools like Tableau.
  • Experience in Big Data platforms like Hadoop platforms (Map-R, Hortonworks & others), Aster and Graph Databases
  • Erperienced in transforming business requirements into analytical models, designing algorithms, building models, developing data mining and reporting solutions that scale across a massive volume of structured and unstructured data.
  • Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, and design
  • Expertise in transforming business requirements into analytical models, building models, designing algorithms developing data mining and reporting solutions that scale across a massive volume of structured and unstructured data.
  • Skilled in performing data parsing, data manipulation and data preparation with methods including describe data contents, compute descriptive statistics of data, regex, split and combine, Remap, reindex, merge, subset and reshape.
  • Experience in using various packages in R and python-like ggplot2, gmodels, twitter, NLP, pandas, NumPy, Seaborn, SciPy, Matplotlib, sci-kit-learn.
  • Hands on experience with big data tools like Hadoop, Spark, Hive, Pig, PySpark, Spark SQL
  • Hands on experience in implementing LDA and skilled in Random Forests, Decision Trees, Linear and Logistic Regression, SVM, Clustering, neural networks, Principle Component Analysis.
  • Good Knowledge in Proof of Concepts (PoC's), gap analysis and gathered necessary data for analysis from different sources, prepared data for data exploration using data munging.
  • Good industry knowledge, analytical &problem-solving skills and ability to work well within a team as well as an individual.
  • Experience in designing stunning visualizations using Tableau software and publishing and presenting dashboards, Storyline on web and desktop platforms.
  • Extensive experience in operating Big Data Pipelines (Spark, Hives, SQL engines) batch and streaming.
  • Highly skilled in using visualization tools like , ggplot2, dash, Tableau,flask for creating dashboards.
  • Experience with Data Analytics, Data Reporting, Graphs, Scales, Ad-hoc Reporting, OLAP reporting and PivotTables
  • Highly skilled in using Hadoop (pig and Hive) for basic analysis and extraction of data in the infrastructure to provide data summarization.
  • Extracted data from HDFS and prepared data for exploratory analysis using data munging.
  • Worked and extracted data from various database sources like Oracle, SQL Server, DB2, regularly accessing JIRA tool and other internal issue trackers for the Project development.
  • Extensive experience in Data Visualization including producing tables, listings,graphs using various procedures and tools such as Tableau.
  • Implemented, tuned and tested the model on AWS EC2 with the best algorithm and parameters.
  • Experience of implementing deep learning algorithms such as Artificial Neural network (ANN) and Recurrent Neural Network (RNN), tuned hyper-parameter and improved models with Python packages TensorFlow.
  • Highly innovative, creative, committed, intellectually curious, business savvy with good communication and interpersonal skills.

TECHNICAL SKILLS

Languages: Anaconda, Jupyter notebook, C, C++, XML, R/R Studio, SAS, SAS Enterprise Guide, R, Python 2.x/3.x, Java, C, SQL, Shell Scripting, Spark Sql, Maven, Scala, spark 2, 2.3, Spark Streaming, Hadoop, MapReduce, R - (Packages: Stats, Zoo, Matrix, data table, OpenSSL), HDFS, Eclipse, Anaconda, Jupyter notebook

BI Tools: Tableau,Tableau server, Tableau Reader, Splunk,SAP Business Objects,OBIEE,SAP Business Intelligence, QlikView, Amazon Redshift, or Azure Data Warehouse

Statistics: Hypothetical Testing, ANOVA, Confidence Intervals, correlation,, Bayes Law, MLE, Fish Information, Principal Component Analysis (PCA), Cross-Validation.

NO SQL Databases: Maria DB, MongoDB, Cassandra

Big Data: Hadoop,HDFS,HIVE,PuTTY,Spark,Scale,sqoop

Data Analysis and Data Science: Neural Networks, graph/network analysis, and time series analysis(ARIMA model), NLP Deep neural network, Logistic regression,Decision Tress, Random Forests, KNN, XGBoost, Ensembles (Bagging, Boosting), Support Vector Machines.

Algorithms: Logistic Regression, Random Forest, XG Boost, KNN, SVM, Neural Network rk, Generalized Linear Models, Boxplots,, Lasso Regression, K-Means Clustering, SVN, PuTTY, WinSCP, Redmine (Bug Tracking, Documentation, Scrum), Neural networks, AI, Teradata, Tableau, H2O flow, Splunk, GitHub.

Database Design Tools and Data Modeling: Normalization and De-normalization techniques, Kimball &Inmon Methodologies, MS Visio, ERWIN 4.5/4.0, Star Schema/Snowflake Schema modeling, Fact & Dimensions tables, physical & logical data modeling

Reporting Tools: MS Office (Word/Excel/PowerPoint/ Visio/Outlook), Crystal Reports XI, SSRS

PROFESSIONAL EXPERIENCE

Confidential, Richardson,TX

Data Scientist/ML Engineer

Responsibilities:

  • Various data science porcess such as data cleansing, data mining, data collection,dataset preparation, for machine learning models in an RDBMS database was done using T-SQL and Netezza
  • Different analysis processes such as predictive modeling, trend analysis,machine learning and statistics, and other data analysis techniques was used to collect, explore, and identify the data to explain customer behavior and segmentation, text analytics, product level analysis, big data analytics, and customer experience analysis.
  • Used Data cleaning process applied Backward - Forward filling methods on dataset for handling missing value.
  • Analyzed and Performed Data Preparation by applying historical model on the data set in AZURE ML. validate KNN model and develop a predictive model to predict the feature label.
  • Deep learning (LSTM recurrent neural network (RNN)) package was used for sequence to sequence (Seq2Seq) LSTM model .
  • Develop,plan, and apply leading-edge analytic and quantitative tools and modeling techniques to help clients gain insights and improve decision-making.
  • Used ARIMA model to forecast a time series using the past values and built an optimal ARIMA model from scratch and extend it to Seasonal ARIMA (SARIMA) and SARIMAX models using python. widely used Power BI, Tableau and Pivot Table to manipulate large data and develop visualization dash board that resembles the vulnerability remediation of USAA assets.
  • Perform Data Transformation method for Normalizing Variables and Rescaling.
  • Utilize Spark, Scala, Hadoop, pySpark, Data Lake, TensorFlow, Spark Streamin, MLLib, AWS, Python, HBase, Cassandra, Redshift, MongoDB, Kafka, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction.
  • Experienced in Data streaming and batch processing of bulk data on Google cloud .
  • Build and Deploy Data Pipelines on Google Cloud to enable AI & ML capabilities.
  • Storerd and process very large amounts of data, including streaming and real-time into a cloud-based big data lake.
  • Apply various machine learning algorithms and statistical modeling like decision trees, text analytics, natural language processing (NLP), supervised and unsupervised, regression models, social network analysis, neural networks, deep learning, SVM, clustering to identify Volume using scikit-learn package in python, Matlab.
  • Built and tested different Ensemble Models such as Bootstrap aggregating, Bagged Decision Trees and Random Forest, Gradient boosting, XGBoost, and AdaBoost to improve accuracy, reduce variance and bias, and improve stability of a model
  • Optimize algorithm with stochastic gradient descent algorithm Fine-tuned the algorithm parameter with manual tuning and automated tuning such as Bayesian Optimization .
  • Extensively used Power BI, Pivot Table and Tableau to manipulate large data and develop visualization dash board that resembles the vulnerability remediation of USAA assets.
  • Used the python packages, such as pandas, numpy, Jupyter Notebook, scipy, scikit-learn, TensorFlow, keras to design and implement logistic regression, neural networks, NLP, support vector machines, and ensemble trees,multivariate regressions, clustering algorithms, simulation, modeling, neural networks, GLM, Regression, Random Forest, Boosting, text mining,
  • Convolutional Neural Networks (ConvNets or CNNs) was used for image recognition and classification.
  • Natural language processing (NLP) was used to identify and separate words, to extract topics in a text, to build fake news classifier. The libraries such as NLTK was used to utilize deep learning to solve common NLP problems such as sentimental analysis
  • Performance was optimized using Hyperparameter tuning, debugging, parameter fitting and troubleshooting of models and automated the processes.
  • Developed reports, charts, tables, and other visual aids in support of findings to recommend business direction or outcomes.
  • MLlib, Spark machine learning (ML) functionality was used for machine learning problem such as binary classification, regression, clustering and collaborative filtering, as well underlying gradient descent optimization.
  • Analyzed the data by performing Hive queries and running Pig scripts to study customer behavior.
  • Implement machine learning model (logistic regression, XGBoost, SVM) with Python Scikit- learn.
  • Work on different data formats such as JSON, XML and applied machine learning algorithms in Python.
  • Work on Text Analytics, Naive Bayes, Sentiment analysis, creating word clouds and retrieving data from Twitter and other social networking platforms.
  • Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
  • Successfully read data, cleaned data, filtered data, preprocessed data, removed outlier, subset data, read preprocessed data, used model ( Linear Regression, Random Forest Regression ) and selected the reasonable model on the basis of R-square and accuracy.
  • Improved estimated customer delivery date (ECDD) accuracy up to 70% using AI/ML models which increased the customer satisfactions by 15% and reduction in the customer calls by 15%.
  • Successfully connected to different data sources using SSH, SFTP from Hadoop cluster and Azure data factory .

Environmen t: HDFS, Hive, Scoop, Pig, Google Cloud Plateform(GCP), Python 3.x (SciPy, Scikit-Learn), Tableau 9.x, D3.js,SVM, Linear regression, Random Forests, Naïve Bayes Classifier, A/B experiment, Git 2.x, Agile/SCRUM.

Confidential - Philadelphia,Pa

Data Scientist/ ML Engineer

.

Responsibilities:

  • Worked with data mining, data collection, data cleansing, dataset preparation, ETL, ability to query and store information for machine learning models in an RDBMS database and program/automate using SQL
  • Created advanced machine learning algorithms and statistics: simulation, scenario analysis, modeling, clustering, decision trees, neural networks, GLM/Regression, Random Forest, Boosting, text mining, social network analysis
  • Delivered analytics use cases with Aster analytics functions primarily nPath, Text Parser, Collaborative Filtering, SQL MR, Query vectors, Regression Models, Correlations, Pattern Matching, and Text Mining
  • Applied state-of-the-art Machine Learning (ML) / Deep Learning (DL) technologies to evaluate the sentiment score by Re-trained semantic segmentation models.
  • Used classifiers built in SciKit-learn, on top of hand-engineered linguistic features extracted by spaCy, an industrial-strength NLP library.
  • Design new language models ( NLP/NLU) to understand and build predictions or summaries from customer calls, claim notes, web chats, medical records, and beyond.
  • Experience with speech recognition, NLP, NLU or a similar field.
  • Designed a light-weight convolutional neural network ( C NN) on TensorFlow to recognize keywords, phrases, tone and inflection.
  • Worked for Hyperparameter tuning, debugging and troubleshooting machine learning models and automated processes to optimize the performance.
  • Analyzed performance, Conducted experiments with different types of algorithms and models to identify the best algorithms to employ.
  • Reported that cNN architecture in our application was leading to fast convergence and quick overfitting; Bi-direction LSTM architecture cost 10x training time, and it tended to be under fitting the training data initially.
  • Migrated eDiscovery search engine from monolith to microservice architecture using NLTK, SciKit-Learn.
  • Developed a DL module with Keras, using GloVe (Wikipedia-trained word vectors) as the embedding layer.
  • Conducted a hybrid of Hierarchical and K-means Cluster Analysis
  • Implemented use of real-time data using Storm and Spark as well as Spark ML.
  • Implemented model on batch data using Spark SQL.
  • Evaluate models using Cross Validation, Log loss function, ROC curves and used AUC for feature selection and elastic technologies like Elastic Search, Kibana etc.
  • Worked in the agile environment to implement agile management ideals such as sprint planning, daily standups, managing project timelines, and communicate with clients to ensure project progress satisfactorily.
  • Documented logical data models, semantic data models and physical data models.
  • Involved in agile environment to implement agile management ideals such as sprint planning, daily standups, managing project timelines, and communicate with clients to ensure project progress satisfactorily. Also worked in canban environment.
  • Used advanced SQL query to extract data from SQL server and to integrate the data with Tableau.
  • Application of various machine learning algorithms and statistical modeling like decision trees, text analytics, natural language processing (NLP), supervised and unsupervised, regression models, social network analysis, neural networks, deep learning, SVM, clustering to identify Volume using scikit-learn package in python, Matlab.
  • Engineered pipelines for tokenization and word vector encoding with CoreNLP and Python and deployed them on premise.
  • Created pipeline for server-side rendering word highlighting visualizations using Flask, HTML and CSS.
  • Experience with Big data related techniques i.e., Hadoop, MapReduce, NoSQL, Pig/Hive, Spark, Spark MLlib.
  • Used data mining and machine learning algorithms, theories, principles and practices.
  • Built and tested different Ensemble Models such as Bootstrap aggregating, Bagged Decision Trees and Random Forest, Gradient boosting, XGBoost, and AdaBoost to improve accuracy, reduce variance and bias, and improve stability of a model
  • Relational and non-relational databases were connected using Azure services for data science and machine learning and also was used to build, train, host, and deploy models from Python environment.
  • Performed data cleaning and feature selection using MLlib package in PySpark and working with deep learning frameworks .
  • Worked on Cloud Services such as Amazon Web Services (AWS), S3, Redshift to assist with big data tools, solve storage issue.
  • Performed MapReduce jobs and Spark analysis using Python for machine learning and predictive analytics models on big data in Hadoop ecosystem on AWS cloud platform as well as some data from on-premise SQL.
  • Developed Spark/Scala, Python, R for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources. Used clustering technique K-Means to identify outliers and to classify unlabeled data.
  • Evaluate models using Cross Validation, Log loss function, ROC curves and used AUC for feature selection and elastic technologies like Elastic Search, Kibana etc.
  • Worked with cloud computing to store, retrieve, and share large quantities of data in AWS is the Amazon S3 object store. Read and wrote to S3 from Apache Hadoop, Apache Spark, and Apache Hive. PCA was used for dimensional Reduction and created the K-means clustering.
  • Apache Flume, and Apache Sqoop data loading was used to load both structured and unstructured streaming data to HDFS, hive and HBase.
  • Worked in the Agile methodology with self-organizing, cross-functional teams sprint towards results in fast, iterative, incremental, and adaptive steps
  • Splunk was used to search, investigate, troubleshoot, monitor, visualize, alert, and report machine-generated big data. Splunk Enterprise Security (ES) was used to identify and track security incidents, analyze security risks, use predictive analytics, and threat discovery.
  • K-nearest neighbor, k-means Clustering, Support Vector Machine (SVM) was used for anomaly detection. k-means Clustering was used for Customer Segmentation based on sale behavior and text mining by Clustering Text Documents in Scikit-learn
  • Involved in the structure, execution, improvement and combination of an Artificial Intelligence arrangement. Sentimental analysis, mine unstructured information, and develop insights using Statistical Natural Language Processing; Using advanced statistical methods analyze and model structured data and implement algorithms.
  • Created Data Quality Scripts using SQL and Hive to validate successful data load and quality of the data. Create various types of data visualizations using Python and Tableau .

Environmen t: HDFS, Hive, Scoop, Pig, Oozie, Amazon Web Services (AWS), Python 3.x (SciPy, Scikit-Learn), Tableau 9.x, D3.js, SVM, Random Forests, Naïve Bayes Classifier, SVM, LigthGBM Classifier,XGBoost Classifier, A/B experiment, Git 2.x, Agile/SCRUM.

Confidential - Columbia, SC

Data Scientist/ML Engineer

Responsibilities:

  • Involved in the design, implementation, development & integration of an Artificial Intelligence solution. Utilized statistical Natural Language Processing for sentiment analysis, mine unstructured data, and create insights; analyze and model structured data using advanced statistical methods and implement algorithms.
  • Python libraries such as pandas, sklearn, Scikit-learn, numphy, scipy, keras, tensorFlows was used to predict categories based on location, time and some other features by Linear, regression, k-means Clustering, Deep neural network, Logistic regression, Decision Tress, Random Forests, KNN, XGBoost, Ensembles (Bagging, Boosting), Support Vector Machines, Neural Networks, graph/network analysis, and time series analysis(ARIMA model)
  • Trained large list of models, evaluated and compared and selected the best models for prediction and forecasting. Established and maintain effective processes for K - fold validating, confusion matrix and updated and validated different models. Developed Convolutional Neural Networks (CNNs) in Python using the TensorFlow deep learning library to address the image and recognition problem
  • Developed LSTM recurrent neural network (RNN) in Python using the Keras and tensorflow deep learning library to address the time-series prediction problem such as to predict the sale by 2022. Used LSTM recurrent neural network (RNN) to solve the use case such as natural language processing (NLP)
  • H2O Flow notebook interface was used for H2O to capture and share workflow. Used H2O to import files, build models, and improve the model for data analytics and prediction.
  • Built Ensemble Models in machine learning such as Bootstrap aggregating (bagging - Bagged Decision Trees and Random Forest) and Boosting (Gradient boosting, XGBoost and AdaBoost) to improve accuracy, reduce variance and bias, and improve stability of a model, Random forest, select K best, RFE was used for feature selection process.
  • Familiar with Azure Machine Learning Model Management to manage and deploy machine-learning workflows and models into production
  • Used Spark SQL for ETL of raw data. Worked for feature selection, data wrangling, and feature extraction and worked on ETL on Hadoop.
  • Tableau was used to connect to files, relational and Big Data sources to acquire, visually analyze, and process data. Tableau was also used to create and distribute an interactive and shareable dashboard to see the trends, variations, and density of the data in the form of graphs and charts.
  • Took sole ownership of the analytics solution from requirements through to delivery.
  • Wrote database complex SQL queries in Oracle, PostgreSQL, MySQL and developed data models for data analysis and extraction.
  • Familiar with AR (Autoregressive), MA (Moving Average), and ARIMA (Autoregressive Integrated Moving Average) time series analysis models. CVS market demand was predicted using ARIMA time series analysis and forecasting statistical model with grid search.
  • Captured and elaborated analytics solution requirements, working with customers and product managers. Created advanced analytics solution. Zeppelin, Spark SQL and Spark MLLib was combined to simplify exploratory Data Science
  • Apache Zeppelin web-based notebook was used to bring data ingestion, data exploration, visualization, sharing and collaboration features to Hadoop and Spark.
  • Kafka was used as message broker to collect large volume of data and to analyze the collected data in the distributed system. Business data was feed into Kafka and then processed using Spark Streaming in Scala for real-time analytics and data science.
  • Spark Streaming was used to populate real-time sentiment analysis, crisis management and service adjusting. Amazon Simple Storage Service (Amazon S3) was used to store and retrieve data. GitHub was used for version control.
  • Designed, developed, and implemented end-to-end cloud based machine learning production pipelines (data exploration, sampling, training data generation, feature engineering, model building, and performance evaluation)
  • Worked with different clustering methods such as K-means, DBSCAN, and Gaussian mixture.

Environment: R/R studio, Python, Tableau, MS SQL Server, MS Excel, Scala,Python, SAS, Power BI, Statistical Models

Confidential - Austin, TX

Jr.Data scientist

Responsibilities:

  • Conducted two factor, three factor and multi factor ANOVA model, tested two- or three-way interactions, derived random and mixed effect ANOVA models, tested nested design and repeated measures.
  • Performed comprehensive chi-square test, t test, z test, non-parametric test and justified claims and drawn conclusions on basis of p-values.
  • Fitted a logistic regression models, LDA, QDA, KNN for binomial classification and recommended reasonable models with justification. Confusion Matrix and ROC Chart were used to evaluate the classification model.
  • Performed comprehensive chi-square test, t test, z test, non-parametric test and justified claims and drawn conclusions on basis of p-values.
  • Successfully fitted linear and multiple regression models using continuous and categorical predictors, verified regression assumptions using homoscedasticity, normality, independent and linearity test, predicted the test data
  • Performed multiple linear regression model using least square methods, best subset selection method, forward stepwise selection method and backward stepwise selection method and recommended the appropriate model with justifications.
  • Constructed machine learning models using NumPy, SciPy, NLTK, SciKitLearn, MLPy, OpenCV
  • Optimized queries with some manipulations and modifications in MySQL code and removed unwanted columns and duplicate data.
  • Closely worked with Machine learning engineers to analyze the data based upon their requirements. Experienced in creating pivot tables for analyzing data in excel.
  • Tested and ensured data accuracy through the creation and implementation of data integrity queries, debugging and trouble shooting .

Environment: R/R studio, Python, Tableau, MS SQL Server, MS Excel, Scala,Python, SAS, Power BI, Statistical Models

We'd love your feedback!