We provide IT Staff Augmentation Services!

Data Engineer/ Machine Learning Engineer Resume

SUMMARY

  • Data Scientist with 7+ years of experience in transforming business requirements into analytical models, designing algorithms, building models, developing data mining and reporting solutions that scale across a massive volume of structured and unstructured data and expertise working in a variety of industries including Banking and Healthcare.
  • Expert in Data Science process life cycle: Data Acquisition, Data Preparation, Modeling (Feature Engineering, Model Evaluation) and Deployment.
  • Equipped with experience in utilizing statistical techniques which include hypothesis testing, Principal Component Analysis (PCA), ANOVA, sampling distributions, chi - square tests, time-series analysis, discriminant analysis, Bayesian inference, multivariate analysis
  • Efficient in preprocessing data including Data cleaning, Correlation analysis, Imputation, Visualization, Feature Scaling and Dimensionality Reduction techniques using Machine learning platforms like Python Data Science Packages (Scikit-Learn, Pandas, NumPy).
  • Expertise in building various machine learning models using algorithms such as Linear Regression, Logistic Regression, Naive Bayes, Support Vector Machines (SVM), Decision trees, KNN, K-means Clustering, Ensemble methods (Bagging, Gradient Boosting).
  • Experience in Text mining, Topic modeling, Natural Language Processing (NLP), Content Classification, Sentiment analysis, Market Basket Analysis, Recommendation systems, Entity recognition etc.
  • Applied text pre-processing and normalization techniques, such as tokenization, POS tagging, and parsing. Expertise using NLP techniques (BOW, TF-IDF, Word2Vec) and toolkits such as NLTK, Genism, SpaCy.
  • Experienced in tuning models using Grid Search, Randomized Grid Search, K-Fold Cross Validation.
  • Strong Understanding with artificial neural networks, convolutional neural networks, and deep learning
  • Skilled in using statistical methods including exploratory data analysis, regression analysis regularized linear models, time-series analysis, cluster analysis, goodness of fit, Monte Carlo simulation, sampling, cross-validation, ANOVA, A/B testing, etc.
  • Working experience in Natural Language Processing (NLP) and Deep understanding of Statistics/Linear Algebra/Calculus and various optimization algorithms like gradient descent.
  • Familiar with key data science concepts (statistics, data visualization, machine learning, etc.). Experienced in Python, R, MATLAB, SAS, PySpark programming for statistic and quantitative analysis.
  • Knowledge on Time Series Analysis using AR, MA, ARIMA, GARCH and ARCH model.
  • Experience in building production quality and large-scale deployment of applications related to natural language processing and machine learning algorithms.
  • Experience with high performance computing (cluster computing on AWS with Spark/Hadoop) and building real-time analysis with Kafka and Spark Streaming. Knowledge using Qlik, Tableau, and Power BI
  • Exposure to AI and Deep learning platforms such as TensorFlow, Keras, AWS ML, Azure ML studio
  • Experience working with Big Data tools such as Hadoop - HDFS and MapReduce, Hive QL, Sqoop, Pig Latin and Apache Spark (PySpark).
  • Extensive experience working with RDBMS such as SQL Server, MySQL, and NoSQL databases such as MongoDB, HBase.
  • Generated data visualizations using tools such as Tableau, Python Matplotlib, Python Seaborn, R.
  • Knowledge and experience working in Agile environments including the scrum process and used Project Management tools like ProjectLibre, Jira and version control tools such as GitHub/Git.

TECHNICAL SKILLS

Data Sources: AWS snowflake, Postgres SQL, MS SQL Server, MongoDB, MySQL, HBase, Amazon Redshift, Teradata.

Statistical Methods: Hypothesis Testing, ANOVA, Principal Component Analysis (PCA), Time Series, Correlation (Chi-square test, covariance), Multivariate Analysis, Bayes Law.

Machine Learning: Linear Regression, Logistic Regression, Naive Bayes, Decision Trees, Random Forest, Support Vector Machines (SVM), K-Means Clustering, K-Nearest Neighbors (KNN), Random Forest, Gradient Boosting Trees, Ada Boosting, PCA, LDA, Natural Language Processing

Deep Learning: Artificial Neural Networks, Convolutional Neural Networks, RNN, Deep Learning on AWS, Keras API.

Hadoop Ecosystem: Hadoop, Spark, MapReduce, Hive QL, HDFS, Sqoop, Pig Latin

Data Visualization: Tableau, Python (Matplotlib, Seaborn), R(ggplot2), Power BI, QlikView, D3.js

Languages: Python (NumPy, Pandas, Scikit-learn, Matplotlib, Seaborn), R, SQL, MATLAB, Spark, Java, C#

Operating Systems: UNIX Shell Scripting (via PuTTY client), Linux, Windows, Mac OS

Other tools and technologies: TensorFlow, Keras, AWS ML, Azure ML studio, GCP, NLTK, SpaCy, Gensim, MS Office Suite, Google Analytics, GitHub, AWS—(EC2/S3/Redshift/EMR/Lambda/Snowflake)

PROFESSIONAL EXPERIENCE

Confidential

Data Engineer/ Machine learning Engineer

Responsibilities:

  • Worked with data governance team on CCPA- California consumer privacy act, GDPR - General data protection regulation projects.
  • Worked on end to end machine learning workflow, written python code for gathering the data from AWS snowflake, data preprocessing, feature extraction, feature engineering, modeling, evaluating the model, deployment. Written python code for exploratory data analysis using Scikit-learn machine learning python packages- NumPy, Pandas, Matplotlib, Seaborn, statsmodels, pandas profiling.
  • Trained Random forest algorithm on customer web activity data on media applications to predict the potential customers. Worked on Google TensorFlow, Keras API- convolution neural networks for classification problems.
  • Written code for feature engineering, Principal component analysis PCA, hyperparameter tuning to improve the accuracy of the model.
  • Worked on various machine learning algorithms like Linear regression, logistic regression, Decision trees, random forests, K- means clustering, Support vector machines, XGBoosting on client requirements.
  • Developed machine learning models using recurrent neural networks - LSTM for time series, predictive analytics.
  • Developed machine learning models using Google TensorFlow keras API Convolution neural networks for Classification problems, fine-tuned the model performance by adjusting the epochs, bath size, Adam optimizer.
  • Good knowledge on image classification problems using the Keras Models for image classification with weights trained on ImageNet like VGG16, VGG19, ResNet, ResNetV2, InceptionV3. Knowledge on OpenCV for real time computer vision.
  • Worked on natural language processing for documentation classification, text processing using NLTK, SPACY, TextBlob to find the sensitive information in the electronically stored files and text summarization.
  • Developed the Python automation script for consuming the Data subjects request from AWS snowflake tables and post the data to adobe analytics privacy API.
  • Developed the python script to automate the data cataloging in Alation data catalog tool. Tagged all the Personally identified Information (PII) data in Alation enterprise data Catalog tool, to identify the sensitive consumer information.
  • Consumed the Adobe analytics web API and written the python script to get the adobe consumer information for digital marketing into snowflake. Worked on Adobe analytics ETL jobs.
  • Written stored procedures in AWS snowflake to look for sensitive information across all the data sources and hash the sensitive data with salt value to anonymize the sensitive information to meet the CCPA law.
  • Worked on AWS boto3 API to make the HTTP calls to AWS amazon web services like S3, AWS secrets manager, AWS SQS.
  • Created an integration to consume the HBO consumer subscription information posted to AWS SQS- simple queue services and loaded into Snowflake tables for data processing, stored the meta data information into Postgres tables.
  • Worked on generating the reports to provide the warner media brands consumer information to data subjects through python automation jobs.
  • Implemented AWS lambda functions, python script that pulls the privacy files from AWS S3 buckets to post to it the Malibu data privacy endpoints.
  • Involved in different phases of data acquisition, data collection, data cleaning, model development, model validation, and visualization to deliver solutions.
  • Worked with Python NumPy, SciPy, Pandas, Matplot, Stats packages to perform dataset manipulation, data mapping, data cleansing and feature engineering. Built and analyzed datasets using R and Python.
  • Extracted the data required for building models from AWS snowflake Database. Performed data cleaning including transforming variables and dealing with missing value and ensured data quality, consistency, and integrity using Pandas and NumPy.
  • Tackled highly imbalanced Fraud dataset using sampling techniques like under-sampling and over-sampling with SMOTE using Python Scikit-learn.
  • Utilized PCA and other feature engineering techniques to reduce the high dimensional data, applied feature scaling, handled categorical attributes using one hot encoder of Scikit-learn library.
  • Developed various machine learning models such as Logistic regression, KNN, and Gradient Boosting with Pandas, NumPy, Seaborn, Matplotlib, Scikit-learn in Python.
  • Elucidating the continuous improvement opportunities of current predictive modeling algorithms. Proactively collaborates with business partners to determine identified population segments and develop actionable plans to enable the identification of patterns related to quality, use, cost and other variables.
  • Experimented with ensemble methods to increase the accuracy of the training model with different bagging and boosting methods and deployed the model on AWS.
  • Created and maintained reports to display the status and performance of deployed model and algorithm with Tableau.

Technology Stack: Python, Postgres, AWS snowflake, Alation data catalog tool, snowsql, AWS EC2, S3, AWS lambda, AWS secrets manager, AWS SQS, Adobe analytics, Linux, Scikit-learn, SciPy, NumPy, Pandas, Matplotlib, Seaborn, JIRA, GitHub, Agile/ SCRUM.

Confidential, Duluth, GA

Data Scientist

Responsibilities:

  • Extensively involved in all phases of data acquisition, data collection, data cleaning, model development, model validation and visualization to deliver data science solutions.
  • Built machine learning models to identify whether a user is legitimate using real-time data analysis and prevent fraudulent transactions using the history of customer transactions with supervised learning.
  • Extracted data from SQL Server Database, copied into HDFS File system and used Hadoop tools such as Hive and Pig Latin to retrieve the data required for building models.
  • Performed data cleaning including transforming variables and dealing with missing value and ensured data quality, consistency, integrity using Pandas, NumPy.
  • Tackled highly imbalanced Fraud dataset using sampling techniques like under sampling and oversampling with SMOTE (Synthetic Minority Over-Sampling Technique) using Python Scikit-learn.
  • Utilized PCA, t-SNE and other feature engineering techniques to reduce the high dimensional data, applied feature scaling, handled categorical attributes using one hot encoder of scikit-learn library
  • Developed various machine learning models such as Logistic regression, KNN, and Gradient Boosting with Pandas, NumPy, Seaborn, Matplotlib, Scikit-learn in Python.
  • Worked on Amazon Web Services (AWS) cloud services to do machine learning on big data.
  • Developed Spark Python modules for machine learning & predictive analytics in Hadoop.
  • Implemented a Python-based distributed random forest via PySpark and MLlib.
  • Used cross-validation to test the model with different batches of data to find the best parameters for the model and optimized, which eventually boosted the performance.
  • Experimented with Ensemble methods to increase the accuracy of the training model with different Bagging and Boosting methods and deployed the model on AWS.
  • Created and maintained reports to display the status and performance of deployed model and algorithm with Tableau.
  • Used GitHub and Jenkins for CI/CD (DevOPs operations). Also familiar with Tortoise SVN, Bitbucket, JIRA, Confluence.

Technology Stack: Machine Learning, AWS, Python (Scikit-learn, SciPy NumPy, Pandas, Matplotlib, Seaborn, SQL Server, Hadoop, HDFS, Hive, Pig Latin, Apache Spark/PySpark/MLlib, GitHub, Linux, Tableau.

Confidential, Southfield, MI

Data Scientist / Machine learning Engineer

Responsibilities:

  • Collaborated with data engineers and operation team to implement the ETL process, wrote and optimized SQL queries to perform data extraction to fit the analytical requirements.
  • Performed data analysis by retrieving the data from the Hadoop cluster.
  • Performed univariate and multivariate analysis on the data to identify any underlying pattern in the data and associations between the variables.
  • Explored and analyzed the customer specific features by using Matplotlib in Python and ggplot2 in R.
  • Performed data imputation using Scikit-learn package in Python.
  • Participated in features engineering such as feature generating, PCA, feature normalization and label encoding with Scikit-learn preprocessing.
  • Used Python (NumPy, SciPy, pandas, Scikit-learn, seaborn) and R to develop a variety of models and algorithms for analytic purposes.
  • Worked on Natural Language Processing with NLTK module of python and developed NLP models for sentiment analysis
  • Experimented and built predictive models including ensemble models using machine learning algorithms such as Logistic regression, Random Forests, and KNN to predict customer churn.
  • Conducted analysis of customer behaviors and discover the value of customers with RMF analysis; applied customer segmentation with clustering algorithms such as K-Means Clustering. Gaussian Mixture Model and Hierarchical Clustering.
  • Used F-Score, AUC/ROC, Confusion Matrix, Precision, and Recall evaluating different models’ performance.
  • Designed and implemented a recommendation system which leveraged Google Analytics data and the machine learning models and utilized Collaborative filtering techniques to recommend courses for different customers.
  • Designed rich data visualizations to model data into human-readable form with Tableau and Matplotlib.

Technology Stack: Hadoop, HDFS, Python, R, Tableau, Machine Learning (Logistic regression/ Random Forests/ KNN/ K-Means Clustering/ Hierarchical Clustering/ Ensemble methods/ Collaborative filtering), JIRA, GitHub, Agile/ SCRUM, GCP

Confidential

Machine Learning Engineer

Responsibilities:

  • Communicated and coordinated with end client for collecting data and performed ETL to define the uniform standard format. Queried and retrieved data from Oracle database servers to get the dataset.
  • In the preprocessing phase, used Pandas to remove or replace all the missing data and balanced the dataset with Over-sampling the minority label class and Under-sampling the majority label class.
  • Used PCA and other feature engineering, feature scaling, Scikit-learn preprocessing techniques to reduce the high dimensional data using entire patient visit history, proprietary comorbidity flags and comorbidity scoring from over 12 million EMR and claims data.
  • In data exploration stage used correlation analysis and graphical techniques in Matplotlib and Seaborn to get some insights about the patient admission and discharge data.
  • Experimented with predictive models including Logistic Regression, Support Vector Machine (SVM), Gradient Boosting and Random Forest using Python Scikit-learn to predict whether a patient might be readmitted.
  • Designed and implemented Cross-validation and statistical tests including ANOVA, Chi-square test to verify the models’ significance.
  • Implemented, tuned and tested the model on AWS EC2 with the best performing algorithm and parameters.
  • Set up data preprocessing pipeline to guarantee the consistency between the training data and new coming data.
  • Deployed the model on AWS Lambda. Collected the feedback after deployment, retrained the model and tweaked the parameters to improve the performance.
  • Designed, developed and maintained daily and monthly summary, trending and benchmark reports in Tableau Desktop.
  • Used Agile methodology and Scrum process for project developing.

Technology Stack: AWS EC2, S3, Oracle DB, AWS, Linux, Python (Scikit-Learn/NumPy/Pandas/Matplotlib), Machine Learning (Logistic Regression/Support Vector Machine/Gradient Boosting/Random Forest), Tableau.

Hire Now