Principle Data Scientist/data Engineer Resume
SUMMARY
- Extensive working experiences in the field of Data Science, Data Engineer, Machine Learning, Deep Learning, Data Mining, Predictive Modeling, Recommendation Systems, ETL Development and Data Visualization
- Comprehensive programming skills inPython2/3, R, Scala, MATLAB, SQL, Bash, JavaScript, HTML5, CSS3, C, C# and Java
- Expertise in Supervised Machine Learning Algorithms like Linear and Logistic Regression, Decision Tree, Ada - Boost, Gradient Boosting, XGBoost, Random Forest, Naïve Bayes, K-Nearest Neighbors, Support Vector Machines, LDA (Linear Discriminant Analysis), Neural Networks;and Unsupervised Learninglike K-Means Clustering, PCA (Principal Component Analysis)
- Skilled in Deep Learning Framework: TensorFlow, Keras and PyTorch; Familiar with Deep Learning Models like DNN, CNN, RNN and LSTMs
- Experienced in building Data Warehousing and Extract Transform Load (ETL) pipelines using Spark, Airflow and cloud tools
- Experience in defining project scope across Data Science, Data Analytics projects in collaboration with senior management and client
- Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification in both Waterfall and Agile methodologies
- Adept in using Python libraries such as Pandas, NumPy, SciPy, Seaborn, Matplotlib, Scikit-learn, Keras, NLTK
- Experience in using Anaconda Navigator (Jupyter Notebook), PyCharm, RStudio for Python and R programming
- Working knowledge with Big Data technologies like Hadoop, MapReduce, Spark, SparkSQL, HDFS, Hive, HBase
- Expert in designing visualizations usingTableau10.3, Dash, R-Shiny, Power BI and D3.js
- Experience in using A/B test, Hypothesis test and ANOVA testing to find the accuracy of model
- Professional experience with handling with structured and unstructured data (Social Media, Texting, Photographs and Videos) using relational databases like MySQL 5.X, Oracle 11g
- Used various sources to pull data into Power BI such as SQL Server, Excel, Oracle, SQL Azure etc.
- Involved in of existing solutions on-premise systems/applications to Azure cloud.
- Implement medium to large scale BI solutions on Azure using Azure Data Platform services(Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB).
- Datacenter Migration, Azure Data Services have strong virtualization experience.
- Design and implement streaming solutions using Kafka or Azure Stream Analytics.
- Expert in dealing with big data on NoSQL databases like Cassandra3.0 and MongoDB3.2
- In-depth knowledge with Cloud Infrastructure like AWS S3, AWS EC2 and Docker
- Experience in working with version control systems like GIT and used Source code management client tools like GitBash and GitHub
- Excellent communication, analytical, interpersonal, and presentation skills; expert at managing multiple projects simultaneously
- Familiar with current industry standards, such ISO, Six Sigma, and Capability Maturity Model (CMM)
- Good knowledge in Microsoft Project, Microsoft Office, WordPress, Photoshop etc.
TECHNICAL SKILLS
Machine Learning/Deep Learning\: Regression models, Naive Bayes, Decisiontrees, Random Forests, Ada Boost, XG Boost,SVM, KNN, Bagging, Gradient Boosting, LDA, K-means, Neural Networks, CNN, RNN
Packages\: Numpy, Pandas, Scipy, Seaborn, Matplotlib, Plotly, Keras, Scikit-learn, NLTK, PyTorch, Beautiful Soup, WordCloud, TensorFlow, Flask
Languages\: Python2.7/3.6, R, SQL, JavaScript, Scala, Pig, HTML5,XML, CSS3, Shell, Markdown
Database\: MySQL 5.X, Oracle 11g, PostgreSQL9.6, MongoDB3.2, Cassandra3.0
BI Tools\: Tableau10.3, Microsoft Power BI, MicroStrategy, Dash, R-Shiny
Infrastructure\: Databricks, Docker, AWS, GCP, Microsoft Azure, Git, Bitbucket
Report/Document Tools\: MS Office 2016, MS Project, Outlook, Excel, Word, PowerPoint
BigData Tools\: Spark, SparkSQL, Hadoop, MapReduce, Hive, HBase
Operation Systems\: Linux, Ubuntu, Mac OS, CentOS, Windo
PROFESSIONAL EXPERIENCE
Confidential
Principle Data Scientist/Data Engineer
Responsibilities:
- Project Development:Designed and developed scalable production-level recommendation systems leveraging Machine Learning, Deep Learning, Natural Language Processing, Statistical Modeling using Python to solve real-world business problems; collaborated with backend and frontend engineers to implement recommendation systems into Flask Rest Framework and successfully deployed on CentOS
- Content-Based Dog Breed Selector: Builtnew algorithm flow based on current rule-based dog breed selector; designed a principle approach (TF-IDF + Cosine Similarity) to compute the similarity between text data; constructed a predictive model for predicting user behavior; implement the Personalized Recommendation API in order to deliver recommended breed to dog seekers
- Machine Learning API: Deployed the dog breed selector model as a REST-API using Flask; built and pickled the Content-Based Model; deployed on Linux (CentOS) server
Environment: Experienced with ML-Flow framework to manage the Machine Learning cycle; evaluated different models and selected the best model shipping to production; created a docker-compose.yml file and managed multiple isolated environments on a single host
Environment: Python3.6, JavaScript, Docker Compose, CentOS, MySQL8.0.17, React, flask-restful, iTerm, AWS, Numpy, Pandas, Scikit-learn, Keras, nltk, Tensorflow, Git, JIRA, VS Code
Confidential
Data Scientist
Responsibilities:
- Project Development:Designed and developed scalable production-level recommendation systems leveraging Machine Learning, Deep Learning, Natural Language Processing, Statistical Modeling using Python to solve real-world business problems; collaborated with backend and frontend engineers implemented recommendation systems into Django Rest Framework and successfully deployed on AWS EC2
- Data Analysis: Translating numbers into meaningful facts for businesses to help them make better business decisions; Perform cleansing, manipulation, analysis, and visualization of client data; Generated data visualization dashboard using Tableau10.3 and Python library Matplotlib/ Seaborn
- Data Preprocessing: Collected 6 GB data through company’s API, built Data Processing Pipeline and performed data cleaning, features scaling, features engineering using Pandas and NumPy packages in python; built streaming data ETL using Spark that write only the data that changed from previous batch
- NLP (Natural Language Processing) Techniques: Built projects utilizing NLP knowledge including text mining, regex, bag of words, TF-IDF, Word2Vec, PCA, LSTMs, cosine similarity, sentiment analysis, NER, and information extraction
- Log Classification:Applied feature selection based on tree importance to get 8 most important features from IVR data and extracted features from modem logs and trained Random Forest to classify intents(label), then builtcontent-based recommender that recommends improvement mimicking status ofgiven cable modem
- Recommendation Algorithm: DesignedUser-Based and Item-Based Collaborative-Filtering based on Pearson correlationbetween users/items; hybridized content-based recommender with collaborative-filtering
- Model Evaluation: Measured model performance using Confusion Matrix, AUC-ROC curve; and identified accuracy, precision, recall and F1 score using Confusion Matrix; used GridSearch to tune hyperparameters and evaluate a model for each combination of algorithm parameters specified in a grid, finally we increased accuracy by 5%
- Agile Project Coordinator:Pitched machine learning ideas, showed exploratory data analysis(EDA) and presented project demo to front desk business users; suggested, collected and synthesized business requirements based on use cases, created an effective roadmap towards the deployment of a production-level machine learning application
Environment: Python3.6, Golang, Flask, Celery, iTerm, Numpy, Pandas, Seaborn, Matplotlib, nltk, Scikit-learn, AWS S3, Databricks, Tableau10.3, Spark, Spark SQL, PL/SQL, Git, JIRA
Confidential
Data Scientist
Responsibilities:
- Strategies Building: Being a member of a five-person group charged with building resume-parsing systems using NLP related strategies for recruiting platform based on machine learning and deep learning
- Implementation:Transformed resume from PDF, Word, and other forms to txt file using Tika; Created corpus word list including segment keyword list, university list and company list etc.; Searched segment keywords and created bounding box near keyword using Hierarchical Layout, then stored each sentence into respective segment;performed feature extraction by creating segment specific feature list and searched main feature in the respective segment
- Machine Learning/Deep Learning: Developed machine learning algorithms for Named Entities Recognition(NER), such as recognizing candidate’s name and company’s name; used Support Vector Machine and Naïve Bayes Classifier to better generate segmentation result; applied Regular Expression for information extraction, such as extracting email address; Implemented Deep learning multi-class classification using RNN and CNN networks;Designed Confusion Matrixand calculated precision, recall and f1 score to measure model performance, the accuracy reached to 99.9%
- Data Engineering:Constructed data pipeline on AWS by deploying Linux environment to use Jupyter notebook to query and clean data, enabling datapipelineETL, and preparing machine-learning oriented features table; Applied cloud technology (Google Cloud, AWS, and Databricks) to synchronize and deploy Parse Server (Docker Container)on AWS through EC2;Processed one millionresume files and increased time efficiency by 20 times
- Interpersonal Communication and Leadership:Served as group leader for all interns to develop an adaptive information extraction algorithm based on about 100 academic papers; Reviewed and refined all interns' information extraction strategies by testing the results; Collaborating with product managers, marketing analytics, and front-end engineers to deliver features
Environment: Python3.6, JRE8, Docker, Postgres9.6, Apache Tika, Databricks, PyCharm, iTerm, AWS, Numpy, Pandas, Scikit-learn, Keras, nltk, Tensorflow, Git