Sr. Data Scientist/machine Learning Engineer Resume
Hillsboro, OR
PROFESSIONAL SUMMARY:
- Data Scientist with around 5 years of experience in areas including Data Analysis, Statistical Analysis, Machine Learning, Deep Learning, Data mining with large data sets of structured and unstructured data
- Experienced in using various Python libraries (Beautiful Soup, Numpy, Scipy, matplotlib, python - twitter, Pandas, MySQL DB for database connectivity).
- Experience in building end to end data science solutions using R, Python, SQL and Tableau by leveraging machine learning based algorithms, Statistical Modeling, DataMining, Natural Language Processing (NLP) and Data Visualization.
- Adept and deep understanding of Statistical Modeling, Multivariate Analysis, model testing, problem analysis, model comparison and validation.
- Adapted with Python and OOP concepts such as Inheritance, Polymorphism, Abstraction, Association, etc.
- Sound understanding of Deep learning using CNN, RNN, ANN, reinforcement learning, transfer learning.
- Experienced in developing machine learning models for real-world problems using R and python
- Experienced in Agile Methodologies, Scrum stories and sprints experience in a Python based environment, along with data analytics, data wrangling.
- Worked on theoretical foundations and practical hands-on projects related to supervised learning (linear and logistic regression, boosted decision trees, Support Vector Machines(SVM), neural networks(NN), NLP), unsupervised learning (clustering, dimensionality reduction, recommender systems), probability & statistics, experiment analysis, confidence intervals, A/B testing, algorithms and data structures.
- Excellent understanding of machine learning techniques and algorithms, such as K-NN, Naive Bayes, SVM, Decision Forests, Random forest etc.
- Experience with command- Confidential scripting, data structures and algorithms.
- Experienced in processing large datasets with Spark using Python.
- Solid understanding of big data technologies like Hadoop, Spark, HDFS, MapReduce, Pig and Hive.
- Experience with machine learning tools and libraries such as Scikit-learn, R, Spark and Weka
- Experience working with large, real world data(Unsupervised Data) — big, messy, incomplete, full of errors
- Hands-on experience with NLP, mining of structured, semi-structured, and unstructured data
- Expertise in database Performance Tuning using Oracle Hints, Explain plan, TKPROF, Partitioning and Indexes
- Experience in manipulating the large data sets with R packages like tidyr, tidyverse, dplyr reshape, lubridate, Caret and visualizing the data using lattice and ggplot2 packages.
- Intensive hands-on Boot camp on Data Analytics course spanning from Statistics to Programming including data engineering, data visualization, machine learning and programming in R, SQL.
- Strong background in Machine Learning, Predictive Analysis and Data Mining with a broad understanding of Supervised and Unsupervised learning techniques and algorithms (eg: Regression, K-NN, SVM, Naïve Bayes, Decision trees, Clustering, etc.)
- Proficient in data visualization tools such as Tableau 10.5, Power BI 2.30, Python Matplotlib/Seaborn, R ggplot2/Shiny to generate charts like Box Plot, Scatter Chart, Pie Chart and Histogram e,t,c. and to create visually impactful and actionable interactive reports and dashboards.
- Experience in using Teradata ETL tools and utilizes such as BTEQ, MLOAD, FASTLOAD, TPT, Fast Export.
- Experience with tools such as R Programming, visualizations, SAS, Open Source etc.
- Strong experience writing stored procedures, functions, triggers and adhoc queries using PL/SQL
- Experienced in integration of various relational and non-relational sources such as DB2, Oracle, Netezza, SQL Server, NoSQL, COBOL, XML and FlatFiles, to Netezza database.
- Extensive experience in Normalization (1NF, 2NF, 3NF and BCNF) and De-normalization techniques for improved database performance Data Warehouse/Data Mart environments.
TECHNICAL SKILLS:
Machine Learning: Neural Networks, Deep Learning, NLP, Recommendation Systems, IoT
Software Development: Agile, Scrum, Jira, Wiki, Git, SVN, AWS, Predix, Microsoft Azure, Third Party API integration, Unit Testing, Code coverage
Database: MySQL, MSSQL, DB2, PostgreSQL, Cassandra, HDFS
Python: Scipy, Numpy, IPython, Scikit-learn, Pyspark, Pandas, Flask, Tensor flow, keras
R: Recommederlab, Random forest, glm, rpart, xgboost
SAS: Logistic Regression, Decision Tree, Proc
Visualization: Tableau, PowerBI, ggplot, matplotlib
Operating System: Windows, Linux, Unix, Macintosh HD, Red Hat
Data Modeling Tools: Erwin r 9.6, 9.5, 9.1, 8.x, Rational Rose, ER/Studio, MS Visio, SAP Power designerHadoop Ecosystem: Hadoop, Hive, HDFS, MapReduce, Pig, Kafka
PROFESSIONAL EXPERIENCE
Confidential, Hillsboro OR
Sr. Data Scientist/Machine Learning Engineer
Responsibilities:
- Worked on different data formats such as JSON, XML and performed machine learning algorithms in Python.
- Setup storage and data analysis tools in Amazon Web Services (AWS) cloud computing infrastructure.
- Implemented end-to-end systems for Data Analytics, Data Automation and integrated with custom visualization tools using R, Mahout, Hadoop and MongoDB.
- Worked as Data Architects and IT Architects to understand the movement of data and its storage and ER Studio 9.7.
- Worked with several R packages including knitr, dplyr, SparkR, CausalInfer, Space-Time.
- Coded R functions to interface with Caffe Deep Learning Framework.
- Used Pandas, Numpy, Seaborn, Scipy, Matplotlib, Sci-kit-learn, and NLTK in Python for developing various machine learning algorithms.
- Demonstrated experience in design and implementation of Statistical models, Predictive models, enterprise data model, metadata solution and data life cycle management in both RDBMS, Big Data environments.
- Utilized domain knowledge and application portfolio knowledge to play a key role in defining the future state of large, business technology programs.
- Machine Learning algorithms such as decision trees and random forest were used in this process to predict the urgency of the problem statement received by the company, this was done by calculating the weighted totals of the polarity and subjectivity of the problem statements and classifying each statement accordingly.
- Text data received in the problem statements was converted into numerical/ordinal data using parameters like polarity and subjectivity by developing a mathematical model to integrate the two statistics.
- Installed and used Caffe Deep Learning Framework
- Utilized Spark, Scala, Hadoop, HBase, Cassandra, MongoDB, Kafka, Spark Streaming, MLLib, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc.
- Used Spark Data frames, Spark-SQL, Spark MLLib extensively and developing and designing POC's using Scala, Spark SQL and MLlib libraries.
- Used Data Quality Validation techniques to validate Critical Data Elements (CDE) and identified various anomalies.
- Developed various Qlik-View Data Models by extracting and using the data from various sources files, DB2, Excel, Flat Files and Big data.
- Participated in all phases of Data-Mining, Data-collection, Data-Cleaning, Developing-Models, Validation, Visualization and Performed Gap Analysis.
- Good knowledge of Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, Secondary Name Node, and MapReduce concepts.
- As Architect delivered various complex OLAP Databases/Cubes, Scorecards, Dashboards and Reports.
- Programmed a utility in Python that used multiple packages (Scipy, Numpy, Pandas)
- Implemented Classification using supervised algorithms like Logistic Regression, Decision trees, KNN, Naive Bayes.
- Designed both 3NF data models for ODS, OLTP systems and Dimensional Data Models using Star and Snow flake Schemas.
- Updated Python scripts to match training data with our database stored in AWS Cloud Search, so that we would be able to assign each document a response label for further classification.
- Created SQL tables with referential integrity and developed queries using SQL, SQL PLUS and PL/SQL.
- Designed and developed Use Case, Activity Diagrams, Sequence Diagrams, OOD (Object oriented Design) using UML and Visio.
Environment: AWS, R, Informatica, Machine learning-Algorithms, Anaconda, Market Basket Analysis, Sentiment Analysis, Polarity, Predictive Analytics, Deep Learning- Algorithms, CNN, HCNN, Python, Data Mining, Data Collection, Data Cleaning, Validation, HDFS, ODS, OLTP, Oracle 10g, Hive, OLAP, DB2, Metadata, MS Excel, Mainframes MS Vision, Map-Reduce, Rational Rose, SQL, and MongoDB.
Confidential, Malvern, PA
Data Scientist
Responsibilities:
- Implementation of machine learning methods, optimization, and visualization. Mathematical methods of statistics such as Regression Models, Decision Tree, Naïve Bayes, Ensemble Classifier, Hierarchical Clustering and Semi-Supervised Learning on different datasets using Python.
- Researched and implemented various Machine Learning Algorithms using the R language.
- Devised a machine learning algorithm using Python for facial recognition.
- Used R for a prototype on a sample data exploration to identify the best algorithmic approach and then wrote Scala scripts using spark machine learning module.
- Used Scala scripts for spark machine learning libraries API execution for decision trees, ALS, logistic and linear regressions algorithms.
- Worked on Migrating an On-premises virtual machine to Azure Resource Manager Subscription with Azure Site Recovery.
- Provide consulting and cloud architecture for premier customers and internal projects running on MS Azure platform for high-availability of services, low operational costs.
- Develop structured, efficient and error-free codes for Big Data requirements using my knowledge in Hadoop and its Eco-system.
- Development of web service using Windows Communication Foundation and.Net to receive and process XML files and deploy on Cloud Service on Microsoft Azure.
- Worked on various methods including data fusion and machine learning and improved the accuracy of distinguished right rules from potential rules.
- Developed Merge jobs in Python to extract and load data into a MySQL database.
- Used Test driven approach for developing the application and Implemented the unit tests using Python Unit test framework.
- Wrote unit test cases in Python and Objective-C for other API calls in the customer frameworks.
- Tested with various Machine Learning algorithms like Support Vector Machine(SVM), Random Forest, Trees with XGBoost concluded Decision Trees as a champion model.
- Built models using Statistical techniques like Bayesian HMM and Machine Learning classification models like XGBoost, SVM, and Random Forest.
- Worked on different data formats such as JSON, XML and performed machine learning algorithms in Python.
Environment: Machine Learning, R Language, Hadoop, Big Data, Azure, Python, Java, J2EE, Spring, Struts, JSF, Dojo, JavaScript, DB2, CRUD, PL/ SQL, JDBC, coherence, MongoDB, Apache CXF, soap, Web Services, Eclipse
Confidential, Boston, GA
Data Scientist
Responsibilities:
- Analyzed large data sets apply machine learning techniques and develop predictive models, statistical models and developing and enhancing statistical models by leveraging best-in-class modeling techniques.
- Analyzed pre-existing predictive model developed by advanced analytics team and factors considered during model development.
- Experienced in all phases of data mining; data collection, data cleaning, developing models, validation and visualization.
- Analyzed metadata and processed data to get better insights of the data.
- Created initial data visualizations in tableau to provide basic insights of data to the project stakeholders.
- Application of various machine learning algorithms and statistical modeling like decision trees, regression models, clustering, SVM to identify Volume using scikit-learn package in Python.
- Conducted regular communications with leaders of other teams to get better understanding of the data at a deeper level.
- Analyzed dataset of 14M record count and reduced it to 1.3M by filtering out rows with duplicate customer IDs and removed outliers using boxplots and univariate algorithms.
- Performed extensive exploratory data analysis using Teradata to improve the quality of the dataset and developed Machine Learning algorithms using Python for predicting the model quality and created Data Visualizations using Tableau.
- Developed visualizations using R packages like ggplot2, choroplethr to identify patterns and trends in the preprocessed data.
- Experienced in RStudio packages and Python libraries like SciKit-Learn to improve the model accuracy from 65% to 86%.
- Experienced in various Python libraries like Pandas, One dimensional NumPy and Two dimensional NumPy.
- Experienced in using PyTorch library and implementing natural language processing.
- Developed data visualizations in Tableau to display day to day accuracy of the model with newly incoming
- data.
- Hold a point-of-view on the strengths and limitations of statistical models and analyses in various business contexts and is able to evaluate and effectively communicate the uncertainty in the results.
- Used Keras library to build and train deep learning models and fetched good results.
- Propensity model developed that was beneficial with a greater ROI compared to other models. Achieved 0.95 million dollars ROI per cycle with cycle duration of one quarter year.
- Implemented complete data science project involving data acquisition, data wrangling, exploratory data analysis(EDA), model development and model evaluation.
Environment: MS Access, SQL Server, Teradata, Advanced SQL, RStudio (ggplot2, caret), Python (Pandas, NumPy, Sci-kit learn), Machine Learning (Logistic Regression, Decision trees, SVM, Random forest), PyTorch, Keras, Tableau, Excel
Confidential
Data Analyst
Responsibilities:
- Collected data from end client, performed ETL and defined the uniform standard format
- Wrote queries to retrieve data from SQL Server database to get the sample dataset containing basic fields
- Performed string formatting on the dataset converting hours from date format to a numerical integer
- Used Python libraries like Matplotlib and Seaborn to visualize the numerical columns of the dataset such as day of week, age, hour and number of screens.
- Developed and implemented predictive models like Logistic Regression, Decision Tree, Support Vector
- Machine (SVM) to predict the probability of enrollment
- Used Ensemble learning methods like Random Forest, Bagging, Gradient Boosting, picked the final model based on confusion matrix, ROC, AUC predicted the probability of customer enrollment
- Worked on missing value imputation, outlier identification with statistical methodologies using Pandas
- NumPy
- Tuned the hyper parameters of the above models using Grid Search to find the optimum models
- Designed and implemented K-Fold Cross-validation to test and verify the model’s significance
- Developed a dashboard and story in Tableau showing the benchmarks and summary of model’s measure.
- Use tools extensively like R, Python, ODS, DB2, Metadata, MS Excel etc. to analyze data from multiple perspectives and was able to provide a robust Machine Learning algorithm.
- Created new tools and business processes that simplify, standardize and enables operational excellence.
- Used tools like Tableau for drilling-downdata, creatinginsightfulreports and garnering actionable business insights.
Environment: Tableau report builder, MS Outlook, SQL Server 2012/2014, Python (Scikit-Learn, NumPy, Pandas, Matplotlib, Dateutil, Seaborn), Tableau, Hadoop.
