We provide IT Staff Augmentation Services!

Data Scientist / Machine Learning Engineer Resume

San Francisco, CA


  • A Data Scientist professional with 7+ Years of progressive experience in Data Analytics, Statistical Modeling, Visualization and Machine Learning . Excellent capability in collaboration, quick learning and adaptation.
  • Experience in Data mining with large datasets of Structured and Unstructured data, Data Acquisition, Data Validation, Predictive modeling, Data Visualization.
  • Experience in integrating data, profiling, validating and data cleansing transformation and data visualization using R and Python .
  • Theoretical foundations and practical hands - on projects related to (i) supervised learning (linear and logistic regression, boosted decision trees, Support Vector Machines, neural networks, NLP), (ii) unsupervised learning (clustering, dimensionality reduction, recommender systems), (iii) probability & statistics, experiment analysis, confidence intervals, A/B testing, (iv) algorithms and data structures .
  • Extensive knowledge on Azure Data Lake and Azure Storage .
  • Experience in migration from heterogeneous sources including Oracle to MS SQL Server .
  • Hands on experience in design, management and visualization of databases using Oracle, MySQL and SQL Server .
  • In depth knowledge and hands on experience of Big Data / Hadoop ecosystem (MapReduce, HDFS, Hive, Pig and Sqoop) .
  • Experience in Apache Spark, Kafka for Big Data Processing & Scala Functional programming .
  • Experience in manipulating the large data sets with R packages like tidyr, tidyverse, dplyr reshape, lubridate, Caret and visualizing the data using lattice and ggplot2 packages .
  • Experience in dimensionality reduction using techniques like PCA and LDA .
  • Intensive hands-on Boot camp on Data Analytics course spanning from Statistics to Programming including data engineering, data visualization, machine learning and programming in R, SQL .
  • Experience in data analytics, predictive analysis like Classification, Regression, Recommender Systems.
  • Good Exposure with Factor Analysis, Bagging and Boosting algorithms .
  • Experience in Descriptive Analysis Problems like Frequent Pattern Mining, Clustering, Outlier Detection .
  • Worked on Machine Learning algorithms like Classification and Regression with KNN Model, Decision Tree Model, Naïve Bayes Model, Logistic Regression, SVM Model and Latent Factor Model .
  • Hands-on experience on Python and libraries like Numpy, Pandas, Matplotlib, Seaborn, NLTK, Sci-Kit learn, SciPy .
  • Expertise and knowledge in TensorFlow to do machine learning/deep learning package in python .
  • Good knowledge on Microsoft Azure SQL, Machine Learning and HDInsight .
  • Good Exposure on SAS analytics .
  • Good Exposure in deep learning with Tensor flow in python .
  • Good Knowledge on Natural Language Processing (NLP) and Time Series Analysis and Forecasting using ARIMA model in Python and R .
  • Good knowledge in Tableau, Power BI for interactive data visualizations.
  • In-depth Understanding in NoSQL databases like MongoDB, HBase .
  • Very good experience and knowledge in provisioning virtual clusters under AWS cloud which includes services like EC2, S3, and EMR .
  • Experience and Knowledge in developing software using Java, C++ (Data Structures and Algorithms) technologies.
  • Good exposure in creating pivot tables and charts in Excel .
  • Experience in developing Custom Report and different types of Tabular Reports, Matrix Reports, Ad hoc reports and distributed reports in multiple formats using SQL Server Reporting Services (SSRS) .
  • Excellent Database administration (DBA) skills including user authorizations, Database creation, Tables, indexes and backup creation.


Languages: Java 8, Python, R

Numpy, SciPy, Pandas, Scikit: learn, Matplotlib, Seaborn, ggplot2, caret, dplyr, purrr, readxl, tidyr, Rweka, gmodels, RCurl, C50, twitter, NLP, Reshape2, rjson, plyr, Beautiful Soup, Rpy2

Kernel Density Estimation and Non: parametric Bayes Classifier, K-Means, Linear Regression, Neighbors (Nearest, Farthest, Range, k, Classification), Non-Negative Matrix Factorization, Dimensionality Reduction, Decision Tree, Gaussian Processes, Logistic RegressionNa ve Bayes, Random Forest, Ridge Regression, Matrix Factorization/SVD

NLP/Machine Learning/Deep Learning: LDA (Latent Dirichlet Allocation), NLTK, Apache OpenNLP, Stanford NLP, Sentiment Analysis, SVMs, ANN, RNN, CNN, TensorFlow, MXNet, Caffe, H2O, Keras, PyTorch, Theano, Azure ML

Cloud: Google Cloud Platform, AWS, Azure, Bluemix

Web Technologies: JDBC, HTML5, DHTML and XML, CSS3, Web Services, WSDL

Data Modelling Tools: Erwin r 9.6, 9.5, 9.1, 8.x, Rational Rose, ER/Studio, MS Visio, SAP Power designer

Big Data Technologies: Hadoop, Hive, HDFS, MapReduce, Pig, Kafka

Databases: SQL, Hive, Impala, Pig, Spark SQL, Databases SQL-Server, My SQL, MS Access, HDFS, HBase, Teradata, Netezza, MongoDB, Cassandra.

Reporting Tools: MS Office (Word/Excel/Power Point/ Visio), Tableau, Crystal reports XI, Business Intelligence, SSRS, Business Objects 5.x/ 6.x, Cognos7.0/6.0.

ETL Tools: Informatica Power Centre, SSIS.

Version Control Tools: SVM, GitHub

BI Tools: Tableau, Tableau Server, Tableau Reader, SAP Business Objects, OBIEE, QlikView, SAP Business Intelligence, Amazon Redshift, or Azure Data Warehouse

Operating System: Windows, Linux, Unix, Macintosh HD, Red Hat


Confidential, San Francisco, CA

Data Scientist / Machine Learning Engineer


  • Perform Data Profiling to learn about behavior with various features such as traffic pattern, location, Date and Time etc.
  • Extracted the data from hive tables by writing efficient Hive queries.
  • Performed preliminary data analysis using descriptive statistics and handled anomalies such as removing duplicates and imputing missing values.
  • Analyze Data and Performed Data Preparation by applying historical model on the data set in AZUREML.
  • Application of various machine learning algorithms and statistical modeling like decision trees, text analytics, natural language processing (NLP), supervised and unsupervised, regression models, social network analysis, neural networks, deep learning, SVM, clustering to identify Volume using scikit-learn package in python, Matlab.
  • Exploring DAG's, their dependencies and logs using AirFlow pipelines for automation.
  • Performed data cleaning and feature selection using MLlib package in PySpark and working with deep learning frameworks such as Caffe, Neon etc.
  • Conducted a hybrid of Hierarchical and K-means Cluster Analysis using IBM SPSS and identified meaningful segments of customers through a discovery approach.
  • Develop Spark/Scala, Python, R for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources. Used clustering technique K-Means to identify outliers and to classify unlabeled data.
  • Evaluate models using Cross Validation, Log loss function, ROC curves and used AUC for feature selection and elastic technologies like ElasticSearch, Kibana etc.
  • Work with NLTK library to NLP data processing and finding the patterns.
  • Categorize comments into positive and negative clusters from different social networking sites using Sentiment Analysis and Text Analytics.
  • Analyze traffic patterns by calculating autocorrelation with different time lags.
  • Ensure that the model has low False Positive Rate and Text classification and sentiment analysis for unstructured and semi-structured data.
  • Addressed over fitting by implementing of the algorithm regularization methods like L2 and L1.
  • Use Principal Component Analysis in feature engineering to analyze high dimensional data.
  • Create and design reports that will use gathered metrics to infer and draw logical conclusions of past and future behavior.
  • Perform Multinomial Logistic Regression, Random forest, Decision Tree, SVM to classify package is going to deliver on time for the new route.
  • Implemented different models like Logistic Regression, Random Forest and Gradient-Boost Trees to predict whether a given die will pass or fail the test.
  • Perform data analysis by using Hive to retrieve the data from Hadoop cluster, Sql to retrieve data from Oracle database and used ETL for data transformation.
  • Use MLlib, Spark's Machine learning library to build and evaluate different models.
  • Perform Data Cleaning, features scaling, features engineering using pandas and numpy packages in python.
  • Develop MapReduce pipeline for feature extraction using Hive and Pig.
  • Create Data Quality Scripts using SQL and Hive to validate successful data load and quality of the data. Create various types of data visualizations using Python and Tableau.
  • Communicate the results with operations team for taking best decisions.
  • Collect data needs and requirements by Interacting with the other departments.

Environment: Python 2.x, R, HDFS, Hadoop 2.3, Hive, Linux, Spark, IBM SPSS, Tableau Desktop, SQL Server 2012, Microsoft Excel, Matlab, Spark SQL, Pyspark.

Confidential, Minneapolis, MN

Data Scientist


  • Implemented Data Exploration to analyze patterns and to select features using Python SciPy.
  • Built Factor Analysis and Cluster Analysis models using Python SciPy to classify customers into different target groups.
  • Built predictive models including Support Vector Machine, Random Forests and Naïve Bayes Classifier using Python Scikit-Learn to predict the personalized product choice for each client.
  • Using R’s dplyr and ggplot2 packages, performed an extensive graphical visualization of overall data, including customized graphical representation of revenue reports, specific item sales statistics and visualization.
  • Designed and implemented cross-validation and statistical tests including Hypothetical Testing, ANOVA, Auto-correlation to verify the models’ significance.
  • Designed an A/B experiment for testing the business performance of the new recommendation system.
  • Supported MapReduce Programs running on the cluster.
  • Evaluated business requirements and prepared detailed specifications that follow project guidelines required to develop written programs.
  • Configured Hadoop cluster with Namenode and slaves and formatted HDFS.
  • Used Oozie workflow engine to run multiple Hive and Pig jobs.
  • Participated in Data Acquisition with Data Engineer team to extract historical and real-time data by using Hadoop MapReduce and HDFS.
  • Performed Data Enrichment jobs to deal missing value, to normalize data, and to select features by using HiveQL.
  • Developed multiple MapReduce jobs in java for data cleaning and pre-processing.
  • Analyzed the partitioned and bucketed data and compute various metrics for reporting.
  • Involved in loading data from RDBMS and web logs into HDFS using Sqoop and Flume.
  • Worked on loading the data from MySQL to HBase where necessary using Sqoop.
  • Developed Hive queries for Analysis across different banners.
  • Extracted data from Twitter using Java and Twitter API. Parsed JSON formatted twitter data and uploaded to database.
  • Launching Amazon EC2 Cloud Instances using Amazon Images (Linux/ Ubuntu) and Configuring launched instances with respect to specific applications.
  • Developed Hive queries for analysis, and exported the result set from Hive to MySQL using Sqoop after processing the data.
  • Analyzed the data by performing Hive queries and running Pig scripts to study customer behavior.
  • Created HBase tables to store various data formats of data coming from different portfolios.
  • Worked on improving performance of existing Pig and Hive Queries.
  • Created reports and dashboards, by using D3.js and Tableau 9.x, to explain and communicate data insights, significant features, models scores and performance of new recommendation system to both technical and business teams.
  • Utilize SQL, Excel and several Marketing/Web Analytics tools (Google Analytics, AdWords) in order to complete business & marketing analysis and assessment.
  • Used Git 2.x for version control with Data Engineer team and Data Scientists colleagues.
  • Used Agile methodology and SCRUM process for project developing.

Environment: HDFS, Hive, Scoop, Pig, Oozie, Amazon Web Services (AWS), Python 3.x (SciPy, Scikit-Learn), Tableau 9.x, D3.js, SVM, Random Forests, Naïve Bayes Classifier, A/B experiment, Git 2.x, Agile/SCRUM.

Confidential, NY

Data Analyst/Data Scientist


  • Study and understanding of the business and its functionalities by communication with Business Analysts.
  • Analyzed the existing database for performance and suggested methods to redesign the model for improving the performance of the system.
  • Supported ad-hoc, standard reporting and production projects.
  • Designed and implemented many standard processes that are maintained and run on a scheduled basis.
  • Created reports using MS Access and Excel. Applying filters to retrieve best results.
  • Developed the Stored Procedures, SQL Joins, SQL queries for data retrieval, accessed for analysis and exported the data into CSV, Excel files.
  • Developed Data mapping specifications to create and execute detailed system test plans. The data mapping specifies what data will be extracted from an internal data warehouse, transformed and sent to an external entity.
  • Analyzed business requirements, system requirements data mapping requirement specifications and communicated it to developers effectively.
  • Documented functional requirements and supplementary requirements in Quality Center.
  • Setting up of environments to be used for testing and the range of functionalities to be tested as per technical specifications.
  • Tested Complex ETL Mappings and Sessions based on business user requirements and business rules to load data from source flat files and RDBMS tables to target tables.
  • Wrote and executed unit, system, integration and UAT scripts in a data warehouse projects.
  • Wrote and executed SQL queries to verify that data has been moved from transactional system to DSS, Data warehouse, data mart reporting system in accordance with requirements.
  • Troubleshoot test scripts, SQL queries, ETL jobs data warehouse/data mart/data store models.
  • Responsible for different Data mapping activities from Source systems to Teradata.
  • Developed SQL scripts, stored procedures, and views for data processing, maintenance etc., and other database operations.
  • Performed the SQL Tuning and optimized the database and created the technical documents.
  • Imported the Excel Sheet, CSV, Delimited Data, advanced excel features, ODBC compliance data sources into Oracle database for data extractions, data processing, and business needs.
  • Designed and optimized the SQL queries, pass through query, make table query, joins in MS-Access 2003 and exported the data into Oracle database server.
  • Compiled sales production and market penetration data for executive management. Data included employee activity, client coverage, and territory alignment analysis.
  • Conducted business analysis, project assessment, and feasibility determination.
  • Analyzed data feed requirements for Risk Management, Customer Information Management, and Analytic Support.
  • Familiar with data and content migration using SAS migration utility for products that rely on metadata.
  • Developed CSV files and reported offshore progress to management with the use of Excel Templates, Excel macros, Pivot tables and functions.
  • Improved accuracy and relevance of credit card clients planning process reports and budgets reports for make high-level decisions.
  • Manage all UAT deliverables to completion with overlapping releases.

Environment: SAS Enterprise Guide 4.0, OLAP Cube studio, Stored Processes, SAS Management Console, Informatica 8.1, MS Excel, MS PowerPoint, MS Visio, MS Project Management, Teradata SQL Assistant, Enterprise Miner, SAS DI Studio, MS Access, MS Excel. SQL, SPSS, SQL, VBA, PL/SQL, Shell Scripting, Oracle, Oracle 10g.

Confidential, MA

Data Analyst/Data Scientist


  • Involved in complete Software Development Life Cycle (SDLC) process by analyzing business requirements and understanding the functional work flow of information from source systems to destination systems.
  • A highly immersive Data Science program involving Data Manipulation & Visualization, Web Scraping, Machine Learning, Python programming, SQL, Unix Commands, NoSQL, Hadoop.
  • Used pandas, numpy, seaborn, scipy, matplotlib, scikit-learn, NLTK in Python for developing various machine learning algorithms.
  • Worked on different data formats such as JSON, XML and performed machine learning algorithms in Python.
  • Analyzed sentimental data and detecting trend in customer usage and other services.
  • Analyzed and Prepared data, identify the patterns on dataset by applying historical models.
  • Collaborated with Senior Data Scientists for understanding of data.
  • Used Python and R scripting by implementing machine algorithms to predict the data and forecast the data for better results.
  • Used Python and R scripting to visualize the data and implemented machine learning algorithms.
  • Experience in developing packages in R with a shiny interface.
  • Used predictive analysis to create models of customer behavior that are correlated positively with historical data and use these models to forecast future results.
  • Predicted user preference based on segmentation using General Additive Models, combined with feature clustering, to understand non-linear patterns between user segmentation and related monthly platform usage features (time series data).
  • Perform data manipulation, data preparation, normalization, and predictive modeling.
  • Improve efficiency and accuracy by evaluating model in Python and R.
  • Used Python and R script for improvement of model.
  • Application of various machine learning algorithms and statistical modeling like Decision Trees, Random Forest, Regression Models, neural networks, SVM, clustering to identify Volume using scikit-learn package .
  • Performed Data cleaning process applied Backward - Forward filling methods on dataset for handling missing values.
  • Developed a predictive model and validate Neural Network Classification model for predict the feature label.
  • Performed Boosting method on predicted model for the improve efficiency of the model.
  • Presented Dashboards to Higher Management for more Insights using Power BI and Tableau.
  • Hands on experience in using HIVE, Hadoop, HDFS and Bigdata related topics.

Environment: R/R studio, Python, Tableau, Hadoop, Hive, MS SQL Server, MS Access, MS Excel, Outlook, Power BI.


Data Analyst/Data Scientist


  • Developing Data Mapping, Data Governance, Transformation and Cleansing rules for the Master Data Management (MDM) Architecture involving OLTP, ODS and OLAP.
  • Providing source to target mappings to the ETL team to perform initial, full, and incremental loads into the target data mart.
  • Conducting JAD sessions, writing meeting minutes, collecting requirements from business users and analyze based on the requirements.
  • Involved in defining the source to target data mappings, business rules, and data definitions.
  • Transformation on the files received from clients and consumed by Sql Server.
  • Working closely with the ETL, SSIS, SSRS Developers to explain the complex Data Transformation using Logic.
  • Worked on DTS Packages, DTS Import/Export for transferring data between SQL Server .
  • Performing Data Profiling, Cleansing, Integration and extraction tools
  • Defining the list codes and code conversions between the source systems and the data mart using Reference Data Management (RDM).
  • Applying data cleansing/data scrubbing techniques to ensure consistency amongst data sets.
  • Extensively using ETL methodology for supporting data extraction, transformations and loading processing, in a complex EDW.

Environment: MS Excel, Agile, Oracle 11g, Sql Server, SOA, SSIS, SSRS, ETL, UNIX, T-SQL, HP Quality Center 11, RDM (Reference Data Management).

Hire Now