Data Scientist Resume
Irvine, CA
PROFESSIONAL SUMMARY:
- 9+ years of professional experience in all phases of diverse technology projects specializing in Data Science, Big Data, Azure Machine Learning, Google Cloud and Tableau, using Cloud based infrastructure.
- Knowledge of R and bioinformatics pipelines.
- Expertise in machine learning techniques such as convolutional neural networks, recursive neural networks, LSTMs, random forests.
- Experience applying advanced data analysis methodologies for prediction and scientific discovery while integrating a thorough understanding of the underlying mechanisms generating the data in computer science, machine learning, computational statistics, and uncertainty quantification.
- Expert in the entire Data Science process life cycle including Data Acquisition, Data Preparation, Data Manipulation, Feature Engineering, Machine Learning Algorithms, Validation and Visualization.
- Strong knowledge in Statistical methodologies such as Hypothesis Testing, Principal Component Analysis (PCA), Sampling Distributions and Time Series Analysis.
- Experience in building various machine learning models using algorithms such as Linear Regression, Gradient Descent, Support Vector Machines (SVM), Logistic Regression, KNN, Decision Tree, Ensembles such as Random Forrest, AdaBoost, Gradient Boosting Trees.
- Demonstrate ability to apply relevant techniques to drive business impact and help with Optimization, causal inference, and choice modeling.
- Great intuition for consumer products and marketplace dynamics.
- Expertise in applying computational analysis to life sciences data and demonstrated high integrity and professionalism to work with patient data in a HIPPA - compliant and morally and ethically responsible manner.
- Experience with machine learning techniques. (clustering, decision tree learning, artificial neural networks, etc.)
- Experience using statistical computer languages (R, Python, SQL) to manipulate large data sets.
- Experience with database querying languages (SQL), programming languages for reproducible data analysis (Python, R), Electronic Medical Records (Epic Systems), Research database systems (REDCap), and project management frameworks (Lean, A3 Thinking, Agile, Six Sigma).
- Expertise in NLP methods such as LSA, LDA, Semantic Hashing, Word2Vec, LSTM, BiDAF etc.
- Experience in using cloud services like Amazon Web Services (AWS) such as EC2, S3 to work with different virtual machines.
- Experience in Apache Spark, Kafka for Big Data Processing & Scala Functional programming.
- Experience in manipulating the large data sets with R packages like tidyr, tidyverse, dplyr reshape, lubridate, Caret and visualizing the data using lattice and ggplot2 packages.
- Theoretical foundations and practical hands-on projects related to (i) supervised learning (linear and logistic regression, boosted decision trees, Support Vector Machines, neural networks, NLP), (ii) unsupervised learning (clustering, dimensionality reduction, recommender systems), (iii) probability & statistics, experiment analysis, confidence intervals, A/B testing, (iv) algorithms and data structures.
- Experience of building machine learning solutions using PySpark for large sets of data on Hadoop ecosystem.
- Experience in Natural Language Processing (NLP) and Time Series Analysis and Forecasting using ARIMA model in Python and R.
- Extensive knowledge on Azure Data Lake and Azure Storage.
- Knowledge and experience in Agile environments such as Scrum and using project management tools like Jira/Confluence and version control tools such as Github/Git.
- Deep learning programming experience with Python/Tensorflow or similar library in a GPU environment.
- Experience working with external datasets like SQUAD, SemEval, MSRP, WikTable, WikiQA, AllenAI etc.
- Tuning and optimization of sequential deep learning models
- Experience in R, SQL, Python (NumPy, Pandas), Scala/Java (Apache Spark), and data visualization (Matplotlib, Seaborn)
- Experience in machine learning and deep learning libraries such as Scikit-Learn, TensorFlow, Keras, PyTorch or Apache Spark MLlib
- Experience in working with notebooks - Jupyter, Zeppelin, Databricks notebook.
- Experience with Elasticsearch, graph databases (Neo4j), and semantic parsing.
- Experience in big data platform such as Hadoop, Apache Spark, Hive, and HBase.
- Experience in BI and data visualization tools such as Excel, Qlikview, Tableau.
- Experience with using web services. (Redshift, S3, Spark).
- Cooperate with BI teams across the organization in finding insights and developing the right data solutions.
- Effective at organizing large amounts of data from multiple data sources and build predictive and machine learning models from these large data sets, in order to support solutions of business problems.
- Experience working with SQL server, Google Cloud Platform/Google Big Query and Hadoop ecosystem.
- Understanding of SAS Enterprise Miner/ Enterprise Guide/ Base SAS.
- Hands on experience or large scale academic projects on recommender systems, reinforcement learning, contextual bandits, RNNs (LSTM, GRU, etc), non-linear optimization, learning to rank
- Extensive knowledge and understanding of principles, theories, and concepts relevant to Artificial Intelligence (AI), Machine Learning model development, and Control Systems.
- Knowledge of application and data security concepts, and demonstrated ability to contribute research and technical content to grant proposals.
- Self motivated, able to learn quickly, meet deadlines and demonstrate strong organizational and problem solving skills.
- Demonstrated effective communication and interpersonal skills, and ability to communicate technical information to technical and non-technical personnel at various levels in the organization and to external research and audiences.
SKILL:
Languages: C, C++, Python, R, SAS, Java-SQL, PL/SQL, SQL, MATLAB, DA
Python and R: Numpy, SciPy, Pandas, Scikit: learn, Matplotlib, Seaborn, ggplot2, caret, dplyr, purrr, readxl, tidyr, Rweka, gmodels, RCurl, C50, twitter, NLP, Reshape2, rjson, plyr, Beautiful Soup, Rpy2
Algorithms: Kernel Density Estimation and Non: parametric Bayes Classifier, K-Means, Linear Regression, Neighbors (Nearest, Farthest, Range, k, Classification), Non-Negative Matrix Factorization, Dimensionality Reduction, Decision Tree, Gaussian Processes, Logistic Regression, Na ve Bayes, Random Forest, Ridge Regression, Matrix Factorization/SVD
NLP/Machine Learning/Deep Learning: LDA (Latent Dirichlet Allocation), NLTK, Apache OpenNLP, Stanford NLP, Sentiment Analysis, SVMs, ANN, RNN, CNN, TensorFlow, MXNet, Caffe, H2O, Keras, PyTorch, Theano, Azure ML
Cloud: Google Cloud Platform, AWS, Azure, Bluemix
Web Technologies: JDBC, HTML5, DHTML and XML, CSS3, Web Services, WSDL
Data Modeling Tools: Erwin r 9.6, 9.5, 9.1, 8.x, Rational Rose, ER/Studio, MS Visio, SAP Power designer
Big Data Technologies: Hadoop, Hive, HDFS, MapReduce, Pig, Kafka
Databases: SQL, Hive, Impala, Pig, Spark SQL, Databases SQL-Server, My SQL, MS Access, HDFS, HBase, Teradata, Netezza, MongoDB, Cassandra.
Reporting Tools: MS Office (Word/Excel/Power Point/ Visio), Tableau, Crystal reports XI, Business Intelligence, SSRS, Business Objects 5.x/ 6.x, Cognos7.0/6.0.
ETL Tools: Informatica Power Centre, SSIS.
Version Control Tools: SVM, GitHub
BI Tools: Tableau, Tableau Server, Tableau Reader, SAP Business Objects, OBIEE, QlikView, SAP Business Intelligence, Amazon Redshift, or Azure Data Warehouse
Operating System: Windows, Linux, Unix, Macintosh HD, Red Hat
PROFESSIONAL EXPERIENCE:
Confidential, Irvine, CA
Data Scientist
Responsibilities:
- Apply data mining techniques, conduct statistical analysis and build high quality prediction models to solve challenging business problems.
- Analyze whole-genome and whole-exome NGS sequencing data to characterize patients and cell lines.
- Perform Data wrangling, mashup distinctive data sets and generate algorithms using statistical analysis, natural language processing and machine learning that helps in generating insights.
- Generate insights/outcomes from these models/algorithms for business/user consumption and integrate these with the Tableau interface.
- Query the Health Records Database, validating the results, curating the data, and presenting the results to the stakeholders
- Write programming scripts for reproducibility, uploading data to a HIPAA compliant repository, and creating data visualizations.
- Develop and maintain key relationships with clinicians, researchers, analytics peers, and other stakeholders such as Biospecimen Resources Program, Clinical & Translational Science Institute.
- Performed preliminary data analysis using descriptive statistics and handled anomalies such as removing duplicates and imputing missing values.
- Develop strategies for extracting clinical context out of clinical documents from different clinical areas.
- Leverage cutting edge NLP and machine learning technologies/frameworks like Keras, Tensorflow, PyTorch.
- Train deep learning models with internal and external NLP datasets, and define the metrics to be used to measure the NLP engine performance.
- Design and build a feedback control loop system that measures the performance of NLP and calculate the F1 Score, recall and retention.
- Maintain up-to-date list of national and international clinical trials, with an emphasis on targeted therapies.
- Aggregate existing genomic databases (government, academic, and industry).
- Execute statistical and data mining techniques (e.g. hypothesis testing, machine learning and retrieval processes) on large data sets to identify trends, figures and other relevant information.
- Retrain existing predictive models using new data source and possibly new advanced techniques and perform rigorous model evaluation, design hypothesis tests, oversee test execution and result evaluation.
- Play a key role in helping develop and refine novel functional genomics assays and sequencing data pipelines.
- Perform Data Cleaning, features scaling, features engineering using pandas and numpy packages in python.
- Communicate the results with operations team for taking best decisions.
- Collect data needs and requirements by Interacting with the other departments.
- Analyze Data and Performed Data Preparation by applying historical model on the data set in AZUREML.
- Application of various machine learning algorithms and statistical modeling like decision trees, text analytics, natural language processing (NLP), supervised and unsupervised, regression models, neural networks, deep learning, SVM, clustering to identify Volume using scikit-learn package in python, Matlab.
- Conducted a hybrid of Hierarchical and K-means Cluster Analysis using IBM SPSS and identified meaningful segments of customers through a discovery approach.
- Develop Spark/Scala, Python, R for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources. Used clustering technique K-Means to identify outliers and to classify unlabeled data.
- Work closely with a cross-functional team of life scientists, bioengineers and machine learning scientists to design and analyze experiments to collect high-throughput in vitro genomic data.
- Analyze human level data from clinical trials (including genetics, transcriptomics and pathology) and integrate it with in-house genomic data to identify therapeutic targets and develop drugs that have high efficacy and low toxicity.
Confidential, Boston, MA
Data Scientist
Responsibilities:
- Help develops and implements effective Fraud detection and prevention strategies to mitigate fraud losses while ensuring an appropriate balance between risk and customer experience.
- Design, develop, and validate statistical and machine learning models for credit card risk and loss forecasting to support the credit card business in the US.
- Proactively identifies fraud detection issues at the strategy and portfolio level and provide analytical/modeling solutions.
- Leverages customer data to build risk segmentation and mitigation strategies.
- Evaluate new tools and products that enhance risk detection and prevention.
- Interact with Policy, Operations, and other functional business partners to optimize business strategies.
- Develop Spark/Scala, Python, R for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources. Use clustering technique K-Means to identify outliers and to classify unlabeled data.
- Evaluate models using Cross Validation, Log loss function, ROC curves and used AUC for feature selection and elastic technologies like ElasticSearch, Kibana etc.
- Work with NLTK library to NLP data processing and finding the patterns.
- Ensure that the model has low False Positive Rate and Text classification and sentiment analysis for unstructured and semi-structured data.
- Use Jupyter Notebook to create and share documents that contain live code, equations, visualizations, and explanatory text.
- Understanding and implementation of text mining concepts, graph processing and semi structured and unstructured data processing.
- Worked with Ajax API calls to communicate with Hadoop through Impala Connection and SQL to render the required data through it.
- Involved in loading data from RDBMS and web logs into HDFS using Sqoop and Flume.
- Worked on loading the data from MySQL to HBase where necessary using Sqoop.
- Developed Hive queries for Analysis across different banners.
- Launching Amazon EC2 Cloud Instances using Amazon Images (Linux/ Ubuntu) and Configuring launched instances with respect to specific applications to improve robustness.
- Ensure adherence to strict audit and control requirements; participate in audit related activities and assist Fraud Analytics Team Lead with completion of the audit and compliance related activities.
- Work collaboratively with a team of scientists, analysts, compliance specialists, and the business lines to implement advanced methods to maintain and enhance the Banks risk mitigation program.
- Work closely with business stakeholders, Financial Analysts, Data Engineers, Data Visualization Specialists and other team members to turn data into critical information and knowledge that can be used to make sound organizational decisions. Propose innovative ways to look at problems by using data mining (the process of discovering new patterns from large datasets) approaches across a wide range and variety of data assets.
- Strengthen the business and help support clients by using data to describe and model the outcomes of investment and business decisions.
- Validate their findings using an experimental and iterative approach. Present back findings to the business team by exposing their assumptions and validation work in a way that can be easily understood the business counterparts.
- Use analytics techniques to model complex business problems, discovering insights and identifying opportunities through the use of statistical, algorithmic, mining and visualization techniques.
- Integrating and preparing large, varied datasets, implementing specialized database and computing environments, and communicating results.
- Improve organizational performance though the application of original thinking to existing and emerging analytic methods, processes, products and services, and employ sound judgment in determining how innovations will be deployed to produce return on investment.
- Work with Data Engineers and determine how to best source data, including identification of potential proxy data sources, and design business analytics solutions, considering current and future needs, infrastructure and security requirements, and load frequencies.
Confidential, Farmingham, MA
Data Scientist
Responsibilities:
- Act as a liaison between Operations, IT, Marketing, and Ecommerce team to aid in the flawless execution of launches and proactively avoid and resolve catalog issues.
- Support Team via web scraping, data mining and presentation of data to inform the day to day business and product launches (data cleansing, modeling, implementation as needed).
- Exploratory Analysis (building user behavior models, identifying long term trends) to provide Account Managers actionable insights to drive business analysis to improve forecasting, inventory, retailer sell-in, sell-through, inventory management, and various relevant key performance metrics.
- Drive product innovation in testing and insights automation within an e-Commerce / Retail environment
- Enable leading e-commerce companies and retailers to make better investment and pricing decisions
- Develop tools and prototypes for different topics: budget and revenue scenarios in online marketing, CRM analysis, customer cohort analysis, churn prediction, stock level models, pricing strategiesm customer journey and attribution model analysis, automated A/B testing.
- Collaborate with the team to road-map the years to come for solving various eCommerce problems with Machine Learning, Deep Learning (ie: Customer Life Modeling, using NLP for SEO Models)
- You will closely work with the data team and online marketing practitioners to design and put models into production.
- This project was focused on customer clustering based on ML and statistical modeling effort including building predictive models and generate data products to support customer classification and segmentation.
- Develop a Estimation model for various product & services bundled offering to optimize and predict the gross margin
- Built sales model for various product and services bundled offering
- Developed predictive causal model using annual failure rate and standard cost basis for the new bundled services.
- Design and develop analytics, machine learning models, and visualizations that drive performance and provide insights, from prototyping to production deployment and product recommendation and allocation planning.
- Worked with sales and Marketing team for Partner and collaborate with a cross-functional team to frame and answer important data questions. prototyping and experimenting ML algorithms and integrating into production system for different business needs.
- Application Machine Learning algorithms with Spark Mlib standalone and R/Python.
- Design, built and deployed a set of python modeling APIs for customer analytics, which integrate multiple machine learning techniques for various user behavior prediction and support multiple marketing segmentation programs
- Segmented the customers based on demographics using K-means Clustering
- Used classification techniques including Random Forest and Logistic Regression to quantify the likelihood of each user referring
- Designed and implemented end-to-end systems for Data Analytics and Automation, integrating custom visualization tools using R, Tableau, Power BI
- Work closely with team of data scientist and analysts on requirements around propensity models and methods for growing customer lifetime value. These models would be used directly for inference, or as input consumed by other business partners or stakeholders across the company.
- Actively support research and implementation of advanced statistical methodologies and applied behavioral science concepts in the design and analysis of qualitative, observational and randomized controlled experiments. These experiments would be conducted in collaboration with business partners or stakeholders across the company.
- Collaborate with team of data scientists and analysts on core requirements for building machine learning models and methods focused on understanding and predicting customer behavior.
- Leverage multi-channel customer insights to deliver enhanced personalized customer experience; design interventions to optimize customer lifetime value; build bigger shopper baskets; improve product discoverability.
- Ensure operational and business metric health by monitoring production decision points - Investigate adversarial trends, identify behavior patterns, and respond with agile logic changes.
Confidential, New York City, NY
Data Engineer/ Data Scientist
Responsibilities:
- Worked with data scientists and the research team to gain valuable insights.
- Worked on data cleaning and ensured data quality, consistency, integrity using Pandas, NumPy.
- Performed data imputation using Scikit-learn package in Python.
- Explored and analyzed the customer specific features by using Matplotlib, Seaborn in Python and dashboards in Tableau.
- Designed rich data visualizations to model data into human-readable form with Tableau and Matplotlib.
- Worked with Amazon EC2 based cloud-hosted architecture systems to provide solutions for client.
- Generated comprehensive analytical reports by running SQL queries against current databases to conduct data analysis.
- Work on different data formats such as JSON, XML and applied machine learning algorithms in Python.
- Use AWS Environment for loading data files from the cloud servers.
- Collaborated with business leaders to analyze problems optimize processes and build presentation dashboards.
- Merge data into AWS environment to make several teams could access the data from different locations which saves times and increase security.
- Program both R and Python scripts and modules for data collection, cleaning, analysis and visualization.
- Updated legacy data systems to convert hard copies to searchable online database format.
Confidential
Data Analyst/ Data Scientist
Responsibilities:
- Integrated data from multiple data sources or functional areas, ensures data accuracy and integrity, and updates data as need using SQL and Python.
- Expertise leveraging SQL, Excel and Tableau to manipulate, analyze and present data.
- Performs analyses of structured and unstructured data to solve multiple and/or complex business problems utilizing advanced statistical techniques and mathematical analyses.
- Developed advanced models using multivariate regression, Logistic regression, Random forests, decision trees and clustering.
- Used Pandas, Numpy, Seaborn, Scikit-learn in Python for developing various machine learning algorithms.
- Build and improve models using natural language processing (NLP) and machine learning to extract insights from unstructured data.
- Experienced working with distributed computing technologies (Apache Spark, Hive).
- Applied predictive analysis and statistical modeling techniques to analyze customer behavior and offer customized products, reduce delinquency rate and default rate. Lead to fall in default rates from 5% to 2%.
- Applied machine learning techniques to tap into new markets, new customers and put forth my recommendations to the top management which resulted in increase in customer base by 5% and customer portfolio by 9%.
- Analyzed customer master data for the identification of prospective business, to understand their business needs, built client relationships and explored opportunities for cross-selling of financial products. 60% (Increased from 40%) of customers availed more than 6 products.
- Collaborated with business partners to understand their problems and goals, develop predictive modeling, statistical analysis, data reports and performance metrics.
- Participate in the on-going design and development of a consolidated data warehouse supporting key business metrics across the organization.
- Designed, developed, and implemented data quality validation rules to inspect and monitor the health of the data.