We provide IT Staff Augmentation Services!

Data Scientist Resume

4.00/5 (Submit Your Rating)

Los Angeles, CA

PROFESSIONAL SUMMARY

  • A data scientist professional with 7 years of progressive experience in Data Analytics, Statistical Modeling, Visualization and Machine Learning.
  • Excellent capability in collaboration, quick learning and adaptation.
  • Experience in Data mining with large datasets of Structured and Unstructured data, Data Acquisition, Data Validation, Predictive modeling, Data Visualization.
  • Experience in integrating data, profiling, validating and data cleansing transformation and data visualization using R and Python.
  • Theoretical foundations and practical hands - on projects related to (i) supervised learning (linear and logistic regression, boosted decision trees, Support Vector Machines, neural networks, NLP), (ii) unsupervised learning (clustering, dimensionality reduction, recommender systems), (iii) probability & statistics, experiment analysis, confidence intervals, A/B testing, (iv) algorithms and data structures.
  • Extensive knowledge on Azure Data Lake and Azure Storage.
  • Experience in migration from heterogeneous sources including Oracle to MS SQL Server.
  • Hands on experience in design, management and visualization of databases using Oracle, MySQL and SQL Server.
  • In depth knowledge and hands on experience of Big Data / Hadoop ecosystem (MapReduce, HDFS, Hive, Pig and Sqoop).
  • Experience in Apache Spark, Kafka for Big Data Processing & Scala Functional programming.
  • Experience in manipulating the large data sets with R packages like tidyr, tidyverse, dplyr reshape, lubridate, Caret and visualizing the data using lattice and ggplot2 packages.
  • Experience in dimensionality reduction using techniques like PCA and LDA.
  • Intensive hands-on Boot camp on Data Analytics course spanning from Statistics to Programming including data engineering, data visualization, machine learning and programming in R, SQL.
  • Experience in data analytics, predictive analysis like Classification, Regression, Recommender Systems.
  • Good Exposure with Factor Analysis, Bagging and Boosting algorithms.
  • Experience in Descriptive Analysis Problems like Frequent Pattern Mining, Clustering, Outlier Detection.
  • Worked on Machine Learning algorithms like Classification and Regression with KNN Model, Decision Tree Model, Naïve Bayes Model, Logistic Regression, SVM Model and Latent Factor Model.
  • Hands-on experience on Python and libraries like Numpy, Pandas, Matplotlib, Seaborn, NLTK, Sci-Kit learn, SciPy.
  • Expertise and knowledge in TensorFlow to do machine learning/deep learning package in python.
  • Good knowledge on Microsoft Azure SQL, Machine Learning and HDInsight.
  • Good Exposure on SAS analytics.
  • Good Exposure in deep learning with Tensor flow in python.
  • Good Knowledge on Natural Language Processing (NLP) and Time Series Analysis and Forecasting using ARIMA model in Python and R.
  • Good knowledge in Tableau, Power BI for interactive data visualizations.
  • In-depth Understanding in NoSQL databases like MongoDB, HBase.
  • Very good experience and knowledge in provisioning virtual clusters under AWS cloud which includes services like EC2, S3, and EMR.
  • Experience and Knowledge in developing software using Java, C++ (Data Structures and Algorithms) technologies.
  • Good exposure in creating pivot tables and charts in Excel.

SKILLS SECTION

Languages: Java 8, Python, R

Packages: ggplot2, caret, dplyr, Rweka, gmodels, RCurl, C50, twitter, NLP, Reshape2, rjson, plyr, pandas, numPy, Seaborn, sciPy, matplot lib, sci-kit-learn, Beautiful Soup, Rpy2.

Web Technologies: JDBC, HTML5, DHTML and XML, CSS3, Web Services, WSDL

Data Modelling Tools: Erwin r 9.6, 9.5, 9.1, 8.x, Rational Rose, ER/Studio, MS Visio, SAP Power designer

Big Data Technologies: Hadoop, Hive, HDFS, MapReduce, Pig, Kafka

Databases: SQL, Hive, Impala, Pig, Spark SQL, Databases SQL-Server, My SQL, MS Access, HDFS, HBase, Teradata, Netezza, MongoDB, Cassandra.

Reporting Tools: MS Office (Word/Excel/Power Point/ Visio), Tableau, Crystal reports XI, Business Intelligence, SSRS, Business Objects 5.x/ 6.x, Cognos7.0/6.0.

ETL Tools: Informatica Power Centre, SSIS.

Version Control Tools: SVM, GitHub

Project Execution Methodologies: Ralph Kimball and Bill Inmon data warehousing methodology, Rational Unified Process (RUP), Rapid Application Development (RAD), Joint Application Development (JAD).

BI Tools: Tableau, Tableau Server, Tableau Reader, SAP Business Objects, OBIEE, QlikView, SAP Business Intelligence, Amazon Redshift, or Azure Data Warehouse

Operating System: Windows, Linux, Unix, Macintosh HD, Red Hat

PROJECT EXPERIENCE

Confidential - Los Angeles, CA

Data Scientist

Responsibilities:

  • Designed and implemented cross-validation and statistical tests including Hypothetical Testing, ANOVA, Auto-correlation to verify the models’ significance.
  • Designed an A/B experiment for testing the business performance of the new recommendation system.
  • Supported MapReduce Programs running on the cluster.
  • Evaluated business requirements and prepared detailed specifications that follow project guidelines required to develop written programs.
  • Configured Hadoop cluster with Namenode and slaves and formatted HDFS.
  • Used Oozie workflow engine to run multiple Hive and Pig jobs.
  • Participated in Data Acquisition with Data Engineer team to extract historical and real-time data by using Hadoop MapReduce and HDFS.
  • Performed Data Enrichment jobs to deal missing value, to normalize data, and to select features by using HiveQL.
  • Developed multiple MapReduce jobs in java for data cleaning and pre-processing.
  • Analyzed the partitioned and bucketed data and compute various metrics for reporting.
  • Involved in loading data from RDBMS and web logs into HDFS using Sqoop and Flume.
  • Worked on loading the data from MySQL to HBase where necessary using Sqoop.
  • Developed Hive queries for Analysis across different banners.
  • Extracted data from Twitter using Java and Twitter API. Parsed JSON formatted twitter data and uploaded to database.
  • Launching Amazon EC2 Cloud Instances using Amazon Images (Linux/ Ubuntu) and Configuring launched instances with respect to specific applications.
  • Developed Hive queries for analysis, and exported the result set from Hive to MySQL using Sqoop after processing the data.
  • Analyzed the data by performing Hive queries and running Pig scripts to study customer behavior.
  • Created HBase tables to store various data formats of data coming from different portfolios.
  • Worked on improving performance of existing Pig and Hive Queries.
  • Created reports and dashboards, by using D3.js and Tableau 9.x, to explain and communicate data insights, significant features, models scores and performance of new recommendation system to both technical and business teams.
  • Utilize SQL, Excel and several Marketing/Web Analytics tools (Google Analytics, AdWords) in order to complete business & marketing analysis and assessment.
  • Used Git 2.x for version control with Data Engineer team and Data Scientists colleagues.
  • Used Agile methodology and SCRUM process for project developing.

Environment: HDFS, Hive, Scoop, Pig, Oozie, Amazon Web Services (AWS), Python 3.x (SciPy, Scikit-Learn), Tableau 9.x, D3.js, SVM, Random Forests, Naïve Bayes Classifier, A/B experiment, Git 2.x, Agile/SCRUM.

Confidential - Dallas, TX

Sr. Data Scientist

Responsibilities

  • Perform Data Profiling to learn about behavior with various features such as traffic pattern, location, Date and Time etc.
  • Extracted the data from hive tables by writing efficient Hive queries.
  • Performed preliminary data analysis using descriptive statistics and handled anomalies such as removing duplicates and imputing missing values.
  • Analyze Data and Performed Data Preparation by applying historical model on the data set in AZUREML.
  • Application of various machine learning algorithms and statistical modeling like decision trees, text analytics, natural language processing (NLP), supervised and unsupervised, regression models, social network analysis, neural networks, deep learning, SVM, clustering to identify Volume using scikit-learn package in python, Matlab.
  • Exploring DAG's, their dependencies and logs using Airflow pipelines for automation.
  • Performed data cleaning and feature selection using MLlib package in PySpark and working with deep learning frameworks such as Caffe, Neon etc.
  • Conducted a hybrid of Hierarchical and K-means Cluster Analysis using IBM SPSS and identified meaningful segments of customers through a discovery approach.
  • Develop Spark/Scala, Python, R for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources. Used clustering technique K-Means to identify outliers and to classify unlabeled data.
  • Evaluate models using Cross Validation, Log loss function, ROC curves and used AUC for feature selection and elastic technologies like Elastic Search, Kibana etc.
  • Work with NLTK library to NLP data processing and finding the patterns.
  • Categorize comments into positive and negative clusters from different social networking sites using Sentiment Analysis and Text Analytics.
  • Analyze traffic patterns by calculating autocorrelation with different time lags.
  • Ensure that the model has low False Positive Rate and Text classification and sentiment analysis for unstructured and semi-structured data.
  • Addressed over fitting by implementing of the algorithm regularization methods like L2 and L1.
  • Use Principal Component Analysis in feature engineering to analyze high dimensional data.
  • Create and design reports that will use gathered metrics to infer and draw logical conclusions of past and future behavior.
  • Perform Multinomial Logistic Regression, Random forest, Decision Tree, SVM to classify package is going to deliver on time for the new route.
  • Implemented different models like Logistic Regression, Random Forest and Gradient-Boost Trees to predict whether a given die will pass or fail the test.
  • Perform data analysis by using Hive to retrieve the data from Hadoop cluster, Sql to retrieve data from Oracle database and used ETL for data transformation.
  • Use MLlib, Spark's Machine learning library to build and evaluate different models.
  • Perform Data Cleaning, features scaling, features engineering using pandas and numpy packages in python.
  • Develop MapReduce pipeline for feature extraction using Hive and Pig.
  • Create Data Quality Scripts using SQL and Hive to validate successful data load and quality of the data. Create various types of data visualizations using Python and Tableau.
  • Communicate the results with operations team for taking best decisions.
  • Collect data needs and requirements by Interacting with the other departments.

Environment: Python 2.x, R, HDFS, Hadoop 2.3, Hive, Linux, Spark, IBM SPSS, Tableau Desktop, SQL Server 2012, Microsoft Excel, Matlab, Spark SQL, Pyspark.

Confidential, Denver

Data Scientist

Responsibilities:

  • Developing Data Mapping, Data Governance, Transformation and Cleansing rules for the Master Data Management (MDM) Architecture involving OLTP, ODS and OLAP.
  • Providing source to target mappings to the ETL team to perform initial, full, and incremental loads into the target data mart.
  • Conducting JAD sessions, writing meeting minutes, collecting requirements from business users and analyze based on the requirements.
  • Involved in defining the source to target data mappings, business rules, and data definitions.
  • Transformation on the files received from clients and consumed by Sql Server.
  • Working closely with the ETL, SSIS, SSRS Developers to explain the complex Data Transformation using Logic.
  • Worked on DTS Packages, DTS Import/Export for transferring data between SQL Server 2000 to 2005.
  • Performing Data Profiling, Cleansing, Integration and extraction tools
  • Defining the list codes and code conversions between the source systems and the data mart using Reference Data Management (RDM).
  • Applying data cleansing/data scrubbing techniques to ensure consistency amongst data sets.
  • Extensively using ETL methodology for supporting data extraction, transformations and loading processing, in a complex EDW.

Environment: MS Excel, Agile, Oracle 11g, Sql Server, SOA, SSIS, SSRS, ETL, UNIX, T-SQL, HP Quality Center 11, RDM (Reference Data Management).

Confidential, Hoboken, NJ

Data Scientist

Responsibilities:

  • Involved in complete Software Development Life Cycle (SDLC) process by analyzing business requirements and understanding the functional work flow of information from source systems to destination systems.
  • A highly immersive Data Science program involving Data Manipulation & Visualization, Web Scraping, Machine Learning, Python programming, SQL, Unix Commands, NoSQL, Hadoop.
  • Used pandas, numpy, seaborn, scipy, matplotlib, scikit-learn, NLTK in Python for developing various machine learning algorithms.
  • Worked on different data formats such as JSON, XML and performed machine learning algorithms in Python.
  • Analyzed sentimental data and detecting trend in customer usage and other services.
  • Analyzed and Prepared data, identify the patterns on dataset by applying historical models.
  • Collaborated with Senior Data Scientists for understanding of data.
  • Used Python and R scripting by implementing machine algorithms to predict the data and forecast the data for better results.
  • Used Python and R scripting to visualize the data and implemented machine learning algorithms.
  • Experience in developing packages in R with a shiny interface.
  • Used predictive analysis to create models of customer behavior that are correlated positively with historical data and use these models to forecast future results.
  • Predicted user preference based on segmentation using General Additive Models, combined with feature clustering, to understand non-linear patterns between user segmentation and related monthly platform usage features (time series data). Perform data manipulation, data preparation, normalization, and predictive modeling.
  • Improve efficiency and accuracy by evaluating model in Python and R. Used Python and R script for improvement of model.
  • Application of various machine learning algorithms and statistical modeling like Decision Trees, Random Forest,
  • Regression Models, neural networks, SVM, clustering to identify Volume using scikit-learn package .
  • Performed Data cleaning process applied Backward - Forward filling methods on dataset for handling missing values.
  • Developed a predictive model and validate Neural Network Classification model for predict the feature label.
  • Performed Boosting method on predicted model for the improve efficiency of the model.
  • Presented Dashboards to Higher Management for more Insights using Power BI and Tableau.
  • Hands on experience in using HIVE, Hadoop, HDFS and Bigdata related topics.

Environment: R/R studio, Python, Tableau, Hadoop, Hive, MS SQL Server, MS Access, MS Excel, Outlook, Power BI.

Confidential

Data Reporting Analyst

Responsibilities:

  • Designed and implemented an internal reporting tool named I-CUBE using Python to automate sales and financial operational data accessible through a built-in SharePoint for leaders globally. Used API for I-Cube to extract sales data on an hourly-basis.
  • Built and customized interactive reports on forecasts, targets and actuals data using BI/ETL tools such as SAS, SSAS, SSIS in the CRM which slashed manual efforts by 8%.
  • Conducted operational analyses for business worth $3M working through all phases such as requirements gathering, developing use cases, data mapping and creating workflow diagrams.
  • Accomplished data cleansing and analysis results using Excel pivot tables, VLOOKUPs, data validation, graphs and chart manipulation in Excel.
  • Designed complex SQL queries, Views, Stored Procedures, Functions and Triggers to handle database manipulation and performance.
  • Used SQL, PLSQL scripts for automating repeatable tasks of customer feedback survey data collection and distribution which increased the departmental efficiency by 8%.

We'd love your feedback!