We provide IT Staff Augmentation Services!

Data Scientist Resume

3.00/5 (Submit Your Rating)

San Antonio, TX

SUMMARY

  • Data Scientist with 6 years of experience in Predictive analysis, Data Acquisition, Data Mining, Machine Learning, Data Visualization with various data sets.
  • Programmer with experience in programming languages and technologies like Python, R, Hadoop, Spark, Sqoop, Kafka, Pig, Hive, Map Reduce, Airflow and Presto.
  • Experience in Data Visualization tools like Tableau, Power BI, and Google Analytics. Performed advanced statistical analysis like Descriptive and Predictive analysis using machine learning algorithms.
  • Good knowledge on Python libraries: NumPy, pandas, SciPy, scikit - learn, matplotlib, Seaborn, datetime.
  • Built models using various machine learning algorithms: Regression (Linear, Log), Classification (Decision Tress, Random Forest, Extra Trees), Clustering (K-means, KNN, Hierarchical), Naive Bayes, SVM on Structured, Unstructured and Semi-structured data.
  • Performed dimensionality reduction techniques like PCA, t-SNE and Pruning (Pre-pruning, Post-pruning).
  • Performed data clustering in machine learning using EM algorithm and in NLP two prominent instances of the algorithm are the Baum-Welch algorithm for hidden Markov models.
  • Performed small number of gradient steps while avoiding over-fitting using Model- Agnostic Meta Learning algorithm approach.
  • Utilized SMOTE oversampling technique in data analysis for classification problems to adjust the class distribution of the data set.
  • Detected outliers based on density using Local outlier factor to produce more stable results within clusters.
  • Experience in working with R data mining packages (e.g., Rattle, RMiner, etc) and for manipulating data utilized packages like dplyr, tidyr, stringr, lubridate for visualization data used packages like ggplot2, ggvis, rgl, htmlwidgets.
  • Manipulated complex datasets, wrote functions, developed visualization results using library ggplot2 packages in R Studio.
  • Developed SQL databases and written applications to interface with SQL databases, as well as tested code. Designed tables, storing procedures, functions and views.
  • Extracted data from various data sources, including relational databases (MySQL), non-relational databases (MongoDB, Cassandra) and column store (Redshift).
  • Worked in Waterfall as well as Agile environments including the Project Management tools like JIRA and Scrum process and version control tools such as GitHub.
  • Build various reports, lists, pages, libraries, apps, features, content types, documentation and sub-sites on SharePoint platform.
  • Expertise in processing, tagging, and indexing unstructured, semi-structured and structured classified and unclassified data sets.
  • Worked on Jupiter notebook, PySpark through cloud platform in EC2 instance using putty and evaluated models using Cross Validation, Log loss function, ROC curves and used AUC for feature selection.
  • Used matplotlib for data preprocessing like cleaning for (missing values, outliers) and data visualization (Scatter Plots, Box Plots, Histograms).
  • Build models in R, import results into Tableau and visualize through tableau which has in-built support for R via Rserve.
  • Created reports in various output formats like HTML, PDF, CSV, Excel and expertise in formatting the reports for Excel output format.
  • Calculated specificity and sensitivity from error matrix to visualize the performance of a classification model on a test data for which the true values are known.
  • Determined which model predicts the classes best in classification analysis using AUC and ROC by plotting true positive rates against false positive rates.

TECHNICAL SKILLS

Languages: R, PYTHON, MySQL, C++

Software: Maximo

Operating Systems: Microsoft Windows, macOS, Linux

Document Management: SharePoint 2013

Development Tools: Anaconda, Geany, RStudio, Jupiter, MySQL Work Bench, Hive, Sqoop, HBase, Spark, ETL, HDFS, Hadoop, Pig

Server Software: MySQL, Oracle, MS Access, SQL, TSQL

Web Technologies: HTML, CSS

Productivity Software: Microsoft Excel, Word, PowerPoint

Visualization Platforms: Tableau, Power BI, Google Analytics

Machine Learning Algorithms: Supervised: Linear Regression, Logistic Regression, Decision Trees, Random Forest, Extra Trees, KNN, Support Vector Machine

PROFESSIONAL EXPERIENCE

Confidential, San Antonio, TX

Data Scientist

Responsibilities:

  • Perform ad-hoc exploratory statistics and data mining tasks on diverse datasets from small scale to "big data
  • Used DFAST testing for Retrieving, maintaining, and standardizing both internal and external data is usually difficult and time consuming.
  • Participate in data architecture and engineering decision-making to support analytics.
  • Take initiative in evaluating and adapting new approaches from data science research.
  • Investigate data visualization and summarization techniques for conveying key findings.
  • Communicate findings and obstacles to stakeholders to help drive the delivery to market.
  • Developed bottom up stress test models in R for bank's residential real estate loan portfolio.
  • Developed and automated the data manipulation process for above using stored procedures/views in SQL Server.
  • Developed the code as per the client's requirements using SQL, PL/SQL and Data Ware housing concepts.
  • Developed and updated and manipulated PostgreSQL database architecture.
  • Automated the scraping and cleaning of data from various data sources in R.
  • Developed Banks's loss forecasting process using relevant forecasting and regression algorithms in R.
  • Delivered an interactive dashboard in Tableau to visualize 8 billion rows (1.2 TB) credit data.
  • Designed a scalable data cube structure for a 10x improvement in refresh rate.
  • Built credit risk scorecards and marketing response models using SQL and SAS. Presented results and recommendations to executives and managers from two large banks. Researched performance inference techniques (that reduce sample bias) using statistical and machine learning packages in R.
  • Designed and developed user interfaces and customization of Reports using Tableau and designed cubes for data visualization, mobile/web presentation with parameterization and cascading.
  • Integrated various relational and non-relational sources such as DB2, Teradata, Oracle, SFDC, Netezza, SQL Server, COBOL, XML and Flat Files.
  • Performed Data Analysis and Data Profiling and worked on data transformations and data quality rules.
  • Used extensively Base SAS, SAS/Macro, SAS/SQL, and Excel to develop codes and generated various analytical reports.
  • Created SSIS Packages using Pivot Transformation, Execute SQL Task, Data Flow Task, etc to import data into the data warehouse.
  • Performed administrative tasks, including creation of database objects such as database, tables, and views, using SQL DCL, DDL, and DML requests.
  • Created predictive models to analyze the behavior of customer in purchase using Python and R.
  • Solved problems by applying a broad range of statistical, mathematics, or computational approaches data.
  • Proficiency in model-building language such as R or Python.
  • Experienced in application of various machine learning algorithms and statistical modeling techniques like decision trees, regression models, and SVM in Python.
  • Performed Data Cleaning, features scaling, features selection using Pandas, NumPy and scikit-learn packages in python.
  • Solid understanding of AWS (Amazon Web Services) S3, EC2, RDS and IAM, Azure ML, Apache Spark, Scala process and concepts.
  • Good Knowledge of Spark internals and Performance tuning of jobs.
  • Using graphical packages produced ROC Curve to visually represent True Positive Rate versus False Positive Rate. Equally produced visualization of Precision Recall Curve for Area under the Curve.
  • Used Principal Component Analysis in feature engineering to analyze high dimensional data.
  • Investigated key business problems through quantitative analyses of utilization and costs after gathering report requirements from customer and acquiring data from internal as well as some external data sources by writing and executing SQL queries and maintaining mapping documents of data files and systems
  • Created Data Quality Scripts using SQL to validate successful data load and quality of the data.
  • Created various types of data visualizations using Python.
  • Communicated the results with operations team for taking best decisions.
  • Objectives were reduced fraud claims, cost reduction, operation efficiency, accelerated claims processing.
  • Built models using decision trees, segmentation, regression and clustering intelligent decision models to analyze customer response behaviors, interaction patterns and propensity.

Environment: Python (2 &3), R and SQL. Regression analysis, Decision Tree and SVM. NumPy and Pandas. ROC curve.

Confidential, Atlanta, GA

Data Scientist

Responsibilities:

  • Collaborated with the business analyst on the requirements of the project and explored the data from the database querying (SQL) search techniques, web services etc.
  • Preparing data using techniques like dimensionality reduction for reduction of features using (PCA, t-SNE), cleaning the data using libraries of Python.
  • Applying advanced statistical techniques (Bayesian, sampling and experimental design) while performing machine learning algorithms on the heterogenous data.
  • Used advanced analytical tools and programming languages such as Python (NumPy, pandas, SciPy) for data analysis.
  • Constructed and evaluated various types of datasets by performing machine learning models using algorithms and statistical modeling techniques such as clustering, classification, regression, decision trees, support vector machines, anomaly detection, sequential pattern discovery, and text mining from Python libraries (scikits.learn).
  • Performing the Post pruning techniques in machine learning to reduce the complexity of the final classifier which results in improving the predictive analysis by reducing over fitting, using python libraries(sklearn).
  • Performing predictive analytics and machine learning algorithms especially supervised (SVM, Logistic Regression, Boosting), unsupervised (K-Means, LDA, EM) and Reinforcement learning (Random Forests) methods.
  • Obtained better predictive performance of 81% accuracy using ensemble methods like Bootstrap aggregation (Bagging) and Boosting (Adaboost, Gradient).
  • Build decision tree and random forest based on Entropy, Information gain and Gini Impurity for split criteria.
  • Using regularization techniques to solve the over-fitting problem by reducing loss function either by adding multiple (LASSO or Ridge) or by performing cross validation.
  • Repeating steps as required for improving the Scalability, Reliability and performance of our Streaming Data Pipelines which was built on top of Spark.
  • Importing and exporting data from various RDBMS such as MySQL, Oracle, mainframe into Hadoop Distributed File System(HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS using Sqoop.
  • Working with the Hadoop ecosystem, including HDFS, MapReduce, Hive, and Spark for managing data processing and storage for big data applications running in clustered systems.
  • Read the different data formats like API (JSON), XML, CSV, Rich Text Format (.rtf), Open Document Text (. odt), HTML (.htm, .html), parquet, Avro.
  • Deployed Spark Ecosystem includes Spark SQL, Spark DataFrames, MLlib, GraphX, Spark Streaming, Spark Core API increase productivity and can be seamlessly combined to create complex workflows.
  • Visualized graphs and reports using matplotlib, seaborn and panda packages in python on datasets for analytical models to know the missing values, outliers, correlation between the features.
  • Utilizing Tableau visualization software for visualizing the results of the model by transforming data into dashboards that look amazing and are also interactive.
  • Creating user stories, sub tasks, epics in JIRA for the project. To track the flow of the project used Kanban board throughout different phases of lifecycle.

Environment: MySQL Workbench 5.7, Python 3.6.3, Jupiter notebook 5.5.0, Apache Spark 2.2, Tableau 10.4, Hive 2.3.0, Sqoop 1.4.6, Hadoop 2.7.1, Spark 2.2, JIRA

Confidential, Cleveland, Ohio

Machine Learning Engineer

Responsibilities:

  • Implemented and Engineered reinforcement, machine learning algorithms enabled analysis of massive quantities of data.
  • Integrated with a diverse team to deliver in an agile methodology for every sprint meeting and made ready for review the specific work.
  • Investigated and solved exciting challenges in classification, deep learning, image recognition, and content analysis.
  • Proficient with programming languages like R and Python and created next generation spam filtering model using Random Forest algorithm, optimizing the hyper parameters, tuning the architecture by minimizing false positives, false negatives and achieved accuracy.
  • Used ensemble technique to reduce problems related to over-fitting of the training data using Bagging and Boosting.
  • Handled non-elliptical shapes, time and space requirements using hierarchical clustering.
  • Solved k-medoids problem using PAM and CLARANS algorithms. Clustered categorical data using ROCK by neighbor and link analysis, LIMBO and COOLCAT using information theoretic tools.
  • Used EM algorithm to perform data clustering in machine learning and for Hidden Markov Models used the Baum-Welch algorithm in NLP.
  • Handled clusters of different sizes and shapes using DBSCAN algorithm which was also resistant to noise and used for determining Eps and MinPts.
  • Performed classification problems using SMOTE oversampling technique. Used MAML algorithm approach to perform small number of gradient steps while avoiding over-fitting.
  • Ability to debug production code and deliver the results on time. Performed architect, test, and tune and deploy algorithms into production platform.
  • Implemented user profile system based on behaviors of user and applied the recommendation systems algorithm based on profiles of users.
  • Performed predictive analysis using scikit-learn, HBase. Hands-on implementation and coding in Scala.
  • Used “big data” technologies including Map Reduce, Kafka, HBase, Pig, Cassandra. Proficient with website analytics tool like Google Analytics.

Environment: Python 3.4.0, RStudio 0.98.493, Apache HBase, Pig v0.13.0, Classification, Clustering, Regression, NLP, Deep Learning, Reinforcement learning.

Confidential, CA

Data Analyst

Responsibilities:

  • Designed, developed and delivered the development life cycle with detailed level of data analysis using technical tools.
  • Knowledge of data life cycle - data acquisition, data quality management, data governance, and metadata management.
  • Capable of building, articulating, and presenting new ideas to technical, non-technical, and business communities.
  • Used analytics data preparation methods, e.g. data validation, data quality assurance, data transformation.
  • Assembled large, complex data sets that meet functional or non-functional business requirements.
  • Used Python's pandas for file conversion, file counts, column analysis and formatted.
  • Designed and created multiple worksheets, analytical reports and Data Visualization Dashboards to help users for identifying Key Performance Indicator along with strategic planning in firm using Tableau as Data Visualization as per the requirements of the end user.
  • Used various statistical methods like Hypothesis Testing, Chi-Square test, Control charts, t-Test, ANOVA, Correlation Techniques, Statistical Process Control and Descriptive Statistics.
  • Used various Python libraries like seaborn, scikit-learn, SciPy to visualize, analyze the data for machine learning.
  • Developed various Statistical Methods, expertise in Text Analytics, created data visualization, build solutions for Data Mining using R and Python.
  • Analyzed large, complex, multi-dimensional datasets using a variety of tools like R and SPSS.
  • Hands-on knowledge of the Big Data technologies - Hadoop, Hive, Impala, HBase, Sqoop, Pig, Kafka etc.
  • Created SQL Schema such as Functions, Views, Procedures, Sequences, Record Type, Triggers and Object Type, performed in coordination with SQL Developer.
  • Build the infrastructure required for optimal extraction, transformation, and loading of data from a wide variety of data sources using SQL and AWS ‘big data’ technologies.
  • Implemented data pre-processing such as text preprocessing, data cleaning, noise removal, object standardization and lexicon normalization.
  • Performed data ingestion into Big Data platform from distinct data sources using Sqoop.
  • Coordinated with the developer’s team to develop, implement and design solutions for BI cases to determine Sale KPIs at macro and micro level.

Environment: Python, Hadoop, Hive, NumPy, Pandas, SciPy, Map Reduce, Tableau, Sqoop, HBase, HDFS, Tableau, SQL Server.

We'd love your feedback!