We provide IT Staff Augmentation Services!

Data Scientist Resume

3.00/5 (Submit Your Rating)

Bentonville, AR

SUMMARY:

  • Research and data scientist specializing in machine learning, visualization, predictive modeling, pattern detection, and algorithmic development.
  • Management and consulting experience in projects requiring data mining and pattern detection from multiple, massive, and disparate sources.
  • As a program developer, designed and implemented automated processing pipelines for image, geographic and statistical analysis. Experienced System Administrator of high performance computers including both Linux and Microsoft workstations and servers.

TECHNICAL SKILLS:

Programming Languages and Libraries: Python, Numpy/Scipy, Matplotlib, Pandas, Sklearn, lxml, NLTK (python natural language toolkit), Beautiful Soup Bash

Scientific: R Statistical package, Octave, Zeppelin, Spark, See5, RandomForest, Tesseract, Weka, Vegetation Dynamics Development Tool

System Administration: Linux/Unix, Hadoop/Hive, VMware, Windows (2003 and 2008 R2)

DBMS: MySQL, PostgreSQL, Microsoft SQL server, network routers, switches, monitoring, Sendmail

PROFESSIONAL EXPERIENCE:

Confidential, Bentonville, AR

Data Scientist

Responsibilities:

  • Developed Zeppelin and Spark applications within Hadoop environment to optimize and scale machine learning tasks in the retail domain
  • Worked within an Agile development framework to document work in tickets and stories
  • Documented advanced procedures on Confluence webpage and developed training videos for knowledge dissemination
  • Maintained code base with a Git software control system
  • Developed and implemented automated system of data analysis and visualization of RFID tracking with Python to facilitate inventory tracking
  • Managed off - shore team as workstream leader organizing work tasks and presenting results to senior leadership
  • Utilized Markov models and time series analysis of clickstream data using R and Python to evaluate website usage and design to evaluate design effectiveness

Confidential, San Francisco, CA

Data Scientist

Responsibilities:

  • Implemented multiple time-series econometric models to meet Federal Reserve DFAST and CCAR stress test requirements under a variety of macroeconomic scenarios for Credit Risk Stress Analysis
  • Streamlined production code in R and Python integrated with SAS database
  • Optimized statistical techniques such as the logistic, fractional regression, and state-transition models to enable automation of credit loss forecasting
  • Reconciled disparate information from multiple databases to provide consistent and reliable results
  • Developed within an agile/pair programming working environment
  • Implemented test suite to provide validation of results between code set development cycles
  • Documented work progress using trac program management system integrated with subversion and git

Confidential, San Ramon, CA

Data Scientist

Responsibilities:

  • Developed automated scripts in an Agile work environment using Python integrated with Tesseract, opencv, and R to convert engineering diagrams and other documents into usable and searchable text
  • Implemented fuzzy text search with a modified Levenshtein distance rule to improve text extraction
  • Designed and implemented protocol to extract specific keywords from supplier form PDF files using associative text and other rules and formatted results into structured tables
  • Integrated extracted data into structured tables for input into Tableau and other data bases; enabled further analysis with machine learning
  • Designed statistical accuracy assessment protocols providing quality control
  • Implemented program logging to document and debug OCR and data extraction process
  • Developed configuration file setup to provide fine-grained program control
  • Documented work progress using trac program management system integrated with subversion server

Confidential, Missoula, MT

Data Scientist

Responsibilities:

  • Integrated automated multi-processed Linux based programs with Python into eDiscovery process to forensically analyze and extract metadata and contents from a wide variety of legal documents
  • Extracted and indexed text based information from processed legal documents using data mining and Python natural language toolkit to classify as confidential, privileged, responsive or needing redaction
  • Produced internal tracking database of all processing steps to document chain of custody of data from original source to storage in SQL relational database. Evaluated software such as Tableau.
  • Augmented automated OCR Tesseract operations with image enhancement routines to make available previously unusable document scans as evidence
  • Administered office network, Linux servers, VMware virtual machines, Microsoft Domain and SQL servers
  • Managed internal Web-based project management and software revision control system

Confidential

Research/Data Scientist

Responsibilities:

  • Developed Succession Class (sclass) Mapping Tool using ArcGIS Python libraries
  • Designed mapping tool to evaluate, reconcile, and evaluate relationships among multiple data layers depicting historic, current and potential vegetation characteristics
  • Applied categorical rules from state and transition models, and produced sclass map
  • Designed output tables and visualizations based on statistical measures and machine learning designed to evaluate results, pinpoint potential data discrepancies and guide rule modification
  • Using Linux high performance computer and Cray cluster, improved process efficiency and streamlined data flow for FIREHARM (fire hazard and risk mapping program) layer preparation across CONUS (continental U.S.)
  • Implemented Python process using API’s from ArcGIS and open source GDAL to reduce layer preparation time by more than 97%
  • Utilized gridded 18-year daymet climate record to integrate production environment on Linux server to maintain FIREHARM codebase
  • Used Python-based Trac project management to share methods, source code, and data with fellow researchers
  • Designed and maintained data processing pipeline and databases for Biophysical and Fire Regime Products within LANDFIRE 2001/2008 Refresh effort
  • Established database and Python based geoprocessing system linking multiple primary and secondary layers across CONUS and Alaska to ensure smooth, consistent and efficient data and layer production
  • Designed machine learning geostatistical and text-based filtering techniques in R and Python to impute species-based plots from SSURGO soil data and ecotype descriptions to improve spatial models depicting landscape patterns
  • Administered a Web-based subversion and tracking and wiki system (trac) to enable cloud-based collaboration among a dispersed working team
  • Trained users for production runs of automated processes
  • Developed machine learning based algorithms in Python to extract scale appropriate data based on bootstrap and nonparametric statistical analysis
  • Integrated object (raster polygon) and pixel based classification methods into a forest height classification algorithm using FIA data implemented through Python and gdal with R-randomForest, See5 (CART), and ArcGIS.
  • Developed and implemented geostatistical filtering procedures using multi-scale/multi-temporal imagery to evaluate and reconcile disparate data and provide relevant classification scale data
  • Collaborated with field scientists to identify and resolve local issues of vegetation dynamics
  • Processed multiple zones with Classification and Regression Tree (CART - See5, Random Forest) to model Environmental Site Potential (ESP) and Biophysical Settings (BpS) with biophysical gradients and ground truth extracted from the LANDFIRE Reference Database (LFRDB)
  • Improved efficiency and reproducibility of automated procedures for QA/QC of plot data used for CART based mapping of landscape features
  • Developed rulesets assigning LANDFIRE BpS map units to LFRDB records based on statistical analysis and visualization of floristic composition
  • Trained fellow researchers in use of analysis programs
  • Using Landsat TM data and other ancillary data, designed and implemented automated procedures for modeling land cover and landscape patterns
  • Designed a common vegetation classification scheme derived from multiple and often conflicting existing thematic vegetation layers using probabilistic statistics measures of association
  • Developed statistical methods to evaluate both plot data and predictors resulting in significantly increased cross-validation accuracy within regression tree analyses

Confidential, Missoula, MT

Research Data Scientist

Responsibilities:

  • Developed innovative automated algorithm for modeling multi-scale landscape features within a raster polygon database created from multispectral satellite data.
  • Evaluated methods and results of polygon delineations and classifications derived from in-house image segmentation methods with those performed through eCognition using IKONOS and TM imagery
  • Developed techniques using bootstrap statistical techniques with R enabling the analysis and visualization of spatial accuracy for classified thematic digital land cover maps
  • Developed method to integrate coarse scale Confidential data with fine scale TM data to develop regression equations in R predicting seasonal variation of forest stand leaf area index and productivity
  • Researched and improved procedures for classifying and mapping fire severity now implemented on a national basis by through the MTBS program by EROS and USFS/RSAC
  • Developed automated algorithm for spectral matching to create a Landsat Thematic Mapper Mosaic of 36 scenes encompassing the State of Montana
  • Managed and maintained geospatial databases for imagery, land cover, digital line data, digital elevation, and plot data models for spatial analysis of large geographic areas under contract with the USFS
  • Designed and maintained database of ground-truth data of over 100,000 locations providing for easy retrieval, analysis and use
  • Developed and implemented distributed computing system within AIX/Linux environment using laboratory computers to efficiently parse demanding and routine tasks
  • Provided system administration for multiple workstations and maintained local area network for increased reliability and efficiency for researchers.
  • Recommended and prepared technical reports, research manuscripts, and contributed to grant proposals, with research teams

We'd love your feedback!