class: title-slide # The important role of spatial autocorrelation in hyperparameter tuning and predictive performance of machine-learning algorithms for spatial data <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> ### Patrick Schratz<sup>1</sup>, Jannes Muenchow<sup>1</sup>, Eugenia Iturritxa<sup>2</sup>, Jakob Richter<sup>3</sup>, Alexander Brenning<sup>1</sup> <p style="margin-left:15px;"> <br>
<sup>1</sup> Department of Geography, GIScience group, University of Jena <a href="http://www.geographie.uni-jena.de/en/Geoinformatik_p_1558.html">
</a> <br>
<sup>2</sup> NEIKER, Vitoria-Gasteiz, Spain <a href="http://www.neiker.net/">
</a> <br>
<sup>3</sup> Department of Statistics, TU Dortmund <a href="https://www.statistik.tu-dortmund.de/aktuelles.html">
</a> <br><br>
<a href="https://pjs-web.de">https://pjs-web.de</a>  
<a href="https://twitter.com/pjs_228">@pjs_228</a>  
<a href="https://github.com/pat-s">@pat-s</a>  
<a href="https://stackoverflow.com/users/4185785/pat-s">@pjs_228</a>   <br>
<a href="patrick.schratz@uni-jena.de">patrick.schratz@uni-jena.de</a> 
<a href="https://www.linkedin.com/in/patrick-schratz/">Patrick Schratz</a>  </p> <div class="my-header"><img src="figs/life.jpg" style="width = 5%;" /></div> --- layout: true <div class="my-header"><img src="figs/life.jpg" style="width = 5%;" /></div> --- # Outline .pull-left[ .font150[ 1. Introduction 2. Data and study area 3. Methods 4. Results 5. Discussion ]] .pull-right[ <blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Slides of my upcoming talk at LMU Munich on Jun 20th: <a href="https://t.co/SyWRky6sGn">https://t.co/SyWRky6sGn</a></p>— Patrick Schratz (@pjs_228) <a href="https://twitter.com/pjs_228/status/1008803282029088774?ref_src=twsrc%5Etfw">June 18, 2018</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> ] --- class: inverse, center, middle # Introduction <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- # Introduction #### \# Whoami - "Data Scientist/Analyst" - B.Sc. **Geography** & M.Sc. **Geoinformatics** at University of Jena - Self-taught programmer - Interested in model optimization, R package development, server administration - Arch Linux package maintainer - PhD student (since 2016) #### Contributions to `mlr` - Integrated new sampling scheme for CV: Spatial sampling - Redesigned the tutorial site (`mkdocs`-> `pkgdown`) - Added getter for inner resampling indices - more to come ;) --- # Introduction .pull-left[ ### LIFE Healthy Forest
Early detection and advanced management systems to reduce forest decline by invasive and pathogenic agents. **Main task**: Spatial (modeling) analysis to support the early detection of various pathogens. ## Pathogens
* Fusarium circinatum * **Diplodia sapinea** (
needle blight) * Armillaria root disease * Heterobasidion annosum ] .pull-right[ .center[  .font70[**Fig. 1:** Needle blight caused by **Diplodia pinea**] ] ] --- # Introduction ## Motivation * Find the model with the **highest predictive performance**. * Results are assumed to be representative for data sets with similar predictors and different pathogens (response). * Be aware of **spatial autocorrelation**
* Analyze differences between spatial and non-spatial hyperparameter tuning (no research here yet!). * Analyze differences in performance between algorithms and sampling schemes in CV (both performance estimation and hyperparameter tuning) --- class: inverse, center, middle # Data
& Study Area
<html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- # Data
& Study Area
.code70[ ``` ## Skim summary statistics ## n obs: 926 ## n variables: 12 ## ## Variable type: factor ## ## variable missing n n_unique top_counts ## ----------- --------- ----- ---------- -------------------------------------------- ## diplo01 0 926 2 0: 703, 1: 223, NA: 0 ## lithology 0 926 5 clas: 602, chem: 143, biol: 136, surf: 32 ## soil 0 926 7 soil: 672, soil: 151, soil: 35, pron: 22 ## year 0 926 4 2009: 401, 2010: 261, 2012: 162, 2011: 102 ## ## Variable type: numeric ## ## variable missing n mean p0 p50 p100 hist ## --------------- --------- ----- ---------- ------- -------- -------- ---------- ## age 0 926 18.94 2 20 40 ▂▃▅▆▇▂▂▁ ## elevation 0 926 338.74 0.58 327.22 885.91 ▃▇▇▇▅▅▂▁ ## hail_prob 0 926 0.45 0.018 0.55 1 ▇▅▁▂▆▇▃▁ ## p_sum 0 926 234.17 124.4 224.55 496.6 ▅▆▇▂▂▁▁▁ ## ph 0 926 4.63 3.97 4.6 6.02 ▃▅▇▂▂▁▁▁ ## r_sum 0 926 -0.00004 -0.1 0.0086 0.082 ▁▂▅▃▅▇▃▂ ## slope_degrees 0 926 19.81 0.17 19.47 55.11 ▃▆▇▆▅▂▁▁ ## temp 0 926 15.13 12.59 15.23 16.8 ▁▁▃▃▆▇▅▁ ``` ] --- # Data
& Study Area
.center[  .font70[**Fig. 2:** Study area (Basque Country, Spain)] ] --- class: inverse, center, middle # Methods
<html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- # Methods
## Machine-learning models * Boosted Regression Trees (`BRT`) * Random Forest (`RF`) * Support Vector Machine (`SVM`) * k-nearest Neighbor (`KNN`) ## Parametric models * Generalized Addtitive Model (`GAM`) * Generalized Linear Model (`GLM`) ## Performance Measure Brier Score --- # Methods
## Nested Cross-Validation * Cross-validation for **performance estimation** * Cross-validation for **hyperparameter tuning** (sequential model based optimization, Bischl, Richter, Bossek, et al. (2017)) Different sampling strategies (Performance estimation/Tuning): * Non-Spatial/Non-Spatial * Spatial/Non-Spatial * Spatial/Spatial * Non-Spatial/No Tuning * Spatial/No Tuning --- # Methods
## Nested (spatial) cross-validation .center[  .font70[**Fig. 3:** Nested spatial/non-spatial cross-validation]] --- # Methods
## Nested (spatial) cross-validation <br> .center[  .font70[**Fig. 4:** Comparison of spatial and non-spatial partitioning of the data set.] ] --- # Methods
#### Hyperparameter tuning search spaces RF : Probst, Wright, and Boulesteix (2018) BRT, SVM, KNN: Self-defined limits based on evaluation of estimated hyperparameters .center[  .font70[**Table 1:** Hyperparameter limits and types of each model. Notations of hyperparameters from the respective R packages were used. `\(p\)` = Number of variables.] ] --- class: inverse, center, middle # Results
<html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- # Results
## Hyperparameter tuning .center[  ] .font70[**Fig 4:** SMBO optimization paths of the first five folds of the **spatial/spatial** and **spatial/non-spatial** CV setting for RF. The dashed line marks the border between the initial design (30 randomly composed hyperparameter settings) and the sequential optimization part in which each setting was proposed using information from the prior evaluated settings. ] --- # Results
## Hyperparameter tuning .center[  ] .font70[**Fig 5:** Best hyperparameter settings by fold (500 total) each estimated from 100 (30/70) SMBO tuning iterations per fold using five-fold cross-validation. Split by spatial and non-spatial partitioning setup and model type. Red crosses indicate the default hyperparameters of the respective model. Black dots represent the winning hyperparameter setting of each fold. The labels ranging from one to five show the winning hyperparameter settings of the first five folds. ] --- # Results
## Predictive Performance .center[  ] .font70[**Fig 6:** (Nested) CV estimates of model performance at the repetition level using 100 SMBO iterations for hyperparameter tuning. CV setting refers to performance estimation/hyperparameter tuning of the respective (nested) CV, e.g. "Spatial/Non-Spatial" means that spatial partitioning was used for performance estimation and non-spatial partitioning for hyperparameter tuning. ] --- class: inverse, center, middle # Discussion
<html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- # Discussion
## Predictive performance * `RF` showed the best predictive performance
-- * High bias in performance when using non-spatial CV --- # Discussion
.center[  ] .font70[**Fig 6:** (Nested) CV estimates of model performance at the repetition level using 100 SMBO iterations for hyperparameter tuning. CV setting refers to performance estimation/hyperparameter tuning of the respective (nested) CV, e.g. "Spatial/Non-Spatial" means that spatial partitioning was used for performance estimation and non-spatial partitioning for hyperparameter tuning. ] --- # Discussion
## Predictive Performance * `RF` showed the best predictive performance
* High bias in performance when using non-spatial CV -- * The `GLM` shows an equally good performance as BRT, KNN and SVM -- * The `GAM` suffers from overfitting --- # Discussion
## Hyperparameter tuning * Almost no effect on predictive performance. -- * Differences between algorithms are higher than the effect of hyperparameter tuning -- * Spatial hyperparameter tuning has no substantial effect on predictive performance compared to non-spatial tuning -- * Optimal parameters estimated from spatial hyperparameter tuning show a wide spread across the search space --- # Discussion
## Tuning .center[  ] .font70[**Fig 5:** Best hyperparameter settings by fold (500 total) each estimated from 100 (30/70) SMBO tuning iterations per fold using five-fold cross-validation. Split by spatial and non-spatial partitioning setup and model type. Red crosses indicate the default hyperparameters of the respective model. Black dots represent the winning hyperparameter setting of each fold. The labels ranging from one to five show the winning hyperparameter settings of the first five folds. ] --- # Discussion
## Hyperparameter tuning * Almost no effect on predictive performance. * Differences between algorithms are higher than the effect of hyperparameter tuning. * Spatial hyperparameter tuning has no substantial effect on predictive performance compared to non-spatial tuning. * Optimal parameters estimated from spatial hyperparameter tuning show a wide spread across the search space.
Spatial hyperparameter tuning should be used for spatial data sets to have a consistent resampling scheme.
--- # References
Bischl, B, J. Richter, J. Bossek, et al. (2017). "mlrMBO: A Modular Framework for Model-Based Optimization of Expensive Black-Box Functions". In: _ArXiv e-prints_. arXiv: [1703.03373 [stat]](https://arxiv.org/abs/1703.03373). Probst, P, M. Wright and A. Boulesteix (2018). "Hyperparameters and Tuning Strategies for Random Forest". In: _ArXiv e-prints_. 00000. arXiv: [1804.03515 [stat.ML]](https://arxiv.org/abs/1804.03515). <br> .center[ ## Thanks for listening! Questions? Slides can be found here: https://t.co/SyWRky6sGn And now, let's have a
;) ] --- class: inverse, center, middle # Backup
<html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- # Backup
.center[  ]