test

class: title-slide

# The important role of spatial autocorrelation in hyperparameter tuning and predictive performance of machine-learning algorithms for spatial data

### Patrick Schratz1, Jannes Muenchow1, Eugenia Iturritxa2, Jakob Richter3, Alexander Brenning1

1 Department of Geography, GIScience group, University of Jena <a href="http://www.geographie.uni-jena.de/en/Geoinformatik_p_1558.html"></a>

2 NEIKER, Vitoria-Gasteiz, Spain <a href="http://www.neiker.net/"></a>

3 Department of Statistics, TU Dortmund <a href="https://www.statistik.tu-dortmund.de/aktuelles.html"></a>

<a href="https://pjs-web.de">https://pjs-web.de</a> &emsp;

<a href="https://twitter.com/pjs_228">@pjs_228</a> &emsp;

<a href="https://github.com/pat-s">@pat-s</a> &emsp;

<a href="https://stackoverflow.com/users/4185785/pat-s">@pjs_228</a> &emsp;

<a href="patrick.schratz@uni-jena.de">patrick.schratz@uni-jena.de</a>&emsp;

<a href="https://www.linkedin.com/in/patrick-schratz/">Patrick Schratz</a>&emsp;

---

layout: true

---

# Outline

.pull-left[
.font150[
1. Introduction

2. Data and study area

3. Methods

4. Results

5. Discussion
]]

.pull-right[
<blockquote class="twitter-tweet" data-lang="en">Slides of my upcoming talk at LMU Munich on Jun 20th: <a href="https://t.co/SyWRky6sGn">https://t.co/SyWRky6sGn</a>&mdash; Patrick Schratz (@pjs_228) <a href="https://twitter.com/pjs_228/status/1008803282029088774?ref_src=twsrc%5Etfw">June 18, 2018</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
]

---
class: inverse, center, middle

# Introduction

---

# Introduction

#### \# Whoami

- "Data Scientist/Analyst"
- B.Sc. **Geography** & M.Sc. **Geoinformatics** at University of Jena
- Self-taught programmer
- Interested in model optimization, R package development, server administration
- Arch Linux package maintainer
- PhD student (since 2016)

#### Contributions to `mlr`
- Integrated new sampling scheme for CV: Spatial sampling
- Redesigned the tutorial site (`mkdocs`-> `pkgdown`)
- Added getter for inner resampling indices
- more to come ;)

---

# Introduction

.pull-left[

### LIFE Healthy Forest

Early detection and advanced management systems to reduce forest decline by invasive and pathogenic agents.

**Main task**: Spatial (modeling) analysis to support the early detection of various pathogens.

## Pathogens

* Fusarium circinatum 
* **Diplodia sapinea** ( needle blight)
* Armillaria root disease
* Heterobasidion annosum

]

.pull-right[

.center[
![:scale 90%](figs/diplodia.jpg)  
.font70[**Fig. 1:** Needle blight caused by **Diplodia pinea**]
]
]

---

# Introduction

## Motivation

* Find the model with the **highest predictive performance**.

* Results are assumed to be representative for data sets with similar predictors and different pathogens (response).

* Be aware of **spatial autocorrelation**

* Analyze differences between spatial and non-spatial hyperparameter tuning (no research here yet!).

* Analyze differences in performance between algorithms and sampling schemes in CV (both performance estimation and hyperparameter tuning)

---
class: inverse, center, middle

# Data & Study Area

---

# Data & Study Area

.code70[

```
## Skim summary statistics  
##  n obs: 926    
##  n variables: 12    
## 
## Variable type: factor
## 
##  variable     missing     n     n_unique                    top_counts                 
## -----------  ---------  -----  ----------  --------------------------------------------
##   diplo01        0       926       2                  0: 703, 1: 223, NA: 0            
##  lithology       0       926       5        clas: 602, chem: 143, biol: 136, surf: 32  
##    soil          0       926       7         soil: 672, soil: 151, soil: 35, pron: 22  
##    year          0       926       4        2009: 401, 2010: 261, 2012: 162, 2011: 102 
## 
## Variable type: numeric
## 
##    variable       missing     n       mean       p0       p50       p100       hist   
## ---------------  ---------  -----  ----------  -------  --------  --------  ----------
##       age            0       926     18.94        2        20        40      ▂▃▅▆▇▂▂▁ 
##    elevation         0       926     338.74     0.58     327.22    885.91    ▃▇▇▇▅▅▂▁ 
##    hail_prob         0       926      0.45      0.018     0.55       1       ▇▅▁▂▆▇▃▁ 
##      p_sum           0       926     234.17     124.4    224.55    496.6     ▅▆▇▂▂▁▁▁ 
##       ph             0       926      4.63      3.97      4.6       6.02     ▃▅▇▂▂▁▁▁ 
##      r_sum           0       926    -0.00004    -0.1     0.0086    0.082     ▁▂▅▃▅▇▃▂ 
##  slope_degrees       0       926     19.81      0.17     19.47     55.11     ▃▆▇▆▅▂▁▁ 
##      temp            0       926     15.13      12.59    15.23      16.8     ▁▁▃▃▆▇▅▁
```
]

---

# Data & Study Area

.center[
![:scale 100%](figs/study_area.png)  
.font70[**Fig. 2:** Study area (Basque Country, Spain)]
]

---
class: inverse, center, middle

# Methods

---

# Methods

## Machine-learning models

* Boosted Regression Trees (`BRT`)
* Random Forest (`RF`)
* Support Vector Machine (`SVM`)
* k-nearest Neighbor (`KNN`)

## Parametric models

* Generalized Addtitive Model (`GAM`)
* Generalized Linear Model (`GLM`)

## Performance Measure

Brier Score

---

# Methods

## Nested Cross-Validation

* Cross-validation for **performance estimation**

* Cross-validation for **hyperparameter tuning** (sequential model based optimization, Bischl, Richter, Bossek, et al. (2017))
    
Different sampling strategies (Performance estimation/Tuning):

* Non-Spatial/Non-Spatial

* Spatial/Non-Spatial

* Spatial/Spatial

* Non-Spatial/No Tuning

* Spatial/No Tuning

---

# Methods

## Nested (spatial) cross-validation

.center[
![:scale 100%](figs/cross-validation_farbig.png)  
.font70[**Fig. 3:** Nested spatial/non-spatial cross-validation]]

---

# Methods

## Nested (spatial) cross-validation

.center[
![:scale 100%](figs/spcv_nspcv_folds_pres.png)  
.font70[**Fig. 4:** Comparison of spatial and non-spatial partitioning of the data set.]
]

---

# Methods

#### Hyperparameter tuning search spaces

RF : Probst, Wright, and Boulesteix (2018)  
BRT, SVM, KNN: Self-defined limits based on evaluation of estimated hyperparameters

.center[
![](figs/tuning_limits.png)  
.font70[**Table 1:** Hyperparameter limits and types of each model.  
Notations of hyperparameters from the respective R packages were used.  
`\(p\)` = Number of variables.]
]

---
class: inverse, center, middle

# Results

---

# Results

## Hyperparameter tuning

.center[ 
![:scale 70%](figs/opt-paths-RF-sp-vs-nsp.png)  
]
.font70[**Fig 4:** SMBO optimization paths of the first five folds of the **spatial/spatial** and **spatial/non-spatial** CV setting for RF. The dashed line marks the border between the initial design (30 randomly composed hyperparameter settings) and the sequential optimization part in which each setting was proposed using information from the prior evaluated settings.
]

---

# Results

## Hyperparameter tuning

.center[ 
![:scale 67%](figs/tuning_effects_all_models_mbo.png) 
]
.font70[**Fig 5:** Best hyperparameter settings by fold (500 total) each estimated from 100 (30/70) SMBO tuning iterations per fold using five-fold cross-validation. Split by spatial and non-spatial partitioning setup and model type. 
Red crosses indicate the default hyperparameters of the respective model.
Black dots represent the winning hyperparameter setting of each fold.
The labels ranging from one to five show the winning hyperparameter settings of the first five folds.
]

---

# Results

## Predictive Performance

.center[
![:scale 55%](figs/cv_boxplots_final_brier.png)  
]
.font70[**Fig 6:**  (Nested) CV estimates of model performance at the repetition level using 100 SMBO iterations for hyperparameter tuning.
CV setting refers to performance estimation/hyperparameter tuning of the respective (nested) CV, e.g. "Spatial/Non-Spatial" means that spatial partitioning was used for performance estimation and non-spatial partitioning for hyperparameter tuning.
]

---
class: inverse, center, middle

# Discussion

---

# Discussion

## Predictive performance

* `RF` showed the best predictive performance

* High bias in performance when using non-spatial CV

---

# Discussion

---

# Discussion

## Predictive Performance

* `RF` showed the best predictive performance

* High bias in performance when using non-spatial CV

* The `GLM` shows an equally good performance as BRT, KNN and SVM

* The `GAM` suffers from overfitting

---

# Discussion

## Hyperparameter tuning

* Almost no effect on predictive performance.

* Differences between algorithms are higher than the effect of hyperparameter tuning

* Spatial hyperparameter tuning has no substantial effect on predictive performance compared to non-spatial tuning

* Optimal parameters estimated from spatial hyperparameter tuning show a wide spread across the search space

---

# Discussion

## Tuning

.center[ 
![:scale 70%](figs/tuning_effects_all_models_mbo.png)  
]
.font70[**Fig 5:** Best hyperparameter settings by fold (500 total) each estimated from 100 (30/70) SMBO tuning iterations per fold using five-fold cross-validation. Split by spatial and non-spatial partitioning setup and model type. 
Red crosses indicate the default hyperparameters of the respective model.
Black dots represent the winning hyperparameter setting of each fold.
The labels ranging from one to five show the winning hyperparameter settings of the first five folds.
]

---

# Discussion

## Hyperparameter tuning

* Almost no effect on predictive performance.

* Differences between algorithms are higher than the effect of hyperparameter tuning.

* Spatial hyperparameter tuning has no substantial effect on predictive performance compared to non-spatial tuning.

* Optimal parameters estimated from spatial hyperparameter tuning show a wide spread across the search space.

Spatial hyperparameter tuning should be used for spatial data sets to have a consistent resampling scheme.

---

# References

Bischl, B, J. Richter, J. Bossek, et al. (2017). "mlrMBO: A
Modular Framework for Model-Based Optimization of Expensive
Black-Box Functions". In: _ArXiv e-prints_. arXiv: [1703.03373
[stat]](https://arxiv.org/abs/1703.03373).

Probst, P, M. Wright and A. Boulesteix (2018). "Hyperparameters
and Tuning Strategies for Random Forest". In: _ArXiv e-prints_.
00000. arXiv: [1804.03515
[stat.ML]](https://arxiv.org/abs/1804.03515).

.center[

## Thanks for listening!

Questions? Slides can be found here: https://t.co/SyWRky6sGn

And now, let's have a ;)
]

---
class: inverse, center, middle

# Backup

---

# Backup

.center[
![:scale 85%](figs/tuning_effects_appendix_mbo.png)
]