class: center, middle, inverse, title-slide # Machine Learning for spatiotemporal data using {mlr3} ### Marc Becker, Patrick Schratz et al. ### 2021-10-27, Zurich R User Group --- ## Who I am .pull-left[ .center[  ] - M.Sc. Geoinformatics - Previously researcher at University of **Jena** and LMU **Munich** - Now R consultant in Zurich, Switzerland - PhD Candidate (Environmental modelling) ] .pull-right[ - Unix & R enthusiast - [Gitea](https://gitea.io/en-us/) (https://gitea.io) contributor - Member of mlr-org core team; Machine learning in R  - [mlr3](https://github.com/mlr-org/mlr3) - https://github.com/mlr-org/mlr3 - [mlr](https://github.com/mlr-org/mlr) - https://github.com/mlr-org/mlr ] --- ## Where I work .pull-left[ - Swiss-based R consulting company (Zurich), founded in 2018 - [www.cynkra.com](www.cynkra.com) - 5 - 10 people from 7 different countries - Strong Free and Open-Source (FOSS) philosophy - [RStudio Certified Partner](https://www.rstudio.com/certified-partners/) ] .pull-right[  ] --- class: middle, inverse, center # 1. mlr3 Overview --- ## mlr3: Overview - Why do we want to use mlr3? - Key principles of mlr3 <br> **Code** https://gist.github.com/pat-s/ae290bd6dd8c2970c7aa0baf200483c4 **Slides** https://github.com/pat-s/presentations --- ## <i class="fas fa-question"></i> Why use mlr3 *Users* want to efficiently **train**/**predict**/**benchmark** - many **methods** - on many **datasets** - using different **tuning methods** - using different **feature selection methods** - preferably using the **same syntax** <br> <i class="fas fa-arrow-right"></i> *Design principles* of {mlr3} --- ## mlr3: Overview .center.middle[  ] --- ## Motivation: Make benchmarking easy! .fl.w-70.pa0[ By unifying - interfaces to **train** and **predict** methods, - interfaces to learner **hyperparameters** and **optimizers** (tuning), - **resampling** (performance estimation), - **preprocessing** independently from the data, - **parallelization**, and - **error handling** ] .fl.w-30.pl1.extrasmall[ <iframe src="https://giphy.com/embed/xUPGck7rzlAftbFZza" width="480" height="270" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/nba-warriors-golden-state-xUPGck7rzlAftbFZza">Source: https://giphy.com/gifs/nba-warriors-golden-state-xUPGck7rzlAftbFZza</a></p> ] --- ## Is it worth to "learn" mlr3? .pull-left[ - Avoid making mistakes by relying on **tested functionality** - Predefined performance measures - Resampling - **Easily scale up** your benchmark - Integrated parallelization - Benchmarking functions - New methods can be easily integrated into the {mlr3verse} ] .pull-right[  ] --- ## mlr3 in a nutshell .fl.w-50.pa2[ ```r library("mlr3verse", quietly = TRUE) set.seed(42) # example tasks tasks <- tsks(c("iris", "german_credit")) # from {mlr3learners} learners <- lrns(c("classif.rpart", "classif.ranger")) # run a cross-val bmg <- benchmark_grid( tasks, learners, rsmp("cv") ) bmr <- benchmark(bmg) # visualize by classification error autoplot(bmr, measure = msr("classif.ce")) ``` ] .fl.w-50.pa2[  ] --- exclude: true ## Principles of mlr3 .fl.w-60.pa2.small[ - Overcome limitations of S3 with the help of **{R6}** - Truly object-oriented: data and methods live in the same object - Make use of inheritance - Make slight use of reference semantics - Embrace **{data.table}**, both for arguments and internally - Fast operations for tabular data - List columns to arrange complex objects in tabular structure - Be **light on dependencies**: - `{R6}`, `{data.table}`, `{lgr}` - Plus some of our own packages (`{backports}`, `{checkmate}`) - Special packages are loaded from mlr3 extension libraries ] .fl.w-34.pl2[  ] --- ## The mlr3verse .center.middle[  ] --- class: middle, inverse, center # 2. mlr3 + spatiotemporal data --- class: middle ## mlr3 + spatiotemporal data - How does mlr3 help in spatiotemporal/environmental/ecological modelling? - What things do I need to be aware of? - What is still missing? - Can I contribute? --- ## mlr3 + spatiotemporal data There are currently two packages for spatiotemporal analysis in mlr3: [{mlr3spatiotempcv}](https://github.com/mlr-org/mlr3spatiotempcv) <i class="fas fa-arrow-right"></i> Spatiotemporal **resampling methods** (for cross-validation) <br> [{mlr3spatial}](https://github.com/mlr-org/mlr3spatial) <i class="fas fa-arrow-right"></i> Spatial **DataBackends** and (parallelized) **prediction** support <br> .small[ Planned but unfinished (and currently unmaintained): [mlr3temporal](https://github.com/mlr-org/mlr3temporal). Please reach out to us if you have knowledge in this area and think about contributing <i class="fas fa-hands-helping"></i> ] --- class: middle, inverse, center # 2.1 mlr3spatial --- ## mlr3spatial .extrasmall[<i class="fas fa-info-circle"></i> {mlr3spatial} is new and not on CRAN yet] ### What's inside the tin? <i class="fas fa-check" style="color:green;"></i> `DataBackendRaster` for ([{terra}](https://cran.r-project.org/web/packages/terra/index.html), [{raster}](https://cran.r-project.org/web/packages/raster/index.html), [{stars}](https://cran.r-project.org/web/packages/stars/index.html) <i class="fas fa-check" style="color:green;"></i> `DataBackendVector` for [{sf}](https://cran.r-project.org/web/packages/sf/index.html)) <i class="fas fa-check" style="color:green;"></i> Parallel (future-based) predictions via `<learner>$predict()` <i class="fas fa-check" style="color:green;"></i> Memory-aware chunked predictions --- ## mlr3spatial Predict the cadmium concentration from the `l7data` dataset (see `?stars::L7_ETMs`). .fl.w-60.pa1[ ```r library("mlr3") library("mlr3learners") library("mlr3spatial") tif <- system.file("tif/L7_ETMs.tif", package = "stars" ) l7data <- stars::read_stars(tif) # create mlr3 backend from sf data backend <- as_data_backend(l7data) ``` ] .fl.w-40.pa1[ - Load required packages - Load the `L7` data - Create a `DataBackendSpatial` ] --- ## mlr3spatial .fl.w-60.pa1[ ```r # create a "Random Forest" learner and train it learner <- lrn("regr.ranger") task <- as_task_regr(backend, target = "layer.1") rows_train <- sample(1:task$nrow, 1000) rows_pred <- setdiff(1:task$nrow, rows_train) learner$train(task, row_ids = rows_train) ``` ] .fl.w-40.pa1[ - Create a `TaskRegr` with `layer1` as the response - Train a Random Forest learner ({ranger} package) on a subset of the data (1000 obs.) ] .small[<i class="fas fa-info-circle"></i> Usually one would not split a raster file into train and predict sets - often the train set is composed from point observations and a raster is used for predictions into unknown space.] --- ## mlr3spatial <i class="far fa-newspaper"></i> Also available as vignette ["Getting Started"](https://mlr3spatial.mlr-org.com/articles/meuse.html). ```r # set the output file and predict with the learner predict_spatial(task, learner, format = "stars") ``` ``` ## INFO [18:09:00.238] Start raster prediction ## INFO [18:09:00.246] Prediction is executed with a chunksize of 200, 1 chunk(s) in total, 122848 values per chunk ## INFO [18:09:08.742] Chunk 1 of 1 finished ## INFO [18:09:08.770] Finished raster prediction in 8 seconds ``` ``` ## stars object with 2 dimensions and 1 attribute ## attribute(s): ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## lyr.1 56.29713 67.00904 78.60435 78.9765 89.28517 150.165 ## dimension(s): ## from to offset delta refsys point values x/y ## x 1 349 288776 28.5 UTM Zone 25, Southern Hem... FALSE NULL [x] ## y 1 352 9120761 -28.5 UTM Zone 25, Southern Hem... FALSE NULL [y] ``` --- ## mlr3spatial ```r plot(pred, col = c("#440154FF", "#443A83FF", "#31688EFF", "#21908CFF", "#35B779FF", "#8FD744FF", "#FDE725FF"), main = "Cadmium concentration") ``` <img src="2021-10-zurich_files/figure-html/mlr3spatial-ex-4-1.png" width="50%" style="display: block; margin: auto;" /> --- ## mlr3spatial **Parallel predictions** Often spatial predictions take quite some time due to the amount of points to be predicted. Especially in the field of remote sensing this can be **millions** of points and more. While some spatial classes come with built-in parallelization, {mlr3} provides a more efficient and generalized methodology to speed up such large prediction tasks. <br> Check out this benchmark <i class="fas fa-hand-point-right"></i> .extrasmall[ Source: https://mlr3spatial.mlr-org.com/articles/benchmark.html ] --- .fl.w-40[ - 500 MB file on disk - ~ 25 Mio. values .small[ `mlr3spatial::demo_stack_spatraster(500)` ] ] .fl.w-60[  ] --- class: middle, inverse, center # 2.2 mlr3spatiotempcv --- ## mlr3spatiotempcv - Spatiotemporal resampling methods for {mlr3} - Aims to simplify/structure the jungle of spatiotemporal resampling methods <i class="fas fa-check" style="color:green;"></i> Generic `ggplot2::autoplot()` for all methods <i class="far fa-file-alt"></i> Submitted paper (JSS) - preprint: [https://arxiv.org/abs/2110.12674](https://arxiv.org/abs/2110.12674) <i class="fas fa-check" style="color:green;"></i> Currently wraps **8** resampling methods from **4** packages - {blockCV} - {sperrorest} - {CAST} - {skmeans} --- ## mlr3spatiotempcv Spatiotemporal performance estimations - Essentials <i class="fas fa-arrow-right", style="color:green"></i> Non-spatial resampling methods **overestimate** model performace due to **spatial autocorrelation** betweeen train and test data <i class="fas fa-exclamation", style="color:orange"></i> There is **no single best** method, the choice of the method should be **target-oriented** (what do I want to predict?) <i class="fas fa-question", style="color:grey"></i> There is a debate whether spatiotemporal resampling methods **might be too pessimistic** <br> <i class="fas fa-arrow-right"></i> Ongoing research <i class="far fa-lightbulb"></i> --- ## mlr3spatiotempcv Example: - Spatial cross-validation with Random Forest <i class="fas fa-trees", style="color=green"></i>; - Predicting **landslide** events (0/1) in Ecuador. ```r library("mlr3spatiotempcv") # create 'sf' object from example data data_sf <- sf::st_as_sf(ecuador, coords = c("x", "y"), crs = 32717) ``` --- ## mlr3spatiotempcv
--- ## mlr3spatiotempcv ```r # create ClassifST task task <- TaskClassifST$new("ecuador_sf", backend = data_sf, target = "slides", positive = "TRUE" ) print(task) ``` ``` ## <TaskClassifST:ecuador_sf> (751 x 11) ## * Target: slides ## * Properties: twoclass ## * Features (10): ## - dbl (10): carea, cslope, dem, distdeforest, distroad, distslidespast, hcurv, log.carea, slope, vcurv ## * Coordinates: ## X Y ## 1: 712882.5 9560002 ## 2: 715232.5 9559582 ## 3: 715392.5 9560172 ## 4: 715042.5 9559312 ## 5: 715382.5 9560142 ## --- ## 747: 714472.5 9558482 ## 748: 713142.5 9560992 ## 749: 713322.5 9560562 ## 750: 715392.5 9557932 ## 751: 713802.5 9560862 ``` --- ## mlr3spatiotempcv ```r library("mlr3learners") library("ranger") task <- tsk("ecuador") learner <- lrn("classif.ranger", predict_type = "prob") resampling_sp <- rsmp("repeated_spcv_coords", folds = 4, repeats = 2 ) rr_sp <- resample( task = task, learner = learner, resampling = resampling_sp ) rr_sp$aggregate(measures = msr("classif.ce")) ``` ``` ## classif.ce ## 0.3585072 ``` --- ## mlr3spatiotempcv ```r autoplot(resampling_sp, task, fold_id = c(1:2), size = 0.7) ``` <img src="2021-10-zurich_files/figure-html/mlr3spatiotempcv-ex3-1.png" width="100%" /> --- ## More resources - See the **"Spatiotemporal Analysis"** chapter in the mlr3book (https://mlr3book.mlr-org.com/special-tasks.html#spatiotemporal) - Function reference of {mlr3spatiotempcv}: https://mlr3spatiotempcv.mlr-org.com/reference/index.html - Literature: - Roberts et al. 2017: [Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure](https://onlinelibrary.wiley.com/doi/10.1111/ecog.02881) - Schratz et al. 2019: [Hyperparameter tuning and performance assessment of statistical and machine-learning algorithms using spatial data](http://www.sciencedirect.com/science/article/pii/S0304380019302145) - Schratz et al. 2021: [Spatiotemporal resampling methods for machine learning in R](https://arxiv.org/abs/2110.12674) --- ## mlr3spatiotempcv What about (spatio)-temporal methods? - Two methods (`"sptcv_cstf"` and `"sptcv_cluto"`) support both space and time - Spatiotemporal resampling is non-trivial due to the involvment of multiple dimensions - We would love to see help/contributions from the community for {mlr3temporal} --- ## Acknowledgements .fl.w-50.pa2[ Thanks to **Marc Becker** for his help developing mlr3 spatial packages. Thanks to mlr-org's GitHub sponsors (especially **OpenGeoHub** and **cynkra**). Thanks to **you** for being interested in / using mlr3! ] .fl.w-10.pl2[ .extrasmall[Bernd Bischl]  .extrasmall[Patrick Schratz]  ] .fl.w-10.pl2[ .extrasmall[Michel Lang]  .extrasmall[Flo Pfisterer]  ] .fl.w-10.pl2[ .extrasmall[Lars Kothoff]  <p style="margin-bottom:2.08cm;"> .extrasmall[Marc Becker]  ] .fl.w-10.pl2[ .extrasmall[Jakob Richter]  <p style="margin-bottom:1.35cm;"> .extrasmall[L. Schneider]  ] .fl.w-10.pl2[ .extrasmall[Martin Binder]  <p style="margin-bottom:2.08cm;"> .extrasmall[R. Sonabend]  ]