Overview (mixed models essentials and subject-matter intro)
- Mixed models essentials
- Science intro
Coding
extras
references
- references

Overview (mixed models essentials and subject-matter intro)

Mixed models essentials

What are mixed models good for?

a broad class of statistical models that extend linear and generalized linear models to handle data where observations are measured within discrete groups such as field sites; years or other temporal blocks; individuals that are observed multiple times; genotypes; species; etc. They can be thought of (equivalently) as (1) accounting for the correlation among observations from the same group; (2) estimating the variability among groups, or (3) parsimoniously estimating the effects of groups. They are most useful when the experimental or observational design includes a large number of groups with varying numbers of observations per group.

What is a random effect anyway?

A grouping variable g (must be discrete!) and a varying term f
denoted as (f | g) in most R MM packages
"the effect of f (e.g. slope) varies across groups defined by g (e.g. species; f = 1 → intercept)
effects of f for each group g via shrinkage estimation/partial pooling (empirical Bayes, joint Bayesian prior, …)

When should you use a random effect?

Some rules of thumb

don’t want to test hypotheses about differences between groups
do want to quantify the variability across groups
do want to make predictions for unobserved groups
do want to combine information across groups
do have variation in information per group (samples, noise)
do have groups randomly sampled from a population
do have a categorical nuisance variable
do have exchangeable groups
do have “many” (> 5-6) groups; small \(n\) per group; unbalanced groups

cf. Crawley (2002), Gelman (2005)

What tools?

Mostly lme4; gamm4 to deal with spatial autocorrelation
Some ggplot2, maybe some tidyverse
diagnostics etc.; broom.mixed, DHARMa, car (using :: notation as appropriate)
Other (G)LMM-adjacent packages listed here

Science intro

What is this project?

Paper in progress/review with Max Moritz (UCSB) and Enric Batllori Presas (Univ Barcelona)
Use synthetic global-scale databases of species richness, primary productivity, wildfire to quantify relationships between (NPP, fire consumption) and richness

Pictures?

Analytical goals?

estimate effects of NPP (g C m\(^2\)/year), fire consumption (% of NPP), interannual CV of NPP and fire consumption, and their interactions, on species richness
at the global level and possibly variation across different geographic scales

need for mixed models: geographic variation

realms (large-scale) (e.g. “Neotropics”)
biomes (medium-scale, environmental) (e.g. “tropical grassland”)
biome × realm interaction (“tropical grasslands in the Neotropics”)
“ecoregion”: sampling unit (Olson et al. 2001)¹

nesting and crossing of random effects

Nested: sub-unit IDs only measured within a single larger unit. e.g.: Class1 in School1 independent of Class1 in School2: (1|School/Class)

Crossed: sub-unit IDs can be measured in multiple larger units. e.g. class = “history”, “geography”, “mathematics”: (1|School) + (1|Class)

Unique coding: removes ambiguity

Robert Long, Cross Validated

random effects terms

“random effect of X” usually means intercept variation only, by default (but see Schielzeth and Forstmeier (2009))
- one parameter (variance/std dev of intercept across groups)
RE terms with n parameters (intercept + slope = 2)
- n*(n+1)/2 parameters
- n=5 → 15 for this example
RE terms with n independent effects (|| shortcut); only n parameters
- 5 for this example

more preliminaries

we’ll work with log-scaled NPP/fire, raw CVs, all centered (Schielzeth 2010)
main effects all evaluated at geometric mean of other variables
coefficients are approximately elasticities

before we start coding …

knitr::include_graphics("pix/gelman_hill_complexity.png")

Gelman and Hill (2006)

knitr::include_graphics("pix/uriarte_yackulic_complexity.png")

Uriarte and Yackulic (2009)

Coding

load packages

Load packages up front, note what they’re used for …

library(tidyverse); theme_set(theme_bw())
library(lme4)
library(gamm4)
## diagnostics
library(DHARMa)
library(car) ## influencePlot
## extraction/graphics
library(broom.mixed)
library(emmeans)
library(dotwhisker)
library(gridExtra)
## from GitHub: see utils.R
library(r2glmm) ## bbolker/r2glmm
library(remef)  ## hohenstein/remef
source("utils.R")
source("gamm4_utils.R")

load data

dd <- readRDS("data/ecoreg.rds")

simplest model

Single-level model (biomes), intercept variation only. All pairwise interactions of main variables ((...)^2), plus (log of) ecoregion area:

m1 <- lmer(mbirds_log ~ scale(log(area_km2)) + (Feat_log_sc + Feat_cv_sc + NPP_log_sc + NPP_cv_sc)^2 +
             (1 | biome),
           data = dd)
## "Feat" = "fire eaten"
## may get
## Warning message:
## Some predictor variables are on very different scales: consider rescaling

diagnostics

Best practice: check diagnostics as early as possible (before summary(), coeff plots) to reduce snooping.

plot(m1, type = c("p", "smooth"))

## heteroscedasticity
plot(m1, sqrt(abs(resid(.))) ~ fitted(.), type = c("p", "smooth")) ## slight heteroscedasticity?

car::influencePlot(m1)

plot(DHARMa::simulateResiduals(m1))  ## we'll get back to this

coefficient plots

## basic coefficient plot
dotwhisker::dwplot(m1, effects="fixed") + geom_vline(xintercept = 0, lty = 2)

## ordered coefficient plot (from utils.R)
dwplot_ordered(m1, effects = "fixed")

add RE terms

allow the main effects to vary across biomes (in a correlated way)
update() is your friend

m2 <- update(m1, . ~ . - (1|biome) + (1 + Feat_log_sc + Feat_cv_sc + NPP_log_sc + NPP_cv_sc | biome))

## compare plots
dwplot_ordered(list(intercept_only = m1, random_slopes = m2), effects = "fixed")

three-level model

Now go to the (almost) maximal model; variation of main effects at all three geographic levels

## ~ 30 seconds
max_model <- lmer(mbirds_log ~ scale(log(area_km2)) + (Feat_log_sc + Feat_cv_sc + NPP_log_sc + NPP_cv_sc)^2 +
                    (Feat_log_sc + Feat_cv_sc + NPP_log_sc + NPP_cv_sc | biome) +
                    (Feat_log_sc + Feat_cv_sc + NPP_log_sc + NPP_cv_sc | flor_realms) +
                    (Feat_log_sc + Feat_cv_sc + NPP_log_sc + NPP_cv_sc | biome_FR),
                  data = dd,
                  ## for speed/skip convergence warnings
                  control = lmerControl(calc.derivs = FALSE))

checking the covariance parameter estimates

isSingular(max_model)

## [1] TRUE

lwr <- getME(max_model, "lower"); theta <- getME(max_model, "theta"); min(theta[lwr == 0])

## [1] 0

VarCorr(max_model)

##  Groups      Name        Std.Dev. Corr                       
##  biome_FR    (Intercept) 0.108634                            
##              Feat_log_sc 0.028956 -0.998                     
##              Feat_cv_sc  0.058083  0.968 -0.975              
##              NPP_log_sc  0.112312 -0.011 -0.027  0.239       
##              NPP_cv_sc   0.036536 -0.987  0.991 -0.996 -0.152
##  biome       (Intercept) 0.126970                            
##              Feat_log_sc 0.011065 -1.000                     
##              Feat_cv_sc  0.050601  0.936 -0.936              
##              NPP_log_sc  0.148002 -0.468  0.468 -0.749       
##              NPP_cv_sc   0.091323 -0.621  0.621 -0.856  0.983
##  flor_realms (Intercept) 0.229352                            
##              Feat_log_sc 0.130501  0.561                     
##              Feat_cv_sc  0.072461  0.396  0.440              
##              NPP_log_sc  0.119123 -0.358 -0.964 -0.512       
##              NPP_cv_sc   0.031699  0.249 -0.660 -0.093  0.791
##  Residual                0.193239

maximal model

idea
why it usually doesn’t work
- complexity
- confounding with residual variance: starling mixed model example

model simplification

avoid singularity/non-convergence (Barr et al. 2013; Schielzeth and Forstmeier 2009)
data-driven (AIC, p-value) (Bates et al. 2015; Matuschek et al. 2017)

non-convergence vs singularity

convergence warnings: historical reasons
very unreliable (and slow!) for large data sets (>10,000 observations)
gold standard: run allFit(), diagnose/evaluate differences in effects of interest

AIC table/strategy

for loop over table
fitting many models is a code smell but sometimes hard to avoid
reformulate() is useful

all_vars <- "1 + Feat_log_sc + Feat_cv_sc + NPP_log_sc + NPP_cv_sc"
v1 <- expand.grid(c("1 | ", paste(all_vars, "|"), paste(all_vars, "||")),
                  c("biome", "flor_realms", "biome_FR"))
v2 <- sprintf("(%s)", apply(v1, 1, paste, collapse = " "))
## use cross_df instead of expand.grid, want chars
v3 <- cross_df(list(biome = v2[1:3], FR = v2[4:6], biome_FR = v2[7:9]))
v3[1,]

## # A tibble: 1 × 3
##   biome        FR                 biome_FR       
##   <chr>        <chr>              <chr>          
## 1 (1 |  biome) (1 |  flor_realms) (1 |  biome_FR)

model_list <- list()
p1 <- proc.time()
for (i in 1:nrow(v3)) {
  cat(i, unlist(v3[i,]), "\n")
  form <- reformulate(
      c(sprintf("(%s)^2", all_vars), ## main effects and 2-way interax
        "log(area_km2)",             ## plus ecoregion area
        unlist(v3[i,])),             ## plus specified REs
      response = "mbirds_log")
  model_list[[i]] <- lmer(form, data = dd) ## fit model
}
saveRDS(model_list, file = "data/model_list.rds")
proc.time() - p1 ## about 10 minutes

extract summary info from models, take a look

model_list <- readRDS("data/model_list.rds")
aic_vec <- sapply(model_list, AIC)
is_sing <- sapply(model_list, isSingular)
conv_warn <- sapply(model_list, has_warning)
tibble(model=1:27, aic_vec, is_sing, conv_warn) %>% arrange(aic_vec)

## # A tibble: 27 × 4
##    model aic_vec is_sing conv_warn
##    <int>   <dbl> <lgl>   <lgl>    
##  1    19    35.7 FALSE   FALSE    
##  2    16    38.6 TRUE    FALSE    
##  3    25    39.7 TRUE    FALSE    
##  4    10    40.9 TRUE    FALSE    
##  5    21    43.2 TRUE    FALSE    
##  6    27    46.2 TRUE    FALSE    
##  7    18    46.8 TRUE    FALSE    
##  8    12    48.4 TRUE    FALSE    
##  9     9    49.5 FALSE   FALSE    
## 10    13    51.2 FALSE   TRUE     
## # … with 17 more rows

find best-AIC, non-singular model

best_index <- which(aic_vec == min(aic_vec) & !is_sing & !conv_warn)
best_model <- model_list[[best_index]]

check it out

best_model

## Linear mixed model fit by REML ['lmerMod']
## Formula: 
## mbirds_log ~ (1 + Feat_log_sc + Feat_cv_sc + NPP_log_sc + NPP_cv_sc)^2 +  
##     log(area_km2) + (1 | biome) + (1 | flor_realms) + ((1 | biome_FR) +  
##     (0 + Feat_log_sc | biome_FR) + (0 + Feat_cv_sc | biome_FR) +  
##     (0 + NPP_log_sc | biome_FR) + (0 + NPP_cv_sc | biome_FR))
##    Data: dd
## REML criterion at convergence: -4.3321
## Random effects:
##  Groups      Name        Std.Dev.
##  biome_FR    NPP_cv_sc   0.07758 
##  biome_FR.1  NPP_log_sc  0.13922 
##  biome_FR.2  Feat_cv_sc  0.08691 
##  biome_FR.3  Feat_log_sc 0.05442 
##  biome_FR.4  (Intercept) 0.13560 
##  biome       (Intercept) 0.10908 
##  flor_realms (Intercept) 0.23103 
##  Residual                0.19400 
## Number of obs: 620, groups:  biome_FR, 55; biome, 14; flor_realms, 6
## Fixed Effects:
##            (Intercept)             Feat_log_sc              Feat_cv_sc  
##               5.714624                0.091608               -0.006491  
##             NPP_log_sc               NPP_cv_sc           log(area_km2)  
##               0.260378                0.066486               -0.029917  
## Feat_log_sc:Feat_cv_sc  Feat_log_sc:NPP_log_sc   Feat_log_sc:NPP_cv_sc  
##              -0.011307               -0.051931               -0.047096  
##  Feat_cv_sc:NPP_log_sc    Feat_cv_sc:NPP_cv_sc    NPP_log_sc:NPP_cv_sc  
##               0.013720               -0.036622                0.006426

diagnostics

basics

plot(best_model, type = c("p", "smooth"))
## heteroscedasticity
plot(best_model, sqrt(abs(resid(.))) ~ fitted(.), type = c("p", "smooth"))
car::influencePlot(best_model)
plot(sr <- simulateResiduals(best_model))

What’s going on with DHARMa?

Computes residuals at population level (usually a good default but not necessarily appropriate here?)

DHARMa::plotResiduals(sr, dd$NPP_log_sc)

## https://www.stat.ubc.ca/~jenny/STAT545A/block16_colorsLattice.html
line_style <- list(plot.line = list(alpha=1, col = "red", lty = 1, lwd = 2))
pop_resids <- model.response(model.frame(best_model)) - predict(best_model, re.form = NA)
p1 <- lattice::xyplot(pop_resids ~ dd$NPP_log_sc, cex = 1, type = c("p", "smooth"),
                      main = "pop-level resids (à la DHARMa)",
                      par.settings = line_style)
p2 <- plot(best_model, resid(.) ~ NPP_log_sc, type = c("p", "smooth"),
           main = "unit-level resids (à la lme4)",
           par.settings = line_style)
gridExtra::grid.arrange(p1, p2, nrow = 1)

(lowest resid is Palearctic, lon=24, lat=

spatial correlation

Easiest to explore spatial correlations graphically:

dd$res1 <- residuals(best_model)
ggplot(dd, aes(x, y, colour = res1, size = abs(res1))) +
  geom_point() +
  scale_colour_gradient2() +
  scale_size(range=c(2,7))

Can see sub-realm geographic variation
Could work on spatial variograms, Moran’s I statistic, etc..

refit with `gamm4`

Quickest (not necessarily best) fix for spatial autocorr; bs="sos" is a spherical smooth

gamm4_form  <- update(formula(best_model), . ~ . + s(y, x, bs="sos"))
best_gamm4 <- gamm4(formula = nobars(gamm4_form),
                    random = as.formula(reOnly(gamm4_form)),
                    data = dd)
class(best_gamm4) <- c("gamm4", "list") ## so we can tidy etc. (using gamm4_utils.R)

compare results

dwplot_ordered(list(gamm4 = best_gamm4, lme4 = best_model), effects="fixed")

display/description

coefficient plots

with dotwhisker::dwplot() as shown before, or broom.mixed::tidy() + ggplot tweaking
scaled or unscaled coefficients?

\(R^2\) and partial \(R^2\) values

rsqvals <- r2glmm::r2beta(best_model, method = "sgv", partial = TRUE)
## set (reverse)factor order (or forcats::fct_inorder())
rsqvals$Effect <- factor(rsqvals$Effect, levels = rev(rsqvals$Effect))
ggplot(rsqvals, aes(x=Rsq, xmin = lower.CL, xmax = upper.CL, y = Effect)) +
  geom_pointrange()

all taxa (from the paper)

prediction plots

emmeans, effects and ggeffects package are useful for extracting predictions
need to be mindful of whether random effects components are included in the predictions (re.form)

NPPvec <- with(dd, seq(min(NPP_log_sc), max(NPP_log_sc), length.out = 51))
pred <- emmeans::emmeans(best_model, ~NPP_log_sc, at = list(NPP_log_sc = NPPvec))

## NOTE: Results may be misleading due to involvement in interactions

gg0 <- ggplot(as.data.frame(pred),
              aes(NPP_log_sc, emmean)) + geom_line() +
  geom_ribbon(aes(ymin=lower.CL, ymax = upper.CL), alpha = 0.2, colour = NA)
gg1 <- gg0 + geom_point(data = dd, aes(y = mbirds_log, colour = biome))
print(gg1)

Can also back-transform, unscale, etc. (see e.g. Stack Overflow Q1, Stack Overflow Q2)

partial residuals

Compute partial residuals and display:

dd$remef <- remef::remef(best_model, fix = "NPP_log_sc")
grid.arrange(gg1 + ggtitle("full"),
             gg0 + geom_point(data = dd, aes(y = remef, colour = biome)) +
             scale_y_continuous(limits = range(dd$mbirds_log)) +
             ggtitle("partial"),
             nrow=1)

random effects

Can extract, plot, etc. examine random effects
Built-in dotplot, qqmath methods
Or use as.data.frame(ranef()) or broom.mixed::tidy(model, effects = "ran_vals")
For this problem, REs were interesting but too noisy to draw conclusions from

re_plots <- lattice::dotplot(ranef(best_model))
print(re_plots$biome_FR)

print(grid.arrange(grobs = re_plots[2:3], nrow = 1))

extras

GLMMs

the entire layer of GLMMs can be added
standard families (Poisson/Gamma/binomial etc.) via glmer
zero-inflation, Tweedie, neg binom … glmmTMB, brms
more complex machinery (Gauss-Hermite quadrature, GLMMadaptive: or go Bayesian)
observation-level random effects

Bayesian approaches

slower but you get more
regularization is easy
informative priors
better handling of uncertainty, especially for complex models

regularization

if you want to fit a very complex model but prevent singular fits etc., you need to regularize the model somehow (i.e. force it to be sensible)
this is easiest to do in a Bayesian or semi-Bayesian framework
blme, MCMCglmm, rstanarm, brms …

model simplification/covariance structures

beyond intercept-only and diagonal structures

Barr et al. (2013) also discusses slope-only models
compound symmetry (all-equal correlations)
- (1|g/f) instead of (f|g) for factor terms
- cs() in glmmTMB, other possibilities in nlme::lme, MCMCglmm, brms, …
factor-analytic models (Maeve McGillycuddy, UNSW)

autocorrelation

mgcv: Markov random field, etc.
brms: can use all mgcv smooths
INLA: probably best for spatial/temporal, but a whole other world
glmmTMB: simple space/time stuff with complex GLMMs

REML vs ML

not as important as people think?
analogous to dividing by \(n-1\) instead of \(n\) when computing variance
REML better for RE covariance estimates
ML required for model comparison

convergence example

system.time(af <- allFit(model_list[[13]],
                         parallel = "multicore",
                         ncpus = 8))
saveRDS(af, "allfit.rds")

if (file.exists("allfit.rds")) {
  af <- readRDS("allfit.rds")
  summary(af)
}

mixed-model ecosystem in R

Google sheets list

references

Barr, Dale J., Roger Levy, Christoph Scheepers, and Harry J. Tily. 2013. “Random Effects Structure for Confirmatory Hypothesis Testing: Keep It Maximal.” Journal of Memory and Language 68 (3): 255–78. https://doi.org/10.1016/j.jml.2012.11.001.

Bates, Douglas, Reinhold Kliegl, Shravan Vasishth, and Harald Baayen. 2015. “Parsimonious Mixed Models.” arXiv:1506.04967 [Stat], June. http://arxiv.org/abs/1506.04967.

Bolker, Benjamin M. 2015. “Linear and Generalized Linear Mixed Models.” In Ecological Statistics: Contemporary Theory and Application, edited by Gordon A. Fox, Simoneta Negrete-Yankelevich, and Vinicio J. Sosa. Oxford University Press.

Bolker, Benjamin M., Mollie E. Brooks, Connie J. Clark, Shane W. Geange, John R. Poulsen, M. Henry H. Stevens, and Jada-Simone S. White. 2009. “Generalized Linear Mixed Models: A Practical Guide for Ecology and Evolution.” Trends in Ecology & Evolution 24: 127–35. https://doi.org/10.1016/j.tree.2008.10.008.

Crawley, Michael J. 2002. Statistical Computing: An Introduction to Data Analysis Using S-PLUS. John Wiley & Sons.

Gelman, Andrew. 2005. “Analysis of Variance: Why It Is More Important Than Ever.” Annals of Statistics 33 (1): 1–53. https://doi.org/doi:10.1214/009053604000001048.

Gelman, Andrew, and Jennifer Hill. 2006. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge, England: Cambridge University Press. http://www.stat.columbia.edu/~gelman/arm/.

Matuschek, Hannes, Reinhold Kliegl, Shravan Vasishth, Harald Baayen, and Douglas Bates. 2017. “Balancing Type I Error and Power in Linear Mixed Models.” Journal of Memory and Language 94: 305–15. https://doi.org/10.1016/j.jml.2017.01.001.

McElreath, Richard. 2015. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Boca Raton: Chapman; Hall/CRC.

Olson, D. M., E. Dinerstein, E. D. Wikramanayake, N. D. Burgess, G. V. Powell, E. C. Underwood, J. A. D’Amico, et al. 2001. “Terrestrial Ecoregions of the World: A New Map of Life on Earth: A New Global Map of Terrestrial Ecoregions Provides an Innovative Tool for Conserving Biodiversity.” BioScience 51 (11): 933–38.

Schielzeth, Holger. 2010. “Simple Means to Improve the Interpretability of Regression Coefficients.” Methods in Ecology and Evolution 1: 103–13. https://doi.org/10.1111/j.2041-210X.2010.00012.x.

Schielzeth, Holger, and Wolfgang Forstmeier. 2009. “Conclusions Beyond Support: Overconfident Estimates in Mixed Models.” Behavioral Ecology 20 (2): 416–20. https://doi.org/10.1093/beheco/arn145.

Uriarte, María, and Charles B. Yackulic. 2009. “Preaching to the Unconverted.” Ecological Applications 19 (3): 592–96. http://www.jstor.org/stable/27646001.

Wood, Simon. 2017. Generalized Additive Models: An Introduction with R. 2d ed. CRC Texts in Statistical Science. Chapman & Hall.

Zuur, Alain F., Elena N. Ieno, Neil J. Walker, Anatoly A. Saveliev, and Graham M. Smith. 2009. Mixed Effects Models and Extensions in Ecology with R. Springer.

“relatively large units of land containing a distinct assemblage of natural communities and species, with boundaries that approximate the original extent of natural communities prior to major land-use change”↩︎

EFI mixed models example

Ben Bolker

01 Nov 2021

Overview (mixed models essentials and subject-matter intro)

Mixed models essentials

What are mixed models good for?

What is a random effect anyway?

When should you use a random effect?

What tools?

Science intro

What is this project?

Pictures?

Analytical goals?

need for mixed models: geographic variation

nesting and crossing of random effects

random effects terms

more preliminaries

before we start coding …

Coding

load packages

load data

simplest model

diagnostics

coefficient plots

add RE terms

three-level model

checking the covariance parameter estimates

maximal model

model simplification

non-convergence vs singularity

AIC table/strategy

extract summary info from models, take a look

find best-AIC, non-singular model

check it out

diagnostics

basics

What’s going on with DHARMa?

spatial correlation

refit with gamm4

compare results

display/description

coefficient plots

\(R^2\) and partial \(R^2\) values

all taxa (from the paper)

prediction plots

partial residuals

random effects

extras

further reading

GLMMs

Bayesian approaches

regularization

model simplification/covariance structures

autocorrelation

REML vs ML

convergence example

mixed-model ecosystem in R

references

references

refit with `gamm4`