NeuralEstimators: Incomplete gridded data

Matthew Sainsbury-Dale, Andrew Zammit-Mangion, and Raphaël Huser

In this vignette, we demonstrate neural Bayes estimation with gridded data using convolutional neural networks (CNNs). These architectures require completely-observed gridded data; in this vignette, we also show how one can use CNNs with incomplete data using two techniques.

Before proceeding, we load the required packages (see here for instructions on installing Julia and the Julia packages NeuralEstimators and Flux):

library("NeuralEstimators")
library("JuliaConnectoR")
library("ggplot2")
juliaEval('using NeuralEstimators, Flux')
#> Starting Julia ...

1 Complete data

We first consider the case that the data are collected completely over a regular grid.

We develop a neural Bayes estimator for a spatial Gaussian process model, where \(\boldsymbol{Z} \equiv (Z_{1}, \dots, Z_{n})'\) are data collected at locations \(\{\boldsymbol{s}_{1}, \dots, \boldsymbol{s}_{n}\}\) in a spatial domain that is a subset of \(\mathbb{R}^2\). The data are assumed to be spatially-correlated mean-zero Gaussian random variables with exponential covariance function, \[ \textrm{cov}(Z_i, Z_j) = \textrm{exp}(-\|\boldsymbol{s}_i - \boldsymbol{s}_j\|/\theta), \] with unknown range parameter \(\theta > 0\). Here, we take the spatial domain be the unit square, we simulate data on a grid with \(16^2 = 256\) possible observation locations, and we adopt a uniform prior, \(\theta \sim \rm{Unif}(0, 0.4)\).

To begin, we define a function for sampling from the prior distribution and a function for marginal simulation from the statistical model. Note that our simulation code can be made faster (e.g., using parallel versions of lapply, utilising GPUs, etc.), but we provide a relatively simple implementation for pedagogical purposes:

# Sampling from the prior distribution
# K: number of samples to draw from the prior
prior <- function(K) { 
  theta <- 0.4 * runif(K) # draw from the prior distribution
  theta <- t(theta)       # reshape to matrix
  return(theta)
}

# Marginal simulation from the statistical model
# theta: a matrix of parameters drawn from the prior
# m: number of conditionally independent replicates for each parameter vector
simulate <- function(theta, m = 1) { 
  
  # Spatial locations, a grid over the unit square, and spatial distance matrix
  N <- 16 # number of locations along each dimension of the grid
  S <- expand.grid(seq(1, 0, len = N), seq(0, 1, len = N))
  D <- as.matrix(dist(S))         
  
  # Simulate conditionally independent replicates for each parameter vector
  Z <- lapply(1:ncol(theta), function(k) {
    Sigma <- exp(-D/theta[k])  # covariance matrix
    L <- t(chol(Sigma))        # lower Cholesky factor of Sigma
    n <- nrow(L)               # number of observation locations
    mm <- if (length(m) == 1) m else sample(m, 1) # allow for variable sample sizes
    z <- matrix(rnorm(n*mm), nrow = n, ncol = mm) # standard normal variates
    Z <- L %*% z               # conditionally independent replicates from the model
    Z <- array(Z, dim = c(N, N, 1, mm)) # reshape to multidimensional array
    Z
  })
  
  return(Z)
}

A note on data format: When working with the package NeuralEstimators, the following general rules regarding assumed data format apply:

When using CNNs, each data set must be stored as a multi-dimensional array. The penultimate dimension stores the so-called channels (this dimension is singleton for univariate processes, two for bivariate processes, etc.), while the final dimension stores conditionally independent replicates. For example, to store \(50\) conditionally independent replicates of a bivariate spatial process measured over a \(10\times15\) grid, one would construct an array of dimension \(10\times15\times2\times50\).

Now, let’s visualise a few realisations from our spatial Gaussian process model:

Next, we design our neural-network architecture.

The package NeuralEstimators is centred on the DeepSets framework (Zaheer et al., 2017), which allows for making inference with an arbitrary number of independent replicates and for incorporating both neural (learned) and user-defined summary statistics. Specifically, the neural Bayes estimator is taken to be of the form, \[ \hat{\boldsymbol{\theta}}(\boldsymbol{Z}_1,\dots,\boldsymbol{Z}_m) = \boldsymbol{\phi}\bigg\{\frac{1}{m} \sum_{i=1}^m\boldsymbol{\psi}(\boldsymbol{Z}_i)\bigg\}, \] where \(\boldsymbol{Z}_1,\dots,\boldsymbol{Z}_m\) are replicates of \(\boldsymbol{Z}\), \(\boldsymbol{\psi}(\cdot)\) is a conventional neural network whose architecture depends on the multivariate structure of the data (e.g., a CNN for gridded data), and \(\boldsymbol{\phi}(\cdot)\) is (always) a multilayer perceptron (MLP). See Sainsbury-Dale et al. (2024) for further details on the use of DeepSets in the context of neural Bayes estimation and a discussion on the framework’s connection to conventional estimators. See also Dumoulin and Visin (2016) for a detailed description of CNNs and their construction, which is beyond the scope of this vignette:

estimator <- juliaEval('
  # Summary network
  psi = Chain(
      Conv((3, 3), 1 => 32, relu),   # 3x3 convolutional filter, 1 input channel to 32 output channels, ReLU activation
      MaxPool((2, 2)),               # 2x2 max pooling layer for dimension reduction
      Conv((3, 3), 32 => 64, relu),  # 3x3 convolutional filter, 32 input channels to 64 output channels, ReLU activation
      MaxPool((2, 2)),               # 2x2 max pooling layer for dimension reduction
      Flux.flatten                   # flatten the output to feed into a fully connected layer
  )
  
  # Inference network
  phi = Chain(
      Dense(256, 64, relu),          # fully connected layer, 256 input neurons to 64 output neurons, ReLU activation
      Dense(64, 1)                   # fully connected layer, 64 input neurons to 1 output neuron
  )
  
  # Construct DeepSet object and initialise a point estimator
  deepset = DeepSet(psi, phi)
  estimator = PointEstimator(deepset)
')

Next, we train the estimator using parameter vectors sampled from the prior distribution, and data simulated from the statistical model conditional on these parameter vectors:

# Construct training and validation sets 
K <- 25000                       # size of the training set 
theta_train <- prior(K)          # parameter vectors used in stochastic-gradient descent during training
theta_val   <- prior(K/10)       # parameter vectors used to monitor performance during training
Z_train <- simulate(theta_train) # data used in stochastic-gradient descent during training
Z_val   <- simulate(theta_val)   # data used to monitor performance during training

# Train the estimator 
estimator <- train(
  estimator,
  theta_train = theta_train,
  theta_val   = theta_val,
  Z_train = Z_train,
  Z_val   = Z_val
)
#> Computing the initial validation risk... Initial validation risk = 0.27961352
#> Computing the initial training risk... Initial training risk = 0.27962294
#> Epoch: 1  Training risk: 0.075  Validation risk: 0.053  Run time of epoch: 30.346 seconds
#> Epoch: 2  Training risk: 0.043  Validation risk: 0.033  Run time of epoch: 6.594 seconds
#> Epoch: 3  Training risk: 0.031  Validation risk: 0.028  Run time of epoch: 6.537 seconds
#> Epoch: 4  Training risk: 0.027  Validation risk: 0.027  Run time of epoch: 6.498 seconds
#> Epoch: 5  Training risk: 0.024  Validation risk: 0.023  Run time of epoch: 6.63 seconds
#> Epoch: 6  Training risk: 0.023  Validation risk: 0.023  Run time of epoch: 6.551 seconds
#> Epoch: 7  Training risk: 0.022  Validation risk: 0.022  Run time of epoch: 6.513 seconds
#> Epoch: 8  Training risk: 0.022  Validation risk: 0.022  Run time of epoch: 6.526 seconds
#> Epoch: 9  Training risk: 0.021  Validation risk: 0.021  Run time of epoch: 6.815 seconds
#> Epoch: 10  Training risk: 0.021  Validation risk: 0.021  Run time of epoch: 6.791 seconds
#> Epoch: 11  Training risk: 0.02  Validation risk: 0.021  Run time of epoch: 6.583 seconds
#> Epoch: 12  Training risk: 0.02  Validation risk: 0.021  Run time of epoch: 6.705 seconds
#> Epoch: 13  Training risk: 0.02  Validation risk: 0.02  Run time of epoch: 6.775 seconds
#> Epoch: 14  Training risk: 0.019  Validation risk: 0.02  Run time of epoch: 6.652 seconds
#> Epoch: 15  Training risk: 0.019  Validation risk: 0.02  Run time of epoch: 6.586 seconds
#> Epoch: 16  Training risk: 0.019  Validation risk: 0.021  Run time of epoch: 6.554 seconds
#> Epoch: 17  Training risk: 0.019  Validation risk: 0.021  Run time of epoch: 6.73 seconds
#> Epoch: 18  Training risk: 0.019  Validation risk: 0.02  Run time of epoch: 6.554 seconds
#> Epoch: 19  Training risk: 0.018  Validation risk: 0.02  Run time of epoch: 6.556 seconds
#> Epoch: 20  Training risk: 0.018  Validation risk: 0.02  Run time of epoch: 6.57 seconds
#> Epoch: 21  Training risk: 0.018  Validation risk: 0.02  Run time of epoch: 6.613 seconds
#> Epoch: 22  Training risk: 0.018  Validation risk: 0.02  Run time of epoch: 6.571 seconds
#> Epoch: 23  Training risk: 0.018  Validation risk: 0.02  Run time of epoch: 6.54 seconds
#> Epoch: 24  Training risk: 0.017  Validation risk: 0.02  Run time of epoch: 6.648 seconds
#> Epoch: 25  Training risk: 0.017  Validation risk: 0.02  Run time of epoch: 6.695 seconds
#> Epoch: 26  Training risk: 0.017  Validation risk: 0.02  Run time of epoch: 6.587 seconds
#> Stopping early since the validation loss has not improved in 5 epochs

It is good practice to assess the performance of a trained estimator using out-of-sample test data:

# Test the estimator using out-of-sample data
theta <- prior(1000)
Z <- simulate(theta)
assessment <- assess(estimator, theta, Z, estimator_names = "NBE")
#>  Running estimator NBE...
plotestimates(assessment)

Once the neural Bayes estimator has been trained and assessed, it can be applied to observed data collected completely over a \(16\times 16\) grid using estimate(). Below, we use simulated data as a surrogate for observed data:

theta    <- matrix(0.1)         # true parameter
Z        <- simulate(theta)     # pretend that this is observed data
estimate(estimator, Z)          # point estimates from the "observed" data
#>           [,1]
#> [1,] 0.1008148
#> attr(,"JLDIM")
#> [1] 1 1

In practice, data are often incomplete and, in the next section, we describe methods that facilitate neural Bayes estimation in these settings.

2 Incomplete data

We now consider the case where the data are collected over a regular grid, but where some elements of the grid are unobserved. This situation often arises in, for example, remote-sensing applications, where the presence of cloud cover prevents measurement in some places.

For instance, our data may look like:

We here consider two techniques that facilitate neural Bayes estimation with incomplete data: the “masking approach” of Wang et al. (2024) and the “neural EM algorithm”.

2.1 The masking approach

The first technique that we consider is the so-called masking approach of Wang et al. (2024). The strategy involves completing the data by replacing missing values with zeros, and using auxiliary variables to encode the missingness pattern, which are also passed into the network.

Let \(\boldsymbol{Z}\) denote the complete-data vector. Then, the masking approach considers inference based on \(\boldsymbol{W}\), a vector of indicator variables that encode the missingness pattern (with elements equal to one or zero if the corresponding element of \(\boldsymbol{Z}\) is observed or missing, respectively), and \[ \boldsymbol{U} \equiv \boldsymbol{Z} \odot \boldsymbol{W}, \] where \(\odot\) denotes elementwise multiplication and the product of a missing element and zero is defined to be zero. Irrespective of the missingness pattern, \(\boldsymbol{U}\) and \(\boldsymbol{W}\) have the same fixed dimensions and hence may be processed easily using a single neural network. A neural point estimator is then trained on realisations of \(\{\boldsymbol{U}, \boldsymbol{W}\}\) which, by construction, do not contain any missing elements.

Since the missingness pattern \(\boldsymbol{W}\) is now an input to the neural network, it must be incorporated during the training phase. When interest lies only in making inference from a single already-observed data set, \(\boldsymbol{W}\) is fixed and known. However, amortised inference, whereby one trains a single neural network that will be used to make inference with many data sets, requires a model for the missingness pattern \(\boldsymbol{W}\).

Below, we define a helper function that removes a specified proportion of the data completely at random, and which serves to define our missingness model when using the masking approach in this example:

# Replaces a proportion of elements with NA
removedata <- function(Z, proportion = runif(1, 0.2, 0.8)) {
  
  # Ensure proportion is between 0 and 1
  if (proportion < 0 || proportion > 1) stop("Proportion must be between 0 and 1")
  
  # Randomly sample indices to replace
  n <- length(Z)
  n_na <- round(proportion * n)
  na_indices <- sample(1:n, n_na)
  
  # Replace selected elements with NA
  Z[na_indices] <- NA
  
  return(Z)
}

To make the general strategy concrete, consider the encoded data set \(\{\boldsymbol{U}, \boldsymbol{W}\}\) constructed from the following incomplete data \(\boldsymbol{Z}_1\):

Next, we construct and train a masked neural Bayes estimator. Here, the first convolutional layer takes two input channels, since we store the encoded data \(\boldsymbol{U}\) in the first channel and the missingness pattern \(\boldsymbol{W}\) in the second. Otherwise, the architecture remains as given above:

masked_estimator <- juliaEval('
  # Summary network
  psi = Chain(
    Conv((3, 3), 2 => 32, relu),
    MaxPool((2, 2)),
    Conv((3, 3),  32 => 64, relu),
    MaxPool((2, 2)),
    Flux.flatten
    )
    
    # Inference network
  phi = Chain(Dense(256, 64, relu), Dense(64, 1))

  deepset = DeepSet(psi, phi)
')

Next, we generate incomplete data for training and validation using removedata(), and construct the corresponding encoded data sets \(\{\boldsymbol{U}, \boldsymbol{W}\}\) using encodedata():

UW_train <- encodedata(lapply(Z_train, removedata))
UW_val   <- encodedata(lapply(Z_val, removedata))

Training and assessment of the neural Bayes estimator then proceeds as usual:

# Train the estimator 
masked_estimator <- train(
  masked_estimator,
  theta_train = theta_train,
  theta_val   = theta_val,
  Z_train = UW_train,
  Z_val   = UW_val
)
#> Computing the initial validation risk... Initial validation risk = 0.22933221
#> Computing the initial training risk... Initial training risk = 0.22850168
#> Epoch: 1  Training risk: 0.092  Validation risk: 0.076  Run time of epoch: 7.184 seconds
#> Epoch: 2  Training risk: 0.064  Validation risk: 0.06  Run time of epoch: 6.849 seconds
#> Epoch: 3  Training risk: 0.056  Validation risk: 0.052  Run time of epoch: 6.775 seconds
#> Epoch: 4  Training risk: 0.051  Validation risk: 0.048  Run time of epoch: 6.82 seconds
#> Epoch: 5  Training risk: 0.047  Validation risk: 0.049  Run time of epoch: 6.753 seconds
#> Epoch: 6  Training risk: 0.045  Validation risk: 0.044  Run time of epoch: 6.736 seconds
#> Epoch: 7  Training risk: 0.043  Validation risk: 0.044  Run time of epoch: 6.843 seconds
#> Epoch: 8  Training risk: 0.041  Validation risk: 0.043  Run time of epoch: 6.899 seconds
#> Epoch: 9  Training risk: 0.04  Validation risk: 0.041  Run time of epoch: 6.886 seconds
#> Epoch: 10  Training risk: 0.039  Validation risk: 0.041  Run time of epoch: 6.888 seconds
#> Epoch: 11  Training risk: 0.038  Validation risk: 0.04  Run time of epoch: 6.817 seconds
#> Epoch: 12  Training risk: 0.037  Validation risk: 0.04  Run time of epoch: 6.762 seconds
#> Epoch: 13  Training risk: 0.037  Validation risk: 0.041  Run time of epoch: 6.806 seconds
#> Epoch: 14  Training risk: 0.036  Validation risk: 0.042  Run time of epoch: 6.882 seconds
#> Epoch: 15  Training risk: 0.035  Validation risk: 0.039  Run time of epoch: 6.758 seconds
#> Epoch: 16  Training risk: 0.035  Validation risk: 0.038  Run time of epoch: 6.807 seconds
#> Epoch: 17  Training risk: 0.034  Validation risk: 0.038  Run time of epoch: 6.791 seconds
#> Epoch: 18  Training risk: 0.034  Validation risk: 0.039  Run time of epoch: 6.823 seconds
#> Epoch: 19  Training risk: 0.033  Validation risk: 0.04  Run time of epoch: 6.727 seconds
#> Epoch: 20  Training risk: 0.033  Validation risk: 0.039  Run time of epoch: 6.781 seconds
#> Epoch: 21  Training risk: 0.032  Validation risk: 0.039  Run time of epoch: 6.764 seconds
#> Epoch: 22  Training risk: 0.032  Validation risk: 0.038  Run time of epoch: 6.716 seconds
#> Epoch: 23  Training risk: 0.031  Validation risk: 0.039  Run time of epoch: 6.728 seconds
#> Epoch: 24  Training risk: 0.031  Validation risk: 0.038  Run time of epoch: 6.668 seconds
#> Epoch: 25  Training risk: 0.031  Validation risk: 0.038  Run time of epoch: 6.66 seconds
#> Epoch: 26  Training risk: 0.03  Validation risk: 0.041  Run time of epoch: 6.677 seconds
#> Epoch: 27  Training risk: 0.03  Validation risk: 0.038  Run time of epoch: 6.656 seconds
#> Epoch: 28  Training risk: 0.03  Validation risk: 0.038  Run time of epoch: 6.652 seconds
#> Epoch: 29  Training risk: 0.029  Validation risk: 0.038  Run time of epoch: 6.664 seconds
#> Epoch: 30  Training risk: 0.029  Validation risk: 0.039  Run time of epoch: 6.792 seconds
#> Epoch: 31  Training risk: 0.029  Validation risk: 0.038  Run time of epoch: 6.664 seconds
#> Epoch: 32  Training risk: 0.029  Validation risk: 0.038  Run time of epoch: 6.781 seconds
#> Epoch: 33  Training risk: 0.028  Validation risk: 0.04  Run time of epoch: 6.902 seconds
#> Epoch: 34  Training risk: 0.028  Validation risk: 0.039  Run time of epoch: 6.867 seconds
#> Epoch: 35  Training risk: 0.027  Validation risk: 0.038  Run time of epoch: 6.988 seconds
#> Stopping early since the validation loss has not improved in 5 epochs

# Test the estimator with many data sets, each with a missingness proportion of 25%
theta <- prior(1000)
Z <- simulate(theta)
Z1 <- lapply(Z, removedata, 0.25)
UW <- lapply(Z1, encodedata)
assessment <- assess(masked_estimator, theta, UW, estimator_names = "Masked NBE")
#>  Running estimator Masked NBE...
plotestimates(assessment)

Note that the variance of the estimates is larger when compared to the estimates obtained from complete data; this is to be expected, since missingness reduces the available information for making inference on \(\boldsymbol{\theta}\).

Once trained and assessed, we can apply our masked neural Bayes estimator to (incomplete) observed data. The data must be encoded in the same manner that was done during training. Below, we use simulated data as a surrogate for real data, with a missingness proportion of 0.25:

theta <- matrix(0.2)
Z <- simulate(theta)[[1]] 
Z1 <- removedata(Z, 0.25)
UW <- encodedata(Z1)
c(estimate(masked_estimator, UW))
#> [1] 0.1975793

2.2 The neural EM algorithm

Let \(\boldsymbol{Z}_1\) and \(\boldsymbol{Z}_2\) denote the observed and unobserved (i.e., missing) data, respectively, and let \(\boldsymbol{Z} \equiv (\boldsymbol{Z}_1', \boldsymbol{Z}_2')'\) denote the complete data. A classical approach to facilitating inference when data are missing is the expectation-maximisation (EM) algorithm (Dempster et al., 1977).

The neural EM algorithm is an approximate version of the conventional (Bayesian) Monte Carlo EM algorithm (Wei and Tanner, 1990) which, at the \(l\)th iteration, updates the parameter vector through \[ \boldsymbol{\theta}^{(l)} = \textrm{argmax}_{\boldsymbol{\theta}} \sum_{h = 1}^H \ell(\boldsymbol{\theta}; \boldsymbol{Z}_1, \boldsymbol{Z}_2^{(lh)}) + \log \pi_H(\boldsymbol{\theta}), \] where realisations of the missing-data component, \(\{\boldsymbol{Z}_2^{(lh)} : h = 1, \dots, H\}\), are sampled from the probability distribution of \(\boldsymbol{Z}_2\) given \(\boldsymbol{Z}_1\) and \(\boldsymbol{\theta}^{(l-1)}\), and where \(\pi_H(\boldsymbol{\theta}) \propto \{\pi(\boldsymbol{\theta})\}^H\) is a concentrated version of the chosen prior. Note that when \(\pi(\boldsymbol{\theta})\) is uniform, as is the case in our working example, the distribution implied by \(\pi_H(\cdot)\) is the same as that implied by \(\pi(\cdot)\).

Given the conditionally simulated data, the neural EM algorithm performs the above EM update using a neural network that returns the MAP estimate from the conditionally simulated data. Such a neural network, which we refer to as a neural MAP estimator, can be obtained by training a neural Bayes estimator under a continuous relaxation of the 0–1 loss function, for example, the loss function, \[ L_{\kappa}(\boldsymbol{\theta}, \hat{\boldsymbol{\theta}}) = \rm{tanh}(\|\hat{\boldsymbol{\theta}} - \boldsymbol{\theta}\|/\kappa), \quad \kappa > 0, \] which yields the 0–1 loss function in the limit as \(\kappa \to 0\).

An advantage of using the neural EM algorithm is that the neural-network architecture does not need to be altered compared with that used in the complete-data case, and we can therefore use our complete-data estimator trained earlier as the starting point of our neural MAP estimator; this is known as pretraining, and it can substantially reduce the computational cost of training.

Below, we train a neural MAP estimator, employing the 0–1 surrogate loss function given above with \(\kappa = 0.1\):

# Train the estimator under the tanh loss, a surrogate for the 0-1 loss 
estimator <- train(
  estimator,
  theta_train = theta_train,
  theta_val   = theta_val,
  Z_train = Z_train,
  Z_val   = Z_val, 
  loss = tanhloss(0.1)
)
#> Computing the initial validation risk... Initial validation risk = 0.18875616962708677
#> Computing the initial training risk... Initial training risk = 0.16363383117491395
#> Epoch: 1  Training risk: 0.17  Validation risk: 0.198  Run time of epoch: 7.31 seconds
#> Epoch: 2  Training risk: 0.168  Validation risk: 0.191  Run time of epoch: 6.45 seconds
#> Epoch: 3  Training risk: 0.167  Validation risk: 0.193  Run time of epoch: 6.311 seconds
#> Epoch: 4  Training risk: 0.167  Validation risk: 0.188  Run time of epoch: 6.317 seconds
#> Epoch: 5  Training risk: 0.165  Validation risk: 0.19  Run time of epoch: 6.306 seconds
#> Epoch: 6  Training risk: 0.163  Validation risk: 0.191  Run time of epoch: 6.299 seconds
#> Epoch: 7  Training risk: 0.162  Validation risk: 0.192  Run time of epoch: 6.316 seconds
#> Epoch: 8  Training risk: 0.16  Validation risk: 0.192  Run time of epoch: 6.31 seconds
#> Epoch: 9  Training risk: 0.159  Validation risk: 0.189  Run time of epoch: 6.299 seconds
#> Epoch: 10  Training risk: 0.158  Validation risk: 0.191  Run time of epoch: 6.313 seconds
#> Stopping early since the validation loss has not improved in 5 epochs

An advantage of the neural EM algorithm is that the training of the neural network is exactly the same as the complete-data case, so the methods for assessing the estimator that we describe in Section 1 can be applied directly here.

Next, we define a function for conditional simulation. For the current model, this simply involves sampling from a conditional multivariate Gaussian distribution (see, e.g., here):

simulateconditional <- function(Z, theta, nsims = 1){
  
  # Coerce theta to scalar if given as a matrix
  theta <- theta[1] 
  
  # Save the original dimensions
  dims <- dim(Z) 
  N <- nrow(Z)
  
  # Convert to vector 
  Z <- c(Z) 
  
  # Indices of observed and missing elements 
  I_1 <- which(!is.na(Z)) 
  I_2 <- which(is.na(Z))  
  n_1 <- length(I_1)
  n_2 <- length(I_2)
  
  # Extract the observed data 
  Z_1 <- Z[I_1] 
  
  # Spatial locations and distance matrices
  S <- expand.grid(seq(1, 0, len = N), seq(0, 1, len = N))
  D <- as.matrix(dist(S))  
  D_22 <- D[I_2, I_2]
  D_11 <- D[I_1, I_1]
  D_12 <- D[I_1, I_2]
  
  # Marginal covariance matrices
  Sigma_22 <- exp(-D_22 / theta)
  Sigma_11 <- exp(-D_11 / theta)
  Sigma_12 <- exp(-D_12 / theta)
  
  # Conditional covariance matrix, cov(Z_2 | Z_1, theta), its Cholesky factor, 
  # and the conditional mean, E(Z_2 | Z_1, theta)
  L_11 <- t(chol(Sigma_11))
  x <- solve(L_11, Sigma_12)
  y <- solve(L_11, Z_1)
  Sigma <- Sigma_22 - crossprod(x)
  L <- t(chol(Sigma))
  mu <- c(crossprod(x, y))
  
  # Simulate from the conditional distribution Z_2 | Z_1, theta ~ Gau(mu, Sigma)
  z <- matrix(rnorm(n_2 * nsims), nrow = n_2, ncol = nsims)
  Z_2 <- mu + L %*% z
  
  # Combine Z_1 with conditionally-simulated replicates of Z_2
  Z <- sapply(1:nsims,function(l){
    z <-rep(NA, n_1 + n_2)
    z[I_1]<- Z_1
    z[I_2]<- Z_2[, l]
    z
  })
  
  # Convert Z to an array with appropriate dimensions
  Z <- array(Z, dim=c(dims,1, nsims))
  
  return(Z)
}

Let’s visualise a few conditional simulations given incomplete data \(\boldsymbol{Z}_1\). Below, the left panel shows the incomplete data \(\boldsymbol{Z}_1\), and the remaining panels show conditional simulations given \(\boldsymbol{Z}_1\) and the true parameter \(\theta = 0.2\):

The final step is to define a function that implements the Monte Carlo EM algorithm. This involves the specification of an initial estimate \(\boldsymbol{\theta}^{(0)}\), the maximum number of iterations, and a convergence criterion:

# Monte Carlo EM algorithm 
EM <- function(Z1,                # incomplete data
               estimator,         # (neural) MAP estimator
               theta_0,           # initial estimate
               niterations = 100, # maximum number of iterations
               tolerance = 0.01,  # convergence tolerance
               nconsecutive = 3,  # number of consecutive iterations for which the convergence criterion must be met
               nsims = 1,         # Monte Carlo sample size
               verbose = TRUE,    # print current estimate to console if TRUE
               return_iterates = FALSE  # return all iterates if TRUE
               ) {
  
  if(verbose) print(paste("Initial estimate:", theta_0))
  theta_l <- theta_0          # initial estimate
  convergence_counter <- 0    # initialise counter for consecutive convergence
  
  # Initialize a matrix to store all iterates as columns
  p <- length(theta_0)
  theta_all <- matrix(NA, nrow = p, ncol = niterations + 1)
  theta_all[, 1] <- theta_0

  for (l in 1:niterations) {
    # complete the data by conditional simulation
    Z <- simulateconditional(drop(Z1), theta_l, nsims = nsims)
    # compute the MAP estimate from the conditionally sampled replicates
    if ("JuliaProxy" %in% class(estimator)) {
      theta_l_plus_1 <- c(estimate(estimator, Z)) # neural MAP estimator
    } else {
      theta_l_plus_1 <- estimator(Z, theta_l)     # analytic MAP estimator
    }
    # check convergence criterion
    if (max(abs(theta_l_plus_1 - theta_l) / abs(theta_l)) < tolerance) {
      # increment counter if condition is met
      convergence_counter <- convergence_counter + 1  
      # check if convergence criterion has been met for required number of iterations
      if (convergence_counter == nconsecutive) {        
        if(verbose) message("The EM algorithm has converged")
        theta_all[, l + 1] <- theta_l_plus_1  # store the final iterate
        break
      }
    } else {
      # reset counter if condition is not met
      convergence_counter <- 0  
    }
    theta_l <- theta_l_plus_1  
    theta_all[, l + 1] <- theta_l  # store the iterate
    if(verbose) print(paste0("Iteration ", l, ": ", theta_l))
  }
  
  # Remove unused columns if convergence occurred before max iterations
  theta_all <- theta_all[, 1:(l + 1), drop = FALSE]

  # Return all iterates if return_iterates is TRUE, otherwise return the last iterate
  if (return_iterates) {
    return(theta_all)
  } else {
    return(theta_all[, ncol(theta_all)])
  }
}

We are now ready to apply the neural EM algorithm with incomplete data. Here, we use the same incomplete data \(\boldsymbol{Z}_1\) simulated conditionally on \(\theta = 0.2\) at the end of the preceding subsection:

theta_0 <- 0.1
EM(Z1, estimator, theta_0, nsims = 100)
#> [1] "Initial estimate: 0.1"
#> [1] "Iteration 1: 0.162930652499199"
#> [1] "Iteration 2: 0.191277891397476"
#> [1] "Iteration 3: 0.201464504003525"
#> [1] "Iteration 4: 0.208807781338692"
#> [1] "Iteration 5: 0.207885518670082"
#> [1] "Iteration 6: 0.205317720770836"
#> [1] "Iteration 7: 0.204709783196449"
#> [1] "Iteration 8: 0.206982254981995"
#> [1] "Iteration 9: 0.209661170840263"
#> [1] "Iteration 10: 0.21269303560257"
#> [1] "Iteration 11: 0.212645277380943"
#> [1] "Iteration 12: 0.211384132504463"
#> The EM algorithm has converged
#> [1] 0.2095863

Visualise the Monte Carlo variability with different Monte Carlo sample sizes:

all_H <- c(1, 10, 100)
dfs <- list()
for (H in all_H) {
    estimates <- c(EM(Z1, estimator, theta_0, nsims = H, return_iterates = TRUE, verbose = FALSE, tolerance = 0.0001))
    df <- data.frame(
      iteration = 1:length(estimates),
      estimate = estimates, 
      H = H
      )
    dfs <- c(dfs, list(df))
}
df <- do.call(rbind, dfs)
ggplot(df) + 
  geom_line(aes(x = iteration, y = estimate)) + 
  facet_wrap(~ H, labeller = labeller(H = function(labels) paste0("H = ", labels)), nrow = 1) + 
  theme_bw()

2.3 Summary

We have considered two approaches that facilitate neural Bayes estimation with incomplete data.

  1. The masking approach, where the input to the neural network is the complete-data vector with missing entries replaced by a constant (typically zero), along with a vector of indicator variables that encode the missingness pattern.

    + Does not require conditional simulation, and is therefore broadly applicable.

    + Can be used with all loss functions that are amenable to gradient-based training.

    - Needs the neural network to take an additional input (the missingness pattern).

    - More complicated learning task.

    - Requires a model for the missingness mechanism.

  2. The neural EM algorithm, a Monte Carlo EM algorithm where the incomplete data are completed using conditional simulation.

    + Neural-network architecture is the same as that used in the complete-data case.

    + Simpler learning task (mapping from the complete data to the parameter space).

    + Does not require a model for the missingness mechanism.

    - Requires conditional simulation, which places restrictions on the applicable class of models.

    - Limited to providing MAP estimates.