This package is presented as an accompaniment for the group project report for the COMPASS CDT, entitled Pseudo-Marginal Inference for Gaussian Process Classification with Large Datasets.
This package provides functionality to
ivm_subset_selection
)gpc
)plot.gpc
, print.gpc
and predict.gpc
)In the report, we show that this classification model provides a more accurate classification of the binary problem encountered in the e-mail spam dataset. The problem is as follows: Given a large dataset of emails, with their corresponding word and character frequencies for different cases, can we accurately predict whether a particular e-mail is spam or not spam?
We have provided access to the spam dataset as part of the package, which can be run with
And format out the response and the predictor variables, as well as divide into testing and training datasets.
cut = round(0.6*nrow(spam))
train_ind = sample(1:nrow(spam), cut)
test_ind = (1:nrow(spam))[-train_ind]
y = spam[train_ind, 1]
X = spam[train_ind, 2:ncol(spam)]
yp = spam[test_ind, 1]
Xp = spam[test_ind, 2:ncol(spam)]
The problematic part of fitting to this data immediately is its size:
Since we are using MCMC methods with a pseudo marginal approach, fitting to this data for a sufficient amount of iterations will be incredibly time consuming. We instead choose a subset which maximises the entropy score for a given subset size. We use the ivm_subset_selection
function implemented here, and choose a subset size \(d=150\). We also need to specify the covariance function (or kernel) \(k\), as well as the initial values of the hyperparameter vector \(\boldsymbol{\theta}\), since these are used in the subset selection.
kernel = function(x, y, theta) theta[1] * exp(-0.5 / theta[2]^2 * sum((x - y)^2))
init_theta = c(1, 1)
subset = ivm_subset_selection(X, kernel, init_theta, nsub = 150)
Now we can specify the model matrix \(X\) and response vector \(y\) that we will use for training the classification model.
The function gpc
is a wrapper that calls the Rcpp
functions that fit the classification model, including the pseudo marginal approximate likelihood and the Laplace approximation. These Rcpp functions are also written in parallel using RcppParallel
, improving the speed of building the gram matrix \(K\), and calculating the pseudo marginal approximation.
We fit the model as follows:
gp_fit = gpc(y = yd,
X = Xd,
nsteps = 500,
nburn = 250,
nchains = 2,
nimp = 100,
init_theta = c(1, 1),
kernel = "gaussian",
print_every = Inf
)
#> Fitting a GP classification model.
#> Running with 2 chain, 500 steps.
#> Data length n = 150, dimension p = 57.
#> Number of hyperparameters: 2
#> _____________________________________
#> Chain 1 / 2 starting...
#> Warning in algo_1(y, X, nsteps, nburn, nimp, init_theta, init_marginal_lik, :
#> NAs introduced by coercion to integer range
#> Chain 1 Running: 1/500 iterations completed. || Acceptance ratio: 0
#>
#> Chain completed successfully.
#> Chain 2 / 2 starting...
#> Warning in algo_1(y, X, nsteps, nburn, nimp, init_theta, init_marginal_lik, :
#> NAs introduced by coercion to integer range
#> Chain 2 Running: 1/500 iterations completed. || Acceptance ratio: 0
#>
#> Chain completed successfully.
Most arguments are self explanatory, but for a description of all inputs to gpc
, see ?gpc
. Note that we have only used 500 steps, and an \(Nimp\) value of 200, which is lower than what is written into the report. When fitting the full model, higher values are used.
Now we can inspect the model using plot.gpc
, an S3 method used for plotting gpc
objects.
By default,
plot.gpc
will plot the trace and density plots of the hyperparameters and the log pseudo marginal likelihood approximation. The argument f=TRUE
will plot a series of the same plots for the latent variables \(f\).
To obtain predictions from this fit, we can use predict.gpc
, which averages across all values of hyperparameter vector theta
across all samples and chains, to produce probabilities at each data point for the positive class.
pred_train = predict(gp_fit, X)
#> Warning in predict_gp(object$y, object$X, newdata, object$kernel,
#> fit_all_chains, : NAs introduced by coercion to integer range
#> Predicting from GP classification model.
#> Averaging across all 500 steps, 2 chain(s).
#> Fit Data length n = 150, dimension p = 57.
#> New data length n = 2761, dimension p = 57.
#> _____________________________________
#>
#> Completed.
Note that we fit the model to Xd
, the subset model matrix, but predictions can be evaluated on the full training dataset. We do the same for the testing set.
pred_test = predict(gp_fit, Xp)
#> Warning in predict_gp(object$y, object$X, newdata, object$kernel,
#> fit_all_chains, : NAs introduced by coercion to integer range
#> Predicting from GP classification model.
#> Averaging across all 500 steps, 2 chain(s).
#> Fit Data length n = 150, dimension p = 57.
#> New data length n = 1840, dimension p = 57.
#> _____________________________________
#>
#> Completed.
From here, we can see the percentage of correct predictions in both cases (using a 0-1 loss, if the probabilities are greater than 0.5 then we predict the positive class, otherwise it is the negative class).
output_pred_train = ifelse(pred_train > 0.5, 1, -1)
output_pred_test = ifelse(pred_test > 0.5, 1, -1)
The percentages are given by