Using the gpc package

Introduction

This package is presented as an accompaniment for the group project report for the COMPASS CDT, entitled Pseudo-Marginal Inference for Gaussian Process Classification with Large Datasets.

This package provides functionality to

library(gpc)
#> Loading required package: mvtnorm

Usage on the Spam Dataset

In the report, we show that this classification model provides a more accurate classification of the binary problem encountered in the e-mail spam dataset. The problem is as follows: Given a large dataset of emails, with their corresponding word and character frequencies for different cases, can we accurately predict whether a particular e-mail is spam or not spam?

Loading the Data and Taking a Subset

We have provided access to the spam dataset as part of the package, which can be run with

And format out the response and the predictor variables, as well as divide into testing and training datasets.

The problematic part of fitting to this data immediately is its size:

Since we are using MCMC methods with a pseudo marginal approach, fitting to this data for a sufficient amount of iterations will be incredibly time consuming. We instead choose a subset which maximises the entropy score for a given subset size. We use the ivm_subset_selection function implemented here, and choose a subset size \(d=150\). We also need to specify the covariance function (or kernel) \(k\), as well as the initial values of the hyperparameter vector \(\boldsymbol{\theta}\), since these are used in the subset selection.

Now we can specify the model matrix \(X\) and response vector \(y\) that we will use for training the classification model.

Fitting the Model

The function gpc is a wrapper that calls the Rcpp functions that fit the classification model, including the pseudo marginal approximate likelihood and the Laplace approximation. These Rcpp functions are also written in parallel using RcppParallel, improving the speed of building the gram matrix \(K\), and calculating the pseudo marginal approximation.

We fit the model as follows:

Most arguments are self explanatory, but for a description of all inputs to gpc, see ?gpc. Note that we have only used 500 steps, and an \(Nimp\) value of 200, which is lower than what is written into the report. When fitting the full model, higher values are used.

Now we can inspect the model using plot.gpc, an S3 method used for plotting gpc objects.

By default, plot.gpc will plot the trace and density plots of the hyperparameters and the log pseudo marginal likelihood approximation. The argument f=TRUE will plot a series of the same plots for the latent variables \(f\).

Predicting from the Model

To obtain predictions from this fit, we can use predict.gpc, which averages across all values of hyperparameter vector theta across all samples and chains, to produce probabilities at each data point for the positive class.

Note that we fit the model to Xd, the subset model matrix, but predictions can be evaluated on the full training dataset. We do the same for the testing set.

From here, we can see the percentage of correct predictions in both cases (using a 0-1 loss, if the probabilities are greater than 0.5 then we predict the positive class, otherwise it is the negative class).

The percentages are given by