The following vignette provides an overview of the functions in
sinkr with specific relevance for Principal Component
Analysis (PCA), which is sometimes referred to as Empirical Orthogonal
Function (EOF) analysis in the climate sciences.
For most cases, R’s base prcomp function is all that is
needed to conduct PCA. What sinkr provides is additional
functionality in a few aspects:
This vignette will demonstrate the use of functions for use in each of these aspects. The following packages should be installed and loaded in order to reproduce the examples:
Three main data sets will be used in the examples:
iris - available with the base installation of R,
containing morphometric data from 3 species of iriswine - available in sinkr, containing
chemical properties of 3 different varieties of Italian wineXt - available in sinkr, containing a
larger synthetic example of a matrix comprised of 9 mixed signalsThe following code chunk loads these data sets, and applies some pre-processing.
In PCA, centering (subtracting the mean) is typically done to focus
the analysis on the covariances among variables (i.e. columns in the
matrix). Additional scaling of variables (dividing by the standard
deviation) may be done to remove potential differences in units, but
also to focus on correlation patterns rather than covariance. Centering
and scaling options are applied internally by the two main PCA functions
used in the vignette (prcomp and eof), but
also conducted here before visualizing the iris and
wine data sets.
# iris morphology
iris2 <- iris[,1:4]
# wine characteristics
data(wine)
wine2 <- wine[,names(wine)!="Type"]
# Synthetic data field
data(Xt)
# The "noisy" field
# noise standard deviation at 10% noise-to-signal ratio
noise_sd <- 0.1 * sd(as.vector(Xt))
# Add Gaussian noise
set.seed(123) # For reproducibility
Xn <- Xt + rnorm(length(Xt), mean = 0, sd = noise_sd)
Xn <- array(Xn, dim = dim(Xt))
par(mfrow = c(1,3), mar = c(3,3,2,1), mgp = c(2,0.5,0))
image(t(as.matrix(scale(iris2))), col = viridis(100), axes = F)
mtext("iris2 (scaled)", line = 0.5)
image(t(as.matrix(scale(wine2))), col = viridis(100), axes = F)
mtext("wine2 (scaled)", line = 0.5)
image(t(Xn), col = viridis(100), axes = F)
mtext("Synthetic", line = 0.5)
mtext(text = "columns", side = 1, line = -2, outer = TRUE)
mtext(text = "rows", side = 2, line = -2, outer = TRUE)PCA will identify patterns that explain the largest variability in a data set, and translates these into a new coordinate system. In this new coordinate system, principal components (PCs) are orthogonal to one another (i.e. uncorrelated) and ranked according to decreasing explained variation.
In many cases, users may be primarily interested in the leading patterns to help with understanding relations in the higher dimensional data set. However, later patterns may be of interest as carriers of information. The process of identifying “significant” PCs may be desired to help differentiate signal from noise. This may be important for guiding interpretation of patterns, but also has applications in further analyses that benefit from the reduced dimensionality of the PCs.
A relatively straight-forward approach is to plot the variance
explained by each PCA pattern (prcomp()$sdev^2 or
eof()$Lambda), referred to as a “Scree plot”. These plots
often use a log-scale, whereby the transition to a linear slope may be
associated with the point at which subsequent patterns are only
describing small-scale feature or noise.
The following examples show Scree plots for the three example datasets (both the true and noisy versions of the synthetic data are shown). As you can see, eye-balling the transition to noisy patterns is not always straightforward.
p_iris2 <- prcomp(iris2, center = TRUE, scale. = TRUE)
p_wine2 <- prcomp(wine2, center = TRUE, scale. = TRUE)
p_Xt <- prcomp(Xt)
p_Xn <- prcomp(Xn)
par(mfcol = c(1,3), mar = c(3,3,2,1), mgp = c(2,0.5,0))
plot(p_iris2$sdev^2, t = "b", log = "y", ylab = "Variances (log-scaled)")
mtext("iris2 (scaled)", line = 0.5)
plot(p_wine2$sdev^2, t = "b", log = "y", ylab = "Variances (log-scaled)")
mtext("wine2 (scaled)", line = 0.5)
plot(p_Xt$sdev[1:20]^2, t = "b", log = "y", ylab = "Variances (log-scaled)")
mtext("Synthetic", line = 0.5)
lines(p_Xn$sdev[1:20]^2, t = "b", col = 2, pch = 2)
legend("bottomleft", legend = c("Xt (True)", "Xn (w/ noise)"), pch = c(1,2), col = 1:2, bty = "n")As an alternative, sinkr offers two main approaches for
determining “significant” PCA patterns: cross validation and
imputation.
For the cross-validation option, leave-one-out
(pca_loocv) or k-fold (pca_kcv) functions are
available, while dineof is used for imputation.
Depending on the size of the data set, one or the other approach may
be preferable; dineof can be much faster, but requires a
somewhat subjective decision in the convergence threshold
(rms.delta).
For cross validation, while leave-one-out (pca_loocv)
may be somewhat more robust, k-fold (pca_kcv) offers
similar performance, while being faster.
Both pca_loocv and pca_kcv arrive at lowest
squared error for holout data with 3 PCA patterns.
res_loocv <- pca_loocv(wine2, center = TRUE, scale = TRUE)
res_loocv <- lapply(res_loocv, colSums)$pseudoinverse
loocv_nsig <- which(res_loocv==min(res_loocv))
set.seed(1)
res_kcv <- pca_kcv(wine2, ks = 5, center = TRUE, scale = TRUE, verbose = FALSE)
res_kcv <- lapply(res_kcv, colSums)$pseudoinverse
kcv_nsig <- which(res_kcv==min(res_kcv))
par(mar = c(4,4,1,1))
plot(res_loocv, log = "y", t = "b", pch = 21, col = 3, bg = c(NA,3)[(seq(res_loocv)==loocv_nsig)+1])
lines(res_kcv, t = "b", pch = 21, col = 4, bg = c(NA,4)[(seq(res_kcv)==kcv_nsig)+1])dineof was mainly developed to allow for the
interpolation of missing data by means of PCA; however, the function
will work on full (complete) data sets in the same way.
The function requires the definition of reference data points
(ref.pos) by which to measure convergence. By default the
function will choose 30 values or 1% of the values, whichever is larger,
but users can also supply their own matrix positions for reference. The
algorithm sets these reference values (and any missing values) to zero,
and then iteratively runs a PCA followed by a reconstruction of their
values using a truncated number of PCs. The error of the reconstruction
of reference values is evaluated for convergence, whereby additional PCs
are added if error of the reconstruction is further reduced. Thus, the
maximum number of PCs that minimizes the reconstruction error can be
used to define the threshold between larger scale signals and noise
The choice of reference points will influence the convergence criteria, and thus should be carefully chosen. For most cases, randomly selected values is likely to be appropriate. A sufficient fraction of the data set should also be selected to provide a good representation of the full set, although similar results are observed across a range of percentages (e.g. 1-20%, depending on the size of the data).
The following example uses a similar approach to the k-fold cross
validation, but uses each fold’s holdout set as the reference data.
Using a 20-fold cross-validation, ~5% of the data is held out for each
fold. The function returns the optimal number of patterns
(dineof()$n.eof). Some variability is seem among folds, but
the median value of the solution provides a more robust overall
result.
Note that dineof returns an interpolated matrix
(dineof()$Xa) rather than PCs etc., and thus requires a
subsequent call to prcomp or eof on the
interpolated matrix if desired. Furthermore all centering or scaling of
the data must be currently done preliminarily, outside of the
dineof function.
Similar to the cross-validation approaches, 3 PCA patterns are
identified for the wine data set.
set.seed(2)
wine2sc <- as.matrix(scale(wine2))
ks <- kfold(length(as.matrix(wine2sc)), k = 20)
dineof_nsig <- lapply(ks, FUN = function(x){
dineof(as.matrix(wine2sc), ref.pos = x, delta.rms = 1e-2, verbose = FALSE)$n.eof
})
unlist(dineof_nsig) [1] 3 3 3 3 3 5 3 3 3 3 3 3 2 5 5 2 4 3 2 4
[1] 3
The following table shows the number of significant PCA patterns
identified by pca_kcv and dineof as applied to
all data sets.
Both approaches arrive at the same number of PCA patterns for
iris and wine. In the case of the synthetic
example, where the true field, Xt, is comprised on 9
signals, some differences are shown. In the case of
pca_kcv, the number reflects the number of variance
carrying PCs seen in the Scree plot above, the number being higher than
9, possibly due to mixtures of signals that become non-linear.
dineof identifies less than 9 signals, which has to do with
the error threshold used (delta.rms = 1e-2), essentially
converging before the addition of the less important trailing PCs. In
the noisy version of the data, Xn, these trailing signals
are overcome by noise, and both approaches return the same number of PCA
patterns.
method dataset nsig
1 pca_kcv iris2sc 3
2 dineof iris2sc 3
3 pca_kcv wine2sc 3
4 dineof wine2sc 3
5 pca_kcv Xt 14
6 dineof Xt 6
7 pca_kcv Xn 5
8 dineof Xn 5
EOF is the terminology used in meteorological disciplines, and
typically refers to the decomposition of a spatiotemporal field. The
following example uses monthly seas surface temperature anomaly data
from the Equatorial Pacific (included in sinkr).
Columns usually represent spatial dimension and rows are the temporal dimension; i.e. each column is a time series for a given location.
The example shows the rough equivalence between the
prcomp and eof functions. Scaling is not used
in order to focus the analysis on spatial covariance - rather than
correlation (unitless). This is followed by a visualization of the
patterns in space (EOFs) and time (PCs).
data(sst)
# prcomp(sst$field) # this would fail due to empty columns
# remove empty columns for prcomp
incl <- which(colSums(is.na(sst$field)) == 0)
dat <- sst$field[,incl]
P <- prcomp(dat, center = TRUE, scale = FALSE)
# or use eof directly
E <- eof(sst$field, center = TRUE, scale = FALSE)
# results are nearly identical
par(mar = c(4,4,1,1))
plot(P$sdev[1:50]^2, E$Lambda[1:50], log = "xy")
abline(0,1)eof.num <- 1 # EOF number to plot
par(no.readonly=TRUE, mgp = c(2,0.5,0))
layout(matrix(c(1,3,2,3),nrow=2, ncol=2), widths=c(5,1), heights=c(3,3), respect = TRUE)
par(cex=1, mar=c(4,4,1,1))
PAL <- colorPalette(c("blue", "cyan", "grey90", "yellow", "red"), c(10,1,1,10))
ZLIM <- c(-1,1)*max(abs(E$u[,eof.num]))
COL <- val2col(E$u[,eof.num], col=PAL(100), zlim=ZLIM)
plot(lat ~ lon, data=sst$grid, pch=22, bg=COL, col=COL, cex=2)
mtext(paste("EOF", eof.num))
map("world", add=TRUE)
par(mar=c(4,0,1,4))
imageScale(E$u[,eof.num], col=PAL(100), zlim=ZLIM, axis.pos=4)
par(mar=c(4,4,1,4))
plot(sst$date, E$A[,eof.num], t="l", xlab = "", ylab = "")
lines(loess.smooth(sst$date, E$A[,eof.num], span=1/3), col=rgb(0.5,0.5,1), lwd=2) # smoothed signal
abline(h=0, v=seq(as.Date("1000-01-01"), as.Date("2100-01-01"), by="10 years"), col=8, lty=3)
mtext(paste0("PC ", eof.num, " (expl. var = ",
round((E$Lambda[eof.num]/E$tot.var)*100, 1), "%)"))The data matrix is quite large (1906 x 252), and is likely to contain many potential signals across space and time - where potentially a high-rank model with many significant dimensions to explain it’s modes of variability.
With such a large matrix, pca_kcv and
pca_loocv may take too long, and dineof may be
preferred. Here is an example with a single dineof
iteration with 5% of the data used as the reference holdout set. The
setting for delta.rms can be important, and some knowledge
about the field’s parameter could guide this level of precision.
As before, the holdout data is randomly selected, which should cover a large degree of spatiotemporal locations to prevent overfitting to some local dynamic.
datsc <- scale(dat, center = TRUE, scale = FALSE)
set.seed(3)
ks <- kfold(length(as.matrix(dat)), k = 10)
D <- dineof(datsc, delta.rms = 1e-2, ref.pos = ks[[1]], verbose = F)
par(mar = c(4,4,2,1), mfcol = c(1,2))
plot(E$Lambda[1:40]/E$tot.var*100, log = "y", ylab = "Expl. var (log-scaled) [%]")
abline(v = D$n.eof, lty = 2)
plot(cumsum(E$Lambda[1:40]/E$tot.var*100), xlab = "", ylab = "Cum. expl. var. [%]")
abline(v = D$n.eof, lty = 2)21 patterns were identified, although some of these trailing patterns are likely smaller-scale features given that they lie on the linearly decreasing part of the Scree plot (left). Depending on the focus of the analysis, other truncation thresholds may be preferred.
Finally, one of the main features offered by sinkr is
when conducting PCA on “gappy” data - i.e. containing missing
values.
We will start with the wine data set, and add some gaps
(20%)
dat <- as.matrix(wine2)
set.seed(6)
# 20% gaps
frac_gaps <- 0.2
gaps <- sample(length(dat), length(dat)*frac_gaps)
dat_g <- dat
dat_g[gaps] <- NaN
image(t(scale(dat_g)), col = viridis(100))
mtext("wine2 (scaled) with 20% gaps", line = 0.5)Two approaches are available in sinkr. The first is the
previously used dineof (Data Interpolating Empirical
Orthogonal Functions), where both gaps and reference values are
interpolated. This is then followed by a call to another PCA function on
the interpolated matrix (e.g. prcomp).
The second approach uses eof, which always calculates
PCA via a covariance matrix; specifically cov4gappy, which
can handle missing values. Using the argument
recursive = TRUE, the function calculates Recursively
Subtracted Empirical Orthogonal Functions (rseof), which performs
better on covariance matrices that are not positive definite (as is
typically the case).
The following shows the application of both approaches.
# Method #1: dineof to interpolate, then prcomp
# data should probably be pre-processed - centered and (possibly) scaled
dat_g_sc <- scale(dat_g)
# 10% of non-gaps as reference
ref.pos <- sample(seq(dat_g)[-gaps], sum(!is.na(dat_g))*0.1)
D <- dineof(dat_g_sc, ref.pos = ref.pos, delta.rms = 1e-3, verbose = F, method = "svds")
Xa <- unscale(D$Xa) # unscale data
P <- prcomp(Xa, center = T, scale. = T)
# Method # 2: direct rseof
E <- eof(dat_g, recursive = TRUE, center = T, scale = T)Both approaches produce leading PCs that are highly correlated with those of the full, non-gappy data set, with some lower correlations in the intermediate PCs.
# PCA on full data set for comparison
Pfull <- prcomp(wine2, center = TRUE, scale. = TRUE)
# prep of pc loadings
mat1 <- Pfull$rot
mat2 <- E$u
dimnames(mat2) <- dimnames(mat1)
mat3 <- P$rot
dimnames(mat3) <- dimnames(mat1)
# calculation of correlation among pc loadings between full and gappy data sets
tmp1 <- as.data.frame(as.table(
abs(cor(mat1, mat3))))
names(tmp1) <- c("loadings_full", "loadings_gappy", "abs_cor")
tmp1$method <- "dineof+prcomp"
tmp2 <- as.data.frame(as.table(
abs(cor(mat1, mat2))))
names(tmp2) <- c("loadings_full", "loadings_gappy", "abs_cor")
tmp2$method <- "rseof"
tmp <- merge(tmp1, tmp2, all = T)
ggplot(tmp) + aes(x = loadings_full, y = loadings_gappy, fill = abs_cor) +
facet_wrap(~method) +
geom_tile() +
scale_fill_viridis_c() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))Finally, we can see how the two approaches do in terms of
reconstruction. The following plots show the error in the reconstructed
data values as compared to the full (true) data. Error is shown as a
function of using progressively more PCA patterns for the
reconstruction, and uses the functions prcompRecon and
eofRecon for PCA results derived from prcomp
and eof, respectively.
The results show very similar results for both approaches. A minimum error in gaps is observed using around 4-7 PCs, while error in non-gaps show a similar decrease with increasing PCs.
res <- expand.grid(n = seq(E$Lambda), method = c("dineof+prcomp", "rseof"),
data = c("gaps", "non-gaps", "all"), stringsAsFactors = FALSE)
res$rmse <- NaN
for(i in seq(E$Lambda)){
R1 <- prcompRecon(P, pcs = seq(i), unscale = T, uncenter = T)
R2 <- eofRecon(E, pcs = seq(i), uncenter = T, unscale = T)
# gaps
idx <- which(res$n == i & res$method == "dineof+prcomp" & res$data == "gaps")
res$rmse[idx] <- sqrt(mean((dat[gaps] - R1[gaps])^2))
idx <- which(res$n == i & res$method == "rseof" & res$data == "gaps")
res$rmse[idx] <- sqrt(mean((dat[gaps] - R2[gaps])^2))
# non-gaps
idx <- which(res$n == i & res$method == "dineof+prcomp" & res$data == "non-gaps")
res$rmse[idx] <- sqrt(mean((dat[-gaps] - R1[-gaps])^2))
idx <- which(res$n == i & res$method == "rseof" & res$data == "non-gaps")
res$rmse[idx] <- sqrt(mean((dat[-gaps] - R2[-gaps])^2))
# all
idx <- which(res$n == i & res$method == "dineof+prcomp" & res$data == "all")
res$rmse[idx] <- sqrt(mean((dat[] - R1[])^2))
idx <- which(res$n == i & res$method == "rseof" & res$data == "all")
res$rmse[idx] <- sqrt(mean((dat[] - R2[])^2))
}
ggplot(res) + aes(x = n, y = rmse, group = method, color = method) +
facet_grid(~data) +
geom_vline(xintercept = D$n.eof, lty = 2) +
geom_line() +
labs(title = "Reconstruction error", x = "Principal Components [n]",
y = "Root mean squared error (RMSE)")