This vignette demonstrates how to load a trained MuVI
model using the MOFA2 R package, and perform a typical downstream analysis for multi-omics data. Briefly, MuVI
is a multi-view latent variable model with domain-informed structured sparsity, that integrates (noisy) feature sets e.g. gene set annotations. In this analysis we rely on a pretrained MuVI
model according to this python notebook on RNA-seq and ATAC-seq, in order to infer a joint latent space that is informed a priori by cell type gene set annotations.
library(data.table)
library(tibble)
library(ggplot2)
library(rhdf5)
# GSEA
library(msigdbr)
# make sure that `MOFA2` is imported last, to avoid collisions with functions from other packages
library(MOFA2)
Load the MuVI
model with the training data. Essentially, we call the underlying load_model
from MOFA2
but rename the latent factors according to the prior information provided when training MuVI
. For instance, instead of having Factor1
, we have the actual label of the prior gene set informing this particular factor, e.g. Naive T Cell
.
load_muvi <- function(file) {
factors_names <- h5read(file, "/factors/group_0")
model <- load_model(file,
sort_factors = FALSE,
remove_inactive_factors = FALSE)
factors_names(model) <- factors_names
return(model)
}
muvi <- load_muvi("10x_multiome_muvi.hdf5")
muvi
## Trained MOFA with the following characteristics:
## Number of views: 2
## Views names: rna atac
## Number of features (per view): 3026 6000
## Number of groups: 1
## Groups names: group_0
## Number of samples (per group): 9534
## Number of factors: 15
Variance decomposition provides two main insights.
First, the overall variance explained serves as a check for the quality of the data fit. In this case, the decomposition performed by MuVI
explains slightly less than 50% of variance in the RNA and around 30% in the ATAC view.
plot_variance_explained(muvi, plot_total = TRUE)[[2]]
Second, it quantifies the percentage of variance explained by each factor in each view, thereby identifying shared and private factors. For instance, we see that the factor informed by the Naive T Cell
gene set explains a large portion of the variance, and is also shared across RNA and ATAC.
plot_variance_explained(muvi)
Since the latent factors in MuVI
are pre-labelled by design, we may directly inspect how they differentiate the cell population. In this case, we see that the latent factor informed by the Naive T Cell
gene set assigns higher scores to the 4 groups of T cells. In addition, the Follicular B Cell
factor helps distinguish B cells from the rest of the cell population. The same can be said about Nk Cells
regarding natural killer cells and Dendritic Cell
regarding mDCs and pDCs.
relevant_factors <-
c("Naive T Cell", "Follicular B Cell", "Nk Cells", "Dendritic Cell")
plot_factor(muvi,
factors = relevant_factors,
group_by = "rna.celltype",
color_by = "rna.celltype") +
theme(axis.text.x = element_text(
color = "black",
angle = 90,
vjust = 0.5,
hjust = 1
))
The factor loadings quantify the contribution of each feature (gene) to each latent factor, e.g. pathway.
For the rest of the analysis we focus on the Follicular B Cell
factor/pathway.
We first plot the distribution of all factor loadings in the RNA. Notice we only have non-negative weights, as the model has been constrained during training. Among others we notice several B cell markers such as BANK1
, IGHM
, MS4A1
, PAX5
and so on.
plot_weights(
muvi,
view = "rna",
factors = "Follicular B Cell",
nfeatures = 20,
text_size = 4
)
We get a clearer picture of the relevant features for this particular factor by plotting only the top loadings.
plot_top_weights(
muvi,
view = "rna",
factors = "Follicular B Cell",
sign = "positive",
nfeatures = 20,
)
Using MOFA2
we may also assess the correlation of the factor scores with respect to the most relevant features.
plot_data_scatter(
muvi,
view = "rna",
factor = "Follicular B Cell",
features = 6,
color_by = "rna.celltype",
add_lm = T,
dot_size = 1
)
A powerful aspect of MuVI
is pathway information transfer. In particular, we have only informed the RNA, but expect the ATAC view to be implicitly informed via shared factors. For instance, the corresponding gene symbol CD74
and BLK
are also deemed highly relevant for the Follicular B Cell
pathway, even in the uninformed ATAC view. Potentially, we may even extract a Follicular B Cell
ATAC signature from the top loadings.
plot_top_weights(
muvi,
view = "atac",
factors = "Follicular B Cell",
sign = "positive",
nfeatures = 20,
)
As opposed to scatterplots which focus mostly on two dimensions, we may render heatmaps to visualise the overall patterns of each factor. In this case, we observe that B cells are clearly identifiable by significantly larger Follicular B Cell
scores.
plot_data_heatmap(
muvi,
view = "rna",
factor = "Follicular B Cell",
features = 25,
show_rownames = T,
show_colnames = F,
cluster_rows = T,
cluster_cols = F,
annotation_samples = "rna.celltype"
)
As we already notice a very strong correlation of the Follicular B Cell
with BANK1
we may visualise a stripplot of the factor scores (y-axis) grouped by each cell type (x-axis) and coloured by the BANK1
expression.
plot_factor(muvi,
factors = "Follicular B Cell",
group_by = "rna.celltype",
color_by = "BANK1") +
theme(axis.text.x = element_text(
color = "black",
angle = 30,
vjust = 1,
hjust = 1
))
Similar to the principal components of a PCA decomposition, MuVI
infers a compact latent space governed by domain knowledge in terms of feature (gene) sets. Therefore, we may apply clustering algorithms that learn non-linear manifolds such as t-SNE or UMAP. Here, we apply UMAP and further compress the latent space into two dimensions for better visualisation.
muvi <- run_umap(muvi,
n_neighbors = 15,
min_dist = 0.50)
plot_dimred(
muvi,
method = "UMAP",
color_by = "rna.celltype",
label = TRUE,
stroke = 0.05,
dot_size = 1,
legend = FALSE
)
We can then go over each latent factor, labelled by their corresponding gene set, and colour each sample by the inferred factor score. When comparing to the previous UMAP plot, we clearly see how each latent factor can be strongly identified with their corresponding cell type.
for (i in relevant_factors) {
p <- plot_dimred(
muvi,
method = "UMAP",
color_by = i,
stroke = 0.05,
dot_size = 1
) + ggtitle(i)
print(p)
}
Although MuVI
accommodates gene set information a priori, it is always advisable to perform a GSEA procedure after training in order to test whether the inferred latent factors (posterior gene sets) still resemble the prior gene sets. In cases where the prior information is not suitable or is severely noisy, the model will attempt to refine the prior gene set annotations given enough evidence from the data.
We perform GSEA on the prior gene set collection from the C8 category in MSigDB. The original gene set names contain the prefix HAY_BONE_MARROW_...
but have been prettified for the plots, that is HAY_BONE_MARROW_NAIVE_T_CELL
is renamed to Naive T Cell
, and so on.
As the test results indicate, not only can we ensure that each inferred latent factor significantly matches the prior gene set, we also see that the prior gene set is typically ranked first as the most significant posterior gene set candidate.
msigdb.matrix <- msigdbr(species = "Homo sapiens",
category = "C8") %>% as.data.table %>% .[, id := 1] %>%
dcast(gs_name ~ gene_symbol, value.var = "id", fill = 0) %>%
remove_rownames %>%
column_to_rownames(var = "gs_name")
# GSEA on positive weights only because we have no negative weights
gsea.positive <- run_enrichment(muvi,
feature.sets = msigdb.matrix == 1,
view = "rna",
sign = "positive")
for (i in relevant_factors) {
p <-
plot_enrichment(gsea.positive, factor = i, max.pathways = 15) + ggtitle(i)
print(p)
}
We may also inspect the top loadings that contributed to the test results.
plot_enrichment_detailed(
gsea.positive,
factor = "Follicular B Cell",
max.genes = 10,
max.pathways = 5
)
We perform GSEA on another gene set collection, e.g. GO terms, in order to examine whether we can recover similar themes regarding each inferred factor.
Again, the test results point towards the same underlying biological processes, even when comparing to a different gene set collection.
msigdb.matrix <- msigdbr(species = "Homo sapiens",
category = "C5",
subcategory = "BP") %>% as.data.table %>% .[, id := 1] %>%
dcast(gs_name ~ gene_symbol, value.var = "id", fill = 0) %>%
remove_rownames %>%
column_to_rownames(var = "gs_name")
# GSEA on positive weights only because we have no negative weights
gsea.positive <- run_enrichment(muvi,
feature.sets = msigdb.matrix == 1,
view = "rna",
sign = "positive")
for (i in relevant_factors) {
p <-
plot_enrichment(gsea.positive, factor = i, max.pathways = 15) + ggtitle(i)
print(p)
}
plot_enrichment_detailed(
gsea.positive,
factor = "Follicular B Cell",
max.genes = 10,
max.pathways = 5
)
sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 20.04.6 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3; LAPACK version 3.9.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Europe/Berlin
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] MOFA2_1.16.0 msigdbr_7.5.1 rhdf5_2.50.1 ggplot2_3.5.1
## [5] tibble_3.2.1 data.table_1.16.4 BiocStyle_2.34.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 dplyr_1.1.4 farver_2.1.2
## [4] filelock_1.0.3 fastmap_1.2.0 digest_0.6.37
## [7] lifecycle_1.0.4 magrittr_2.0.3 compiler_4.4.2
## [10] rlang_1.1.4 sass_0.4.9 tools_4.4.2
## [13] utf8_1.2.4 yaml_2.3.10 corrplot_0.95
## [16] knitr_1.49 ggsignif_0.6.4 S4Arrays_1.6.0
## [19] labeling_0.4.3 reticulate_1.40.0 DelayedArray_0.32.0
## [22] plyr_1.8.9 RColorBrewer_1.1-3 abind_1.4-8
## [25] HDF5Array_1.34.0 Rtsne_0.17 babelgene_22.9
## [28] withr_3.0.2 purrr_1.0.2 BiocGenerics_0.52.0
## [31] grid_4.4.2 stats4_4.4.2 fansi_1.0.6
## [34] ggpubr_0.6.0 colorspace_2.1-1 Rhdf5lib_1.28.0
## [37] scales_1.3.0 tinytex_0.54 cli_3.6.3
## [40] rmarkdown_2.29 crayon_1.5.3 generics_0.1.3
## [43] reshape2_1.4.4 cachem_1.1.0 stringr_1.5.1
## [46] splines_4.4.2 zlibbioc_1.52.0 parallel_4.4.2
## [49] BiocManager_1.30.25 XVector_0.46.0 matrixStats_1.4.1
## [52] basilisk_1.18.0 vctrs_0.6.5 Matrix_1.7-1
## [55] carData_3.0-5 jsonlite_1.8.9 dir.expiry_1.14.0
## [58] car_3.1-3 bookdown_0.41 IRanges_2.40.1
## [61] S4Vectors_0.44.0 ggrepel_0.9.6 rstatix_0.7.2
## [64] irlba_2.3.5.1 Formula_1.2-5 tidyr_1.3.1
## [67] jquerylib_0.1.4 glue_1.8.0 codetools_0.2-19
## [70] cowplot_1.1.3 uwot_0.2.2 RcppAnnoy_0.0.22
## [73] stringi_1.8.4 gtable_0.3.6 munsell_0.5.1
## [76] pillar_1.9.0 basilisk.utils_1.18.0 htmltools_0.5.8.1
## [79] rhdf5filters_1.18.0 R6_2.5.1 evaluate_1.0.1
## [82] lattice_0.22-5 png_0.1-8 backports_1.5.0
## [85] pheatmap_1.0.12 broom_1.0.7 bslib_0.8.0
## [88] Rcpp_1.0.13-1 nlme_3.1-165 SparseArray_1.6.0
## [91] mgcv_1.9-1 xfun_0.49 MatrixGenerics_1.18.0
## [94] forcats_1.0.0 pkgconfig_2.0.3