1 Introduction
2 Model Loading
3 Variance Decomposition
4 Investigating Factors
5 Non-linear Dimensionality Reduction
6 Gene Set Enrichment Analysis
- 6.1 Significance to Prior Information
- 6.2 Matching to Other Collections

1 Introduction

This vignette demonstrates how to load a trained MuVI model using the MOFA2 R package, and perform a typical downstream analysis for multi-omics data. Briefly, MuVI is a multi-view latent variable model with domain-informed structured sparsity, that integrates (noisy) feature sets e.g. gene set annotations. In this analysis we rely on a pretrained MuVI model according to this python notebook on RNA-seq and ATAC-seq, in order to infer a joint latent space that is informed a priori by cell type gene set annotations.

library(data.table)
library(tibble)
library(ggplot2)
library(rhdf5)

# GSEA
library(msigdbr)

# make sure that `MOFA2` is imported last, to avoid collisions with functions from other packages
library(MOFA2)

2 Model Loading

Load the MuVI model with the training data. Essentially, we call the underlying load_model from MOFA2 but rename the latent factors according to the prior information provided when training MuVI. For instance, instead of having Factor1, we have the actual label of the prior gene set informing this particular factor, e.g. Naive T Cell.

load_muvi <- function(file) {
  factors_names <- h5read(file, "/factors/group_0")
  model <- load_model(file,
                      sort_factors = FALSE,
                      remove_inactive_factors = FALSE)
  factors_names(model) <- factors_names
  return(model)
}
muvi <- load_muvi("10x_multiome_muvi.hdf5")
muvi

## Trained MOFA with the following characteristics: 
##  Number of views: 2 
##  Views names: rna atac 
##  Number of features (per view): 3026 6000 
##  Number of groups: 1 
##  Groups names: group_0 
##  Number of samples (per group): 9534 
##  Number of factors: 15

3 Variance Decomposition

Variance decomposition provides two main insights.
First, the overall variance explained serves as a check for the quality of the data fit. In this case, the decomposition performed by MuVI explains slightly less than 50% of variance in the RNA and around 30% in the ATAC view.

plot_variance_explained(muvi, plot_total = TRUE)[[2]]

Second, it quantifies the percentage of variance explained by each factor in each view, thereby identifying shared and private factors. For instance, we see that the factor informed by the Naive T Cell gene set explains a large portion of the variance, and is also shared across RNA and ATAC.

plot_variance_explained(muvi)

4 Investigating Factors

Since the latent factors in MuVI are pre-labelled by design, we may directly inspect how they differentiate the cell population. In this case, we see that the latent factor informed by the Naive T Cell gene set assigns higher scores to the 4 groups of T cells. In addition, the Follicular B Cell factor helps distinguish B cells from the rest of the cell population. The same can be said about Nk Cells regarding natural killer cells and Dendritic Cell regarding mDCs and pDCs.

relevant_factors <-
  c("Naive T Cell", "Follicular B Cell", "Nk Cells", "Dendritic Cell")
plot_factor(muvi,
            factors = relevant_factors,
            group_by = "rna.celltype",
            color_by = "rna.celltype") +
  theme(axis.text.x = element_text(
    color = "black",
    angle = 90,
    vjust = 0.5,
    hjust = 1
  ))

4.1 Factor Loadings

The factor loadings quantify the contribution of each feature (gene) to each latent factor, e.g. pathway.
For the rest of the analysis we focus on the Follicular B Cell factor/pathway.
We first plot the distribution of all factor loadings in the RNA. Notice we only have non-negative weights, as the model has been constrained during training. Among others we notice several B cell markers such as BANK1, IGHM, MS4A1, PAX5 and so on.

plot_weights(
  muvi,
  view = "rna",
  factors = "Follicular B Cell",
  nfeatures = 20,
  text_size = 4
)

We get a clearer picture of the relevant features for this particular factor by plotting only the top loadings.

plot_top_weights(
  muvi,
  view = "rna",
  factors = "Follicular B Cell",
  sign = "positive",
  nfeatures = 20,
)

4.2 Associating Factors and Features

Using MOFA2 we may also assess the correlation of the factor scores with respect to the most relevant features.

plot_data_scatter(
  muvi,
  view = "rna",
  factor = "Follicular B Cell",
  features = 6,
  color_by = "rna.celltype",
  add_lm = T,
  dot_size = 1
)

4.3 Prior Knowledge Transfer

A powerful aspect of MuVI is pathway information transfer. In particular, we have only informed the RNA, but expect the ATAC view to be implicitly informed via shared factors. For instance, the corresponding gene symbol CD74 and BLK are also deemed highly relevant for the Follicular B Cell pathway, even in the uninformed ATAC view. Potentially, we may even extract a Follicular B Cell ATAC signature from the top loadings.

plot_top_weights(
  muvi,
  view = "atac",
  factors = "Follicular B Cell",
  sign = "positive",
  nfeatures = 20,
)

4.4 Factor-specific Patterns

As opposed to scatterplots which focus mostly on two dimensions, we may render heatmaps to visualise the overall patterns of each factor. In this case, we observe that B cells are clearly identifiable by significantly larger Follicular B Cell scores.

plot_data_heatmap(
  muvi,
  view = "rna",
  factor = "Follicular B Cell",
  features = 25,
  show_rownames = T,
  show_colnames = F,
  cluster_rows = T,
  cluster_cols = F,
  annotation_samples = "rna.celltype"
)

As we already notice a very strong correlation of the Follicular B Cell with BANK1 we may visualise a stripplot of the factor scores (y-axis) grouped by each cell type (x-axis) and coloured by the BANK1 expression.

plot_factor(muvi,
            factors = "Follicular B Cell",
            group_by = "rna.celltype",
            color_by = "BANK1") +
  theme(axis.text.x = element_text(
    color = "black",
    angle = 30,
    vjust = 1,
    hjust = 1
  ))

5 Non-linear Dimensionality Reduction

Similar to the principal components of a PCA decomposition, MuVI infers a compact latent space governed by domain knowledge in terms of feature (gene) sets. Therefore, we may apply clustering algorithms that learn non-linear manifolds such as t-SNE or UMAP. Here, we apply UMAP and further compress the latent space into two dimensions for better visualisation.

muvi <- run_umap(muvi,
                 n_neighbors = 15,
                 min_dist = 0.50)

plot_dimred(
  muvi,
  method = "UMAP",
  color_by = "rna.celltype",
  label = TRUE,
  stroke = 0.05,
  dot_size = 1,
  legend = FALSE
)

We can then go over each latent factor, labelled by their corresponding gene set, and colour each sample by the inferred factor score. When comparing to the previous UMAP plot, we clearly see how each latent factor can be strongly identified with their corresponding cell type.

for (i in relevant_factors) {
  p <- plot_dimred(
    muvi,
    method = "UMAP",
    color_by = i,
    stroke = 0.05,
    dot_size = 1
  ) + ggtitle(i)
  print(p)
}

6 Gene Set Enrichment Analysis

Although MuVI accommodates gene set information a priori, it is always advisable to perform a GSEA procedure after training in order to test whether the inferred latent factors (posterior gene sets) still resemble the prior gene sets. In cases where the prior information is not suitable or is severely noisy, the model will attempt to refine the prior gene set annotations given enough evidence from the data.

6.1 Significance to Prior Information

We perform GSEA on the prior gene set collection from the C8 category in MSigDB. The original gene set names contain the prefix HAY_BONE_MARROW_... but have been prettified for the plots, that is HAY_BONE_MARROW_NAIVE_T_CELL is renamed to Naive T Cell, and so on.
As the test results indicate, not only can we ensure that each inferred latent factor significantly matches the prior gene set, we also see that the prior gene set is typically ranked first as the most significant posterior gene set candidate.

msigdb.matrix <- msigdbr(species = "Homo sapiens",
                         category = "C8") %>% as.data.table %>% .[, id := 1] %>%
  dcast(gs_name ~ gene_symbol, value.var = "id", fill = 0) %>%
  remove_rownames %>%
  column_to_rownames(var = "gs_name")

# GSEA on positive weights only because we have no negative weights
gsea.positive <- run_enrichment(muvi,
                                feature.sets = msigdb.matrix == 1,
                                view = "rna",
                                sign = "positive")

for (i in relevant_factors) {
  p <-
    plot_enrichment(gsea.positive, factor = i, max.pathways = 15) + ggtitle(i)
  print(p)
}

We may also inspect the top loadings that contributed to the test results.

plot_enrichment_detailed(
  gsea.positive,
  factor = "Follicular B Cell",
  max.genes = 10,
  max.pathways = 5
)

6.2 Matching to Other Collections

We perform GSEA on another gene set collection, e.g. GO terms, in order to examine whether we can recover similar themes regarding each inferred factor.
Again, the test results point towards the same underlying biological processes, even when comparing to a different gene set collection.

msigdb.matrix <- msigdbr(species = "Homo sapiens",
                         category = "C5",
                         subcategory = "BP") %>% as.data.table %>% .[, id := 1] %>%
  dcast(gs_name ~ gene_symbol, value.var = "id", fill = 0) %>%
  remove_rownames %>%
  column_to_rownames(var = "gs_name")

# GSEA on positive weights only because we have no negative weights
gsea.positive <- run_enrichment(muvi,
                                feature.sets = msigdb.matrix == 1,
                                view = "rna",
                                sign = "positive")

for (i in relevant_factors) {
  p <-
    plot_enrichment(gsea.positive, factor = i, max.pathways = 15) + ggtitle(i)
  print(p)
}

plot_enrichment_detailed(
  gsea.positive,
  factor = "Follicular B Cell",
  max.genes = 10,
  max.pathways = 5
)

Session Info

sessionInfo()

## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 20.04.6 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3;  LAPACK version 3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Europe/Berlin
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] MOFA2_1.16.0      msigdbr_7.5.1     rhdf5_2.50.1      ggplot2_3.5.1    
## [5] tibble_3.2.1      data.table_1.16.4 BiocStyle_2.34.0 
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1      dplyr_1.1.4           farver_2.1.2         
##  [4] filelock_1.0.3        fastmap_1.2.0         digest_0.6.37        
##  [7] lifecycle_1.0.4       magrittr_2.0.3        compiler_4.4.2       
## [10] rlang_1.1.4           sass_0.4.9            tools_4.4.2          
## [13] utf8_1.2.4            yaml_2.3.10           corrplot_0.95        
## [16] knitr_1.49            ggsignif_0.6.4        S4Arrays_1.6.0       
## [19] labeling_0.4.3        reticulate_1.40.0     DelayedArray_0.32.0  
## [22] plyr_1.8.9            RColorBrewer_1.1-3    abind_1.4-8          
## [25] HDF5Array_1.34.0      Rtsne_0.17            babelgene_22.9       
## [28] withr_3.0.2           purrr_1.0.2           BiocGenerics_0.52.0  
## [31] grid_4.4.2            stats4_4.4.2          fansi_1.0.6          
## [34] ggpubr_0.6.0          colorspace_2.1-1      Rhdf5lib_1.28.0      
## [37] scales_1.3.0          tinytex_0.54          cli_3.6.3            
## [40] rmarkdown_2.29        crayon_1.5.3          generics_0.1.3       
## [43] reshape2_1.4.4        cachem_1.1.0          stringr_1.5.1        
## [46] splines_4.4.2         zlibbioc_1.52.0       parallel_4.4.2       
## [49] BiocManager_1.30.25   XVector_0.46.0        matrixStats_1.4.1    
## [52] basilisk_1.18.0       vctrs_0.6.5           Matrix_1.7-1         
## [55] carData_3.0-5         jsonlite_1.8.9        dir.expiry_1.14.0    
## [58] car_3.1-3             bookdown_0.41         IRanges_2.40.1       
## [61] S4Vectors_0.44.0      ggrepel_0.9.6         rstatix_0.7.2        
## [64] irlba_2.3.5.1         Formula_1.2-5         tidyr_1.3.1          
## [67] jquerylib_0.1.4       glue_1.8.0            codetools_0.2-19     
## [70] cowplot_1.1.3         uwot_0.2.2            RcppAnnoy_0.0.22     
## [73] stringi_1.8.4         gtable_0.3.6          munsell_0.5.1        
## [76] pillar_1.9.0          basilisk.utils_1.18.0 htmltools_0.5.8.1    
## [79] rhdf5filters_1.18.0   R6_2.5.1              evaluate_1.0.1       
## [82] lattice_0.22-5        png_0.1-8             backports_1.5.0      
## [85] pheatmap_1.0.12       broom_1.0.7           bslib_0.8.0          
## [88] Rcpp_1.0.13-1         nlme_3.1-165          SparseArray_1.6.0    
## [91] mgcv_1.9-1            xfun_0.49             MatrixGenerics_1.18.0
## [94] forcats_1.0.0         pkgconfig_2.0.3

Non-negative Integration of Gene Expression and Chromatin Accessibility of 10k PBMCs

2024-12-14

Contents