This vignette demonstrates the simultaneous multi-view and multi-group integration framework of MOFA+.
We consider a dataset where scNMT-seq was used to simultaneously profile RNA expression, DNA methylation and chromatin accessibility in 1,828 cells at multiple stages of mouse development. MOFA+ provides a method for delineating coordinated variation between the transcriptome and the epigenome and for detecting at which stage(s) of development it occurs.
As input to the model we quantified DNA methylation and chromatin accessibility values over different sets of regulatory elements. Here we considered gene promoters and enhancer elements (distal H3K27ac sites). RNA expression was quantified over protein-coding genes. After data processing, separate views were defined for the RNA expression and for each combination of genomic context and epigenetic readout. Cells were grouped according to their developmental stage (E5.5, E6.5 and E7.5). For details in the data processing, see the following github repository
The data set we use here is a simplified version of the original data set published in (Nature)[https://www.nature.com/articles/s41586-019-1825-8]. The full data set can be downloaded from this FTP.
Load dependencies. Make sure that MOFA is imported last, to avoid collisions with functions from other packages
Define cell type colors for the visualisations
In this vignette we skip all the data processing and model training part and we focus on the downstream characterisation of the model. For details on the data preparation and setting up the MOFA object we refer to the following github repository
MOFA models are saved in hdf5 format and can be loaded into R with the function load_model
. In this case, however, we provide the trained model as an RData file, which already contains the cell metadata.
# MOFAmodel <- load_model("(...)/model.hdf5")
# load("/Users/ricard/data/mofa2_vignettes/gastrulation_scnmt_mofa.RData")
load(url("ftp://ftp.ebi.ac.uk/pub/databases/mofa/scnmt_gastrulation/gastrulation_scnmt_mofa.RData"))
Explore the cell metadata:
* sample: cell identity.
* stage: developmental stage.
* lineage: cell type annotation (derived from mapping the cells to the 10x atlas).
* pass_rnaQC: did the cell pass QC for RNA expression?.
* pass_metQC: did the cell pass QC for DNA methylation? NA
if the cell was only profiled for RNA.
* pass_accQC: did the cell pass QC for chromatin accessibility? NA
if the cell was only profiled for RNA.
* group: the grouping used for MOFA, corresponds to the stage.
head(MOFAmodel@samples_metadata)
## sample plate pass_rnaQC pass_metQC pass_accQC
## 1 E4.5-5.5_new_Plate1_E09 E4.5-5.5_new_Plate1 TRUE FALSE FALSE
## 2 E4.5-5.5_new_Plate1_D09 E4.5-5.5_new_Plate1 TRUE NA NA
## 3 E4.5-5.5_new_Plate1_G09 E4.5-5.5_new_Plate1 TRUE TRUE TRUE
## 4 E4.5-5.5_new_Plate1_F09 E4.5-5.5_new_Plate1 TRUE NA NA
## 5 E4.5-5.5_new_Plate1_A09 E4.5-5.5_new_Plate1 TRUE FALSE FALSE
## 6 E4.5-5.5_new_Plate1_C09 E4.5-5.5_new_Plate1 TRUE TRUE TRUE
## stage stage_lineage group lineage
## 1 E5.5 E5.5_Epiblast E5.5 Epiblast
## 2 E5.5 E5.5_ExE Endoderm E5.5 ExE Endoderm
## 3 E5.5 E5.5_Epiblast E5.5 Epiblast
## 4 E5.5 E5.5_ExE Endoderm E5.5 ExE Endoderm
## 5 E5.5 E5.5_Epiblast E5.5 Epiblast
## 6 E5.5 E5.5_Epiblast E5.5 Epiblast
The function plot_data_overview
can be used to obtain an overview of the input data.
It shows how many views (rows) and how many groups (columns) exist, what are their corresponding dimensionalities and how many missing information they have (grey bars).
view.colors <- c(
"RNA expression" = "#3CB54E",
"Enhancer accessibility" = "#00BFC4",
"Promoter accessibility" = "#00BFC4",
"Enhancer methylation" = "#F37A71",
"Promoter methylation" = "#F37A71"
)
view.colors = view.colors[views_names(MOFAmodel)]
plot_data_overview(MOFAmodel, colors = view.colors)
As a sanity check, one should verify that the factors are (fairly) uncorrelated. Otherwise it suggests that the model has not converged or that perhaps you are using too many factors.
cor <- plot_factor_cor(MOFAmodel)
Quantifying the variance explained across groups and views is probably the most important plot that MOFA+ generates. It summarises the (latent) signal from a complex heterogeneous data set in a single figure.
When having a multi-group and multi-view setting, it is advised to plot one factor at a time:
Factor 1
plot_variance_explained(MOFAmodel, x="group", y="view", factor=1, legend = T)
Factor 2
plot_variance_explained(MOFAmodel, x="group", y="view", factor=2, legend = T)
Factor 1 and 2 show a very structured and interesting signal. When looking at the signal across views, you can notice that these factors connect transcriptome variation to changes in DNA methylation and chromatin accessibility. When looking at the signal across groups, you can notice that Factor 1 is active across all groups, whereas Factor 2 only appears at E7.5.
Each factor ordinates cells along a one-dimensional axis that is centered at zero. Samples with different signs indicate opposite phenotypes, with higher absolute value indicating a stronger phenotype. Let’s plot Factor 1 values and we color cells by lineage assignment.
Clearly, this factor captures the signal associated with the formation of ExE endoderm.
plot_factor(MOFAmodel,
factor = 1,
color_by = "lineage",
scale = TRUE,
add_violin = TRUE, color_violin = TRUE,
dodge = TRUE, dot_size = 1, legend = TRUE
) + scale_color_manual(values=colors) + scale_fill_manual(values=colors)
Here is another way of representing the same plot:
plot_factor(MOFAmodel,
factor = 1,
color_by = "lineage",
dot_size = 1.5,
scale = TRUE, legend = FALSE
) + scale_color_manual(values=colors) + scale_fill_manual(values=colors)
The weights provide a score for each gene on each factor. Genes with no association with the factor are expected to have values close to zero, whereas genes with strong association with the factor are expected to have large absolute values. The sign of the loading indicates the direction of the effect: a positive loading indicates that the feature is more active in the cells with positive factor values, and viceversa.
Let’s plot the distribution of weights for Factor 1.
plot_weights(MOFAmodel,
view = "RNA expression",
factor = 1,
nfeatures = 10, # Top number of features to highlight
scale = T # Scale weights from -1 to 1
)
We can also highlight some genes of interest using the argument manual
to see where in the distribution they lie:
plot_weights(MOFAmodel,
view = "RNA expression",
factor = 1,
nfeatures = 5,
manual = list(c("Snai1","Mesp1","Phlda2"), c("Spink1","Amot")),
color_manual = c("darkgreen","red"),
scale = T
)
If you are not interested in the full distribution, but just on the top weights, you can do:
plot_top_weights(MOFAmodel,
view = "RNA expression",
factor = 1,
nfeatures = 10,
scale = T,
abs = T
)
We expect that genes with large positive weights For Factor 1 to be highlighy expressed in the ExE Endoderm cells. If we plot Factor 1 colouring cells by gene expresion of the top genes:
genes <- c("Spink1","Dab2")
for (i in genes) {
p <- plot_factor(MOFAmodel,
factor = 1,
dot_size = 2.5,
color_by = i
) + scale_colour_gradientn(colours = terrain.colors(10)) # change color scale
print(p)
}
The weights are useful to get an idea of which are top genes that drive the factors. However, to get an idea of how well Factors are associated with genomic features we can generate a heatmap plot of the samples sorted by Factor values against gene expression for the genes with the largest weights:
plot_data_heatmap(MOFAmodel,
view = "RNA expression",
factor = 1,
features = 25,
show_colnames = F, cluster_cols = F # extra arguments passed to `pheatmap`
)
Interestingly, we provide the option to plot the “denoised” observations obtained by reconstructing the data using the MOFA factors. This essentially removes all the variation that is not captured by the model:
plot_data_heatmap(MOFAmodel,
view = "RNA expression",
factor = 1,
denoise = TRUE,
features = 25,
show_colnames = F, cluster_cols = F # extra arguments passed to `pheatmap`
)
We observe that essentially all genes have a large and negative weight. This suggests that samples with a positive factor value (ExE endoderm cells) are globally demethylated as compared to cells with negative factor values.
plot_weights(MOFAmodel,
view = c("Promoter methylation"),
factor = 1,
nfeatures = 3,
scale = T
)
Let’s visualise the coordinated variation that MOFA captures using a heatmap. We will select the top features with largest loading and plot its DNA methylation levels, with sampels ordered according to Factor 1. In addition, here we are going to use the imputation capabilities of MOFA to obtain a cleaner representation.
First, we need to impute missing values:
MOFAmodel <- impute(MOFAmodel)
Generate Heatmap of the top 25 features with largest loading, before imputation (impute=FALSE
):
plot_data_heatmap(MOFAmodel,
view = "Promoter methylation",
factor = 1,
impute = FALSE,
features = 25,
show_colnames = F, cluster_cols = F # extra arguments passed to `pheatmap`
)
After imputation (impute=TRUE
):
plot_data_heatmap(MOFAmodel,
view = "Promoter methylation",
factor = 1,
impute = TRUE,
features = 25,
show_colnames = F, cluster_cols = F # extra arguments passed to `pheatmap`
)
Clearly, as the distribution of the weights indicate, ExE endoderm cells are in a state of global demethylation.
We can now follow a similar approach to characterise Factor 1.
plot_factor(MOFAmodel,
factor = 2,
color_by = "lineage",
scale = TRUE,
add_violin = TRUE, color_violin = TRUE,
dodge = TRUE, dot_size = 1, legend = TRUE
) + scale_color_manual(values=colors) + scale_fill_manual(values=colors)
plot_weights(MOFAmodel,
view = "RNA expression",
factor = 2,
nfeatures = 10, # Top number of features to highlight
scale = T # Scale weights from -1 to 1
)
If we compare the loadings of promoter DNA methylation between Factor 1 and Factor 2 we observe that in Factor 2 they are very small! This indicates that formation of mesoderm is not associated with changes in promoter DNA methylation status.
plot_weights(MOFAmodel,
view = c("Promoter methylation"),
factor = 1:2,
nfeatures = 0,
scale = T
)
Enhancers however look more promising. The dynamics of DNA methylation are not as strong as in Factor 1, but there is definitely something there.
plot_weights(MOFAmodel,
view = c("Enhancer methylation"),
factor = 1:2,
nfeatures = 0,
scale = T
)
Next step would be to zoom in into those enhancers and characterise them in detail:
df <- MOFAmodel@samples_metadata[,c("sample","lineage")]
rownames(df) <- df$sample; df$sample <- NULL
plot_data_heatmap(MOFAmodel,
view = c("Enhancer methylation"),
factor = 2,
impute = TRUE,
features = 15,
# extra arguments passed to `pheatmap`
show_colnames = F, show_rownames = F,
annotation_samples = df, annotation_colors = list("lineage"=colors),
cluster_cols = F
)
You can extract the factor values and the weights to do your own downstream analysis This is done with the functions get_factors
and get_weights
.
Here we are going to plot the correlation between the DNA methylation and chromatin accessibility weights for promoters.
# Fetch weights
w.met <- get_weights(MOFAmodel, factors = 1, views = "Promoter methylation", as.data.frame=T) %>%
as.data.table
w.acc <- get_weights(MOFAmodel, factors = 1, views = "Promoter accessibility", as.data.frame=T) %>%
as.data.table
# Remove the met_ and acc_ prefix from the feature names
w.met[,feature:=substr(feature,5,nchar(as.character(feature)))]
w.acc[,feature:=substr(feature,5,nchar(as.character(feature)))]
# Scale the loadings
w.met[,value:=value/max(abs(value))]
w.acc[,value:=value/max(abs(value))]
# Merge loadings
w.dt <- merge(
w.met[,c("feature","factor","value")],
w.acc[,c("feature","factor","value")],
by=c("feature","factor")
)
ggplot(w.dt, aes(x=value.x, y=value.y)) +
geom_point() + stat_smooth(method="lm") +
theme_classic() +
facet_wrap(~factor, scales="free", ncol=3) +
labs(x="DNA methylation (weight)", y="Chr. accessibility weight")
sessionInfo()
## R version 4.0.0 (2020-04-24)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Mojave 10.14.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] MOFA2_0.99.0 ggplot2_3.3.2 purrr_0.3.4 data.table_1.12.8
## [5] BiocStyle_2.16.0
##
## loaded via a namespace (and not attached):
## [1] ggrepel_0.8.2 Rcpp_1.0.4.6 lattice_0.20-41
## [4] tidyr_1.1.0 digest_0.6.25 R6_2.4.1
## [7] plyr_1.8.6 stats4_4.0.0 evaluate_0.14
## [10] pillar_1.4.4 rlang_0.4.6 magick_2.4.0
## [13] S4Vectors_0.26.1 Matrix_1.2-18 reticulate_1.16
## [16] rmarkdown_2.3 splines_4.0.0 labeling_0.3
## [19] stringr_1.4.0 pheatmap_1.0.12 munsell_0.5.0
## [22] DelayedArray_0.14.0 HDF5Array_1.16.1 compiler_4.0.0
## [25] vipor_0.4.5 xfun_0.15 pkgconfig_2.0.3
## [28] BiocGenerics_0.34.0 ggbeeswarm_0.6.0 mgcv_1.8-31
## [31] htmltools_0.5.0 tidyselect_1.1.0 tibble_3.0.1
## [34] bookdown_0.20 IRanges_2.22.2 matrixStats_0.56.0
## [37] crayon_1.3.4 dplyr_1.0.0 withr_2.2.0
## [40] grid_4.0.0 nlme_3.1-148 jsonlite_1.7.0
## [43] gtable_0.3.0 lifecycle_0.2.0 magrittr_1.5
## [46] scales_1.1.1 stringi_1.4.6 farver_2.0.3
## [49] reshape2_1.4.4 ellipsis_0.3.1 generics_0.0.2
## [52] vctrs_0.3.1 cowplot_1.0.0 Rhdf5lib_1.10.0
## [55] RColorBrewer_1.1-2 tools_4.0.0 forcats_0.5.0
## [58] glue_1.4.1 beeswarm_0.2.3 parallel_4.0.0
## [61] yaml_2.2.1 colorspace_1.4-1 rhdf5_2.32.1
## [64] BiocManager_1.30.10 corrplot_0.84 knitr_1.29
## [67] ggrastr_0.1.9