This tutorial demonstrates how to use MOFA for the integration of single-cell multi-omics data. We consider a dataset where scNMT-seq was used to simultaneously profile RNA expression, DNA methylation and chromatin accessibility in 1,828 cells at multiple stages of mouse development. MOFA provides a method for delineating coordinated variation between the transcriptome and the epigenome.
The data set we use here is a simplified version of the original data set (only E7.5 cells) published in Nature. The full data set can be downloaded from this FTP.
In this vignette we skip all the data processing details as well as the model training part. We focus on the downstream characterisation of a pretrained MOFA model. If you are interested in the code to train the model from the data linked above, have a look at the scripts in the train_MOFA folder.
Load dependencies. Make sure that MOFA2 is imported last, to avoid collisions with functions from other packages
library(data.table) # fast manipulation of data.frames
library(purrr) # pipes to make the code more readable
library(ggplot2)
library(MOFA2)
Define cell type colors for the visualisations
colors <- c(
"Mesoderm" = "#CD3278",
"Endoderm" = "#43CD80",
"Ectoderm" = "steelblue"
)
As input to the model we quantified DNA methylation and chromatin accessibility values over two different sets of regulatory elements: gene promoters and enhancer elements. RNA expression was quantified over protein-coding genes. After data processing, separate views were defined for the RNA expression and for each combination of genomic context and epigenetic readout.
Note that for this tutorial we selected the MOFA Factors that explained at least 1% of variation in the RNA expression. The trained model can be found here ftp://ftp.dkfz-heidelberg.de/outgoing/SCCourse2021/day_2_data/MOFAmodel.rds.
MOFAobject <- readRDS("day_2_data/MOFAmodel.rds")
# MOFAobject <- readRDS(url("ftp://ftp.dkfz-heidelberg.de/outgoing/SCCourse2021/day_2_data/MOFAmodel.rds"))
MOFAobject
## Trained MOFA with the following characteristics:
## Number of views: 5
## Views names: Enhancers accessibility Enhancers methylation Promoters accessibility Promoters methylation RNA
## Number of features (per view): 1500 1500 1500 1500 2500
## Number of groups: 1
## Groups names: single_group
## Number of samples (per group): 801
## Number of factors: 6
Explore the cell metadata:
- sample: cell ID
- stage: developmental stage.
- lineage: cell type annotation (derived from mapping the cells to the 10x reference atlas).
- pass_rnaQC: did the cell pass QC for RNA expression?.
- pass_metQC: did the cell pass QC for DNA methylation? NA
if the cell was only profiled for RNA.
- pass_accQC: did the cell pass QC for chromatin accessibility? NA
if the cell was only profiled for RNA.
- group: ignore this column
head(samples_metadata(MOFAobject))
## sample stage stage_lineage lineage pass_rnaQC pass_metQC pass_accQC
## 1 E7.5_Plate1_C9 E7.5 E7.5_Mesoderm Mesoderm TRUE NA NA
## 2 E7.5_Plate1_D9 E7.5 E7.5_Mesoderm Mesoderm TRUE NA NA
## 3 E7.5_Plate1_A9 E7.5 E7.5_Ectoderm Ectoderm TRUE NA NA
## 4 E7.5_Plate1_C3 E7.5 E7.5_Mesoderm Mesoderm TRUE NA NA
## 5 E7.5_Plate1_E3 E7.5 E7.5_Mesoderm Mesoderm TRUE NA NA
## 6 E7.5_Plate1_A3 E7.5 E7.5_Endoderm Endoderm TRUE TRUE TRUE
## group
## 1 single_group
## 2 single_group
## 3 single_group
## 4 single_group
## 5 single_group
## 6 single_group
(Q) How many cells from each lineages are in the object?
table(samples_metadata(MOFAobject)$lineage)
##
## Ectoderm Endoderm Mesoderm
## 189 131 481
Notice that there a lot of cells that only have RNA expression measurements. One of the major advantages of MOFA is that it handles missing values, so we don’t have to remove these cells prior to model training
(Q) How many cells do not have DNA methylation measurements?Visualisation of Factor values
mean(!is.na(samples_metadata(MOFAobject)$pass_metQC))
## [1] 0.4069913
The function plot_data_overview
can be used to obtain an overview of the input data.
It shows how many views (rows) and how many cells (columns) exist, what are their corresponding dimensionalities are. It also shows which views each cell is missing.
view.colors <- c(
"RNA" = "#3CB54E",
"Enhancers accessibility" = "#00BFC4",
"Promoters accessibility" = "#00BFC4",
"Enhancers methylation" = "#F37A71",
"Promoters methylation" = "#F37A71"
)
view.colors = view.colors[views_names(MOFAobject)]
plot_data_overview(MOFAobject, colors = view.colors)
Notice that RNA expression has already been normalised (using scran
), but the distribution looks zero-inflated. This is a statistical challenge for some single-cell RNA-seq assays. To model this data we probably want to use a zero-inflated Gaussian distribution, but this not implemented in the MOFA framework. Instead, we used a Gaussian distribution, which should provide a decent approximation but it will likely underestimate the mean expression values.
rna <- get_data(MOFAobject, views="RNA")[[1]][[1]]
hist(rna)
As seen in the previous session, we use M-values instead of Beta-values.
However, when looking at the distribution of M-values we can notice that the epigenetic modalities are not well modeled by a Gaussian likelihood. Ideally this data modality should be modeled with a binomial distribution: for every cell \(i\) and region \(j\) the total number of CpGs correspond to the total number of trials and the number of methylated CpGs to the number of successes. Unfortunately, the binomial likelihood is not implemented in the MOFA framework and thus we need to calculate M-values that can be approximated with a Gaussian distribution.
met <- get_data(MOFAobject, views="Promoters methylation")[[1]][[1]]
hist(met)
acc <- get_data(MOFAobject, views="Promoters accessibility")[[1]][[1]]
hist(acc)
A good sanity check is to verify that the Factors are largely uncorrelated. In MOFA there are no orthogonality constraints such as in Principal Component Analysis, but if there is a lot of correlation between Factors this suggests a poor model fit.
plot_factor_cor(MOFAobject)
The most important insight that MOFA generates is the variance decomposition analysis using plot_variance_explained
. This plot shows the percentage of variance explained by each factor in each data modality. It summarises the (latent) signal from a complex heterogeneous data set in a single figure.
plot_variance_explained(MOFAobject) +
theme(
axis.text.x = element_text(angle=25, hjust=1, vjust=1.05)
)
(Q) What insights from the data can we learn just from inspecting this plot?
(Q) What do you notice about promoters?
There are a few systematic strategies to characterise the molecular signal that underlies each MOFA factor:
correlate_factors_with_covariates
plot_factor
(one factor at a time) and plot_factors
(combinations of factors)plot_weights
(all weights), plot_top_weights
(only the top weights)run_enrichment
, followed by plot_enrichment
.Plotting Factor 1 values and colouring cells by lineage assignment cleartly shows that this factor captures the variation that is associated with the separation between Mesoderm (positive Factor values) and non-Mesoderm cells (negative Factor values).
plot_factor(MOFAobject,
factor = 1,
color_by = "lineage",
add_violin = TRUE,
dodge = TRUE
) + scale_fill_manual(values=colors)
How do we interpret the factor values?
Each factor captures a different source of variability in the data. Mathematically, each Factor is defined by a linear combination of the input features. Each Factor ordinates cells along a one-dimensional axis that is centered at zero. Samples with different signs manifest opposite phenotypes along the inferred axis of variation, with higher absolute value indicating a stronger effect.
Note that the interpretation of MOFA factors is analogous to the interpretation of the principal components in PCA.
The weights provide a score for each gene on each factor. Genes with no association with the factor are expected to have values close to zero, whereas genes with strong association with the factor are expected to have large absolute values. The sign of the weight indicates the direction of the effect: a positive weight indicates that the feature is more active in the cells with positive factor values, and viceversa.
Let’s plot the distribution of weights for Factor 1.
plot_weights(MOFAobject,
view = "RNA",
factor = 1,
nfeatures = 10, # Top number of features to highlight
scale = T # Scale weights from -1 to 1
)
If you are not interested in the full distribution, but just on the top weights, you can instead do:
plot_top_weights(MOFAobject,
view = "RNA",
factor = 1,
nfeatures = 10,
scale = T,
abs = T
)
We expect that genes with large positive weights For Factor 1 to be highlighy expressed in the Mesoderm cells. If we plot Factor 1 colouring cells by gene expresion of the top genes with positive weight:
genes <- c("Phlda2","Mesp1")
for (i in genes) {
plot_factor(MOFAobject,
factor = 1,
dot_size = 2.5,
group_by = "lineage",
color_by = i
) %>% print
}
Similarly, we expect that genes with large negative weights For Factor 1 to be lowly expressed in the Mesoderm cells. If we plot Factor 1 colouring cells by gene expresion of the top genes with negative weight:
genes <- c("Cldn6","Pim2")
for (i in genes) {
plot_factor(MOFAobject,
factor = 1,
dot_size = 2.5,
group_by = "lineage",
color_by = i
) %>% print
}
The weights are useful to identify which genes are driving each factors. After inspecting the weights it is good practice to go back to the high-dimensional space and check if the variability that MOFA captures is real.
For example, one could generate a heatmap plot of the RNA expression for the top genes, where samples are sorted by the corresponding factor values. This is the aim of the plot_data_heatmap
function:
plot_data_heatmap(MOFAobject,
view = "RNA",
factor = 1,
features = 25,
annotation_samples = "lineage",
# extra arguments passed to `pheatmap`,
show_colnames = F, cluster_cols = F,
annotation_colors = list("lineage"=colors),
annotation_legend = FALSE
)
An interesting option of plot_data_heatmap
is to plot “denoised” observations. This is obtained by reconstructing the data using the matrix factorisation equation from MOFA:
\[\hat{\mathbf{Y}}^m = \mathbf{W}^m\mathbf{Z}\]
where \(\mathbf{W}^m\) is the weight matrix for the \(m\)-th view, and \(\mathbf{Z}\) is the (shared) factor matrix.
This data reconstruction step essentially removes all the variation that is not captured by the model:
plot_data_heatmap(MOFAobject,
view = "RNA",
factor = 1,
denoise = TRUE,
features = 25,
# extra arguments passed to `pheatmap`
show_colnames = F, cluster_cols = F
)
As we have done with RNA, we can also visualise the distribution of weights for the epigenetic modalities. The problem about this is that the large majority of enhancers are not well annotated and we only have the genomic coordinates for them…
plot_weights(MOFAobject,
view = c("Enhancers methylation"),
factor = 1,
nfeatures = 5,
scale = F
)
As done with the RNA above, let’s visualise in the high-dimensional space the DNA methylation variation that MOFA captures using the plot_data_heatmap
function. Notice how noisy and sparse DNA methylation data is.
plot_data_heatmap(MOFAobject,
view = "Enhancers methylation",
factor = 1,
features = 25,
annotation_samples = "lineage",
# extra arguments passed to `pheatmap`
show_colnames = F, cluster_cols = F,
annotation_colors = list("lineage"=colors),
annotation_legend = FALSE,
fontsize = 6
)
We will use MOFA to impute the missing values. This is based on the data reconstruction equation shown above.
MOFAobject <- impute(MOFAobject)
Plot heatmap with impute=TRUE
argument. Despite the gaussian likelihood model not being optimal for DNA methylation data, this heatmap looks much better!
plot_data_heatmap(MOFAobject,
view = "Enhancers methylation",
factor = 1,
impute = TRUE,
features = 25,
annotation_samples = "lineage",
# extra arguments passed to `pheatmap`
show_colnames = F, cluster_cols = F,
annotation_colors = list("lineage"=colors),
annotation_legend = FALSE,
fontsize = 6
)
As we guessed from the variance decomposition analysis, the promoters do not display interesting signal during germ layer commitment
plot_data_heatmap(MOFAobject,
view = "Promoters methylation",
factor = 1,
impute = TRUE,
features = 25,
annotation_samples = "lineage",
# extra arguments passed to `pheatmap`
show_colnames = F, cluster_cols = F,
annotation_colors = list("lineage"=colors),
annotation_legend = FALSE,
fontsize = 6
)
(Q) Your task is to provide a characterisation for Factor 2.
Try a similar pipeline as for Factor 1 and answer the following questions:
Visualisation of Factor values
plot_factor(MOFAobject,
factor = 2,
color_by = "lineage",
add_violin = TRUE,
dodge = TRUE
) + scale_fill_manual(values=colors)
Visualisation of RNA weights
plot_weights(MOFAobject,
view = "RNA",
factor = 2,
nfeatures = 10, # Top number of features to highlight
scale = T # Scale weights from -1 to 1
)
Plot the expression of the genes with largest weight
genes <- c("Foxa2","Krt8")
for (i in genes) {
plot_factor(MOFAobject,
factor = 2,
dot_size = 2.5,
group_by = "lineage",
color_by = i
) %>% print
}
impute
)Visualisation of DNA methylation patterns in the high-dimensional space
plot_data_heatmap(MOFAobject,
view = "Enhancers methylation",
factor = 2,
impute = FALSE,
features = 25,
annotation_samples = "lineage",
# extra arguments passed to `pheatmap`
show_colnames = F, cluster_cols = F,
annotation_colors = list("lineage"=colors),
annotation_legend = FALSE,
fontsize = 6
)
plot_data_heatmap(MOFAobject,
view = "Enhancers methylation",
factor = 2,
impute = TRUE,
features = 25,
annotation_samples = "lineage",
# extra arguments passed to `pheatmap`
show_colnames = F, cluster_cols = F,
annotation_colors = list("lineage"=colors),
annotation_legend = FALSE,
fontsize = 6
)
This scatter plot captures in two dimensions all the variation that is required to separate the three germ layers! It corresponds to Figure 2b in the paper.
plot_factors(MOFAobject,
factors = c(1,2),
color_by = "lineage",
dot_size = 2,
legend = TRUE
) + scale_fill_manual(values=colors)
We can colour this plot by the molecular measurements, for example gene expression:
# Ectoderm marker
plot_factors(MOFAobject,
factors = c(1,2),
color_by = "Pim2",
dot_size = 2
)
# Mesoderm marker
plot_factors(MOFAobject,
factors = c(1,2),
color_by = "Mesp1",
dot_size = 2
)
# Endoderm marker
plot_factors(MOFAobject,
factors = c(1,2),
color_by = "Foxa2",
dot_size = 2
)
(Q) Extract (1) the RNA expression data from the mofa model using the get_data
function, and (2) the MOFA RNA expression predictions using the predict
function. Then, plot a histogram of the two matrices. What do you notice?
Do predictions
rna.true <- get_data(MOFAobject, views="RNA")[[1]][[1]]
rna.pred <- predict(MOFAobject, views="RNA")[[1]][[1]]
Plot histograms. Notice that the Gaussian likelihood model implemented in MOFA (red) underestimates the mean of zero-inflated data (blue).
hist(rna.true, col=rgb(0,0,1,0.5), main="")
hist(rna.pred, col=rgb(1,0,0,0.5), add=T)
In the lectures you have also learnt about extensions of MOFA to account for temporal or spatial information on the samples. You will hear more about spatial data tomorrow. Time course data at a good temporal resolution is still rare for single cell studies. In the full scNMT-seq gastrulation data we have 4 different time points, which does not provide enough temporal resolution for MEFISTO but instead we can consider using a measure of pseudotime and use this in MEFISTO to distinguish patterns of variation that vary smoothly along pseudotime from other non-smooth sources of variation, e.g. cell cycle.
For this we can train a similar model on the full data from all 4 stages and simply additionally pass the pseudo-time values for each cell to MOFA. If you are interested in how this works and what other types of down-stream analysis are possible have a look at this tutorial.
sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] MOFA2_1.1.21 ggplot2_3.3.3 purrr_0.3.4 data.table_1.14.0
## [5] BiocStyle_2.19.2
##
## loaded via a namespace (and not attached):
## [1] MatrixGenerics_1.3.1 sass_0.3.1 tidyr_1.1.3
## [4] jsonlite_1.7.2 splines_4.1.0 bslib_0.2.4
## [7] assertthat_0.2.1 BiocManager_1.30.12 highr_0.9
## [10] stats4_4.1.0 yaml_2.2.1 ggrepel_0.9.1
## [13] corrplot_0.84 pillar_1.6.0 lattice_0.20-44
## [16] glue_1.4.2 reticulate_1.19 digest_0.6.27
## [19] RColorBrewer_1.1-2 colorspace_2.0-0 cowplot_1.1.1
## [22] htmltools_0.5.1.1 Matrix_1.3-3 plyr_1.8.6
## [25] pkgconfig_2.0.3 pheatmap_1.0.12 dir.expiry_0.99.4
## [28] bookdown_0.21 scales_1.1.1 HDF5Array_1.19.14
## [31] Rtsne_0.15 tibble_3.1.1 mgcv_1.8-35
## [34] generics_0.1.0 farver_2.1.0 IRanges_2.25.11
## [37] ellipsis_0.3.1 withr_2.4.2 BiocGenerics_0.37.2
## [40] magrittr_2.0.1 crayon_1.4.1 evaluate_0.14
## [43] fansi_0.4.2 nlme_3.1-152 forcats_0.5.1
## [46] tools_4.1.0 lifecycle_1.0.0 matrixStats_0.58.0
## [49] basilisk.utils_1.3.9 stringr_1.4.0 Rhdf5lib_1.13.4
## [52] S4Vectors_0.29.15 munsell_0.5.0 DelayedArray_0.17.10
## [55] compiler_4.1.0 jquerylib_0.1.3 rlang_0.4.10
## [58] rhdf5_2.35.2 grid_4.1.0 rhdf5filters_1.3.4
## [61] labeling_0.4.2 rmarkdown_2.7 basilisk_1.3.7
## [64] gtable_0.3.0 DBI_1.1.1 reshape2_1.4.4
## [67] R6_2.5.0 knitr_1.32 dplyr_1.0.5
## [70] uwot_0.1.10 utf8_1.2.1 filelock_1.0.2
## [73] stringi_1.5.3 parallel_4.1.0 Rcpp_1.0.6
## [76] vctrs_0.3.7 tidyselect_1.1.0 xfun_0.22