1 Description
2 Load libraries
3 Load pre-computed MOFA model
4 Overview of training data
5 Overview of the MOFA model
6 Characterisation of Factors
7 Zero-inflation in scRNA-seq data
8 Are the changes in DNA methylation and chromatin accessibility correlated at the loci level?
9 Optional: Using pseudo-time annotations of cells with MEFISTO
10 sessionInfo

1 Description

This tutorial demonstrates how to use MOFA for the integration of single-cell multi-omics data. We consider a dataset where scNMT-seq was used to simultaneously profile RNA expression, DNA methylation and chromatin accessibility in 1,828 cells at multiple stages of mouse development. MOFA provides a method for delineating coordinated variation between the transcriptome and the epigenome.

The data set we use here is a simplified version of the original data set (only E7.5 cells) published in Nature. The full data set can be downloaded from this FTP.

In this vignette we skip all the data processing details as well as the model training part. We focus on the downstream characterisation of a pretrained MOFA model. If you are interested in the code to train the model from the data linked above, have a look at the scripts in the train_MOFA folder.

2 Load libraries

Load dependencies. Make sure that MOFA2 is imported last, to avoid collisions with functions from other packages

library(data.table)  # fast manipulation of data.frames
library(purrr)       # pipes to make the code more readable
library(ggplot2)
library(MOFA2)

Define cell type colors for the visualisations

colors <- c(
  "Mesoderm" = "#CD3278",
  "Endoderm" = "#43CD80",
  "Ectoderm" = "steelblue"
)

3 Load pre-computed MOFA model

As input to the model we quantified DNA methylation and chromatin accessibility values over two different sets of regulatory elements: gene promoters and enhancer elements. RNA expression was quantified over protein-coding genes. After data processing, separate views were defined for the RNA expression and for each combination of genomic context and epigenetic readout.

Note that for this tutorial we selected the MOFA Factors that explained at least 1% of variation in the RNA expression. The trained model can be found here ftp://ftp.dkfz-heidelberg.de/outgoing/SCCourse2021/day_2_data/MOFAmodel.rds.

MOFAobject <- readRDS("day_2_data/MOFAmodel.rds")
# MOFAobject <- readRDS(url("ftp://ftp.dkfz-heidelberg.de/outgoing/SCCourse2021/day_2_data/MOFAmodel.rds"))
MOFAobject

## Trained MOFA with the following characteristics: 
##  Number of views: 5 
##  Views names: Enhancers accessibility Enhancers methylation Promoters accessibility Promoters methylation RNA 
##  Number of features (per view): 1500 1500 1500 1500 2500 
##  Number of groups: 1 
##  Groups names: single_group 
##  Number of samples (per group): 801 
##  Number of factors: 6

Explore the cell metadata:
- sample: cell ID
- stage: developmental stage.
- lineage: cell type annotation (derived from mapping the cells to the 10x reference atlas).
- pass_rnaQC: did the cell pass QC for RNA expression?.
- pass_metQC: did the cell pass QC for DNA methylation? NA if the cell was only profiled for RNA.
- pass_accQC: did the cell pass QC for chromatin accessibility? NA if the cell was only profiled for RNA.
- group: ignore this column

head(samples_metadata(MOFAobject))

##           sample stage stage_lineage  lineage pass_rnaQC pass_metQC pass_accQC
## 1 E7.5_Plate1_C9  E7.5 E7.5_Mesoderm Mesoderm       TRUE         NA         NA
## 2 E7.5_Plate1_D9  E7.5 E7.5_Mesoderm Mesoderm       TRUE         NA         NA
## 3 E7.5_Plate1_A9  E7.5 E7.5_Ectoderm Ectoderm       TRUE         NA         NA
## 4 E7.5_Plate1_C3  E7.5 E7.5_Mesoderm Mesoderm       TRUE         NA         NA
## 5 E7.5_Plate1_E3  E7.5 E7.5_Mesoderm Mesoderm       TRUE         NA         NA
## 6 E7.5_Plate1_A3  E7.5 E7.5_Endoderm Endoderm       TRUE       TRUE       TRUE
##          group
## 1 single_group
## 2 single_group
## 3 single_group
## 4 single_group
## 5 single_group
## 6 single_group

(Q) How many cells from each lineages are in the object?

Answer

table(samples_metadata(MOFAobject)$lineage)

## 
## Ectoderm Endoderm Mesoderm 
##      189      131      481

Notice that there a lot of cells that only have RNA expression measurements. One of the major advantages of MOFA is that it handles missing values, so we don’t have to remove these cells prior to model training

(Q) How many cells do not have DNA methylation measurements?

Answer

Visualisation of Factor values

mean(!is.na(samples_metadata(MOFAobject)$pass_metQC))

## [1] 0.4069913

4 Overview of training data

The function plot_data_overview can be used to obtain an overview of the input data. It shows how many views (rows) and how many cells (columns) exist, what are their corresponding dimensionalities are. It also shows which views each cell is missing.

view.colors <- c(
  "RNA" = "#3CB54E",
  "Enhancers accessibility" = "#00BFC4",
  "Promoters accessibility" = "#00BFC4",
  "Enhancers methylation" = "#F37A71",
  "Promoters methylation" = "#F37A71"
)
view.colors = view.colors[views_names(MOFAobject)]

plot_data_overview(MOFAobject, colors = view.colors)

4.1 Visualise RNA expression data.

Notice that RNA expression has already been normalised (using scran), but the distribution looks zero-inflated. This is a statistical challenge for some single-cell RNA-seq assays. To model this data we probably want to use a zero-inflated Gaussian distribution, but this not implemented in the MOFA framework. Instead, we used a Gaussian distribution, which should provide a decent approximation but it will likely underestimate the mean expression values.

rna <- get_data(MOFAobject, views="RNA")[[1]][[1]]
hist(rna)

4.2 Visualise DNA methylation and chromatin accessibility data.

As seen in the previous session, we use M-values instead of Beta-values.
However, when looking at the distribution of M-values we can notice that the epigenetic modalities are not well modeled by a Gaussian likelihood. Ideally this data modality should be modeled with a binomial distribution: for every cell \(i\) and region \(j\) the total number of CpGs correspond to the total number of trials and the number of methylated CpGs to the number of successes. Unfortunately, the binomial likelihood is not implemented in the MOFA framework and thus we need to calculate M-values that can be approximated with a Gaussian distribution.

met <- get_data(MOFAobject, views="Promoters methylation")[[1]][[1]]
hist(met)

acc <- get_data(MOFAobject, views="Promoters accessibility")[[1]][[1]]
hist(acc)

5 Overview of the MOFA model

5.1 Correlation between factors

A good sanity check is to verify that the Factors are largely uncorrelated. In MOFA there are no orthogonality constraints such as in Principal Component Analysis, but if there is a lot of correlation between Factors this suggests a poor model fit.

plot_factor_cor(MOFAobject)

5.2 Variance decomposition analysis

The most important insight that MOFA generates is the variance decomposition analysis using plot_variance_explained. This plot shows the percentage of variance explained by each factor in each data modality. It summarises the (latent) signal from a complex heterogeneous data set in a single figure.

plot_variance_explained(MOFAobject) +
  theme(
    axis.text.x = element_text(angle=25, hjust=1, vjust=1.05)
  )

(Q) What insights from the data can we learn just from inspecting this plot?

Answer

Factor 1 and Factor 2 capture strong sources of variability that drive significant amounts of variation in RNA expression and epigenetic status of enhancer elements, but not in promoter elements.
Factor 3 and Factor 5 capture a source of variation that is exclusive to the RNA expression.
Factor 4 captures a strong source of variation that is shared between RNA expression and DNA methylation of enhancer elements.

(Q) What do you notice about promoters?

6 Characterisation of Factors

There are a few systematic strategies to characterise the molecular signal that underlies each MOFA factor:

Association analysis between the sample metadata and the Factor values: function correlate_factors_with_covariates
Inspection of factor values: functions plot_factor (one factor at a time) and plot_factors (combinations of factors)
Inspection of the feature weights: functions plot_weights (all weights), plot_top_weights (only the top weights)
Gene set enrichment analysis on the mRNA weights: functions run_enrichment, followed by plot_enrichment.

6.1 Characterisation of Factor 1

6.1.1 Visualisation of Factor values

Plotting Factor 1 values and colouring cells by lineage assignment cleartly shows that this factor captures the variation that is associated with the separation between Mesoderm (positive Factor values) and non-Mesoderm cells (negative Factor values).

plot_factor(MOFAobject,
  factor = 1,
  color_by = "lineage", 
  add_violin = TRUE,
  dodge = TRUE
) + scale_fill_manual(values=colors)

How do we interpret the factor values?
Each factor captures a different source of variability in the data. Mathematically, each Factor is defined by a linear combination of the input features. Each Factor ordinates cells along a one-dimensional axis that is centered at zero. Samples with different signs manifest opposite phenotypes along the inferred axis of variation, with higher absolute value indicating a stronger effect.
Note that the interpretation of MOFA factors is analogous to the interpretation of the principal components in PCA.

6.1.2 Visualisation of RNA weights

The weights provide a score for each gene on each factor. Genes with no association with the factor are expected to have values close to zero, whereas genes with strong association with the factor are expected to have large absolute values. The sign of the weight indicates the direction of the effect: a positive weight indicates that the feature is more active in the cells with positive factor values, and viceversa.

Let’s plot the distribution of weights for Factor 1.

plot_weights(MOFAobject,
  view = "RNA",
  factor = 1,
  nfeatures = 10,     # Top number of features to highlight
  scale = T           # Scale weights from -1 to 1
)

If you are not interested in the full distribution, but just on the top weights, you can instead do:

plot_top_weights(MOFAobject, 
  view = "RNA", 
  factor = 1, 
  nfeatures = 10,
  scale = T, 
  abs = T
)

We expect that genes with large positive weights For Factor 1 to be highlighy expressed in the Mesoderm cells. If we plot Factor 1 colouring cells by gene expresion of the top genes with positive weight:

genes <- c("Phlda2","Mesp1")

for (i in genes) {
  plot_factor(MOFAobject,
    factor = 1,
    dot_size = 2.5,
    group_by = "lineage",
    color_by = i
  ) %>% print
}

Similarly, we expect that genes with large negative weights For Factor 1 to be lowly expressed in the Mesoderm cells. If we plot Factor 1 colouring cells by gene expresion of the top genes with negative weight:

genes <- c("Cldn6","Pim2")

for (i in genes) {
  plot_factor(MOFAobject,
    factor = 1,
    dot_size = 2.5,
    group_by = "lineage",
    color_by = i
  ) %>% print
}

6.2 Visualisation of RNA expression patterns in the high-dimensional space

The weights are useful to identify which genes are driving each factors. After inspecting the weights it is good practice to go back to the high-dimensional space and check if the variability that MOFA captures is real.
For example, one could generate a heatmap plot of the RNA expression for the top genes, where samples are sorted by the corresponding factor values. This is the aim of the plot_data_heatmap function:

plot_data_heatmap(MOFAobject, 
  view = "RNA", 
  factor = 1, 
  features = 25,
  annotation_samples = "lineage",
  # extra arguments passed to `pheatmap`,
  show_colnames = F, cluster_cols = F, 
  annotation_colors = list("lineage"=colors),
  annotation_legend = FALSE
)

An interesting option of plot_data_heatmap is to plot “denoised” observations. This is obtained by reconstructing the data using the matrix factorisation equation from MOFA:

\[\hat{\mathbf{Y}}^m = \mathbf{W}^m\mathbf{Z}\] where \(\mathbf{W}^m\) is the weight matrix for the \(m\)-th view, and \(\mathbf{Z}\) is the (shared) factor matrix.
This data reconstruction step essentially removes all the variation that is not captured by the model:

plot_data_heatmap(MOFAobject, 
  view = "RNA", 
  factor = 1, 
  denoise = TRUE,
  features = 25,
  # extra arguments passed to `pheatmap`
  show_colnames = F, cluster_cols = F 
)

6.3 Visualisation of DNA methylation weights

As we have done with RNA, we can also visualise the distribution of weights for the epigenetic modalities. The problem about this is that the large majority of enhancers are not well annotated and we only have the genomic coordinates for them…

plot_weights(MOFAobject,
  view = c("Enhancers methylation"),
  factor = 1,
  nfeatures = 5,
  scale = F
)

6.4 Visualisation of DNA methylation patterns in the high-dimensional space

As done with the RNA above, let’s visualise in the high-dimensional space the DNA methylation variation that MOFA captures using the plot_data_heatmap function. Notice how noisy and sparse DNA methylation data is.

plot_data_heatmap(MOFAobject, 
  view = "Enhancers methylation", 
  factor = 1, 
  features = 25,
  annotation_samples = "lineage",
  # extra arguments passed to `pheatmap`
  show_colnames = F, cluster_cols = F, 
  annotation_colors = list("lineage"=colors),
  annotation_legend = FALSE,
  fontsize = 6
)

We will use MOFA to impute the missing values. This is based on the data reconstruction equation shown above.

MOFAobject <- impute(MOFAobject)

Plot heatmap with impute=TRUE argument. Despite the gaussian likelihood model not being optimal for DNA methylation data, this heatmap looks much better!

plot_data_heatmap(MOFAobject, 
  view = "Enhancers methylation", 
  factor = 1, 
  impute = TRUE,
  features = 25,
  annotation_samples = "lineage",
  # extra arguments passed to `pheatmap`
  show_colnames = F, cluster_cols = F, 
  annotation_colors = list("lineage"=colors),
  annotation_legend = FALSE,
  fontsize = 6
)

As we guessed from the variance decomposition analysis, the promoters do not display interesting signal during germ layer commitment

plot_data_heatmap(MOFAobject, 
  view = "Promoters methylation", 
  factor = 1, 
  impute = TRUE,
  features = 25,
  annotation_samples = "lineage",
  # extra arguments passed to `pheatmap`
  show_colnames = F, cluster_cols = F, 
  annotation_colors = list("lineage"=colors),
  annotation_legend = FALSE,
  fontsize = 6
)

6.5 Characterisation of Factor 2

(Q) Your task is to provide a characterisation for Factor 2.

Try a similar pipeline as for Factor 1 and answer the following questions:

Which germ layer underlies Factor 2?

Answer

Visualisation of Factor values

plot_factor(MOFAobject,
  factor = 2,
  color_by = "lineage", 
  add_violin = TRUE,
  dodge = TRUE
) + scale_fill_manual(values=colors)

Can you identify mRNA markers? Validate them using the scRNA-seq reference atlas

Answer

Visualisation of RNA weights

plot_weights(MOFAobject,
  view = "RNA",
  factor = 2,
  nfeatures = 10,     # Top number of features to highlight
  scale = T           # Scale weights from -1 to 1
)

Plot the expression of the genes with largest weight

genes <- c("Foxa2","Krt8")

for (i in genes) {
  plot_factor(MOFAobject,
    factor = 2,
    dot_size = 2.5,
    group_by = "lineage",
    color_by = i
  ) %>% print
}

Generate a heatmap that displays the DNA methylation variation (try with and without imputation of missing values, see the function impute)

Answer

Visualisation of DNA methylation patterns in the high-dimensional space

plot_data_heatmap(MOFAobject, 
  view = "Enhancers methylation", 
  factor = 2, 
  impute = FALSE,
  features = 25,
  annotation_samples = "lineage",
  # extra arguments passed to `pheatmap`
  show_colnames = F, cluster_cols = F, 
  annotation_colors = list("lineage"=colors),
  annotation_legend = FALSE,
  fontsize = 6
)

plot_data_heatmap(MOFAobject, 
  view = "Enhancers methylation", 
  factor = 2, 
  impute = TRUE,
  features = 25,
  annotation_samples = "lineage",
  # extra arguments passed to `pheatmap`
  show_colnames = F, cluster_cols = F, 
  annotation_colors = list("lineage"=colors),
  annotation_legend = FALSE,
  fontsize = 6
)

6.6 Plot combinations of Factors

This scatter plot captures in two dimensions all the variation that is required to separate the three germ layers! It corresponds to Figure 2b in the paper.

plot_factors(MOFAobject,
  factors = c(1,2), 
  color_by = "lineage",
  dot_size = 2,
  legend = TRUE
) + scale_fill_manual(values=colors)

We can colour this plot by the molecular measurements, for example gene expression:

# Ectoderm marker
plot_factors(MOFAobject,
  factors = c(1,2), 
  color_by = "Pim2", 
  dot_size = 2
)

# Mesoderm marker
plot_factors(MOFAobject,
  factors = c(1,2), 
  color_by = "Mesp1", 
  dot_size = 2
)

# Endoderm marker
plot_factors(MOFAobject,
  factors = c(1,2), 
  color_by = "Foxa2", 
  dot_size = 2
)

7 Zero-inflation in scRNA-seq data

(Q) Extract (1) the RNA expression data from the mofa model using the get_data function, and (2) the MOFA RNA expression predictions using the predict function. Then, plot a histogram of the two matrices. What do you notice?

Answer

Do predictions

rna.true <- get_data(MOFAobject, views="RNA")[[1]][[1]]
rna.pred <- predict(MOFAobject, views="RNA")[[1]][[1]]

Plot histograms. Notice that the Gaussian likelihood model implemented in MOFA (red) underestimates the mean of zero-inflated data (blue).

hist(rna.true, col=rgb(0,0,1,0.5), main="")
hist(rna.pred, col=rgb(1,0,0,0.5), add=T)

8 Are the changes in DNA methylation and chromatin accessibility correlated at the loci level?

Using MOFA we have identified coordinated axis of variation between the three omics (Factor1 and Factor2). However, this does not necessarily imply that the same features are driving the signal in all of the omics. It could be that DNA methylation changes associated with endoderm commitment (Factor 2) are driven by enhancers A,B,C whereas chromatin accessibility changes are driven by enhancers D,E,F.
To explore this, we will correlate the weights for both epigenetc modalities. This cannot be done internally within MOFA so we have to extract the weights manually do our own downstream analysis This is done with the function get_weights:

# Fetch weights
w.met <- get_weights(MOFAobject, 
  factors = c(1,2), 
  views = "Enhancers methylation", 
  scale = TRUE,
  as.data.frame = T
) %>% as.data.table

w.acc <- get_weights(MOFAobject, 
  factors = c(1,2), 
  views = "Enhancers accessibility", 
  scale = TRUE,
  as.data.frame = T
) %>% as.data.table

# Remove the met_ and acc_ prefix from the feature names
w.met[,feature:=substr(feature,5,nchar(as.character(feature)))]
w.acc[,feature:=substr(feature,5,nchar(as.character(feature)))]

# Merge weights
w.dt <- merge(
  w.met[,c("feature","factor","value")], 
  w.acc[,c("feature","factor","value")], 
  by=c("feature","factor")
)

Notice how the DNA methylation and chromatin accessibility weighs are highly correlated, suggesting that the variability in both epigenetic layers is highly coupled not just globally but also at the individual loci level.

ggplot(w.dt, aes(x=value.x, y=value.y)) +
  geom_point() + 
  stat_smooth(method="lm") +
  theme_classic() +
  facet_wrap(~factor, scales="free") +
  labs(x="DNA methylation weight", y="Chromatin accessibility weight")

9 Optional: Using pseudo-time annotations of cells with MEFISTO

In the lectures you have also learnt about extensions of MOFA to account for temporal or spatial information on the samples. You will hear more about spatial data tomorrow. Time course data at a good temporal resolution is still rare for single cell studies. In the full scNMT-seq gastrulation data we have 4 different time points, which does not provide enough temporal resolution for MEFISTO but instead we can consider using a measure of pseudotime and use this in MEFISTO to distinguish patterns of variation that vary smoothly along pseudotime from other non-smooth sources of variation, e.g. cell cycle.

For this we can train a similar model on the full data from all 4 stages and simply additionally pass the pseudo-time values for each cell to MOFA. If you are interested in how this works and what other types of down-stream analysis are possible have a look at this tutorial.

10 sessionInfo

sessionInfo()

## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] MOFA2_1.1.21      ggplot2_3.3.3     purrr_0.3.4       data.table_1.14.0
## [5] BiocStyle_2.19.2 
## 
## loaded via a namespace (and not attached):
##  [1] MatrixGenerics_1.3.1 sass_0.3.1           tidyr_1.1.3         
##  [4] jsonlite_1.7.2       splines_4.1.0        bslib_0.2.4         
##  [7] assertthat_0.2.1     BiocManager_1.30.12  highr_0.9           
## [10] stats4_4.1.0         yaml_2.2.1           ggrepel_0.9.1       
## [13] corrplot_0.84        pillar_1.6.0         lattice_0.20-44     
## [16] glue_1.4.2           reticulate_1.19      digest_0.6.27       
## [19] RColorBrewer_1.1-2   colorspace_2.0-0     cowplot_1.1.1       
## [22] htmltools_0.5.1.1    Matrix_1.3-3         plyr_1.8.6          
## [25] pkgconfig_2.0.3      pheatmap_1.0.12      dir.expiry_0.99.4   
## [28] bookdown_0.21        scales_1.1.1         HDF5Array_1.19.14   
## [31] Rtsne_0.15           tibble_3.1.1         mgcv_1.8-35         
## [34] generics_0.1.0       farver_2.1.0         IRanges_2.25.11     
## [37] ellipsis_0.3.1       withr_2.4.2          BiocGenerics_0.37.2 
## [40] magrittr_2.0.1       crayon_1.4.1         evaluate_0.14       
## [43] fansi_0.4.2          nlme_3.1-152         forcats_0.5.1       
## [46] tools_4.1.0          lifecycle_1.0.0      matrixStats_0.58.0  
## [49] basilisk.utils_1.3.9 stringr_1.4.0        Rhdf5lib_1.13.4     
## [52] S4Vectors_0.29.15    munsell_0.5.0        DelayedArray_0.17.10
## [55] compiler_4.1.0       jquerylib_0.1.3      rlang_0.4.10        
## [58] rhdf5_2.35.2         grid_4.1.0           rhdf5filters_1.3.4  
## [61] labeling_0.4.2       rmarkdown_2.7        basilisk_1.3.7      
## [64] gtable_0.3.0         DBI_1.1.1            reshape2_1.4.4      
## [67] R6_2.5.0             knitr_1.32           dplyr_1.0.5         
## [70] uwot_0.1.10          utf8_1.2.1           filelock_1.0.2      
## [73] stringi_1.5.3        parallel_4.1.0       Rcpp_1.0.6          
## [76] vctrs_0.3.7          tidyselect_1.1.0     xfun_0.22

MOFA applied to the gastrulation scNMT-seq data set

2021-05-31

Contents

1 Description

2 Load libraries

3 Load pre-computed MOFA model

4 Overview of training data

4.1 Visualise RNA expression data.

4.2 Visualise DNA methylation and chromatin accessibility data.

5 Overview of the MOFA model

5.1 Correlation between factors

5.2 Variance decomposition analysis

6 Characterisation of Factors

6.1 Characterisation of Factor 1

6.1.1 Visualisation of Factor values

6.1.2 Visualisation of RNA weights

6.2 Visualisation of RNA expression patterns in the high-dimensional space

6.3 Visualisation of DNA methylation weights

6.4 Visualisation of DNA methylation patterns in the high-dimensional space

6.5 Characterisation of Factor 2

6.6 Plot combinations of Factors

7 Zero-inflation in scRNA-seq data

8 Are the changes in DNA methylation and chromatin accessibility correlated at the loci level?

9 Optional: Using pseudo-time annotations of cells with MEFISTO

10 sessionInfo