This vignette show how to use MOFA+ on the bulk multi-omics data set that was used in the first publication of MOFA and the original vignette of the MOFA package.
Briefly, the data consists of four omics including DNA methylation, RNA-seq, somatic mutations and drug response data from blood for N=200 patients with Chronic Lymphocytic Leukemia (CLL). The data set is explained in detail in this article and is publicly available here
library(MOFA2)
library(MOFAdata)
library(data.table)
library(ggplot2)
library(tidyverse)
Data is stored as a list of matrices. Features are stored in the rows and samples in the columns
utils::data("CLL_data")
lapply(CLL_data,dim)
## $Drugs
## [1] 310 200
##
## $Methylation
## [1] 4248 200
##
## $mRNA
## [1] 5000 200
##
## $Mutations
## [1] 69 200
Sample metadata are stored as a data.frame. Important columns are:
The full meta data can be obtained from the Bioconductor package BloodCancerMultiOmics2017
as data("patmeta")
.
CLL_metadata <- fread("ftp://ftp.ebi.ac.uk/pub/databases/mofa/cll_vignette/sample_metadata.txt")
Create the MOFA object
MOFAobject <- create_mofa(CLL_data)
MOFAobject
## Untrained MOFA model with the following characteristics:
## Number of views: 4
## Views names: Drugs Methylation mRNA Mutations
## Number of features (per view): 310 4248 5000 69
## Number of groups: 1
## Groups names: group1
## Number of samples (per group): 200
Visualise the number of views (rows) and the number of groups (columns) exist, what are their corresponding dimensionalities and how many missing information they have (grey bars).
plot_data_overview(MOFAobject)
Important arguments:
FALSE
FALSE
data_opts <- get_default_data_options(MOFAobject)
data_opts
## $scale_views
## [1] FALSE
##
## $scale_groups
## [1] FALSE
##
## $views
## [1] "Drugs" "Methylation" "mRNA" "Mutations"
##
## $groups
## [1] "group1"
Important arguments:
FALSE
.TRUE
.TRUE
if using multiple groups.TRUE
if using multiple views.model_opts <- get_default_model_options(MOFAobject)
model_opts$num_factors <- 15
model_opts
## $likelihoods
## Drugs Methylation mRNA Mutations
## "gaussian" "gaussian" "gaussian" "bernoulli"
##
## $num_factors
## [1] 15
##
## $spikeslab_factors
## [1] FALSE
##
## $spikeslab_weights
## [1] TRUE
##
## $ard_factors
## [1] FALSE
##
## $ard_weights
## [1] TRUE
Important arguments:
train_opts <- get_default_training_options(MOFAobject)
train_opts$convergence_mode <- "slow"
train_opts$seed <- 42
train_opts
## $maxiter
## [1] 1000
##
## $convergence_mode
## [1] "slow"
##
## $drop_factor_threshold
## [1] -1
##
## $verbose
## [1] FALSE
##
## $startELBO
## [1] 1
##
## $freqELBO
## [1] 1
##
## $stochastic
## [1] FALSE
##
## $gpu_mode
## [1] FALSE
##
## $seed
## [1] 42
##
## $outfile
## NULL
##
## $save_interrupted
## [1] FALSE
Prepare the MOFA object
MOFAobject <- prepare_mofa(MOFAobject,
data_options = data_opts,
model_options = model_opts,
training_options = train_opts
)
Train the model: this should take ~2min
# NOTE: The software has evolved since the original publication and the results will not be 100% identical to the original publication, please use the pretrained model if you are running through the vignette for the fist time
# MOFAobject <- run_mofa(MOFAobject, outfile="/Users/ricard/Downloads/MOFA2_CLL.hdf5")
# saveRDS(MOFAobject,"MOFA2_CLL.rds")
# Load precomputed model
MOFAobject <- readRDS(url("http://ftp.ebi.ac.uk/pub/databases/mofa/cll_vignette/MOFA2_CLL.rds"))
The MOFA object consists of multiple slots where relevant data and information is stored. For descriptions, you can read the documentation using ?MOFA
. The most important slots are:
slotNames(MOFAobject)
## [1] "data" "intercepts" "imputed_data"
## [4] "samples_metadata" "features_metadata" "expectations"
## [7] "training_stats" "training_options" "stochastic_options"
## [10] "data_options" "model_options" "dimensions"
## [13] "on_disk" "dim_red" "cache"
## [16] "status"
Data:
names(MOFAobject@data)
## [1] "Drugs" "Methylation" "mRNA" "Mutations"
dim(MOFAobject@data$Drugs$group1)
## [1] 310 200
Factor and Weight values (expectations of the posterior distributions):
names(MOFAobject@expectations)
## [1] "Z" "W"
# Dimensionality of the factor matrix: 200 samples, 15 factors
dim(MOFAobject@expectations$Z$group1)
## [1] 200 15
# Dimensionality of the mRNA Weight matrix: 5000 features, 15 factors
dim(MOFAobject@expectations$W$mRNA)
## [1] 5000 15
The sample metadata must be provided as a data.frame and it must contain a column sample
with the sample IDs. Make sure that the samples in the metadata match the samples in the model
# Sanity check
# stopifnot(all(sort(CLL_metadata$sample)==sort(unlist(samples_names(MOFAobject)))))
# Add sample metadata to the model
samples_metadata(MOFAobject) <- CLL_metadata
A good sanity check is to verify that the Factors are largely uncorrelated. In MOFA there are no orthogonality constraints such as in Principal Component Analysis, but if there is a lot of correlation between Factors this suggests a poor model fit. Reasons? Perhaps you used too many factors or perhaps the normalisation is not adequate.
plot_factor_cor(MOFAobject)
The most important insight that MOFA generates is the variance decomposition analysis. This plot shows the percentage of variance explained by each factor across each data modality (and group, if provided). It summarises the sources of variation from a complex heterogeneous data set in a single figure.
plot_variance_explained(MOFAobject, max_r2=15)
What insights from the data can we learn just from inspecting this plot?
A reasonable question is whether the model is providing a good fit to the data. For this we can plot the total variance explained (using all factors). The resulting values will depend on the nature of the data set, the number of samples, the number of factors, etc. Some general guidelines:
In this data set, using only K=15 factors the model explains up to ~54% of the variation in the Drug response and ~42% of the variation in the mRNA data. This is quite remarkable for a linear model.
plot_variance_explained(MOFAobject, plot_total = T)[[2]]