- 1 Introduction
- 2 Load libraries and data
- 3 Create the MOFA obejct and train the model
- 4 Overview of the trained MOFA model
- 5 Characterisation of Factor 1
- 6 Characterisation of Factor 3
- 7 Inspection of combinations of Factors
- 8 Prediction of clinical subgroups
- 9 Gene set enrichment analysis (GSEA)
- 10 Customized analysis
- 11 Imputation of missing values
- 12 Building predictive models of clinical outcome
- 13 sessionInfo

This vignette show how to use MOFA+ on the bulk multi-omics data set that was used in the first publication of MOFA and the original vignette of the MOFA package.

Briefly, the data consists of four omics including DNA methylation, RNA-seq, somatic mutations and drug response data from blood for N=200 patients with Chronic Lymphocytic Leukemia (CLL). The data set is explained in detail in this article and is publicly available here

```
library(MOFA2)
library(MOFAdata)
library(data.table)
library(ggplot2)
library(tidyverse)
```

Data is stored as a list of matrices. Features are stored in the rows and samples in the columns

```
utils::data("CLL_data")
lapply(CLL_data,dim)
```

```
## $Drugs
## [1] 310 200
##
## $Methylation
## [1] 4248 200
##
## $mRNA
## [1] 5000 200
##
## $Mutations
## [1] 69 200
```

Sample metadata are stored as a data.frame. Important columns are:

**Gender**: m (male), f (female)**Age**: age in years**TTT**: time (in years) which passed from taking the sample to the next treatment**TTD**: time (in years) which passed from taking the sample to patients’ death**treatedAfter**: (TRUE/FALSE)**Died**: whether the patient died (TRUE/FALSE)

The full meta data can be obtained from the Bioconductor package `BloodCancerMultiOmics2017`

as `data("patmeta")`

.

`CLL_metadata <- fread("ftp://ftp.ebi.ac.uk/pub/databases/mofa/cll_vignette/sample_metadata.txt")`

Create the MOFA object

`MOFAobject <- create_mofa(CLL_data)`

`MOFAobject`

```
## Untrained MOFA model with the following characteristics:
## Number of views: 4
## Views names: Drugs Methylation mRNA Mutations
## Number of features (per view): 310 4248 5000 69
## Number of groups: 1
## Groups names: group1
## Number of samples (per group): 200
```

Visualise the number of views (rows) and the number of groups (columns) exist, what are their corresponding dimensionalities and how many missing information they have (grey bars).

`plot_data_overview(MOFAobject)`

Important arguments:

**scale_groups**: scale groups to the same total variance? Default is`FALSE`

**scale_views**: scale views to the same total variance? Default is`FALSE`

**views**: views names**groups**: groups names

```
data_opts <- get_default_data_options(MOFAobject)
data_opts
```

```
## $scale_views
## [1] FALSE
##
## $scale_groups
## [1] FALSE
##
## $views
## [1] "Drugs" "Methylation" "mRNA" "Mutations"
##
## $groups
## [1] "group1"
```

Important arguments:

**num_factors**: number of factors**likelihoods**: likelihood per view (options are “gaussian”, “poisson”, “bernoulli”). By default they are inferred automatically.**spikeslab_factors**: use spike-slab sparsity prior in the factors? default is`FALSE`

.**spikeslab_weights**: use spike-slab sparsity prior in the weights? default is`TRUE`

.**ard_factors**: use ARD prior in the factors? Default is`TRUE`

if using multiple groups.**ard_weights**: use ARD prior in the weights? Default is`TRUE`

if using multiple views.

```
model_opts <- get_default_model_options(MOFAobject)
model_opts$num_factors <- 15
model_opts
```

```
## $likelihoods
## Drugs Methylation mRNA Mutations
## "gaussian" "gaussian" "gaussian" "bernoulli"
##
## $num_factors
## [1] 15
##
## $spikeslab_factors
## [1] FALSE
##
## $spikeslab_weights
## [1] TRUE
##
## $ard_factors
## [1] FALSE
##
## $ard_weights
## [1] TRUE
```

Important arguments:

**maxiter**: number of iterations. Default is 1000.**convergence_mode**: “fast”, “medium” (default), “slow”. For exploration, the fast mode is good enough.**seed**: random seed

```
train_opts <- get_default_training_options(MOFAobject)
train_opts$convergence_mode <- "slow"
train_opts$seed <- 42
train_opts
```

```
## $maxiter
## [1] 1000
##
## $convergence_mode
## [1] "slow"
##
## $drop_factor_threshold
## [1] -1
##
## $verbose
## [1] FALSE
##
## $startELBO
## [1] 1
##
## $freqELBO
## [1] 1
##
## $stochastic
## [1] FALSE
##
## $gpu_mode
## [1] FALSE
##
## $seed
## [1] 42
##
## $outfile
## NULL
##
## $save_interrupted
## [1] FALSE
```

Prepare the MOFA object

```
MOFAobject <- prepare_mofa(MOFAobject,
data_options = data_opts,
model_options = model_opts,
training_options = train_opts
)
```

Train the model: this should take ~2min

```
# NOTE: The software has evolved since the original publication and the results will not be 100% identical to the original publication, please use the pretrained model if you are running through the vignette for the fist time
# MOFAobject <- run_mofa(MOFAobject, outfile="/Users/ricard/Downloads/MOFA2_CLL.hdf5")
# saveRDS(MOFAobject,"MOFA2_CLL.rds")
# Load precomputed model
MOFAobject <- readRDS(url("http://ftp.ebi.ac.uk/pub/databases/mofa/cll_vignette/MOFA2_CLL.rds"))
```

The MOFA object consists of multiple slots where relevant data and information is stored. For descriptions, you can read the documentation using `?MOFA`

. The most important slots are:

**data**: input data used to train the model (features are centered at zero mean)**samples_metadata**: sample metadata information**expectations**: expectations of the posterior distributions for the Weights and the Factors

`slotNames(MOFAobject)`

```
## [1] "data" "intercepts" "imputed_data"
## [4] "samples_metadata" "features_metadata" "expectations"
## [7] "training_stats" "training_options" "stochastic_options"
## [10] "data_options" "model_options" "dimensions"
## [13] "on_disk" "dim_red" "cache"
## [16] "status"
```

Data:

`names(MOFAobject@data)`

`## [1] "Drugs" "Methylation" "mRNA" "Mutations"`

`dim(MOFAobject@data$Drugs$group1)`

`## [1] 310 200`

Factor and Weight values (expectations of the posterior distributions):

`names(MOFAobject@expectations)`

`## [1] "Z" "W"`

```
# Dimensionality of the factor matrix: 200 samples, 15 factors
dim(MOFAobject@expectations$Z$group1)
```

`## [1] 200 15`

```
# Dimensionality of the mRNA Weight matrix: 5000 features, 15 factors
dim(MOFAobject@expectations$W$mRNA)
```

`## [1] 5000 15`

The sample metadata must be provided as a data.frame and it must contain a column `sample`

with the sample IDs. Make sure that the samples in the metadata match the samples in the model

```
# Sanity check
# stopifnot(all(sort(CLL_metadata$sample)==sort(unlist(samples_names(MOFAobject)))))
# Add sample metadata to the model
samples_metadata(MOFAobject) <- CLL_metadata
```

A good sanity check is to verify that the Factors are largely uncorrelated. In MOFA there are no orthogonality constraints such as in Principal Component Analysis, but if there is a lot of correlation between Factors this suggests a poor model fit. Reasons? Perhaps you used too many factors or perhaps the normalisation is not adequate.

`plot_factor_cor(MOFAobject)`

**The most important insight that MOFA generates is the variance decomposition analysis**. This plot shows the percentage of variance explained by each factor across each data modality (and group, if provided). It summarises the sources of variation from a complex heterogeneous data set in a single figure.

`plot_variance_explained(MOFAobject, max_r2=15)`

What insights from the data can we learn just from inspecting this plot?

**Factor 1**captures a source of variability that is present across all data modalities. Thus, its etiology is likely to be something very important for the disease**Factor 2**captures a very strong source of variation that is exclusive to the drug response data.**Factor 3**captures variation that is present across multiple data modalities, except for DNA methylation. This is likely to be important too.**Factor 5**is capturing some co-variation between the mRNA and the drug response assay.

A reasonable question is whether the model is providing a good fit to the data. For this we can plot the total variance explained (using all factors). The resulting values will depend on the nature of the data set, the number of samples, the number of factors, etc. Some general guidelines:

- Noisy data sets with strong non-linearities will result in small amounts of variance explained (<10%).
- The higher the number of samples the smaller the total variance explained
- The higher the number of factors, the higher the total variance explained.
- MOFA is a linear and sparse model. This is helpful to prevent overfitting, but it will never explain 100% of the variance, even if using a lot of Factors.

In this data set, using only K=15 factors the model explains up to ~54% of the variation in the Drug response and ~42% of the variation in the mRNA data. This is quite remarkable for a linear model.

`plot_variance_explained(MOFAobject, plot_total = T)[[2]]`