1 Introduction
2 Load libraries and data
3 Create the MOFA obejct and train the model
4 Overview of the trained MOFA model
5 Characterisation of Factor 1
6 Characterisation of Factor 3
7 Inspection of combinations of Factors
8 Prediction of clinical subgroups
9 Gene set enrichment analysis (GSEA)
- 9.1 Load Reactome gene set annotations.
- 9.2 Run enrichment analysis
  - 9.2.1 Plot enrichment analysis results
10 Customized analysis
11 Imputation of missing values
12 Building predictive models of clinical outcome
13 sessionInfo

1 Introduction

This vignette show how to use MOFA+ on the bulk multi-omics data set that was used in the first publication of MOFA and the original vignette of the MOFA package.

Briefly, the data consists of four omics including DNA methylation, RNA-seq, somatic mutations and drug response data from blood for N=200 patients with Chronic Lymphocytic Leukemia (CLL). The data set is explained in detail in this article and is publicly available here

2 Load libraries and data

library(MOFA2)
library(MOFAdata)
library(data.table)
library(ggplot2)
library(tidyverse)

Data is stored as a list of matrices. Features are stored in the rows and samples in the columns

utils::data("CLL_data")       
lapply(CLL_data,dim)

## $Drugs
## [1] 310 200
## 
## $Methylation
## [1] 4248  200
## 
## $mRNA
## [1] 5000  200
## 
## $Mutations
## [1]  69 200

Sample metadata are stored as a data.frame. Important columns are:

Gender: m (male), f (female)
Age: age in years
TTT: time (in years) which passed from taking the sample to the next treatment
TTD: time (in years) which passed from taking the sample to patients’ death
treatedAfter: (TRUE/FALSE)
Died: whether the patient died (TRUE/FALSE)

The full meta data can be obtained from the Bioconductor package BloodCancerMultiOmics2017 as data("patmeta").

CLL_metadata <- fread("ftp://ftp.ebi.ac.uk/pub/databases/mofa/cll_vignette/sample_metadata.txt")

3 Create the MOFA obejct and train the model

Create the MOFA object

MOFAobject <- create_mofa(CLL_data)

MOFAobject

## Untrained MOFA model with the following characteristics: 
##  Number of views: 4 
##  Views names: Drugs Methylation mRNA Mutations 
##  Number of features (per view): 310 4248 5000 69 
##  Number of groups: 1 
##  Groups names: group1 
##  Number of samples (per group): 200

3.1 Plot data overview

Visualise the number of views (rows) and the number of groups (columns) exist, what are their corresponding dimensionalities and how many missing information they have (grey bars).

plot_data_overview(MOFAobject)

3.2 Define MOFA options

3.2.1 Data options

Important arguments:

scale_groups: scale groups to the same total variance? Default is FALSE
scale_views: scale views to the same total variance? Default is FALSE
views: views names
groups: groups names

data_opts <- get_default_data_options(MOFAobject)
data_opts

## $scale_views
## [1] FALSE
## 
## $scale_groups
## [1] FALSE
## 
## $views
## [1] "Drugs"       "Methylation" "mRNA"        "Mutations"  
## 
## $groups
## [1] "group1"

3.2.2 Model options

Important arguments:

num_factors: number of factors
likelihoods: likelihood per view (options are “gaussian”, “poisson”, “bernoulli”). By default they are inferred automatically.
spikeslab_factors: use spike-slab sparsity prior in the factors? default is FALSE.
spikeslab_weights: use spike-slab sparsity prior in the weights? default is TRUE.
ard_factors: use ARD prior in the factors? Default is TRUE if using multiple groups.
ard_weights: use ARD prior in the weights? Default is TRUE if using multiple views.

model_opts <- get_default_model_options(MOFAobject)
model_opts$num_factors <- 15

model_opts

## $likelihoods
##       Drugs Methylation        mRNA   Mutations 
##  "gaussian"  "gaussian"  "gaussian" "bernoulli" 
## 
## $num_factors
## [1] 15
## 
## $spikeslab_factors
## [1] FALSE
## 
## $spikeslab_weights
## [1] TRUE
## 
## $ard_factors
## [1] FALSE
## 
## $ard_weights
## [1] TRUE

3.2.3 Training options

Important arguments:

maxiter: number of iterations. Default is 1000.
convergence_mode: “fast”, “medium” (default), “slow”. For exploration, the fast mode is good enough.
seed: random seed

train_opts <- get_default_training_options(MOFAobject)
train_opts$convergence_mode <- "slow"
train_opts$seed <- 42

train_opts

## $maxiter
## [1] 1000
## 
## $convergence_mode
## [1] "slow"
## 
## $drop_factor_threshold
## [1] -1
## 
## $verbose
## [1] FALSE
## 
## $startELBO
## [1] 1
## 
## $freqELBO
## [1] 1
## 
## $stochastic
## [1] FALSE
## 
## $gpu_mode
## [1] FALSE
## 
## $seed
## [1] 42
## 
## $outfile
## NULL
## 
## $save_interrupted
## [1] FALSE

3.3 Train the MOFA model

Prepare the MOFA object

MOFAobject <- prepare_mofa(MOFAobject,
  data_options = data_opts,
  model_options = model_opts,
  training_options = train_opts
)

Train the model: this should take ~2min

# NOTE: The software has evolved since the original publication and the results will not be 100% identical to the original publication, please use the pretrained model if you are running through the vignette for the fist time
# MOFAobject <- run_mofa(MOFAobject, outfile="/Users/ricard/Downloads/MOFA2_CLL.hdf5")
# saveRDS(MOFAobject,"MOFA2_CLL.rds")

# Load precomputed model
MOFAobject <- readRDS(url("http://ftp.ebi.ac.uk/pub/databases/mofa/cll_vignette/MOFA2_CLL.rds"))

4 Overview of the trained MOFA model

4.1 Slots

The MOFA object consists of multiple slots where relevant data and information is stored. For descriptions, you can read the documentation using ?MOFA. The most important slots are:

data: input data used to train the model (features are centered at zero mean)
samples_metadata: sample metadata information
expectations: expectations of the posterior distributions for the Weights and the Factors

slotNames(MOFAobject)

##  [1] "data"               "intercepts"         "imputed_data"      
##  [4] "samples_metadata"   "features_metadata"  "expectations"      
##  [7] "training_stats"     "training_options"   "stochastic_options"
## [10] "data_options"       "model_options"      "dimensions"        
## [13] "on_disk"            "dim_red"            "cache"             
## [16] "status"

Data:

names(MOFAobject@data)

## [1] "Drugs"       "Methylation" "mRNA"        "Mutations"

dim(MOFAobject@data$Drugs$group1)

## [1] 310 200

Factor and Weight values (expectations of the posterior distributions):

names(MOFAobject@expectations)

## [1] "Z" "W"

# Dimensionality of the factor matrix: 200 samples, 15 factors
dim(MOFAobject@expectations$Z$group1)

## [1] 200  15

# Dimensionality of the mRNA Weight matrix: 5000 features, 15 factors
dim(MOFAobject@expectations$W$mRNA)

## [1] 5000   15

4.2 Add sample metadata to the model

The sample metadata must be provided as a data.frame and it must contain a column sample with the sample IDs. Make sure that the samples in the metadata match the samples in the model

# Sanity check
# stopifnot(all(sort(CLL_metadata$sample)==sort(unlist(samples_names(MOFAobject)))))

# Add sample metadata to the model
samples_metadata(MOFAobject) <- CLL_metadata

4.3 Correlation between factors

A good sanity check is to verify that the Factors are largely uncorrelated. In MOFA there are no orthogonality constraints such as in Principal Component Analysis, but if there is a lot of correlation between Factors this suggests a poor model fit. Reasons? Perhaps you used too many factors or perhaps the normalisation is not adequate.

plot_factor_cor(MOFAobject)

4.4 Plot variance decomposition

4.4.1 Variance decomposition by Factor

The most important insight that MOFA generates is the variance decomposition analysis. This plot shows the percentage of variance explained by each factor across each data modality (and group, if provided). It summarises the sources of variation from a complex heterogeneous data set in a single figure.

plot_variance_explained(MOFAobject, max_r2=15)

What insights from the data can we learn just from inspecting this plot?

Factor 1 captures a source of variability that is present across all data modalities. Thus, its etiology is likely to be something very important for the disease
Factor 2 captures a very strong source of variation that is exclusive to the drug response data.
Factor 3 captures variation that is present across multiple data modalities, except for DNA methylation. This is likely to be important too.
Factor 5 is capturing some co-variation between the mRNA and the drug response assay.

4.4.2 Total variance explained per view

A reasonable question is whether the model is providing a good fit to the data. For this we can plot the total variance explained (using all factors). The resulting values will depend on the nature of the data set, the number of samples, the number of factors, etc. Some general guidelines:

Noisy data sets with strong non-linearities will result in small amounts of variance explained (<10%).
The higher the number of samples the smaller the total variance explained
The higher the number of factors, the higher the total variance explained.
MOFA is a linear and sparse model. This is helpful to prevent overfitting, but it will never explain 100% of the variance, even if using a lot of Factors.

In this data set, using only K=15 factors the model explains up to ~54% of the variation in the Drug response and ~42% of the variation in the mRNA data. This is quite remarkable for a linear model.

plot_variance_explained(MOFAobject, plot_total = T)[[2]]

5 Characterisation of Factor 1

There are a few systematic strategies to characterise the molecular etiology underlying the MOFA Factors and to relate them to the sample covariates:

Association analysis between the sample metadata and the Factor values.
Inspection of factor values.
Inspection of the feature weights.
Gene set enrichment analysis on the mRNA weights.

5.1 Association analysis

Let’s test the association between MOFA Factors and Gender, survival outcome (dead vs alive) and age:

correlate_factors_with_covariates(MOFAobject, 
  covariates = c("Gender","died","age"), 
  plot="log_pval"
)

## Warning in correlate_factors_with_covariates(MOFAobject, covariates =
## c("Gender", : There are non-numeric values in the covariates data.frame,
## converting to numeric...

Most Factors don’t have a clear association with any of the covariates. Only Factor 11 has a small association with survival outcome. We will go back to associations with clinical information at the end of the vignette.

5.2 Plot factor values

How do we interpret the factor values?
Each factor captures a different source of variability in the data. Mathematically, each Factor is defined by a linear combination of the input features. As the data is centered prior to running MOFA, each Factor ordinates cells along a one-dimensional axis that is centered at zero. Samples with different signs manifest opposite phenotypes along the inferred axis of variation, with higher absolute value indicating a stronger effect. Note that the interpretation of MOFA factors is analogous to the interpretation of the principal components in PCA.

plot_factor(MOFAobject, 
  factors = 1, 
  color_by = "Factor1"
)

5.3 Plot feature weights

How do we interpret the weights?
The weights provide a score for each feature on each factor. Features with no association with the corresponding factor are expected to have values close to zero, whereas features with strong association with the factor are expected to have large absolute values. The sign of the weights indicates the direction of the effect: a positive weights indicates that the feature has higher levels in the cells with positive factor values, and vice-versa.

5.3.1 Plot feature weights for somatic mutations

By looking at the variance explained plot, we saw that Factor 1 captures variation in all data modalities. Out of all omics, the somatic mutation data is a good place to start, as somatic mutations are very sparse, easy to interpret and any change in the DNA is likely to have downstream consequences to all other molecular layers. Let’s plot the weights:

plot_weights(MOFAobject,
 view = "Mutations",
 factor = 1,
 nfeatures = 10,     # Top number of features to highlight
 scale = T           # Scale weights from -1 to 1
)

Notice that most features lie at zero, indicating that most features have no association with Factor 1. There is however one gene that clearly stands out: IGHV (immunoglobulin heavy chain variable region). This is the main clinical marker for CLL.

An alternative visualistion to the full distribution of weights is to do a line plot that displays only the top features with the corresponding weight sign on the right:

plot_top_weights(MOFAobject,
 view = "Mutations",
 factor = 1,
 nfeatures = 10,     # Top number of features to highlight
 scale = T           # Scale weights from -1 to 1
)

IGHV has a positve weight. This means that samples with positive Factor 1 values have IGHV mutation whereas samples with negative Factor 1 values do not have the IGHV mutation. To confirm this, let’s plot the Factor values and colour the IGHV mutation status.

plot_factor(MOFAobject, 
  factors = 1, 
  color_by = "IGHV",
  add_violin = TRUE,
  dodge = TRUE
)

We can also plot Factor values coloured by other covariates, for example Gender. As shown above, this variable has no association with Factor 1:

plot_factor(MOFAobject, 
  factors = 1, 
  color_by = "Gender",
  dodge = TRUE,
  add_violin = TRUE
)

5.3.2 Plot gene weights for mRNA expression

From the variance explained plot we know that Factor 1 drives variation across all data modalities. Let’s visualise the mRNA expression changes that are associated with Factor 1:

plot_weights(MOFAobject, 
  view = "mRNA", 
  factor = 1, 
  nfeatures = 10
)

5.3.3 Plot molecular signatures in the input data

In this case we have a large amount of genes that have large positive and negative weights. Genes with large positive values will be more expressed in the samples with IGHV mutation, whereas genes with large negative values will be more expressed in the samples without the IGHV mutation. Let’s verify this. The function plot_data_scatter generates a scatterplot of Factor 1 values (x-axis) versus expression values (y-axis) for the top 4 genes with largest positive weight. Samples are coloured by IGHV status:

plot_data_scatter(MOFAobject, 
  view = "mRNA",
  factor = 1,  
  features = 4,
  sign = "positive",
  color_by = "IGHV"
) + labs(y="RNA expression")

This function generates a scatterplot of Factor 1 values (x-axis) versus expression values (y-axis) for the top 4 genes with largest negative weight. Samples are coloured by IGHV status:

plot_data_scatter(MOFAobject, 
  view = "mRNA",
  factor = 1,  
  features = 4,
  sign = "negative",
  color_by = "IGHV"
) + labs(y="RNA expression")

An alternative visualisation is to use a heatmap

plot_data_heatmap(MOFAobject, 
  view = "mRNA",
  factor = 1,  
  features = 25,
  cluster_rows = FALSE, cluster_cols = FALSE,
  show_rownames = TRUE, show_colnames = FALSE,
  scale = "row"
)

plot_data_heatmap has an interesting argument to “beautify” the heatmap: denoise = TRUE. Instead of plotting the (noisy) input data, we can plot the data reconstructed by the model, where noise has been removed:

plot_data_heatmap(MOFAobject, 
  view = "mRNA",
  factor = 1,  
  features = 25,
  denoise = TRUE,
  cluster_rows = FALSE, cluster_cols = FALSE,
  show_rownames = TRUE, show_colnames = FALSE,
  scale = "row"
)

6 Characterisation of Factor 3

6.1 Plot feature weights

Following a similar strategy as for Factor 1, we notice that Factor 3 is also active in the somatic mutation view. Thus, there must be a mutation that underlies this phenotype. Let’s plot the corresponding weights:

plot_weights(MOFAobject, 
  view = "Mutations", 
  factor = 3, 
  nfeatures = 10,
  abs = F
)

In this case we have two mutations that have large weight. One of them is the trisomy of chromosome 12, which is the second most important clinical marker in CLL!

6.2 Plot Factor values

Let’s verify this by plotting the Factor values grouping samples by the presence or absence of trisomy12:

plot_factor(MOFAobject, 
  factors = 3, 
  color_by = "trisomy12",
  dodge = TRUE,
  add_violin = TRUE
)

6.3 Plot molecular signatures in the input data

Again, we can also inspect the molecular signatures in the input data with the functions plot_data_scatter and plot_data_heatmap:

plot_data_scatter(MOFAobject, 
  view = "Drugs",
  factor = 3,  
  features = 4,
  sign = "positive",
  color_by = "trisomy12"
) + labs(y="Drug response (cell viability)")

plot_data_heatmap(MOFAobject, 
  view = "mRNA",
  factor = 3,  
  features = 25,
  denoise = TRUE,
  cluster_rows = TRUE, cluster_cols = FALSE,
  show_rownames = TRUE, show_colnames = FALSE,
  scale = "row"
)

7 Inspection of combinations of Factors

Now that we have characterised the etiology of the two main Factors, let’s explore them together:

p <- plot_factors(MOFAobject, 
  factors = c(1,3), 
  color_by = "IGHV",
  shape_by = "trisomy12",
  dot_size = 2.5,
  show_missing = T
)

p <- p + 
  geom_hline(yintercept=-1, linetype="dashed") +
  geom_vline(xintercept=(-0.5), linetype="dashed")

print(p)

## Warning: Removed 27 rows containing missing values (geom_point).

This plot is extremely important. It classifies the patients into four different subgroups depending on their (multi-omic) molecular profile. As shown in the analysis above, both factors are associated with differences in the drug response assay they are are strongly linked to somatic mutations (IGHV and trisomy12) that are easy to profile in clinical practice. This is fantastic for the aim of personalised medicine.

8 Prediction of clinical subgroups

The scatterplot of Factor 1 vs Factor 3 reveals that a few samples are missing the somatic mutation status. In this case, the doctors were not able to classify patients into their clinical subgroups. But we can now use MOFA to exploit the molecular profiles and attempt to impute the IGHV and trisomy12 status.

suppressPackageStartupMessages(library(randomForest))

# Prepare data
df <- as.data.frame(get_factors(MOFAobject, factors=c(1,2))[[1]])

# Train the model for IGHV
df$IGHV <- as.factor(MOFAobject@samples_metadata$IGHV)
model.ighv <- randomForest(IGHV ~ ., data=df[!is.na(df$IGHV),], ntree=10)
df$IGHV <- NULL

# Do predictions
MOFAobject@samples_metadata$IGHV.pred <- stats::predict(model.ighv, df)

# Train the model for Trisomy12
df$trisomy12 <- as.factor(MOFAobject@samples_metadata$trisomy12)
model.trisomy12 <- randomForest(trisomy12 ~ ., data=df[!is.na(df$trisomy12),], ntree=10)
df$trisomy12 <- NULL

MOFAobject@samples_metadata$trisomy12.pred <- stats::predict(model.trisomy12, df)

Plot predictions for IGHV

MOFAobject@samples_metadata$IGHV.pred_logical <- c("True","Predicted")[as.numeric(is.na(MOFAobject@samples_metadata$IGHV))+1]

p <- plot_factors(MOFAobject, 
  factors = c(1,3), 
  color_by = "IGHV.pred",
  shape_by = "IGHV.pred_logical",
  dot_size = 2.5,
  show_missing = T
)

p <- p + 
  geom_hline(yintercept=-1, linetype="dashed") +
  geom_vline(xintercept=(-0.5), linetype="dashed")

print(p)

9 Gene set enrichment analysis (GSEA)

In addition to exploring the individual weights for each factor, we can use enrichment analysis to look for signiificant associations of factors to genesets. Here, we use the Reactome genesets for illustrations, which is contained in the MOFAdata package. For more details on how the GSEA works we encourage the users to read the GSEA vignette

9.1 Load Reactome gene set annotations.

Gene set annotations are provided as a binary membership matrix. Genes are stored in the rows, pathways are stored in the columns. A value of 1 indicates that gene \(j\) belongs to the pathway \(i\).

utils::data(reactomeGS)

head(colnames(reactomeGS))

## [1] "ENSG00000187634" "ENSG00000188976" "ENSG00000187961" "ENSG00000187583"
## [5] "ENSG00000187642" "ENSG00000188290"

head(rownames(reactomeGS))

## [1] "Interleukin-6 signaling"                                  
## [2] "Apoptosis"                                                
## [3] "Hemostasis"                                               
## [4] "Intrinsic Pathway for Apoptosis"                          
## [5] "Cleavage of Growing Transcript in the Termination Region "
## [6] "PKB-mediated events"

9.2 Run enrichment analysis

These are the steps for doing Gene Set Enrichment Analysis (GSEA) with MOFA:

(1) Define your gene set matrix: this can be specified as a binary matrix where rows are gene sets and columns are genes. A value of 1 indicates that gene j belongs to pathway i. A value of 0 indicates elsewise.
(2) Select a gene set statistic: the statistic used to quantify the scores at the pathway level. Must be one of the following: mean.diff (difference in the average weight between foreground and background genes) or rank.sum (difference in the sum of ranks between foreground and background genes).
(3) Select a statistical test: the statistical test used to compute the significance of the gene set statistics under a competitive null hypothesis. Must be one of the following: parametric (a simple and very liberal parametric t-test), cor.adj.parametric (parametric t-test adjusted by the correlation between features), permutation (unparametric, the null distribution is created by permuting the weights. This option is computationally expensive, but it preserves the correlation structure between features in the data.).

An important consideration when running GSEA is that MOFA contains positive and negative weights. There will be cases where the genes with negative weights all belong to a specific pathway but genes with positive weights belong to other pathways. If this is true, doing GSEA with all of them together could dilute the signal. Hence, we recommend the user to do GSEA separately for (+) and (-) weights, and possibly also jointly with all weights.

# GSEA on positive weights, with default options
res.positive <- run_enrichment(MOFAobject, 
  feature.sets = reactomeGS, 
  view = "mRNA",
  sign = "positive"
)

# GSEA on negative weights, with default options
res.negative <- run_enrichment(MOFAobject, 
  feature.sets = reactomeGS, 
  view = "mRNA",
  sign = "negative"
)

The enrichment analysis returns a list of 5 elements:

feature.sets: the feature set matrix filtered by the genes that overlap with the MOFA model.
pval: the nominal p-values.
pval.adj: the FDR-adjusted p-values.
feature.statistics: the feature statistics (i.e. the weights).
set.statistics: matrices with the gene set statistics.
sigPathways: list with significant pathways per factor at a specified FDR threshold

names(res.positive)

## [1] "feature.sets"       "pval"               "pval.adj"          
## [4] "feature.statistics" "set.statistics"     "sigPathways"

9.2.1 Plot enrichment analysis results

Plot an overview of the number of significant pathways per factor.
It seems that most of the Factors do not have clear gene set signatures. A clear exception is Factor 5, which has a very strong enrichment for genes with positive weights.

plot_enrichment_heatmap(res.positive)

plot_enrichment_heatmap(res.negative)

Let’s plot the GSEA results for Factor 5. It seems that this Factor is capturing differences in the stress response of the blood cells.

plot_enrichment(res.positive, factor = 5, max.pathways = 15)

It is always advised to not rely only on the p-values and to visualise which genes are driving the enrichment within each pathways. There are problematic cases where a single gene is driving the enrichment results in very small pathways.

plot_enrichment_detailed(
  enrichment.results = res.positive,
  factor = 5, 
  max.pathways = 3
)

10 Customized analysis

For customized exploration of weights and factors, you can directly fetch the variables from the model using ‘get’ functions: get_weights, get_factors and get_data:

weights <- get_weights(MOFAobject, 
  views = "all", 
  factors = "all", 
  as.data.frame = TRUE 
)
head(weights)

##   feature  factor         value  view
## 1 D_001_1 Factor1 -0.0001029638 Drugs
## 2 D_001_2 Factor1 -0.0061835683 Drugs
## 3 D_001_3 Factor1 -0.0484350338 Drugs
## 4 D_001_4 Factor1 -0.0396829204 Drugs
## 5 D_001_5 Factor1 -0.0240856562 Drugs
## 6 D_002_1 Factor1 -0.0004036970 Drugs

factors <- get_factors(MOFAobject, 
  factors = "all", 
  as.data.frame = TRUE
)
head(factors)

##   sample  factor     value  group
## 1   H045 Factor1 -1.418579 group1
## 2   H109 Factor1 -1.415237 group1
## 3   H024 Factor1  1.663551 group1
## 4   H056 Factor1  1.224276 group1
## 5   H079 Factor1 -1.217420 group1
## 6   H164 Factor1 -1.791768 group1

data <- get_data(MOFAobject, 
  views = "all", 
  as.data.frame = TRUE
)
head(data)

##    view  group feature sample      value
## 1 Drugs group1 D_001_1   H045 0.02363938
## 2 Drugs group1 D_001_2   H045 0.04623274
## 3 Drugs group1 D_001_3   H045 0.31874706
## 4 Drugs group1 D_001_4   H045 0.82370272
## 5 Drugs group1 D_001_5   H045 0.89627769
## 6 Drugs group1 D_002_1   H045 0.09432807

11 Imputation of missing values

With the impute function all missing values are imputed based on the MOFA model. The imputed data is then stored in the imputed_data slot of the MOFAobject and can be accessed via the get_imputed_data function.

MOFAobject <- impute(MOFAobject)

Before imputation

MOFAobject@data$mRNA[[1]][1:5,190:195]

##                 H136 H178 H166 H174 H177 H259
## ENSG00000244734   NA   NA   NA   NA   NA   NA
## ENSG00000158528   NA   NA   NA   NA   NA   NA
## ENSG00000198478   NA   NA   NA   NA   NA   NA
## ENSG00000175445   NA   NA   NA   NA   NA   NA
## ENSG00000174469   NA   NA   NA   NA   NA   NA

After imputation

MOFAobject@imputed_data$mRNA[[1]][["mean"]][1:5,190:195]

##                      H136     H178     H166     H174     H177     H259
## ENSG00000244734  8.437186 6.634943 9.558299 4.506726 0.147641 9.142674
## ENSG00000158528  7.811139 5.195533 6.924045 6.253729 6.583781 8.099277
## ENSG00000198478  8.193635 7.851333 7.285100 7.347293 4.890448 6.041116
## ENSG00000175445 10.007591 8.087473 9.101702 9.020709 9.958857 9.480526
## ENSG00000174469  9.198478 8.696589 9.370480 7.688797 8.307647 8.521276

12 Building predictive models of clinical outcome

The factors inferred by MOFA can be related to clinical outcomes such as time to treatment or survival times. As this type of data is censored (not for all samples we have already observed the event) we will use Cox models for this purpose. In a Cox proportional hazards model we model the hazard of an event ocurring (e.g. death or treatment) as a function of some covariates (here the factors). If a factor has a influence on the surivival time or time to treatment it will receive a high absoulte coefficient in the Cox model. In particular:

If the coefficient is positive, samples with large factor values have an increased hazard (of death or treatment) compared to samples with small factor values.
If the coefficient is negative, samples with small factor values have an increased hazard compared to samples with a large factor values.

To fit these models we will use the coxph function in the survival package. The survival data is stored in a survival object that contains both the time a sample has been followed up and whether the event has occured (as 0,1).

Let’s take time to treatment as an example here. The sample metadata contains the follow-up times per sample in years in the column TTT, and the column treatedAfter indicated whether a treatment occured.

12.0.1 Fit Cox models

library(survival)
library(survminer)

SurvObject <- Surv(MOFAobject@samples_metadata$TTT, MOFAobject@samples_metadata$treatedAfter)
Z <- get_factors(MOFAobject)[[1]]
fit <- coxph(SurvObject ~ Z) 
fit

## Call:
## coxph(formula = SurvObject ~ Z)
## 
##               coef exp(coef) se(coef)      z        p
## ZFactor1  -0.43422   0.64777  0.10944 -3.968 7.26e-05
## ZFactor2   0.53961   1.71534  0.13380  4.033 5.51e-05
## ZFactor3   0.10059   1.10582  0.14057  0.716 0.474259
## ZFactor4   0.04846   1.04965  0.10528  0.460 0.645339
## ZFactor5   0.27567   1.31741  0.11328  2.434 0.014952
## ZFactor6  -0.25121   0.77786  0.12383 -2.029 0.042500
## ZFactor7   0.23077   1.25957  0.11332  2.036 0.041709
## ZFactor8   0.40083   1.49306  0.10826  3.702 0.000214
## ZFactor9  -0.04353   0.95740  0.10580 -0.411 0.680751
## ZFactor10 -0.12880   0.87915  0.10962 -1.175 0.240011
## ZFactor11  0.54195   1.71935  0.13455  4.028 5.63e-05
## ZFactor12 -0.80444   0.44734  0.12448 -6.463 1.03e-10
## ZFactor13 -0.12435   0.88307  0.11611 -1.071 0.284175
## ZFactor14 -0.05307   0.94831  0.08940 -0.594 0.552783
## ZFactor15  0.03688   1.03757  0.12081  0.305 0.760168
## 
## Likelihood ratio test=114.6  on 15 df, p=< 2.2e-16
## n= 174, number of events= 96 
##    (26 observations deleted due to missingness)

We can see that several factors have a significant association to time to treatment. For example, Factor 1 has a negative coefficient. Samples with low factor values have an increased hazard compared to samples with a large factor values.

12.0.2 Plot Hazard ratios

s <- summary(fit)
coef <- s[["coefficients"]]

df <- data.frame(
  factor = factor(rownames(coef), levels = rev(rownames(coef))),
  p      = coef[,"Pr(>|z|)"], 
  coef   = coef[,"exp(coef)"], 
  lower  = s[["conf.int"]][,"lower .95"], 
  higher = s[["conf.int"]][,"upper .95"]
)

ggplot(df, aes(x=factor, y=coef, ymin=lower, ymax=higher)) +
  geom_pointrange( col='#619CFF') + 
  coord_flip() +
  scale_x_discrete() + 
  labs(y="Hazard Ratio", x="") + 
  geom_hline(aes(yintercept=1), linetype="dotted") +
  theme_bw()

12.0.3 Kaplan-Meier plots

For illustration purposes we can also split the samples based on the factor values into two groups using the maximally selected rank statistics from the maxstat R package and plot the Kaplan Meier plots per group.

df <- data.frame(
  time = SurvObject[,1], 
  event = SurvObject[,2], Z1 = Z[,1]
)
cut <- surv_cutpoint(df, variables='Z1')
df$FactorCluster <- df$Z1 > cut$cutpoint$cutpoint
fit <- survfit(Surv(time, event) ~ FactorCluster, df)

ggsurvplot(fit, data = df,
  conf.int = TRUE, pval = TRUE,
  fun = function(y) y * 100,
  legend = "top", legend.labs = c(paste("low LF 1"), paste("high LF 1")),
  xlab = "Time to treatment", ylab="Survival probability (%)", title= "Factor 1"
)$plot

13 sessionInfo

sessionInfo()

## R version 4.0.0 (2020-04-24)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Mojave 10.14.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] survminer_0.4.7     ggpubr_0.4.0        survival_3.2-3     
##  [4] randomForest_4.6-14 forcats_0.5.0       stringr_1.4.0      
##  [7] dplyr_1.0.0         purrr_0.3.4         readr_1.3.1        
## [10] tidyr_1.1.0         tibble_3.0.1        tidyverse_1.3.0    
## [13] ggplot2_3.3.2       data.table_1.12.8   MOFAdata_1.4.0     
## [16] MOFA2_1.1           BiocStyle_2.16.0   
## 
## loaded via a namespace (and not attached):
##   [1] colorspace_1.4-1      ggsignif_0.6.0        ellipsis_0.3.1       
##   [4] rio_0.5.16            fs_1.4.1              rstudioapi_0.11      
##   [7] farver_2.0.3          ggrepel_0.8.2         mvtnorm_1.1-1        
##  [10] fansi_0.4.1           lubridate_1.7.9       xml2_1.3.2           
##  [13] splines_4.0.0         mnormt_2.0.0          knitr_1.29           
##  [16] jsonlite_1.7.0        broom_0.5.6           km.ci_0.5-2          
##  [19] dbplyr_1.4.4          pheatmap_1.0.12       HDF5Array_1.16.1     
##  [22] BiocManager_1.30.10   compiler_4.0.0        httr_1.4.1           
##  [25] backports_1.1.8       assertthat_0.2.1      Matrix_1.2-18        
##  [28] cli_2.0.2             htmltools_0.5.0       tools_4.0.0          
##  [31] gtable_0.3.0          glue_1.4.1            reshape2_1.4.4       
##  [34] Rcpp_1.0.4.6          carData_3.0-4         cellranger_1.1.0     
##  [37] vctrs_0.3.1           nlme_3.1-148          exactRankTests_0.8-31
##  [40] psych_1.9.12.31       xfun_0.15             openxlsx_4.1.5       
##  [43] rvest_0.3.5           lifecycle_0.2.0       rstatix_0.6.0        
##  [46] zoo_1.8-8             scales_1.1.1          hms_0.5.3            
##  [49] parallel_4.0.0        rhdf5_2.32.1          RColorBrewer_1.1-2   
##  [52] yaml_2.2.1            curl_4.3              reticulate_1.16      
##  [55] gridExtra_2.3         KMsurv_0.1-5          stringi_1.4.6        
##  [58] S4Vectors_0.26.1      corrplot_0.84         BiocGenerics_0.34.0  
##  [61] zip_2.0.4             rlang_0.4.6           pkgconfig_2.0.3      
##  [64] matrixStats_0.56.0    evaluate_0.14         lattice_0.20-41      
##  [67] Rhdf5lib_1.10.0       labeling_0.3          cowplot_1.0.0        
##  [70] tidyselect_1.1.0      plyr_1.8.6            magrittr_1.5         
##  [73] bookdown_0.20         R6_2.4.1              IRanges_2.22.2       
##  [76] magick_2.4.0          generics_0.0.2        DelayedArray_0.14.0  
##  [79] DBI_1.1.0             pillar_1.4.4          haven_2.3.1          
##  [82] foreign_0.8-80        withr_2.2.0           mgcv_1.8-31          
##  [85] abind_1.4-5           modelr_0.1.8          crayon_1.3.4         
##  [88] car_3.0-8             survMisc_0.5.5        maxstat_0.7-25       
##  [91] tmvnsim_1.0-2         rmarkdown_2.3         grid_4.0.0           
##  [94] readxl_1.3.1          blob_1.2.1            reprex_0.3.0         
##  [97] digest_0.6.25         xtable_1.8-4          stats4_4.0.0         
## [100] munsell_0.5.0

Applying MOFA+ to the Chronic Lymphocytic Leukemia cohort

2020-09-03

Contents

1 Introduction

2 Load libraries and data

3 Create the MOFA obejct and train the model

3.1 Plot data overview

3.2 Define MOFA options

3.2.1 Data options

3.2.2 Model options

3.2.3 Training options

3.3 Train the MOFA model

4 Overview of the trained MOFA model

4.1 Slots

4.2 Add sample metadata to the model

4.3 Correlation between factors

4.4 Plot variance decomposition

4.4.1 Variance decomposition by Factor

4.4.2 Total variance explained per view

5 Characterisation of Factor 1

5.1 Association analysis

5.2 Plot factor values

5.3 Plot feature weights

5.3.1 Plot feature weights for somatic mutations

5.3.2 Plot gene weights for mRNA expression

5.3.3 Plot molecular signatures in the input data

6 Characterisation of Factor 3

6.1 Plot feature weights

6.2 Plot Factor values

6.3 Plot molecular signatures in the input data

7 Inspection of combinations of Factors

8 Prediction of clinical subgroups

9 Gene set enrichment analysis (GSEA)

9.1 Load Reactome gene set annotations.

9.2 Run enrichment analysis

9.2.1 Plot enrichment analysis results

10 Customized analysis

11 Imputation of missing values

12 Building predictive models of clinical outcome

12.0.1 Fit Cox models

12.0.2 Plot Hazard ratios

12.0.3 Kaplan-Meier plots

13 sessionInfo