library(MOFA2)
library(magrittr)
library(tidyverse)
library(cowplot)
library(reshape2)

1 Load data

The data for this vignette can be downloaded from here and placed in a data/ directory. You should then have two files:

data/evodevo.RData containing the expression data used as input
data/evodevo_model.hdf5 containing the pre-trained MEFISTO model

We first load the dataframe that contains the gene expression data for all organs (views) and species (groups) as well as the (unmatched) time annotation for each sample.

load("data/evodevo.RData")
head(evodevo)

## # A tibble: 6 x 6
##   group view  sample      feature               value  time
##   <chr> <chr> <chr>       <chr>                 <dbl> <dbl>
## 1 Human Brain 10wpc_Human ENSG00000000457_Brain  8.57     7
## 2 Human Brain 10wpc_Human ENSG00000001084_Brain  8.88     7
## 3 Human Brain 10wpc_Human ENSG00000001167_Brain 11.3      7
## 4 Human Brain 10wpc_Human ENSG00000001461_Brain  7.37     7
## 5 Human Brain 10wpc_Human ENSG00000001561_Brain  7.31     7
## 6 Human Brain 10wpc_Human ENSG00000001617_Brain  9.96     7

2 Prepare and train MOFA

Create the MOFA object

First, we need to create a MOFA object from this data. This step is analogous to use of MOFA without the time information and can be done using create_mofa, which results in an untrained MOFA object.

# create the MOFA object and add times as covariate
MOFAobject_untrained <- create_mofa(data = evodevo)
MOFAobject_untrained

## Untrained MOFA model with the following characteristics: 
##  Number of views: 5 
##  Views names: Brain Cerebellum Heart Liver Testis 
##  Number of features (per view): 7696 7696 7696 7696 7696 
##  Number of groups: 5 
##  Groups names: Human Mouse Opossum Rabbit Rat 
##  Number of samples (per group): 23 14 15 15 16 
##

Next, we want to add the time information for each sample, which we can do using set_covariates. As the time is already contained in the data.frame that we passed to the MOFAobject in create_mofa, we can just specify to use this column as covariate. Alternatively, we could also supply a new matrix or data frame providing the sample names and covariates. See ?set_covariates.

head(samples_metadata(MOFAobject_untrained))

##                  sample group time
## 10wpc_Human 10wpc_Human Human    7
## 11wpc_Human 11wpc_Human Human    8
## 12wpc_Human 12wpc_Human Human    9
## 13wpc_Human 13wpc_Human Human   10
## 16wpc_Human 16wpc_Human Human   11
## 18wpc_Human 18wpc_Human Human   12

MOFAobject_untrained <- set_covariates(MOFAobject_untrained, covariates = "time")
MOFAobject_untrained

## Untrained MEFISTO model with the following characteristics: 
##  Number of views: 5 
##  Views names: Brain Cerebellum Heart Liver Testis 
##  Number of features (per view): 7696 7696 7696 7696 7696 
##  Number of groups: 5 
##  Groups names: Human Mouse Opossum Rabbit Rat 
##  Number of samples (per group): 23 14 15 15 16 
##  Number of covariates per sample: 1 
##

Show data and time covariate Before moving towards the training, let’s first take a look at the data in the untrained object. The plot shows for species (groups) which organ (view) are available for which samples. The right plots show the time values for each sample.

gg_input <- plot_data_overview(MOFAobject_untrained,
                               show_covariate = TRUE,
                               show_dimensions = TRUE) 
gg_input

Prepare the MOFA object Next, we prepare the model for training. For this we specify the different options. Similar to the standard MOFA workflow we define data, model and training options. Take a look at get_default_data_options, get_default_model_options and get_default_training_options to see what options can be specified. Here, we specify that we want to use 5 factors and set a seed for reproducibility.

In addition, for MEFISTO we can specify MEFISTO-specific options, see get_default_mefisto_options. Here, we specify that we want to align the times across species using Mouse as a reference group. Furthermore, we can optionally specify time points (when using warping these correspond to times in the reference group), for which to interpolate the factors in all groups.

data_opts <- get_default_data_options(MOFAobject_untrained)

model_opts <- get_default_model_options(MOFAobject_untrained)
model_opts$num_factors <- 5

train_opts <- get_default_training_options(MOFAobject_untrained)
train_opts$seed <- 2020
train_opts$convergence_mode <- "medium" # use "fast" for faster training

mefisto_opts <- get_default_mefisto_options(MOFAobject_untrained)
mefisto_opts$warping <- TRUE
mefisto_opts$warping_ref <- "Mouse"
mefisto_opts$new_values <- 1:14

We then pass all these options to the object.

MOFAobject_untrained <- prepare_mofa(
  object = MOFAobject_untrained,
  data_options = data_opts,
  model_options = model_opts,
  training_options = train_opts,
  mefisto_options = mefisto_opts
)

Train the MOFA model Now, we are ready to train the model.

outfile <- "data/evodevo_model.hdf5"
# MOFAobject <- run_mofa(MOFAobject_untrained, outfile = outfile)

As this might take ~15 min, for all further analysis we will use a pre-trained model. In addition to the factor values at the observed data points we can optionally load the interpolated data.

MOFAobject <- load_model(outfile, load_interpol_Z = TRUE)

# We can add a species column to the metadata
samples_metadata(MOFAobject) %<>% mutate(species = group)

3 Factor overview and visualization

3.1 Variance decomposition and factor correlation

To obtain a first overview of the factors we can take a look at the variance that a factor explains in each organ and species.

p <- plot_variance_explained(MOFAobject, plot_total = T)
p[[1]] + theme(axis.text.x = element_text(angle = 90))

p[[2]] + theme(axis.text.x = element_text(angle = 90))

3.2 Factors versus developmental time (before/after alignement)

To look at the inferred factors, we can plot them against the developmental stages. Before alignment, the timing of common developmental process does not match, after alignment, we see a clearer match.

# Factors versus developmental time (before alignment)
plot_factors_vs_cov(MOFAobject, color_by = "species",
                    covariate = "time",  warped = FALSE)

# Factors versus developmental time (after alignment)
plot_factors_vs_cov(MOFAobject, color_by = "species",
                    covariate = "time", scale = FALSE)

3.3 Interpolation

Using the underlying Gaussian process for each factor we can interpolate to unseen time points for species that are missing data in these time points or intermediate time points. If you want to show uncertainties, set only_mean to FALSE. Note that these are only available if the new values for interpolation have been specified before training.

plot_interpolation_vs_covariate(MOFAobject, only_mean = FALSE)

# if the interpolation was not specified in mefisto_options before training
# or you want to interpolate to new values you can use interpolate_factors
# this however does not provide uncertainties
MOFAobject_new <- interpolate_factors(MOFAobject,
                                  new_values = seq(1, 14, 0.1))

## We recommend doing interpolation from python where additionally uncertainties are provided for the interpolation.

## Warning in interpolate_factors(MOFAobject, new_values = seq(1, 14, 0.1)): Object
## already contains interpolated factor values, overwriting it.

plot_interpolation_vs_covariate(MOFAobject_new)

3.4 Factor scatterplot

A scatterplot of the first two factors shows the embedding of timepoint-species combination in the latent space. Again we see, that the aligned times show a gradual development along the developmental trajectory captured by the conserved developmental programs on Factors 1 and 2.

plot_factors(MOFAobject, 1:2, color_by = "time_warped")

plot_factors(MOFAobject, 1:2, color_by = "time")

plot_factors(MOFAobject, 1:2, color_by = "species")

4 Alignment

Next, we further look into the alignment that was learnt by the model by plotting the mapping of mouse developmental stages to the stages of other species.

plot_alignment(MOFAobject)

5 Smoothness and sharedness of factors: Investigating the learnt hyper parameters

In addition to the factor values and the alignment the model also inferred an underlying Gaussian process that generated these values. By looking into it we can extract information on the smoothness of each factor, i.e. how smoothly it varies along developmental time, as well as the sharedness of each factor, i.e. how much the species (groups) show the same underlying developmental pattern and how the shape of their developmental trajectory related to a given developmental module (Factor) clusters between species.

5.1 Smoothness score

The scale parameters of the Gaussian process capture the smoothness of the model. A scale of 1 indicates high smoothness, a scale of 0 variation independent of time. Here all factors show a very high degree of smoothness.

get_scales(MOFAobject)

##   Factor1   Factor2   Factor3   Factor4   Factor5 
## 0.9182398 0.8032560 0.8039605 0.8853431 0.9534674

For visualization we visualize the scale of the Gaussian process as a smoothness score in a barplot for each factor. No color indicates no smoothness, a fully colored bar a very smooth factor.

plot_smoothness(MOFAobject)

5.2 Sharedness

The group kernel of the Gaussian process can give us insights into the extent to which a temporal pattern is shared across species for the developmental module captured by each process. Here, we see a strong clustering of the first processes, indicating that these are conserved in evolution of the species, while the last two processes show distinct clusters.

# We can use the following, below we make some more custom plot
plot_group_kernel(MOFAobject, factors = "all",
                  cellheight =  15, cellwidth = 15,
                  treeheight_col = 3, treeheight_row = 3)

For visualization we calculate an overall sharedness score per factor between 0 and 1. No color indicates no sharedness, a fully colored bar a fully shared factor.

plot_sharedness(MOFAobject)

6 Factor weights: Gene set enrichment analysis and inspection of genes with top weights per factor

6.1 Inspection of top genes per factor

To obtain insights into the molecular signatures associated with each factor in each organ, we can inspect the genes that have high weights on the factors as well as carry out an enrichment analysis to obtain gene sets that are associated to the process.

plot_top_weights(MOFAobject, view = "Brain", factors = 1)

Let’s take a look at these genes over the developmental time.

Factor 1 captures genes with a smooth up/downregulation of expression across all species.

plot_data_vs_cov(MOFAobject, factor = 1,
                 view = "Brain",
                 features = 3)

Factor 3 captures a sharp expression change in testis.

plot_data_vs_cov(MOFAobject, factor = 3,
                 view ="Testis",
                 features = 3)

For factor 4, we see that human expression trajectories stand out.

plot_data_vs_cov(MOFAobject, factor = 4,
                 view ="Liver",
                 features = 3)

6.2 Gene set enrichment analysis

To aggregate information on factors from individual genes to gene sets we can use enrichment analysis on each factor and for each organ by running run_enrichment. We illustrate this here for Brain, for other organs you can do this in an analogous manner.

See our MOFA tutorial on enrichment analysis for details.

data("reactomeGS", package = "MOFAdata")
reactomeGS <- reactomeGS[!is.na(rownames(reactomeGS)),]

# we rename the feature to match the gene names in the gene sets
# by removing the organ parts
MOFAobject_renamed <- MOFAobject
features_names(MOFAobject_renamed) <- lapply(features_names(MOFAobject_renamed),
                                             function(l) gsub("_.*","", l))

res <- run_enrichment(MOFAobject_renamed,
    view = "Brain", factors = "all",
    feature.sets = reactomeGS,
    statistical.test = "parametric")

## Intersecting features names in the model and the gene set annotation results in a total of 7682 features.

## 
## Running feature set Enrichment Analysis with the following options...
## View: Brain 
## Number of feature sets: 890 
## Set statistic: mean.diff 
## Statistical test: parametric

##

# show top 10 pathways for Factor 1
plot_enrichment(res, factor = 1, max.pathways = 10)

Session Info

sessionInfo()

## R version 4.0.3 (2020-10-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Mojave 10.14.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] reshape2_1.4.4   cowplot_1.1.0    forcats_0.5.0    stringr_1.4.0   
##  [5] dplyr_1.0.2      purrr_0.3.4      readr_1.4.0      tidyr_1.1.2     
##  [9] tibble_3.0.4     ggplot2_3.3.2    tidyverse_1.3.0  magrittr_2.0.1  
## [13] MOFA2_1.1.15     BiocStyle_2.18.1
## 
## loaded via a namespace (and not attached):
##  [1] matrixStats_0.57.0   fs_1.5.0             lubridate_1.7.9.2   
##  [4] filelock_1.0.2       RColorBrewer_1.1-2   httr_1.4.2          
##  [7] tools_4.0.3          backports_1.2.1      utf8_1.1.4          
## [10] R6_2.5.0             HDF5Array_1.18.0     uwot_0.1.10         
## [13] DBI_1.1.0            BiocGenerics_0.36.0  colorspace_2.0-0    
## [16] rhdf5filters_1.2.0   withr_2.3.0          tidyselect_1.1.0    
## [19] compiler_4.0.3       cli_2.2.0            rvest_0.3.6         
## [22] basilisk.utils_1.2.1 xml2_1.3.2           DelayedArray_0.16.0 
## [25] labeling_0.4.2       bookdown_0.21        scales_1.1.1        
## [28] rappdirs_0.3.1       digest_0.6.27        rmarkdown_2.6       
## [31] basilisk_1.2.1       pkgconfig_2.0.3      htmltools_0.5.0     
## [34] MatrixGenerics_1.2.0 dbplyr_2.0.0         rlang_0.4.9         
## [37] readxl_1.3.1         rstudioapi_0.13      farver_2.0.3        
## [40] generics_0.1.0       jsonlite_1.7.2       Matrix_1.2-18       
## [43] Rcpp_1.0.5           munsell_0.5.0        S4Vectors_0.28.1    
## [46] Rhdf5lib_1.12.0      fansi_0.4.1          reticulate_1.18     
## [49] lifecycle_0.2.0      stringi_1.5.3        yaml_2.2.1          
## [52] rhdf5_2.34.0         Rtsne_0.15           plyr_1.8.6          
## [55] grid_4.0.3           parallel_4.0.3       ggrepel_0.9.0       
## [58] crayon_1.3.4         lattice_0.20-41      haven_2.3.1         
## [61] hms_0.5.3            magick_2.5.2         knitr_1.30          
## [64] pillar_1.4.7         stats4_4.0.3         reprex_0.3.0        
## [67] glue_1.4.2           evaluate_0.14        BiocManager_1.30.10 
## [70] modelr_0.1.8         vctrs_0.3.6          cellranger_1.1.0    
## [73] gtable_0.3.0         assertthat_0.2.1     xfun_0.19           
## [76] broom_0.7.3          pheatmap_1.0.12      IRanges_2.24.1      
## [79] corrplot_0.84        ellipsis_0.3.1

MEFISTO application to evodevo data

2021-02-15