library(MOFA2)
library(magrittr)
library(tidyverse)
library(cowplot)
library(reshape2)
The data for this vignette can be downloaded from here and placed in a data/
directory. You should then have two files:
data/evodevo.RData
containing the expression data used as inputdata/evodevo_model.hdf5
containing the pre-trained MEFISTO modelWe first load the dataframe that contains the gene expression data for all organs (views) and species (groups) as well as the (unmatched) time annotation for each sample.
load("data/evodevo.RData")
head(evodevo)
## # A tibble: 6 x 6
## group view sample feature value time
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 Human Brain 10wpc_Human ENSG00000000457_Brain 8.57 7
## 2 Human Brain 10wpc_Human ENSG00000001084_Brain 8.88 7
## 3 Human Brain 10wpc_Human ENSG00000001167_Brain 11.3 7
## 4 Human Brain 10wpc_Human ENSG00000001461_Brain 7.37 7
## 5 Human Brain 10wpc_Human ENSG00000001561_Brain 7.31 7
## 6 Human Brain 10wpc_Human ENSG00000001617_Brain 9.96 7
Create the MOFA object
First, we need to create a MOFA object from this data. This step is analogous to use of MOFA without the time information and can be done using create_mofa
, which results in an untrained MOFA object.
# create the MOFA object and add times as covariate
MOFAobject_untrained <- create_mofa(data = evodevo)
MOFAobject_untrained
## Untrained MOFA model with the following characteristics:
## Number of views: 5
## Views names: Brain Cerebellum Heart Liver Testis
## Number of features (per view): 7696 7696 7696 7696 7696
## Number of groups: 5
## Groups names: Human Mouse Opossum Rabbit Rat
## Number of samples (per group): 23 14 15 15 16
##
Next, we want to add the time information for each sample, which we can do using set_covariates
. As the time is already contained in the data.frame that we passed to the MOFAobject in create_mofa
, we can just specify to use this column as covariate. Alternatively, we could also supply a new matrix or data frame providing the sample names and covariates. See ?set_covariates
.
head(samples_metadata(MOFAobject_untrained))
## sample group time
## 10wpc_Human 10wpc_Human Human 7
## 11wpc_Human 11wpc_Human Human 8
## 12wpc_Human 12wpc_Human Human 9
## 13wpc_Human 13wpc_Human Human 10
## 16wpc_Human 16wpc_Human Human 11
## 18wpc_Human 18wpc_Human Human 12
MOFAobject_untrained <- set_covariates(MOFAobject_untrained, covariates = "time")
MOFAobject_untrained
## Untrained MEFISTO model with the following characteristics:
## Number of views: 5
## Views names: Brain Cerebellum Heart Liver Testis
## Number of features (per view): 7696 7696 7696 7696 7696
## Number of groups: 5
## Groups names: Human Mouse Opossum Rabbit Rat
## Number of samples (per group): 23 14 15 15 16
## Number of covariates per sample: 1
##
Show data and time covariate Before moving towards the training, let’s first take a look at the data in the untrained object. The plot shows for species (groups) which organ (view) are available for which samples. The right plots show the time values for each sample.
gg_input <- plot_data_overview(MOFAobject_untrained,
show_covariate = TRUE,
show_dimensions = TRUE)
gg_input
Prepare the MOFA object
Next, we prepare the model for training. For this we specify the different options. Similar to the standard MOFA workflow we define data, model and training options. Take a look at get_default_data_options
, get_default_model_options
and get_default_training_options
to see what options can be specified. Here, we specify that we want to use 5 factors and set a seed for reproducibility.
In addition, for MEFISTO we can specify MEFISTO-specific options, see get_default_mefisto_options
. Here, we specify that we want to align the times across species using Mouse as a reference group. Furthermore, we can optionally specify time points (when using warping these correspond to times in the reference group), for which to interpolate the factors in all groups.
data_opts <- get_default_data_options(MOFAobject_untrained)
model_opts <- get_default_model_options(MOFAobject_untrained)
model_opts$num_factors <- 5
train_opts <- get_default_training_options(MOFAobject_untrained)
train_opts$seed <- 2020
train_opts$convergence_mode <- "medium" # use "fast" for faster training
mefisto_opts <- get_default_mefisto_options(MOFAobject_untrained)
mefisto_opts$warping <- TRUE
mefisto_opts$warping_ref <- "Mouse"
mefisto_opts$new_values <- 1:14
We then pass all these options to the object.
MOFAobject_untrained <- prepare_mofa(
object = MOFAobject_untrained,
data_options = data_opts,
model_options = model_opts,
training_options = train_opts,
mefisto_options = mefisto_opts
)
Train the MOFA model Now, we are ready to train the model.
outfile <- "data/evodevo_model.hdf5"
# MOFAobject <- run_mofa(MOFAobject_untrained, outfile = outfile)
As this might take ~15 min, for all further analysis we will use a pre-trained model. In addition to the factor values at the observed data points we can optionally load the interpolated data.
MOFAobject <- load_model(outfile, load_interpol_Z = TRUE)
# We can add a species column to the metadata
samples_metadata(MOFAobject) %<>% mutate(species = group)
To obtain a first overview of the factors we can take a look at the variance that a factor explains in each organ and species.
p <- plot_variance_explained(MOFAobject, plot_total = T)
p[[1]] + theme(axis.text.x = element_text(angle = 90))
p[[2]] + theme(axis.text.x = element_text(angle = 90))
To look at the inferred factors, we can plot them against the developmental stages. Before alignment, the timing of common developmental process does not match, after alignment, we see a clearer match.
# Factors versus developmental time (before alignment)
plot_factors_vs_cov(MOFAobject, color_by = "species",
covariate = "time", warped = FALSE)
# Factors versus developmental time (after alignment)
plot_factors_vs_cov(MOFAobject, color_by = "species",
covariate = "time", scale = FALSE)
Using the underlying Gaussian process for each factor we can interpolate to unseen time points for species that are missing data in these time points or intermediate time points. If you want to show uncertainties, set only_mean to FALSE. Note that these are only available if the new values for interpolation have been specified before training.
plot_interpolation_vs_covariate(MOFAobject, only_mean = FALSE)
# if the interpolation was not specified in mefisto_options before training
# or you want to interpolate to new values you can use interpolate_factors
# this however does not provide uncertainties
MOFAobject_new <- interpolate_factors(MOFAobject,
new_values = seq(1, 14, 0.1))
## We recommend doing interpolation from python where additionally uncertainties are provided for the interpolation.
## Warning in interpolate_factors(MOFAobject, new_values = seq(1, 14, 0.1)): Object
## already contains interpolated factor values, overwriting it.
plot_interpolation_vs_covariate(MOFAobject_new)
A scatterplot of the first two factors shows the embedding of timepoint-species combination in the latent space. Again we see, that the aligned times show a gradual development along the developmental trajectory captured by the conserved developmental programs on Factors 1 and 2.
plot_factors(MOFAobject, 1:2, color_by = "time_warped")
plot_factors(MOFAobject, 1:2, color_by = "time")
plot_factors(MOFAobject, 1:2, color_by = "species")
Next, we further look into the alignment that was learnt by the model by plotting the mapping of mouse developmental stages to the stages of other species.
plot_alignment(MOFAobject)
To obtain insights into the molecular signatures associated with each factor in each organ, we can inspect the genes that have high weights on the factors as well as carry out an enrichment analysis to obtain gene sets that are associated to the process.
plot_top_weights(MOFAobject, view = "Brain", factors = 1)
Let’s take a look at these genes over the developmental time.
Factor 1 captures genes with a smooth up/downregulation of expression across all species.
plot_data_vs_cov(MOFAobject, factor = 1,
view = "Brain",
features = 3)
Factor 3 captures a sharp expression change in testis.
plot_data_vs_cov(MOFAobject, factor = 3,
view ="Testis",
features = 3)
For factor 4, we see that human expression trajectories stand out.
plot_data_vs_cov(MOFAobject, factor = 4,
view ="Liver",
features = 3)
To aggregate information on factors from individual genes to gene sets we can use enrichment analysis on each factor and for each organ by running run_enrichment
. We illustrate this here for Brain, for other organs you can do this in an analogous manner.
See our MOFA tutorial on enrichment analysis for details.
data("reactomeGS", package = "MOFAdata")
reactomeGS <- reactomeGS[!is.na(rownames(reactomeGS)),]
# we rename the feature to match the gene names in the gene sets
# by removing the organ parts
MOFAobject_renamed <- MOFAobject
features_names(MOFAobject_renamed) <- lapply(features_names(MOFAobject_renamed),
function(l) gsub("_.*","", l))
res <- run_enrichment(MOFAobject_renamed,
view = "Brain", factors = "all",
feature.sets = reactomeGS,
statistical.test = "parametric")
## Intersecting features names in the model and the gene set annotation results in a total of 7682 features.
##
## Running feature set Enrichment Analysis with the following options...
## View: Brain
## Number of feature sets: 890
## Set statistic: mean.diff
## Statistical test: parametric
##
# show top 10 pathways for Factor 1
plot_enrichment(res, factor = 1, max.pathways = 10)
Session Info
sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Mojave 10.14.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] reshape2_1.4.4 cowplot_1.1.0 forcats_0.5.0 stringr_1.4.0
## [5] dplyr_1.0.2 purrr_0.3.4 readr_1.4.0 tidyr_1.1.2
## [9] tibble_3.0.4 ggplot2_3.3.2 tidyverse_1.3.0 magrittr_2.0.1
## [13] MOFA2_1.1.15 BiocStyle_2.18.1
##
## loaded via a namespace (and not attached):
## [1] matrixStats_0.57.0 fs_1.5.0 lubridate_1.7.9.2
## [4] filelock_1.0.2 RColorBrewer_1.1-2 httr_1.4.2
## [7] tools_4.0.3 backports_1.2.1 utf8_1.1.4
## [10] R6_2.5.0 HDF5Array_1.18.0 uwot_0.1.10
## [13] DBI_1.1.0 BiocGenerics_0.36.0 colorspace_2.0-0
## [16] rhdf5filters_1.2.0 withr_2.3.0 tidyselect_1.1.0
## [19] compiler_4.0.3 cli_2.2.0 rvest_0.3.6
## [22] basilisk.utils_1.2.1 xml2_1.3.2 DelayedArray_0.16.0
## [25] labeling_0.4.2 bookdown_0.21 scales_1.1.1
## [28] rappdirs_0.3.1 digest_0.6.27 rmarkdown_2.6
## [31] basilisk_1.2.1 pkgconfig_2.0.3 htmltools_0.5.0
## [34] MatrixGenerics_1.2.0 dbplyr_2.0.0 rlang_0.4.9
## [37] readxl_1.3.1 rstudioapi_0.13 farver_2.0.3
## [40] generics_0.1.0 jsonlite_1.7.2 Matrix_1.2-18
## [43] Rcpp_1.0.5 munsell_0.5.0 S4Vectors_0.28.1
## [46] Rhdf5lib_1.12.0 fansi_0.4.1 reticulate_1.18
## [49] lifecycle_0.2.0 stringi_1.5.3 yaml_2.2.1
## [52] rhdf5_2.34.0 Rtsne_0.15 plyr_1.8.6
## [55] grid_4.0.3 parallel_4.0.3 ggrepel_0.9.0
## [58] crayon_1.3.4 lattice_0.20-41 haven_2.3.1
## [61] hms_0.5.3 magick_2.5.2 knitr_1.30
## [64] pillar_1.4.7 stats4_4.0.3 reprex_0.3.0
## [67] glue_1.4.2 evaluate_0.14 BiocManager_1.30.10
## [70] modelr_0.1.8 vctrs_0.3.6 cellranger_1.1.0
## [73] gtable_0.3.0 assertthat_0.2.1 xfun_0.19
## [76] broom_0.7.3 pheatmap_1.0.12 IRanges_2.24.1
## [79] corrplot_0.84 ellipsis_0.3.1