This vignette demonstrates how to do Gene Set Enrichment Analysis on the MOFA factors.
We consider a single-cell RNA-seq data set where 16,152 cells were isolated from a total of 8 mouse embryos from developmental stages E6.5, E6.75, E7.0 and E7.25 (two embryos per stage), spanning post-implantation and early gastrulation. The original data set can be visualised and downloaded from here.
The vignette where the MOFA model is trained on this data set can be found here
Sometimes factors cannot be easily characterised by simply inspecting the genes with the largest weight in each factor. Sometimes it is useful to combine information across genes and work instead with gene sets (i.e. biological pathways).
These are the steps for doing Gene Set Enrichment Analysis (GSEA) with MOFA:
j
belongs to pathway i
. A value of 0 indicates elsewise.mean.diff
(difference in the mean weight between foreground and background) or rank.sum
(difference in the sum of ranks between foreground and background).parametric
(a simple and very liberal parametric t-test), cor.adj.parametric
(parametric t-test adjusted by the correlation between features), permutation
(unparametric, the null distribution is created by permuting the weights. This option is computationally expensive, but it preserves the correlation structure between features in the data.).For more details, we refer the reader to the following paper: Principal component gene set enrichment (PCGSE). The function we implemented is based on the function with some modifications.
library(data.table)
library(purrr)
library(ggplot2)
library(cowplot)
library(MOFAdata)
library(MOFA2)
There are a large number of gene set annotations, and the right one to use will depend on your data set. Some generic and commonly used ones are MSigDB, Reactome and Gene Ontology.
We have manually processed some gene sets, which can be found in the MOFAdata package:
Reactome (human):
data("reactomeGS")
head(rownames(reactomeGS), n=3)
## [1] "Interleukin-6 signaling" "Apoptosis"
## [3] "Hemostasis"
head(colnames(reactomeGS), n=3)
## [1] "ENSG00000187634" "ENSG00000188976" "ENSG00000187961"
MSigDB 6.0 (human):
# C2: curated gene sets from online pathway databases, publications in PubMed, and knowledge of domain experts.
data("MSigDB_v6.0_C2_human")
# C5: extracted from the Gene Ontology data.base
data("MSigDB_v6.0_C5_human")
head(rownames(MSigDB_v6.0_C2_human), n=3)
## [1] "KEGG_GLYCOLYSIS_GLUCONEOGENESIS" "KEGG_CITRATE_CYCLE_TCA_CYCLE"
## [3] "KEGG_PENTOSE_PHOSPHATE_PATHWAY"
head(colnames(MSigDB_v6.0_C2_human), n=3)
## [1] "ENSG00000186092" "ENSG00000237683" "ENSG00000235249"
MSigDB 6.0 (mouse):
# C2: curated gene sets from online pathway databases, publications in PubMed, and knowledge of domain experts.
data("MSigDB_v6.0_C2_mouse")
# C5: extracted from the Gene Ontology data.base
data("MSigDB_v6.0_C5_mouse")
head(rownames(MSigDB_v6.0_C2_mouse), n=3)
## [1] "KEGG_GLYCOLYSIS_GLUCONEOGENESIS" "KEGG_CITRATE_CYCLE_TCA_CYCLE"
## [3] "KEGG_PENTOSE_PHOSPHATE_PATHWAY"
head(colnames(MSigDB_v6.0_C2_mouse), n=3)
## [1] "XKR4" "RP1" "SOX17"
Give them a shot, but it is likely that you will have to create a tailored gene set for your application. Also, If you have an annotation that other people could benefit from, please let us know and we will upload it.
load(url("ftp://ftp.ebi.ac.uk/pub/databases/mofa/scrna_gastrulation/gastrulation10x_mofa.RData"))
#model <- readRDS("/Users/ricard/data/gastrulation10x_mofa/model.rds")
model
## Trained MOFA with the following characteristics:
## Number of views: 1
## Views names: RNA
## Number of features (per view): 5000
## Number of groups: 6
## Groups names: E6.5 (1) E6.5 (2) E7.0 (1) E7.0 (2) E7.25 (1) E7.25 (2)
## Number of samples (per group): 337 1106 2152 1135 5537 5885
## Number of factors: 10
For this example we will use the MSigDB_v6.0_C5_mouse gene set annotations, which are derived from the Gene Ontology data base.
First, we need to match the gene names in the MOFA object to the gene names in the gene set annotation. For this we just have to capitalise the gene names in MOFA:
features_names(model)[["RNA"]] <- toupper(features_names(model)[["RNA"]])
head(features_names(model)[["RNA"]])
## [1] "SOX17" "OPRK1" "NPBWR1" "SGK3" "MCMDC2" "PPP1R42"
An important consideration when running GSEA is that MOFA has positive and negative weights. The features with positive weights are “high” in the samples with positive factor values, whereas the features with negative weights are “high” in the samples with negative factor values.
Taking this into account, you may want to do GSEA specifically with the positive weights or with the negative weights. Merging them and taking the absolute value could dilute the signal. Hence, we recommend the user to do GSEA separately for (+) and (-) weights as well as jointly with all weights.
As a demonstration, we will run GSEA with a simple parametric t-test only using features with negative weights. For simplicity we’ll use Factors 1 to 3:
enrichment.parametric <- run_enrichment(model,
view = "RNA", factors = 1:3,
feature.sets = MSigDB_v6.0_C5_mouse,
sign = "negative",
statistical.test = "parametric"
)
The enrichment analysis returns a list of 5 elements:
names(enrichment.parametric)
## [1] "feature.sets" "pval" "pval.adj"
## [4] "feature.statistics" "set.statistics" "sigPathways"
Let’s explore the output:
enrichment.parametric$set.statistics[1:5,1]
## GO_CARDIAC_CHAMBER_DEVELOPMENT
## -1.1974283
## GO_CIRCADIAN_RHYTHM
## -0.3816839
## GO_SPINAL_CORD_DEVELOPMENT
## -0.8425902
## GO_PLATELET_DERIVED_GROWTH_FACTOR_RECEPTOR_SIGNALING_PATHWAY
## -0.7649083
## GO_POSITIVE_REGULATION_OF_EPITHELIAL_CELL_DIFFERENTIATION
## -1.0308289
enrichment.parametric$pval.adj[1:5,1]
## GO_CARDIAC_CHAMBER_DEVELOPMENT
## 0.7114479
## GO_CIRCADIAN_RHYTHM
## 0.8347303
## GO_SPINAL_CORD_DEVELOPMENT
## 0.7360531
## GO_PLATELET_DERIVED_GROWTH_FACTOR_RECEPTOR_SIGNALING_PATHWAY
## 0.7360531
## GO_POSITIVE_REGULATION_OF_EPITHELIAL_CELL_DIFFERENTIATION
## 0.7360531
There are three main functions to plot the results of the gene set enrichment analysis:
The heatmap shows that all three factors have some enriched pathways
plot_enrichment_heatmap(enrichment.parametric)
The lineplot shows that Factor 1 is enriched for lipid metabolism
plot_enrichment(enrichment.parametric,
factor = 1,
max.pathways = 15
)
This plot shows that all enriched pathways are driven by the same genes: Apolipoproteins such as APOE, APOA1 and APOM.
plot_enrichment_detailed(enrichment.parametric,
factor = 1,
max.genes = 8,
max.pathways = 5
)
The results above suggest that Apolipoproteins are highly expressed in samples with negative factor values. Let’s check so:
genes <- list("APOA1","APOE")
genes %>% map(~ plot_factors(model,
factors = c(1,2),
color_by = .,
scale = T,
legend = F
)) %>% cowplot::plot_grid(plotlist=., nrow=1)
Let’s run the correlation-adjusted parametric test:
enrichment.parametric.adj <- run_enrichment(model,
view = "RNA", factors = 1:5,
feature.sets = MSigDB_v6.0_C5_mouse,
sign = "negative",
statistical.test = "cor.adj.parametric"
)
Compare the histogram of the p-values. Clearly the correlation-adjusted test results in more conservative estimates:
dt <- rbind(
enrichment.parametric$pval[,1:3] %>% as.data.table %>% .[,c("test","pathway"):=list("parametric",1:.N)],
enrichment.parametric.adj$pval[,1:3] %>% as.data.table %>% .[,c("test","pathway"):=list("parametric.adj",1:.N)]
) %>% melt(id.vars=c("test","pathway"), variable.name="factor")
ggplot(dt, aes(x=value, fill=test)) +
facet_wrap(~factor, scales="free_y", nrow=1) +
geom_histogram() +
theme_bw() +
theme(
legend.position = "top",
legend.title = element_blank()
)
Yet, the resulting p-values are highlighy correlated between the two tests, as expected:
dt2 <- dt %>% dcast(factor+pathway~test)
ggplot(dt2, aes(x=parametric, y=parametric.adj)) +
geom_point(size=0.5) +
geom_abline(slope=1, intercept=0, color="orange") +
facet_wrap(~factor, scales="free_y", nrow=1) +
labs(x="Parametric p-value", y="Adjusted parametric p-value") +
theme_bw() +
theme(
legend.position = "top"
)
sessionInfo()
## R version 4.0.0 (2020-04-24)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Mojave 10.14.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] MOFA2_0.99.0 MOFAdata_1.4.0 cowplot_1.0.0 ggplot2_3.3.2
## [5] purrr_0.3.4 data.table_1.12.8 BiocStyle_2.16.0
##
## loaded via a namespace (and not attached):
## [1] ggrepel_0.8.2 Rcpp_1.0.4.6 lattice_0.20-41
## [4] tidyr_1.1.0 digest_0.6.25 R6_2.4.1
## [7] plyr_1.8.6 stats4_4.0.0 evaluate_0.14
## [10] pillar_1.4.4 rlang_0.4.6 magick_2.4.0
## [13] S4Vectors_0.26.1 Matrix_1.2-18 reticulate_1.16
## [16] rmarkdown_2.3 labeling_0.3 stringr_1.4.0
## [19] pheatmap_1.0.12 munsell_0.5.0 DelayedArray_0.14.0
## [22] HDF5Array_1.16.1 compiler_4.0.0 vipor_0.4.5
## [25] xfun_0.15 pkgconfig_2.0.3 BiocGenerics_0.34.0
## [28] ggbeeswarm_0.6.0 htmltools_0.5.0 tidyselect_1.1.0
## [31] tibble_3.0.1 bookdown_0.20 IRanges_2.22.2
## [34] matrixStats_0.56.0 crayon_1.3.4 dplyr_1.0.0
## [37] withr_2.2.0 grid_4.0.0 jsonlite_1.7.0
## [40] gtable_0.3.0 lifecycle_0.2.0 magrittr_1.5
## [43] scales_1.1.1 stringi_1.4.6 farver_2.0.3
## [46] reshape2_1.4.4 ellipsis_0.3.1 generics_0.0.2
## [49] vctrs_0.3.1 Rhdf5lib_1.10.0 RColorBrewer_1.1-2
## [52] tools_4.0.0 forcats_0.5.0 glue_1.4.1
## [55] beeswarm_0.2.3 parallel_4.0.0 yaml_2.2.1
## [58] colorspace_1.4-1 rhdf5_2.32.1 BiocManager_1.30.10
## [61] corrplot_0.84 knitr_1.29 ggrastr_0.1.9