1 Description
2 Why doing Gene Set Enrichment Analysis?
3 Load libraries
4 Where can I find gene set annotations?
5 Load pre-computed model
6 Run enrichment analysis
7 Plot results of enrichment analysis
8 Compare different statistical tests
9 sessionInfo

1 Description

This vignette demonstrates how to do Gene Set Enrichment Analysis on the MOFA factors.

We consider a single-cell RNA-seq data set where 16,152 cells were isolated from a total of 8 mouse embryos from developmental stages E6.5, E6.75, E7.0 and E7.25 (two embryos per stage), spanning post-implantation and early gastrulation. The original data set can be visualised and downloaded from here.

The vignette where the MOFA model is trained on this data set can be found here

2 Why doing Gene Set Enrichment Analysis?

Sometimes factors cannot be easily characterised by simply inspecting the genes with the largest weight in each factor. Sometimes it is useful to combine information across genes and work instead with gene sets (i.e. biological pathways).

These are the steps for doing Gene Set Enrichment Analysis (GSEA) with MOFA:

(1) Define your gene set matrix: this can be specified as a binary matrix where rows are gene sets and columns are genes. A value of 1 indicates that gene j belongs to pathway i. A value of 0 indicates elsewise.
(2) Select a gene set statistic: the statistic used to quantify the scores at the pathway level. Must be one of the following: mean.diff (difference in the mean weight between foreground and background) or rank.sum (difference in the sum of ranks between foreground and background).
(3) Select a statistical test: the statistical test used to compute the significance of the gene set statistics under a competitive null hypothesis. Must be one of the following: parametric (a simple and very liberal parametric t-test), cor.adj.parametric (parametric t-test adjusted by the correlation between features), permutation (unparametric, the null distribution is created by permuting the weights. This option is computationally expensive, but it preserves the correlation structure between features in the data.).

For more details, we refer the reader to the following paper: Principal component gene set enrichment (PCGSE). The function we implemented is based on the function with some modifications.

3 Load libraries

library(data.table)
library(purrr)
library(ggplot2)
library(cowplot)
library(MOFAdata)
library(MOFA2)

4 Where can I find gene set annotations?

There are a large number of gene set annotations, and the right one to use will depend on your data set. Some generic and commonly used ones are MSigDB, Reactome and Gene Ontology.

We have manually processed some gene sets, which can be found in the MOFAdata package:

Reactome (human):

data("reactomeGS")
head(rownames(reactomeGS), n=3)

## [1] "Interleukin-6 signaling" "Apoptosis"              
## [3] "Hemostasis"

head(colnames(reactomeGS), n=3)

## [1] "ENSG00000187634" "ENSG00000188976" "ENSG00000187961"

MSigDB 6.0 (human):

# C2: curated gene sets from online pathway databases, publications in PubMed, and knowledge of domain experts.
data("MSigDB_v6.0_C2_human") 

# C5: extracted from the Gene Ontology data.base
data("MSigDB_v6.0_C5_human") 

head(rownames(MSigDB_v6.0_C2_human), n=3)

## [1] "KEGG_GLYCOLYSIS_GLUCONEOGENESIS" "KEGG_CITRATE_CYCLE_TCA_CYCLE"   
## [3] "KEGG_PENTOSE_PHOSPHATE_PATHWAY"

head(colnames(MSigDB_v6.0_C2_human), n=3)

## [1] "ENSG00000186092" "ENSG00000237683" "ENSG00000235249"

MSigDB 6.0 (mouse):

# C2: curated gene sets from online pathway databases, publications in PubMed, and knowledge of domain experts.
data("MSigDB_v6.0_C2_mouse") 

# C5: extracted from the Gene Ontology data.base
data("MSigDB_v6.0_C5_mouse") 

head(rownames(MSigDB_v6.0_C2_mouse), n=3)

## [1] "KEGG_GLYCOLYSIS_GLUCONEOGENESIS" "KEGG_CITRATE_CYCLE_TCA_CYCLE"   
## [3] "KEGG_PENTOSE_PHOSPHATE_PATHWAY"

head(colnames(MSigDB_v6.0_C2_mouse), n=3)

## [1] "XKR4"  "RP1"   "SOX17"

Give them a shot, but it is likely that you will have to create a tailored gene set for your application. Also, If you have an annotation that other people could benefit from, please let us know and we will upload it.

5 Load pre-computed model

 load(url("ftp://ftp.ebi.ac.uk/pub/databases/mofa/scrna_gastrulation/gastrulation10x_mofa.RData"))
#model <- readRDS("/Users/ricard/data/gastrulation10x_mofa/model.rds")

model

## Trained MOFA with the following characteristics: 
##  Number of views: 1 
##  Views names: RNA 
##  Number of features (per view): 5000 
##  Number of groups: 6 
##  Groups names: E6.5 (1) E6.5 (2) E7.0 (1) E7.0 (2) E7.25 (1) E7.25 (2) 
##  Number of samples (per group): 337 1106 2152 1135 5537 5885 
##  Number of factors: 10

For this example we will use the MSigDB_v6.0_C5_mouse gene set annotations, which are derived from the Gene Ontology data base.
First, we need to match the gene names in the MOFA object to the gene names in the gene set annotation. For this we just have to capitalise the gene names in MOFA:

features_names(model)[["RNA"]] <- toupper(features_names(model)[["RNA"]])
head(features_names(model)[["RNA"]])

## [1] "SOX17"   "OPRK1"   "NPBWR1"  "SGK3"    "MCMDC2"  "PPP1R42"

6 Run enrichment analysis

An important consideration when running GSEA is that MOFA has positive and negative weights. The features with positive weights are “high” in the samples with positive factor values, whereas the features with negative weights are “high” in the samples with negative factor values.
Taking this into account, you may want to do GSEA specifically with the positive weights or with the negative weights. Merging them and taking the absolute value could dilute the signal. Hence, we recommend the user to do GSEA separately for (+) and (-) weights as well as jointly with all weights.

As a demonstration, we will run GSEA with a simple parametric t-test only using features with negative weights. For simplicity we’ll use Factors 1 to 3:

enrichment.parametric <- run_enrichment(model,
  view = "RNA", factors = 1:3,
  feature.sets = MSigDB_v6.0_C5_mouse,
  sign = "negative",
  statistical.test = "parametric"
)

The enrichment analysis returns a list of 5 elements:

feature.sets: the feature set matrix filtered by the genes that overlap with the MOFA model.
pval: the nominal p-values.
pval.adj: the FDR-adjusted p-values.
feature.statistics: the feature statistics (i.e. the weights).
set.statistics: matrices with the gene set statistics.
sigPathways: list with significant pathways per factor at a specified FDR threshold

names(enrichment.parametric)

## [1] "feature.sets"       "pval"               "pval.adj"          
## [4] "feature.statistics" "set.statistics"     "sigPathways"

Let’s explore the output:

enrichment.parametric$set.statistics[1:5,1]

##                               GO_CARDIAC_CHAMBER_DEVELOPMENT 
##                                                   -1.1974283 
##                                          GO_CIRCADIAN_RHYTHM 
##                                                   -0.3816839 
##                                   GO_SPINAL_CORD_DEVELOPMENT 
##                                                   -0.8425902 
## GO_PLATELET_DERIVED_GROWTH_FACTOR_RECEPTOR_SIGNALING_PATHWAY 
##                                                   -0.7649083 
##    GO_POSITIVE_REGULATION_OF_EPITHELIAL_CELL_DIFFERENTIATION 
##                                                   -1.0308289

enrichment.parametric$pval.adj[1:5,1]

##                               GO_CARDIAC_CHAMBER_DEVELOPMENT 
##                                                    0.7114479 
##                                          GO_CIRCADIAN_RHYTHM 
##                                                    0.8347303 
##                                   GO_SPINAL_CORD_DEVELOPMENT 
##                                                    0.7360531 
## GO_PLATELET_DERIVED_GROWTH_FACTOR_RECEPTOR_SIGNALING_PATHWAY 
##                                                    0.7360531 
##    GO_POSITIVE_REGULATION_OF_EPITHELIAL_CELL_DIFFERENTIATION 
##                                                    0.7360531

7 Plot results of enrichment analysis

There are three main functions to plot the results of the gene set enrichment analysis:

plot_enrichment_heatmap: plot a heatmap of gene sets (rows) versus factors (columns) where each entry correspons to the log p-value. This is useful to get an overview on which factors show statistically enriched pathways.
plot_enrichment: plot the top significant pathways for a specific factor.
plot_enrichment_detailed: plot a detailed output, highlighting the genes that are contributing to the enrichment of each pathway.

The heatmap shows that all three factors have some enriched pathways

plot_enrichment_heatmap(enrichment.parametric)

The lineplot shows that Factor 1 is enriched for lipid metabolism

plot_enrichment(enrichment.parametric, 
  factor = 1, 
  max.pathways = 15
)

This plot shows that all enriched pathways are driven by the same genes: Apolipoproteins such as APOE, APOA1 and APOM.

plot_enrichment_detailed(enrichment.parametric, 
  factor = 1, 
  max.genes = 8, 
  max.pathways = 5
)

The results above suggest that Apolipoproteins are highly expressed in samples with negative factor values. Let’s check so:

genes <- list("APOA1","APOE")

genes %>% map(~ plot_factors(model, 
  factors = c(1,2), 
  color_by = ., 
  scale = T,
  legend = F
)) %>% cowplot::plot_grid(plotlist=., nrow=1)

8 Compare different statistical tests

Let’s run the correlation-adjusted parametric test:

enrichment.parametric.adj <- run_enrichment(model,
  view = "RNA", factors = 1:5,
  feature.sets = MSigDB_v6.0_C5_mouse,
  sign = "negative",
  statistical.test = "cor.adj.parametric"
)

Compare the histogram of the p-values. Clearly the correlation-adjusted test results in more conservative estimates:

dt <- rbind(
  enrichment.parametric$pval[,1:3] %>% as.data.table %>% .[,c("test","pathway"):=list("parametric",1:.N)],
  enrichment.parametric.adj$pval[,1:3] %>% as.data.table %>% .[,c("test","pathway"):=list("parametric.adj",1:.N)]
) %>% melt(id.vars=c("test","pathway"), variable.name="factor")

ggplot(dt, aes(x=value, fill=test)) +
  facet_wrap(~factor, scales="free_y", nrow=1) +
  geom_histogram() +
  theme_bw() +
  theme(
    legend.position = "top",
    legend.title = element_blank()
  )

Yet, the resulting p-values are highlighy correlated between the two tests, as expected:

dt2 <- dt %>% dcast(factor+pathway~test)

ggplot(dt2, aes(x=parametric, y=parametric.adj)) +
  geom_point(size=0.5) +
  geom_abline(slope=1, intercept=0, color="orange") +
  facet_wrap(~factor, scales="free_y", nrow=1) +
  labs(x="Parametric p-value", y="Adjusted parametric p-value") +
  theme_bw() +
  theme(
    legend.position = "top"
  )

9 sessionInfo

sessionInfo()

## R version 4.0.0 (2020-04-24)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Mojave 10.14.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] MOFA2_0.99.0      MOFAdata_1.4.0    cowplot_1.0.0     ggplot2_3.3.2    
## [5] purrr_0.3.4       data.table_1.12.8 BiocStyle_2.16.0 
## 
## loaded via a namespace (and not attached):
##  [1] ggrepel_0.8.2       Rcpp_1.0.4.6        lattice_0.20-41    
##  [4] tidyr_1.1.0         digest_0.6.25       R6_2.4.1           
##  [7] plyr_1.8.6          stats4_4.0.0        evaluate_0.14      
## [10] pillar_1.4.4        rlang_0.4.6         magick_2.4.0       
## [13] S4Vectors_0.26.1    Matrix_1.2-18       reticulate_1.16    
## [16] rmarkdown_2.3       labeling_0.3        stringr_1.4.0      
## [19] pheatmap_1.0.12     munsell_0.5.0       DelayedArray_0.14.0
## [22] HDF5Array_1.16.1    compiler_4.0.0      vipor_0.4.5        
## [25] xfun_0.15           pkgconfig_2.0.3     BiocGenerics_0.34.0
## [28] ggbeeswarm_0.6.0    htmltools_0.5.0     tidyselect_1.1.0   
## [31] tibble_3.0.1        bookdown_0.20       IRanges_2.22.2     
## [34] matrixStats_0.56.0  crayon_1.3.4        dplyr_1.0.0        
## [37] withr_2.2.0         grid_4.0.0          jsonlite_1.7.0     
## [40] gtable_0.3.0        lifecycle_0.2.0     magrittr_1.5       
## [43] scales_1.1.1        stringi_1.4.6       farver_2.0.3       
## [46] reshape2_1.4.4      ellipsis_0.3.1      generics_0.0.2     
## [49] vctrs_0.3.1         Rhdf5lib_1.10.0     RColorBrewer_1.1-2 
## [52] tools_4.0.0         forcats_0.5.0       glue_1.4.1         
## [55] beeswarm_0.2.3      parallel_4.0.0      yaml_2.2.1         
## [58] colorspace_1.4-1    rhdf5_2.32.1        BiocManager_1.30.10
## [61] corrplot_0.84       knitr_1.29          ggrastr_0.1.9

MOFA+: tutorial on Gene Set Enrichment Analysis

2020-09-03

Contents