This vignette demonstrates how MOFA can be used to integrate scRNA-seq and scATAC-seq data from the Chromium Single Cell Multiome ATAC + Gene Expression assay recently commercialised by 10x Genomics. This vignette results from a collaboration between the 10x Genomics R&D team and the MOFA team. The data set consists of the conventional Peripheral Blood Mononuclear Cells (PBMC) from a single healthy donor, which is available here.
MOFA is a factor analysis model that provides a general framework for the integration of multi-omic data sets in an unsupervised fashion. Intuitively, it can be viewed as a versatile and statistically rigorous generalisation of principal component analysis (PCA) to multi-omics data. Briefly, the model performs unsupervised dimensionality reduction simultaneously across multiple data modalities, thereby capturing the global sources of cell-to-cell variability via a small number of inferred factors. Importantly, it distinguishes variation that is shared between assays from variation that is unique to a specific assay. Thus, in this data set MOFA can be useful to disentangle the RNA and the ATAC activity of the different cellular populations that exist in PBMCs.
Make sure that MOFA2
is imported last, to avoid collisions with functions from other packages
library(data.table)
library(ggplot2)
library(Seurat)
library(Signac)
# for GSEA analysis
library(msigdbr)
# For motif enrichment analysis
library(JASPAR2020)
library(TFBSTools)
library(BSgenome.Hsapiens.UCSC.hg38)
# MOFA
library(MOFA2)
We have created a Seurat object with RNA and ATAC data modalities stored as different assays. Here is the script to create the Seurat object from the CellRanger output
# run this to download the data set
# seurat <- readRDS(url("ftp://ftp.ebi.ac.uk/pub/databases/mofa/10x_rna_atac_vignette/seurat.rds"))
seurat <- readRDS("/Users/ricard/data/10x_rna_atac/PBMC/seurat.rds")
seurat
## An object of class Seurat
## 138109 features across 11909 samples within 2 assays
## Active assay: RNA (29732 features, 0 variable features)
## 1 other assay present: ATAC
The metadata slot contains the cell type annotations that have been done a priori by the 10x Genomics R&D team. This will be useful to characterise the MOFA factors. One could employ the MOFA factors to perform clustering and cell type annotation, but we will demonstrate this in a future vignette.
head(seurat@meta.data[,c("celltype","broad_celltype","pass_rnaQC","pass_accQC")])
## celltype broad_celltype pass_rnaQC pass_accQC
## AAACAGCCAAGGAATC naive CD4 T cells Lymphoid TRUE TRUE
## AAACAGCCAATCCCTT memory CD4 T cells Lymphoid TRUE TRUE
## AAACAGCCAATGCGCT naive CD4 T cells Lymphoid TRUE TRUE
## AAACAGCCACACTAAT <NA> <NA> FALSE FALSE
## AAACAGCCACCAACCG <NA> <NA> FALSE FALSE
## AAACAGCCAGGATAAC <NA> <NA> FALSE FALSE
table(seurat@meta.data$celltype)
##
## CD56 (bright) NK cells CD56 (dim) NK cells classical monocytes
## 407 472 1929
## effector CD8 T cells intermediate monocytes MAIT T cells
## 385 664 106
## memory B cells memory CD4 T cells myeloid DC
## 420 1611 242
## naive B cells naive CD4 T cells naive CD8 T cells
## 295 1462 1549
## non-classical monocytes plasmacytoid DC
## 383 107
table(seurat@meta.data$broad_celltype)
##
## Lymphoid Myeloid
## 6814 3218
Keep cells that pass QC for both omics
seurat <- seurat %>%
.[,seurat@meta.data$pass_accQC==TRUE & seurat@meta.data$pass_rnaQC==TRUE]
seurat
## An object of class Seurat
## 138109 features across 10032 samples within 2 assays
## Active assay: RNA (29732 features, 0 variable features)
## 1 other assay present: ATAC
The RNA expression consists of 29,732 genes and 10,032 cells
seurat@assays[["RNA"]]
## Assay data with 29732 features for 10032 cells
## First 10 features:
## AL627309.1, AL627309.2, AL627309.5, AL627309.4, AP006222.2, AL669831.2,
## LINC01409, FAM87B, LINC01128, LINC00115
The ATAC expression consists of 108,377 peaks and 10,032 cells
seurat@assays[["ATAC"]]
## Assay data with 108377 features for 10032 cells
## First 10 features:
## chr1:10109-10357, chr1:180730-181630, chr1:191491-191736,
## chr1:267816-268196, chr1:586028-586373, chr1:629721-630172,
## chr1:633793-634264, chr1:777634-779926, chr1:816881-817647,
## chr1:819912-823500
Collect a list of position-specific weight matrices (PWM) from the JASPAR database, we’ll use this in the downstream analysis.
pfm <- getMatrixSet(JASPAR2020,
opts = list(species = "Homo sapiens")
)
Feature metadata contains genomic information for both RNA expression and ATAC peaks
feature_metadata <- fread("ftp://ftp.ebi.ac.uk/pub/databases/mofa/10x_rna_atac_vignette/filtered_feature_bc_matrix/features.tsv.gz") %>%
setnames(c("ens_id","gene","view","chr","start","end"))
Fetch RNA metadata
feature_metadata.rna <- feature_metadata[view=="Gene Expression"]
head(feature_metadata.rna,n=3)
## ens_id gene view chr start end
## 1: ENSG00000243485 MIR1302-2HG Gene Expression chr1 29553 30267
## 2: ENSG00000237613 FAM138A Gene Expression chr1 36080 36081
## 3: ENSG00000186092 OR4F5 Gene Expression chr1 65418 69055
Fetch ATAC metadata
feature_metadata.atac <- feature_metadata[view=="Peaks"] %>%
.[,ens_id:=NULL] %>% setnames("gene","peak")
head(feature_metadata.atac,n=3)
## peak view chr start end
## 1: chr1:10109-10357 Peaks chr1 10109 10357
## 2: chr1:180730-181630 Peaks chr1 180730 181630
## 3: chr1:191491-191736 Peaks chr1 191491 191736
Classify peaks into promoter-overlapping and distal
foo <- fread("ftp://ftp.ebi.ac.uk/pub/databases/mofa/10x_rna_atac_vignette/atac_peak_annotation.tsv") %>%
.[,c("peak","peak_type")] %>%
.[peak_type%in%c("distal", "promoter")]
feature_metadata.atac <- feature_metadata.atac %>%
merge(foo,by="peak",all.x=TRUE)
table(feature_metadata.atac$peak_type)
##
## distal promoter
## 91822 16074
Split ATAC matrix depending on the peak type and create a ChromatinAssay
for each modality using the Signac package. This object requires a GRanges object with the peak metadata. We also provide an optional motif matrix that we’ll use for downstream analysis.
for (i in c("distal","promoter")) {
# Create GRanges
peaks.granges <- feature_metadata.atac %>%
.[peak_type==i] %>%
.[,c("chr","start","end","peak")] %>%
makeGRangesFromDataFrame(keep.extra.columns = TRUE, ignore.strand = TRUE)
# Scan motifs throughout the DNA sequence of each peak and create a binary matrix of motif-peak presence.
motif.matrix <- CreateMotifMatrix(
features = peaks.granges,
pwm = pfm,
genome = 'hg38',
use.counts = FALSE
) %>% as.matrix
# AddChromatinAssay to the Seurat object
seurat@assays[[paste0("ATAC_",i)]] <- CreateChromatinAssay(
seurat@assays$ATAC@counts[peaks.granges$peak,],
ranges = peaks.granges,
motifs = CreateMotifObject(motif.matrix, pfm)
)
}
seurat
## An object of class Seurat
## 245401 features across 10032 samples within 4 assays
## Active assay: RNA (29732 features, 0 variable features)
## 3 other assays present: ATAC, ATAC_distal, ATAC_promoter
Standard Log normalisation, nothing fancy. One could possibly regress out the nFeature_RNA
covariate or use the alternative SCTransform
normalisation
seurat <- NormalizeData(seurat, normalization.method = "LogNormalize", assay = "RNA")
seurat <- ScaleData(seurat, do.center = TRUE, do.scale = FALSE)
TFIDF normalisation, as implemented in Signac
. This is important to transform the binary measurements to continuous readouts, which is a mandatory step to use the Gaussian likelihood in MOFA. Keep in mind that other strategies are possible (and perhaps even more recommended). One could also try to scale the data and regress out nFeature_ATAC
for (i in c("ATAC_distal","ATAC_promoter")) {
seurat <- RunTFIDF(seurat, assay = i)
}
Select most variable genes using the VST procedure
seurat <- FindVariableFeatures(seurat,
selection.method = "vst",
nfeatures = 5000,
assay = "RNA",
verbose = FALSE
)
Select the most variable peaks using the FindTopFeatures
function
for (i in c("ATAC_distal","ATAC_promoter")) {
seurat <- FindTopFeatures(seurat, assay=i, min.cutoff = 2000)
print(length(seurat[[i]]@var.features))
}
## [1] 8456
## [1] 11198
MOFA can take as input a multi-assay Seurat object.
mofa <- create_mofa(seurat, assays = c("RNA","ATAC_distal","ATAC_promoter"))
mofa
## Untrained MOFA model with the following characteristics:
## Number of views: 3
## Views names: RNA ATAC_distal ATAC_promoter
## Number of features (per view): 5000 8456 11198
## Number of groups: 1
## Groups names: group1
## Number of samples (per group): 10032
The most important hyperparameter is the number of factors. It needs to be specified by the user and it generally depends on the complexity of the data set. It is usually recommended to provide a large enough number and trim factors a posteriori based on a minimum variance explained criteria. Here let’s define K=15 factors
model_opts <- get_default_model_options(mofa)
model_opts$num_factors <- 15
There also other options that can be tuned, see ?get_default_model_options
, ?get_default_training_options
and ?get_default_data_options
for details.
mofa <- prepare_mofa(mofa,
model_options = model_opts
)
This step can take around 2h in a MacBook Pro, but it can be speed up using multiple cores (~2-3x) or using GPUs (~10x). The training is done in Python, and the connection with R is done via reticulate. If this step fails please read our FAQ, as this is the most common source of errors when running MOFA
# mofa <- run_mofa(mofa)
Load pretrained MOFA object
# run this to download the data set
# mofa <- readRDS(url("ftp://ftp.ebi.ac.uk/pub/databases/mofa/10x_rna_atac_vignette/mofa.rds"))
mofa <- readRDS("/Users/ricard/data/10x_rna_atac/PBMC/mofa/mofa.rds")
mofa
## Trained MOFA with the following characteristics:
## Number of views: 3
## Views names: RNA ATAC_distal ATAC_promoter
## Number of features (per view): 5000 8456 11198
## Number of groups: 1
## Groups names: group1
## Number of samples (per group): 10032
## Number of factors: 13
The sample metadata must be provided as a data.frame and it must contain a column named sample with the sample IDs.
samples_metadata(mofa) <- seurat@meta.data %>%
tibble::rownames_to_column("sample") %>%
as.data.table
A good sanity check is to verify that the Factors are largely uncorrelated. In MOFA there are no orthogonality constraints such as in Principal Component Analysis, but if there is a lot of correlation between Factors this suggests a poor model fit
plot_factor_cor(mofa)
The variance decomposition analysis is my favourite MOFA functionality. It calculates the percentage of variance explained by each factor and data modality. What insight can we learn just from visualising this? Factor 1 has a very strong signal from all data modalities, so it must be the most important source of variability in the data. Factor 2 and Factor 3 are weaker but also shared between all data modalities, so they are also likely to be impotant sources of variability. Interestingly, most Factors have a stronger signal in distal peaks than in promoter peaks.
plot_variance_explained(mofa, max_r2 = 4)
A reasonable question is whether the model is providing a good fit to the data. To assess this we can plot the total variance explained (using all factors). The resulting values will depend on the nature of the data set, the number of samples, the number of factors, etc. For single-cell data the values tend to be quite low. Here we have a 30% for RNA, a very reasonable value, versus ~8-10% for ATAC data, which is also a decent value taking into account how noisy scATAC data is. In fact, compared to my previous analysis on the SNARE-seq data set these are very good values. In the SNARE-seq analysis the the ATAC modality was extremely noisy and we had to rely on denoising strategies to get meaningful signal. As you’ll see in this vignette, the 10x Genomics technology provides much better quality measurements and no denoising strategy is required.
plot_variance_explained(mofa, plot_total = TRUE)[[2]]