Contents

1 Description

This vignette demonstrates how MOFA can be used to integrate scRNA-seq and scATAC-seq data from the Chromium Single Cell Multiome ATAC + Gene Expression assay recently commercialised by 10x Genomics. This vignette results from a collaboration between the 10x Genomics R&D team and the MOFA team. The data set consists of the conventional Peripheral Blood Mononuclear Cells (PBMC) from a single healthy donor, which is available here.

MOFA is a factor analysis model that provides a general framework for the integration of multi-omic data sets in an unsupervised fashion. Intuitively, it can be viewed as a versatile and statistically rigorous generalisation of principal component analysis (PCA) to multi-omics data. Briefly, the model performs unsupervised dimensionality reduction simultaneously across multiple data modalities, thereby capturing the global sources of cell-to-cell variability via a small number of inferred factors. Importantly, it distinguishes variation that is shared between assays from variation that is unique to a specific assay. Thus, in this data set MOFA can be useful to disentangle the RNA and the ATAC activity of the different cellular populations that exist in PBMCs.

2 Load libraries

Make sure that MOFA2 is imported last, to avoid collisions with functions from other packages

library(data.table)
library(ggplot2)
library(Seurat)
library(Signac)

# for GSEA analysis
library(msigdbr)

# For motif enrichment analysis
library(JASPAR2020)
library(TFBSTools)
library(BSgenome.Hsapiens.UCSC.hg38)

# MOFA
library(MOFA2)

3 Load data

3.1 Load multi-modal Seurat object

We have created a Seurat object with RNA and ATAC data modalities stored as different assays. Here is the script to create the Seurat object from the CellRanger output

# run this to download the data set
# seurat <- readRDS(url("ftp://ftp.ebi.ac.uk/pub/databases/mofa/10x_rna_atac_vignette/seurat.rds"))

seurat <- readRDS("/Users/ricard/data/10x_rna_atac/PBMC/seurat.rds")

seurat
## An object of class Seurat 
## 138109 features across 11909 samples within 2 assays 
## Active assay: RNA (29732 features, 0 variable features)
##  1 other assay present: ATAC

The metadata slot contains the cell type annotations that have been done a priori by the 10x Genomics R&D team. This will be useful to characterise the MOFA factors. One could employ the MOFA factors to perform clustering and cell type annotation, but we will demonstrate this in a future vignette.

head(seurat@meta.data[,c("celltype","broad_celltype","pass_rnaQC","pass_accQC")])
##                            celltype broad_celltype pass_rnaQC pass_accQC
## AAACAGCCAAGGAATC  naive CD4 T cells       Lymphoid       TRUE       TRUE
## AAACAGCCAATCCCTT memory CD4 T cells       Lymphoid       TRUE       TRUE
## AAACAGCCAATGCGCT  naive CD4 T cells       Lymphoid       TRUE       TRUE
## AAACAGCCACACTAAT               <NA>           <NA>      FALSE      FALSE
## AAACAGCCACCAACCG               <NA>           <NA>      FALSE      FALSE
## AAACAGCCAGGATAAC               <NA>           <NA>      FALSE      FALSE
table(seurat@meta.data$celltype)
## 
##  CD56 (bright) NK cells     CD56 (dim) NK cells     classical monocytes 
##                     407                     472                    1929 
##    effector CD8 T cells  intermediate monocytes            MAIT T cells 
##                     385                     664                     106 
##          memory B cells      memory CD4 T cells              myeloid DC 
##                     420                    1611                     242 
##           naive B cells       naive CD4 T cells       naive CD8 T cells 
##                     295                    1462                    1549 
## non-classical monocytes         plasmacytoid DC 
##                     383                     107
table(seurat@meta.data$broad_celltype)
## 
## Lymphoid  Myeloid 
##     6814     3218

Keep cells that pass QC for both omics

seurat <- seurat %>%
  .[,seurat@meta.data$pass_accQC==TRUE & seurat@meta.data$pass_rnaQC==TRUE]
seurat
## An object of class Seurat 
## 138109 features across 10032 samples within 2 assays 
## Active assay: RNA (29732 features, 0 variable features)
##  1 other assay present: ATAC

The RNA expression consists of 29,732 genes and 10,032 cells

seurat@assays[["RNA"]]
## Assay data with 29732 features for 10032 cells
## First 10 features:
##  AL627309.1, AL627309.2, AL627309.5, AL627309.4, AP006222.2, AL669831.2,
## LINC01409, FAM87B, LINC01128, LINC00115

The ATAC expression consists of 108,377 peaks and 10,032 cells

seurat@assays[["ATAC"]]
## Assay data with 108377 features for 10032 cells
## First 10 features:
##  chr1:10109-10357, chr1:180730-181630, chr1:191491-191736,
## chr1:267816-268196, chr1:586028-586373, chr1:629721-630172,
## chr1:633793-634264, chr1:777634-779926, chr1:816881-817647,
## chr1:819912-823500

3.2 Load additional information

Collect a list of position-specific weight matrices (PWM) from the JASPAR database, we’ll use this in the downstream analysis.

pfm <- getMatrixSet(JASPAR2020,
  opts = list(species = "Homo sapiens")
)

4 Load metadata

4.1 Cell metadata

4.2 Feature metadata

Feature metadata contains genomic information for both RNA expression and ATAC peaks

feature_metadata <- fread("ftp://ftp.ebi.ac.uk/pub/databases/mofa/10x_rna_atac_vignette/filtered_feature_bc_matrix/features.tsv.gz") %>%
  setnames(c("ens_id","gene","view","chr","start","end"))

Fetch RNA metadata

feature_metadata.rna <- feature_metadata[view=="Gene Expression"]
head(feature_metadata.rna,n=3)
##             ens_id        gene            view  chr start   end
## 1: ENSG00000243485 MIR1302-2HG Gene Expression chr1 29553 30267
## 2: ENSG00000237613     FAM138A Gene Expression chr1 36080 36081
## 3: ENSG00000186092       OR4F5 Gene Expression chr1 65418 69055

Fetch ATAC metadata

feature_metadata.atac <- feature_metadata[view=="Peaks"] %>% 
  .[,ens_id:=NULL] %>% setnames("gene","peak")
head(feature_metadata.atac,n=3)
##                  peak  view  chr  start    end
## 1:   chr1:10109-10357 Peaks chr1  10109  10357
## 2: chr1:180730-181630 Peaks chr1 180730 181630
## 3: chr1:191491-191736 Peaks chr1 191491 191736

Classify peaks into promoter-overlapping and distal

foo <- fread("ftp://ftp.ebi.ac.uk/pub/databases/mofa/10x_rna_atac_vignette/atac_peak_annotation.tsv") %>%
  .[,c("peak","peak_type")] %>%
  .[peak_type%in%c("distal", "promoter")]

feature_metadata.atac <- feature_metadata.atac %>% 
  merge(foo,by="peak",all.x=TRUE)

table(feature_metadata.atac$peak_type)
## 
##   distal promoter 
##    91822    16074

5 Parse Seurat object

Split ATAC matrix depending on the peak type and create a ChromatinAssay for each modality using the Signac package. This object requires a GRanges object with the peak metadata. We also provide an optional motif matrix that we’ll use for downstream analysis.

for (i in c("distal","promoter")) {
  
  # Create GRanges
  peaks.granges <- feature_metadata.atac %>%
    .[peak_type==i] %>%
    .[,c("chr","start","end","peak")] %>%
    makeGRangesFromDataFrame(keep.extra.columns = TRUE, ignore.strand = TRUE)

  # Scan motifs throughout the DNA sequence of each peak and create a binary matrix of motif-peak presence.
  motif.matrix <- CreateMotifMatrix(
    features = peaks.granges,
    pwm = pfm,
    genome = 'hg38',
    use.counts = FALSE
  ) %>% as.matrix
  
  # AddChromatinAssay to the Seurat object
  seurat@assays[[paste0("ATAC_",i)]] <- CreateChromatinAssay(
    seurat@assays$ATAC@counts[peaks.granges$peak,], 
    ranges = peaks.granges,
    motifs = CreateMotifObject(motif.matrix, pfm)
  )
  
}
seurat
## An object of class Seurat 
## 245401 features across 10032 samples within 4 assays 
## Active assay: RNA (29732 features, 0 variable features)
##  3 other assays present: ATAC, ATAC_distal, ATAC_promoter

6 Normalization

6.1 RNA

Standard Log normalisation, nothing fancy. One could possibly regress out the nFeature_RNA covariate or use the alternative SCTransform normalisation

seurat <- NormalizeData(seurat, normalization.method = "LogNormalize", assay = "RNA")
seurat <- ScaleData(seurat, do.center = TRUE, do.scale = FALSE)

6.2 ATAC

TFIDF normalisation, as implemented in Signac. This is important to transform the binary measurements to continuous readouts, which is a mandatory step to use the Gaussian likelihood in MOFA. Keep in mind that other strategies are possible (and perhaps even more recommended). One could also try to scale the data and regress out nFeature_ATAC

for (i in c("ATAC_distal","ATAC_promoter")) {
  seurat <- RunTFIDF(seurat, assay = i)
}

7 Feature selection

7.1 RNA

Select most variable genes using the VST procedure

seurat <- FindVariableFeatures(seurat, 
  selection.method = "vst", 
  nfeatures = 5000,
  assay = "RNA",
  verbose = FALSE
)

7.2 ATAC

Select the most variable peaks using the FindTopFeatures function

for (i in c("ATAC_distal","ATAC_promoter")) {
  seurat <- FindTopFeatures(seurat, assay=i, min.cutoff = 2000)
  print(length(seurat[[i]]@var.features))
}
## [1] 8456
## [1] 11198

8 Train the MOFA model

8.1 Create MOFA object

MOFA can take as input a multi-assay Seurat object.

mofa <- create_mofa(seurat, assays = c("RNA","ATAC_distal","ATAC_promoter"))
mofa
## Untrained MOFA model with the following characteristics: 
##  Number of views: 3 
##  Views names: RNA ATAC_distal ATAC_promoter 
##  Number of features (per view): 5000 8456 11198 
##  Number of groups: 1 
##  Groups names: group1 
##  Number of samples (per group): 10032

8.2 Define MOFA options

The most important hyperparameter is the number of factors. It needs to be specified by the user and it generally depends on the complexity of the data set. It is usually recommended to provide a large enough number and trim factors a posteriori based on a minimum variance explained criteria. Here let’s define K=15 factors

model_opts <- get_default_model_options(mofa)
model_opts$num_factors <- 15

There also other options that can be tuned, see ?get_default_model_options, ?get_default_training_options and ?get_default_data_options for details.

mofa <- prepare_mofa(mofa,
  model_options = model_opts
)

8.3 Run MOFA

This step can take around 2h in a MacBook Pro, but it can be speed up using multiple cores (~2-3x) or using GPUs (~10x). The training is done in Python, and the connection with R is done via reticulate. If this step fails please read our FAQ, as this is the most common source of errors when running MOFA

# mofa <- run_mofa(mofa)

Load pretrained MOFA object

# run this to download the data set
# mofa <- readRDS(url("ftp://ftp.ebi.ac.uk/pub/databases/mofa/10x_rna_atac_vignette/mofa.rds"))

mofa <- readRDS("/Users/ricard/data/10x_rna_atac/PBMC/mofa/mofa.rds")

mofa
## Trained MOFA with the following characteristics: 
##  Number of views: 3 
##  Views names: RNA ATAC_distal ATAC_promoter 
##  Number of features (per view): 5000 8456 11198 
##  Number of groups: 1 
##  Groups names: group1 
##  Number of samples (per group): 10032 
##  Number of factors: 13

9 MOFA downstream analysis

9.1 Add cell metadata to the model

The sample metadata must be provided as a data.frame and it must contain a column named sample with the sample IDs.

samples_metadata(mofa) <- seurat@meta.data %>%
  tibble::rownames_to_column("sample") %>%
  as.data.table

9.2 Asses the correlation between factors

A good sanity check is to verify that the Factors are largely uncorrelated. In MOFA there are no orthogonality constraints such as in Principal Component Analysis, but if there is a lot of correlation between Factors this suggests a poor model fit

plot_factor_cor(mofa)

9.3 Variance decomposition

The variance decomposition analysis is my favourite MOFA functionality. It calculates the percentage of variance explained by each factor and data modality. What insight can we learn just from visualising this? Factor 1 has a very strong signal from all data modalities, so it must be the most important source of variability in the data. Factor 2 and Factor 3 are weaker but also shared between all data modalities, so they are also likely to be impotant sources of variability. Interestingly, most Factors have a stronger signal in distal peaks than in promoter peaks.

plot_variance_explained(mofa, max_r2 = 4)

A reasonable question is whether the model is providing a good fit to the data. To assess this we can plot the total variance explained (using all factors). The resulting values will depend on the nature of the data set, the number of samples, the number of factors, etc. For single-cell data the values tend to be quite low. Here we have a 30% for RNA, a very reasonable value, versus ~8-10% for ATAC data, which is also a decent value taking into account how noisy scATAC data is. In fact, compared to my previous analysis on the SNARE-seq data set these are very good values. In the SNARE-seq analysis the the ATAC modality was extremely noisy and we had to rely on denoising strategies to get meaningful signal. As you’ll see in this vignette, the 10x Genomics technology provides much better quality measurements and no denoising strategy is required.

plot_variance_explained(mofa, plot_total = TRUE)[[2]]