1 Description
2 Load libraries
3 Load data
- 3.1 Load multi-modal Seurat object
- 3.2 Load additional information
4 Load metadata
- 4.1 Cell metadata
- 4.2 Feature metadata
5 Parse Seurat object
6 Normalization
- 6.1 RNA
- 6.2 ATAC
7 Feature selection
- 7.1 RNA
- 7.2 ATAC
8 Train the MOFA model
9 MOFA downstream analysis
10 Conclusions

1 Description

This vignette demonstrates how MOFA can be used to integrate scRNA-seq and scATAC-seq data from the Chromium Single Cell Multiome ATAC + Gene Expression assay recently commercialised by 10x Genomics. This vignette results from a collaboration between the 10x Genomics R&D team and the MOFA team. The data set consists of the conventional Peripheral Blood Mononuclear Cells (PBMC) from a single healthy donor, which is available here.

MOFA is a factor analysis model that provides a general framework for the integration of multi-omic data sets in an unsupervised fashion. Intuitively, it can be viewed as a versatile and statistically rigorous generalisation of principal component analysis (PCA) to multi-omics data. Briefly, the model performs unsupervised dimensionality reduction simultaneously across multiple data modalities, thereby capturing the global sources of cell-to-cell variability via a small number of inferred factors. Importantly, it distinguishes variation that is shared between assays from variation that is unique to a specific assay. Thus, in this data set MOFA can be useful to disentangle the RNA and the ATAC activity of the different cellular populations that exist in PBMCs.

2 Load libraries

Make sure that MOFA2 is imported last, to avoid collisions with functions from other packages

library(data.table)
library(ggplot2)
library(Seurat)
library(Signac)

# for GSEA analysis
library(msigdbr)

# For motif enrichment analysis
library(JASPAR2020)
library(TFBSTools)
library(BSgenome.Hsapiens.UCSC.hg38)

# MOFA
library(MOFA2)

3 Load data

3.1 Load multi-modal Seurat object

We have created a Seurat object with RNA and ATAC data modalities stored as different assays. Here is the script to create the Seurat object from the CellRanger output

# run this to download the data set
# seurat <- readRDS(url("ftp://ftp.ebi.ac.uk/pub/databases/mofa/10x_rna_atac_vignette/seurat.rds"))

seurat <- readRDS("/Users/ricard/data/10x_rna_atac/PBMC/seurat.rds")

seurat

## An object of class Seurat 
## 138109 features across 11909 samples within 2 assays 
## Active assay: RNA (29732 features, 0 variable features)
##  1 other assay present: ATAC

The metadata slot contains the cell type annotations that have been done a priori by the 10x Genomics R&D team. This will be useful to characterise the MOFA factors. One could employ the MOFA factors to perform clustering and cell type annotation, but we will demonstrate this in a future vignette.

head(seurat@meta.data[,c("celltype","broad_celltype","pass_rnaQC","pass_accQC")])

##                            celltype broad_celltype pass_rnaQC pass_accQC
## AAACAGCCAAGGAATC  naive CD4 T cells       Lymphoid       TRUE       TRUE
## AAACAGCCAATCCCTT memory CD4 T cells       Lymphoid       TRUE       TRUE
## AAACAGCCAATGCGCT  naive CD4 T cells       Lymphoid       TRUE       TRUE
## AAACAGCCACACTAAT               <NA>           <NA>      FALSE      FALSE
## AAACAGCCACCAACCG               <NA>           <NA>      FALSE      FALSE
## AAACAGCCAGGATAAC               <NA>           <NA>      FALSE      FALSE

table(seurat@meta.data$celltype)

## 
##  CD56 (bright) NK cells     CD56 (dim) NK cells     classical monocytes 
##                     407                     472                    1929 
##    effector CD8 T cells  intermediate monocytes            MAIT T cells 
##                     385                     664                     106 
##          memory B cells      memory CD4 T cells              myeloid DC 
##                     420                    1611                     242 
##           naive B cells       naive CD4 T cells       naive CD8 T cells 
##                     295                    1462                    1549 
## non-classical monocytes         plasmacytoid DC 
##                     383                     107

table(seurat@meta.data$broad_celltype)

## 
## Lymphoid  Myeloid 
##     6814     3218

Keep cells that pass QC for both omics

seurat <- seurat %>%
  .[,seurat@meta.data$pass_accQC==TRUE & seurat@meta.data$pass_rnaQC==TRUE]
seurat

## An object of class Seurat 
## 138109 features across 10032 samples within 2 assays 
## Active assay: RNA (29732 features, 0 variable features)
##  1 other assay present: ATAC

The RNA expression consists of 29,732 genes and 10,032 cells

seurat@assays[["RNA"]]

## Assay data with 29732 features for 10032 cells
## First 10 features:
##  AL627309.1, AL627309.2, AL627309.5, AL627309.4, AP006222.2, AL669831.2,
## LINC01409, FAM87B, LINC01128, LINC00115

The ATAC expression consists of 108,377 peaks and 10,032 cells

seurat@assays[["ATAC"]]

## Assay data with 108377 features for 10032 cells
## First 10 features:
##  chr1:10109-10357, chr1:180730-181630, chr1:191491-191736,
## chr1:267816-268196, chr1:586028-586373, chr1:629721-630172,
## chr1:633793-634264, chr1:777634-779926, chr1:816881-817647,
## chr1:819912-823500

3.2 Load additional information

Collect a list of position-specific weight matrices (PWM) from the JASPAR database, we’ll use this in the downstream analysis.

pfm <- getMatrixSet(JASPAR2020,
  opts = list(species = "Homo sapiens")
)

4 Load metadata

4.1 Cell metadata

4.2 Feature metadata

Feature metadata contains genomic information for both RNA expression and ATAC peaks

feature_metadata <- fread("ftp://ftp.ebi.ac.uk/pub/databases/mofa/10x_rna_atac_vignette/filtered_feature_bc_matrix/features.tsv.gz") %>%
  setnames(c("ens_id","gene","view","chr","start","end"))

Fetch RNA metadata

feature_metadata.rna <- feature_metadata[view=="Gene Expression"]
head(feature_metadata.rna,n=3)

##             ens_id        gene            view  chr start   end
## 1: ENSG00000243485 MIR1302-2HG Gene Expression chr1 29553 30267
## 2: ENSG00000237613     FAM138A Gene Expression chr1 36080 36081
## 3: ENSG00000186092       OR4F5 Gene Expression chr1 65418 69055

Fetch ATAC metadata

feature_metadata.atac <- feature_metadata[view=="Peaks"] %>% 
  .[,ens_id:=NULL] %>% setnames("gene","peak")
head(feature_metadata.atac,n=3)

##                  peak  view  chr  start    end
## 1:   chr1:10109-10357 Peaks chr1  10109  10357
## 2: chr1:180730-181630 Peaks chr1 180730 181630
## 3: chr1:191491-191736 Peaks chr1 191491 191736

Classify peaks into promoter-overlapping and distal

foo <- fread("ftp://ftp.ebi.ac.uk/pub/databases/mofa/10x_rna_atac_vignette/atac_peak_annotation.tsv") %>%
  .[,c("peak","peak_type")] %>%
  .[peak_type%in%c("distal", "promoter")]

feature_metadata.atac <- feature_metadata.atac %>% 
  merge(foo,by="peak",all.x=TRUE)

table(feature_metadata.atac$peak_type)

## 
##   distal promoter 
##    91822    16074

5 Parse Seurat object

Split ATAC matrix depending on the peak type and create a ChromatinAssay for each modality using the Signac package. This object requires a GRanges object with the peak metadata. We also provide an optional motif matrix that we’ll use for downstream analysis.

for (i in c("distal","promoter")) {
  
  # Create GRanges
  peaks.granges <- feature_metadata.atac %>%
    .[peak_type==i] %>%
    .[,c("chr","start","end","peak")] %>%
    makeGRangesFromDataFrame(keep.extra.columns = TRUE, ignore.strand = TRUE)

  # Scan motifs throughout the DNA sequence of each peak and create a binary matrix of motif-peak presence.
  motif.matrix <- CreateMotifMatrix(
    features = peaks.granges,
    pwm = pfm,
    genome = 'hg38',
    use.counts = FALSE
  ) %>% as.matrix
  
  # AddChromatinAssay to the Seurat object
  seurat@assays[[paste0("ATAC_",i)]] <- CreateChromatinAssay(
    seurat@assays$ATAC@counts[peaks.granges$peak,], 
    ranges = peaks.granges,
    motifs = CreateMotifObject(motif.matrix, pfm)
  )
  
}
seurat

## An object of class Seurat 
## 245401 features across 10032 samples within 4 assays 
## Active assay: RNA (29732 features, 0 variable features)
##  3 other assays present: ATAC, ATAC_distal, ATAC_promoter

6 Normalization

6.1 RNA

Standard Log normalisation, nothing fancy. One could possibly regress out the nFeature_RNA covariate or use the alternative SCTransform normalisation

seurat <- NormalizeData(seurat, normalization.method = "LogNormalize", assay = "RNA")
seurat <- ScaleData(seurat, do.center = TRUE, do.scale = FALSE)

6.2 ATAC

TFIDF normalisation, as implemented in Signac. This is important to transform the binary measurements to continuous readouts, which is a mandatory step to use the Gaussian likelihood in MOFA. Keep in mind that other strategies are possible (and perhaps even more recommended). One could also try to scale the data and regress out nFeature_ATAC

for (i in c("ATAC_distal","ATAC_promoter")) {
  seurat <- RunTFIDF(seurat, assay = i)
}

7 Feature selection

7.1 RNA

Select most variable genes using the VST procedure

seurat <- FindVariableFeatures(seurat, 
  selection.method = "vst", 
  nfeatures = 5000,
  assay = "RNA",
  verbose = FALSE
)

7.2 ATAC

Select the most variable peaks using the FindTopFeatures function

for (i in c("ATAC_distal","ATAC_promoter")) {
  seurat <- FindTopFeatures(seurat, assay=i, min.cutoff = 2000)
  print(length(seurat[[i]]@var.features))
}

## [1] 8456
## [1] 11198

8 Train the MOFA model

8.1 Create MOFA object

MOFA can take as input a multi-assay Seurat object.

mofa <- create_mofa(seurat, assays = c("RNA","ATAC_distal","ATAC_promoter"))
mofa

## Untrained MOFA model with the following characteristics: 
##  Number of views: 3 
##  Views names: RNA ATAC_distal ATAC_promoter 
##  Number of features (per view): 5000 8456 11198 
##  Number of groups: 1 
##  Groups names: group1 
##  Number of samples (per group): 10032

8.2 Define MOFA options

The most important hyperparameter is the number of factors. It needs to be specified by the user and it generally depends on the complexity of the data set. It is usually recommended to provide a large enough number and trim factors a posteriori based on a minimum variance explained criteria. Here let’s define K=15 factors

model_opts <- get_default_model_options(mofa)
model_opts$num_factors <- 15

There also other options that can be tuned, see ?get_default_model_options, ?get_default_training_options and ?get_default_data_options for details.

mofa <- prepare_mofa(mofa,
  model_options = model_opts
)

8.3 Run MOFA

This step can take around 2h in a MacBook Pro, but it can be speed up using multiple cores (~2-3x) or using GPUs (~10x). The training is done in Python, and the connection with R is done via reticulate. If this step fails please read our FAQ, as this is the most common source of errors when running MOFA

# mofa <- run_mofa(mofa)

Load pretrained MOFA object

# run this to download the data set
# mofa <- readRDS(url("ftp://ftp.ebi.ac.uk/pub/databases/mofa/10x_rna_atac_vignette/mofa.rds"))

mofa <- readRDS("/Users/ricard/data/10x_rna_atac/PBMC/mofa/mofa.rds")

mofa

## Trained MOFA with the following characteristics: 
##  Number of views: 3 
##  Views names: RNA ATAC_distal ATAC_promoter 
##  Number of features (per view): 5000 8456 11198 
##  Number of groups: 1 
##  Groups names: group1 
##  Number of samples (per group): 10032 
##  Number of factors: 13

9 MOFA downstream analysis

9.1 Add cell metadata to the model

The sample metadata must be provided as a data.frame and it must contain a column named sample with the sample IDs.

samples_metadata(mofa) <- seurat@meta.data %>%
  tibble::rownames_to_column("sample") %>%
  as.data.table

9.2 Asses the correlation between factors

A good sanity check is to verify that the Factors are largely uncorrelated. In MOFA there are no orthogonality constraints such as in Principal Component Analysis, but if there is a lot of correlation between Factors this suggests a poor model fit

plot_factor_cor(mofa)

9.3 Variance decomposition

The variance decomposition analysis is my favourite MOFA functionality. It calculates the percentage of variance explained by each factor and data modality. What insight can we learn just from visualising this? Factor 1 has a very strong signal from all data modalities, so it must be the most important source of variability in the data. Factor 2 and Factor 3 are weaker but also shared between all data modalities, so they are also likely to be impotant sources of variability. Interestingly, most Factors have a stronger signal in distal peaks than in promoter peaks.

plot_variance_explained(mofa, max_r2 = 4)

A reasonable question is whether the model is providing a good fit to the data. To assess this we can plot the total variance explained (using all factors). The resulting values will depend on the nature of the data set, the number of samples, the number of factors, etc. For single-cell data the values tend to be quite low. Here we have a 30% for RNA, a very reasonable value, versus ~8-10% for ATAC data, which is also a decent value taking into account how noisy scATAC data is. In fact, compared to my previous analysis on the SNARE-seq data set these are very good values. In the SNARE-seq analysis the the ATAC modality was extremely noisy and we had to rely on denoising strategies to get meaningful signal. As you’ll see in this vignette, the 10x Genomics technology provides much better quality measurements and no denoising strategy is required.

plot_variance_explained(mofa, plot_total = TRUE)[[2]]

9.4 Characterisation of Factors

There are a few systematic strategies to characterise the molecular etiology underlying each MOFA Factor:

Association analysis between cell covariates and Factor values.
Visualisation of factor values.
Visualisation of feature weights.
Gene set enrichment analysis.

9.4.1 Correlate factors with covariates

As a first approach, let’s correlate the standard quality control metrics with the Factor values. We see that Factor 4 is strongly associated with the number of expressed genes per cell and the Factors 1 and 7 are associated with the number of accessible peaks per cell. Some of these are likely technical factors that should be removed with a better normalisation strategy.

correlate_factors_with_covariates(mofa, 
  covariates = c("nFeature_RNA","nFeature_ATAC")
  )

9.4.2 Characterisation of Factor 1

The variance expalined plot tells us that Factor 1 is the strongest source of variation and has a large contribution from all data modalities.

9.4.2.1 Visualisation of Factor values

How do we interpret the factor values?
Mathematically, each Factor is defined by a linear combination of the input features. As the data is centered prior to running MOFA, each Factor ordinates cells along a one-dimensional axis that is centered at zero. Samples with different signs manifest opposite “effects” along the inferred axis of variation, with higher absolute value indicating a stronger effect. Note that the interpretation of MOFA factors is analogous to the interpretation of the principal components in PCA.

Let’s plot the Factor 1 values grouped by cell type and coloured by lineage of origin (Lymphoid or Myeloid). Clearly Factor 1 captures the molecular variation associated with lineage of origin.

plot_factor(mofa, factors=1, group_by = "celltype", color_by="broad_celltype") +
  theme(
    axis.text.x = element_text(color="black", angle=40, vjust=1, hjust=1)
  )

9.4.2.2 Visualisation of feature weights

how do we interpret the weights?
The weights provide a score for each feature on each factor. Features with no association with the corresponding factor are expected to have values close to zero, whereas features with strong association with the factor are expected to have large absolute values. The sign of the weights indicates the direction of the effect: a positive weights indicates that the feature has higher levels in the cells with positive factor values, and vice-versa.

Let’s plot the associated RNA weights. These are all good markers of myeloid (LYN, SLC8A1, LYZ, etc.) vs lymphoid fate (BCL11B)

plot_weights(mofa, 
  view = "RNA", 
  factors = 1, 
  nfeatures = 20, 
  text_size = 4
)

9.4.2.3 Visualisation covariation patterns in the input data

The weights are useful to get an idea of which genes are driving each factor. However, to get an idea of how good the association between features and factors is we can generate a scatterplot of the Factor values against mRNA expression for the genes with the largest weights:

plot_data_scatter(mofa, 
  view = "RNA", 
  factor = 1, 
  features = 6,
  color_by = "broad_celltype",
  add_lm = T,
  dot_size = 1
)

Now let’s visualise the ATAC signatures by inspecting the weights, although they are not very interpretable. We’ll have to rely on motif-based approaches to get TF/motif signatures.

plot_top_weights(mofa, 
  view = "ATAC_distal", 
  factors = 1, 
  sign = "positive",
  nfeatures = 15,
)

Again, we can visualise the ATAC patterns recovered by MOFA on the high-dimensional space. Instead of scatterplots we can use heatmaps:

plot_data_heatmap(mofa, 
  view = "ATAC_promoter", 
  factor = 1, 
  features = 50,
  show_rownames = F, show_colnames = F, 
  cluster_rows = T, cluster_cols = F,
  annotation_samples = "broad_celltype"
)

Again, we can visualise the ATAC patterns recovered by MOFA on the high-dimensional space. Instead of scatterplots we can also use heatmaps:

MOFA has an interesting option to denoise/smooth the data by reconstructing the data modalities using the latent factors. We can check how this works in the plot_data_heatmap function using denoise = TRUE. Instead of plotting the (noisy) input data, this plots the data reconstructed by the model:

plot_data_heatmap(mofa, 
  view = "ATAC_promoter", 
  factor = 1, 
  features = 50,
  show_rownames = F, show_colnames = F, 
  cluster_rows = T, cluster_cols = F,
  annotation_samples = "broad_celltype",
  denoise = TRUE
)

9.4.3 Characterisation of Factor 2

Let’s proceed now in a similar way to characterise Factor 2. It is clearly capturing B cell commitment

plot_factor(mofa, factors=2, group_by = "celltype") +
  theme(
    axis.text.x = element_text(color="black", angle=40, vjust=1, hjust=1)
  )

Let’s plot the associated RNA weights. We see good B cell markers, including CD74 and BANK1

plot_weights(mofa, 
  view = "RNA", 
  factors = 2, 
  nfeatures = 10, 
  text_size = 4
)

Let’s color the factor values by BANK1 expression

plot_factor(mofa, factors=2, group_by = "celltype", color_by="BANK1") +
  theme(
    axis.text.x = element_text(color="black", angle=40, vjust=1, hjust=1)
  )

Again, we can visualise the ATAC signatures:

plot_data_heatmap(mofa, 
  view = "ATAC_promoter", 
  factor = 2, 
  features = 50,
  show_rownames = F, show_colnames = F, 
  cluster_rows = T, cluster_cols = F,
  annotation_samples = "celltype",
  denoise = TRUE
)

9.4.4 Characterisation of Factor 4

In the correlation analysis we found that Factor 4 is strongly correlated to the total number of ATAC peaks, so it is likely a technical source of variation that has not been removed in the normalisation steps. The signal is clear, and it affects all cell types in a similar fashion:

plot_factor(mofa, factors=4, group_by = "celltype", color_by="nFeature_ATAC") +
  theme(
    axis.text.x = element_text(color="black", angle=40, vjust=1, hjust=1)
  )

If we plot the associated ATAC weights notice that the distribution is massively shifted towards negative values. A typical signature of a technical factor is that most of the weights tend to non-zero:

plot_weights(mofa, 
  view = "ATAC_promoter", 
  factors = 4, 
  nfeatures = 0, 
  text_size = 4
)

9.5 Non-linear dimensionality reduction

9.5.1 Using MOFA factors

The latent space inferred by MOFA can replace the PCA embedding as input to algorithms that learn non-linear manifolds such as t-SNE or UMAP. This can be very useful to identify cellular populations and reconstruct complex pseudotime trajectories. The advantage of MOFA is that (1) we use information from all available omics, and (2) we can characterise the Factors and remove the technical ones.

factors <- 1:get_dimensions(mofa)[["K"]]
factors <- factors[!factors%in%c(4,7)]

mofa <- run_umap(mofa, 
  factors = factors, 
  n_neighbors = 15,  
  min_dist = 0.30
)

plot_dimred(mofa, 
  method = "UMAP", 
  color_by = "celltype", 
  label = TRUE, 
  stroke=0.05, 
  dot_size = 1, 
  legend = FALSE
) + scale_fill_manual(values=colors)

We can try to add some interpretatibility on the UMAP by visualising the contribution of each Factor on the different groups of cells.

for (i in paste0("Factor",1:3)) {
  p <- plot_dimred(mofa, 
    method = "UMAP", 
    color_by = i, 
    stroke = 0.05, 
    dot_size = 1
  )
  print(p)
}

In this case however we notice that the resulting UMAP is not very different from the one you could obtain using the RNA expression or the ATAC data alone. This is because most, if not all, biological factors have contributions from both the RNA and ATAC modalities.

9.5.2 Using RNA data alone

DefaultAssay(seurat) <- "RNA"
seurat <- RunPCA(seurat, npcs = K, verbose = FALSE)
seurat <- RunUMAP(seurat, reduction = 'pca', dims = 1:K, verbose = FALSE)
DimPlot(seurat, label = TRUE, reduction="umap") + 
  NoLegend() + NoAxes() + scale_fill_manual(values=colors)

9.5.3 Using ATAC data alone

DefaultAssay(seurat) <- "ATAC_distal"
seurat <- RunSVD(seurat, n = K, verbose = FALSE)
seurat <- RunUMAP(seurat, reduction = 'lsi', dims = 1:K, verbose = FALSE)
DimPlot(seurat, label = TRUE, reduction="umap") + 
  NoLegend() + NoAxes() + scale_fill_manual(values=colors)

9.6 Gene set enrichment analysis

In addition to exploring the individual weights for each factor, we can use enrichment analysis to look for significant associations of factors to genesets. For more details on how GSEA works in MOFA we encourage the users to read the GSEA vignette.

9.6.1 Load gene set annotations

First we define the gene set matrix. We’ll use the C5 category and the Biological Process subcategory from the MSigDB data base:

msgidb.matrix <- msigdbr(
  species = "Homo sapiens",
  category = "C5", 
  subcategory = "BP"
  ) %>% as.data.table %>% .[,id:=1] %>%
  dcast(gs_name~gene_symbol, value.var="id", fill=0) %>% 
  matrix.please

9.6.2 Run GSEA

Second, we run the enrichment analysis with default options. An important consideration when running GSEA is that MOFA Factors contain positive (+) and negative (-) weights. There will be cases where the genes with (-) weights all belong to a specific pathway but genes with (+) weights belong to other pathways. If this is true, doing GSEA with all of them together could dilute the signal. Hence, we recommend the user to do GSEA separately for (+) and (-) weights.

# GSEA on positive weights
gsea.positive <- run_enrichment(mofa, 
  feature.sets = msgidb.matrix, 
  view = "RNA",
  sign = "positive"
)

# GSEA on negative weights
gsea.negative <- run_enrichment(mofa, 
  feature.sets = msgidb.matrix, 
  view = "RNA",
  sign = "negative"
)

The enrichment analysis returns a list of 5 elements:

feature.sets: the gene set matrix filtered by the genes that overlap with the MOFA model.
pval: the nominal p-values.
pval.adj: the FDR-adjusted p-values.
feature.statistics: the feature statistics (i.e. the mofa weights, by default).
set.statistics: the gene set statistics.
sigPathways: list with significant pathways per factor at a specified FDR threshold

names(gsea.positive)

## [1] "feature.sets"       "pval"               "pval.adj"          
## [4] "feature.statistics" "set.statistics"     "sigPathways"

9.6.3 Visualise GSEA rsults

Factor 1 captured lineage of origin, where lymphoid cells wer linked to (-) factor values and myeloid cells to (+) factor values. Consistently, when plotting the pathways enriched in Factor 1 for (+) weight values we can see Myeloid-associated pathways:

plot_enrichment(gsea.positive, factor = 1, max.pathways = 15)

and when plotting the pathways enriched in Factor 1 for (-) weight values we can see Lymphoid-associated pathways:

plot_enrichment(gsea.negative, factor = 1, max.pathways = 15)

It is always advised to not rely only on the p-values and to visualise which genes are driving the enrichment within each pathways. This can be done with the plot_enrichment_detailed function.

plot_enrichment_detailed(gsea.positive,
  factor = 1,
  max.genes = 10,
  max.pathways = 5
)

9.7 Motif enrichment

Although peak annotation provides functional interpretation, it does not directly explain the underlying mechanism. Open chromatin can affect transcription through TFs, which facilitate transcription by recognizing and binding to specific DNA motifs. Let’s attempt to do motif enrichment analysis using the weights of the ATAC modality.

Here we can use the GSEA functionality implemented in MOFA. We just replace the gene set annotations by peak-motif annotations.

9.7.1 Run

The variance explained plot suggests that there is more signal in distal peaks than in promoter peaks in the first two factors. Let’s do motif enrichment on distal peaks. Again, we compare positive weights (for Factor 1, peaks more open and motifs more enriched in Lymphoid cells) versus negative weights (for Factor 1, peaks more open and motifs more enriched in Myeloid cells)

# define motif matrix
motif.matrix <- t(as.matrix(seurat[["ATAC_distal"]]@motifs@data))

# Run GSEA enrichment analysis using the motif-peak matrix, (+) weights
motif.enrichment.positive <- run_enrichment(mofa,
  view = "ATAC_distal", 
  factors = 1:2,
  feature.sets = motif.matrix,
  sign = "positive"
)

# Run GSEA enrichment analysis using the motif-peak matrix, (-) weights
motif.enrichment.negative <- run_enrichment(mofa,
  view = "ATAC_distal", 
  factors = 1:2,
  feature.sets = motif.matrix,
  sign = "negative"
)

9.7.2 Visualise

Plot motif enrichment results for Factor 1

plot_enrichment(motif.enrichment.positive, factor = 1, max.pathways = 15)

plot_enrichment(motif.enrichment.negative, factor = 1, max.pathways = 15)

We can visualise the motifs using the MotifPlot function from the Signac package.

Myeloid motifs (positive Factor values): CEBPs are one of the most studied and important regulator of myeloid differentiation.

sig.motifs.positive <- motif.enrichment.positive$pval.adj[,"Factor1"] %>%
  sort %>% head(n=6) %>% names
MotifPlot(seurat[["ATAC_distal"]], motifs = sig.motifs.positive)

Lymphoid motifs (negative Factor values):

LEF1 is essential for T-cell functionality.
TCFL2 also known as TCF4, is a T-cell specific TF.
RUNX3 is highly expressed in cells of lymphoid origin.
TCF7 is highly expressed in cells of lymphoid origin.

Seems like we are getting sensible results!

sig.motifs.negative <- motif.enrichment.negative$pval.adj[,"Factor1"] %>%
  sort %>% head(n=6) %>% names
MotifPlot(seurat[["ATAC_distal"]], motifs = sig.motifs.negative)

9.7.3 Validation using chromVAR

To validate the motif enrichment results, we can calculate motif activity scores per cell using chromVAR. Briefly, this method computes an accessibility z-score for each motif in each cell that is adjusted by technical confounders. Positive values denote more accessible than background, and negative values denote less accessible than background

seurat <- RunChromVAR(
  object = seurat, 
  assay = "ATAC_distal",
  genome = BSgenome.Hsapiens.UCSC.hg38, 
  new.assay.name = "chromvar"
)

DefaultAssay(seurat) <- "chromvar"

Indeed, the top hits for Factor 1 have very clear chromVAR patterns that distinguish lymphoid versus myeloid cells

motifs.to.plot <- c(sig.motifs.positive[1:2], sig.motifs.negative[1:2])

FeaturePlot(seurat,
  features = motifs.to.plot,
  reduction = "umap",
  combine = TRUE
) & NoLegend() & NoAxes()

10 Conclusions

This vignette demonstrates how MOFA can be used to integrate the RNA and ATAC modalities from the Chromium Single Cell Multiome ATAC + Gene Expression assay recently introduced by 10x Genomics. I want to thank Vijay, Shamoni, Paul and Kamila from the 10x R&D team for this collaboration and for feedback on the analysis.

If you have questions, suggestions about MOFA or the vignette you can contact me by email (ricard@ebi.ac.uk). Also, we have a Slack group where we provide quick and personalised advise if you are interested in MOFA.

Session Info

sessionInfo()

## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] MOFA2_1.0                         BSgenome.Hsapiens.UCSC.hg38_1.4.3
##  [3] BSgenome_1.56.0                   rtracklayer_1.48.0               
##  [5] Biostrings_2.56.0                 XVector_0.28.0                   
##  [7] GenomicRanges_1.40.0              GenomeInfoDb_1.24.2              
##  [9] IRanges_2.22.2                    S4Vectors_0.26.1                 
## [11] BiocGenerics_0.34.0               TFBSTools_1.26.0                 
## [13] JASPAR2020_0.99.10                msigdbr_7.1.1                    
## [15] Signac_1.0.0                      Seurat_3.2.0                     
## [17] ggplot2_3.3.2                     data.table_1.13.0                
## [19] BiocStyle_2.16.0                 
## 
## loaded via a namespace (and not attached):
##   [1] rappdirs_0.3.1              SnowballC_0.7.0            
##   [3] GGally_2.0.0                R.methodsS3_1.8.1          
##   [5] nabor_0.5.0                 tidyr_1.1.2                
##   [7] bit64_4.0.5                 knitr_1.29                 
##   [9] irlba_2.3.3                 DelayedArray_0.14.1        
##  [11] R.utils_2.10.1              rpart_4.1-15               
##  [13] KEGGREST_1.28.0             RCurl_1.98-1.2             
##  [15] AnnotationFilter_1.12.0     generics_0.0.2             
##  [17] GenomicFeatures_1.40.1      cowplot_1.0.0              
##  [19] RSQLite_2.2.0               RANN_2.6.1                 
##  [21] chromVAR_1.10.0             future_1.18.0              
##  [23] bit_4.0.4                   spatstat.data_1.4-3        
##  [25] httpuv_1.5.4                SummarizedExperiment_1.18.2
##  [27] assertthat_0.2.1            DirichletMultinomial_1.30.0
##  [29] xfun_0.16                   hms_0.5.3                  
##  [31] evaluate_0.14               promises_1.1.1             
##  [33] progress_1.2.2              readxl_1.3.1               
##  [35] caTools_1.18.0              dbplyr_1.4.4               
##  [37] tmvnsim_1.0-2               igraph_1.2.5               
##  [39] DBI_1.1.0                   htmlwidgets_1.5.1          
##  [41] reshape_0.8.8               purrr_0.3.4                
##  [43] ellipsis_0.3.1              RSpectra_0.16-0            
##  [45] corrplot_0.84               ggpubr_0.4.0               
##  [47] dplyr_1.0.2                 backports_1.1.9            
##  [49] bookdown_0.20               annotate_1.66.0            
##  [51] biomaRt_2.44.1              deldir_0.1-28              
##  [53] vctrs_0.3.4                 Biobase_2.48.0             
##  [55] ensembldb_2.12.1            ROCR_1.0-11                
##  [57] abind_1.4-5                 withr_2.2.0                
##  [59] checkmate_2.0.0             sctransform_0.2.1          
##  [61] GenomicAlignments_1.24.0    prettyunits_1.1.1          
##  [63] mnormt_2.0.2                goftest_1.2-2              
##  [65] cluster_2.1.0               ape_5.4-1                  
##  [67] lazyeval_0.2.2              seqLogo_1.54.3             
##  [69] crayon_1.3.4                labeling_0.3               
##  [71] pkgconfig_2.0.3             nlme_3.1-148               
##  [73] vipor_0.4.5                 ProtGenerics_1.20.0        
##  [75] nnet_7.3-14                 rlang_0.4.7                
##  [77] globals_0.12.5              lifecycle_0.2.0            
##  [79] miniUI_0.1.1.1              BiocFileCache_1.12.1       
##  [81] rsvd_1.0.3                  dichromat_2.0-0            
##  [83] cellranger_1.1.0            ggrastr_0.1.9              
##  [85] polyclip_1.10-0             matrixStats_0.56.0         
##  [87] lmtest_0.9-37               graph_1.66.0               
##  [89] Matrix_1.2-18               ggseqlogo_0.1              
##  [91] carData_3.0-4               Rhdf5lib_1.10.1            
##  [93] zoo_1.8-8                   base64enc_0.1-3            
##  [95] beeswarm_0.2.3              pheatmap_1.0.12            
##  [97] ggridges_0.5.2              png_0.1-7                  
##  [99] viridisLite_0.3.0           bitops_1.0-6               
## [101] R.oo_1.24.0                 KernSmooth_2.23-17         
## [103] blob_1.2.1                  stringr_1.4.0              
## [105] qvalue_2.20.0               rstatix_0.6.0              
## [107] readr_1.3.1                 jpeg_0.1-8.1               
## [109] ggsignif_0.6.0              CNEr_1.24.0                
## [111] scales_1.1.1                memoise_1.1.0              
## [113] magrittr_1.5                plyr_1.8.6                 
## [115] ica_1.0-2                   zlibbioc_1.34.0            
## [117] compiler_4.0.2              RColorBrewer_1.1-2         
## [119] fitdistrplus_1.1-1          Rsamtools_2.4.0            
## [121] listenv_0.8.0               patchwork_1.0.1            
## [123] pbapply_1.4-3               htmlTable_2.0.1            
## [125] Formula_1.2-3               MASS_7.3-51.6              
## [127] mgcv_1.8-31                 tidyselect_1.1.0           
## [129] forcats_0.5.0               stringi_1.4.6              
## [131] yaml_2.2.1                  askpass_1.1                
## [133] latticeExtra_0.6-29         ggrepel_0.8.2              
## [135] grid_4.0.2                  VariantAnnotation_1.34.0   
## [137] fastmatch_1.1-0             tools_4.0.2                
## [139] rio_0.5.16                  future.apply_1.6.0         
## [141] rstudioapi_0.11             TFMPvalue_0.0.8            
## [143] foreign_0.8-80              lsa_0.73.2                 
## [145] gridExtra_2.3               farver_2.0.3               
## [147] Rtsne_0.15                  digest_0.6.25              
## [149] BiocManager_1.30.10         shiny_1.5.0                
## [151] pracma_2.2.9                motifmatchr_1.10.0         
## [153] Rcpp_1.0.5                  car_3.0-9                  
## [155] broom_0.7.0                 later_1.1.0.1              
## [157] RcppAnnoy_0.0.16            OrganismDbi_1.30.0         
## [159] httr_1.4.2                  AnnotationDbi_1.50.3       
## [161] ggbio_1.36.0                biovizBase_1.36.0          
## [163] psych_2.0.8                 colorspace_1.4-1           
## [165] XML_3.99-0.5                tensor_1.5                 
## [167] reticulate_1.16             splines_4.0.2              
## [169] uwot_0.1.8                  RBGL_1.64.0                
## [171] RcppRoll_0.3.0              spatstat.utils_1.17-0      
## [173] plotly_4.9.2.1              xtable_1.8-4               
## [175] jsonlite_1.7.0              poweRlaw_0.70.6            
## [177] spatstat_1.64-1             R6_2.4.1                   
## [179] Hmisc_4.4-1                 pillar_1.4.6               
## [181] htmltools_0.5.0             mime_0.9                   
## [183] DT_0.15                     glue_1.4.2                 
## [185] fastmap_1.0.1               BiocParallel_1.22.0        
## [187] codetools_0.2-16            lattice_0.20-41            
## [189] tibble_3.0.3                curl_4.3                   
## [191] ggbeeswarm_0.6.0            leiden_0.3.3               
## [193] gtools_3.8.2                zip_2.1.1                  
## [195] openxlsx_4.1.5              GO.db_3.11.4               
## [197] openssl_1.4.2               survival_3.1-12            
## [199] rmarkdown_2.3               munsell_0.5.0              
## [201] rhdf5_2.32.2                GenomeInfoDbData_1.2.3     
## [203] HDF5Array_1.16.1            haven_2.3.1                
## [205] reshape2_1.4.4              gtable_0.3.0

MOFA analysis of the Chromium Single Cell Multiome ATAC + Gene Expression assay

2020-10-01

Contents