This vignette demonstrates how MOFA+ can be used to integrate scRNA-seq and scATAC-seq data from the same cell.
There are very few protocols to date that enable this, one of them is SNARE-seq. As a demonstration we will use a simple data set of ~1000 cells where they mixed four cell lines together: BJ, GM12878, H1 and K562.
In this setting, MOFA+ should be able to detect the (coordinated) variability in the RNA expression and in chromatin accessibiliy that determines the four different cell types.
Make sure that MOFA2
is imported last, to avoid collisions with functions from other packages
library(data.table)
library(purrr)
library(Seurat)
library(ggplot2)
library(gridExtra)
library(MOFA2)
# reticulate::use_python("/Users/ricard/anaconda3/envs/base_new/bin/python", required = T)
Load RNA modality as Seurat object
seurat_rna <- readRDS(url("ftp://ftp.ebi.ac.uk/pub/databases/mofa/snare_seq/seurat_rna.rds"))
# seurat_rna <- readRDS("/Users/ricard/data/Chen2019/CellLineMixture/rna/seurat.rds")
dim(seurat_rna)
## [1] 5262 1047
Load ATAC modality as Seurat object
# seurat_atac <- readRDS("/Users/ricard/data/Chen2019/CellLineMixture/atac/seurat.rds")
seurat_atac <- readRDS(url("ftp://ftp.ebi.ac.uk/pub/databases/mofa/snare_seq/seurat_atac.rds"))
dim(seurat_atac)
## [1] 58099 1047
Load metadata
# metadata <- fread("/Users/ricard/data/Chen2019/CellLineMixture/cell_metadata.txt")
metadata <- fread("ftp://ftp.ebi.ac.uk/pub/databases/mofa/snare_seq/cell_metadata.txt", header=T)
Let’s make sure that we subset cells that match in both assays:
barcodes <- intersect(colnames(seurat_atac), colnames(seurat_rna))
seurat_atac <- seurat_atac[,barcodes]
seurat_rna <- seurat_rna[,barcodes]
Subset top 2500 most variable genes and top 10000 most variable ATAC peaks
# RNA
seurat_rna <- FindVariableFeatures(seurat_rna,
selection.method = "vst",
nfeatures = 2500
)
rna.features <- seurat_rna@assays$RNA@var.features
# ATAC
seurat_atac <- FindVariableFeatures(seurat_atac,
selection.method = "disp",
nfeatures = 5000
)
atac.features <- seurat_atac@assays$peaks@var.features
mofa <- create_mofa(list(
"RNA" = as.matrix( seurat_rna@assays$RNA@data[rna.features,] ),
"ATAC" = as.matrix( seurat_atac@assays$peaks@data[atac.features,] )
))
mofa
## Untrained MOFA model with the following characteristics:
## Number of views: 2
## Views names: RNA ATAC
## Number of features (per view): 2500 5000
## Number of groups: 1
## Groups names: group1
## Number of samples (per group): 1047
# Model options: let's use only 4 factors, should be enough to distinguish the four cell lines.
model_opts <- get_default_model_options(mofa)
model_opts$num_factors <- 4
# Training options: let's use default options
train_opts <- get_default_training_options(mofa)
train_opts$seed <- 42
mofa <- prepare_mofa(
object = mofa,
model_options = model_opts,
training_options = train_opts
)
mofa <- run_mofa(mofa)
Plot variance explained by factor
plot_variance_explained(mofa)
Plot total variance explained per view
plot_variance_explained(mofa, plot_total = TRUE)[[2]]
In general, we observe that most of the variation the model captures is driven by the RNA expression and little variance is explained by the ATAC data. Why? We will explore below. First let’s have a look whether the factors make sense.
Looks like indeed the MOFA factors are capturing cell line variation
plot_factors(mofa, factors=1:4, color_by = "cell_line")
Let’s confirm this by plotting a UMAP based on the MOFA factors:
mofa <- run_umap(mofa)
plot_dimred(mofa, method="UMAP", color_by = "cell_line")
Plot distribution of RNA weights per factor, highlighting the top ones
plot_weights(mofa,
view = "RNA",
factors = 1:2,
nfeatures = 6,
text_size = 4
)
Before denoising (plotting the actual input data)
plot_data_heatmap(mofa,
view = "RNA",
factor = 1,
features = 20,
show_rownames = T, show_colnames = F,
cluster_rows = T, cluster_cols = F,
annotation_samples = "cell_line"
)
After denoising (plotting model predictions)
plot_data_heatmap(mofa,
view = "RNA",
factor = 1,
features = 20,
show_rownames = T, show_colnames = F,
cluster_rows = T, cluster_cols = F,
annotation_samples = "cell_line",
denoise = TRUE
)
RNA looks great, but what is going on with the ATAC data?
Let’s plot a histogram of the data.
hist(mofa@data$ATAC[[1]])
mean(mofa@data$ATAC[[1]]>0)
## [1] 0.005447947
Wow, this is massively sparse. Only 0.5% of values are different than zero. No wonder why the model is uncapable of explaining more variance.
Yet, based on the variance explained estimates it seems that a bit of signal is indeed recovered. Let’s explore it. First, we plot the variance explained values for the top 25 ATAC peaks from every factor.
# Fetch top 25 features per factor, sorted by absolute value of the weight
weights.dt <- get_weights(mofa, views="ATAC", factors="all", as.data.frame = T) %>%
as.data.table %>%
.[,abs_value:=abs(value)] %>%
.[,.SD %>% setorder(abs_value) %>% tail(n=25), by="factor"]
top_features <- weights.dt[,feature] %>% as.character
# Fetch variance explained values per feature
variance.dt <- plot_variance_explained_per_feature(mofa,
view = "ATAC",
features = top_features,
return_data = T
) %>% as.data.table
# Merge
to.plot <- merge(
variance.dt[,c("feature","value")],
weights.dt[,c("feature","factor")], by="feature")
For the features with the largest weight (per factor) the model is capturing a bit of signal, but it is indeed very small. At most only 5% of the variance is explained:
ggplot(to.plot, aes(x=feature, y=value*100)) +
geom_bar(stat="identity") +
facet_wrap(~factor, scales="free_x") +
labs(x="ATAC peaks", y="Variance explained (%)") +
theme_classic() +
theme(
axis.text.x = element_blank(),
axis.ticks.x = element_blank()
)
As we did before with the RNA, let’s visualise the actual data (figure below).
plot_data_heatmap(mofa,
view = "ATAC",
factor = 1,
features = 20,
show_rownames = T, show_colnames = F,
cluster_rows = T, cluster_cols = F,
annotation_samples = "cell_line"
)
If we plot the (denoised) model predictions this actually looks a bit better, but the input data is just of very poor quality.
plot_data_heatmap(mofa,
view = "ATAC",
factor = 1,
features = 20,
show_rownames = T, show_colnames = F,
cluster_rows = T, cluster_cols = F,
annotation_samples = "cell_line",
denoise = TRUE
)
cisTopic is a method for the simultaneous identification of cis-regulatory topics and cell states from single cell ATAC-seq data. The underlying Latent Dirichlet Allocation (LDA) model is a latent variable model (i.e. analogous to factor analysis) that is very well suited to deal with the sparsity of scATAC-seq data. Similar to MOFA, cisTopic extracts cell and feature (i.e. peaks) embeddings that can be used for downstream analysis.
Here we will use cisTopic to create a (cell,topics) matrix that we will use as input to MOFA.
library(cisTopic)
cisTopicObject <- createcisTopicObject(
count.matrix = seurat_atac@assays$peaks@counts,
project.name = 'CellLineMixture'
)
cisTopicObject
## An object of class cisTopic in project CellLineMixture
## 58099 regions across 1047 samples.
topics <- c(10, 15, 20, 25)
cisTopicObject <- runCGSModels(
cisTopicObject,
topic = topics,
burnin = 250,
iterations = 500,
nCores = 2
)
# Model selection
cisTopicObject <- selectModel(cisTopicObject)
Load pre-computed solution
cisTopicObject <- readRDS("/Users/ricard/data/Chen2019/CellLineMixture/atac/cistopics.rds")
cistopics_embeddings <- modelMatSelection(
cisTopicObject,
target = "cell",
method = "Z-score"
)
# Sort samples to match same order as in the RNA
cistopics_embeddings <- cistopics_embeddings[,colnames(seurat_rna)]
dim(cistopics_embeddings)
## [1] 25 1047
Notice that we replace the ATAC peak matrix by the cisTopic cell embedding.
mofa <- create_mofa(list(
"RNA" = as.matrix( seurat_rna@assays$RNA@data[rna.features,] ),
"ATAC" = cistopics_embeddings
))
mofa
## Untrained MOFA model with the following characteristics:
## Number of views: 2
## Views names: RNA ATAC
## Number of features (per view): 2500 25
## Number of groups: 1
## Groups names: group1
## Number of samples (per group): 1047
Define MOFA options
# Model options
model_opts <- get_default_model_options(mofa)
model_opts$num_factors <- 4
# Training options
train_opts <- get_default_training_options(mofa)
train_opts$seed <- 42
# Prepare MOFA object
mofa <- prepare_mofa(
object = mofa,
model_options = model_opts,
training_options = train_opts
)
Train the model
mofa <- run_mofa(mofa)
Plot variance explained per factor
plot_variance_explained(mofa)
Plot total variance explained per view
plot_variance_explained(mofa, plot_total = TRUE)[[2]]
This looks much better now! a lot of variance is explained by the ATAC topics. The reason is that the topics provide a compressed and denoised representation of the ATAC-seq data based on the Latent Dirichlet Allocation model.
Let’s again plot a UMAP based on the MOFA factors to confirm that the factors capture the cell line variability:
mofa <- run_umap(mofa)
plot_dimred(mofa, method="UMAP", color_by = "cell_line")
Plot UMAP coloured by the z-score values (per cell) for some topics. Some nice signal in there!
p1 <- plot_dimred(mofa, method="UMAP", color_by = "Topic1")
p2 <- plot_dimred(mofa, method="UMAP", color_by = "Topic3")
p3 <- plot_dimred(mofa, method="UMAP", color_by = "Topic7")
p4 <- plot_dimred(mofa, method="UMAP", color_by = "Topic13")
grid.arrange(p1, p2, p3, p4, nrow= 2, ncol = 2)
plot_weights(mofa,
view = "ATAC",
factors = 1:3,
nfeatures = 3,
text_size = 5,
dot_size = 1.5
)
plot_data_heatmap(mofa,
view = "ATAC",
factor = 1,
features = 100,
show_rownames = T, show_colnames = F,
cluster_rows = T, cluster_cols = F,
)
There are multiple ways of characterising the topics. We’ll show a couple of examples, but we recommend you to visit the cisTopic tutorials.
cisTopic returns two distributions that represent: (1) the topic contributions per cell and (2) the region contribution to a topic. As in MOFA, you can fetch the corresponding values and try interpretring the etiology of each topic.
To analyze the regions, the first step is always to derive a score that evaluates how likely is for a region to belong to a topic. This is done using getRegionsScores()
. These scores can be rescaled into the range [0,1], which will be useful for the susbsequent binarization step:
cisTopicObject <- getRegionsScores(cisTopicObject, method = "NormTop", scaled = TRUE)
cisTopicObject <- binarizecisTopics(cisTopicObject, plot = FALSE)
Extract region loadings (in z-scores)
loadings <- modelMatSelection(cisTopicObject, target = "region", method = "Z-score")
hist(loadings)
loadings[1:5,1:3]
## chr20:62121663-62122648 chr14:64387638-64388462
## Topic1 4.7498801 4.74827774
## Topic2 -0.2555989 -0.26586624
## Topic3 -0.2555989 0.08395776
## Topic4 -0.2555989 0.31717376
## Topic5 -0.2555989 0.08395776
## chr2:130342575-130343233
## Topic1 4.6877146
## Topic2 -0.2640966
## Topic3 -0.2640966
## Topic4 -0.2640966
## Topic5 -0.1461963
library(ChIPseeker)
library(org.Hs.eg.db)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
# Add peak annotation to the regions (e.g. type of region, closest gene)
cisTopicObject <- annotateRegions(
cisTopicObject,
txdb = TxDb.Hsapiens.UCSC.hg38.knownGene,
annoDb = 'org.Hs.eg.db'
)
## >> preparing features information... 2020-02-10 15:50:57
## >> identifying nearest features... 2020-02-10 15:51:00
## >> calculating distance from peak to TSS... 2020-02-10 15:51:02
## >> assigning genomic annotation... 2020-02-10 15:51:02
## >> adding gene annotation... 2020-02-10 15:51:45
## >> assigning chromosome lengths 2020-02-10 15:51:45
## >> done... 2020-02-10 15:51:45
Heatmap containing the row normalised AUC values for the signatures in the topics
signaturesHeatmap(
cisTopicObject,
selected.signatures = 'annotation'
)
# cisTopicObject <- binarizedcisTopicsToCtx(cisTopicObject, genome='hg19')
#
# cisTopicObject <- scoredRegionsToCtx(cisTopicObject, genome='hg19')
#
# pathToFeather <- "hg19-regions-9species.all_regions.mc8nr.feather"
#
# cisTopicObject <- topicsRcisTarget(cisTopicObject, genome='hg19', pathToFeather, reduced_database=FALSE, nesThreshold=3, rocthr=0.005, maxRank=20000, nCores=5)
#
# Topic6_motif_enr <- cisTopicObject@binarized.RcisTarget[[6]]
#
# DT::datatable(Topic6_motif_enr[,-c("enrichedRegions", "TF_lowConf"), with=FALSE], escape = FALSE, filter="top", options=list(pageLength=5))
The count ATAC-seq matrix is too sparse to be used as input to MOFA. We recommend applying some pre-processing options to enrich the signal. A simple solution could be to aggregate peaks and summarise them. The second option, which we demonstrated here, is to use a signal extraction method such as cis-Topics, and then take their output as input to MOFA.
I want to thank Carmen Bravo for feedback on how to best use CisTopic, and Song Chen for sharing the scripts to analyse SNARE-seq data.
Having said this, one should not request scripts to reproduce results by e-mail in 2020, we should all document our code and release it upon publication.
sessionInfo()
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Catalina 10.15.3
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
##
## attached base packages:
## [1] grid parallel stats4 stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] doRNG_1.7.1
## [2] rngtools_1.4
## [3] pkgmaker_0.27
## [4] registry_0.5-1
## [5] foreach_1.4.7
## [6] ComplexHeatmap_2.3.2
## [7] fastcluster_1.1.25
## [8] TxDb.Hsapiens.UCSC.hg38.knownGene_3.4.6
## [9] GenomicFeatures_1.36.4
## [10] GenomicRanges_1.36.1
## [11] GenomeInfoDb_1.20.0
## [12] org.Hs.eg.db_3.8.2
## [13] AnnotationDbi_1.46.1
## [14] IRanges_2.18.3
## [15] S4Vectors_0.22.1
## [16] Biobase_2.44.0
## [17] BiocGenerics_0.30.0
## [18] ChIPseeker_1.20.0
## [19] fitdistrplus_1.0-14
## [20] npsurv_0.4-0
## [21] lsei_1.2-0
## [22] survival_3.1-8
## [23] MASS_7.3-51.4
## [24] cisTopic_0.3.0
## [25] MOFA2_0.99.1
## [26] gridExtra_2.3
## [27] ggplot2_3.2.1
## [28] Seurat_3.1.1
## [29] purrr_0.3.3
## [30] data.table_1.12.6
## [31] BiocStyle_2.12.0
##
## loaded via a namespace (and not attached):
## [1] rtracklayer_1.44.4
## [2] GGally_1.4.0
## [3] R.methodsS3_1.7.1
## [4] tidyr_1.0.0
## [5] bit64_0.9-7
## [6] knitr_1.25
## [7] irlba_2.3.3
## [8] DelayedArray_0.10.0
## [9] R.utils_2.9.0
## [10] RCurl_1.95-4.12
## [11] doParallel_1.0.15
## [12] metap_1.1
## [13] snow_0.4-3
## [14] cowplot_1.0.0
## [15] lambda.r_1.2.4
## [16] RSQLite_2.1.2
## [17] RANN_2.6.1
## [18] europepmc_0.3
## [19] future_1.14.0
## [20] bit_1.1-14
## [21] enrichplot_1.4.0
## [22] xml2_1.2.2
## [23] httpuv_1.5.2
## [24] SummarizedExperiment_1.14.1
## [25] assertthat_0.2.1
## [26] viridis_0.5.1
## [27] xfun_0.10
## [28] hms_0.5.2
## [29] evaluate_0.14
## [30] promises_1.1.0
## [31] progress_1.2.2
## [32] caTools_1.17.1.2
## [33] igraph_1.2.4.1
## [34] DBI_1.0.0
## [35] htmlwidgets_1.5.1
## [36] futile.logger_1.4.3
## [37] reshape_0.8.8
## [38] feather_0.3.5
## [39] ellipsis_0.3.0
## [40] corrplot_0.84
## [41] RSpectra_0.15-0
## [42] dplyr_0.8.3
## [43] ggpubr_0.2.3
## [44] backports_1.1.5
## [45] bookdown_0.14
## [46] annotate_1.62.0
## [47] gbRd_0.4-11
## [48] gridBase_0.4-7
## [49] RcppParallel_4.4.4
## [50] biomaRt_2.40.5
## [51] vctrs_0.2.0
## [52] ROCR_1.0-7
## [53] withr_2.1.2
## [54] ggforce_0.3.1
## [55] triebeard_0.3.0
## [56] sctransform_0.2.0
## [57] GenomicAlignments_1.20.1
## [58] prettyunits_1.0.2
## [59] cluster_2.1.0
## [60] DOSE_3.10.2
## [61] ape_5.3
## [62] lazyeval_0.2.2
## [63] crayon_1.3.4
## [64] pkgconfig_2.0.3
## [65] labeling_0.3
## [66] tweenr_1.0.1
## [67] nlme_3.1-142
## [68] vipor_0.4.5
## [69] rlang_0.4.1
## [70] globals_0.12.4
## [71] lifecycle_0.1.0
## [72] doSNOW_1.0.18
## [73] rsvd_1.0.2
## [74] polyclip_1.10-0
## [75] matrixStats_0.55.0
## [76] lmtest_0.9-37
## [77] graph_1.62.0
## [78] urltools_1.7.3
## [79] Matrix_1.2-18
## [80] Rhdf5lib_1.6.3
## [81] boot_1.3-23
## [82] zoo_1.8-6
## [83] beeswarm_0.2.3
## [84] GlobalOptions_0.1.1
## [85] ggridges_0.5.1
## [86] pheatmap_1.0.12
## [87] rjson_0.2.20
## [88] png_0.1-7
## [89] viridisLite_0.3.0
## [90] RcisTarget_1.7.1
## [91] bitops_1.0-6
## [92] R.oo_1.22.0
## [93] lda_1.4.2
## [94] KernSmooth_2.23-16
## [95] Biostrings_2.52.0
## [96] blob_1.2.0
## [97] shape_1.4.4
## [98] stringr_1.4.0
## [99] qvalue_2.16.0
## [100] gridGraphics_0.4-1
## [101] ggsignif_0.6.0
## [102] scales_1.0.0
## [103] memoise_1.1.0
## [104] GSEABase_1.46.0
## [105] magrittr_1.5
## [106] plyr_1.8.4
## [107] ica_1.0-2
## [108] gplots_3.0.1.1
## [109] bibtex_0.4.2
## [110] gdata_2.18.0
## [111] zlibbioc_1.30.0
## [112] compiler_3.6.2
## [113] RColorBrewer_1.1-2
## [114] clue_0.3-57
## [115] plotrix_3.7-7
## [116] Rsamtools_2.0.3
## [117] XVector_0.24.0
## [118] listenv_0.7.0
## [119] pbapply_1.4-2
## [120] formatR_1.7
## [121] text2vec_0.5.1
## [122] tidyselect_0.2.5
## [123] stringi_1.4.3
## [124] forcats_0.4.0
## [125] doMC_1.3.6
## [126] yaml_2.2.0
## [127] GOSemSim_2.10.0
## [128] ggrepel_0.8.1
## [129] fastmatch_1.1-0
## [130] tools_3.6.2
## [131] future.apply_1.3.0
## [132] circlize_0.4.8
## [133] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
## [134] AUCell_1.6.1
## [135] farver_2.0.3
## [136] Rtsne_0.15
## [137] ggraph_2.0.1
## [138] rvcheck_0.1.7
## [139] digest_0.6.22
## [140] BiocManager_1.30.9
## [141] FNN_1.1.3
## [142] shiny_1.4.0
## [143] Rcpp_1.0.2
## [144] SDMTools_1.1-221.1
## [145] later_1.0.0
## [146] RcppAnnoy_0.0.13
## [147] httr_1.4.1
## [148] Rdpack_0.11-0
## [149] colorspace_1.4-1
## [150] XML_3.98-1.20
## [151] reticulate_1.13
## [152] splines_3.6.2
## [153] uwot_0.1.4
## [154] graphlayouts_0.5.0
## [155] ggplotify_0.0.4
## [156] plotly_4.9.0
## [157] xtable_1.8-4
## [158] jsonlite_1.6
## [159] futile.options_1.0.1
## [160] tidygraph_1.1.2
## [161] UpSetR_1.4.0
## [162] zeallot_0.1.0
## [163] R6_2.4.0
## [164] mlapi_0.1.0
## [165] pillar_1.4.2
## [166] htmltools_0.4.0
## [167] mime_0.7
## [168] glue_1.3.1
## [169] fastmap_1.0.1
## [170] BiocParallel_1.18.1
## [171] codetools_0.2-16
## [172] fgsea_1.10.1
## [173] tsne_0.1-3
## [174] lattice_0.20-38
## [175] tibble_2.1.3
## [176] ggbeeswarm_0.6.0
## [177] leiden_0.3.1
## [178] gtools_3.8.1
## [179] GO.db_3.8.2
## [180] rmarkdown_1.16
## [181] munsell_0.5.0
## [182] GetoptLong_0.1.8
## [183] DO.db_2.9
## [184] rhdf5_2.28.1
## [185] GenomeInfoDbData_1.2.1
## [186] iterators_1.0.12
## [187] HDF5Array_1.12.3
## [188] reshape2_1.4.3
## [189] gtable_0.3.0