Abstract
Dna based EvolutioNary tree preDiction by scRna-seq technOlogy (DENDRO) is a statistical framework that takes scRNA-seq data for a tumor and accurately reconstructs its phylogeny, assigning each single cell from the scRNA-seq data to a subclone. Our approach allows us to (1) cluster cells based on genomic profiles while accounting for transcriptional bursting, technical dropout and sequencing error, as benchmarked on in silico mixture and a spike-in analysis, (2) make robust mutation profile inference in subclonal resolution, and (3) perform DNA-RNA joint analysis with same set of cells and examine the relationship between transcriptomic variation and mutation profile. For more detail, please check our biorixv preprint, no link yet
Install all packages in the latest version of R.
devtools::install_github("zhouzilu/DENDRO")
If you have any questions or problems when using DENDRO, please feel free to open a new issue here. You can also email the maintainers of the package – the contact information is below.
Zilu Zhou (zhouzilu at pennmedicine dot upenn dot edu)
Genomics and Computational Biology Graduate Group, UPenn
Nancy R. Zhang (nzh at wharton dot upenn dot edu)
Department of Statistics, UPenn
Figure 1 illustrate the overall pipeline. DENDRO starts from scRNA-seq raw data. We recommend STAR 2-pass method for mapping because it is more robust with splicing junction. SNA detection was applied to mapped BAM files. Both counts of total allele reads and counts of alternative allele reads for each cell \(c\) at mutation position \(g\) are collected. In the next step, a cell-to-cell genetic divergence matrix is calculated using a genetic divergence evaluation function. DENDRO further clusters the cells and polls cells from same cluster together and re-estimate SNA profiles. Based on the re-estimated SNA profiles, DENDRO generates a parsimony tree which shows the evolution relationship between subclones.
Figure 1. A flowchart outlining the procedures for DENDRO. We separate our analysis pipeline into three stages. The subtask is labeled on the right.
Starting with scRNA-seq dataset, we first detect primary mutation with GATK tools. Due to large amount of cells, a map-reduce ERC GVCF approach is necessary. An detailed illustration is attached here. After generated the VCF files, DENDRO utilized information of (1) number of alternative allele read counts \(X\), (2) number of total allele read counts \(N\), and (3) mutation profile matrix \(Z\), where \(Z=1\) indicates mutations for each cell and loci. \(X, N, Z \in R^{M \times C}\). \(M\) is total number of mutation loci and \(C\) is total number of cells. We also provided an example R script here for \(X\),\(N\),\(Z\) matrix extraction from VCF.
Here we load our demo dataset (200 mutation locus \(\times\) 130 cells) generated using spike-in.
library(DENDRO)
data("DENDRO_demo")
str(demo)
#> List of 5
#> $ X : int [1:200, 1:130] 0 0 0 0 0 0 0 0 0 13 ...
#> $ N : int [1:200, 1:130] 0 0 0 0 0 4 0 0 4 13 ...
#> $ Z : num [1:200, 1:130] 0 1 0 0 0 0 0 1 0 1 ...
#> $ label: int [1:130] 2 3 2 3 2 1 1 2 2 3 ...
#> $ Info :'data.frame': 200 obs. of 4 variables:
#> ..$ #CHROM: chr [1:200] "chr2" "chr16" "chr6" "chr5" ...
#> ..$ POS : int [1:200] 181856031 2303284 73428614 112211786 120692337 125748791 58364166 154313322 51631588 73388675 ...
#> ..$ REF : chr [1:200] "T" "T" "T" "T" ...
#> ..$ ALT : chr [1:200] "G" "A" "C,*" "C" ...
#> ..- attr(*, ".internal.selfref")=<externalptr>
where Info
indicates mutation information such as chromosome, allele nucleitide and position, and label
indicates the true label, which we don’t have in a real data analysis.
Given \(X, N\) and \(Z\), DENDRO first apply quality control (QC). In default, DENDRO filter out (1) mutations with less than 5% cells detected (too rare) or more than 95% of cells detected (too common), and (2) cells with total read counts deviated than 5 standard deviation. Check ??FilterCellMutation
for more details.
demo_qc = FilterCellMutation(demo$X,demo$N,demo$Z,demo$Info,demo$label,cut.off.VAF = 0.05, cut.off.sd = 5)
#> Filtering variants: 152 out of 200 variants retained; filter creatiera: 10 < # of cells detected < 190
#> Filtering cells: 130 out of 130 cells retained; filter creatiera: 0 < total # of variants detected < 4371
The above two plots illustrate two distributions: (left) total number of cells for each mutations and (right) total number of mutations for each cell, with the red line indicating filter thresholds.
str(demo_qc)
#> List of 6
#> $ X : int [1:152, 1:130] 0 0 0 0 0 0 0 13 0 0 ...
#> $ N : int [1:152, 1:130] 0 0 0 0 4 0 4 13 0 0 ...
#> $ Z : num [1:152, 1:130] 0 1 0 0 0 1 0 1 1 0 ...
#> $ Info :'data.frame': 152 obs. of 4 variables:
#> ..$ #CHROM: chr [1:152] "chr2" "chr16" "chr6" "chr10" ...
#> ..$ POS : int [1:152] 181856031 2303284 73428614 120692337 125748791 154313322 51631588 73388675 141506144 82340585 ...
#> ..$ REF : chr [1:152] "T" "T" "T" "A" ...
#> ..$ ALT : chr [1:152] "G" "A" "C,*" "G" ...
#> ..- attr(*, ".internal.selfref")=<externalptr>
#> $ label : int [1:130] 2 3 2 3 2 1 1 2 2 3 ...
#> $ filted_cell: logi [1:130] TRUE TRUE TRUE TRUE TRUE TRUE ...
Now DENDRO can calculate the genetic divergence matrix. Check ??DENDRO.dist
for more details.
demo_qc$dist = DENDRO.dist(demo_qc$X,demo_qc$N,demo_qc$Z,show.progress=FALSE)
Let’s apply hierachical clustering and plot out the clustering result colored by known true label: demo_qc$clade
. To allow user to customize your tree DENDRO.cluster
inherent arguments from plot.phylo
, Check DENDRO.cluster
for more details.
demo_qc$cluster = DENDRO.cluster(demo_qc$dist,label=demo_qc$label,type='fan')
Let’s decided the optimal number of clusters using an intra-cluster divergence (icd) measurements.
demo_qc$icd = DENDRO.icd(demo_qc$dist,demo_qc$cluster)
demo_qc$optK = 3
demo_qc$DENDRO_label = cutree(demo_qc$cluster,demo_qc$optK)
We decide the optimal number of cluster by identifying kink or “elbow point” in the icd plot. In this example, optK = 3
. It is crucial that if there are multiple “elbow point”, the smallest one is the most robust.
Let’s re-plot our data with DENDRO label
demo_qc$cluster = DENDRO.cluster(demo_qc$dist,label=demo_qc$DENDRO_label,type='fan')
DENDRO further re-esimate the subclone-level mutation profiles by pooling all reads within each reads together with a maximum liklihood approach (B. Li et al. 2012). Check ??DENDRO.recalculate
for more details.
demo_cluster = DENDRO.recalculate(demo_qc$X,demo_qc$N, demo_qc$Info, demo_qc$DENDRO_label, cluster.name=c('Cluster3','Cluster2','Cluster1'))
#> Before QC, there are total 152 mutations across 3 subclones
#> After QC, there are total 68 mutations across 3 subclones
cluster.name
specifies the cluster name given the clustering order (1, 2, 3, …).
Given the filtered cluster-level mutation profiles, we now can construct an neighbor-joining tree using algorithm implemented in package phangorn
. See ??DENDRO.tree
for more details.
DENDRO.tree(demo_cluster$Z)
In this phylogenetic tree, Cluster1 has greater genetic divergence compared with Cluster2 and Cluster3, which is consistent with our data generating process.
User could further perform joint differential expression analysis and differential mutation analysis between different subclone groups. Mutation profile across clones is sored at demo_cluster$Z
.
Differential expression analysis packages are wide-spread. Two methods that I personally preferred are Seurat MAST implementation (Butler et al. 2018) and scDD (Korthauer et al. 2016).
Gene set enrichment analysis is available at MSigDB, Broad Institute (Subramanian et al. 2005).
sessionInfo()
#> R version 3.5.3 (2019-03-11)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 17134)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=English_United States.1252
#> [2] LC_CTYPE=English_United States.1252
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United States.1252
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] DENDRO_0.1.1 ggplot2_3.1.1 mclust_5.4.3
#> [4] phyclust_0.1-24 colorspace_1.4-1 svMisc_1.1.0
#> [7] phangorn_2.5.3 ape_5.3 dendextend_1.10.0
#> [10] TailRank_3.2.1 oompaBase_3.2.6
#>
#> loaded via a namespace (and not attached):
#> [1] modeltools_0.2-22 tidyselect_0.2.5 xfun_0.5
#> [4] purrr_0.3.2 kernlab_0.9-27 lattice_0.20-38
#> [7] oompaData_3.1.1 htmltools_0.3.6 stats4_3.5.3
#> [10] viridisLite_0.3.0 yaml_2.2.0 rlang_0.4.2
#> [13] pillar_1.3.1 withr_2.1.2 glue_1.3.1
#> [16] prabclus_2.2-7 BiocGenerics_0.28.0 fpc_2.1-11.1
#> [19] plyr_1.8.4 robustbase_0.93-4 stringr_1.4.0
#> [22] munsell_0.5.0 gtable_0.3.0 mvtnorm_1.0-10
#> [25] evaluate_0.13 Biobase_2.42.0 knitr_1.22
#> [28] flexmix_2.3-15 parallel_3.5.3 class_7.3-15
#> [31] DEoptimR_1.0-8 trimcluster_0.1-2.1 Rcpp_1.0.1
#> [34] scales_1.0.0 diptest_0.75-7 fastmatch_1.1-0
#> [37] gridExtra_2.3 digest_0.6.18 stringi_1.4.3
#> [40] dplyr_0.8.0.1 grid_3.5.3 quadprog_1.5-5
#> [43] tools_3.5.3 magrittr_1.5 lazyeval_0.2.2
#> [46] tibble_2.1.1 cluster_2.0.7-1 crayon_1.3.4
#> [49] whisker_0.3-2 pkgconfig_2.0.2 Matrix_1.2-15
#> [52] MASS_7.3-51.1 assertthat_0.2.1 rmarkdown_1.12
#> [55] viridis_0.5.1 R6_2.4.0 igraph_1.2.4
#> [58] nlme_3.1-137 nnet_7.3-12 compiler_3.5.3
Butler, Andrew, Paul Hoffman, Peter Smibert, Efthymia Papalexi, and Rahul Satija. 2018. “Integrating Single-Cell Transcriptomic Data Across Different Conditions, Technologies, and Species.” Nature Biotechnology. doi:10.1038/nbt.4096.
Korthauer, Keegan D., Li-Fang Chu, Michael A. Newton, Yuan Li, James Thomson, Ron Stewart, and Christina Kendziorski. 2016. “A Statistical Approach for Identifying Differential Distributions in Single-Cell Rna-Seq Experiments.” Genome Biology 17 (1): 222. doi:10.1186/s13059-016-1077-y.
Li, Bingshan, Wei Chen, Xiaowei Zhan, Fabio Busonero, Serena Sanna, Carlo Sidore, Francesco Cucca, Hyun M Kang, and Gonçalo R Abecasis. 2012. “A Likelihood-Based Framework for Variant Calling and de Novo Mutation Detection in Families.” PLoS Genetics 8 (10). Public Library of Science: e1002944.
Subramanian, Aravind, Pablo Tamayo, Vamsi K. Mootha, Sayan Mukherjee, Benjamin L. Ebert, Michael A. Gillette, Amanda Paulovich, et al. 2005. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proceedings of the National Academy of Sciences 102 (43). National Academy of Sciences: 15545–50. doi:10.1073/pnas.0506580102.