Abstract
Before conducting a single cell RNA-seq experiment on a tumor sample, it is important to project how subclone detection power depends on the number of cells sequenced and the coverage per cell. To facilitate experiment design, we developed a tool, DENDROplan (Figure 2), that predicts the expected clustering accuracy by DENDRO given sequencing parameters. Given a tree structure and a target accuracy, DENDROplan computes the necessary read depth and number of cells needed based on the spike-in procedure described above. For more detail, please check our biorixv preprint
Install all packages in the latest version of R.
devtools::install_github("zhouzilu/DENDRO")
The reference dataset from Deng et al. (Deng et al. 2014) is pre-stored in the DENDRO package.
If you have any questions or problems when using DENDROplan, please feel free to open a new issue here. You can also email the maintainers of the package – the contact information is below.
Zilu Zhou (zhouzilu at pennmedicine dot upenn dot edu)
Genomics and Computational Biology Graduate Group, UPenn
Nancy R. Zhang (nzh at wharton dot upenn dot edu)
Department of Statistics, UPenn
Figure 2 illustrate the overall pipeline. The analysis starts with a designed tree with an interested clade (purple clade in the example). Based on the tree model, number of cells, sequencing depth and sequencing error rate, we simulate single cell mutation profile. scRNA data was sampled from a reference scRNA-seq dataset given expression level in bulk. A phylogeny computed by DENDRO is further compared with underlining truth, which measured by three statistics - adjusted Rand index (global accuracy statistics), capture rate (subclone specific statistic) and purity (subclone specific statistic). Figure 2. A flowchart outlining the procedures for DENDROplan.
Our reference dataset is from Deng et al. (Deng et al. 2014) with great sequencing quality and depth. First, let’s load DENDROplan_ref
and check the structure of the reference variable ref
library(DENDRO)
data("DENDROplan_ref")
str(ref)
#> List of 3
#> $ X1 : int [1:22958, 1:133] 0 0 0 0 0 0 0 0 0 0 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : NULL
#> .. ..$ : chr [1:133] "GSM1112611_earlyblast_2-1_expression.txt" "GSM1112612_earlyblast_2-10_expression.txt" "GSM1112613_earlyblast_2-12_expression.txt" "GSM1112614_earlyblast_2-15_expression.txt" ...
#> $ X2 : int [1:22958, 1:133] 0 0 0 0 0 0 0 0 0 0 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : NULL
#> .. ..$ : chr [1:133] "GSM1112611_earlyblast_2-1_expression.txt" "GSM1112612_earlyblast_2-10_expression.txt" "GSM1112613_earlyblast_2-12_expression.txt" "GSM1112614_earlyblast_2-15_expression.txt" ...
#> $ exp_level: chr [1:22958] "low" "low" "low" "low" ...
There are total 22958 potential mutation loci across 133 cells. As a result, when we simulate trees, the maximum number of cells is 133 and the maximum number of mutations is 22958.
The trees that DENDROplan is able to simulate show at Figure 3. The corresponding k
and subtype
value are also included.
Figure 3. Example tree structure DENDRO can simulate with corresponding parameters.
Numbers on branches illustrate index of mutation probability variable lprob
, and numbers at node illustrate index of cell proportion variable kprob
. The interested clade is labeled by a star. Capture rate and purity are measured on that specific star cluster. k
is the parameter indicating number of subclones. When k=4
, the subclones can form two types of trees, differetiated by the subtype
parameter.
We measure DENDROplan results by four statistics, explained here:
Adjusted Rand index: Adjusted Rand index is a measure of the similarity between two data clusterings after adjusted for the chance grouping of elements. For details, see here
Capture rate: Capture rate is a measure of “false negative rate” of a specific star clade. Out of all the cells from the specific clade, how many of them is detected by the algorithm.
Purity: Purity is a measure of “false positive rate” of a specific star clade. Out of all the cells in the “specific cluster” you detected, how many are actually from the true specific clade.
Observation probability: This is only one single value, which measures the probability of observe all clades in our multipe simulation round.
Assume we only have 100 mutation sites and our tree looks like Figure 3c with mutations and cells uniformly distributed
res=DENDRO.simulation(n=100,ref=ref,k=4,subtype=1)
#> There are total 4 clades in our simulation
res
#> AdjustedRandIndex Capture rate Purity Observation probability
#> CI_low 0.7173185 0.8181818 0.7500000 NA
#> Mean 0.8701641 0.9520010 0.9370848 1
#> CI_up 1.0000000 1.0000000 1.0000000 NA
Result shows the example tree structure as well as statistics with 95% Confidence Interval (CI).
In the 2nd example, we want to specify the similar tree but with customized cell proportion and mutation loads. We want the four clusters have cell proportion 0.2, 0.2, 0.2 and 0.4, i.e. we want one major subclone. In our code we need to specify this by parameter kprob
. Also, we want our mutation load unequally distributed with branch 4 (cluster 4 specific) having more mutations lprob=c(0.15,0.15,0.15,0.25,0.15,0.15)
.
res=DENDRO.simulation(kprob=c(0.2,0.2,0.2,0.4),lprob=c(0.15,0.15,0.15,0.25,0.15,0.15),n=100,ref=ref,k=4,subtype=1)
#> There are total 4 clades in our simulation
res
#> AdjustedRandIndex Capture rate Purity Observation probability
#> CI_low 0.6393633 0.6800000 0.9056604 NA
#> Mean 0.8819690 0.9511894 0.9830158 1
#> CI_up 0.9815705 1.0000000 1.0000000 NA
Example tree shows consistency with our input.
One important reason for user to use this tool is that it can estimate the cluster accuracy given different sequencing depth. In the original Deng et al. paper (Deng et al. 2014), they collect around 10,000,000 50bp-reads per cell in mice, resulting 46 mapped reads for each mutation site, which is super high depth (Figure 4).
Figure 4. scRNA-seq library statistics for Deng et al. (Deng et al. 2014). Modified from original paper.
In real life, usually 1,000,000 reads per cell is pretty good depth. Proportionally, there are around 4.5 mapped reads per mutation site. Let’s change the read depth parameter RD
and see how well DENDRO performs.
res=DENDRO.simulation(RD=4.5,n=100,ref=ref,k=4,subtype=1)
#> There are total 4 clades in our simulation
res
#> AdjustedRandIndex Capture rate Purity Observation probability
#> CI_low 0.5795291 0.6521739 0.6206897 NA
#> Mean 0.8090934 0.9205561 0.9324983 1
#> CI_up 0.9730729 1.0000000 1.0000000 NA
Feel free to play with different parameters. The detailed function document can be found by typing ??DENDRO.simulation
in R.
sessionInfo()
#> R version 3.5.3 (2019-03-11)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 17134)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=English_United States.1252
#> [2] LC_CTYPE=English_United States.1252
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United States.1252
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] DENDRO_0.1.1 ggplot2_3.1.1 mclust_5.4.3
#> [4] phyclust_0.1-24 colorspace_1.4-1 svMisc_1.1.0
#> [7] phangorn_2.5.3 ape_5.3 dendextend_1.10.0
#> [10] TailRank_3.2.1 oompaBase_3.2.6
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.1 mvtnorm_1.0-10 lattice_0.20-38
#> [4] class_7.3-15 assertthat_0.2.1 digest_0.6.18
#> [7] R6_2.4.0 plyr_1.8.4 stats4_3.5.3
#> [10] evaluate_0.13 pillar_1.3.1 rlang_0.3.4
#> [13] lazyeval_0.2.2 diptest_0.75-7 whisker_0.3-2
#> [16] kernlab_0.9-27 Matrix_1.2-15 rmarkdown_1.12
#> [19] labeling_0.3 stringr_1.4.0 igraph_1.2.4
#> [22] munsell_0.5.0 compiler_3.5.3 xfun_0.5
#> [25] pkgconfig_2.0.2 BiocGenerics_0.28.0 htmltools_0.3.6
#> [28] nnet_7.3-12 tidyselect_0.2.5 tibble_2.1.1
#> [31] gridExtra_2.3 quadprog_1.5-5 viridisLite_0.3.0
#> [34] crayon_1.3.4 dplyr_0.8.0.1 withr_2.1.2
#> [37] MASS_7.3-51.1 grid_3.5.3 nlme_3.1-137
#> [40] gtable_0.3.0 magrittr_1.5 scales_1.0.0
#> [43] stringi_1.4.3 viridis_0.5.1 flexmix_2.3-15
#> [46] robustbase_0.93-4 fastmatch_1.1-0 tools_3.5.3
#> [49] fpc_2.1-11.1 Biobase_2.42.0 glue_1.3.1
#> [52] trimcluster_0.1-2.1 DEoptimR_1.0-8 purrr_0.3.2
#> [55] parallel_3.5.3 yaml_2.2.0 oompaData_3.1.1
#> [58] cluster_2.0.7-1 prabclus_2.2-7 knitr_1.22
#> [61] modeltools_0.2-22
Deng, Qiaolin, Daniel Ramsköld, Björn Reinius, and Rickard Sandberg. 2014. “Single-Cell Rna-Seq Reveals Dynamic, Random Monoallelic Gene Expression in Mammalian Cells.” Science 343 (6167). American Association for the Advancement of Science: 193–96. doi:10.1126/science.1245316.