DENDRO Vignette

Zilu Zhou

10/25/2019

Abstract

Dna based EvolutioNary tree preDiction by scRna-seq technOlogy (DENDRO) is a statistical framework that takes scRNA-seq data for a tumor and accurately reconstructs its phylogeny, assigning each single cell from the scRNA-seq data to a subclone. Our approach allows us to (1) cluster cells based on genomic profiles while accounting for transcriptional bursting, technical dropout and sequencing error, as benchmarked on in silico mixture and a spike-in analysis, (2) make robust mutation profile inference in subclonal resolution, and (3) perform DNA-RNA joint analysis with same set of cells and examine the relationship between transcriptomic variation and mutation profile. For more detail, please check our biorixv preprint, no link yet

1. Installation
2. Questions & issues
3. DENDRO analysis pipeline
4. Session info
5. References

1. Installation

Install all packages in the latest version of R.

devtools::install_github("zhouzilu/DENDRO")

2. Questions & issues

If you have any questions or problems when using DENDRO, please feel free to open a new issue here. You can also email the maintainers of the package – the contact information is below.

Zilu Zhou (zhouzilu at pennmedicine dot upenn dot edu)
Genomics and Computational Biology Graduate Group, UPenn
Nancy R. Zhang (nzh at wharton dot upenn dot edu)
Department of Statistics, UPenn

3. DENDRO analysis pipeline

3.1 Overall pipeline

Figure 1 illustrate the overall pipeline. DENDRO starts from scRNA-seq raw data. We recommend STAR 2-pass method for mapping because it is more robust with splicing junction. SNA detection was applied to mapped BAM files. Both counts of total allele reads and counts of alternative allele reads for each cell $c$ at mutation position $g$ are collected. In the next step, a cell-to-cell genetic divergence matrix is calculated using a genetic divergence evaluation function. DENDRO further clusters the cells and polls cells from same cluster together and re-estimate SNA profiles. Based on the re-estimated SNA profiles, DENDRO generates a parsimony tree which shows the evolution relationship between subclones.

Figure 1. A flowchart outlining the procedures for DENDRO. We separate our analysis pipeline into three stages. The subtask is labeled on the right.

3.2 Stage I

3.2.1 Initial SNA detection with GATK

Starting with scRNA-seq dataset, we first detect primary mutation with GATK tools. Due to large amount of cells, a map-reduce ERC GVCF approach is necessary. An detailed illustration is attached here. After generated the VCF files, DENDRO utilized information of (1) number of alternative allele read counts $X$, (2) number of total allele read counts $N$, and (3) mutation profile matrix $Z$, where $Z=1$ indicates mutations for each cell and loci. $X, N, Z \in R^{M \times C}$. $M$ is total number of mutation loci and $C$ is total number of cells. We also provided an example R script here for $X$,$N$,$Z$ matrix extraction from VCF.

Here we load our demo dataset (200 mutation locus $\times$ 130 cells) generated using spike-in.

library(DENDRO)
data("DENDRO_demo")
str(demo)
#> List of 5
#>  $ X    : int [1:200, 1:130] 0 0 0 0 0 0 0 0 0 13 ...
#>  $ N    : int [1:200, 1:130] 0 0 0 0 0 4 0 0 4 13 ...
#>  $ Z    : num [1:200, 1:130] 0 1 0 0 0 0 0 1 0 1 ...
#>  $ label: int [1:130] 2 3 2 3 2 1 1 2 2 3 ...
#>  $ Info :'data.frame':   200 obs. of  4 variables:
#>   ..$ #CHROM: chr [1:200] "chr2" "chr16" "chr6" "chr5" ...
#>   ..$ POS   : int [1:200] 181856031 2303284 73428614 112211786 120692337 125748791 58364166 154313322 51631588 73388675 ...
#>   ..$ REF   : chr [1:200] "T" "T" "T" "T" ...
#>   ..$ ALT   : chr [1:200] "G" "A" "C,*" "C" ...
#>   ..- attr(*, ".internal.selfref")=<externalptr>

where Info indicates mutation information such as chromosome, allele nucleitide and position, and label indicates the true label, which we don’t have in a real data analysis.

3.2.2 Cell and mutation filtering

Given $X, N$ and $Z$, DENDRO first apply quality control (QC). In default, DENDRO filter out (1) mutations with less than 5% cells detected (too rare) or more than 95% of cells detected (too common), and (2) cells with total read counts deviated than 5 standard deviation. Check ??FilterCellMutation for more details.

demo_qc = FilterCellMutation(demo$X,demo$N,demo$Z,demo$Info,demo$label,cut.off.VAF = 0.05, cut.off.sd = 5)
#> Filtering variants:  152  out of  200  variants retained; filter creatiera: 10 < # of cells detected < 190
#> Filtering cells: 130  out of  130 cells retained; filter creatiera: 0 < total # of variants detected < 4371

The above two plots illustrate two distributions: (left) total number of cells for each mutations and (right) total number of mutations for each cell, with the red line indicating filter thresholds.

str(demo_qc)
#> List of 6
#>  $ X          : int [1:152, 1:130] 0 0 0 0 0 0 0 13 0 0 ...
#>  $ N          : int [1:152, 1:130] 0 0 0 0 4 0 4 13 0 0 ...
#>  $ Z          : num [1:152, 1:130] 0 1 0 0 0 1 0 1 1 0 ...
#>  $ Info       :'data.frame': 152 obs. of  4 variables:
#>   ..$ #CHROM: chr [1:152] "chr2" "chr16" "chr6" "chr10" ...
#>   ..$ POS   : int [1:152] 181856031 2303284 73428614 120692337 125748791 154313322 51631588 73388675 141506144 82340585 ...
#>   ..$ REF   : chr [1:152] "T" "T" "T" "A" ...
#>   ..$ ALT   : chr [1:152] "G" "A" "C,*" "G" ...
#>   ..- attr(*, ".internal.selfref")=<externalptr> 
#>  $ label      : int [1:130] 2 3 2 3 2 1 1 2 2 3 ...
#>  $ filted_cell: logi [1:130] TRUE TRUE TRUE TRUE TRUE TRUE ...

3.3 Stage II

3.3.1 Genetic divergence matrix calculation

Now DENDRO can calculate the genetic divergence matrix. Check ??DENDRO.dist for more details.

demo_qc$dist = DENDRO.dist(demo_qc$X,demo_qc$N,demo_qc$Z,show.progress=FALSE)

3.3.2 Clustering with the genetic divergence matrix

Let’s apply hierachical clustering and plot out the clustering result colored by known true label: demo_qc$clade. To allow user to customize your tree DENDRO.cluster inherent arguments from plot.phylo, Check DENDRO.cluster for more details.

demo_qc$cluster = DENDRO.cluster(demo_qc$dist,label=demo_qc$label,type='fan')

Let’s decided the optimal number of clusters using an intra-cluster divergence (icd) measurements.

demo_qc$icd = DENDRO.icd(demo_qc$dist,demo_qc$cluster)

demo_qc$optK = 3
demo_qc$DENDRO_label = cutree(demo_qc$cluster,demo_qc$optK)

We decide the optimal number of cluster by identifying kink or “elbow point” in the icd plot. In this example, optK = 3. It is crucial that if there are multiple “elbow point”, the smallest one is the most robust.

Let’s re-plot our data with DENDRO label

demo_qc$cluster = DENDRO.cluster(demo_qc$dist,label=demo_qc$DENDRO_label,type='fan')

3.3.3 Re-estimate mutation profile within each cluster and QC

DENDRO further re-esimate the subclone-level mutation profiles by pooling all reads within each reads together with a maximum liklihood approach (B. Li et al. 2012). Check ??DENDRO.recalculate for more details.

demo_cluster = DENDRO.recalculate(demo_qc$X,demo_qc$N, demo_qc$Info, demo_qc$DENDRO_label, cluster.name=c('Cluster3','Cluster2','Cluster1'))
#> Before QC, there are total  152  mutations across  3  subclones 
#> After QC, there are total  68  mutations across  3  subclones

cluster.name specifies the cluster name given the clustering order (1, 2, 3, …).

3.4 Stage III

3.4.1 Evolutionary tree construction

Given the filtered cluster-level mutation profiles, we now can construct an neighbor-joining tree using algorithm implemented in package phangorn. See ??DENDRO.tree for more details.

DENDRO.tree(demo_cluster$Z)

In this phylogenetic tree, Cluster1 has greater genetic divergence compared with Cluster2 and Cluster3, which is consistent with our data generating process.

3.4.2 Other analysis

User could further perform joint differential expression analysis and differential mutation analysis between different subclone groups. Mutation profile across clones is sored at demo_cluster$Z.

Differential expression analysis packages are wide-spread. Two methods that I personally preferred are Seurat MAST implementation (Butler et al. 2018) and scDD (Korthauer et al. 2016).

Gene set enrichment analysis is available at MSigDB, Broad Institute (Subramanian et al. 2005).

4. Session info

sessionInfo()
#> R version 3.5.3 (2019-03-11)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 17134)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.1252 
#> [2] LC_CTYPE=English_United States.1252   
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] DENDRO_0.1.1      ggplot2_3.1.1     mclust_5.4.3     
#>  [4] phyclust_0.1-24   colorspace_1.4-1  svMisc_1.1.0     
#>  [7] phangorn_2.5.3    ape_5.3           dendextend_1.10.0
#> [10] TailRank_3.2.1    oompaBase_3.2.6  
#> 
#> loaded via a namespace (and not attached):
#>  [1] modeltools_0.2-22   tidyselect_0.2.5    xfun_0.5           
#>  [4] purrr_0.3.2         kernlab_0.9-27      lattice_0.20-38    
#>  [7] oompaData_3.1.1     htmltools_0.3.6     stats4_3.5.3       
#> [10] viridisLite_0.3.0   yaml_2.2.0          rlang_0.4.2        
#> [13] pillar_1.3.1        withr_2.1.2         glue_1.3.1         
#> [16] prabclus_2.2-7      BiocGenerics_0.28.0 fpc_2.1-11.1       
#> [19] plyr_1.8.4          robustbase_0.93-4   stringr_1.4.0      
#> [22] munsell_0.5.0       gtable_0.3.0        mvtnorm_1.0-10     
#> [25] evaluate_0.13       Biobase_2.42.0      knitr_1.22         
#> [28] flexmix_2.3-15      parallel_3.5.3      class_7.3-15       
#> [31] DEoptimR_1.0-8      trimcluster_0.1-2.1 Rcpp_1.0.1         
#> [34] scales_1.0.0        diptest_0.75-7      fastmatch_1.1-0    
#> [37] gridExtra_2.3       digest_0.6.18       stringi_1.4.3      
#> [40] dplyr_0.8.0.1       grid_3.5.3          quadprog_1.5-5     
#> [43] tools_3.5.3         magrittr_1.5        lazyeval_0.2.2     
#> [46] tibble_2.1.1        cluster_2.0.7-1     crayon_1.3.4       
#> [49] whisker_0.3-2       pkgconfig_2.0.2     Matrix_1.2-15      
#> [52] MASS_7.3-51.1       assertthat_0.2.1    rmarkdown_1.12     
#> [55] viridis_0.5.1       R6_2.4.0            igraph_1.2.4       
#> [58] nlme_3.1-137        nnet_7.3-12         compiler_3.5.3

5. References

Butler, Andrew, Paul Hoffman, Peter Smibert, Efthymia Papalexi, and Rahul Satija. 2018. “Integrating Single-Cell Transcriptomic Data Across Different Conditions, Technologies, and Species.” Nature Biotechnology. doi:10.1038/nbt.4096.

Korthauer, Keegan D., Li-Fang Chu, Michael A. Newton, Yuan Li, James Thomson, Ron Stewart, and Christina Kendziorski. 2016. “A Statistical Approach for Identifying Differential Distributions in Single-Cell Rna-Seq Experiments.” Genome Biology 17 (1): 222. doi:10.1186/s13059-016-1077-y.

Li, Bingshan, Wei Chen, Xiaowei Zhan, Fabio Busonero, Serena Sanna, Carlo Sidore, Francesco Cucca, Hyun M Kang, and Gonçalo R Abecasis. 2012. “A Likelihood-Based Framework for Variant Calling and de Novo Mutation Detection in Families.” PLoS Genetics 8 (10). Public Library of Science: e1002944.

Subramanian, Aravind, Pablo Tamayo, Vamsi K. Mootha, Sayan Mukherjee, Benjamin L. Ebert, Michael A. Gillette, Amanda Paulovich, et al. 2005. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proceedings of the National Academy of Sciences 102 (43). National Academy of Sciences: 15545–50. doi:10.1073/pnas.0506580102.