cTP-net Vignette (Seurat v3)

Zilu Zhou

11/1/2019

Abstract

While single cell RNA sequencing (scRNA-seq) is invaluable for studying cell populations, cell-surface proteins are often integral markers of cellular function and serve as primary targets for therapeutic intervention. Here we propose a transfer learning framework, single cell Transcriptome to Protein prediction with deep neural network (cTP-net), to impute surface protein abundances from scRNA-seq data by learning from existing single-cell multi-omic resources. For more detail, please check our biorixv preprint. See list of surface proteins we can predict for now.

1. Installation

1.1 Install cTP-net

1.1.1 Support Python package

First, install the supporting Python package ctpnetpy. See the source code of the package here

pip install cTPnet

If there is problem with PyTorch, refer to pytorch website for more details.

1.1.2 R package

Next, open R and install the R package cTPnet

devtools::install_github("zhouzilu/cTPnet")

1.1.3 Pretrained model

Download the pretrained model from weights.

1.2 Install SAVER-X (Highly recommended)

In addition, if you want to denoise your raw scRNA counts, please follow the SAVER-X installation pipeline. Modified from https://github.com/jingshuw/SAVERX

1.2.1 Support Python package

Install supporting Python package sctransfer.

pip install sctransfer

1.2.2 R package

Install R pacakge.

devtools::install_github("jingshuw/SAVERX")

1.2.3 Pretrained model

Download the pretrained model from weights.

Currently, SAVER-X do not support for super large data sets (test failed for 270,000 cells and 200GB RAM). cTP-net, on the other hand, can predict surface protein abundance relatively accurate without denoising.

2. Questions & issues

If you have any questions or problems when using cTPnet or ctpnetpy, please feel free to open a new issue here. You can also email the maintainers of the corresponding packages –

Zilu Zhou (zhouzilu at pennmedicine dot upenn dot edu)
Genomics and Computational Biology Graduate Group, UPenn
Nancy R. Zhang (nzh at wharton dot upenn dot edu)
Department of Statistics, UPenn

3. cTP-net analysis pipeline

To accurately impute surface protein abundance from scRNA-seq data, cTP-net employs two steps: (1) denoising of the scRNA-seq count matrix and (2) imputation based on the denoised data through a transcriptome-protein mapping (Figure 1). The initial denoising, by SAVERX, produces more accurate estimates of the RNA transcript relative abundances for each cell. Compared to the raw counts, the denoised relative expression values have significantly improved correlation with their cognate protein measurement.

Figure 1. (a) Overview of cTP-net analysis pipeline, which learns a mapping from the denoised scRNA-seq data to the relative abundance of surface proteins, capturing multi-gene features that reflect the cellular environment and related processes. (b) For three example proteins, cross-cell scatter and correlation of CITE-seq measured abundances vs. (1) raw RNA count, (2) SAVER-X denoised RNA level, and (3) cTP-net predicted protein abundance.

3.1 Raw counts denoising with SAVER-X

Please refer to SAVER-X package for detailed instruction. As for this vignette, we load a demo data set (17009 genes \(\times\) 2000 cells) from Bone Marrow Mononuclear Cell that has been already denoised with SAVER-X.

library(cTPnet)
library(Seurat)
library(reticulate)
# Set python path and virtual environment using reticulate
use_virtualenv("C:/Users/zhouzilu/Documents/test_ctpnet")
# The above line has to be called right after loading reticulate library !
data("cTPnet_demo")
head(demo_data[,1:6])
#>               a_GTTACAGCAGTCGTGC.1 b_TGGCTGGAGTCAAGGC.1
#> FO538757.2                   0.085                0.085
#> AP006222.2                   0.019                0.019
#> RP4-669L17.10                0.001                0.001
#> RP11-206L10.9                0.017                0.017
#> LINC00115                    0.003                0.004
#> FAM41C                       0.011                0.009
#>               b_TGAGCATAGTGAAGAG.1 b_ACGGCCAAGATCTGAA.1
#> FO538757.2                   0.086                0.086
#> AP006222.2                   0.019                0.019
#> RP4-669L17.10                0.001                0.001
#> RP11-206L10.9                0.018                0.018
#> LINC00115                    0.004                0.003
#> FAM41C                       0.009                0.008
#>               b_CGTGAGCCAGTATGCT.1 b_GGGACCTGTTGCGCAC.1
#> FO538757.2                   0.086                0.085
#> AP006222.2                   0.019                0.019
#> RP4-669L17.10                0.001                0.001
#> RP11-206L10.9                0.018                0.017
#> LINC00115                    0.006                0.006
#> FAM41C                       0.021                0.011

3.2 Immunophenotype (surface protein) imputation

3.2.1 Seurat v2 pipeline

Let’s create a seurat object demo and generate the prediction.

model_file_path="C:/Users/zhouzilu/Documents/cTPnet_weight_24"
data_type='Seurat3'
demo = CreateSeuratObject(demo_data)
demo = cTPnet(demo,data_type,model_file_path)
#> Start data preprocessing...
#> Start imputation. Running python ...
#> Postprocess...
#> Done!

3.2.2 Following analysis (Modified from Seurat v3.0)

# standard log-normalization
demo <- NormalizeData(demo, display.progress = FALSE)
# choose ~1k variable features
demo <- FindVariableFeatures(demo, do.plot = FALSE)

# standard scaling (no regression)
demo <- ScaleData(demo, display.progress = FALSE)

# Run PCA, select 13 PCs for tSNE visualization and graph-based clustering
demo <- RunPCA(demo, verbose = FALSE)
ElbowPlot(demo, ndims = 25)


demo <- FindNeighbors(demo, dims = 1:25, k.param = 20)
demo <- FindClusters(demo, resolution = 0.8)
#> Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
#> 
#> Number of nodes: 2000
#> Number of edges: 63646
#> 
#> Running Louvain algorithm...
#> Maximum modularity in 10 random starts: 0.8842
#> Number of communities: 15
#> Elapsed time: 0 seconds
demo <- RunTSNE(demo, dims = 1:25, method = "FIt-SNE", max_iter=2000)

DimPlot(demo, label = TRUE, pt.size = 0.5)

3.2.3 Visualize imputed protein levels on RNA clusters

FeaturePlot(demo, features = c(
  "ctpnet_CD34", "ctpnet_CD4", "ctpnet_CD8", 
  "CD34", "CD4", "CD8A",
  "ctpnet_CD16", "ctpnet_CD11c", "ctpnet_CD19", 
  "FCGR3A",'ITGAX','CD19',
  "ctpnet_CD45RA", "ctpnet_CD45RO", "ctpnet_CD27", 
  "PTPRC",'PTPRC','CD27'
     ), min.cutoff = "q25", max.cutoff = "q95", ncol = 3, pt.size=0.5)

3.2.4 Determine the cell markers with helps from imputed proteins

The cell type information can be easily determined by canonical immunophenotypes (i.e. surface protein markers).

# CD4 and CD8 are markers for CD4 T cells and CD8 T cells
# CD45RA and CD45RO are markers for naive T cells and differentiated T cells
# CD19 is marker for B cells
# CD27 is marker for memory B cells
# CD16 is marker for NK cells
# CD34 is marker for developing precursor cells
# CD11c is for tradiational monocyte
new.cluster.ids <- c("Mono","naive CD4/CD8 T", "Mono", "CD8 T", "naive CD4 T", "CD4 T", "naive CD8 T", "Pre.", "B", "NK", "memory B", "Pre.", "Unknown", "CD16+ Mono", "Unknown")
names(new.cluster.ids) <- levels(demo)
demo <- RenameIdents(demo, new.cluster.ids)
DimPlot(demo, label = TRUE, pt.size = 0.5)

RidgePlot(demo, features = c("ctpnet_CD3", "ctpnet_CD11c", "ctpnet_CD8", "ctpnet_CD16"), ncol = 2)
#> Picking joint bandwidth of 0.0677
#> Picking joint bandwidth of 0.0605
#> Picking joint bandwidth of 0.162
#> Picking joint bandwidth of 0.0577

4. Session info

sessionInfo()
#> R version 3.5.3 (2019-03-11)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 17134)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.1252 
#> [2] LC_CTYPE=English_United States.1252   
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] cTPnet_1.0.2    Seurat_3.1.1    reticulate_1.13
#> 
#> loaded via a namespace (and not attached):
#>  [1] nlme_3.1-137        tsne_0.1-3          bitops_1.0-6       
#>  [4] RcppAnnoy_0.0.13    RColorBrewer_1.1-2  httr_1.4.0         
#>  [7] sctransform_0.2.0   tools_3.5.3         R6_2.4.0           
#> [10] irlba_2.3.3         KernSmooth_2.23-15  uwot_0.1.4         
#> [13] lazyeval_0.2.2      colorspace_1.4-1    npsurv_0.4-0       
#> [16] tidyselect_0.2.5    gridExtra_2.3       compiler_3.5.3     
#> [19] plotly_4.8.0        labeling_0.3        caTools_1.17.1.2   
#> [22] scales_1.0.0        lmtest_0.9-36       ggridges_0.5.1     
#> [25] pbapply_1.4-0       stringr_1.4.0       digest_0.6.18      
#> [28] rmarkdown_1.12      R.utils_2.8.0       pkgconfig_2.0.2    
#> [31] htmltools_0.3.6     bibtex_0.4.2        htmlwidgets_1.3    
#> [34] rlang_0.3.4         zoo_1.8-4           jsonlite_1.6       
#> [37] ica_1.0-2           gtools_3.8.1        dplyr_0.8.0.1      
#> [40] R.oo_1.22.0         magrittr_1.5        Matrix_1.2-15      
#> [43] Rcpp_1.0.1          munsell_0.5.0       ape_5.3            
#> [46] R.methodsS3_1.7.1   stringi_1.4.3       yaml_2.2.0         
#> [49] gbRd_0.4-11         MASS_7.3-51.1       gplots_3.0.1.1     
#> [52] Rtsne_0.15          plyr_1.8.4          grid_3.5.3         
#> [55] parallel_3.5.3      gdata_2.18.0        listenv_0.7.0      
#> [58] ggrepel_0.8.1       crayon_1.3.4        lattice_0.20-38    
#> [61] cowplot_0.9.4       splines_3.5.3       SDMTools_1.1-221   
#> [64] knitr_1.22          pillar_1.3.1        igraph_1.2.4       
#> [67] future.apply_1.2.0  reshape2_1.4.3      codetools_0.2-16   
#> [70] leiden_0.3.1        glue_1.3.1          evaluate_0.13      
#> [73] lsei_1.2-0          metap_1.1           RcppParallel_4.4.4 
#> [76] data.table_1.12.0   png_0.1-7           Rdpack_0.10-1      
#> [79] gtable_0.3.0        RANN_2.6.1          purrr_0.3.2        
#> [82] tidyr_0.8.3         future_1.13.0       assertthat_0.2.1   
#> [85] ggplot2_3.1.1       xfun_0.5            rsvd_1.0.1         
#> [88] survival_2.43-3     viridisLite_0.3.0   tibble_2.1.1       
#> [91] cluster_2.0.7-1     globals_0.12.4      fitdistrplus_1.0-14
#> [94] ROCR_1.0-7

5. References

Surface protein imputation from single cell transcriptomes by deep neural networks

Zilu Zhou, Chengzhong Ye, Jingshu Wang, Nancy R. Zhang

bioRxiv 671180; doi: https://doi.org/10.1101/671180