Abstract
While single cell RNA sequencing (scRNA-seq) is invaluable for studying cell populations, cell-surface proteins are often integral markers of cellular function and serve as primary targets for therapeutic intervention. Here we propose a transfer learning framework, single cell Transcriptome to Protein prediction with deep neural network (cTP-net), to impute surface protein abundances from scRNA-seq data by learning from existing single-cell multi-omic resources. For more detail, please check our biorixv preprint. See list of surface proteins we can predict for now.
First, install the supporting Python package ctpnetpy. See the source code of the package here
pip install cTPnet
If there is problem with PyTorch, refer to pytorch website for more details.
Next, open R and install the R package cTPnet
devtools::install_github("zhouzilu/cTPnet")
Download the pretrained model from weights.
In addition, if you want to denoise your raw scRNA counts, please follow the SAVER-X installation pipeline. Modified from https://github.com/jingshuw/SAVERX
Install supporting Python package sctransfer.
pip install sctransfer
Install R pacakge.
devtools::install_github("jingshuw/SAVERX")
Download the pretrained model from weights.
Currently, SAVER-X do not support for super large data sets (test failed for 270,000 cells and 200GB RAM). cTP-net, on the other hand, can predict surface protein abundance relatively accurate without denoising.
If you have any questions or problems when using cTPnet or ctpnetpy, please feel free to open a new issue here. You can also email the maintainers of the corresponding packages –
Zilu Zhou (zhouzilu at pennmedicine dot upenn dot edu)
Genomics and Computational Biology Graduate Group, UPenn
Nancy R. Zhang (nzh at wharton dot upenn dot edu)
Department of Statistics, UPenn
To accurately impute surface protein abundance from scRNA-seq data, cTP-net employs two steps: (1) denoising of the scRNA-seq count matrix and (2) imputation based on the denoised data through a transcriptome-protein mapping (Figure 1). The initial denoising, by SAVERX, produces more accurate estimates of the RNA transcript relative abundances for each cell. Compared to the raw counts, the denoised relative expression values have significantly improved correlation with their cognate protein measurement.
Figure 1. (a) Overview of cTP-net analysis pipeline, which learns a mapping from the denoised scRNA-seq data to the relative abundance of surface proteins, capturing multi-gene features that reflect the cellular environment and related processes. (b) For three example proteins, cross-cell scatter and correlation of CITE-seq measured abundances vs. (1) raw RNA count, (2) SAVER-X denoised RNA level, and (3) cTP-net predicted protein abundance.
Please refer to SAVER-X package for detailed instruction. As for this vignette, we load a demo data set (17009 genes \(\times\) 2000 cells) from Bone Marrow Mononuclear Cell that has been already denoised with SAVER-X.
library(cTPnet)
library(Seurat)
library(reticulate)
# Set python path and virtual environment using reticulate
use_virtualenv("C:/Users/zhouzilu/Documents/test_ctpnet")
# The above line has to be called right after loading reticulate library !
data("cTPnet_demo")
head(demo_data[,1:6])
#> a_GTTACAGCAGTCGTGC.1 b_TGGCTGGAGTCAAGGC.1
#> FO538757.2 0.085 0.085
#> AP006222.2 0.019 0.019
#> RP4-669L17.10 0.001 0.001
#> RP11-206L10.9 0.017 0.017
#> LINC00115 0.003 0.004
#> FAM41C 0.011 0.009
#> b_TGAGCATAGTGAAGAG.1 b_ACGGCCAAGATCTGAA.1
#> FO538757.2 0.086 0.086
#> AP006222.2 0.019 0.019
#> RP4-669L17.10 0.001 0.001
#> RP11-206L10.9 0.018 0.018
#> LINC00115 0.004 0.003
#> FAM41C 0.009 0.008
#> b_CGTGAGCCAGTATGCT.1 b_GGGACCTGTTGCGCAC.1
#> FO538757.2 0.086 0.085
#> AP006222.2 0.019 0.019
#> RP4-669L17.10 0.001 0.001
#> RP11-206L10.9 0.018 0.017
#> LINC00115 0.006 0.006
#> FAM41C 0.021 0.011
Let’s create a seurat object demo
and generate the prediction.
model_file_path="C:/Users/zhouzilu/Documents/cTPnet_weight_24"
data_type='Seurat3'
demo = CreateSeuratObject(demo_data)
demo = cTPnet(demo,data_type,model_file_path)
#> Start data preprocessing...
#> Start imputation. Running python ...
#> Postprocess...
#> Done!
# standard log-normalization
demo <- NormalizeData(demo, display.progress = FALSE)
# choose ~1k variable features
demo <- FindVariableFeatures(demo, do.plot = FALSE)
# standard scaling (no regression)
demo <- ScaleData(demo, display.progress = FALSE)
# Run PCA, select 13 PCs for tSNE visualization and graph-based clustering
demo <- RunPCA(demo, verbose = FALSE)
ElbowPlot(demo, ndims = 25)
demo <- FindNeighbors(demo, dims = 1:25, k.param = 20)
demo <- FindClusters(demo, resolution = 0.8)
#> Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
#>
#> Number of nodes: 2000
#> Number of edges: 63646
#>
#> Running Louvain algorithm...
#> Maximum modularity in 10 random starts: 0.8842
#> Number of communities: 15
#> Elapsed time: 0 seconds
demo <- RunTSNE(demo, dims = 1:25, method = "FIt-SNE", max_iter=2000)
DimPlot(demo, label = TRUE, pt.size = 0.5)
FeaturePlot(demo, features = c(
"ctpnet_CD34", "ctpnet_CD4", "ctpnet_CD8",
"CD34", "CD4", "CD8A",
"ctpnet_CD16", "ctpnet_CD11c", "ctpnet_CD19",
"FCGR3A",'ITGAX','CD19',
"ctpnet_CD45RA", "ctpnet_CD45RO", "ctpnet_CD27",
"PTPRC",'PTPRC','CD27'
), min.cutoff = "q25", max.cutoff = "q95", ncol = 3, pt.size=0.5)
The cell type information can be easily determined by canonical immunophenotypes (i.e. surface protein markers).
# CD4 and CD8 are markers for CD4 T cells and CD8 T cells
# CD45RA and CD45RO are markers for naive T cells and differentiated T cells
# CD19 is marker for B cells
# CD27 is marker for memory B cells
# CD16 is marker for NK cells
# CD34 is marker for developing precursor cells
# CD11c is for tradiational monocyte
new.cluster.ids <- c("Mono","naive CD4/CD8 T", "Mono", "CD8 T", "naive CD4 T", "CD4 T", "naive CD8 T", "Pre.", "B", "NK", "memory B", "Pre.", "Unknown", "CD16+ Mono", "Unknown")
names(new.cluster.ids) <- levels(demo)
demo <- RenameIdents(demo, new.cluster.ids)
DimPlot(demo, label = TRUE, pt.size = 0.5)
RidgePlot(demo, features = c("ctpnet_CD3", "ctpnet_CD11c", "ctpnet_CD8", "ctpnet_CD16"), ncol = 2)
#> Picking joint bandwidth of 0.0677
#> Picking joint bandwidth of 0.0605
#> Picking joint bandwidth of 0.162
#> Picking joint bandwidth of 0.0577
sessionInfo()
#> R version 3.5.3 (2019-03-11)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 17134)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=English_United States.1252
#> [2] LC_CTYPE=English_United States.1252
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United States.1252
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] cTPnet_1.0.2 Seurat_3.1.1 reticulate_1.13
#>
#> loaded via a namespace (and not attached):
#> [1] nlme_3.1-137 tsne_0.1-3 bitops_1.0-6
#> [4] RcppAnnoy_0.0.13 RColorBrewer_1.1-2 httr_1.4.0
#> [7] sctransform_0.2.0 tools_3.5.3 R6_2.4.0
#> [10] irlba_2.3.3 KernSmooth_2.23-15 uwot_0.1.4
#> [13] lazyeval_0.2.2 colorspace_1.4-1 npsurv_0.4-0
#> [16] tidyselect_0.2.5 gridExtra_2.3 compiler_3.5.3
#> [19] plotly_4.8.0 labeling_0.3 caTools_1.17.1.2
#> [22] scales_1.0.0 lmtest_0.9-36 ggridges_0.5.1
#> [25] pbapply_1.4-0 stringr_1.4.0 digest_0.6.18
#> [28] rmarkdown_1.12 R.utils_2.8.0 pkgconfig_2.0.2
#> [31] htmltools_0.3.6 bibtex_0.4.2 htmlwidgets_1.3
#> [34] rlang_0.3.4 zoo_1.8-4 jsonlite_1.6
#> [37] ica_1.0-2 gtools_3.8.1 dplyr_0.8.0.1
#> [40] R.oo_1.22.0 magrittr_1.5 Matrix_1.2-15
#> [43] Rcpp_1.0.1 munsell_0.5.0 ape_5.3
#> [46] R.methodsS3_1.7.1 stringi_1.4.3 yaml_2.2.0
#> [49] gbRd_0.4-11 MASS_7.3-51.1 gplots_3.0.1.1
#> [52] Rtsne_0.15 plyr_1.8.4 grid_3.5.3
#> [55] parallel_3.5.3 gdata_2.18.0 listenv_0.7.0
#> [58] ggrepel_0.8.1 crayon_1.3.4 lattice_0.20-38
#> [61] cowplot_0.9.4 splines_3.5.3 SDMTools_1.1-221
#> [64] knitr_1.22 pillar_1.3.1 igraph_1.2.4
#> [67] future.apply_1.2.0 reshape2_1.4.3 codetools_0.2-16
#> [70] leiden_0.3.1 glue_1.3.1 evaluate_0.13
#> [73] lsei_1.2-0 metap_1.1 RcppParallel_4.4.4
#> [76] data.table_1.12.0 png_0.1-7 Rdpack_0.10-1
#> [79] gtable_0.3.0 RANN_2.6.1 purrr_0.3.2
#> [82] tidyr_0.8.3 future_1.13.0 assertthat_0.2.1
#> [85] ggplot2_3.1.1 xfun_0.5 rsvd_1.0.1
#> [88] survival_2.43-3 viridisLite_0.3.0 tibble_2.1.1
#> [91] cluster_2.0.7-1 globals_0.12.4 fitdistrplus_1.0-14
#> [94] ROCR_1.0-7
Surface protein imputation from single cell transcriptomes by deep neural networks
Zilu Zhou, Chengzhong Ye, Jingshu Wang, Nancy R. Zhang
bioRxiv 671180; doi: https://doi.org/10.1101/671180