Robust Cell Type Decomposition, or RCTD, is an statistical method for learning cell types from spatial transcriptomics data. In this Vignette, we will assign cell types to a cerebellum Slide-seq dataset. We will define cell type profiles using an annotated single nucleus RNA-sequencing (snRNA-seq) cerebellum dataset. Please note that the reference could also be a single-cell dataset or a cell type-specific bulk RNA-seq dataset.
Let’s begin by loading in the data to be usable for RCTD.
In order to run RCTD, the first step is to process the single cell reference. The reference is created using the RCTD
Reference constructor function, which requires 3 parameters:
1. counts: A matrix (or dgCmatrix) representing Digital Gene Expression (DGE). Rownames should be genes and colnames represent barcodes/cell names. Counts should be untransformed count-level data.
A named (by cell barcode) factor of cell type for each cell. The ‘levels’ of the factor would be the possible cell type identities.
Optional, a named (by cell barcode) list of total counts or UMI’s appearing at each pixel. If not provided, nUMI will be assumed to be the total counts appearing on each pixel.
The reference may come from various data types, but it needs to be loaded into the R envirnoment. Here, we emphasize that our protocol for loading in the counts, cell_types, and nUMI obejects does not need to be the way that you load your objects into R. In this case our reference is stored in the ‘Reference/Vignette’ folder as two csv files:
1. meta_data.csv: a CSV file (with 3 columns, with headers “barcode”, “cluster”, and “nUMI”) containing the nUMI and cell type assignment for each cell.
2. dge.csv: a Digital Gene Expression (DGE) (barcodes by gene counts) CSV file in the standard 10x format.
We use the
Reference constructor function:
### Load in/preprocess your data, this might vary based on your file type refdir <- system.file("extdata",'Reference/Vignette',package = 'spacexr') # directory for the reference counts <- read.csv(file.path(refdir,"dge.csv")) # load in counts matrix rownames(counts) <- counts[,1]; counts[,1] <- NULL # Move first column to rownames meta_data <- read.csv(file.path(refdir,"meta_data.csv")) # load in meta_data (barcodes, clusters, and nUMI) cell_types <- meta_data$cluster; names(cell_types) <- meta_data$barcode # create cell_types named list cell_types <- as.factor(cell_types) # convert to factor data type nUMI <- meta_data$nUMI; names(nUMI) <- meta_data$barcode # create nUMI named list ### Create the Reference object reference <- Reference(counts, cell_types, nUMI) #> Warning in Reference(counts, cell_types, nUMI): Reference: nUMI does not match #> colSums of counts. If this is unintended, please correct this discrepancy. If #> this is intended, there is no problem. ## Examine reference object (optional) print(dim(reference@counts)) #observe Digital Gene Expression matrix #>  384 475 table(reference@cell_types) #number of occurences for each cell type #> #> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 #> 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 ## Save RDS object (optional) saveRDS(reference, file.path(refdir,'SCRef.rds'))
Next, we will load the Spatial Transcriptomics data into a
SpatialRNA object. Similarly to the reference, we will first load the data into R, and then pass the data into the RCTD
SpatialRNA constructor function, which requires 3 parameters:
1. coords: A numeric data.frame (or matrix) representing the spatial pixel locations. rownames are barcodes/pixel names, and there should be two columns for ‘x’ and for ‘y’.
2. counts: A matrix (or dgCmatrix) representing Digital Gene Expression (DGE). Rownames should be genes and colnames represent barcodes/pixel names. Counts should be untransformed count-level data.
Optional, a named (by pixel barcode) list of total counts or UMI’s appearing at each pixel. If not provided, nUMI will be assumed to be the total counts appearing on each pixel.
These elements can be constructed differently, depending on how you store your spatial transcriptomic data. In our case (one of many ways to do it), our (Slide-seq) data is stored in the ‘SpatialRNA/Vignette’ file as two csv files:
1. BeadLocationsForR.csv: a CSV file (with 3 columns, with headers “barcodes”, “xcoord”, and “ycoord”) containing the spatial locations of the pixels.
2. MappedDGEForR.csv: a DGE (gene counts by barcodes) CSV file. Represents raw counts at each pixel.
datadir <- system.file("extdata",'SpatialRNA/Vignette',package = 'spacexr') # directory for sample Slide-seq dataset counts <- read.csv(file.path(datadir,"MappedDGEForR.csv")) # load in counts matrix coords <- read.csv(file.path(datadir,"BeadLocationsForR.csv")) rownames(counts) <- counts[,1]; counts[,1] <- NULL # Move first column to rownames rownames(coords) <- coords$barcodes; coords$barcodes <- NULL # Move barcodes to rownames nUMI <- colSums(counts) # In this case, total counts per pixel is nUMI ### Create SpatialRNA object puck <- SpatialRNA(coords, counts, nUMI) ## Examine SpatialRNA object (optional) print(dim(puck@counts)) # observe Digital Gene Expression matrix hist(log(puck@nUMI,2)) # histogram of log_2 nUMI
print(head(puck@coords)) # start of coordinate data.frame barcodes <- colnames(puck@counts) # pixels to be used (a list of barcode names). # This list can be restricted if you want to crop the puck e.g. # puck <- restrict_puck(puck, barcodes) provides a basic plot of the nUMI of each pixel # on the plot: plot_puck_continuous(puck, barcodes, puck@nUMI, ylimit = c(0,round(quantile(puck@nUMI,0.9))), title ='plot of nUMI')
The RDS file ‘puck.RDS’ saves the ‘SpatialRNA’ file we have created, and from now on it can be loaded in by the init_RCTD function. ## Running RCTD
We are now ready to create an
RCTD object using the
create.RCTD function. We simply need to pass in the
SpatialRNA and scRNA-seq objects. There are several configuration options that can be set with this function:
max_cores:for parallel processing, the number of cores used. If set to 1, parallel processing is not used. The system will additionally be checked for number of available cores. Note, that we recommend setting
max_coresto at least
8to improve efficiency.
gene_cutoff, fc_cutoff, gene_cutoff_reg, fc_cutoff_reg:are used for differentially expressed gene selection, with
gene_cutofffiltering for average expression and
fc_cutofffiltering for log-fold-change across cell types.
UMI_min, UMI_max:are the minimum and maximum read depth for pixels in the
Now, we are ready to run RCTD, using the
run.RCTD function. This function is equivalent to sequentially running the functions
doublet_mode argument sets whether RCTD will be run in ‘doublet mode’ (at most 1-2 cell types per pixel), ‘full mode’ (no restrictions on number of cell types), or ‘multi mode’ (finitely many cell types per pixel, e.g. 3 or 4).
myRCTD <- create.RCTD(puck, reference, max_cores = 1) #> Begin: process_cell_type_info #> process_cell_type_info: number of cells in reference: 475 #> process_cell_type_info: number of genes in reference: 384 #> 25252525252525252525252525252525252525 #> End: process_cell_type_info #> create.RCTD: getting regression differentially expressed genes: #> get_de_genes: 1 found DE genes: 35 #> get_de_genes: 2 found DE genes: 38 #> get_de_genes: 3 found DE genes: 57 #> get_de_genes: 4 found DE genes: 18 #> get_de_genes: 5 found DE genes: 26 #> get_de_genes: 6 found DE genes: 16 #> get_de_genes: 7 found DE genes: 20 #> get_de_genes: 8 found DE genes: 46 #> get_de_genes: 9 found DE genes: 59 #> get_de_genes: 10 found DE genes: 51 #> get_de_genes: 11 found DE genes: 44 #> get_de_genes: 12 found DE genes: 19 #> get_de_genes: 13 found DE genes: 27 #> get_de_genes: 14 found DE genes: 55 #> get_de_genes: 15 found DE genes: 51 #> get_de_genes: 16 found DE genes: 57 #> get_de_genes: 17 found DE genes: 50 #> get_de_genes: 18 found DE genes: 75 #> get_de_genes: 19 found DE genes: 56 #> get_de_genes: total DE genes: 313 #> create.RCTD: getting platform effect normalization differentially expressed genes: #> get_de_genes: 1 found DE genes: 58 #> get_de_genes: 2 found DE genes: 61 #> get_de_genes: 3 found DE genes: 85 #> get_de_genes: 4 found DE genes: 33 #> get_de_genes: 5 found DE genes: 35 #> get_de_genes: 6 found DE genes: 29 #> get_de_genes: 7 found DE genes: 28 #> get_de_genes: 8 found DE genes: 89 #> get_de_genes: 9 found DE genes: 82 #> get_de_genes: 10 found DE genes: 78 #> get_de_genes: 11 found DE genes: 79 #> get_de_genes: 12 found DE genes: 26 #> get_de_genes: 13 found DE genes: 38 #> get_de_genes: 14 found DE genes: 87 #> get_de_genes: 15 found DE genes: 79 #> get_de_genes: 16 found DE genes: 64 #> get_de_genes: 17 found DE genes: 59 #> get_de_genes: 18 found DE genes: 93 #> get_de_genes: 19 found DE genes: 89 #> get_de_genes: total DE genes: 340 myRCTD <- run.RCTD(myRCTD, doublet_mode = 'doublet') #> fitBulk: decomposing bulk #> chooseSigma: using initial Q_mat with sigma = 1 #> Likelihood value: 1970.42788150886 #> Sigma value: 0.84 #> Likelihood value: 1964.49817810465 #> Sigma value: 0.84
The results of RCTD are located in the
@results field. Of particular interest is
@results$weights, a data frame of cell type weights for each pixel (for full mode). This section will generate various plots which can be found in
resultsdir. The results of ‘doublet_mode’ are stored in
@results$weights_doublet, the weights of each cell type. More specifically, the
results_df object contains one column per pixel (barcodes as rownames). Important columns are:
spot_class, a factor variable representing RCTD’s classification in doublet mode: “singlet” (1 cell type on pixel), “doublet_certain” (2 cell types on pixel), “doublet_uncertain” (2 cell types on pixel, but only confident of 1), “reject” (no prediction given for pixel).
first_typecolumn gives the first cell type predicted on the bead (for all spot_class conditions except “reject”).
second_type columngives the second cell type predicted on the bead for doublet spot_class conditions (not a confident prediction for “doublet_uncertain”).
Note that in multi-mode, results consists of a list of results for each pixel, which contains
all_weights (weights from full-mode),
cell_type_list (cell types on multi-mode),
conf_list (which cell types are confident on multi-mode) and
sub_weights (proportions of cell types on multi-mode).
Note, some of the plots are not displayed here, but rather saved as pdf files in the ‘RCTD_Plots’ directory.
results <- myRCTD@results # normalize the cell type proportions to sum to 1. norm_weights = normalize_weights(results$weights) cell_type_names <- myRCTD@cell_type_info$info[] #list of cell type names spatialRNA <- myRCTD@spatialRNA resultsdir <- 'RCTD_Plots' ## you may change this to a more accessible directory on your computer. dir.create(resultsdir) #> Warning in dir.create(resultsdir): 'RCTD_Plots' already exists
# make the plots # Plots the confident weights for each cell type as in full_mode (saved as # 'results/cell_type_weights_unthreshold.pdf') plot_weights(cell_type_names, spatialRNA, resultsdir, norm_weights) # Plots all weights for each cell type as in full_mode. (saved as # 'results/cell_type_weights.pdf') plot_weights_unthreshold(cell_type_names, spatialRNA, resultsdir, norm_weights) # Plots the weights for each cell type as in doublet_mode. (saved as # 'results/cell_type_weights_doublets.pdf') plot_weights_doublet(cell_type_names, spatialRNA, resultsdir, results$weights_doublet, results$results_df) # Plots the number of confident pixels of each cell type in 'full_mode'. (saved as # 'results/cell_type_occur.pdf') plot_cond_occur(cell_type_names, resultsdir, norm_weights, spatialRNA)
# makes a map of all cell types, (saved as # 'results/all_cell_types.pdf') plot_all_cell_types(results$results_df, spatialRNA@coords, cell_type_names, resultsdir)
# doublets #obtain a dataframe of only doublets doublets <- results$results_df[results$results_df$spot_class == "doublet_certain",] # Plots all doublets in space (saved as # 'results/all_doublets.pdf') plot_doublets(spatialRNA, doublets, resultsdir, cell_type_names)
# Plots all doublets in space for each cell type (saved as # 'results/all_doublets_type.pdf') plot_doublets_type(spatialRNA, doublets, resultsdir, cell_type_names) # a table of frequency of doublet pairs doub_occur <- table(doublets$second_type, doublets$first_type) # Plots a stacked bar plot of doublet ocurrences (saved as # 'results/doublet_stacked_bar.pdf') plot_doub_occur_stack(doub_occur, resultsdir, cell_type_names)