Applying RCTD to Spatial Transcriptomics Data

library(spacexr)
library(Matrix)

Introduction

Robust Cell Type Decomposition, or RCTD, is an statistical method for learning cell types from spatial transcriptomics data. In this Vignette, we will assign cell types to a cerebellum Slide-seq dataset. We will define cell type profiles using an annotated single nucleus RNA-sequencing (snRNA-seq) cerebellum dataset. Please note that the reference could also be a single-cell dataset or a cell type-specific bulk RNA-seq dataset.

Data Preprocessing

Let’s begin by loading in the data to be usable for RCTD.

Single-Cell Reference

In order to run RCTD, the first step is to process the single cell reference. The reference is created using the RCTD Reference constructor function, which requires 3 parameters:
1. counts: A matrix (or dgCmatrix) representing Digital Gene Expression (DGE). Rownames should be genes and colnames represent barcodes/cell names. Counts should be untransformed count-level data.
2. cell_types:
A named (by cell barcode) factor of cell type for each cell. The ‘levels’ of the factor would be the possible cell type identities.
3. nUMI:
Optional, a named (by cell barcode) list of total counts or UMI’s appearing at each pixel. If not provided, nUMI will be assumed to be the total counts appearing on each pixel.

The reference may come from various data types, but it needs to be loaded into the R envirnoment. Here, we emphasize that our protocol for loading in the counts, cell_types, and nUMI obejects does not need to be the way that you load your objects into R. In this case our reference is stored in the ‘Reference/Vignette’ folder as two csv files:
1. meta_data.csv: a CSV file (with 3 columns, with headers “barcode”, “cluster”, and “nUMI”) containing the nUMI and cell type assignment for each cell.

2. dge.csv: a Digital Gene Expression (DGE) (barcodes by gene counts) CSV file in the standard 10x format.
We use the Reference constructor function:

### Load in/preprocess your data, this might vary based on your file type
refdir <- system.file("extdata",'Reference/Vignette',package = 'spacexr') # directory for the reference
counts <- read.csv(file.path(refdir,"dge.csv")) # load in counts matrix
rownames(counts) <- counts[,1]; counts[,1] <- NULL # Move first column to rownames
meta_data <- read.csv(file.path(refdir,"meta_data.csv")) # load in meta_data (barcodes, clusters, and nUMI)
cell_types <- meta_data$cluster; names(cell_types) <- meta_data$barcode # create cell_types named list
cell_types <- as.factor(cell_types) # convert to factor data type
nUMI <- meta_data$nUMI; names(nUMI) <- meta_data$barcode # create nUMI named list

### Create the Reference object
reference <- Reference(counts, cell_types, nUMI)
#> Warning in Reference(counts, cell_types, nUMI): Reference: nUMI does not match
#> colSums of counts. If this is unintended, please correct this discrepancy. If
#> this is intended, there is no problem.

## Examine reference object (optional)
print(dim(reference@counts)) #observe Digital Gene Expression matrix
#> [1] 384 475
table(reference@cell_types) #number of occurences for each cell type
#> 
#>  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 
#> 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25

## Save RDS object (optional)
saveRDS(reference, file.path(refdir,'SCRef.rds'))

Spatial Transcriptomics data

Next, we will load the Spatial Transcriptomics data into a SpatialRNA object. Similarly to the reference, we will first load the data into R, and then pass the data into the RCTD SpatialRNA constructor function, which requires 3 parameters:
1. coords: A numeric data.frame (or matrix) representing the spatial pixel locations. rownames are barcodes/pixel names, and there should be two columns for ‘x’ and for ‘y’.
2. counts: A matrix (or dgCmatrix) representing Digital Gene Expression (DGE). Rownames should be genes and colnames represent barcodes/pixel names. Counts should be untransformed count-level data.
3. nUMI:
Optional, a named (by pixel barcode) list of total counts or UMI’s appearing at each pixel. If not provided, nUMI will be assumed to be the total counts appearing on each pixel.

These elements can be constructed differently, depending on how you store your spatial transcriptomic data. In our case (one of many ways to do it), our (Slide-seq) data is stored in the ‘SpatialRNA/Vignette’ file as two csv files:
1. BeadLocationsForR.csv: a CSV file (with 3 columns, with headers “barcodes”, “xcoord”, and “ycoord”) containing the spatial locations of the pixels.
2. MappedDGEForR.csv: a DGE (gene counts by barcodes) CSV file. Represents raw counts at each pixel.

datadir <- system.file("extdata",'SpatialRNA/Vignette',package = 'spacexr') # directory for sample Slide-seq dataset
counts <- read.csv(file.path(datadir,"MappedDGEForR.csv")) # load in counts matrix
coords <- read.csv(file.path(datadir,"BeadLocationsForR.csv"))
rownames(counts) <- counts[,1]; counts[,1] <- NULL # Move first column to rownames
rownames(coords) <- coords$barcodes; coords$barcodes <- NULL # Move barcodes to rownames
nUMI <- colSums(counts) # In this case, total counts per pixel is nUMI

### Create SpatialRNA object
puck <- SpatialRNA(coords, counts, nUMI)

## Examine SpatialRNA object (optional)
print(dim(puck@counts)) # observe Digital Gene Expression matrix
hist(log(puck@nUMI,2)) # histogram of log_2 nUMI

print(head(puck@coords)) # start of coordinate data.frame
barcodes <- colnames(puck@counts) # pixels to be used (a list of barcode names). 

# This list can be restricted if you want to crop the puck e.g. 
# puck <- restrict_puck(puck, barcodes) provides a basic plot of the nUMI of each pixel
# on the plot:
plot_puck_continuous(puck, barcodes, puck@nUMI, ylimit = c(0,round(quantile(puck@nUMI,0.9))), 
                     title ='plot of nUMI')

The RDS file ‘puck.RDS’ saves the ‘SpatialRNA’ file we have created, and from now on it can be loaded in by the init_RCTD function. ## Running RCTD

Creating RCTD Object

We are now ready to create an RCTD object using the create.RCTD function. We simply need to pass in the SpatialRNA and scRNA-seq objects. There are several configuration options that can be set with this function:

max_cores: for parallel processing, the number of cores used. If set to 1, parallel processing is not used. The system will additionally be checked for number of available cores. Note, that we recommend setting max_cores to at least 4 or 8 to improve efficiency.
gene_cutoff, fc_cutoff, gene_cutoff_reg, fc_cutoff_reg: are used for differentially expressed gene selection, with gene_cutoff filtering for average expression and fc_cutoff filtering for log-fold-change across cell types.
UMI_min, UMI_max: are the minimum and maximum read depth for pixels in the SpatialRNA dataset.

Running RCTD

Now, we are ready to run RCTD, using the run.RCTD function. This function is equivalent to sequentially running the functions fitBulk, choose_sigma_c, and fitPixels. The doublet_mode argument sets whether RCTD will be run in ‘doublet mode’ (at most 1-2 cell types per pixel), ‘full mode’ (no restrictions on number of cell types), or ‘multi mode’ (finitely many cell types per pixel, e.g. 3 or 4).

myRCTD <- create.RCTD(puck, reference, max_cores = 1)
#> Begin: process_cell_type_info
#> process_cell_type_info: number of cells in reference: 475
#> process_cell_type_info: number of genes in reference: 384
#> 25252525252525252525252525252525252525
#> End: process_cell_type_info
#> create.RCTD: getting regression differentially expressed genes:
#> get_de_genes: 1 found DE genes: 35
#> get_de_genes: 2 found DE genes: 38
#> get_de_genes: 3 found DE genes: 57
#> get_de_genes: 4 found DE genes: 18
#> get_de_genes: 5 found DE genes: 26
#> get_de_genes: 6 found DE genes: 16
#> get_de_genes: 7 found DE genes: 20
#> get_de_genes: 8 found DE genes: 46
#> get_de_genes: 9 found DE genes: 59
#> get_de_genes: 10 found DE genes: 51
#> get_de_genes: 11 found DE genes: 44
#> get_de_genes: 12 found DE genes: 19
#> get_de_genes: 13 found DE genes: 27
#> get_de_genes: 14 found DE genes: 55
#> get_de_genes: 15 found DE genes: 51
#> get_de_genes: 16 found DE genes: 57
#> get_de_genes: 17 found DE genes: 50
#> get_de_genes: 18 found DE genes: 75
#> get_de_genes: 19 found DE genes: 56
#> get_de_genes: total DE genes: 313
#> create.RCTD: getting platform effect normalization differentially expressed genes:
#> get_de_genes: 1 found DE genes: 58
#> get_de_genes: 2 found DE genes: 61
#> get_de_genes: 3 found DE genes: 85
#> get_de_genes: 4 found DE genes: 33
#> get_de_genes: 5 found DE genes: 35
#> get_de_genes: 6 found DE genes: 29
#> get_de_genes: 7 found DE genes: 28
#> get_de_genes: 8 found DE genes: 89
#> get_de_genes: 9 found DE genes: 82
#> get_de_genes: 10 found DE genes: 78
#> get_de_genes: 11 found DE genes: 79
#> get_de_genes: 12 found DE genes: 26
#> get_de_genes: 13 found DE genes: 38
#> get_de_genes: 14 found DE genes: 87
#> get_de_genes: 15 found DE genes: 79
#> get_de_genes: 16 found DE genes: 64
#> get_de_genes: 17 found DE genes: 59
#> get_de_genes: 18 found DE genes: 93
#> get_de_genes: 19 found DE genes: 89
#> get_de_genes: total DE genes: 340
myRCTD <- run.RCTD(myRCTD, doublet_mode = 'doublet')
#> fitBulk: decomposing bulk
#> chooseSigma: using initial Q_mat with sigma =  1
#> Likelihood value: 1970.42788150886
#> Sigma value:  0.84
#> Likelihood value: 1964.49817810465
#> Sigma value:  0.84

RCTD results

The results of RCTD are located in the @results field. Of particular interest is @results$weights, a data frame of cell type weights for each pixel (for full mode). This section will generate various plots which can be found in resultsdir. The results of ‘doublet_mode’ are stored in @results$results_df and @results$weights_doublet, the weights of each cell type. More specifically, the results_df object contains one column per pixel (barcodes as rownames). Important columns are:

spot_class, a factor variable representing RCTD’s classification in doublet mode: “singlet” (1 cell type on pixel), “doublet_certain” (2 cell types on pixel), “doublet_uncertain” (2 cell types on pixel, but only confident of 1), “reject” (no prediction given for pixel).
Next, the first_type column gives the first cell type predicted on the bead (for all spot_class conditions except “reject”).
The second_type column gives the second cell type predicted on the bead for doublet spot_class conditions (not a confident prediction for “doublet_uncertain”).

Note that in multi-mode, results consists of a list of results for each pixel, which contains all_weights (weights from full-mode), cell_type_list (cell types on multi-mode), conf_list (which cell types are confident on multi-mode) and sub_weights (proportions of cell types on multi-mode).

Note, some of the plots are not displayed here, but rather saved as pdf files in the ‘RCTD_Plots’ directory.

results <- myRCTD@results
# normalize the cell type proportions to sum to 1.
norm_weights = normalize_weights(results$weights) 
cell_type_names <- myRCTD@cell_type_info$info[[2]] #list of cell type names
spatialRNA <- myRCTD@spatialRNA
resultsdir <- 'RCTD_Plots' ## you may change this to a more accessible directory on your computer.
dir.create(resultsdir)
#> Warning in dir.create(resultsdir): 'RCTD_Plots' already exists

# make the plots 
# Plots the confident weights for each cell type as in full_mode (saved as 
# 'results/cell_type_weights_unthreshold.pdf')
plot_weights(cell_type_names, spatialRNA, resultsdir, norm_weights) 
# Plots all weights for each cell type as in full_mode. (saved as 
# 'results/cell_type_weights.pdf')
plot_weights_unthreshold(cell_type_names, spatialRNA, resultsdir, norm_weights) 
# Plots the weights for each cell type as in doublet_mode. (saved as 
# 'results/cell_type_weights_doublets.pdf')
plot_weights_doublet(cell_type_names, spatialRNA, resultsdir, results$weights_doublet, 
                     results$results_df) 
# Plots the number of confident pixels of each cell type in 'full_mode'. (saved as 
# 'results/cell_type_occur.pdf')
plot_cond_occur(cell_type_names, resultsdir, norm_weights, spatialRNA)

# makes a map of all cell types, (saved as 
# 'results/all_cell_types.pdf')
plot_all_cell_types(results$results_df, spatialRNA@coords, cell_type_names, resultsdir)




# doublets
#obtain a dataframe of only doublets
doublets <- results$results_df[results$results_df$spot_class == "doublet_certain",] 
# Plots all doublets in space (saved as 
# 'results/all_doublets.pdf')
plot_doublets(spatialRNA, doublets, resultsdir, cell_type_names)

# Plots all doublets in space for each cell type (saved as 
# 'results/all_doublets_type.pdf')
plot_doublets_type(spatialRNA, doublets, resultsdir, cell_type_names) 
# a table of frequency of doublet pairs 
doub_occur <- table(doublets$second_type, doublets$first_type) 
# Plots a stacked bar plot of doublet ocurrences (saved as 
# 'results/doublet_stacked_bar.pdf')

plot_doub_occur_stack(doub_occur, resultsdir, cell_type_names)