library(spacexr)
library(Matrix)
library(doParallel)

## Loading required package: foreach

## Loading required package: iterators

## Loading required package: parallel

library(ggplot2)
datadir <- system.file("extdata",'SpatialRNA/Vignette',package = 'spacexr')# directory for sample Slide-seq dataset
datadir2 <- system.file("extdata",'SpatialRNA/Vignette_rep2',package = 'spacexr')
datadir3 <- system.file("extdata",'SpatialRNA/Vignette_rep3',package = 'spacexr')
datadir4 <- system.file("extdata",'SpatialRNA/Vignette_rep4',package = 'spacexr')
if(!dir.exists(datadir))
  dir.create(datadir)
savedir <- 'RCTD_results'
if(!dir.exists(savedir))
  dir.create(savedir)

Introduction

Cell type-Specific Inference of Differential Expression, or CSIDE, is part of the spacexr R package for learning cell type-specific differential expression from spatial transcriptomics data. In this Vignette, we will demonstrate running RCTD and CSIDE on multiple replicates (slices) of cerebellum Slide-seq toy data.

The ultimate goal will be to test for differential expression across two regions in the cerbellum. First, we will first use RCTD to assign cell types to a cerebellum Slide-seq dataset. We will define cell type profiles using an annotated single nucleus RNA-sequencing (snRNA-seq) cerebellum dataset. We will test for differential expression between the nodulus and anterior lobe regions.

Data Preprocessing and Creating an RCTD.replicates object

First, we run RCTD on the data to annotated cell types (doublet mode). Please refer to the spatial transcriptomics vignette for more explanation on the RCTD algorithm, such as precise details for loading in the SpatialRNA object for this dataset. Since we need to load in the SpatialRNA object for four replicates, we will apply the function read.SpatialRNA to the directories of each of the four datasets to generate puck1, puck2, puck3, and puck4. Please note that in general, we recommend applying to each replicate a custom function that can load in a single replicate.

### Load in/preprocess your data, this might vary based on your file type
puck1 <- read.SpatialRNA(datadir, count_file = "MappedDGEForR.csv", coords_file = 'BeadLocationsForR.csv')
puck2 <- read.SpatialRNA(datadir2, count_file = "MappedDGEForR.csv", coords_file = 'BeadLocationsForR.csv')
puck3 <- read.SpatialRNA(datadir3, count_file = "MappedDGEForR.csv", coords_file = 'BeadLocationsForR.csv')
puck4 <- read.SpatialRNA(datadir4, count_file = "MappedDGEForR.csv", coords_file = 'BeadLocationsForR.csv')

Create explanatory variable / covariate

In order to run CSIDE, we will for each replicate create a explanatory variable (i.e. covariate) used for predicting differential expression in CSIDE. In general one should set the explanatory variable to biologically relevant predictors of gene expression such as spatial position. In this case, we will use the explanatory variable two divide each cerebellum replicate into two regions (left and right). In general, one can also create a custom function that will compute the explanatory variable on each SpatialRNA replicate. We will pass in a list of the explanatory variables to the run.CSIDE.replicates function.

Here, we also artifically upregulate the expression of the Kcnip4 gene in each dataset (in regions of high explanatory variable) to see whether CSIDE can detect this differentially expressed gene.

explanatory.variable1 <- as.integer(puck1@coords$x > 3000)
names(explanatory.variable1) <- rownames(puck1@coords)
explanatory.variable2 <- as.integer(puck2@coords$x > 3000)
names(explanatory.variable2) <- rownames(puck2@coords)
explanatory.variable3 <- as.integer(puck3@coords$x > 3000)
names(explanatory.variable3) <- rownames(puck3@coords)
explanatory.variable4 <- as.integer(puck4@coords$x > 3000)
names(explanatory.variable4) <- rownames(puck4@coords)
exvar_list <- list(explanatory.variable1, explanatory.variable2, explanatory.variable3, explanatory.variable4)

# Differentially upregulate one gene
change_gene <- 'Kcnip4'
high_barc1 <- names(explanatory.variable1[explanatory.variable1 > 0.5])
puck1@counts[change_gene, high_barc1] <- puck1@counts[change_gene, high_barc1] * 2 
high_barc2 <- names(explanatory.variable2[explanatory.variable2 > 0.5])
puck2@counts[change_gene, high_barc2] <- puck1@counts[change_gene, high_barc2] * 2 
high_barc3 <- names(explanatory.variable3[explanatory.variable3 > 0.5])
puck3@counts[change_gene, high_barc3] <- puck3@counts[change_gene, high_barc3] * 2 
high_barc4 <- names(explanatory.variable4[explanatory.variable4 > 0.5])
puck4@counts[change_gene, high_barc4] <- puck4@counts[change_gene, high_barc4] * 2

We also load in the Reference object, as follows.

refdir <- system.file("extdata",'Reference/Vignette',package = 'spacexr') # directory for the reference
counts <- read.csv(file.path(refdir,"dge.csv")) # load in counts matrix
rownames(counts) <- counts[,1]; counts[,1] <- NULL # Move first column to rownames
meta_data <- read.csv(file.path(refdir,"meta_data.csv")) # load in meta_data (barcodes, clusters, and nUMI)
cell_types <- meta_data$cluster; names(cell_types) <- meta_data$barcode # create cell_types named list
cell_types <- as.factor(cell_types) # convert to factor data type
nUMI <- meta_data$nUMI; names(nUMI) <- meta_data$barcode # create nUMI named list
reference <- Reference(counts, cell_types, nUMI)

Now, we are in a position to create a RCTD.replicates object, an object that can store multiple SpatialRNA objects allowing for joint processing with RCTD and CSIDE. We pass in spatial.replicates, a list of the four SpatialRNA objects, replicate_names, a list of names for each replicate, and group_ids, a list of the group number for each replicate. In this case, all replicates will come from the same group. However, we have the option to assign different replicates to different groups, which will allow for extra variation to be modeled across groups.

spatial.replicates <- list(puck1, puck2, puck3, puck4)
replicate_names <- c('1','2','3','4')
group_ids <- c(1,1,1,1)
myRCTD.reps <- create.RCTD.replicates(spatial.replicates, reference, replicate_names, group_ids = group_ids, max_cores = 2)
#> 
#>  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 
#> 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25

Running RCTD

We are now ready to run RCTD with the run.RCTD.replicates function to assign cell types to these replicates. This function runs RCTD on each replicate in the RCTD.replicate object.

myRCTD.reps <- run.RCTD.replicates(myRCTD.reps)
#> run.RCTD.replicates: running RCTD for replicate 1
#> fitBulk: decomposing bulk
#> chooseSigma: using initial Q_mat with sigma =  1
#> Likelihood value: 1972.07376580346
#> Sigma value:  0.84
#> Likelihood value: 1966.20345908295
#> Sigma value:  0.84
#> run.RCTD.replicates: running RCTD for replicate 2
#> fitBulk: decomposing bulk
#> chooseSigma: using initial Q_mat with sigma =  1
#> Likelihood value: 1974.77542702319
#> Sigma value:  0.84
#> Likelihood value: 1969.28192633095
#> Sigma value:  0.84
#> run.RCTD.replicates: running RCTD for replicate 3
#> fitBulk: decomposing bulk
#> chooseSigma: using initial Q_mat with sigma =  1
#> Likelihood value: 1973.42952348242
#> Sigma value:  0.84
#> Likelihood value: 1967.73067452082
#> Sigma value:  0.84
#> run.RCTD.replicates: running RCTD for replicate 4
#> fitBulk: decomposing bulk
#> chooseSigma: using initial Q_mat with sigma =  1
#> Likelihood value: 1972.59047252235
#> Sigma value:  0.84
#> Likelihood value: 1966.71685085192
#> Sigma value:  0.84

Running CSIDE

After creating the explanatory variables and running RCTD, we are now ready to run CSIDE using the run.CSIDE.replicates function. Here, we pass in exvar_list, the list of explanatory variables for each replicate. We will use cell types 10 and 18 and a false discovery rate of 0.25. Next, we will set a gene threshold (i.e. minimum gene expression) of 0.01, and we will set a cell_type_threshold (minimum instances per cell type) of 3.

Warning: On this toy dataset, we have made several choices of parameters that are not recommended for regular use. On real datasets, we recommend first consulting the CSIDE default parameters. This includes gene_threshold (default 5e-5), cell_type_threshold (default 125), and fdr (default 0.01). Please see ?run.CSIDE.replicates for more information on these parameters.

#de
cell_types <- c('10','18', '6')
myRCTD.reps <- run.CSIDE.replicates(myRCTD.reps, cell_types, exvar_list, gene_threshold = .01, cell_type_threshold = 3, fdr = 0.25, weight_threshold = 0.8) 
#> run.CSIDE.replicates: running CSIDE for replicate 1
#> Warning in run.CSIDE.general(myRCTD, X1, X2, barcodes, cell_types,
#> cell_type_threshold = cell_type_threshold, : run.CSIDE.general: some parameters
#> are set to the CSIDE vignette values, which are intended for testing but
#> not proper execution. For more accurate results, consider using the default
#> parameters to this function.
#> run.CSIDE.general: running CSIDE with cell types 10, 18, 6
#> run.CSIDE.general: configure params_to_test = 2,
#> filter_genes: filtering genes based on threshold = 0.01
#> set_global_Q_all: begin
#> set_global_Q_all: finished
#> run.CSIDE.replicates: running CSIDE for replicate 2
#> Warning in run.CSIDE.general(myRCTD, X1, X2, barcodes, cell_types,
#> cell_type_threshold = cell_type_threshold, : run.CSIDE.general: some parameters
#> are set to the CSIDE vignette values, which are intended for testing but
#> not proper execution. For more accurate results, consider using the default
#> parameters to this function.
#> run.CSIDE.general: running CSIDE with cell types 10, 18, 6
#> run.CSIDE.general: configure params_to_test = 2,
#> filter_genes: filtering genes based on threshold = 0.01
#> set_global_Q_all: begin
#> set_global_Q_all: finished
#> run.CSIDE.replicates: running CSIDE for replicate 3
#> Warning in run.CSIDE.general(myRCTD, X1, X2, barcodes, cell_types,
#> cell_type_threshold = cell_type_threshold, : run.CSIDE.general: some parameters
#> are set to the CSIDE vignette values, which are intended for testing but
#> not proper execution. For more accurate results, consider using the default
#> parameters to this function.
#> run.CSIDE.general: running CSIDE with cell types 10, 18, 6
#> run.CSIDE.general: configure params_to_test = 2,
#> filter_genes: filtering genes based on threshold = 0.01
#> set_global_Q_all: begin
#> set_global_Q_all: finished
#> run.CSIDE.replicates: running CSIDE for replicate 4
#> Warning in run.CSIDE.general(myRCTD, X1, X2, barcodes, cell_types,
#> cell_type_threshold = cell_type_threshold, : run.CSIDE.general: some parameters
#> are set to the CSIDE vignette values, which are intended for testing but
#> not proper execution. For more accurate results, consider using the default
#> parameters to this function.
#> run.CSIDE.general: running CSIDE with cell types 10, 18, 6
#> run.CSIDE.general: configure params_to_test = 2,
#> filter_genes: filtering genes based on threshold = 0.01
#> set_global_Q_all: begin
#> set_global_Q_all: finished

saveRDS(myRCTD.reps,file.path(savedir,'myRCTDde_reps.rds'))

Note that the RCTD and CSIDE results on individual replicates are stored in myRCTD.reps@RCTD.reps, which is a list of RCTD objects containing individual CSIDE results as discussed in other vignettes.

CSIDE population inference

After running CSIDE on each replicate, the next step is to compute population-level differential expression across all samples using CSIDE.population.inference. There are two ways to run this function, and we will explore both.

Population inference: default

The default mode for CSIDE.population.inference is to aggregate differential expression estimates across samples to form a population “average” of differential expression:

myRCTD.reps <- CSIDE.population.inference(myRCTD.reps, fdr = 0.25)
#> CSIDE.population.inference: running population DE inference with use.groups=FALSE, and meta = FALSE
#> [1] "one_ct_genes: population inference on cell type, 10"
#> [1] "done"
#> [1] "one_ct_genes: population inference on cell type, 18"
#> [1] "done"
#> [1] "one_ct_genes: population inference on cell type, 6"
#> [1] "done"

CSIDE population results

We will focus on cell type 10. Furthermore, we will examine the original Kcnip4 gene, which was detected to be significantly differentially expressed in cell type 10. Other genes, including nonsignificant genes, can be observed in myRCTD.reps@population_de_results. As shown below, DE results on significant genes are stored in population_sig_gene_df. In particular, notice the columns Z_est (Z-score), log_fc_est (estimated DE coefficient), sd_est (standard error of DE coefficient), p (p-value), and q_val (q-value). Other columns of interest include tau, the estimated variance across samples, and mean i and sd i, which represent the DE point estimate and standard error on each of the four replicates.

#print results for cell type '18'
cell_type <- '10'
sig_gene <- change_gene
print(paste("following results hold for", sig_gene))
#> [1] "following results hold for Kcnip4"
print(myRCTD.reps@population_sig_gene_df[[cell_type]]) 
#>               tau log_fc_est    sd_est     Z_est   p_cross ct_prop        expr
#> Kcnip4 0.00000000  1.0554374 0.2758673  3.825888 0.7345539       1 0.019770883
#> Rora   0.42836080  0.6876000 0.3963400  1.734874 0.1014808       1 0.002325036
#> Kcnd2  0.09014169 -0.4050107 0.1781208 -2.273799 0.3290451       1 0.014091530
#>              q_val            p     mean 1     mean 2     mean 3     mean 4
#> Kcnip4 0.000912112 0.0001303017  0.8401400  1.5089602 1.25066697  0.7088606
#> Rora   0.144835438 0.0827631075  1.0187850  2.4330355 0.54854271 -0.1151301
#> Kcnd2  0.080423340 0.0229780973 -0.2303601 -0.8156025 0.03587612 -0.5407490
#>             sd 1      sd 2      sd 3      sd 4
#> Kcnip4 0.5837606 0.6737911 0.4744952 0.5298941
#> Rora   0.6652035 1.0544742 0.5611421 0.5648168
#> Kcnd2  0.3620835 0.3392899 0.3558457 0.3247958

Finally, we will save population CSIDE results in the savedir directory:

if(!dir.exists(file.path(savedir,'population')))
  dir.create(file.path(savedir,'population'))
save.CSIDE.replicates(myRCTD.reps, file.path(savedir,'population'))

Population inference: meta regression

The second way to run CSIDE population inference is meta regression, which is more flexible than aggregating across samples. For example, one may control for covariates such as age and gender, as well as test across covariates such as timepoints or case vs. control. In this example, we will assume that the four samples came from two cases and two controls. Meta regression works by including a design matrix, meta.design.matrix, of named covariates across samples. In CSIDE.population.inference, we set the option meta = TRUE, and meta.test_var = 'treat', the variable we would like to test.

meta.design.matrix <- data.frame('treat' = c(0,1,0,1))
print(meta.design.matrix)
#>   treat
#> 1     0
#> 2     1
#> 3     0
#> 4     1
myRCTD.reps_meta <- CSIDE.population.inference(myRCTD.reps, fdr = 0.25,
                                          meta = TRUE,
                                       meta.design.matrix = meta.design.matrix,
                                       meta.test_var = 'treat')
#> CSIDE.population.inference: running population DE inference with use.groups=FALSE, and meta = TRUE
#> [1] "one_ct_genes: population inference on cell type, 10"
#> [1] "done"
#> [1] "one_ct_genes: population inference on cell type, 18"
#> [1] "done"
#> [1] "one_ct_genes: population inference on cell type, 6"
#> [1] "done"
head(myRCTD.reps_meta@population_de_results$`10`)
#>             tau  log_fc_est    sd_est       Z_est    p_cross  ct_prop
#> Snap25 0.000000  0.51359922 0.3103883  1.65469895 0.82539758 1.000000
#> Kcnip4 0.000000 -0.07273497 0.5559337 -0.13083389 0.55737760 1.000000
#> Kcnd2  0.000000 -0.57725078 0.3456309 -1.67013660 0.73438010 1.000000
#> Rora   0.884643  0.05103254 1.1293624  0.04518704 0.08938231 1.000000
#> Lsamp  0.000000  0.50366348 0.4924901  1.02268751 0.97895051 1.000000
#> Calm2  1.199270 -1.09960186 1.3023646 -0.84431184 0.00118826 0.859978
#>                expr     q_val          p
#> Snap25 0.0016977534 0.3429495 0.09798558
#> Kcnip4 0.0197708833 0.9639582 0.89590671
#> Kcnd2  0.0140915301 0.3429495 0.09489234
#> Rora   0.0023250360 0.9639582 0.96395822
#> Lsamp  0.0065846141 0.6973665 0.30645562
#> Calm2  0.0005844512 0.6973665 0.39849517

Compared to the default analysis, this population-level analysis is measuring something different, the effect of treatment on differential expression. For example, the lowest p-value gene (yet not < 0.05), Snap25, has an estimated effect of treatment of 0.56. Notice that Kcnip4 is no longer estimated to have a large effect, because the question of difference between cases and controls is fundamentally different from estimating the average across samples.