Applying CSIDE to Spatial Transcriptomics Data: Smooth spatial patterns in MERFISH

library(spacexr)
library(Matrix)
library(doParallel)

## Loading required package: foreach

## Loading required package: iterators

## Loading required package: parallel

library(ggplot2)
datadir <- system.file("extdata",'SpatialRNA/MerfishVignette',package = 'spacexr') # directory for sample MERFISH dataset
if(!dir.exists(datadir))
  dir.create(datadir)
savedir <- 'RCTD_results'
if(!dir.exists(savedir))
  dir.create(savedir)

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  cache = TRUE,
  out.width = "100%"
)

Introduction

Generalized Linear Admixture Models for Differential Expression, or CSIDE, is part of the spacexr R package for learning cell type-specific differential expression from spatial transcriptomics data. In this Vignette, we will use CSIDE to test for nonparametric differential expression in a toy hypothalamus MERFISH dataset. First, we will first use RCTD to assign cell types to a hypothalamus MERFISH dataset. We will define cell type profiles using an annotated single cell RNA-sequencing (scRNA-seq) hypothalamus dataset. We will detect differential expression as nonparametric smooth spatial patterns.

Data Preprocessing and running RCTD

First, we run RCTD on the data to annotate cell types. Please note that this follows exactly the content of the spatial transcriptomics RCTD vignette, except that we use RCTD on doublet mode. Since doublet mode can discover 1-2 cell types per pixel, it is a reasonable choice for technologies such as MERFISH that can have precise spatial resolution. Please refer to the spatial transcriptomics vignette for more explanation on the RCTD algorithm.

### Load in/preprocess your data, this might vary based on your file type
if(!file.exists(file.path(savedir,'myRCTD_merfish.rds'))) {
  counts <- as.data.frame(readr::read_csv(file.path(datadir,"counts.csv"))) # load in counts matrix
  coords <- read.csv(file.path(datadir,"coords.csv"))
  rownames(counts) <- counts[,1]; counts[,1] <- NULL # Move first column to rownames
  rownames(coords) <- coords[,1]; coords[,1] <- NULL # Move first column to rownames
  nUMI <- colSums(counts) # In this case, total counts per pixel is nUMI
  puck <- SpatialRNA(coords, counts, nUMI)
  barcodes <- colnames(puck@counts) # pixels to be used (a list of barcode names). 
  plot_puck_continuous(puck, barcodes, puck@nUMI, ylimit = c(0,round(quantile(puck@nUMI,0.9))), 
                       title ='plot of nUMI') 
  refdir <- system.file("extdata",'Reference/Merfish_Ref',package = 'spacexr') # directory for the reference
  counts <- as.data.frame(readr::read_csv(file.path(refdir,"counts.csv"))) # load in cell types
  rownames(counts) <- counts[,1]; counts[,1] <- NULL # Move first column to rownames
  cell_types <- read.csv(file.path(refdir,"cell_types.csv")) # load in cell types
  cell_types <- setNames(cell_types[[2]], cell_types[[1]])
  cell_types <- as.factor(cell_types) # convert to factor data type
  nUMI <- read.csv(file.path(refdir,"nUMI.csv")) # load in cell types
  nUMI <- setNames(nUMI[[2]], nUMI[[1]])
  reference <- Reference(counts, cell_types, nUMI)
  myRCTD <- create.RCTD(puck, reference, max_cores = 2)
  myRCTD <- run.RCTD(myRCTD, doublet_mode = 'doublet')
  saveRDS(myRCTD,file.path(savedir,'myRCTD_merfish.rds'))
}

Running CSIDE

We are now ready to run CSIDE using the run.CSIDE.nonparametric function. We will use two cores, and a false discovery rate of 0.25. Next, we will set a gene threshold (i.e. minimum gene expression) of 0.01, and we will set a cell_type_threshold (minimum instances per cell type) of 10.

Warning: On this toy dataset, we have made several choices of parameters that are not recommended for regular use. On real datasets, we recommend first consulting the CSIDE default parameters. This includes gene_threshold (default 5e-5), cell_type_threshold (default 125), fdr (default 0.01), and weight_threshold (default NULL). Please see ?run.CSIDE.nonparametric for more information on these parameters.

#de
myRCTD <- readRDS(file.path(savedir,'myRCTD_merfish.rds'))
myRCTD@config$max_cores <- 2
cell_types <- c('Excitatory', 'Inhibitory', 'Astrocytes')
myRCTD <- run.CSIDE.nonparam(myRCTD, df = 6, cell_types = cell_types, gene_threshold = .001, cell_type_threshold = 10, fdr = 0.25) 
#> choose_cell_types: running de with cell types Excitatory, Inhibitory, Astrocytes
#> run.CSIDE.general: configure params_to_test = 2run.CSIDE.general: configure params_to_test = 3run.CSIDE.general: configure params_to_test = 4run.CSIDE.general: configure params_to_test = 5run.CSIDE.general: configure params_to_test = 6
saveRDS(myRCTD,file.path(savedir,'myRCTDde_merfish.rds'))

Equivalently to using the run.CSIDE.nonparametric function, those who want to have more precise control can build the design matrix directly. In this case, we use the build.designmatrix.nonparametric function, which constructs the design matrix given df the number of basis functions. After constructing the design matrix, one can use the run.CSIDE function to run CSIDE with this general design matrix. We also plot the third basis function below.

X <- build.designmatrix.nonparam(myRCTD,df = 6)
print(head(X))
#>                                      X4        X5        X6        X1       X2
#> f4e9c023-f3db-4f6d-9212-72b4dc77a3e3  1 -1.693155 -1.621206 -2.058753 1.989622
#> b918064b-14e5-4c76-a868-7e2b552f098e  1 -1.679173 -1.733760 -2.146166 2.030787
#> f5a6205b-717e-4aa6-9b4a-2bb7de0e76e7  1 -1.642168 -1.642054 -2.009435 1.984991
#> 9ebf1a3c-46d1-4ac1-a519-f438234e2183  1 -1.616511 -1.856374 -2.178610 2.041507
#> e381f712-3bac-4626-9b1d-da94a3bc768f  1 -1.504434 -1.717352 -1.892619 1.954967
#> ed0231aa-9e40-48d3-87c3-9211a6099631  1 -1.501188 -1.857376 -2.028886 1.984552
#>                                               X3
#> f4e9c023-f3db-4f6d-9212-72b4dc77a3e3 -0.16896993
#> b918064b-14e5-4c76-a868-7e2b552f098e  0.01604569
#> f5a6205b-717e-4aa6-9b4a-2bb7de0e76e7 -0.05543589
#> 9ebf1a3c-46d1-4ac1-a519-f438234e2183  0.28730155
#> e381f712-3bac-4626-9b1d-da94a3bc768f  0.28690614
#> ed0231aa-9e40-48d3-87c3-9211a6099631  0.48274495
barcodes <- rownames(X)
myRCTD_identical <- run.CSIDE(myRCTD, X, barcodes, cell_types, gene_threshold = .001, 
                        cell_type_threshold = 10, fdr = 0.25)
#> choose_cell_types: running de with cell types Excitatory, Inhibitory, Astrocytes
#> run.CSIDE.general: configure params_to_test = 2
plot_puck_continuous(myRCTD@spatialRNA, myRCTD@internal_vars_de$barcodes, X[,3], ylimit = c(0,1), 
                       title ='plot of third basis function')

CSIDE results

After running CSIDE, it is time to examine CSIDE results. We will focus on cell type Excitatory. Furthermore, we will examine the Irs4 gene. Notice below that the largest coefficient in magnitude is the third basis function, which was plotted above. This implies that Irs4 has large change in expression along the y-axis, which we will verify with spatial plots next.

#print results for cell type 'Excitatory'
cell_type <- 'Excitatory'
gene <- 'Irs4'
print(paste("following results hold for", gene))
#> [1] "following results hold for Irs4"
print("check for covergence of each cell type")
#> [1] "check for covergence of each cell type"
print(myRCTD@de_results$gene_fits$con_mat[gene, ])
#> Excitatory Inhibitory Astrocytes 
#>       TRUE       TRUE       TRUE
print('estimated coefficients for Excitatory in each basis function')
#> [1] "estimated coefficients for Excitatory in each basis function"
cell_type_ind <- which(cell_type == cell_types)
print(myRCTD@de_results$gene_fits$all_vals[gene, , cell_type_ind]) 
#> [1] -3.3167228 -0.3712502 -3.3505948  2.6041749  0.6933264 -1.6264236

Look at significant genes.

We will now examine the myRCTD@de_results$sig_gene_list object. sig_gene_list is a list, for each cell type, of data.frames that contain the hypothesis testing results for significant genes. For nonparametric CSIDE, each coefficient is tested for significance. In particular, notice the following columns: p_val (p value), log_fc (log-fold-change along that coefficient axis), Z_score (Z-score of testing that coefficient), se (standard error), conv (convergence), and paramindex_best (index of the parameter that achieved significance). For example, for Irs4, the third basis function, as discussed above, achieved signifcance.

print(head(myRCTD@de_results$sig_gene_list[[cell_type]]))
#>         Z_score    log_fc        se paramindex_best conv        p_val
#> Pak3   6.543918 -1.331429 0.2034605               3 TRUE 2.996392e-10
#> Irs4   6.231948 -3.350595 0.5376480               3 TRUE 2.303346e-09
#> Ntng1  5.522369  3.038562 0.5502279               3 TRUE 1.672293e-07
#> Pgr    5.247607  2.892039 0.5511159               4 TRUE 7.704395e-07
#> Prlr   4.904799 -2.118372 0.4318979               3 TRUE 4.676162e-06
#> Necab1 4.881544  2.828793 0.5794874               3 TRUE 5.262916e-06

Finally, we will plot CSIDE results in the savedir directory!

The following plot shows a spatial visualization of the Irs4 gene, which was determined by CSIDE to be significantly differentially expressed in excitatory neurons. We below use the plot_gene_raw function to observe the raw gene values, whereas plot_prediction_gene plots the CSIDE fitted spatial patterns. Observe how the CSIDE fit smooths over the noisy single-pixel gene expression values.

The function make_all_de_plots will automatically generate several types of plots displaying the CSIDE results across all genes and cell types.

myRCTD <- readRDS(file.path(savedir,'myRCTDde_merfish.rds'))
gene <- 'Irs4'
plot_prediction_gene(myRCTD, cell_type, gene)

plot_gene_raw(myRCTD, gene, cell_type, ymax = 40)

make_all_de_plots(myRCTD, savedir)