library(spacexr)
library(Matrix)
library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
library(ggplot2)
datadir <- system.file("extdata",'SpatialRNA/MerfishVignette',package = 'spacexr') # directory for sample MERFISH dataset
if(!dir.exists(datadir))
dir.create(datadir)
savedir <- 'RCTD_results'
if(!dir.exists(savedir))
dir.create(savedir)
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
cache = TRUE,
out.width = "100%"
)
Generalized Linear Admixture Models for Differential Expression, or CSIDE, is part of the spacexr R package for learning cell type-specific differential expression from spatial transcriptomics data. In this Vignette, we will use CSIDE to test for nonparametric differential expression in a toy hypothalamus MERFISH dataset. First, we will first use RCTD to assign cell types to a hypothalamus MERFISH dataset. We will define cell type profiles using an annotated single cell RNA-sequencing (scRNA-seq) hypothalamus dataset. We will detect differential expression as nonparametric smooth spatial patterns.
First, we run RCTD on the data to annotate cell types. Please note that this follows exactly the content of the spatial transcriptomics RCTD vignette, except that we use RCTD on doublet mode. Since doublet mode can discover 1-2 cell types per pixel, it is a reasonable choice for technologies such as MERFISH that can have precise spatial resolution. Please refer to the spatial transcriptomics vignette for more explanation on the RCTD algorithm.
### Load in/preprocess your data, this might vary based on your file type
if(!file.exists(file.path(savedir,'myRCTD_merfish.rds'))) {
counts <- as.data.frame(readr::read_csv(file.path(datadir,"counts.csv"))) # load in counts matrix
coords <- read.csv(file.path(datadir,"coords.csv"))
rownames(counts) <- counts[,1]; counts[,1] <- NULL # Move first column to rownames
rownames(coords) <- coords[,1]; coords[,1] <- NULL # Move first column to rownames
nUMI <- colSums(counts) # In this case, total counts per pixel is nUMI
puck <- SpatialRNA(coords, counts, nUMI)
barcodes <- colnames(puck@counts) # pixels to be used (a list of barcode names).
plot_puck_continuous(puck, barcodes, puck@nUMI, ylimit = c(0,round(quantile(puck@nUMI,0.9))),
title ='plot of nUMI')
refdir <- system.file("extdata",'Reference/Merfish_Ref',package = 'spacexr') # directory for the reference
counts <- as.data.frame(readr::read_csv(file.path(refdir,"counts.csv"))) # load in cell types
rownames(counts) <- counts[,1]; counts[,1] <- NULL # Move first column to rownames
cell_types <- read.csv(file.path(refdir,"cell_types.csv")) # load in cell types
cell_types <- setNames(cell_types[[2]], cell_types[[1]])
cell_types <- as.factor(cell_types) # convert to factor data type
nUMI <- read.csv(file.path(refdir,"nUMI.csv")) # load in cell types
nUMI <- setNames(nUMI[[2]], nUMI[[1]])
reference <- Reference(counts, cell_types, nUMI)
myRCTD <- create.RCTD(puck, reference, max_cores = 2)
myRCTD <- run.RCTD(myRCTD, doublet_mode = 'doublet')
saveRDS(myRCTD,file.path(savedir,'myRCTD_merfish.rds'))
}
We are now ready to run CSIDE using the run.CSIDE.nonparametric
function. We will use two cores, and a false discovery rate of 0.25. Next, we will set a gene threshold (i.e. minimum gene expression) of 0.01, and we will set a cell_type_threshold (minimum instances per cell type) of 10.
Warning: On this toy dataset, we have made several choices of parameters that are not recommended for regular use. On real datasets, we recommend first consulting the CSIDE default parameters. This includes gene_threshold
(default 5e-5), cell_type_threshold
(default 125), fdr
(default 0.01), and weight_threshold
(default NULL). Please see ?run.CSIDE.nonparametric
for more information on these parameters.
#de
myRCTD <- readRDS(file.path(savedir,'myRCTD_merfish.rds'))
myRCTD@config$max_cores <- 2
cell_types <- c('Excitatory', 'Inhibitory', 'Astrocytes')
myRCTD <- run.CSIDE.nonparam(myRCTD, df = 6, cell_types = cell_types, gene_threshold = .001, cell_type_threshold = 10, fdr = 0.25)
#> choose_cell_types: running de with cell types Excitatory, Inhibitory, Astrocytes
#> run.CSIDE.general: configure params_to_test = 2run.CSIDE.general: configure params_to_test = 3run.CSIDE.general: configure params_to_test = 4run.CSIDE.general: configure params_to_test = 5run.CSIDE.general: configure params_to_test = 6
saveRDS(myRCTD,file.path(savedir,'myRCTDde_merfish.rds'))
Equivalently to using the run.CSIDE.nonparametric
function, those who want to have more precise control can build the design matrix directly. In this case, we use the build.designmatrix.nonparametric
function, which constructs the design matrix given df
the number of basis functions. After constructing the design matrix, one can use the run.CSIDE
function to run CSIDE with this general design matrix. We also plot the third basis function below.
X <- build.designmatrix.nonparam(myRCTD,df = 6)
print(head(X))
#> X4 X5 X6 X1 X2
#> f4e9c023-f3db-4f6d-9212-72b4dc77a3e3 1 -1.693155 -1.621206 -2.058753 1.989622
#> b918064b-14e5-4c76-a868-7e2b552f098e 1 -1.679173 -1.733760 -2.146166 2.030787
#> f5a6205b-717e-4aa6-9b4a-2bb7de0e76e7 1 -1.642168 -1.642054 -2.009435 1.984991
#> 9ebf1a3c-46d1-4ac1-a519-f438234e2183 1 -1.616511 -1.856374 -2.178610 2.041507
#> e381f712-3bac-4626-9b1d-da94a3bc768f 1 -1.504434 -1.717352 -1.892619 1.954967
#> ed0231aa-9e40-48d3-87c3-9211a6099631 1 -1.501188 -1.857376 -2.028886 1.984552
#> X3
#> f4e9c023-f3db-4f6d-9212-72b4dc77a3e3 -0.16896993
#> b918064b-14e5-4c76-a868-7e2b552f098e 0.01604569
#> f5a6205b-717e-4aa6-9b4a-2bb7de0e76e7 -0.05543589
#> 9ebf1a3c-46d1-4ac1-a519-f438234e2183 0.28730155
#> e381f712-3bac-4626-9b1d-da94a3bc768f 0.28690614
#> ed0231aa-9e40-48d3-87c3-9211a6099631 0.48274495
barcodes <- rownames(X)
myRCTD_identical <- run.CSIDE(myRCTD, X, barcodes, cell_types, gene_threshold = .001,
cell_type_threshold = 10, fdr = 0.25)
#> choose_cell_types: running de with cell types Excitatory, Inhibitory, Astrocytes
#> run.CSIDE.general: configure params_to_test = 2
plot_puck_continuous(myRCTD@spatialRNA, myRCTD@internal_vars_de$barcodes, X[,3], ylimit = c(0,1),
title ='plot of third basis function')
After running CSIDE, it is time to examine CSIDE results. We will focus on cell type Excitatory. Furthermore, we will examine the Irs4 gene. Notice below that the largest coefficient in magnitude is the third basis function, which was plotted above. This implies that Irs4 has large change in expression along the y-axis, which we will verify with spatial plots next.
#print results for cell type 'Excitatory'
cell_type <- 'Excitatory'
gene <- 'Irs4'
print(paste("following results hold for", gene))
#> [1] "following results hold for Irs4"
print("check for covergence of each cell type")
#> [1] "check for covergence of each cell type"
print(myRCTD@de_results$gene_fits$con_mat[gene, ])
#> Excitatory Inhibitory Astrocytes
#> TRUE TRUE TRUE
print('estimated coefficients for Excitatory in each basis function')
#> [1] "estimated coefficients for Excitatory in each basis function"
cell_type_ind <- which(cell_type == cell_types)
print(myRCTD@de_results$gene_fits$all_vals[gene, , cell_type_ind])
#> [1] -3.3167228 -0.3712502 -3.3505948 2.6041749 0.6933264 -1.6264236
We will now examine the myRCTD@de_results$sig_gene_list
object. sig_gene_list
is a list, for each cell type, of data.frames that contain the hypothesis testing results for significant genes. For nonparametric CSIDE, each coefficient is tested for significance. In particular, notice the following columns: p_val
(p value), log_fc
(log-fold-change along that coefficient axis), Z_score
(Z-score of testing that coefficient), se
(standard error), conv
(convergence), and paramindex_best
(index of the parameter that achieved significance). For example, for Irs4, the third basis function, as discussed above, achieved signifcance.
print(head(myRCTD@de_results$sig_gene_list[[cell_type]]))
#> Z_score log_fc se paramindex_best conv p_val
#> Pak3 6.543918 -1.331429 0.2034605 3 TRUE 2.996392e-10
#> Irs4 6.231948 -3.350595 0.5376480 3 TRUE 2.303346e-09
#> Ntng1 5.522369 3.038562 0.5502279 3 TRUE 1.672293e-07
#> Pgr 5.247607 2.892039 0.5511159 4 TRUE 7.704395e-07
#> Prlr 4.904799 -2.118372 0.4318979 3 TRUE 4.676162e-06
#> Necab1 4.881544 2.828793 0.5794874 3 TRUE 5.262916e-06
Finally, we will plot CSIDE results in the savedir
directory!
The following plot shows a spatial visualization of the Irs4 gene, which was determined by CSIDE to be significantly differentially expressed in excitatory neurons. We below use the plot_gene_raw
function to observe the raw gene values, whereas plot_prediction_gene
plots the CSIDE fitted spatial patterns. Observe how the CSIDE fit smooths over the noisy single-pixel gene expression values.
The function make_all_de_plots
will automatically generate several types of plots displaying the CSIDE results across all genes and cell types.
myRCTD <- readRDS(file.path(savedir,'myRCTDde_merfish.rds'))
gene <- 'Irs4'
plot_prediction_gene(myRCTD, cell_type, gene)
plot_gene_raw(myRCTD, gene, cell_type, ymax = 40)
make_all_de_plots(myRCTD, savedir)