library(Seurat)
library(dplyr)
library(data.table)
library(stringr)
library(Gviz)
library(ggplot2)
library(ggthemes)
library(rtracklayer)
#load gene based analysis Seurat object
= readRDS("./neuro_DGE.RDS") dge.obj
Downstream analysis on scRNA-seq from human neuronal differentiation using SCALPEL
Note: Install scalpelR for downstream analysis
For execution of several function used in this downstream analysis with SCALPEL, we will implemented the R package scalpelR that can be installed from its Github page as:
install.packages("devtools")
::install_github("Franzx7/scalpelR")
devtools
library(scalpelR)
1 Introduction
In this exemple, we did a gene and isoform based single-cell analysis from this Drop-seq single-cell dataset. Using Seurat, we first performed a standard single-cell gene-level analysis of the human neuronal differentiation dataset, which enabled the identification of distinct cell populations.
#Visualize gene based sc cells
DimPlot(dge.obj, label=T, repel = T, pt.size = 1., label.size = 6) %>% theme_umap() +
ggtitle("iPS & NPC cells clusters")
Prior to obtain this final clusters of population, we performed some quality filtering steps to obtain this set of reliable cells. Using those same cells barcode tags, we executed SCALPEL on this Drop-seq based single-cell dataset. All the information required to run SCALPEL can be found on its Github page.
Note: SCALPEL requires the FASTQ, BAM and Gene count expression associated to each sample for its execution. For Drop-seq experiments, the BAM and Gene count expression can be obtained using the Drop-seq tools and for 10x Genomics experiments, the BAM and Gene count expression can be obtained using the CellRanger pipeline.
SCALPEL execution can be performed on a specific cell population using the - -barcodes argument option.
Following SCALPEL execution, a result folder is generated that contains all the information required to perform downstream analysis.
> tree -L 1 scalpel_output
scalpel_output
├── DIU_table.csv
├── hiPSCs_APADGE.txt
├── hiPSCs_filtered.bam
├── hiPSCs_filtered.bam.bai
├── hiPSCs_seurat.RDS
├── iDGE_seurat.RDS
├── NPCs_APADGE.txt
├── NPCs_filtered.bam
├── NPCs_filtered.bam.bai
├── NPCs_seurat.RDS
└── Runfiles
2 directories, 10 files
2 Downstream analysis of SCALPEL results
2.1 Loading isoform based Seurat object
Let’s load the isoform-based Seurat object generated by SCALPEL using the reliable cell barcodes obtained from the gene-based analysis:
= readRDS("./iDGE_seurat.RDS")
idge.obj idge.obj
An object of class Seurat
68874 features across 2535 samples within 1 assay
Active assay: RNA (68874 features, 3837 variable features)
3 layers present: data, counts, scale.data
2 dimensional reductions calculated: pca, umap
2.2 Quality filtering on low expressed isoforms features
For this analysis, we already used a reliable set of cells obtained from the gene-based analysis. However, we can still perform some filtering steps to discard lowly expressed isoforms features that are not informative. For example, we can filter out isoforms that are detected in less than 3 cells:
# Filter out isoforms that are detected in less than 3 cells
= subset(idge.obj, features = rownames(idge.obj)[Matrix::rowSums(idge.obj@assays$RNA@counts > 0) >= 3]) idge.obj
2.3 Dimensionality reduction and clustering
Let’s perform the dimensionality reduction and clustering on this isoform-based Seurat object:
# Standard workflow for dimensionality reduction and clustering (Adjust the parameters)
= NormalizeData(idge.obj)
idge.obj # as isoform data is more sparse and noisy, we recommend to use a higher number of variable features
= FindVariableFeatures(idge.obj, nfeatures = 4000)
idge.obj = ScaleData(idge.obj)
idge.obj = RunPCA(idge.obj, npcs = 50)
idge.obj = RunUMAP(idge.obj, dims = 1:30)
idge.obj
#Clustering
= FindNeighbors(idge.obj, dims = 1:30)
idge.obj = FindClusters(idge.obj, resolution = 0.5)
idge.obj $Clusters = idge.obj$seurat_clusters idge.obj
Now, we can visualize the clusters obtained from this isoform-based analysis. We can observe that while briging a new layer of information, the isoform quantification conserves the cell type specificity identified from the gene-based analysis
# Filter out isoforms that are detected in less than 3 cells
DimPlot(idge.obj, group.by="Clusters", label=T, repel = T, pt.size = 1., label.size = 6) %>% theme_umap() + ggtitle("iPS & NPC cells clusters (isoform-based)")
2.4 Differential isoform expression analysis
We can perform a differential isoform expression analysis between the different clusters obtained from the isoform-based analysis. For example, let’s perform a differential isoform expression analysis between the iPSC and NP cell populations:
= scalpelR::FindIsoforms(
diu.table seurat_obj = idge.obj,
group_by = "Clusters",
assay="RNA",
threshold_pval = 0.05)
print(diu.table)
Aggregating counts in conditions… Performing filtering and comparison test… |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=51s
gene_tr gene_name transcript_name iPS NP p_value
<char> <char> <char> <num> <num> <num>
1: ABCE1***ABCE1-201 ABCE1 ABCE1-201 355 93 0.0004997501
2: ABCE1***ABCE1-202 ABCE1 ABCE1-202 97 90 0.0004997501
3: ABCE1***ABCE1-203 ABCE1 ABCE1-203 267 88 0.0004997501
4: ABI2***ABI2-201 ABI2 ABI2-201 154 209 0.0004997501
5: ABI2***ABI2-208 ABI2 ABI2-208 232 104 0.0004997501
---
3069: SCARB2***SCARB2-225 SCARB2 SCARB2-225 54 44 0.0094952524
3070: ZNF429***ZNF429-201 ZNF429 ZNF429-201 24 9 0.0094952524
3071: ZNF429***ZNF429-202 ZNF429 ZNF429-202 14 9 0.0094952524
3072: ZNF429***ZNF429-205 ZNF429 ZNF429-205 5 14 0.0094952524
3073: ZNF429***ZNF429-208 ZNF429 ZNF429-208 5 5 0.0094952524
statistic p_value_adjusted
<num> <num>
1: 51.570918 0.004359955
2: 51.570918 0.004359955
3: 51.570918 0.004359955
4: 50.017008 0.004359955
5: 50.017008 0.004359955
---
3069: 9.907011 0.048981042
3070: 10.927779 0.048981042
3071: 10.927779 0.048981042
3072: 10.927779 0.048981042
3073: 10.927779 0.048981042
We can visualize a particular gene expression, for example the gene EIF1, that have differentially expressed isoforms between the iPSC and NPC populations:
= diu.table %>% dplyr::filter(gene_name == "EIF1")
gene.tab print(gene.tab)
gene_tr gene_name transcript_name iPS NP p_value statistic
<char> <char> <char> <num> <num> <num> <num>
1: EIF1***EIF1-201 EIF1 EIF1-201 2517 1938 0.0004997501 386.0618
2: EIF1***EIF1-204 EIF1 EIF1-204 5169 1799 0.0004997501 386.0618
p_value_adjusted
<num>
1: 0.004359955
2: 0.004359955
2.5 Visualization of isoform expression in cells
2.5.1 UMAP
We can visualize the expression of these differentially expressed isoforms in the cells using a UMAP plot using Seurat. For example, let’s visualize the expression of the two differentially expressed isoforms of the gene EIF1:
::FeaturePlot(idge.obj,
Seuratfeatures = gene.tab$gene_tr,
pt.size = 1,
label.size=6,
order=T,
label=T,
cols = c("gray80", "red"))
2.5.2 Violin plots
Aditionally, we can visualize the expression of these differentially expressed isoforms within the condition using violin plot as:
= scalpelR::plotExpression(
p1 seurat_obj = idge.obj,
features = gene.tab$gene_tr,
group_var = "Clusters",
plot = "group")
= scalpelR::plotExpression(
p2 seurat_obj = idge.obj,
features = gene.tab$gene_tr,
group_var = "Clusters",
plot = "features")
list(p1, p2) %>% patchwork::wrap_plots(ncol=1)
Using features as id variables
Using features as id variables
2.6 Visualization of read coverage on the genome
Furthermore, we can visualize the read coverage of these differentially expressed isoforms using the BAM files generated by SCALPEL. For example, let’s visualize the read coverage of the two differentially expressed isoforms of the gene EIF1.
As we used a curated set of cell barcodes from the gene-based analysis to run SCALPEL, the BAM files generated by SCALPEL already contains only the reads from those reliable cells. Therefore, we can directly use these filtered BAM files to visualize the read coverage of the differentially expressed isoforms. However, if you want to visualize the read coverage of a specific cell population, you can filter the BAM files using the cell barcodes associated to that population and scalpelR subsetBAM as:
= dplyr::filter(idge.obj[[]], Clusters=="iPS") %>% rownames()
ips_barcodes = dplyr::filter(idge.obj[[]], Clusters=="NP") %>% rownames()
np_barcodes
# Filter BAM file to obtain reads from a specific cell population
#IPS population
::subsetBAM(
scalpelRbam_file = "./scalpel_output/hiPSCs_filtered.bam",
barcodes = ips_barcodes,
output_file = "./subset_ips_filtered.bam",
tag = "XC",
cores = 4)
#NP population
::subsetBAM(
scalpelRbam_file = "./scalpel_output/NPCs_filtered.bam",
barcodes = ips_barcodes,
output_file = "./subset_np_filtered.bam",
tag = "XC",
cores = 4)
#Plot coverage using the subset BAM files and GTF annotation reference
= rtracklayer::import("./gencode.v41.annotation.gtf.gz")
gtrack ::CoveragePlot(
scalpelRgenome_gr = gtrack,
bamfiles = c("IPS"="./subset_ips_filtered.bam",
"NP"="./subset_np_filtered.bam"),
gene_in = "EIF1",
genome_sp = "hg38")
$`BAM file:`
IPS NP
"~/Projects/iPS_filtered.bam" "~/Projects/NP_filtered.bam"
$`Gene:`
[1] "EIF1"
$`Transcripts:`
NULL
$`Genome annotation org:`
[1] "hg38"
Building Gene Track...
Building Alignment Track...
By default, the CoveragePlot function will plot all the isoforms of the gene of interest. However, you can also specify a set of isoforms to plot using the features argument option. Furthermore, you can also specify the distance to zoom from the 3’UTR using the distZOOM argument option as:
::CoveragePlot(
scalpelRgenome_gr = gtrack,
bamfiles = c("IPS"="~/Projects/iPS_filtered.bam",
"NP"="~/Projects/NP_filtered.bam"),
features = gene.tab$transcript_name,
gene_in = "EIF1",
genome_sp = "hg38",
distZOOM = 2000)
$`BAM file:`
IPS NP
"~/Projects/iPS_filtered.bam" "~/Projects/NP_filtered.bam"
$`Gene:`
[1] "EIF1"
$`Transcripts:`
[1] "EIF1-201" "EIF1-204"
$`Genome annotation org:`
[1] "hg38"
Building Gene Track...
Building Alignment Track...
2.7 3’UTR length analysis in single-cells
Using the isoform quantification obtained from SCALPEL, and the genome annotation reference, we can estimate the 3’UTR length of each isoform in each cell. For this, we can use the function scalpelR::weightedUTR_cell as:
= scalpelR::weightedUTR_cell(
utr3length_tab seurat_obj = idge.obj,
annotation_gr = gtrack,
group.by = "Clusters",
assay = "RNA"
)
Processing annotation file...(1/3)
Processing isoform expression...(2/3)
Integrating Seurat metadata...(3/3)
print(utr3length_tab)
cells group wUTRlength.cell
<char> <char> <num>
1: AAAAAAGAACCC-1 iPS 426.7257
2: AAAAAAGAACTG-1 iPS 309.9807
3: AAAAAATGTAAC-1 iPS 284.2699
4: AAAAGAATCGGC-1 iPS 206.4187
5: AAAAGTCAACTA-1 iPS 459.1815
---
2531: TTTTCAGAGTTT-2 NP 549.5643
2532: TTTTCGCAATTC-2 NP 565.8120
2533: TTTTGGCTCAGT-2 NP 241.0014
2534: TTTTGTCACCTT-2 NP 407.8529
2535: TTTTTCCGTACG-2 NP 301.0732
Then, based on this metric, we can visualize the 3’UTR length distribution in the different cell populations using plots as:
%>%
utr3length_tab ::ggplot(aes(x=group, y=wUTRlength.cell, fill=group)) +
ggplot2geom_jitter(na.rm=T, size=.5) +
::geom_violin(scale="width", trim=T, na.rm = T, alpha=.6) +
ggplot2::geom_boxplot(fill="white", width=.3) +
ggplot2::ylim(c(0,1000)) +
ggplot2::ggtitle("average 3'UTR length distribution per cell") +
ggplot2::theme(title = element_text(size=9)) +
ggplot2::theme_bw(base_size = 9) ggplot2
3 Further downstream analysis
Using the R Seurat Object generated by SCALPEL, you can perform further downstream analysis in the same way as a standard gene-based Seurat object. For example, you can perform trajectory analysis using Monocle3, etc…
For more information, please refer to the Seurat documentation here.