SCALPEL application on Dropseq scRNA‐seq: meuronal differentiation

author: Franz AKE

Introduction

Dropseq Logo Nextflow Logo

This vignette demonstrates how to run the SCALPEL nextflow to DropSeq dataset from the study from Franz Ake et al on Human neuronal cells.

The full dataset is publicly available on GEO at accession GSE268222.

SCALPEL requires as input for the SAMPLES, the FASTQ files, the BAM and the gene expression (DGE) count matrix generated by Dropseq tools. For the purpose of this demonstration, we provide a subset of the original data: reduced_GSE268222. This data contains a subset of the original data for running purpose.

> wget -O reduced_GSE268222.tar.gz https://zenodo.org/records/17176865/files/reduced_GSE268222.tar.gz?download=1
> tar -xvf reduced_GSE268222.tar.gz
> tree -L 2 ./reduced_GSE268222/
./reduced_GSE268222/
├── dropseq_files
│   ├── hiPSCs.bam
│   ├── hiPSCs.counts.txt
│   ├── NPCs.bam
│   └── NPCs.counts.txt
├── fastq_files
│   ├── hiPSCs_2000000_lib_16058AAB_CTCTCTAC_read1_001.fastq.gz
│   ├── hiPSCs_2000000_lib_16058AAB_CTCTCTAC_read2_001.fastq.gz
│   ├── NPCs_2000000_lib_16057AAB_TAGGCATG_read1_001.fastq.gz
│   └── NPCs_2000000_lib_16057AAB_TAGGCATG_read2_001.fastq.gz
├── hiPSCs_barcodes.txt
└── NPCs_barcodes.txt

3 directories, 10 files

SCALPEL requires for running a GTF genome annotation, FASTA transcript annotation and a BED genome internal priming reference file associated to your organism (Mouse or Human). For the current analysis, we use a reduced annotation reference from GENCODE for Human (GENCODE.v41)

> wget -O reduced_GENCODE.v41.annotation.gtf.gz https://zenodo.org/records/17176838/files/reduced_GENCODE.v41.annotation.gtf.gz?download=1
> wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_41/gencode.v41.transcripts.fa.gz
> wget -O hg38_ipriming_sites.bed.tar.gz https://zenodo.org/records/15717592/files/hg38_ipriming_sites.bed.tar.gz?download=1

Complete GTF genome annotation and FASTA transcripts annotation can be downloaded from GENCODE or ENSEMBL.

SCALPEL running

To run SCALPEL, you must provide a samplesheet.csv file, which specifies the input files for each sample. This file is required and passed to the pipeline using the [--samplesheet] argument.

Each row in the samplesheet.csv should contain the following columns, in order, using absolute paths (starting from the root or project directory) for all files and directories:

Sample Name: Sample name (…as used by CellRanger).
R1 FASTQ Path: The absolute path to the R1 FASTQ file for the sample.
R2 FASTQ Path: The absolute path to the R2 FASTQ file for the sample.
BAM Path: The absolute path to the BAM file for the sample.
DGE Count Matrix Path: The absolute path to the DGE count matrix file for the sample.

Here, a samplesheet.csv for two samples would look like:

> cat samplesheet.csv
hiPSCs,reduced_GSE268222/fastq_files/hiPSCs_2000000_lib_16058AAB_CTCTCTAC_read1_001.fastq.gz,reduced_GSE268222/fastq_files/hiPSCs_2000000_lib_16058AAB_CTCTCTAC_read2_001.fastq.gz,reduced_GSE268222/dropseq_files/hiPSCs.bam,reduced_GSE268222/dropseq_files/hiPSCs.counts.txt
NPCs,reduced_GSE268222/fastq_files/NPCs_2000000_lib_16057AAB_TAGGCATG_read1_001.fastq.gz,/reduced_GSE268222/fastq_files/NPCs_2000000_lib_16057AAB_TAGGCATG_read2_001.fastq.gz,reduced_GSE268222/dropseq_files/NPCs.bam,reduced_GSE268222/dropseq_files/NPCs.counts.txt

Optionally, you can provide a whitelist_barcodes.csv file specifying the curated barcode file path for each sample [–barcodes] argument. This allows SCALPEL to use custom barcode lists during processing. For the current analysis, we’ll only use a subset of 100 barcodes for each sample, provided in the reduced_GSE268222 directory:

> cat whitelist_barcodes.csv
hiPSCs,reduced_GSE268222/hiPSCs_barcodes.txt
NPCs,reduced_GSE268222/NPCs_barcodes.txt

Execution

To execute SCALPEL with the provided samplesheet and optional barcode whitelist, use the following command:

nextflow run -resume SCALPEL/main.nf \
    --sequencing chromium \
    --samplesheet samplesheet.csv \
    --barcodes whitelist_barcodes.csv \
    --outputDir scalpel_output \
    --transcriptome GENCODE.vM21.transcripts.fa.gz \
    --gtf reduced_GENCODE.v41.annotation.gtf.gz \
    --ipdb hg38_ipriming_sites.bed.tar.gz \
    -with-conda SCALPEL/requirements.yml
    --cpus 32

Warning

Ensure that the GTF annotation file uses consistent chromosome naming conventions throughout your analysis. For example, some files may use ‘chr1’, ‘chr2’, …, ‘chrM’, while others use ‘1’, ‘2’, …, ‘MT’. Mismatches in chromosome names between the GTF, and BAM files can lead to errors or missing data in downstream steps. Verify that all reference files (GTF, and any BAM files) use the same chromosome naming scheme, and convert them if necessary.

Results

nextflow run -resume ~/Projects/SCALPEL/main.nf --samplesheet samplesheet.csv --sequencing dropseq --barcodes wht_barcodes.txt --outputDir scalpel_output --transcriptome ~/Projects/scalpel_wiki2/datas/gencode.v41.transcripts.fa.gz --gtf ~/Projects/scalpel_wiki2/datas/reduced_GENCODE.v41.annotation.gtf.gz --ipdb ~/Projects/scalpel_wiki2/datas/hg38_ipriming_sites.bed.tar.gz -with-conda ~/Projects/SCALPEL/requirements.yml

 N E X T F L O W   ~  version 25.04.7

Launching `/Users/franz/Projects/SCALPEL/main.nf` [lonely_neumann] DSL2 - revision: 007d8bbf88

===============================
SCALPEL - N F   P I P E L I N E
===============================
Author: PLASS Lab ; Franz AKE
*****************
P-CMRC - Barcelona, SPAIN

input files:
- Annotation required files(required):
    - transcriptome reference [--transcriptome]: /Users/franz/Projects/scalpel_wiki2/datas/gencode.v41.transcripts.fa.gz
    - annotation GTF reference [--gtf]: /Users/franz/Projects/scalpel_wiki2/datas/reduced_GENCODE.v41.annotation.gtf.gz
    - internal priming annotation [--ipdb]: /Users/franz/Projects/scalpel_wiki2/datas/hg38_ipriming_sites.bed.tar.gz


- Reads processing files (required):
    - samplesheet [--samplesheet]: samplesheet.csv

- Params:
    Required:
    - sequencing type (required): dropseq
    Optional:
    - barcodes whitelist [--barcodes] (optional): wht_barcodes.txt
    - cell clusters annotation [--clusters] (optional): null
    - transcriptomic distance threshold [--dt_threshold] (optional, default 600bp): 600
    - transcriptomic end distance threshold [--de_threshold] (optional, default 30bp): 30
    - minimal distance of internal priming sites (IP) from isoform 3'ends [--ip_threshold] (optional, 60nuc): 60
    - gene fraction abundance threshold [--gene_fraction] (optional, default '98%'): 98%
    - binsize threshold for transcriptomic distance based probability [--binsize] (optional, default '20): 20
    - output directory for the Nextflow workflow [--outputDir] (optional, default './results'): scalpel_output
    - reads subsampling threshold [--subsample] (optional, default 1): 1
    - cpus [--cpus] (optional, default 16): 16


executor >  local (262)
[24/30603a] annotation_preprocessing:salmon_transcriptome_indexing                    [100%] 1 of 1, cached: 1 ✔
[09/53e66f] annotation_preprocessing:salmon_bulk_quantification (hiPSCs)              [100%] 2 of 2 ✔
[83/4c7d2e] annotation_preprocessing:tpm_counts_average ([NPCs.sf, hiPSCs.sf])        [100%] 1 of 1 ✔
[1d/01aa7a] ann…n_preprocessing:isoform_selection_weighting (chr17, merge_quants.txt) [100%] 4 of 4 ✔
[05/bbafbd] reads_processing:samples_loading:bam_splitting (hiPSCs, chr17)            [100%] 8 of 8 ✔
[90/3f0803] reads_processing:bedfile_conversion (hiPSCs, chr17)                       [100%] 8 of 8 ✔
[e0/fee27f] rea…g:reads_mapping_and_filtering (hiPSCs, chr17, chr17.bed, chr17.exons) [100%] 8 of 8 ✔
[d3/7d5bc7] reads_processing:ip_splitting (chr17)                                     [100%] 4 of 4 ✔
[19/7e0981] reads_processing:ip_filtering (hiPSCs, chr17, chr17_sp.ipdb)              [100%] 8 of 8 ✔
[7e/64dcea] isoform_quantification:probability_distribution (NPCs)                    [100%] 2 of 2 ✔
[c0/93645d] isoform_quantification:fragment_probabilities (NPCs,chr12)                [100%] 8 of 8 ✔
[12/bfd956] isoform_quantification:cells_splitting (hiPSCs)                           [100%] 2 of 2 ✔
[1a/cacdea] isoform_quantification:em_algorithm (hiPSCs, TTTGTAGATCCA-1)              [100%] 200 of 200 ✔
[38/aa8d3a] isoform_quantification:cells_merging (NPCs)                               [100%] 2 of 2 ✔
[2a/b0125c] iso…:dge_generation (NPCs, NPCs_isoforms_quantified.txt, NPCs.counts.txt) [100%] 2 of 2 ✔
[89/b8af07] apa…ion:differential_isoform_usage ([hiPSCs_seurat.RDS, NPCs_seurat.RDS]) [100%] 1 of 1 ✔
[43/a7ff20] apa_characterization:generation_filtered_bams (NPCs)                      [100%] 2 of 2 ✔

Completed at: 22-Mar-2025 11:41:38
Duration    : 14m 12s
CPU hours   : 2.8 (15.5% cached)
Succeeded   : 262
Cached      : 1

Following the successful execution of SCALPEL, the output directory (scalpel_output) is generated containing the results and temporary run files as:

iDGE_seurat.RDS:
The Seurat object containing the isoform expression data (iDGE) for all samples. This object can be used for downstream analysis and visualization in R.
DIU_table.csv:
A table summarizing detected differential Isoform Usage (DIU) events across the samples by default.
Runfiles:
A directory containing all the temporary run files generated during the analysis.
sampleXXX_seurat.RDS:
The Seurat object for each individual sample containing isoform expression data (iDGE).
sampleXXX_APADGE.txt:
The isoform expression count data (iDGE) for each sample.
sampleXXX_filtered.bam:
The filtered BAM file for sample SRR6129050 (discarded duplicated_reads and internal_priming associated reads, only reads passing defaults SCALPEL filters).
sampleXXX_filtered.bam.bai:
The index file for the filtered BAM

> tree -L 1 scalpel_output
scalpel_output
├── DIU_table.csv
├── hiPSCs_APADGE.txt
├── hiPSCs_filtered.bam
├── hiPSCs_filtered.bam.bai
├── hiPSCs_seurat.RDS
├── iDGE_seurat.RDS
├── NPCs_APADGE.txt
├── NPCs_filtered.bam
├── NPCs_filtered.bam.bai
├── NPCs_seurat.RDS
└── Runfiles

2 directories, 10 files

Downstream analysis

For more information, refer to SCALPEL_downstream_analysis for downstream analysis of the SCALPEL output files.

email: fake@idibell.cat