Linux / amd64
DeepSAP is a transformer-based workflow designed to enhance splice junction detection in RNA-seq data. By default, DeepSAP utilizes the highly sensitive GSNAP TGGA aligner for FASTQ inputs. Alternatively, it can also process pre-aligned BAM files generated by GSNAP directly.
We evaluated the performance of DeepSAP in our article titled DeepSAP: Improved RNA-Seq Alignment by Integrating Transcriptome Guidance with Transformer-Based Splice Junction Scoring. In our benchmark, DeepSAP demonstrated an outstanding performance, achieving consistently outstanding results across all evaluated metrics using Baruzzo et al. datasets.
For additional resources, including data, detailed analyses, and other supplementary materials related to the DeepSAP paper, please refer to the DeepSAP GitHub repository.
--batch
: Larger batch sizes significantly improve throughput but require more GPU memory.--fp16
: Using half-precision floating-point (--fp16
, enabled by default) dramatically reduces VRAM usage by nearly half and speeds up computation on compatible GPUs. To disable this feature, add the --no-fp16
flag to your command.--fp16
is enabled. Disabling this flag will approximately double the memory requirement for any given batch size.Batch Size (--batch ) |
Approximate VRAM Usage (with --fp16 ) |
---|---|
64 | ~1.2 GB |
128 | ~1.6 GB |
256 | ~2.2 GB |
2048 | ~10.4 GB |
8192 | ~39.5 GB |
This guide demonstrates how to quickly test DeepSAP's functionality using the malaria_short_pe
dataset. Follow these steps to set up your environment and run DeepSAP:
This step downloads the latest DeepSAP Docker container and all required reference files and test sequencing data.
# Pull the DeepSAP Parabricks Docker image
docker pull nvcr.io/nvidia/clara/clara-parabricks-deepsap:latest
# Download reference genome and annotation files
wget -P test/malaria_short_pe/ https://raw.githubusercontent.com/clara-parabricks-workflows/DeepSAP/main/test/malaria_short_pe/Plasmodium_falciparum.ASM276v2.60.gtf
wget -P test/malaria_short_pe/ https://raw.githubusercontent.com/clara-parabricks-workflows/DeepSAP/main/test/malaria_short_pe/Plasmodium_falciparum.ASM276v2.dna.toplevel.fa
# Download downsampled FASTQ sequence reads (10K) from DeepSAP GitHub
wget -P test/malaria_short_pe/ https://raw.githubusercontent.com/clara-parabricks-workflows/DeepSAP/main/test/malaria_short_pe/SRR14793977_10K_1.fastq.gz
wget -P test/malaria_short_pe/ https://raw.githubusercontent.com/clara-parabricks-workflows/DeepSAP/main/test/malaria_short_pe/SRR14793977_10K_2.fastq.gz
This command executes DeepSAP with the downloaded test dataset. Since the --gsnap_idx
parameter is not specified, DeepSAP will automatically generate the GSNAP index required for alignment as part of this run.
# Run DeepSAP with the test dataset (GSNAP index will be generated)
docker run --gpus all --ulimit memlock=-1 --ulimit stack=67108864 --rm \
--volume $(pwd)/test:/workdir \
--volume $(pwd)/test/outputdir:/outputdir \
nvcr.io/nvidia/clara/clara-parabricks-deepsap:latest \
--out /outputdir/ \
--prefix test_run_10K \
--mate_1 /workdir/malaria_short_pe/SRR14793977_10K_1.fastq.gz \
--mate_2 /workdir/malaria_short_pe/SRR14793977_10K_2.fastq.gz \
--gtf /workdir/malaria_short_pe/Plasmodium_falciparum.ASM276v2.60.gtf \
--fasta /workdir/malaria_short_pe/Plasmodium_falciparum.ASM276v2.dna.toplevel.fa
If you have already generated a GSNAP index (e.g., from a previous DeepSAP run or a separate gmap_build
command), you can provide its path using the --gsnap_idx
parameter. This will instruct DeepSAP to reuse the existing index instead of generating a new one.
# Run DeepSAP with the test dataset and pre-generated GSNAP index
docker run --gpus all --ulimit memlock=-1 --ulimit stack=67108864 --rm \
--volume $(pwd)/test:/workdir \
--volume $(pwd)/test/outputdir:/outputdir \
nvcr.io/nvidia/clara/clara-parabricks-deepsap:latest \
--out /outputdir/ \
--prefix test_run_10K \
--mate_1 /workdir/malaria_short_pe/SRR14793977_10K_1.fastq.gz \
--mate_2 /workdir/malaria_short_pe/SRR14793977_10K_2.fastq.gz \
--gtf /workdir/malaria_short_pe/Plasmodium_falciparum.ASM276v2.60.gtf \
--fasta /workdir/malaria_short_pe/Plasmodium_falciparum.ASM276v2.dna.toplevel.fa\
--gsnap_idx /outputdir/gsnap_idx/
[2025-07-18 12:51:27] [LOG] Running DeepSAP v0.0.3
[2025-07-18 12:51:32] [LOG] Running GSNAP
[2025-07-18 12:51:32] [LOG] Building GSNAP TGGA index
[2025-07-18 12:52:44] [LOG] Running GSNAP TGGA
[2025-07-18 12:52:46] [LOG] Parsing FASTA file '/workdir/malaria_short_pe/Plasmodium_falciparum.ASM276v2.dna.toplevel.fa'
[2025-07-18 12:52:46] [LOG] Parsing GTF file '/workdir/malaria_short_pe/Plasmodium_falciparum.ASM276v2.60.gtf'
[2025-07-18 12:52:47] [LOG] Transcript information:
Number of transcripts: 5767
Shortest transcript: 67 EPT00050203058
Longest transcript: 30863 CAG25094
Transcripts length mean: 2456.79
Transcripts length median: 1618
Transcripts length mode: 71
Shortest intron: 1 PF3D7_1478200: 14__-__3219919__3220323 -> 14__-__3220325__3220534
Longest intron: 2425 CZU00099: 14__+__1639681__1639728 -> 14__+__1642154__1642455
Introns length mean: 163.03
Introns length median: 141.0
Introns length mode: 1
Number of multi exons transcripts: 3064 53.13%
Number of mono exon transcripts: 2703 46.87%
Type of transcripts:
BioType Count Percentage
0 protein_coding 5358 92.91
1 pseudogene 153 2.65
3 ncRNA 102 1.77
4 tRNA 79 1.37
5 rRNA 44 0.76
7 sRNA 17 0.29
6 snRNA 10 0.17
2 nontranslating_CDS 4 0.07
[2025-07-18 12:52:47] [LOG] Collecting splice junctions from GTF
[2025-07-18 12:52:47] [LOG] Collecting splice junctions in mode=NotStrict and window=150
[2025-07-18 12:52:47] [LOG] Collecting splice junctions from transcript types: All
Number of duplicated junctions: 328
Number of short junctions (intron): 0
Number of short junctions (donor): 0
Number of short junctions (acceptor): 0
Number of junctions contains N: 0
Number of accepted junctions: 8764
The First 10 Splicing Signals Types:
Signal Forward Reverse Percentage
GTAG 4096 4431 97.30
AAAA 18 17 0.40
TATA 12 8 0.23
GCAG 9 9 0.21
TTTT 6 9 0.17
ATAT 4 7 0.13
GAGA 5 6 0.13
AGAG 3 6 0.10
TATT 3 6 0.10
TAAT 4 5 0.10
[2025-07-18 12:52:47] [LOG] Collecting splice junctions from SAM/BAM file '/outputdir/test_run_10K_gsnap.bam'
[2025-07-18 12:52:47] [INFO] Sense junctions 518
[2025-07-18 12:52:47] [INFO] Antisense junctions 551
[2025-07-18 12:52:47] [INFO] Total number of reads 20479
[2025-07-18 12:52:47] [INFO] Total number of spliced reads 2233 10.903852727183946%
[2025-07-18 12:52:47] [LOG] Finished parsing a SAM file, len(found_junctions_table)= 1069
[2025-07-18 12:52:47] [LOG] Generating splice-junction prediction dataset batch: 1
[2025-07-18 12:52:47] [LOG] Writting dev.csv file for predicting into '/outputdir/test_run_10K_prediction_batch_1/'
[2025-07-18 12:52:47] [LOG] dev.csv file contains: 0: 1069, 1: 1069
[2025-07-18 12:52:47] [LOG] Predicting found splice junctions using DNABERT MS150
100%|██████████| 67/67 [00:01<00:00, 58.23it/s]
[2025-07-18 12:52:51] [LOG] Generating genome regions
[2025-07-18 12:52:51] [LOG] Parsing FASTA file '/workdir/malaria_short_pe/Plasmodium_falciparum.ASM276v2.dna.toplevel.fa'
[2025-07-18 12:52:53] [LOG] Finished writing BAM successfully into '/outputdir/test_run_10K'
[2025-07-18 12:52:53] [LOG] Number of SAM records: 20479
[2025-07-18 12:52:53] [LOG] Number of reads IDs: 12644
[2025-07-18 12:52:53] [LOG] Number of processed reads IDs: 1405 11.11%
[2025-07-18 12:52:54] [LOG] Finished successfuly
Argument | Description | Required |
---|---|---|
-o, --out |
Path to the output folder | Yes |
--prefix |
Output files prefix string | Yes |
-g, --gtf |
Path to the GTF annotation file compatible with the BAM file | Yes |
-f, --fasta |
Path to the FASTA genome file compatible with the BAM file | Yes |
-s, --sam |
Path to the SAM/BAM file or directory of files | Yes (if BAM) |
--mate_1 |
Path to FASTQ file of mate 1 (for paired-end reads) | Yes (if FASTQ) |
--mate_2 |
Path to FASTQ file of mate 2 (for paired-end reads) | Yes (if FASTQ) |
--gsnap_idx |
Path to GSNAP index | No |
-c, --config |
Config .json file to control DeepSAP internal parameters |
No |
--batch |
Batch size for inference | No |
--no-fp16 |
Don't use fp16 half-precision floating-point | No |
--set_size |
Set size to split datasets for inference | No |
-t, --threads |
Number of threads | No |
--score_reads |
Classify also reads using the transformer model and add scores to SAM, as appose to only SJ | No |
--n_reads |
Number of reads to classify if --score_reads is used |
No |
2025-04-19
.By pulling and using the Parabricks container, you accept the governing terms: The software and materials are governed by the NVIDIA Software License Agreement (found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/) and the Product-Specific Terms for NVIDIA AI Products (found at https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/); except for the model which is governed by the NVIDIA Models Community License Agreement(found at NVIDIA Community Model License). ADDITIONAL INFORMATION: Apache 2.0.