NGC Catalog
CLASSIC
Welcome Guest
Containers
DeepSAP

DeepSAP

For copy image paths and more information, please view on a desktop device.
Features
Description
DeepSAP is a transformer-based workflow designed to enhance splice junction detection in RNA-seq data.
Publisher
NVIDIA
Latest Tag
v0.0.3
Modified
July 19, 2025
Compressed Size
18.01 GB
Multinode Support
No
Multi-Arch Support
No
v0.0.3 (Latest) Security Scan Results

Linux / amd64

Sorry, your browser does not support inline SVG.

DeepSAP

DeepSAP is a transformer-based workflow designed to enhance splice junction detection in RNA-seq data. By default, DeepSAP utilizes the highly sensitive GSNAP TGGA aligner for FASTQ inputs. Alternatively, it can also process pre-aligned BAM files generated by GSNAP directly.

We evaluated the performance of DeepSAP in our article titled DeepSAP: Improved RNA-Seq Alignment by Integrating Transcriptome Guidance with Transformer-Based Splice Junction Scoring. In our benchmark, DeepSAP demonstrated an outstanding performance, achieving consistently outstanding results across all evaluated metrics using Baruzzo et al. datasets.

For additional resources, including data, detailed analyses, and other supplementary materials related to the DeepSAP paper, please refer to the DeepSAP GitHub repository.

Table of Contents

  • Requirements
  • Usage
  • Command-line Arguments
  • Version History
  • License/Terms of Use

Requirements

System Software:

  • Docker with GPU support

System Hardware:

  • CPU: 8 cores or more recommended.
  • System RAM: 32 GB minimum for human genome-sized references. GPU Hardware:
  • VRAM: 16 GB minimum.
  • Note on GPU VRAM: The VRAM requirement is highly dependent on two key parameters:
    • --batch: Larger batch sizes significantly improve throughput but require more GPU memory.
    • --fp16: Using half-precision floating-point (--fp16, enabled by default) dramatically reduces VRAM usage by nearly half and speeds up computation on compatible GPUs. To disable this feature, add the --no-fp16 flag to your command.
    • The VRAM usage estimates below assume --fp16 is enabled. Disabling this flag will approximately double the memory requirement for any given batch size.
      Batch Size (--batch) Approximate VRAM Usage (with --fp16)
      64 ~1.2 GB
      128 ~1.6 GB
      256 ~2.2 GB
      2048 ~10.4 GB
      8192 ~39.5 GB

Input Data:

  • RNA-seq reads in FASTQ format.
  • Reference file in FASTA format.
  • Annotation file in GTF format.
  • Optionally, a path to a GSNAP index.

Usage

This guide demonstrates how to quickly test DeepSAP's functionality using the malaria_short_pe dataset. Follow these steps to set up your environment and run DeepSAP:

Step 1: Prepare Environment and Download Test Data

This step downloads the latest DeepSAP Docker container and all required reference files and test sequencing data.

# Pull the DeepSAP Parabricks Docker image
docker pull nvcr.io/nvidia/clara/clara-parabricks-deepsap:latest

# Download reference genome and annotation files
wget -P test/malaria_short_pe/ https://raw.githubusercontent.com/clara-parabricks-workflows/DeepSAP/main/test/malaria_short_pe/Plasmodium_falciparum.ASM276v2.60.gtf
wget -P test/malaria_short_pe/ https://raw.githubusercontent.com/clara-parabricks-workflows/DeepSAP/main/test/malaria_short_pe/Plasmodium_falciparum.ASM276v2.dna.toplevel.fa

# Download downsampled FASTQ sequence reads (10K) from DeepSAP GitHub
wget -P test/malaria_short_pe/ https://raw.githubusercontent.com/clara-parabricks-workflows/DeepSAP/main/test/malaria_short_pe/SRR14793977_10K_1.fastq.gz
wget -P test/malaria_short_pe/ https://raw.githubusercontent.com/clara-parabricks-workflows/DeepSAP/main/test/malaria_short_pe/SRR14793977_10K_2.fastq.gz

Step 2: Run DeepSAP (Initial Run - Index Generation)

This command executes DeepSAP with the downloaded test dataset. Since the --gsnap_idx parameter is not specified, DeepSAP will automatically generate the GSNAP index required for alignment as part of this run.

# Run DeepSAP with the test dataset (GSNAP index will be generated)
docker run --gpus all --ulimit memlock=-1 --ulimit stack=67108864 --rm              \
    --volume $(pwd)/test:/workdir                                                   \
    --volume $(pwd)/test/outputdir:/outputdir                                       \
    nvcr.io/nvidia/clara/clara-parabricks-deepsap:latest                            \
    --out /outputdir/                                                               \
    --prefix test_run_10K                                                           \
    --mate_1 /workdir/malaria_short_pe/SRR14793977_10K_1.fastq.gz                   \
    --mate_2 /workdir/malaria_short_pe/SRR14793977_10K_2.fastq.gz                   \
    --gtf /workdir/malaria_short_pe/Plasmodium_falciparum.ASM276v2.60.gtf           \
    --fasta /workdir/malaria_short_pe/Plasmodium_falciparum.ASM276v2.dna.toplevel.fa

Step 3: Run DeepSAP with a Pre-existing GSNAP Index

If you have already generated a GSNAP index (e.g., from a previous DeepSAP run or a separate gmap_build command), you can provide its path using the --gsnap_idx parameter. This will instruct DeepSAP to reuse the existing index instead of generating a new one.

# Run DeepSAP with the test dataset and pre-generated GSNAP index
docker run --gpus all --ulimit memlock=-1 --ulimit stack=67108864 --rm              \
    --volume $(pwd)/test:/workdir                                                   \
    --volume $(pwd)/test/outputdir:/outputdir                                       \
    nvcr.io/nvidia/clara/clara-parabricks-deepsap:latest                            \
    --out /outputdir/                                                               \
    --prefix test_run_10K                                                           \
    --mate_1 /workdir/malaria_short_pe/SRR14793977_10K_1.fastq.gz                   \
    --mate_2 /workdir/malaria_short_pe/SRR14793977_10K_2.fastq.gz                   \
    --gtf /workdir/malaria_short_pe/Plasmodium_falciparum.ASM276v2.60.gtf           \
    --fasta /workdir/malaria_short_pe/Plasmodium_falciparum.ASM276v2.dna.toplevel.fa\
    --gsnap_idx /outputdir/gsnap_idx/

DeepSAP Expected Output

[2025-07-18 12:51:27]   [LOG]   Running DeepSAP v0.0.3
[2025-07-18 12:51:32]   [LOG]   Running GSNAP
[2025-07-18 12:51:32]   [LOG]   Building GSNAP TGGA index
[2025-07-18 12:52:44]   [LOG]   Running GSNAP TGGA 
[2025-07-18 12:52:46]   [LOG]   Parsing FASTA file '/workdir/malaria_short_pe/Plasmodium_falciparum.ASM276v2.dna.toplevel.fa'
[2025-07-18 12:52:46]   [LOG]   Parsing GTF file '/workdir/malaria_short_pe/Plasmodium_falciparum.ASM276v2.60.gtf'
[2025-07-18 12:52:47]   [LOG]   Transcript information: 
Number of transcripts:             5767
Shortest transcript:               67   EPT00050203058
Longest transcript:                30863        CAG25094
Transcripts length mean:           2456.79
Transcripts length median:         1618
Transcripts length mode:           71
Shortest intron:                   1    PF3D7_1478200: 14__-__3219919__3220323 -> 14__-__3220325__3220534
Longest intron:                    2425 CZU00099: 14__+__1639681__1639728 -> 14__+__1642154__1642455
Introns length mean:               163.03
Introns length median:             141.0
Introns length mode:               1
Number of multi exons transcripts: 3064 53.13%
Number of mono exon transcripts:   2703 46.87%

Type of transcripts:
              BioType  Count  Percentage
0      protein_coding   5358       92.91
1          pseudogene    153        2.65
3               ncRNA    102        1.77
4                tRNA     79        1.37
5                rRNA     44        0.76
7                sRNA     17        0.29
6               snRNA     10        0.17
2  nontranslating_CDS      4        0.07
[2025-07-18 12:52:47]   [LOG]   Collecting splice junctions from GTF
[2025-07-18 12:52:47]   [LOG]   Collecting splice junctions in mode=NotStrict and window=150
[2025-07-18 12:52:47]   [LOG]   Collecting splice junctions from transcript types: All
Number of duplicated junctions:        328
Number of short junctions (intron):    0
Number of short junctions (donor):     0
Number of short junctions (acceptor):  0
Number of junctions contains N:        0
Number of accepted junctions:          8764
The First 10 Splicing Signals Types: 
Signal  Forward  Reverse  Percentage
  GTAG     4096     4431       97.30
  AAAA       18       17        0.40
  TATA       12        8        0.23
  GCAG        9        9        0.21
  TTTT        6        9        0.17
  ATAT        4        7        0.13
  GAGA        5        6        0.13
  AGAG        3        6        0.10
  TATT        3        6        0.10
  TAAT        4        5        0.10
[2025-07-18 12:52:47]   [LOG]   Collecting splice junctions from SAM/BAM file '/outputdir/test_run_10K_gsnap.bam'
[2025-07-18 12:52:47]   [INFO]  Sense junctions 518
[2025-07-18 12:52:47]   [INFO]  Antisense junctions 551
[2025-07-18 12:52:47]   [INFO]  Total number of reads 20479
[2025-07-18 12:52:47]   [INFO]  Total number of spliced reads 2233 10.903852727183946%
[2025-07-18 12:52:47]   [LOG]   Finished parsing a SAM file, len(found_junctions_table)= 1069
[2025-07-18 12:52:47]   [LOG]   Generating splice-junction prediction dataset batch: 1
[2025-07-18 12:52:47]   [LOG]   Writting dev.csv file for predicting into '/outputdir/test_run_10K_prediction_batch_1/'
[2025-07-18 12:52:47]   [LOG]   dev.csv file contains:   0: 1069, 1: 1069
[2025-07-18 12:52:47]   [LOG]   Predicting found splice junctions using DNABERT MS150
100%|██████████| 67/67 [00:01<00:00, 58.23it/s]
[2025-07-18 12:52:51]   [LOG]   Generating genome regions 
[2025-07-18 12:52:51]   [LOG]   Parsing FASTA file '/workdir/malaria_short_pe/Plasmodium_falciparum.ASM276v2.dna.toplevel.fa'
[2025-07-18 12:52:53]   [LOG]   Finished writing BAM successfully into '/outputdir/test_run_10K'
[2025-07-18 12:52:53]   [LOG]   Number of SAM records: 20479 
[2025-07-18 12:52:53]   [LOG]   Number of reads IDs:   12644 
[2025-07-18 12:52:53]   [LOG]   Number of processed reads IDs: 1405  11.11% 

[2025-07-18 12:52:54]   [LOG]   Finished successfuly

Command-line Arguments

Argument Description Required
-o, --out Path to the output folder Yes
--prefix Output files prefix string Yes
-g, --gtf Path to the GTF annotation file compatible with the BAM file Yes
-f, --fasta Path to the FASTA genome file compatible with the BAM file Yes
-s, --sam Path to the SAM/BAM file or directory of files Yes (if BAM)
--mate_1 Path to FASTQ file of mate 1 (for paired-end reads) Yes (if FASTQ)
--mate_2 Path to FASTQ file of mate 2 (for paired-end reads) Yes (if FASTQ)
--gsnap_idx Path to GSNAP index No
-c, --config Config .json file to control DeepSAP internal parameters No
--batch Batch size for inference No
--no-fp16 Don't use fp16 half-precision floating-point No
--set_size Set size to split datasets for inference No
-t, --threads Number of threads No
--score_reads Classify also reads using the transformer model and add scores to SAM, as appose to only SJ No
--n_reads Number of reads to classify if --score_reads is used No


Version History

v0.0.3

  • Fixed key error in parsing FASTA files.
  • Fixed gene_id pattern error in parsing GTF files.

v0.0.2

  • Updated GSNAP aligner to version 2025-04-19.

v0.0.1

  • Initial release.

License/Terms of Use

By pulling and using the Parabricks container, you accept the governing terms: The software and materials are governed by the NVIDIA Software License Agreement (found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/) and the Product-Specific Terms for NVIDIA AI Products (found at https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/); except for the model which is governed by the NVIDIA Models Community License Agreement(found at NVIDIA Community Model License). ADDITIONAL INFORMATION: Apache 2.0.