NGC Catalog

CLASSIC

Welcome Guest

For copy image paths and more information, please view on a desktop device.

Features

Description

DeepSAP is a transformer-based workflow designed to enhance splice junction detection in RNA-seq data.

Publisher

NVIDIA

Latest Tag

v0.0.3

Modified

August 1, 2025

Compressed Size

18.01 GB

Multinode Support

Multi-Arch Support

v0.0.3 (Latest) Security Scan Results

Linux / amd64

DeepSAP

DeepSAP is a transformer-based workflow designed to enhance splice junction detection in RNA-seq data. By default, DeepSAP utilizes the highly sensitive GSNAP TGGA aligner for FASTQ inputs. Alternatively, it can also process pre-aligned BAM files generated by GSNAP directly.

We evaluated the performance of DeepSAP in our article titled DeepSAP: Improved RNA-Seq Alignment by Integrating Transcriptome Guidance with Transformer-Based Splice Junction Scoring. In our benchmark, DeepSAP demonstrated an outstanding performance, achieving consistently outstanding results across all evaluated metrics using Baruzzo et al. datasets.

For additional resources, including data, detailed analyses, and other supplementary materials related to the DeepSAP paper, please refer to the DeepSAP GitHub repository.

Requirements
Usage
Command-line Arguments
Version History
License/Terms of Use

Requirements

System Software:

Docker with GPU support

System Hardware:

CPU: 8 cores or more recommended.
System RAM: 32 GB minimum for human genome-sized references. GPU Hardware:
VRAM: 16 GB minimum.
Note on GPU VRAM: The VRAM requirement is highly dependent on two key parameters:
- --batch: Larger batch sizes significantly improve throughput but require more GPU memory.
- --fp16: Using half-precision floating-point (--fp16, enabled by default) dramatically reduces VRAM usage by nearly half and speeds up computation on compatible GPUs. To disable this feature, add the --no-fp16 flag to your command.
- The VRAM usage estimates below assume --fp16 is enabled. Disabling this flag will approximately double the memory requirement for any given batch size.
  
  Batch Size (--batch) Approximate VRAM Usage (with --fp16)
  
  64 ~1.2 GB
  
  128 ~1.6 GB
  
  256 ~2.2 GB
  
  2048 ~10.4 GB
  
  8192 ~39.5 GB

Batch Size (`--batch`)	Approximate VRAM Usage (with `--fp16`)
64	~1.2 GB
128	~1.6 GB
256	~2.2 GB
2048	~10.4 GB
8192	~39.5 GB

Input Data:

RNA-seq reads in FASTQ format.
Reference file in FASTA format.
Annotation file in GTF format.
Optionally, a path to a GSNAP index.

Usage

This guide demonstrates how to quickly test DeepSAP's functionality using the malaria_short_pe dataset. Follow these steps to set up your environment and run DeepSAP:

Step 1: Prepare Environment and Download Test Data

This step downloads the latest DeepSAP Docker container and all required reference files and test sequencing data.

# Pull the DeepSAP Parabricks Docker image
docker pull nvcr.io/nvidia/clara/clara-parabricks-deepsap:latest

# Download reference genome and annotation files
wget -P test/malaria_short_pe/ https://raw.githubusercontent.com/clara-parabricks-workflows/DeepSAP/main/test/malaria_short_pe/Plasmodium_falciparum.ASM276v2.60.gtf
wget -P test/malaria_short_pe/ https://raw.githubusercontent.com/clara-parabricks-workflows/DeepSAP/main/test/malaria_short_pe/Plasmodium_falciparum.ASM276v2.dna.toplevel.fa

# Download downsampled FASTQ sequence reads (10K) from DeepSAP GitHub
wget -P test/malaria_short_pe/ https://raw.githubusercontent.com/clara-parabricks-workflows/DeepSAP/main/test/malaria_short_pe/SRR14793977_10K_1.fastq.gz
wget -P test/malaria_short_pe/ https://raw.githubusercontent.com/clara-parabricks-workflows/DeepSAP/main/test/malaria_short_pe/SRR14793977_10K_2.fastq.gz

Step 2: Run DeepSAP (Initial Run - Index Generation)

This command executes DeepSAP with the downloaded test dataset. Since the --gsnap_idx parameter is not specified, DeepSAP will automatically generate the GSNAP index required for alignment as part of this run.

# Run DeepSAP with the test dataset (GSNAP index will be generated)
docker run --gpus all --ulimit memlock=-1 --ulimit stack=67108864 --rm              \
    --volume $(pwd)/test:/workdir                                                   \
    --volume $(pwd)/test/outputdir:/outputdir                                       \
    nvcr.io/nvidia/clara/clara-parabricks-deepsap:latest                            \
    --out /outputdir/                                                               \
    --prefix test_run_10K                                                           \
    --mate_1 /workdir/malaria_short_pe/SRR14793977_10K_1.fastq.gz                   \
    --mate_2 /workdir/malaria_short_pe/SRR14793977_10K_2.fastq.gz                   \
    --gtf /workdir/malaria_short_pe/Plasmodium_falciparum.ASM276v2.60.gtf           \
    --fasta /workdir/malaria_short_pe/Plasmodium_falciparum.ASM276v2.dna.toplevel.fa

Step 3: Run DeepSAP with a Pre-existing GSNAP Index

If you have already generated a GSNAP index (e.g., from a previous DeepSAP run or a separate gmap_build command), you can provide its path using the --gsnap_idx parameter. This will instruct DeepSAP to reuse the existing index instead of generating a new one.

# Run DeepSAP with the test dataset and pre-generated GSNAP index
docker run --gpus all --ulimit memlock=-1 --ulimit stack=67108864 --rm              \
    --volume $(pwd)/test:/workdir                                                   \
    --volume $(pwd)/test/outputdir:/outputdir                                       \
    nvcr.io/nvidia/clara/clara-parabricks-deepsap:latest                            \
    --out /outputdir/                                                               \
    --prefix test_run_10K                                                           \
    --mate_1 /workdir/malaria_short_pe/SRR14793977_10K_1.fastq.gz                   \
    --mate_2 /workdir/malaria_short_pe/SRR14793977_10K_2.fastq.gz                   \
    --gtf /workdir/malaria_short_pe/Plasmodium_falciparum.ASM276v2.60.gtf           \
    --fasta /workdir/malaria_short_pe/Plasmodium_falciparum.ASM276v2.dna.toplevel.fa\
    --gsnap_idx /outputdir/gsnap_idx/

DeepSAP Expected Output

[2025-07-18 12:51:27]   [LOG]   Running DeepSAP v0.0.3
[2025-07-18 12:51:32]   [LOG]   Running GSNAP
[2025-07-18 12:51:32]   [LOG]   Building GSNAP TGGA index
[2025-07-18 12:52:44]   [LOG]   Running GSNAP TGGA 
[2025-07-18 12:52:46]   [LOG]   Parsing FASTA file '/workdir/malaria_short_pe/Plasmodium_falciparum.ASM276v2.dna.toplevel.fa'
[2025-07-18 12:52:46]   [LOG]   Parsing GTF file '/workdir/malaria_short_pe/Plasmodium_falciparum.ASM276v2.60.gtf'
[2025-07-18 12:52:47]   [LOG]   Transcript information: 
Number of transcripts:             5767
Shortest transcript:               67   EPT00050203058
Longest transcript:                30863        CAG25094
Transcripts length mean:           2456.79
Transcripts length median:         1618
Transcripts length mode:           71
Shortest intron:                   1    PF3D7_1478200: 14__-__3219919__3220323 -> 14__-__3220325__3220534
Longest intron:                    2425 CZU00099: 14__+__1639681__1639728 -> 14__+__1642154__1642455
Introns length mean:               163.03
Introns length median:             141.0
Introns length mode:               1
Number of multi exons transcripts: 3064 53.13%
Number of mono exon transcripts:   2703 46.87%

Type of transcripts:
              BioType  Count  Percentage
0      protein_coding   5358       92.91
1          pseudogene    153        2.65
3               ncRNA    102        1.77
4                tRNA     79        1.37
5                rRNA     44        0.76
7                sRNA     17        0.29
6               snRNA     10        0.17
2  nontranslating_CDS      4        0.07
[2025-07-18 12:52:47]   [LOG]   Collecting splice junctions from GTF
[2025-07-18 12:52:47]   [LOG]   Collecting splice junctions in mode=NotStrict and window=150
[2025-07-18 12:52:47]   [LOG]   Collecting splice junctions from transcript types: All
Number of duplicated junctions:        328
Number of short junctions (intron):    0
Number of short junctions (donor):     0
Number of short junctions (acceptor):  0
Number of junctions contains N:        0
Number of accepted junctions:          8764
The First 10 Splicing Signals Types: 
Signal  Forward  Reverse  Percentage
  GTAG     4096     4431       97.30
  AAAA       18       17        0.40
  TATA       12        8        0.23
  GCAG        9        9        0.21
  TTTT        6        9        0.17
  ATAT        4        7        0.13
  GAGA        5        6        0.13
  AGAG        3        6        0.10
  TATT        3        6        0.10
  TAAT        4        5        0.10
[2025-07-18 12:52:47]   [LOG]   Collecting splice junctions from SAM/BAM file '/outputdir/test_run_10K_gsnap.bam'
[2025-07-18 12:52:47]   [INFO]  Sense junctions 518
[2025-07-18 12:52:47]   [INFO]  Antisense junctions 551
[2025-07-18 12:52:47]   [INFO]  Total number of reads 20479
[2025-07-18 12:52:47]   [INFO]  Total number of spliced reads 2233 10.903852727183946%
[2025-07-18 12:52:47]   [LOG]   Finished parsing a SAM file, len(found_junctions_table)= 1069
[2025-07-18 12:52:47]   [LOG]   Generating splice-junction prediction dataset batch: 1
[2025-07-18 12:52:47]   [LOG]   Writting dev.csv file for predicting into '/outputdir/test_run_10K_prediction_batch_1/'
[2025-07-18 12:52:47]   [LOG]   dev.csv file contains:   0: 1069, 1: 1069
[2025-07-18 12:52:47]   [LOG]   Predicting found splice junctions using DNABERT MS150
100%|██████████| 67/67 [00:01<00:00, 58.23it/s]
[2025-07-18 12:52:51]   [LOG]   Generating genome regions 
[2025-07-18 12:52:51]   [LOG]   Parsing FASTA file '/workdir/malaria_short_pe/Plasmodium_falciparum.ASM276v2.dna.toplevel.fa'
[2025-07-18 12:52:53]   [LOG]   Finished writing BAM successfully into '/outputdir/test_run_10K'
[2025-07-18 12:52:53]   [LOG]   Number of SAM records: 20479 
[2025-07-18 12:52:53]   [LOG]   Number of reads IDs:   12644 
[2025-07-18 12:52:53]   [LOG]   Number of processed reads IDs: 1405  11.11% 

[2025-07-18 12:52:54]   [LOG]   Finished successfuly

Command-line Arguments

Argument	Description	Required
`-o, --out`	Path to the output folder	Yes
`--prefix`	Output files prefix string	Yes
`-g, --gtf`	Path to the GTF annotation file compatible with the BAM file	Yes
`-f, --fasta`	Path to the FASTA genome file compatible with the BAM file	Yes
`-s, --sam`	Path to the SAM/BAM file or directory of files	Yes (if BAM)
`--mate_1`	Path to FASTQ file of mate 1 (for paired-end reads)	Yes (if FASTQ)
`--mate_2`	Path to FASTQ file of mate 2 (for paired-end reads)	Yes (if FASTQ)
`--gsnap_idx`	Path to GSNAP index	No
`-c, --config`	Config `.json` file to control DeepSAP internal parameters	No
`--batch`	Batch size for inference	No
`--no-fp16`	Don't use fp16 half-precision floating-point	No
`--set_size`	Set size to split datasets for inference	No
`-t, --threads`	Number of threads	No
`--score_reads`	Classify also reads using the transformer model and add scores to SAM, as appose to only SJ	No
`--n_reads`	Number of reads to classify if `--score_reads` is used	No

Version History

v0.0.3

Fixed key error in parsing FASTA files.
Fixed gene_id pattern error in parsing GTF files.

v0.0.2

Updated GSNAP aligner to version 2025-04-19.

v0.0.1

Initial release.

License/Terms of Use

By pulling and using the Parabricks container, you accept the governing terms: The software and materials are governed by the NVIDIA Software License Agreement (found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/) and the Product-Specific Terms for NVIDIA AI Products (found at https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/); except for the model which is governed by the NVIDIA Models Community License Agreement(found at NVIDIA Community Model License). ADDITIONAL INFORMATION: Apache 2.0.

DeepSAP

DeepSAP

Table of Contents

Requirements

System Software:

System Hardware:

Input Data:

Usage

Step 1: Prepare Environment and Download Test Data

Step 2: Run DeepSAP (Initial Run - Index Generation)

Step 3: Run DeepSAP with a Pre-existing GSNAP Index

DeepSAP Expected Output

Command-line Arguments

Version History

v0.0.3

v0.0.2

v0.0.1

License/Terms of Use