NVIDIA
NVIDIA
Mask Grounding DINO CC
Model
NVIDIA
NVIDIA
Mask Grounding DINO CC

Open vocabulary multi-modal instance segmentation model trained on commercial data.

TAO Pretrained Mask Grounding DINO with Commercial License

Description

Open vocabulary instance segmentation is a computer vision technique that can segment one or multiple objects in a frame based on the text input. Object segmentation recognizes the individual objects in an image and predicts bounding boxes and the segmentation masks. This model card contains pre-trained weights for the Mask Grounding DINO model pretrained on the commercial dataset. The goal of this card is to facilitate transfer learning through the Train Adapt Optimize (TAO) Toolkit.

This model is ready for commercial use.

License

GOVERNING TERMS: Use of this model is governed by the NVIDIA Open Model License

Deployment Geography:

Global

Use Case:

  1. Open-Set (or Open-Vocabulary) Object Detection and Segmentation: Grounding DINO can detect objects that were not pre-defined in the training categories — you can give it new category names via text (e.g., “lion”, “bench”, “ear”) and it will localise them in images. This is useful in systems that need flexibility to recognise arbitrary categories without retraining for each new class.

  2. Referring Expression / Text-Guided Object Detection and Segmentation: The model supports being given a referring expression (“the bottom man with his head up”, “the dog next to the tree”) and then detect/localise the object described.

Release Date:

NGC [11/25/2025]

References

  • Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L., Shum, H.: DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection.
  • Ziu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Li, C., Yang, J., Su, H., Zhu, J., Zhang, L.: Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.
  • Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: PRe-training of Deep Bidirectional Transformers for Language Understanding.
  • Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin Transformer: Hierarchical Vision Transformer using Shifted Windows.
  • Tian Z., Shen C., Chen H.: Conditional Convolutions for Instance Segmentation.
  • Liu, C., Ding, H., & Jiang, X.: Gres: Generalized referring expression segmentation.

Model Architecture

Architecture Type: Transformer-Based Segmentation Model
Network Architecture: Swin-Tiny
Number of Parameters: 182,668,026 params

This model is an NVIDIA proprietary open-vocabulary object segmentation model that takes RGB images and either a list of phrases or a referring expression as input, and outputs bounding boxes, segmentation masks, and corresponding labels.

The model was initialized from a commercial GroundingDINO checkpoint. It employs a Swin-Tiny backbone with a transformer-based segmentation head. All components—including the ReLA module—were pretrained end-to-end on a large-scale mixed corpus of approximately 1.2 million image–expression or image–category name pairs, sourced from Open Images V5, Localized Narratives, COCO, and RefCOCO variants.

Following large-scale pretraining, the model was fine-tuned on RefCOCO, RefCOCO+, and RefCOCOg, using roughly 100,000 referring expression–image pairs to improve grounding accuracy and region-level localization.

All raw images used during training were verified to have commercial licenses, ensuring safe and compliant commercial deployment.

Input(s):

Input Type(s):
Image and Text (list of tokenized captions processed via Hugging Face).

Input Format(s):

  • Image: Red, Green, Blue (RGB)
  • Text: Tokenized string inputs
  • Resolution: Minimum 32 × 32 pixels required
  • Other: No alpha channel; standard 8-bit color depth

Input Parameters:

  • Image: Two-Dimensional (2D)
  • Text: One-Dimensional (1D)

Other Properties Related to Input:

  • inputs: B × 3 × 544 × 960 — RGB image tensor (Batch Size × Channels × Height × Width)
  • input_ids: B × 256 — tokenized text input IDs (Batch Size × Max Token Length)
  • attention_mask: B × 256 — attention mask for text tokens
  • position_ids: B × 256 — positional encoding indices
  • token_type_ids: B × 256 — token type indicators for multi-sentence text
  • text_token_mask: B × 256 × 256 — attention map between tokens (Batch Size × Max Token Length × Max Token Length)

Because ONNX and TensorRT do not support string inputs directly, the tokenization step is performed externally using the Hugging Face tokenizer.
For deployment details, refer to the TAO Deploy Repository.

Output(s):

Output Type(s):
Bounding Boxes, Confidence Scores, Segmentation Masks for each detected object in the input image, and a No-Target Indicator.

Output Format(s):

  • Bounding Boxes: Predicted 2D boxes in (cx, cy, w, h) format
  • Confidence Scores: Scalar scores for each predicted object
  • Segmentation Masks: 3D masks aligned with the input image dimensions
  • No-Target Indicator: Binary logits indicating whether a query corresponds to a valid object or not

Output Parameters:

  • Bounding Boxes: Two-Dimensional (2D)
  • Segmentation Masks: Three-Dimensional (3D)
  • Confidence Scores: One-Dimensional (1D)
  • No-Target Indicator: Two-Dimensional (B × 2)

Other Properties Related to Output:

  • pred_logits: B × 900 — classification logits for each query (Batch Size × Number of Queries)
  • pred_boxes: B × 900 × 4 — predicted bounding boxes in (cx, cy, w, h) format
  • pred_masks: B × 900 × H × W — segmentation masks for each predicted object
  • no_targets: B × 2 — binary logits indicating positive/negative (object/no-object) classes
  • union_logit_masks: B × H × W — union of predicted object masks across all queries (objectness map)

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.


Hardware Optimization Note:
This model is optimized for NVIDIA GPU-accelerated systems.
By leveraging CUDA, TensorRT, and other NVIDIA frameworks, it achieves significantly faster training and inference performance compared to CPU-only systems.

Software Integration

Runtime Engines:

  • TAO 5.5.0

Supported Hardware Architectures:

  • NVIDIA Ampere
  • NVIDIA Jetson
  • NVIDIA Hopper
  • NVIDIA Lovelace
  • NVIDIA Pascal
  • NVIDIA Turing
  • NVIDIA Volta

Supported Operating Systems:

  • Linux
  • Linux 4 Tegra

Model Versions

  • mask_grounding_dino_swin_tiny_commercial_trainable_v2.0 - Fine-tuned Swin-Tiny Mask Grounding DINO model on Commercial dataset.

Training and Evaluation

This model was trained using the mask_grounding_dino entrypoint in TAO. The training algorithm optimizes the network to minimize the localization and contrastive embedding loss between text and visual features as well as the mask losses between predictions and mask groundtruth.

Using this Model

These models need to be used with NVIDIA hardware and software. For hardware, the models can run on any NVIDIA GPU including NVIDIA Jetson devices. These models can only be used with Train Adapt Optimize (TAO) Toolkit, or TensorRT.

The intended use for these models is detecting objects in a color (RGB) image. The model can be used to detect objects from photos and videos by using appropriate video or image decoding and pre-processing.

These models are intended for training and fine-tune with the TAO Toolkit and your datasets for object detection. High-fidelity models can be trained with new use cases. A Jupyter Notebook is available as a part of the TAO container and can be used to re-train.

The models are also intended for easy edge deployment using TensorRT.

Using the Model with TAO

To use these models as pretrained weights for transfer learning, use the following snippet as a template for the model and train components of the experiment spec file. For more information on the experiment spec file, see the TAO Toolkit User Guide.

train:
  pretrained_model_path: /path/to/the/mask_groundingdino.pth
  freeze: ["backbone.0", "bert"]  # freeze the backbone for finetuning
model:
  backbone: swin_tiny_224_1k
  num_feature_levels: 4
  dec_layers: 6
  enc_layers: 6
  num_queries: 900
  dropout_ratio: 0.0
  dim_feedforward: 2048
  log_scale: auto
  class_embed_bias: True
  num_region_queries: 100
  loss_types: ['labels', 'boxes', 'masks', 'rela']

Training Dataset

Training Data

RefCOCO

Link: https://github.com/lichengunc/refer
Data Collection Method by dataset: Human.
Labeling Method by dataset: Human.
Properties: The RefCOCO dataset provides referring expressions linking natural-language descriptions to specific object instances in images.

  • Data size: 36,459 samples
  • Expression style: Short (~3.5 words average), allows spatial terms.

Dataset License(s): Commercial-use permitted


RefCOCO+

Link: https://github.com/lichengunc/refer
Data Collection Method by dataset: Human.
Labeling Method by dataset: Human.
Properties: RefCOCO+ focuses on appearance descriptors rather than spatial descriptions.

  • Data size: 36,302 samples
  • Expression style: Short (~3.5 words average), no spatial terms allowed.

RefCOCOg

Link: https://github.com/lichengunc/refer
Data Collection Method by dataset: Human.
Labeling Method by dataset: Human.
Properties: RefCOCOg offers longer, more descriptive referring expressions for visual grounding tasks.

  • Data size: 23,695 samples
  • Expression style: Long, descriptive, more challenging for grounding models.

COCO (Subset)

Link: COCO dataset
Data Collection Method by dataset: Human.
Labeling Method by dataset: Human-verified annotations.
Properties: This subset of COCO is used for open-vocabulary grounding with a fixed set of 80 category names.

  • Data size: 36,055 images
  • Object categories: 80
  • Provides bounding boxes and category labels for open-vocabulary object segmentation.

OpenImages V5

Link: Open Images V7 overview
Data Collection Method by dataset: Hybrid: Automated web image collection with human verification
Labeling Method by dataset: Human
Properties:
OpenImages provides large-scale annotations for object detection and segmentation, suitable for open-vocabulary grounding.

  • Data size: 662,956 samples (image-category name pairs)
  • Object categories: 350+

Localized Narratives

Link: Localized Narratives
Data Collection Method by dataset: Human.
Labeling Method by dataset: Pseudo-labeling.
Properties:
Provides dense vision-language annotations for fine-grained grounding of text to image regions.

  • Data size: 459,999 samples
  • Supports detailed region grounding for open-vocabulary segmentation

All datasets were selected for commercial-use license compatibility. These sources collectively provide high-quality visual-language supervision for both targeted object grounding and no-target reasoning tasks.

Evaluation Data

Data Collection Method by dataset:
Human

Labeling Method by dataset:
Human

Properties (Evaluation Sets and General Information):

Referring-expression segmentation (RES) requires models to output object masks given natural language expressions. Classic datasets (RefCOCO, RefCOCO+, RefCOCOg) mostly focus on single-target expressions, while gRefCOCO supports single-, multi-, and no-target expressions, making it suitable for generalized RES (GRES) tasks.

  • gRefCOCO:

    • Validation Set: 16,870 expressions / 8,163 sets / 1,500 images
    • Test Set A (People): 18,712 expressions / 6,266 sets / 750 images
    • Test Set B (Objects): 14,933 expressions / 4,618 sets / 750 images
    • Expression Types: Mixed (single-, multi-, no-target)
    • Use Case: Phrase-level grounding; multi-target and no-target scenarios
  • RefCOCO:

    • Validation Set: 3,811 expressions / 1,500 images
    • Test Set A (People): 1,975 expressions / 750 images
    • Test Set B (Objects): 1,810 expressions / 750 images
    • Expression Style: Short (~3.6 words), allows spatial terms
    • Use Case: Single-target RES; people-vs-objects split
  • RefCOCO+:

    • Validation Set: 3,805 expressions / 1,500 images
    • Test Set A (People): 1,975 expressions / 750 images
    • Test Set B (Objects): 1,798 expressions / 750 images
    • Expression Style: Short (~3.5 words), no spatial terms; appearance-based
    • Use Case: Single-target RES; appearance-based grounding
  • RefCOCOg:

    • Validation Set (Google): 10,234 expressions / 4896 images
    • Validation Set (UMD): 5,000 expressions / 5,257 images
    • Test Set (UMD): 5,023 expressions / 5,096 images
    • Expression Style: Long, descriptive expressions
    • Use Case: Single-target RES; per-object split

Methodology and KPI

The evaluation metrics focus on generalized Intersection over Union (gIoU), target accuracy (T_acc), and no-target accuracy (N_acc):

  • gIoU: Mean of per-image IoU values across all samples; for true positive no-target samples, IoU = 1, otherwise IoU = 0.
  • T_acc: Accuracy on target samples, measuring robustness to no-target generalization.
  • N_acc: Accuracy in correctly identifying no-target samples (true positive no-target predictions divided by total no-target samples).
  • mask_mAP: Mean Average Precision (mAP) computed across multiple IoU thresholds (0.5 to 0.95); used to evaluate segmentation performance.

Evaluation Results: gRefCOCO

Eval SetgIoU (%)T_acc (%)N_acc (%)mask_mAP (%)
val46.3566.9459.7533.73
testA46.4871.2261.4540.95
testB38.5457.8954.5034.87

Evaluation Results: RefCOCO, RefCOCO+, RefCOCOg

DatasetEval SetgIoU (%)mask_mAP (%)
RefCOCOval58.2475.30
testA64.0876.55
testB50.7073.46
RefCOCO+val48.6170.16
testA55.0372.68
testB37.3463.58
RefCOCOgvalG48.2772.25
valU45.1768.21
testU44.5167.71

Inference

Engine: Tensor(RT)
Test Hardware:

  • A2
  • A30
  • DGX H100
  • DGX A100
  • Jetson AGX Xavier
  • L4
  • L40
  • JAO 64GB
  • Orin Nano 8GB
  • Orin NX

The inference is run on the provided unpruned model at FP16 precision. The Jetson devices are running at Max-N configuration for maximum GPU frequency.

Technical Blogs

Suggested Reading

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Promise and the Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns here.

Publisher
NVIDIA
NVIDIA
LicenseNVIDIA proprietary
Latest Versionmask_grounding_dino_swin_tiny_commercial_deployable_v2.1_wo_mask_arm
UpdatedMarch 10, 2026 UTC
Compressed Size685.8 MB

NVIDIA uses cookies to improve your experience on our web site. We and our third-party partners also use cookies and other tools to collect and record information you provide as well as information about your interactions with our websites for performance improvement, analytics, and to assist in marketing efforts. By clicking "Accept All", you consent to our use of cookies and other tools as described in our Cookie Policy. You can manage your cookie settings by clicking on "Manage Settings." By continuing to use this site or by clicking one of the buttons below, you agree to our Terms of Service (which contains important waivers). Please see our Privacy Policy for more information on our privacy practices.