Open vocabulary multi-modal instance segmentation model trained on commercial data.
TAO Pretrained Mask Grounding DINO with Commercial License
Description
Open vocabulary instance segmentation is a computer vision technique that can segment one or multiple objects in a frame based on the text input. Object segmentation recognizes the individual objects in an image and predicts bounding boxes and the segmentation masks. This model card contains pre-trained weights for the Mask Grounding DINO model pretrained on the commercial dataset. The goal of this card is to facilitate transfer learning through the Train Adapt Optimize (TAO) Toolkit.
This model is ready for commercial use.
License
GOVERNING TERMS: Use of this model is governed by the NVIDIA Open Model License
Deployment Geography:
Global
Use Case:
-
Open-Set (or Open-Vocabulary) Object Detection and Segmentation: Grounding DINO can detect objects that were not pre-defined in the training categories — you can give it new category names via text (e.g., “lion”, “bench”, “ear”) and it will localise them in images. This is useful in systems that need flexibility to recognise arbitrary categories without retraining for each new class.
-
Referring Expression / Text-Guided Object Detection and Segmentation: The model supports being given a referring expression (“the bottom man with his head up”, “the dog next to the tree”) and then detect/localise the object described.
Release Date:
NGC [11/25/2025]
References
- Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L., Shum, H.: DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection.
- Ziu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Li, C., Yang, J., Su, H., Zhu, J., Zhang, L.: Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.
- Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: PRe-training of Deep Bidirectional Transformers for Language Understanding.
- Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin Transformer: Hierarchical Vision Transformer using Shifted Windows.
- Tian Z., Shen C., Chen H.: Conditional Convolutions for Instance Segmentation.
- Liu, C., Ding, H., & Jiang, X.: Gres: Generalized referring expression segmentation.
Model Architecture
Architecture Type: Transformer-Based Segmentation Model
Network Architecture: Swin-Tiny
Number of Parameters: 182,668,026 params
This model is an NVIDIA proprietary open-vocabulary object segmentation model that takes RGB images and either a list of phrases or a referring expression as input, and outputs bounding boxes, segmentation masks, and corresponding labels.
The model was initialized from a commercial GroundingDINO checkpoint. It employs a Swin-Tiny backbone with a transformer-based segmentation head. All components—including the ReLA module—were pretrained end-to-end on a large-scale mixed corpus of approximately 1.2 million image–expression or image–category name pairs, sourced from Open Images V5, Localized Narratives, COCO, and RefCOCO variants.
Following large-scale pretraining, the model was fine-tuned on RefCOCO, RefCOCO+, and RefCOCOg, using roughly 100,000 referring expression–image pairs to improve grounding accuracy and region-level localization.
All raw images used during training were verified to have commercial licenses, ensuring safe and compliant commercial deployment.
Input(s):
Input Type(s):
Image and Text (list of tokenized captions processed via Hugging Face).
Input Format(s):
- Image: Red, Green, Blue (RGB)
- Text: Tokenized string inputs
- Resolution: Minimum 32 × 32 pixels required
- Other: No alpha channel; standard 8-bit color depth
Input Parameters:
- Image: Two-Dimensional (2D)
- Text: One-Dimensional (1D)
Other Properties Related to Input:
inputs:B × 3 × 544 × 960— RGB image tensor (Batch Size × Channels × Height × Width)input_ids:B × 256— tokenized text input IDs (Batch Size × Max Token Length)attention_mask:B × 256— attention mask for text tokensposition_ids:B × 256— positional encoding indicestoken_type_ids:B × 256— token type indicators for multi-sentence texttext_token_mask:B × 256 × 256— attention map between tokens (Batch Size × Max Token Length × Max Token Length)
Because ONNX and TensorRT do not support string inputs directly, the tokenization step is performed externally using the Hugging Face tokenizer.
For deployment details, refer to the TAO Deploy Repository.
Output(s):
Output Type(s):
Bounding Boxes, Confidence Scores, Segmentation Masks for each detected object in the input image, and a No-Target Indicator.
Output Format(s):
- Bounding Boxes: Predicted 2D boxes in
(cx, cy, w, h)format - Confidence Scores: Scalar scores for each predicted object
- Segmentation Masks: 3D masks aligned with the input image dimensions
- No-Target Indicator: Binary logits indicating whether a query corresponds to a valid object or not
Output Parameters:
- Bounding Boxes: Two-Dimensional (2D)
- Segmentation Masks: Three-Dimensional (3D)
- Confidence Scores: One-Dimensional (1D)
- No-Target Indicator: Two-Dimensional (B × 2)
Other Properties Related to Output:
pred_logits:B × 900— classification logits for each query (Batch Size × Number of Queries)pred_boxes:B × 900 × 4— predicted bounding boxes in(cx, cy, w, h)formatpred_masks:B × 900 × H × W— segmentation masks for each predicted objectno_targets:B × 2— binary logits indicating positive/negative (object/no-object) classesunion_logit_masks:B × H × W— union of predicted object masks across all queries (objectness map)
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Hardware Optimization Note:
This model is optimized for NVIDIA GPU-accelerated systems.
By leveraging CUDA, TensorRT, and other NVIDIA frameworks, it achieves significantly faster training and inference performance compared to CPU-only systems.
Software Integration
Runtime Engines:
- TAO 5.5.0
Supported Hardware Architectures:
- NVIDIA Ampere
- NVIDIA Jetson
- NVIDIA Hopper
- NVIDIA Lovelace
- NVIDIA Pascal
- NVIDIA Turing
- NVIDIA Volta
Supported Operating Systems:
- Linux
- Linux 4 Tegra
Model Versions
- mask_grounding_dino_swin_tiny_commercial_trainable_v2.0 - Fine-tuned Swin-Tiny Mask Grounding DINO model on Commercial dataset.
Training and Evaluation
This model was trained using the mask_grounding_dino entrypoint in TAO. The training algorithm optimizes the network to minimize the localization and contrastive embedding loss between text and visual features as well as the mask losses between predictions and mask groundtruth.
Using this Model
These models need to be used with NVIDIA hardware and software. For hardware, the models can run on any NVIDIA GPU including NVIDIA Jetson devices. These models can only be used with Train Adapt Optimize (TAO) Toolkit, or TensorRT.
The intended use for these models is detecting objects in a color (RGB) image. The model can be used to detect objects from photos and videos by using appropriate video or image decoding and pre-processing.
These models are intended for training and fine-tune with the TAO Toolkit and your datasets for object detection. High-fidelity models can be trained with new use cases. A Jupyter Notebook is available as a part of the TAO container and can be used to re-train.
The models are also intended for easy edge deployment using TensorRT.
Using the Model with TAO
To use these models as pretrained weights for transfer learning, use the following snippet as a template for the model and train components of the experiment spec file. For more information on the experiment spec file, see the TAO Toolkit User Guide.
train:
pretrained_model_path: /path/to/the/mask_groundingdino.pth
freeze: ["backbone.0", "bert"] # freeze the backbone for finetuning
model:
backbone: swin_tiny_224_1k
num_feature_levels: 4
dec_layers: 6
enc_layers: 6
num_queries: 900
dropout_ratio: 0.0
dim_feedforward: 2048
log_scale: auto
class_embed_bias: True
num_region_queries: 100
loss_types: ['labels', 'boxes', 'masks', 'rela']
Training Dataset
Training Data
RefCOCO
Link: https://github.com/lichengunc/refer
Data Collection Method by dataset: Human.
Labeling Method by dataset: Human.
Properties: The RefCOCO dataset provides referring expressions linking natural-language descriptions to specific object instances in images.
- Data size: 36,459 samples
- Expression style: Short (~3.5 words average), allows spatial terms.
Dataset License(s): Commercial-use permitted
RefCOCO+
Link: https://github.com/lichengunc/refer
Data Collection Method by dataset: Human.
Labeling Method by dataset: Human.
Properties: RefCOCO+ focuses on appearance descriptors rather than spatial descriptions.
- Data size: 36,302 samples
- Expression style: Short (~3.5 words average), no spatial terms allowed.
RefCOCOg
Link: https://github.com/lichengunc/refer
Data Collection Method by dataset: Human.
Labeling Method by dataset: Human.
Properties: RefCOCOg offers longer, more descriptive referring expressions for visual grounding tasks.
- Data size: 23,695 samples
- Expression style: Long, descriptive, more challenging for grounding models.
COCO (Subset)
Link: COCO dataset
Data Collection Method by dataset: Human.
Labeling Method by dataset: Human-verified annotations.
Properties: This subset of COCO is used for open-vocabulary grounding with a fixed set of 80 category names.
- Data size: 36,055 images
- Object categories: 80
- Provides bounding boxes and category labels for open-vocabulary object segmentation.
OpenImages V5
Link: Open Images V7 overview
Data Collection Method by dataset: Hybrid: Automated web image collection with human verification
Labeling Method by dataset: Human
Properties:
OpenImages provides large-scale annotations for object detection and segmentation, suitable for open-vocabulary grounding.
- Data size: 662,956 samples (image-category name pairs)
- Object categories: 350+
Localized Narratives
Link: Localized Narratives
Data Collection Method by dataset: Human.
Labeling Method by dataset: Pseudo-labeling.
Properties:
Provides dense vision-language annotations for fine-grained grounding of text to image regions.
- Data size: 459,999 samples
- Supports detailed region grounding for open-vocabulary segmentation
All datasets were selected for commercial-use license compatibility. These sources collectively provide high-quality visual-language supervision for both targeted object grounding and no-target reasoning tasks.
Evaluation Data
Data Collection Method by dataset:
Human
Labeling Method by dataset:
Human
Properties (Evaluation Sets and General Information):
Referring-expression segmentation (RES) requires models to output object masks given natural language expressions. Classic datasets (RefCOCO, RefCOCO+, RefCOCOg) mostly focus on single-target expressions, while gRefCOCO supports single-, multi-, and no-target expressions, making it suitable for generalized RES (GRES) tasks.
-
gRefCOCO:
- Validation Set: 16,870 expressions / 8,163 sets / 1,500 images
- Test Set A (People): 18,712 expressions / 6,266 sets / 750 images
- Test Set B (Objects): 14,933 expressions / 4,618 sets / 750 images
- Expression Types: Mixed (single-, multi-, no-target)
- Use Case: Phrase-level grounding; multi-target and no-target scenarios
-
RefCOCO:
- Validation Set: 3,811 expressions / 1,500 images
- Test Set A (People): 1,975 expressions / 750 images
- Test Set B (Objects): 1,810 expressions / 750 images
- Expression Style: Short (~3.6 words), allows spatial terms
- Use Case: Single-target RES; people-vs-objects split
-
RefCOCO+:
- Validation Set: 3,805 expressions / 1,500 images
- Test Set A (People): 1,975 expressions / 750 images
- Test Set B (Objects): 1,798 expressions / 750 images
- Expression Style: Short (~3.5 words), no spatial terms; appearance-based
- Use Case: Single-target RES; appearance-based grounding
-
RefCOCOg:
- Validation Set (Google): 10,234 expressions / 4896 images
- Validation Set (UMD): 5,000 expressions / 5,257 images
- Test Set (UMD): 5,023 expressions / 5,096 images
- Expression Style: Long, descriptive expressions
- Use Case: Single-target RES; per-object split
Methodology and KPI
The evaluation metrics focus on generalized Intersection over Union (gIoU), target accuracy (T_acc), and no-target accuracy (N_acc):
- gIoU: Mean of per-image IoU values across all samples; for true positive no-target samples, IoU = 1, otherwise IoU = 0.
- T_acc: Accuracy on target samples, measuring robustness to no-target generalization.
- N_acc: Accuracy in correctly identifying no-target samples (true positive no-target predictions divided by total no-target samples).
- mask_mAP: Mean Average Precision (mAP) computed across multiple IoU thresholds (0.5 to 0.95); used to evaluate segmentation performance.
Evaluation Results: gRefCOCO
| Eval Set | gIoU (%) | T_acc (%) | N_acc (%) | mask_mAP (%) |
|---|---|---|---|---|
| val | 46.35 | 66.94 | 59.75 | 33.73 |
| testA | 46.48 | 71.22 | 61.45 | 40.95 |
| testB | 38.54 | 57.89 | 54.50 | 34.87 |
Evaluation Results: RefCOCO, RefCOCO+, RefCOCOg
| Dataset | Eval Set | gIoU (%) | mask_mAP (%) |
|---|---|---|---|
| RefCOCO | val | 58.24 | 75.30 |
| testA | 64.08 | 76.55 | |
| testB | 50.70 | 73.46 | |
| RefCOCO+ | val | 48.61 | 70.16 |
| testA | 55.03 | 72.68 | |
| testB | 37.34 | 63.58 | |
| RefCOCOg | valG | 48.27 | 72.25 |
| valU | 45.17 | 68.21 | |
| testU | 44.51 | 67.71 |
Inference
Engine: Tensor(RT)
Test Hardware:
- A2
- A30
- DGX H100
- DGX A100
- Jetson AGX Xavier
- L4
- L40
- JAO 64GB
- Orin Nano 8GB
- Orin NX
The inference is run on the provided unpruned model at FP16 precision. The Jetson devices are running at Max-N configuration for maximum GPU frequency.
Technical Blogs
- Train like a ‘pro’ without being an AI expert using TAO AutoML
- Create Custom AI models using NVIDIA TAO Toolkit with Azure Machine Learning
- Developing and Deploying AI-powered Robots with NVIDIA Isaac Sim and NVIDIA TAO
- Learn endless ways to adapt and supercharge your AI workflows with TAO - Whitepaper
- Customize Action Recognition with TAO and deploy with DeepStream
- Read the two-part blog on training and optimizing 2D body pose estimation model with TAO - Part 1 | Part 2
- Learn how to train a real-time License plate detection and recognition app with TAO and DeepStream.
- Model accuracy is extremely important; learn how you can achieve state of the art accuracy for classification and object detection models using TAO.
Suggested Reading
- More information on TAO Toolkit and pre-trained models can be found at the NVIDIA Developer Zone.
- Refer to the TAO documentation.
- Read the TAO Toolkit Quick Start Guide and release notes.
- If you have any questions or feedback, see the discussions on the TAO Toolkit Developer Forums.
- Deploy your models for video analytics application using the DeepStream SDK.
- Deploy your models in Riva for ConvAI use case.
Ethical Considerations
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Promise and the Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns here.