Mask Grounding DINO

NGC Catalog

CLASSIC

Welcome Guest

For downloads and more information, please view on a desktop device.

Description

Open vocabulary multi-modal instance segmentation model trained on commercial data.

Publisher

NVIDIA

Latest Version

mask_grounding_dino_swin_tiny_commercial_trainable_v1.0

Modified

November 12, 2024

Size

685.44 MB

TAO Pretrained Mask Grounding DINO with Commercial License

Description

Open vocabulary instance segmentation is a computer vision technique that can segment one or multiple objects in a frame based on the text input. Object segmentation recognizes the individual objects in an image and predicts bounding boxes and the segmentation masks. This model card contains pre-trained weights for the Mask Grounding DINO model pretrained on the commercial dataset. The goal of this card is to facilitate transfer learning through the Train Adapt Optimize (TAO) Toolkit.

This model is ready for commercial/non-commercial use.

References

Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L., Shum, H.: DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection.
Ziu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Li, C., Yang, J., Su, H., Zhu, J., Zhang, L.: Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: PRe-training of Deep Bidirectional Transformers for Language Understanding.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin Transformer: Hierarchical Vision Transformer using Shifted Windows.
Tian Z., Shen C., Chen H.: Conditional Convolutions for Instance Segmentation.

License

The licenses to use this model is covered by the Model EULA. By downloading the unpruned or pruned version of the model, you accept the terms and conditions of these licenses

Model Architecture

Architecture Type: Transformer-Based Segmentation Model Network Architecture: Swin-Tiny

The models in this instance are NVIDIA proprietary open vocabulary object segmentors that take RGB images and list of phrases as input and produce bounding boxes and masks along with their labels as output. More specifically, this model was finetuned with the commercial GroundingDINO pretrained model on the pseudo-labeled Openimages dataset, which allows commercial usage. The model uses Swin-Tiny backbone and its segmentation head was finetuned end-to-end on about 830k images that have pseudo-labeled groundtruth masks. Note that we ensured that all the raw images used during training have commercial licenses to ensure safe commercial usage.

Input

Input Types: Image and list of captions of tokenized through HuggingFace
Input Formats: Red, Green, Blue (RGB) and tokenized inputs. Minimum 32 x 32 Resolution required; no alpha channel or bits
Input Parameters: See below for detailed input shapes
Other Properties Related to Input:

inputs: B X 3 X 544 X 960 (Batch Size x Channel x Height x Width)
input_ids: B x 256 (Batch Size x Max Token Length )
attention_mask: B x 256 (Batch Size x Max Token Length )
position_ids: B x 256 (Batch Size x Max Token Length )
token_type_ids: B x 256 (Batch Size x Max Token Length )
text_token_mask: B x 256 x 256 (Batch Size x Max Token Length x Max Token Length)
Because ONNX / TensorRT can't take string as input, we've offloaded tokenizer outside of the model graph. See TAO-Deploy repo on running tokenization through HuggingFace.

Output

Output Types: Bounding Boxes, Confidence Scores and Masks for each detected object in the input image.
Output Formats: One Dimensional (1D), Two Dimensional (2D), Three-Dimensional (3D) vectors
Other Properties Related to Output:

pred_logits: B x 900 (Batch Size x Number of Queries)
pred_boxes: B x 900 x 4 (Batch Size x Number of Queries x Coordinates in cxcywh format)
pred_masks: B x 900 x H x W (Batch Size x Number of Queries x Height x Width)

Software Integration

Runtime Engines:

TAO 5.5.0

Supported Hardware Architectures:

NVIDIA Ampere
NVIDIA Jetson
NVIDIA Hopper
NVIDIA Lovelace
NVIDIA Pascal
NVIDIA Turing
NVIDIA Volta

Supported Operating Systems:

Linux
Linux 4 Tegra

Model Versions

mask_grounding_dino_swin_tiny_commercial_trainable_v1.0 - Pre-trained Swin-Tiny Mask Grounding DINO model for finetune.
mask_grounding_dino_swin_tiny_commercial_deployable_v1.0 - Swin-Tiny Mask Grounding DINO model deployable.

Training and Evaluation

This model was trained using the mask_grounding_dino entrypoint in TAO. The training algorithm optimizes the network to minimize the localization and contrastive embedding loss between text and visual features as well as the mask losses between predictions and mask groundtruth.

Using this Model

These models need to be used with NVIDIA hardware and software. For hardware, the models can run on any NVIDIA GPU including NVIDIA Jetson devices. These models can only be used with Train Adapt Optimize (TAO) Toolkit, or TensorRT.

The intended use for these models is detecting objects in a color (RGB) image. The model can be used to detect objects from photos and videos by using appropriate video or image decoding and pre-processing.

These models are intended for training and fine-tune with the TAO Toolkit and your datasets for object detection. High-fidelity models can be trained with new use cases. A Jupyter Notebook is available as a part of the TAO container and can be used to re-train.

The models are also intended for easy edge deployment using TensorRT.

Using the Model with TAO

To use these models as pretrained weights for transfer learning, use the following snippet as a template for the model and train components of the experiment spec file. For more information on the experiment spec file, see the TAO Toolkit User Guide.

train:
  pretrained_model_path: /path/to/the/mask_groundingdino.pth
  freeze: ["backbone.0", "bert"]  # freeze the backbone for finetuning
model:
  backbone: swin_tiny_224_1k
  num_feature_levels: 4
  dec_layers: 6
  enc_layers: 6
  num_queries: 900
  dropout_ratio: 0.0
  dim_feedforward: 2048
  log_scale: auto
  class_embed_bias: True
  loss_types: ['labels', 'boxes', 'masks']

Training Dataset

Training Data

Data Collection Method by dataset:

Unknown

Labeling Method by dataset:

Unknown

Properties:
Mask Grounding DINO was pretrained on the Object segmentations subset of Openimages V5. The annotations were pseudo-labeled with 80 COCO categories. The model was trained on 832,362 images and 8,155,684 instances of bounding box and mask groundtruth.

Evaluation Data

Data Collection Method by dataset:

Unknown

Labeling Method by dataset:

Human

Properties:

The COCO dataset contains 5K validation images and corresponding annotation files. The annotation includes bounding boxes and segmentation masks of the 80 thing categories.

Methodology and KPI

The key performance indicator is the mean average precision (mAP), following the standard evaluation protocol for object detection. The KPI for the evaluation data are:

model	precision	mAP	mAP50	mask_mAP	mask_mAP50
mask_grounding_dino_swin_tiny	BF16	47.5	61.7	32.9	55.7

Inference

Engine: Tensor(RT)
Test Hardware:

A2
A30
DGX H100
DGX A100
Jetson AGX Xavier
L4
L40
JAO 64GB
Orin Nano 8GB
Orin NX

The inference is run on the provided unpruned model at FP16 precision. The inference performance is run using trtexec on Jetson AGX Xavier, Xavier NX, Orin, Orin NX and NVIDIA T4, and Ampere GPUs. The Jetson devices are running at Max-N configuration for maximum GPU frequency. The performance shown here is the inference only performance. The end-to-end performance with streaming video data might vary depending on other bottlenecks in the hardware and software.

Swin-Tiny + Bert-Base + Mask Grounding DINO

Platform	BS	FPS
AGX Orin 64GB	1	9.01
Orin NX 16GB	1	3.66
Orin Nano 8GB	1	2.55
A2	1	7.2
A30	1	30.8
L4	1	22.4
L40	1	63.0
A100	1	51.4
H100	1	70.3

Technical Blogs

Train like a ‘pro’ without being an AI expert using TAO AutoML
Create Custom AI models using NVIDIA TAO Toolkit with Azure Machine Learning
Developing and Deploying AI-powered Robots with NVIDIA Isaac Sim and NVIDIA TAO
Learn endless ways to adapt and supercharge your AI workflows with TAO - Whitepaper
Customize Action Recognition with TAO and deploy with DeepStream
Read the two-part blog on training and optimizing 2D body pose estimation model with TAO - Part 1 | Part 2
Learn how to train a real-time License plate detection and recognition app with TAO and DeepStream.
Model accuracy is extremely important; learn how you can achieve state of the art accuracy for classification and object detection models using TAO.

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Promise and the Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns here.

TAO Pretrained Mask Grounding DINO with Commercial License

Description

References

License

Model Architecture

Input

Output

Software Integration

Model Versions

Training and Evaluation

Using this Model

Using the Model with TAO

Training Dataset

Training Data

Evaluation Data

Methodology and KPI

Inference

Technical Blogs

Suggested Reading

Ethical Considerations