Open vocabulary object detection is a computer vision technique that can detect one or multiple objects in a frame based on the text input. Object detection recognizes the individual objects in an image and places bounding boxes around the object. This model card contains pre-trained weights for the Grounding DINO object detection networks pretrained on the commercial dataset. The goal of this card is to facilitate transfer learning through the Train Adapt Optimize (TAO) Toolkit. Note that the model in this model card can be used for commercial purpose.
The licenses to use this model is covered by the Model EULA. By downloading the unpruned or pruned version of the model, you accept the terms and conditions of these licenses
Architecture Type: Transformer-based Network Architecture
Network Architecture
More Details
Input Type(s): Image and list of captions of tokenized through HuggingFace
Input Format(s): Red, Green, Blue (RGB) and tokenized inputs. Can support any input resolution and images do not need any additional pre-processing (e.g. alpha channels or bits)
Input Parameters: Multiple dimensions. See below for detailed input shapes
Other Properties Related to Input:
inputs
: B X 3 X 544 X 960 (Batch Size x Channel x Height x Width)input_ids
: B x 256 (Batch Size x Max Token Length )attention_mask
: B x 256 (Batch Size x Max Token Length )position_ids
: B x 256 (Batch Size x Max Token Length )token_type_ids
: B x 256 (Batch Size x Max Token Length )text_token_mask
: B x 256 x 256 (Batch Size x Max Token Length x Max Token Length)Output Type(s): Bounding Boxes and Confidence Scores for each detected object in the input image.
Output Formats: One Dimensional (1D), Two Dimensional (2D) vectors
Other Properties Related to Output:
pred_logits
: B x 900 (Batch Size x Number of Queries)pred_boxes
: B x 900 x 4 (Batch Size x Number of Queries x Coordinates in cxcywh
format)Runtime Engine(s):
Supported Hardware Architecture(s):
Supported Operating System(s):
This model was trained using the grounding_dino
entrypoint in TAO. The training algorithm optimizes the network to minimize the localization and contrastive embedding loss between text and visual features.
These models need to be used with NVIDIA hardware and software. For hardware, the models can run on any NVIDIA GPU including NVIDIA Jetson devices. These models can only be used with Train Adapt Optimize (TAO) Toolkit, or TensorRT.
The intended use for these models is detecting objects in a color (RGB) image. The model can be used to detect objects from photos and videos by using appropriate video or image decoding and pre-processing.
These models are intended for training and fine-tune with the TAO Toolkit and your datasets for object detection. High-fidelity models can be trained with new use cases. A Jupyter Notebook is available as a part of the TAO container and can be used to re-train.
The models are also intended for easy edge deployment using TensorRT.
To use these models as pretrained weights for transfer learning, use the following snippet as a template for the model
and train
components of the experiment spec file to train a DINO model. For more information on the experiment spec file, see the TAO Toolkit User Guide.
train:
pretrained_model_path: /path/to/the/groundingdino.pth
freeze: ["backbone.0", "bert"] # freeze the backbone for finetuning
model:
backbone: swin_tiny_224_1k
num_feature_levels: 4
dec_layers: 6
enc_layers: 6
num_queries: 900
dropout_ratio: 0.0
dim_feedforward: 2048
log_scale: auto
class_embed_bias: True
Grounding DINO was pretrained on wide range of commercial datasets where the annotations were either human generated or pseudo-labeled. The model was trained on 1,815,237 images and 14,794,974 instances of both object detection (OD) and grounding annotations. Please refer to below section for details of every dataset used to train Grounding DINO.
Dataset | Data Collection Method by dataset | Labeling Method by dataset | # of Images | # of Annotations |
---|---|---|---|---|
Subset of OpenImagesv5 | Unknown | Automated. Pseudo-labeled raw images with Objects365 trained CO-DETR. | 803,826 | 7,345,546 |
Localized Narrative OpenImages | Unknown | Automated. Pseudo-labeled raw images and global captions with Grounding DINO. | 670,553 | 6,098,908 |
Subset of LVIS | Unknown | Human-labeled (only contains commercial subset). | 30,740 | 391,840 |
Subset of Mixed Grounding | Unknown | Human-labeled (only contains commercial subset). | 150,668 | 777,178 |
Subset of RefCOCO | Unknown | Human-labeled (only contains commercial subset). | 36,459 | 36,459 |
Subset of RefCOCO+ | Unknown | Human-labeled (only contains commercial subset). | 36,302 | 36,302 |
Subset of RefCOCOg | Unknown | Human-labeled (only contains commercial subset). | 23,718 | 23,718 |
Subset of gRefCOCO | Unknown | Human-labeled (only contains commercial subset). | 62,971 | 85,023 |
Data Collection Method by dataset:
Labeling Method by dataset:
Properties:
The key performance indicator is the mean average precision (mAP), following the standard evaluation protocol for object detection. The KPI for the evaluation data are:
model | precision | mAP | mAP50 | mAP75 | mAPs | mAPm | mAPl |
---|---|---|---|---|---|---|---|
grounding_dino_swin_tiny | BF16 | 46.1 | 59.9 | 51.0 | 30.5 | 49.3 | 60.8 |
Engine: Tensor(RT)
Test Hardware:
The inference is run on the provided unpruned model at FP16 precision. The inference performance is run using trtexec
on Jetson AGX Xavier, Xavier NX, Orin, Orin NX and NVIDIA T4, and Ampere GPUs. The Jetson devices are running at Max-N configuration for maximum GPU frequency. The performance shown here is the inference only performance. The end-to-end performance with streaming video data might vary depending on other bottlenecks in the hardware and software.
Swin-Tiny + Bert-Base + Grounding DINO
Platform | BS | FPS |
---|---|---|
AGX Orin 64GB | 8 | 11.94 |
Orin NX 16GB | 8 | 4.61 |
Orin Nano 8GB | 4 | 3.36 |
RTX 4090 | 32 | 84.06 |
T4 | 32 | 15.78 |
A2 | 32 | 9.44 |
A30 | 32 | 48.05 |
L4 | 32 | 23.57 |
L40 | 32 | 69.22 |
A100 | 32 | 97.67 |
H100 | 32 | 178.07 |
Grounding DINO was trained on images collected from the web and text data of everyday noun phrases. The model might not perform well on different data distributions. Conducting further fine-tuning on the target domain is recommended to get a higher mAP.
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Promise and the Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns here.