This model card contains pretrained weights of NV-Dinov2 model which can be used as a backbone for most of the popular computer vision tasks such as Classification, Segmentation, Detection. This model is ready for commercial use.
These weights that may be used as a starting point with the classification, segmentation, detection, change detection applications in Train Adapt Optimize (TAO) Toolkit to facilitate transfer learning.
Architecture Type: Transformer-Based
NV-Dinov2 is a visual foundational model trained on NVIDIA proprietary large scale dataset. Dinov2 is a self-supervised learning method that uses a combination of two SSL techniques : DINO and iBOT. These models simplify the use of images in any system by producing all purpose visual features, that is, features that work across image distributions and tasks without finetuning. Trained on large curated datasets, our model has learnt robust fine-grained representation useful for localization and classification tasks. This model can be used as a foundation model for a variety of downstream tasks with few labeled examples. For more details on the method see: Dinov2.
Input Types: Images
Input Formats: Red, Green, Blue (RGB)
Input Parameters: Three-Dimensional (3D)
Other Properties Related to Input:
Minimum Resolution: 224 x 224
Maximum Resolution: 518 x 518
Alpha Channel: No alpha channel
Note: ViT-G was fine-tuned for high-resolution images. It works for any input resolutions between 224 X 224 X 3 -> 518 x 518 x 3. Channel Ordering of the Input: NCHW, where N = Batch Size, C = number of channels (3), H = Height of images (336), W = Width of the images (336)
Output Types: Embedding - Float tensor
Output Format: 3D Vector
Other Properties Related to Output:
Alpha Channel: No alpha channel
The output of this model is an embedding of an input image of size 1024 for ViT-L variant and 1536 for ViT-G.
Runtime Engines:
Supported Hardware Architectures:
Supported Operating Systems:
This model was trained using our implementation of DINOV2 on NVIDIA-commercial dataset.
These models need to be used with NVIDIA hardware and software. For hardware, the models can run on any NVIDIA GPU including NVIDIA Jetson devices. These models can only be used with the Train Adapt Optimize (TAO) Toolkit, or TensorRT.
The primary use case for these models is getting feature embeddings from images. These embeddings can then be used for downstream tasks such as classification, segmentation, and detection by adding relevant heads.
These models are intended for training and fine-tuning using the TAO Toolkit and your datasets for image comparison. High-fidelity models can be trained on new use cases. A Jupyter Notebook is available as a part of the TAO container and can be used to re-training.
The models are also intended for edge deployment using TensorRT.
To use these models as pretrained weights for transfer learning, use the snippet below as a template for the model
and train
component of the experiment spec file to train a Dinov2 Classification model. For more information on the experiment spec file, see the TAO Toolkit User Guide.
For ViT-L NV-Dinov2:
model:
init_cfg:
checkpoint: None
backbone:
type: vit_large_patch14_dinov2_swiglu
pretrained: /path/to/nvdinov2.pth
freeze: True
head:
type: TAOLinearClsHead
Data Collection Method by dataset:
Labeling Method by dataset:
Properties:
Dataset | No. of Images |
---|---|
NV Internal Data | 130M |
NV Internal Data | 700M |
Link: https://www.image-net.org/
Data Collection Method by dataset:
Labeling Method by dataset:
Properties:
50,000 validatio images from ImageNet dataset
The key performance indicator is the accuracy, following the standard evaluation protocol for image classification. The KPI for the evaluation data are reported below.
model | top-1 accuracy |
---|---|
ViT-L NV-Dinov2 ImageNet validation | 79.9 |
ViT-G NV-Dinov2 ImageNet validation | 80.4 |
Engine: Tensor(RT)
Test Hardware:
The inference is run on the provided unpruned model at FP16 precision. The inference performance is run using trtexec
on Orin, Orin NX and NVIDIA T4, and Ampere GPUs. The Jetson devices are running at Max-N configuration for maximum GPU frequency. The performance shown here is the inference only performance. The end-to-end performance with streaming video data might vary depending on other bottlenecks in the hardware and software.
NVDinoV2 (224x224 resolution)
Platform | BS | FPS |
---|---|---|
Orin NX 16GB | 16 | 31.55 |
AGX Orin 64GB | 16 | 81.41 |
A2 | 16 | 72.7 |
T4 | 4 | 110.3 |
A30 | 16 | 461.0 |
L4 | 4 | 275.0 |
L40 | 8 | 579.0 |
A100 | 32 | 1031.0 |
H100 | 64 | 2500.6 |
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Promise and the Explainability, Bias, Safety & Security, and Privacy Subcards.