TAO Commercial Pretrained NV-Dinov2 Model

NGC Catalog

CLASSIC

Welcome Guest

For downloads and more information, please view on a desktop device.

Associated Products

Features

Description

TAO Commercial Pretrained NV-Dinov2 Model ViT-G backbone

Publisher

NVIDIA

Modified

October 2, 2024

TAO Commercial Pretrained NV-Dinov2 Model

Model Overview

This model card contains pretrained weights of NV-Dinov2 model which can be used as a backbone for most of the popular computer vision tasks such as Classification, Segmentation, Detection. This model is ready for commercial use.

These weights that may be used as a starting point with the classification, segmentation, detection, change detection applications in Train Adapt Optimize (TAO) Toolkit to facilitate transfer learning.

References

Oquab, Maxime, et al. "Dinov2: Learning robust visual features without supervision." arXiv preprint arXiv:2304.07193 (2023).

Model Architecture

Architecture Type: Transformer-Based

NV-Dinov2 is a visual foundational model trained on NVIDIA proprietary large scale dataset. Dinov2 is a self-supervised learning method that uses a combination of two SSL techniques : DINO and iBOT. These models simplify the use of images in any system by producing all purpose visual features, that is, features that work across image distributions and tasks without finetuning. Trained on large curated datasets, our model has learnt robust fine-grained representation useful for localization and classification tasks. This model can be used as a foundation model for a variety of downstream tasks with few labeled examples. For more details on the method see: Dinov2.

Input:

Input Types: Images
Input Formats: Red, Green, Blue (RGB)
Input Parameters: Three-Dimensional (3D)
Other Properties Related to Input:
Minimum Resolution: 224 x 224
Maximum Resolution: 518 x 518
Alpha Channel: No alpha channel

Input image format: RGB Image of dimensions: 224 X 224 X 3 (H x W x C)

Note: ViT-G was fine-tuned for high-resolution images. It works for any input resolutions between 224 X 224 X 3 -> 518 x 518 x 3. Channel Ordering of the Input: NCHW, where N = Batch Size, C = number of channels (3), H = Height of images (336), W = Width of the images (336)

Output:

Output Types: Embedding - Float tensor
Output Format: 3D Vector
Other Properties Related to Output:
Alpha Channel: No alpha channel

The output of this model is an embedding of an input image of size 1024 for ViT-L variant and 1536 for ViT-G.

Software Integration:

Runtime Engines:

TAO - 5.2

Supported Hardware Architectures:

NVIDIA Ampere
NVIDIA Jetson
NVIDIA Hopper
NVIDIA Lovelace
NVIDIA Pascal
NVIDIA Turing
NVIDIA Volta

Supported Operating Systems:

Linux
Linux 4 Tegra

Model Versions:

NV-DinoV2-224-V0 - 130M internal dataset pre-trained model at 224 resolution.
NV-DinoV2-518-V0 - 700M internal dataset pre-trained model at 518 resolution.

Training & Evaluation

This model was trained using our implementation of DINOV2 on NVIDIA-commercial dataset.

Using this Model

These models need to be used with NVIDIA hardware and software. For hardware, the models can run on any NVIDIA GPU including NVIDIA Jetson devices. These models can only be used with the Train Adapt Optimize (TAO) Toolkit, or TensorRT.

The primary use case for these models is getting feature embeddings from images. These embeddings can then be used for downstream tasks such as classification, segmentation, and detection by adding relevant heads.

These models are intended for training and fine-tuning using the TAO Toolkit and your datasets for image comparison. High-fidelity models can be trained on new use cases. A Jupyter Notebook is available as a part of the TAO container and can be used to re-training.

The models are also intended for edge deployment using TensorRT.

Using the Model with TAO

To use these models as pretrained weights for transfer learning, use the snippet below as a template for the model and train component of the experiment spec file to train a Dinov2 Classification model. For more information on the experiment spec file, see the TAO Toolkit User Guide.

For ViT-L NV-Dinov2:

model:
  init_cfg:
    checkpoint: None
  backbone:
    type: vit_large_patch14_dinov2_swiglu
    pretrained: /path/to/nvdinov2.pth
    freeze: True
  head:
    type: TAOLinearClsHead

Training Dataset

Data Collection Method by dataset:

Automated

Labeling Method by dataset:

Automated

Properties:

Dataset	No. of Images
NV Internal Data	130M
NV Internal Data	700M

Evaluation Dataset:

Link: https://www.image-net.org/

Data Collection Method by dataset:

Unknown

Labeling Method by dataset:

Unknown

Properties:
50,000 validatio images from ImageNet dataset

Methodology and KPI

The key performance indicator is the accuracy, following the standard evaluation protocol for image classification. The KPI for the evaluation data are reported below.

model	top-1 accuracy
ViT-L NV-Dinov2 ImageNet validation	79.9
ViT-G NV-Dinov2 ImageNet validation	80.4

Inference:

Engine: Tensor(RT)
Test Hardware:

A2
A30
DGX H100
DGX A100
L4
L40
NVIDIA T4
AGX Orin 64GB
Orin NX16GB
T4

The inference is run on the provided unpruned model at FP16 precision. The inference performance is run using trtexec on Orin, Orin NX and NVIDIA T4, and Ampere GPUs. The Jetson devices are running at Max-N configuration for maximum GPU frequency. The performance shown here is the inference only performance. The end-to-end performance with streaming video data might vary depending on other bottlenecks in the hardware and software.

NVDinoV2 (224x224 resolution)

Platform	BS	FPS
Orin NX 16GB	16	31.55
AGX Orin 64GB	16	81.41
A2	16	72.7
T4	4	110.3
A30	16	461.0
L4	4	275.0
L40	8	579.0
A100	32	1031.0
H100	64	2500.6

Technical Blogs

Learn how to transform Industrial Defect Detection with NVIDIA TAO and Vision AI Models.
Read the 2 part blog on training and optimizing 2D body pose estimation model with TAO - Part 1 | Part 2.
Learn how to train real-time license plate detection and recognition app with TAO and DeepStream.
Model accuracy is extremely important, learn how you can achieve state of the art accuracy for classification and object detection models using TAO.
Learn how to train the Instance segmentation model using MaskRCNN with TAO.
Read the technical tutorial on how PeopleNet model can be trained with custom data using Transfer Learning Toolkit.
Learn how to train and deploy real-time intelligent video analytics apps and services using DeepStream SDK.

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Promise and the Explainability, Bias, Safety & Security, and Privacy Subcards.

TAO Commercial Pretrained NV-Dinov2 Model

TAO Commercial Pretrained NV-Dinov2 Model

Model Overview

References

Model Architecture

Input:

Output:

Software Integration:

Model Versions:

Training & Evaluation

Using this Model

Using the Model with TAO

Training Dataset

Evaluation Dataset:

Methodology and KPI

Inference:

Technical Blogs

Suggested Reading

Ethical Considerations: