NGC Catalog
CLASSIC
Welcome Guest
Models
TAO Commercial Pretrained NV-CLIP Model

TAO Commercial Pretrained NV-CLIP Model

For downloads and more information, please view on a desktop device.
Logo for TAO Commercial Pretrained NV-CLIP Model
Associated Products
Features
Description
TAO Commercial Pretrained NV-CLIP ViT-H Model
Publisher
NVIDIA
Latest Version
nv_clip_336_vit_h_trainable_v1.0
Modified
October 2, 2024
Size
1.54 GB

NVCLIP (Commercial Foundation Model)

Model Overview

NVCLIP is an NVIDIA version of the "Contrastive Language-Image Pre-Training (CLIP)" model that transforms an images into three dimensional (3D) textual embeddings. This model is ready for commercial/non-commercial use.

References:

  • Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International conference on machine learning. PMLR, 2021.

Model Architecture:

Architecture Type: Transformer-Based

In TAO, you can use the NVCLIP in conjunction with TAO-MMclassification.

NVCLIP as a backbone can be used towards various downstream tasks such as classification, detection, segmentation and text based image retrieval.

Input:

Input Type(s): Images
Input Format(s): Red, Green, Blue (RGB)
Input Parameters: Three-Dimensional (3D)
Other Properties Related to Input:

  • Input image format: RGB Image of dimensions: 336 X 336 X 3 (H x W x C)

Channel Ordering of the Input: NCHW, where N = Batch Size, C = number of channels (3), H = Height of images (336), W = Width of the images (336)

Output:

Output Type(s): Embedding - Float tensor
Output Format: 3D Vector
Other Properties Related to Output:
The output of this model is an embedding of an input image of size 1024 for ViT-H variant and 768 for ViT-L.

Software Integration:

Runtime Engine(s):

  • TAO - 5.2

Supported Hardware Architecture(s):

  • NVIDIA Ampere
  • NVIDIA Jetson
  • NVIDIA Hopper
  • NVIDIA Lovelace
  • NVIDIA Pascal
  • NVIDIA Turing
  • NVIDIA Volta

Supported Operating System(s):

  • Linux
  • Linux 4 Tegra

Model Version(s):

  • nv_clip_336_vit_l_trainable_v1.0 - NVCLIP ViT-L with 336 resolution is foundation model and is trainable.
  • nv_clip_336_vit_h_trainable_v1.0 - NVCLIP ViT-H with 336 resolution is foundation model and is trainable.

Training & Evaluation:

This model can be used as a backbone and trained using the classification_pyt entrypoint in TAO. The training algorithm does a linear probe finetuning for classification task.

Using this Model

These models need to be used with NVIDIA hardware and software. For hardware, the models can run on any NVIDIA GPU including NVIDIA Jetson devices. These models can only be used with the Train Adapt Optimize (TAO) Toolkit, or TensorRT.

The primary use case for these models is getting feature embeddings from images. These embeddings can then be used for curation, clustering, zero-shot or few-shot downstream tasks such as classification. These embeddings can also be used towards text based image retrieval.

These models are intended for training and fine-tuning using the TAO Toolkit and your datasets for image comparison. High-fidelity models can be trained on new use cases. A Jupyter Notebook is available as a part of the TAO container and can be used to re-training.

The models are also intended for edge deployment using TensorRT.

Using the Model with TAO

To use these models as pretrained weights for transfer learning, use the following as a template for the model and train component of the experiment spec file to train a NVCLIP model. For more information on the experiment spec file, see the TAO Toolkit User Guide - NVCLIP.

model:
  backbone:
    type: "open_clip"
    custom_args:
      model_name: "ViT-L-14-SigLIP-CLIPA-336"
    freeze: true
  init_cfg:
    checkpoint: "Path to the checkpoint"

Training Dataset:

Data Collection Method by dataset:

  • Automated

Labeling Method by dataset:

  • Automated

Properties:

Dataset No. of Images
NV Internal Data 700M

Evaluation Dataset:

Link: https://www.image-net.org/

Data Collection Method by dataset:

  • Unknown

Labeling Method by dataset:

  • Unknown

Properties:
50,000 validatio images from ImageNet dataset

Methodology and KPI

The performance of zero shot accuracy of NVCLIP on ImageNet validation dataset.

model top-1 Accuracy
ViT-H-336 0.7786
ViT-L-336 0.7629

Inference:

Engine: Tensor(RT)
Test Hardware:

  • A2
  • A30
  • DGX H100
  • DGX A100
  • DGX H100
  • JAO 64GB
  • Jetson AGX Xavier
  • L4
  • L40
  • NVIDIA T4
  • Orin
  • Orin Nano 8GB
  • Orin NX
  • Orin NX16GB
  • T4
  • Xavier NX

The inference is run on the provided unpruned model at FP16 precision. The inference performance is run using trtexec on Jetson AGX Xavier, Xavier NX, Orin, Orin NX and NVIDIA T4, and Ampere GPUs. The Jetson devices are running at Max-N configuration for maximum GPU frequency. The performance shown here is the inference only performance. The end-to-end performance with streaming video data might vary depending on other bottlenecks in the hardware and software.

NVCLIP ViT-H

Platform BS FPS
A2 128 34.88
L4 128 107.80
A30 128 230.04
L40 128 286.69
A100 128 466.98
H100 128 782.47

Technical Blogs

  • Learn how to transform Industrial Defect Detection with NVIDIA TAO and Vision AI Models.
  • Read the 2 part blog on training and optimizing 2D body pose estimation model with TAO - Part 1 | Part 2.
  • Learn how to train real-time license plate detection and recognition app with TAO and DeepStream.
  • Model accuracy is extremely important, learn how you can achieve state of the art accuracy for classification and object detection models using TAO.
  • Learn how to train the Instance segmentation model using MaskRCNN with TAO.
  • Read the technical tutorial on how PeopleNet model can be trained with custom data using Transfer Learning Toolkit.
  • Learn how to train and deploy real-time intelligent video analytics apps and services using DeepStream SDK.

Suggested Reading

  • More information on about TAO Toolkit and pre-trained models can be found at the NVIDIA Developer Zone.
  • Read the TAO Quick Start guide and release notes.
  • If you have any questions or feedback, see the discussions on TAO Toolkit Developer Forums.
  • Deploy your model on the edge using DeepStream. Learn more about DeepStream SDK.

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Promise and the Explainability, Bias, Safety & Security, and Privacy Subcards.