NVIDIA
NVIDIA
RADIO-CLIP
Model
NVIDIA
NVIDIA
RADIO-CLIP

RADIO-CLIP object search model

RADIO-CLIP

Model Overview

Description:

RADIO-CLIP is a contrastive vision-language model that combines C-RADIO (image embeddings) and the SigLIP 2-G text adapter (text embeddings). The TAO checkpoint contains both. SigLIP 2-G is the only text adapter currently supported; additional text adapters may be supported in the future. RADIO-CLIP uses a ViT-H/16 backbone for images and produces 1024-dimensional embeddings suitable for text-to-image and image-to-image retrieval, object search, and re-identification in applications such as transportation, warehouse operations, and other industrial contexts. This model is ready for commercial use.


License/Terms of Use

Governing Terms: Use of this model is governed by the NVIDIA Open Model License. Additional InformationApache 2.0.

Deployment Geography:

Global

Use Case:

Object search, re-identification, and cross-modal retrieval (text-to-image and image-to-image) in applications such as transportation, warehouse operations, and other industrial contexts. Users include developers and integrators building vision-language retrieval systems for these domains.

Release Date:

NGC 03/09/2026 via URL

References(s):

RADIO: GitHub | AM-RADIO Paper (CVPR 2024). RADIO-CLIP combines C-RADIO (image encoder) and the SigLIP 2-G text adapter; both are included in the TAO checkpoint.

Model Architecture:

Architecture Type: Vision Transformer (ViT) with contrastive adapter for images; SigLIP 2-G text adapter for text (only option for now; more adapters may be supported in the future).

Network Architecture: RADIO-CLIP: C-RADIO v3 with ViT-H/16 backbone (image encoder) and SigLIP 2-G (text adapter). The TAO checkpoint contains both.

This model was developed based on: C-RADIO v3 (image encoder) and SigLIP 2-G (text adapter).

Number of model parameters: Approximately 6.5×10^8 (C-RADIO v3-H image encoder); full RADIO-CLIP checkpoint includes both image and text encoders.

Input(s):

Input Type(s): Image, Text

Input Format(s):

  • Image: Red, Green, Blue (RGB)
  • Text: String (tokenized via model tokenizer)

Input Parameters:

  • Image: Two-Dimensional (2D)
  • Text: One-Dimensional (1D)

Other Properties Related to Input: Image: Fixed resolution 224×224 pixels (W×H); RGB, 8-bit; pre-processing: resize and crop to 224×224, normalization. Text: Context length and vocabulary per SigLIP 2-G tokenizer; pre-processing: tokenization.

Output(s)

Output Type(s): Embedding vectors (floating-point)

Output Format(s): Real-valued vectors (L2-normalized)

Output Parameters: One-Dimensional (1D) embedding vectors

Other Properties Related to Output: Embedding dimension 1024 (for C-RADIO v3-H with SigLIP 2-G adapter); L2-normalized for cosine similarity; no character or resolution limits.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Runtime Engine(s):

  • TAO - minimally compatible versions as specified on NGC
  • PyTorch with torch.hub (NVlabs/RADIO); TAO checkpoint includes C-RADIO and SigLIP 2-G text adapter for image and text encoding

Supported Hardware Microarchitecture Compatibility:

  • NVIDIA Ampere
  • NVIDIA Blackwell
  • NVIDIA Jetson
  • NVIDIA Hopper
  • NVIDIA Lovelace
  • NVIDIA Pascal
  • NVIDIA Turing
  • NVIDIA Volta

Preferred/Supported Operating System(s):

  • Linux

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

This AI model can be embedded as an Application Programming Interface (API) call into the software environment described above.

Model Version(s):

Trainable v1.0 and deployable v1.0 are available on NGC (e.g., C-RADIO v3-H with SigLIP 2-G adapter).

Training, Testing, and Evaluation Datasets:

Training Dataset

Data Modality:

  • Image
  • Text

Image Training Data Size:

  • Less than a Million Images (for fine-tuning workflows; base model training may use larger sets). TAO fine-tuning training set sizes: 50,019 train images (6,668 persons).

Text Training Data Size:

  • Less than a Billion Tokens

Data Collection Method by dataset:

  • Hybrid: Human, Automated

Labeling Method by dataset:

  • Hybrid: Human, Automated

Properties: Training data comprises image-text pairs: images (RGB) and associated text (captions or attribute-based descriptions). Content includes person and object imagery for retrieval and re-identification. Data sources used in TAO fine-tuning may include internal and licensed datasets (e.g., person attribute search); see the CLIP CLI notebook and experiment specs for exact sources and counts.

Testing Dataset:

Data Collection Method by dataset:

  • Hybrid: Human, Automated

Labeling Method by dataset:

  • Human

Properties: Test sets include image-text pairs with same modalities and nature as training (images and captions/attributes); used for retrieval and re-identification metrics.

Evaluation Dataset:

Data Collection Method by dataset:

  • Human

Labeling Method by dataset:

  • Human

Properties: Evaluation uses benchmark datasets for retrieval and person re-identification (e.g., person attribute search benchmarks); image and text modalities; descriptive and attribute-based captions.

Benchmark Score: Retrieval metrics (e.g., Recall@K, mAP) on person re-identification and text-to-image retrieval benchmarks.

Inference

Acceleration Engine: TensorRT (vision encoder), PyTorch

Test Hardware:

  • NVIDIA datacenter GPUs (e.g., L4, L40, L40S, A100, H100, H200, RTX PRO 6000 Blackwell Server Edition, B200, GB200)
  • Jetson AGX Orin, Jetson AGX Thor T5000, Jetson IGX Thor T7000 (Stargazer), DGX Spark
  • Minimum hardware as required by your TAO / PyTorch / TensorRT deployment

Datacenter GPU results below are vision encoder only (C-RADIO ViT-H/16; pixel_values input). RADIO-CLIP also ships a SigLIP 2-G text adapter in the full checkpoint. Measurements used TensorRT FP16 and trtexec. Throughput is inference-only; end-to-end latency with decoding, pre/post-processing, or full application pipelines may differ.

Environment (dGPU)

ComponentVersion
TensorRT10.14.1.48
CUDA13.1 (V13.1.115)
cuDNN9.17.1
Driver580.105.08
Containergitlab-master_26.01_py3_stage_252181976
OSUbuntu 24.04

Environment (edge)

Jetson IGX Thor T7000 (Stargazer) — flashed with IGX-r38v2.0.11; TensorRT 10.13.3.9; CUDA 13.0; cuDNN 9.12.0.46; power mode 120W; jetson_clocks applied.

Jetson AGX Thor T5000 — flashed with JP7.1 b148; TensorRT 10.13.3.9; CUDA 13.0; cuDNN 9.12.0.46; power mode 120W; jetson_clocks applied.

DGX Spark — FastOS OTA2 Mainline Release 1.120.36; driver 580.126.09 / 590.48.01; CUDA 13.0 / 13.1; cuDNN 9.12 / 9.17; TensorRT 10.13.2 / 10.14.1.48.

Jetson AGX Orin — flashed with JP7.2 b19; TensorRT 10.13.3.9; CUDA 12.9; cuDNN 9.12.0.46; power mode MAXN.

Model configuration and TensorRT build

Export input size is 224×224 in the export header. TensorRT uses input tensor pixel_values:1x3x224x224. ONNX opset 17. Output embedding dimension 1536 for this TensorRT vision encoder export. Note: RADIO-CLIP pairs the image encoder with a SigLIP v2-g text adapter; this performance report covers the vision encoder only.

TensorRT conversion command (from export log):

trtexec --onnx=onnx_exports/radlip_v1.0.onnx --saveEngine=radlip_v1.0.engine --shapes=pixel_values:1x3x224x224 --fp16

Vision encoder throughput (FP16)

PlatformBS=1BS=2BS=4BS=8BS=16BS=32BS=64BS=128
NVIDIA L45543301810521
NVIDIA L4071716950261363
NVIDIA L40S80797257371884
NVIDIA A100-SXM4-80GB1131028356351795
NVIDIA H100 NVL1531411341006031158
NVIDIA H100 80GB HBM316916114611984442312
NVIDIA H200 141GB HBM316515814812686432312
NVIDIA RTX PRO 6000 Blackwell Server Edition116114114755729147
NVIDIA B200236212216178128713719
NVIDIA GB200223229213169129763920
Jetson AGX Orin45311895211
Jetson AGX Thor T50007876543216842
Jetson IGX Thor T7000 (Stargazer)7972543016842
DGX Spark7859402410521

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please make sure you have proper rights and permissions for all input image and video content; if image or video includes people, personal health information, or intellectual property, the image or video generated will not blur or maintain proportions of image subjects included.

For more detailed information on ethical considerations for this model, please see the Model Card++ Bias, Explainability, Safety & Security, and Privacy Subcards.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

Publisher
NVIDIA
NVIDIA
LicenseNVIDIA proprietary
Latest Versiontrainable_v1.0
UpdatedMarch 12, 2026 UTC
Compressed Size15.4 GB