NVIDIA

RADIO-CLIP

Model

NVIDIA

RADIO-CLIP

RADIO-CLIP object search model

RADIO-CLIP

Model Overview

Description:

RADIO-CLIP is a contrastive vision-language model that combines C-RADIO (image embeddings) and the SigLIP 2-G text adapter (text embeddings). The TAO checkpoint contains both. SigLIP 2-G is the only text adapter currently supported; additional text adapters may be supported in the future. RADIO-CLIP uses a ViT-H/16 backbone for images and produces 1024-dimensional embeddings suitable for text-to-image and image-to-image retrieval, object search, and re-identification in applications such as transportation, warehouse operations, and other industrial contexts. This model is ready for commercial use.

License/Terms of Use

Governing Terms: Use of this model is governed by the NVIDIA Open Model License. Additional Information: Apache 2.0.

Deployment Geography:

Global

Use Case:

Object search, re-identification, and cross-modal retrieval (text-to-image and image-to-image) in applications such as transportation, warehouse operations, and other industrial contexts. Users include developers and integrators building vision-language retrieval systems for these domains.

Release Date:

NGC 03/09/2026 via URL

References(s):

RADIO: GitHub | AM-RADIO Paper (CVPR 2024). RADIO-CLIP combines C-RADIO (image encoder) and the SigLIP 2-G text adapter; both are included in the TAO checkpoint.

Model Architecture:

Architecture Type: Vision Transformer (ViT) with contrastive adapter for images; SigLIP 2-G text adapter for text (only option for now; more adapters may be supported in the future).

Network Architecture: RADIO-CLIP: C-RADIO v3 with ViT-H/16 backbone (image encoder) and SigLIP 2-G (text adapter). The TAO checkpoint contains both.

This model was developed based on: C-RADIO v3 (image encoder) and SigLIP 2-G (text adapter).

Number of model parameters: Approximately 6.5×10^8 (C-RADIO v3-H image encoder); full RADIO-CLIP checkpoint includes both image and text encoders.

Input(s):

Input Type(s): Image, Text

Input Format(s):

Image: Red, Green, Blue (RGB)
Text: String (tokenized via model tokenizer)

Input Parameters:

Image: Two-Dimensional (2D)
Text: One-Dimensional (1D)

Other Properties Related to Input: Image: Fixed resolution 224×224 pixels (W×H); RGB, 8-bit; pre-processing: resize and crop to 224×224, normalization. Text: Context length and vocabulary per SigLIP 2-G tokenizer; pre-processing: tokenization.

Output(s)

Output Type(s): Embedding vectors (floating-point)

Output Format(s): Real-valued vectors (L2-normalized)

Output Parameters: One-Dimensional (1D) embedding vectors

Other Properties Related to Output: Embedding dimension 1024 (for C-RADIO v3-H with SigLIP 2-G adapter); L2-normalized for cosine similarity; no character or resolution limits.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Runtime Engine(s):

TAO - minimally compatible versions as specified on NGC
PyTorch with torch.hub (NVlabs/RADIO); TAO checkpoint includes C-RADIO and SigLIP 2-G text adapter for image and text encoding

Supported Hardware Microarchitecture Compatibility:

NVIDIA Ampere
NVIDIA Blackwell
NVIDIA Jetson
NVIDIA Hopper
NVIDIA Lovelace
NVIDIA Pascal
NVIDIA Turing
NVIDIA Volta

Preferred/Supported Operating System(s):

Linux

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

This AI model can be embedded as an Application Programming Interface (API) call into the software environment described above.

Model Version(s):

Trainable v1.0 and deployable v1.0 are available on NGC (e.g., C-RADIO v3-H with SigLIP 2-G adapter).

Training, Testing, and Evaluation Datasets:

Training Dataset

Data Modality:

Image
Text

Image Training Data Size:

Less than a Million Images (for fine-tuning workflows; base model training may use larger sets). TAO fine-tuning training set sizes: 50,019 train images (6,668 persons).

Text Training Data Size:

Less than a Billion Tokens

Data Collection Method by dataset:

Hybrid: Human, Automated

Labeling Method by dataset:

Hybrid: Human, Automated

Properties: Training data comprises image-text pairs: images (RGB) and associated text (captions or attribute-based descriptions). Content includes person and object imagery for retrieval and re-identification. Data sources used in TAO fine-tuning may include internal and licensed datasets (e.g., person attribute search); see the CLIP CLI notebook and experiment specs for exact sources and counts.

Testing Dataset:

Data Collection Method by dataset:

Hybrid: Human, Automated

Labeling Method by dataset:

Human

Properties: Test sets include image-text pairs with same modalities and nature as training (images and captions/attributes); used for retrieval and re-identification metrics.

Evaluation Dataset:

Data Collection Method by dataset:

Human

Labeling Method by dataset:

Human

Properties: Evaluation uses benchmark datasets for retrieval and person re-identification (e.g., person attribute search benchmarks); image and text modalities; descriptive and attribute-based captions.

Benchmark Score: Retrieval metrics (e.g., Recall@K, mAP) on person re-identification and text-to-image retrieval benchmarks.

Inference

Acceleration Engine: TensorRT (vision encoder), PyTorch

Test Hardware:

NVIDIA datacenter GPUs (e.g., L4, L40, L40S, A100, H100, H200, RTX PRO 6000 Blackwell Server Edition, B200, GB200)
Jetson AGX Orin, Jetson AGX Thor T5000, Jetson IGX Thor T7000 (Stargazer), DGX Spark
Minimum hardware as required by your TAO / PyTorch / TensorRT deployment

Datacenter GPU results below are vision encoder only (C-RADIO ViT-H/16; pixel_values input). RADIO-CLIP also ships a SigLIP 2-G text adapter in the full checkpoint. Measurements used TensorRT FP16 and trtexec. Throughput is inference-only; end-to-end latency with decoding, pre/post-processing, or full application pipelines may differ.

Environment (dGPU)

Component	Version
TensorRT	10.14.1.48
CUDA	13.1 (V13.1.115)
cuDNN	9.17.1
Driver	580.105.08
Container	gitlab-master_26.01_py3_stage_252181976
OS	Ubuntu 24.04

Environment (edge)

Jetson IGX Thor T7000 (Stargazer) — flashed with IGX-r38v2.0.11; TensorRT 10.13.3.9; CUDA 13.0; cuDNN 9.12.0.46; power mode 120W; jetson_clocks applied.

Jetson AGX Thor T5000 — flashed with JP7.1 b148; TensorRT 10.13.3.9; CUDA 13.0; cuDNN 9.12.0.46; power mode 120W; jetson_clocks applied.

DGX Spark — FastOS OTA2 Mainline Release 1.120.36; driver 580.126.09 / 590.48.01; CUDA 13.0 / 13.1; cuDNN 9.12 / 9.17; TensorRT 10.13.2 / 10.14.1.48.

Jetson AGX Orin — flashed with JP7.2 b19; TensorRT 10.13.3.9; CUDA 12.9; cuDNN 9.12.0.46; power mode MAXN.

Model configuration and TensorRT build

Export input size is 224×224 in the export header. TensorRT uses input tensor pixel_values:1x3x224x224. ONNX opset 17. Output embedding dimension 1536 for this TensorRT vision encoder export. Note: RADIO-CLIP pairs the image encoder with a SigLIP v2-g text adapter; this performance report covers the vision encoder only.

TensorRT conversion command (from export log):

trtexec --onnx=onnx_exports/radlip_v1.0.onnx --saveEngine=radlip_v1.0.engine --shapes=pixel_values:1x3x224x224 --fp16

Vision encoder throughput (FP16)

Platform	BS=1	BS=2	BS=4	BS=8	BS=16	BS=32	BS=64	BS=128
NVIDIA L4	55	43	30	18	10	5	2	1
NVIDIA L40	71	71	69	50	26	13	6	3
NVIDIA L40S	80	79	72	57	37	18	8	4
NVIDIA A100-SXM4-80GB	113	102	83	56	35	17	9	5
NVIDIA H100 NVL	153	141	134	100	60	31	15	8
NVIDIA H100 80GB HBM3	169	161	146	119	84	44	23	12
NVIDIA H200 141GB HBM3	165	158	148	126	86	43	23	12
NVIDIA RTX PRO 6000 Blackwell Server Edition	116	114	114	75	57	29	14	7
NVIDIA B200	236	212	216	178	128	71	37	19
NVIDIA GB200	223	229	213	169	129	76	39	20
Jetson AGX Orin	45	31	18	9	5	2	1	1
Jetson AGX Thor T5000	78	76	54	32	16	8	4	2
Jetson IGX Thor T7000 (Stargazer)	79	72	54	30	16	8	4	2
DGX Spark	78	59	40	24	10	5	2	1

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please make sure you have proper rights and permissions for all input image and video content; if image or video includes people, personal health information, or intellectual property, the image or video generated will not blur or maintain proportions of image subjects included.

For more detailed information on ethical considerations for this model, please see the Model Card++ Bias, Explainability, Safety & Security, and Privacy Subcards.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

Publisher

NVIDIA

LicenseNVIDIA proprietary

Latest Versiontrainable_v1.0

UpdatedMarch 12, 2026 UTC

Compressed Size15.4 GB

Labels

NSPECT-6PVB-PJAJ TAO Toolkit