NGC Catalog

CLASSIC

Welcome Guest

For downloads and more information, please view on a desktop device.

Description

NVILA visual language model with high res support

Publisher

NVIDIA

Latest Version

nvila-lite-15b-highres-lita

Modified

May 1, 2025

Size

28.44 GB

Model Overview

Description:

Vision-language (VILA) models support single image, multi-image, and video reasoning. VILA models have an augmented series of checkpoints with enhanced vision encoders and large language models (LLMs). New VILA, aka NVILA, is a family of models with an enhanced vision encoder and LLM to improve the code base performance of the previous VILA model.

NVILA-Lite-15B-HighRes-LITA is a variant of NVILA-Lite-15B that can process high resolution images and videos with temporal localization capabilities. Common use cases include captioning, visual Q&A, search, and summarization.

NVILA Finetuning Microservices (FTMS) is a visual language model (VLM) finetuning microservice that allows customers to finetune the pre-trained NVILA-Lite-15B-HighRes-LITA video model, with video/image-text data at scale. Please see the container card for this offering here.

The model is for research and non-commercial use.

License/Terms of Use

This model has been released under the following governing terms: Deed - Attribution-NonCommercial 4.0 International - Creative Commons. Additional information on licensing for base pretrained models: Gemma Terms of Use | Google AI for Developers for PaliGemma 2, Gemma Prohibited Use Policy | Google AI for Developers, and Apache License, Version 2.0 for Qwen2.5.

References:

Model Architecture:

Architecture Type: Transformer
Network Architecture: SigLip, Qwen2.5

Input:

Input Type: Image, Video, Text

Input Format:

Image: Red, Green, Blue (RGB)
Video: MP4
Text: String

Input Parameters:

Image: 2D
Video: 3D
Text: 1D

Output:

Output Type: Text
Output Format: String Output Parameters: 1D

Software Integration:

**Runtime Engine: HF Trainer 4.46.0

Supported Hardware Microarchitecture Compatibility:

NVIDIA Ampere
NVIDIA Hopper
NVIDIA Lovelace

[Preferred/Supported] Operating System(s):

Linux

Model Version(s):

NVILA-Lite-15B-High-Res-LITA

Training Dataset:

All datasets to train our models to release are from three approved NVIDIA JIRA tickets:

Dataset JIRA ticket approved for previous round of release for DGPTT-1347, DGPTT-1568, DGPTT-1843, DGPTT-1843, & DGPTT-233.
New dataset JIRA ticket approved for this round of release: DGPTT-2846.

We do not plan to release any datasets.

** Data Collection Method by dataset

[Hybrid: Automated, Human]

** Labeling Method by dataset

[Hybrid: Automated, Human]

**Properties
60 million image-text pairs or interleaved image-text content.

Methodology and KPI

Generic Video Benchmarks

Benchmark	Accuracy
VideoMME w/o Sub @128f	67.3
VideoMME w/ Sub @128f	70.9

Temporal Localization Benchmarks

Benchmark	Mean IoU
ActivityNet RTL	32.07
Charades-STA	52.8

Inference:

Engine:

PyTorch
TensorRT-LLM
TinyChat

Test Hardware:

A100
Jetson Orin
RTX 4090

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report security vulnerabilities or NVIDIA AI Concerns here.

NVILA HighRes