NGC Catalog
CLASSIC
Welcome Guest
Models
NVILA HighRes

NVILA HighRes

For downloads and more information, please view on a desktop device.
Logo for NVILA HighRes
Description
NVILA visual language model with high res support
Publisher
NVIDIA
Latest Version
nvila-lite-15b-highres-lita
Modified
May 1, 2025
Size
28.44 GB

Model Overview

Description:

Vision-language (VILA) models support single image, multi-image, and video reasoning. VILA models have an augmented series of checkpoints with enhanced vision encoders and large language models (LLMs). New VILA, aka NVILA, is a family of models with an enhanced vision encoder and LLM to improve the code base performance of the previous VILA model.

NVILA-Lite-15B-HighRes-LITA is a variant of NVILA-Lite-15B that can process high resolution images and videos with temporal localization capabilities. Common use cases include captioning, visual Q&A, search, and summarization.

NVILA Finetuning Microservices (FTMS) is a visual language model (VLM) finetuning microservice that allows customers to finetune the pre-trained NVILA-Lite-15B-HighRes-LITA video model, with video/image-text data at scale. Please see the container card for this offering here.

The model is for research and non-commercial use.

License/Terms of Use

This model has been released under the following governing terms: Deed - Attribution-NonCommercial 4.0 International - Creative Commons. Additional information on licensing for base pretrained models: Gemma Terms of Use | Google AI for Developers for PaliGemma 2, Gemma Prohibited Use Policy | Google AI for Developers, and Apache License, Version 2.0 for Qwen2.5.

References:

  • NVILA: Efficient Frontier Visual Language Models
  • LITA: Language Instructed Temporal-Localization Assistant
  • CVPR paper: VILA: On Pre-training for Visual Language Models
  • Vision Language Models

Model Architecture:

Architecture Type: Transformer
Network Architecture: SigLip, Qwen2.5

Input:

Input Type: Image, Video, Text

Input Format:

Image: Red, Green, Blue (RGB)
Video: MP4
Text: String

Input Parameters:

Image: 2D
Video: 3D
Text: 1D

Output:

Output Type: Text
Output Format: String Output Parameters: 1D

Software Integration:

**Runtime Engine: HF Trainer 4.46.0

Supported Hardware Microarchitecture Compatibility:

  • NVIDIA Ampere
  • NVIDIA Hopper
  • NVIDIA Lovelace

[Preferred/Supported] Operating System(s):

Linux

Model Version(s):

NVILA-Lite-15B-High-Res-LITA

Training Dataset:

All datasets to train our models to release are from three approved NVIDIA JIRA tickets:

  • Dataset JIRA ticket approved for previous round of release for DGPTT-1347, DGPTT-1568, DGPTT-1843, DGPTT-1843, & DGPTT-233.
  • New dataset JIRA ticket approved for this round of release: DGPTT-2846.

We do not plan to release any datasets.

** Data Collection Method by dataset

  • [Hybrid: Automated, Human]

** Labeling Method by dataset

  • [Hybrid: Automated, Human]

**Properties
60 million image-text pairs or interleaved image-text content.

Methodology and KPI

  • Generic Video Benchmarks
Benchmark Accuracy
VideoMME w/o Sub @128f 67.3
VideoMME w/ Sub @128f 70.9
  • Temporal Localization Benchmarks
Benchmark Mean IoU
ActivityNet RTL 32.07
Charades-STA 52.8

Inference:

Engine:

  • PyTorch
  • TensorRT-LLM
  • TinyChat

Test Hardware:

  • A100
  • Jetson Orin
  • RTX 4090

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report security vulnerabilities or NVIDIA AI Concerns here.