Vision-language (VILA) models support single image, multi-image, and video reasoning. VILA models have an augmented series of checkpoints with enhanced vision encoders and large language models (LLMs). New VILA, aka NVILA, is a family of models with an enhanced vision encoder and LLM to improve the code base performance of the previous VILA model.
NVILA-Lite-15B-HighRes-LITA is a variant of NVILA-Lite-15B that can process high resolution images and videos with temporal localization capabilities. Common use cases include captioning, visual Q&A, search, and summarization.
NVILA Finetuning Microservices (FTMS) is a visual language model (VLM) finetuning microservice that allows customers to finetune the pre-trained NVILA-Lite-15B-HighRes-LITA video model, with video/image-text data at scale. Please see the container card for this offering here.
The model is for research and non-commercial use.
This model has been released under the following governing terms: Deed - Attribution-NonCommercial 4.0 International - Creative Commons. Additional information on licensing for base pretrained models: Gemma Terms of Use | Google AI for Developers for PaliGemma 2, Gemma Prohibited Use Policy | Google AI for Developers, and Apache License, Version 2.0 for Qwen2.5.
Architecture Type: Transformer
Network Architecture: SigLip, Qwen2.5
Input Type: Image, Video, Text
Input Format:
Image: Red, Green, Blue (RGB)
Video: MP4
Text: String
Input Parameters:
Image: 2D
Video: 3D
Text: 1D
Output Type: Text
Output Format: String
Output Parameters: 1D
**Runtime Engine: HF Trainer 4.46.0
Supported Hardware Microarchitecture Compatibility:
[Preferred/Supported] Operating System(s):
Linux
NVILA-Lite-15B-High-Res-LITA
All datasets to train our models to release are from three approved NVIDIA JIRA tickets:
We do not plan to release any datasets.
** Data Collection Method by dataset
** Labeling Method by dataset
**Properties
60 million image-text pairs or interleaved image-text content.
Benchmark | Accuracy |
---|---|
VideoMME w/o Sub @128f | 67.3 |
VideoMME w/ Sub @128f | 70.9 |
Benchmark | Mean IoU |
---|---|
ActivityNet RTL | 32.07 |
Charades-STA | 52.8 |
Engine:
Test Hardware:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report security vulnerabilities or NVIDIA AI Concerns here.