Linux / amd64
NVILA is a Vision Language Model developed by NVIDIA that can achieve state of the art image and video understanding. There are many subsequent work from VILA such as VILA^2, Long VILA, and NVILA. The container card walks through the required tools to finetune High Resolution Video NVILA using two popular approaches: LoRA (Low-Rank Adaptation) and Full Finetuning.
License for containers is included in the banner of the container. Licenses for the pre-trained models are available with the model cards on NGC. By pulling and using the VLM container, you accept the terms and conditions of this NVIDIA Software Evaluation License Agreement.
NVILA FTMS is a visual language model (VLM) finetuning microservice that allows customers to finetune a pre-trained NVILA-Lite-15B high-res video model, with video/image-text data at scale, enabling multi-image and video VLM for user specific downstream use cases.
NVILA FTMS EA package is comprised of:
License for containers is included in the banner of the container. Licenses for the pre-trained models are available with the model cards on NGC. By pulling and using the VLM container, you accept the terms and conditions of this NVIDIA Software Evaluation License Agreement.
All containers needed to run the finetuning microservice can be pulled from this location. See the list below for all available containers in this registry.
Container Type | container_name:tag |
---|---|
NVILA Finetuning Microservice - Early Access | nvcr.io/nvidia/tao/vlm-finetuning-ea:0.3.0-ea |
Model Name | Link |
---|---|
NVILA-Lite-15B-HighRes | nvidia/tao/nvila:nvila-lite-15b-highres-lita |
Note: The Video Search and Summarization currently does not support LoRA adapter injection at inference. You must merge or attach the LoRA weights into the base model before using it with VSS.
Use the following script to load a base model and integrate LoRA weights:
import llava
import argparse
import os
from typing import Dict
import torch
def parse_config():
parser = argparse.ArgumentParser(description="arg parser")
parser.add_argument("--model_base", type=str, default="nvila_vnvila-lite-15b-highres-lita")
parser.add_argument("--model_path", type=str, default=None, help="")
parser.add_argument("--save_path", type=str, default=None, help="")
args = parser.parse_args()
return args
def main():
args = parse_config()
device = "cuda:0"
torch.cuda.set_device(device)
model = llava.load(args.model_path, model_base=args.model_base)
model.save_pretrained(args.save_path)
if __name__ == "__main__":
main()
NGC Resource | Link |
---|---|
VLM Getting Started - Early Access | nvidia/tao/vlm-getting-started-ea:0.2.0-ea |
Access the latest in Vision AI development workflows with NVIDIA TAO Toolkit 5.0
More information about TAO Toolkit and pre-trained models can be found at the NVIDIA Developer Zone
NVIDIA’s platforms and application frameworks enable developers to build a wide array of AI applications. Consider potential algorithmic bias when choosing or creating the models being deployed. Work with the model’s developer to ensure that it meets the requirements for the relevant industry and use case; that the necessary instruction and documentation are provided to understand error rates, confidence intervals, and results; and that the model is being used under the conditions and in the manner intended. Please report security vulnerabilities or NVIDIA AI Concerns here.
Security Vulnerabilities in Open Source Packages Please review the Security Scanning (LINK) tab to view the latest security scan results. For certain open-source vulnerabilities listed in the scan results, NVIDIA provides a response in the form of a Vulnerability Exploitability eXchange (VEX) document. The VEX information can be reviewed and downloaded from the Security Scanning (LINK) tab.