NGC Catalog
CLASSIC
Welcome Guest
Containers
NVILA Finetuning Microservice - Early Access

NVILA Finetuning Microservice - Early Access

For copy image paths and more information, please view on a desktop device.
Logo for NVILA Finetuning Microservice - Early Access
Description
Container to help get started with NVILA Finetuning Early Access features.
Publisher
NVIDIA
Latest Tag
0.3.0-ea
Modified
June 9, 2025
Compressed Size
13.09 GB
Multinode Support
Yes
Multi-Arch Support
No
0.3.0-ea (Latest) Security Scan Results

Linux / amd64

Sorry, your browser does not support inline SVG.

NVILA Finetuning Microservice - Early Access

NVILA is a Vision Language Model developed by NVIDIA that can achieve state of the art image and video understanding. There are many subsequent work from VILA such as VILA^2, Long VILA, and NVILA. The container card walks through the required tools to finetune High Resolution Video NVILA using two popular approaches: LoRA (Low-Rank Adaptation) and Full Finetuning.

License

License for containers is included in the banner of the container. Licenses for the pre-trained models are available with the model cards on NGC. By pulling and using the VLM container, you accept the terms and conditions of this NVIDIA Software Evaluation License Agreement.

NVILA Finetuning MS EA

NVILA FTMS is a visual language model (VLM) finetuning microservice that allows customers to finetune a pre-trained NVILA-Lite-15B high-res video model, with video/image-text data at scale, enabling multi-image and video VLM for user specific downstream use cases.

NVILA FTMS EA package is comprised of:

  • Finetuning Microservice Container containing scripts and APIs to finetune an NVILA-15B-Lite High Res LITA model
  • Sample tutorial notebook to walk-through the end-to-end finetuning workflow for the NVILA-15B-Lite High Res LITA model
  • A pre-trained NVILA-15B-Lite High Res VLM with LITA

License

License for containers is included in the banner of the container. Licenses for the pre-trained models are available with the model cards on NGC. By pulling and using the VLM container, you accept the terms and conditions of this NVIDIA Software Evaluation License Agreement.

Containers

All containers needed to run the finetuning microservice can be pulled from this location. See the list below for all available containers in this registry.

Container Type container_name:tag
NVILA Finetuning Microservice - Early Access nvcr.io/nvidia/tao/vlm-finetuning-ea:0.3.0-ea

Pre-trained Models

Model Name Link
NVILA-Lite-15B-HighRes nvidia/tao/nvila:nvila-lite-15b-highres-lita

Merge LoRA with Base Model

Note: The Video Search and Summarization currently does not support LoRA adapter injection at inference. You must merge or attach the LoRA weights into the base model before using it with VSS.

Use the following script to load a base model and integrate LoRA weights:

import llava
import argparse
import os
from typing import Dict
import torch

def parse_config():
    parser = argparse.ArgumentParser(description="arg parser")
    parser.add_argument("--model_base", type=str, default="nvila_vnvila-lite-15b-highres-lita")
    parser.add_argument("--model_path", type=str, default=None, help="")
    parser.add_argument("--save_path", type=str, default=None, help="")
    args = parser.parse_args()
    return args


def main():
    args = parse_config()
    device = "cuda:0"
    torch.cuda.set_device(device)

    model = llava.load(args.model_path, model_base=args.model_base)
    model.save_pretrained(args.save_path)

if __name__ == "__main__":
    main()

Resources

NGC Resource Link
VLM Getting Started - Early Access nvidia/tao/vlm-getting-started-ea:0.2.0-ea

Technical blogs

Access the latest in Vision AI development workflows with NVIDIA TAO Toolkit 5.0

  • NVIDIA Announces Nemotron Model Families to Advance Agentic AI
  • Visual Language Models on NVIDIA Hardware with VILA
  • Vision Language Model Prompt Engineering Guide for Image and Video Understanding
  • Build Multimodal Visual AI Agents Powered by NVIDIA NIM

Suggested reading

More information about TAO Toolkit and pre-trained models can be found at the NVIDIA Developer Zone

  • Vision Language Models
  • NVILA: Efficient Frontier Visual Language Models

Ethical Considerations

NVIDIA’s platforms and application frameworks enable developers to build a wide array of AI applications. Consider potential algorithmic bias when choosing or creating the models being deployed. Work with the model’s developer to ensure that it meets the requirements for the relevant industry and use case; that the necessary instruction and documentation are provided to understand error rates, confidence intervals, and results; and that the model is being used under the conditions and in the manner intended. Please report security vulnerabilities or NVIDIA AI Concerns here.

Security

Security Vulnerabilities in Open Source Packages Please review the Security Scanning (LINK) tab to view the latest security scan results. For certain open-source vulnerabilities listed in the scan results, NVIDIA provides a response in the form of a Vulnerability Exploitability eXchange (VEX) document. The VEX information can be reviewed and downloaded from the Security Scanning (LINK) tab.