VITA: Vision-Language Temporal Assistant Model Card
"Vision-Language Instructed Temporal Assistant (VITA) is a Generative AI video understanding model that can do video question answering, video captioning, and event localization. VILA [2] is an Image VLM that does image level question-answering. Vision-Language Temporal Assistant (VITA) is the LITA [1] model that uses the VILA [2] as encoder. This model is for research and development only.
Architecture Type: Transformer
Network Architecture: VILA-336 + LLAMA-3
Model Version:
Input Type(s): Image, Video, Text
Input Format: ".jpg/ .png/ .mp4/ .mov, Text String"
Input Parameters: [(2D)]
Output Type(s): Sequence of characters
Output Format: Text String(s)
Other Properties Related to Output: None
Other Properties Related to Input:
Runtime(s): NVIDIA AI Enterprise
Supported Hardware Platform(s): NVIDIA Ampere, NVIDIA Hopper, NVIDIA Jetson, NVIDIA Lovelace, NVIDIA Pascal, and NVIDIA Turing.
Supported Operating System(s): Linux
Engine: TensorRT
Test Hardware:
The VITA model is a video understanding model that performs both video question answering, video captioning, and event localization. The model is based on the SigLIP-336 and LLAMA-3 architecture.
This model is based on LLaVA like architecture. It utilize SigLip encoder and LLAMA-3 language model. The high level architecture of VITA consists of Vision Encoder, linear projector and LLM.
The model is trained using the internal framework. The model is trained on the VILA dataset and is finetuned on video datasets mentioned below.
VITA is finetuned starting from VILA pretrained model weights with the following datasets: llava-pretrain, activitynet, sharegpt4v, youcook2, and nextqa.
The VITA model is trained on the following datasets:
The dataset has about 1M images and 30k videos. The images cover diverse domains and concepts encompassing world knowledge.
The model is evaluated on multiple metrics to evaluate the text output quality and the temporal localization performance.
Model | RTL | ActivityNet-QA | MSVD-QA | GQA | MME | Video ChatGPT | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
VITA | 0.3207 | 0.3205 | 56.61% | 3.65 | 76.45% | 4.04 | 64.60% | 1548 | 375 | 3.28 | 3.13 | 3.74 | 2.61 | 3.04 | 3.19 |
The key performance indicator is the hmean of detection. The KPI for the evaluation data are reported below.
Model | Networl | Video Duration | Chunk Length (min) | Num Chunks | Average Memory Utilization (when GPU Utilization was 100%) | FPS | VLM Pipeline Time (sec) | Summarization Time (sec) | Latency E2E | Milvus Fetch time |
---|---|---|---|---|---|---|---|---|---|---|
A100 ("8 x A100 VLM Batch Size = 1 FP16") | ||||||||||
VITA | FP16 | ~24 hours | 10 min | 143 | 2701 mb (no instance of 100 % GPU seen) | 20984.83 | 122.66 | 154.14 | 276.8 | 10.26 ms |
This model needs to be used with NVIDIA Hardware and Software: The model can run on any NVIDIA GPU, including NVIDIA Jetson devices with Vision Insights Agent.
The primary use case for this model is to doing video understanding tasks like video question answering, video captioning, and event localization.
This Vision-language Model is not capable of generating a vision output that has detection, segmentation or class output.
The NVIDIA VITA model is trained on limited datasets and may not perform well on data from different fields. To get better accuracy in a specific field, more data is usually required to fine tune the pre-trained model with TAO Toolkit.
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards [Insert Link to Model Card++ here]. Please report security vulnerabilities or NVIDIA AI Concerns here.
The model is governed by the CC BY NV SA 4.0 license (https://creativecommons.org/licenses/by-nc-sa/4.0/). ADDITIONAL INFORMATION: Meta Llama 3 Community license (https://llama.meta.com/llama3/license/), Built with Meta Llama 3
The model is governed by the CC BY NV SA 4.0 license (https://creativecommons.org/licenses/by-nc-sa/4.0/). ADDITIONAL INFORMATION: Meta Llama 3 Community license (https://llama.meta.com/llama3/license/), Built with Meta Llama 3