VITA: Vision-Language Temporal Assistant Model Card # VITA Model Overview ## Description "Vision-Language Instructed Temporal Assistant (VITA) is a Generative AI video understanding model that can do video question answering, video captioning, and event localization. VILA [2] is an Image VLM that does image level question-answering. Vision-Language Temporal Assistant (VITA) is the LITA [1] model that uses the VILA [2] as encoder. This model is for research and development only. ## References(s): ### Citations 1. LITA: Language Instructed Temporal-Localization Assistant. (2024) Huang, De-An, et al. "Lita: Language instructed temporal-localization assistant." arXiv preprint arXiv:2403.19046 (2024). 2. VILA: On Pre-training for Visual Language Models. (2023) Lin, Ji, et al. "Vila: On pre-training for visual language models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024. 3. Llama 3. Touvron, Hugo, et al. "Llama: Open and efficient foundation language models." arXiv preprint arXiv:2302.13971 (2023). ## Model Architecture: **Architecture Type:** Transformer
**Network Architecture:** VILA-336 + LLAMA-3
**Model Version:**
- *2.0.1** - VITA-llama3-8B model that is capable of doing inference. ### Input: **Input Type(s):** Image, Video, Text
**Input Format:** ".jpg/ .png/ .mp4/ .mov, Text String"
**Input Parameters:** [(2D)]
### Output: **Output Type(s):** Sequence of characters
**Output Format:** Text String(s)
**Other Properties Related to Output:** None
**Other Properties Related to Input:**
- Any resolution video can be used as input. The model will resize the video to 336x336 resolution. - Less than 1 minute video is recommended for better performance. - The input text number of words cannot be greater than 1000.
## Software Integration: **Runtime(s):** NVIDIA AI Enterprise
**Supported Hardware Platform(s):** NVIDIA Ampere, NVIDIA Hopper, NVIDIA Jetson, NVIDIA Lovelace, NVIDIA Pascal, and NVIDIA Turing.
**Supported Operating System(s):** Linux
## Inference: **Engine:** TensorRT
**Test Hardware:**
- L4 - L40 - A2 - A30 - A100 - H100 # VITA Model Overview ## Model Overview The VITA model is a video understanding model that performs both video question answering, video captioning, and event localization. The model is based on the SigLIP-336 and LLAMA-3 architecture. ## Model Architecture This model is based on LLaVA like architecture. It utilize SigLip encoder and LLAMA-3 language model. The high level architecture of VITA consists of Vision Encoder, linear projector and LLM. ## Training The model is trained using the internal framework. The model is trained on the VILA dataset and is finetuned on video datasets mentioned below. # Training, Testing, and Evaluation Datasets: ## Training Dataset: VITA is finetuned starting from [VILA](https://github.com/NVlabs/VILA) pretrained model weights with the following datasets: llava-pretrain, activitynet, sharegpt4v, youcook2, and nextqa. The VITA model is trained on the following datasets: - llava-pretrain [link](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) - activitynet [link](https://github.com/MILVLG/activitynet-qa/tree/master) - sharegpt4v [link](https://sharegpt4v.github.io/) - youcook2 [link](http://youcook2.eecs.umich.edu/download) - nextqa [link](https://github.com/doc-doc/NExT-QA) #### Properties The dataset has about 1M images and 30k videos. The images cover diverse domains and concepts encompassing world knowledge. ## Performance ### Evaluation Data The model is evaluated on multiple metrics to evaluate the text output quality and the temporal localization performance. | Model | RTL | | ActivityNet-QA | | MSVD-QA | | GQA | MME | | Video ChatGPT | | | | | | |:---------------------------------------------------------------------------------------:|:------:|:------:|:--------------:|:----:|:-------:|:----:|:------:|:----:|:---:|:-------------:|:----:|:----:|:----:|:----:|:----:| | VITA | 0.3207 | 0.3205 | 56.61% | 3.65 | 76.45% | 4.04 | 64.60% | 1548 | 375 | 3.28 | 3.13 | 3.74 | 2.61 | 3.04 | 3.19 | ### Methodology and KPI The key performance indicator is the hmean of detection. The KPI for the evaluation data are reported below. ### Real-time Inference Performance | Model | Networl | Video Duration | Chunk Length (min) | Num Chunks | Average Memory Utilization (when GPU Utilization was 100%) | FPS | VLM Pipeline Time (sec) | Summarization Time (sec) | Latency E2E | Milvus Fetch time | |:------------------------------------------:|:-------:|:--------------:|:-------------------:|:----------:|:----------------------------------------------------------:|:--------:|:-----------------------:|:------------------------:|:-----------:|:-----------------:| | A100 ("8 x A100 VLM Batch Size = 1 FP16") | | | | | | | | | | | | VITA | FP16 | ~24 hours | 10 min | 143 | 2701 mb (no instance of 100 % GPU seen) | 20984.83 | 122.66 | 154.14 | 276.8 | 10.26 ms | | | | | | | | | | | | | ## Software and Hardware Requirements This model needs to be used with NVIDIA Hardware and Software: The model can run on any NVIDIA GPU, including NVIDIA Jetson devices with Vision Insights Agent. The primary use case for this model is to doing video understanding tasks like video question answering, video captioning, and event localization. ## Limitations This Vision-language Model is not capable of generating a vision output that has detection, segmentation or class output. ### Restricted Usage in Different Fields The NVIDIA VITA model is trained on limited datasets and may not perform well on data from different fields. To get better accuracy in a specific field, more data is usually required to fine tune the pre-trained model with TAO Toolkit. ## Ethical Considerations NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards [Insert Link to Model Card++ here]. Please report security vulnerabilities or NVIDIA AI Concerns here. ### License The model is governed by the CC BY NV SA 4.0 license (https://creativecommons.org/licenses/by-nc-sa/4.0/). ADDITIONAL INFORMATION: Meta Llama 3 Community license (https://llama.meta.com/llama3/license/), Built with Meta Llama 3 ### Terms of use The model is governed by the CC BY NV SA 4.0 license (https://creativecommons.org/licenses/by-nc-sa/4.0/). ADDITIONAL INFORMATION: Meta Llama 3 Community license (https://llama.meta.com/llama3/license/), Built with Meta Llama 3