NGC Catalog

CLASSIC

Welcome Guest

For downloads and more information, please view on a desktop device.

Description

Language Instructed Temporal Assistant

Publisher

NVIDIA

Latest Version

1.0

Modified

July 23, 2024

LITA: Language Instructed Temporal Assistant Model Card

LITA Model Overview

Description

"Language Instructed Temporal Assistant (LITA) is a Generative AI video understanding model that can do video question answering, video captioning, and event localization. VILA [2] is an Image VLM that does image level question-answering. Vision Temporal Assistant is the LITA model that uses the VILA as encoder. This model is for research and development only.

Terms of use

The model is governed by the CC BY NV SA 4.0 license (https://creativecommons.org/licenses/by-nc-sa/4.0/). ADDITIONAL INFORMATION: Meta Llama 3 Community license (https://llama.meta.com/llama3/license/), Built with Meta Llama 3

References(s):

Citations

LITA: Language Instructed Temporal-Localization Assistant. (2024) Huang, De-An, et al. "Lita: Language instructed temporal-localization assistant." arXiv preprint arXiv:2403.19046 (2024).
VILA: On Pre-training for Visual Language Models. (2023) Lin, Ji, et al. "Vila: On pre-training for visual language models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
Llama 3. Touvron, Hugo, et al. "Llama: Open and efficient foundation language models." arXiv preprint arXiv:2302.13971 (2023).

Model Architecture:

Architecture Type: Transformer
Network Architecture: SigLIP-336 + LLAMA-3
Model Version:

LITA-llama-3-8b-inference - LITA-llama3-8B model that is capable of doing inference.

Input:

Input Type(s): Image, Video, Text
Input Format: ".jpg/ .png/ .mp4/ .mov, Text String"
Input Parameters: [(2D)]

Output:

Output Type(s): Sequence of characters
Output Format: Text String(s)
Other Properties Related to Output: None

Other Properties Related to Input:

Any resolution video can be used as input. The model will resize the video to 336x336 resolution.
Less than 1 minute video is recommended for better performance.
The input text number of words cannot be greater than 1000.

Software Integration:

Runtime(s): NVIDIA AI Enterprise
Supported Hardware Platform(s): NVIDIA Ampere, NVIDIA Hopper, NVIDIA Jetson, NVIDIA Lovelace, NVIDIA Pascal, and NVIDIA Turing.
Supported Operating System(s): Linux

Training & Finetuning:

Model versions:

Trainable_v1.0: The PyTorch LITA model that can be used for fine-tuning. Deployable_v1.0: The PyTorch LITA model for inference.

Inference:

Engine: TensorRT
Test Hardware:

L4
L40
A2
A30
A100
H100

LITA Model Overview

Model Overview

The LITA model is a video understanding model that performs both video question answering, video captioning, and event localization. The model is based on the SigLIP-336 and LLAMA-3 architecture. The model is trained on the VILA dataset and is finetuned on the LLAMA-3 dataset.

Model Architecture

This model is based on LLaVA like architecture. It utilize SigLip encoder and LLAMA-3 language model. The high level architecture of LITA consists of Vision Encoder, linear projector and LLM.

Training

The model is trained using the internal framework. The model is trained on the VILA dataset and is finetuned on video datasets mentioned below.

Training, Testing, and Evaluation Datasets:

Training Dataset:

LITA LLAMA-3 is finetuned on VILA pretrained model with the following datasets: llava-pretrain, activitynet, sharegpt4v, youcook2, and nextqa.

The LITA model is trained on the following datasets:

llava-pretrain link
activitynet link
sharegpt4v link
youcook2 link
nextqa link

Properties

The dataset has about 1M images and 30k videos. The images cover diverse domains and concepts encompassing world knowledge.

Performance

Evaluation Data

The model is evaluated on multiple metrics to evaluate the text output quality and the temporal localization performance.

Model	RTL		ActivityNet-QA		MSVD-QA		GQA	MME		Video ChatGPT
VITA	0.3207	0.3205	56.61%	3.65	76.45%	4.04	64.60%	1548	375	3.28	3.13	3.74	2.61	3.04	3.19

Methodology and KPI

The key performance indicator is the hmean of detection. The KPI for the evaluation data are reported below.

Real-time Inference Performance

Model	Networl	Video Duration	Chunk Length (min)	Num Chunks	Average Memory Utilization (when GPU Utilization was 100%)	FPS	VLM Pipeline Time (sec)	Summarization Time (sec)	Latency E2E	Milvus Fetch time
A100 ("8 x A100 VLM Batch Size = 1 FP16")
VITA	FP16	~24 hours	10 min	143	2701 mb (no instance of 100 % GPU seen)	20984.83	122.66	154.14	276.8	10.26 ms

Software and Hardware Requirements

This model needs to be used with NVIDIA Hardware and Software: The model can run on any NVIDIA GPU, including NVIDIA Jetson devices with Vision Insights Agent.

The primary use case for this model is to doing video understanding tasks like video question answering, video captioning, and event localization.

Limitations

This Vision-language Model is not capable of generating a vision output that has detection, segmentation or class output.

Restricted Usage in Different Fields

The NVIDIA LITA model is trained on limited datasets and may not perform well on data from different fields. To get better accuracy in a specific field, more data is usually required to fine tune the pre-trained model with TAO Toolkit.

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards [Insert Link to Model Card++ here]. Please report security vulnerabilities or NVIDIA AI Concerns here.

LITA

LITA Model Overview

Description

Terms of use

References(s):

Citations

Model Architecture:

Input:

Output:

Software Integration:

Training & Finetuning:

Model versions:

Inference:

LITA Model Overview

Model Overview

Model Architecture

Training

Training, Testing, and Evaluation Datasets:

Training Dataset:

Properties

Performance

Evaluation Data

Methodology and KPI

Real-time Inference Performance

Software and Hardware Requirements

Limitations

Restricted Usage in Different Fields

Ethical Considerations

Terms of use

License