NGC Catalog
CLASSIC
Welcome Guest
Models
LITA

LITA

For downloads and more information, please view on a desktop device.
Description
Language Instructed Temporal Assistant
Publisher
NVIDIA
Latest Version
1.0
Modified
July 23, 2024

LITA: Language Instructed Temporal Assistant Model Card

LITA Model Overview

Description

"Language Instructed Temporal Assistant (LITA) is a Generative AI video understanding model that can do video question answering, video captioning, and event localization. VILA [2] is an Image VLM that does image level question-answering. Vision Temporal Assistant is the LITA model that uses the VILA as encoder. This model is for research and development only.

Terms of use

The model is governed by the CC BY NV SA 4.0 license (https://creativecommons.org/licenses/by-nc-sa/4.0/). ADDITIONAL INFORMATION: Meta Llama 3 Community license (https://llama.meta.com/llama3/license/), Built with Meta Llama 3

References(s):

Citations

  1. LITA: Language Instructed Temporal-Localization Assistant. (2024) Huang, De-An, et al. "Lita: Language instructed temporal-localization assistant." arXiv preprint arXiv:2403.19046 (2024).
  2. VILA: On Pre-training for Visual Language Models. (2023) Lin, Ji, et al. "Vila: On pre-training for visual language models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
  3. Llama 3. Touvron, Hugo, et al. "Llama: Open and efficient foundation language models." arXiv preprint arXiv:2302.13971 (2023).

Model Architecture:

Architecture Type: Transformer
Network Architecture: SigLIP-336 + LLAMA-3
Model Version:

  • LITA-llama-3-8b-inference - LITA-llama3-8B model that is capable of doing inference.

Input:

Input Type(s): Image, Video, Text
Input Format: ".jpg/ .png/ .mp4/ .mov, Text String"
Input Parameters: [(2D)]

Output:

Output Type(s): Sequence of characters
Output Format: Text String(s)
Other Properties Related to Output: None

Other Properties Related to Input:

  • Any resolution video can be used as input. The model will resize the video to 336x336 resolution.
  • Less than 1 minute video is recommended for better performance.
  • The input text number of words cannot be greater than 1000.

Software Integration:

Runtime(s): NVIDIA AI Enterprise
Supported Hardware Platform(s): NVIDIA Ampere, NVIDIA Hopper, NVIDIA Jetson, NVIDIA Lovelace, NVIDIA Pascal, and NVIDIA Turing.
Supported Operating System(s): Linux

Training & Finetuning:

Model versions:

Trainable_v1.0: The PyTorch LITA model that can be used for fine-tuning. Deployable_v1.0: The PyTorch LITA model for inference.

Inference:

Engine: TensorRT
Test Hardware:

  • L4
  • L40
  • A2
  • A30
  • A100
  • H100

LITA Model Overview

Model Overview

The LITA model is a video understanding model that performs both video question answering, video captioning, and event localization. The model is based on the SigLIP-336 and LLAMA-3 architecture. The model is trained on the VILA dataset and is finetuned on the LLAMA-3 dataset.

Model Architecture

This model is based on LLaVA like architecture. It utilize SigLip encoder and LLAMA-3 language model. The high level architecture of LITA consists of Vision Encoder, linear projector and LLM.

Training

The model is trained using the internal framework. The model is trained on the VILA dataset and is finetuned on video datasets mentioned below.

Training, Testing, and Evaluation Datasets:

Training Dataset:

LITA LLAMA-3 is finetuned on VILA pretrained model with the following datasets: llava-pretrain, activitynet, sharegpt4v, youcook2, and nextqa.

The LITA model is trained on the following datasets:

  • llava-pretrain link
  • activitynet link
  • sharegpt4v link
  • youcook2 link
  • nextqa link

Properties

The dataset has about 1M images and 30k videos. The images cover diverse domains and concepts encompassing world knowledge.

Performance

Evaluation Data

The model is evaluated on multiple metrics to evaluate the text output quality and the temporal localization performance.

Model RTL ActivityNet-QA MSVD-QA GQA MME Video ChatGPT
VITA 0.3207 0.3205 56.61% 3.65 76.45% 4.04 64.60% 1548 375 3.28 3.13 3.74 2.61 3.04 3.19

Methodology and KPI

The key performance indicator is the hmean of detection. The KPI for the evaluation data are reported below.

Real-time Inference Performance

Model Networl Video Duration Chunk Length (min) Num Chunks Average Memory Utilization (when GPU Utilization was 100%) FPS VLM Pipeline Time (sec) Summarization Time (sec) Latency E2E Milvus Fetch time
A100 ("8 x A100 VLM Batch Size = 1 FP16")
VITA FP16 ~24 hours 10 min 143 2701 mb (no instance of 100 % GPU seen) 20984.83 122.66 154.14 276.8 10.26 ms

Software and Hardware Requirements

This model needs to be used with NVIDIA Hardware and Software: The model can run on any NVIDIA GPU, including NVIDIA Jetson devices with Vision Insights Agent.

The primary use case for this model is to doing video understanding tasks like video question answering, video captioning, and event localization.

Limitations

This Vision-language Model is not capable of generating a vision output that has detection, segmentation or class output.

Restricted Usage in Different Fields

The NVIDIA LITA model is trained on limited datasets and may not perform well on data from different fields. To get better accuracy in a specific field, more data is usually required to fine tune the pre-trained model with TAO Toolkit.

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards [Insert Link to Model Card++ here]. Please report security vulnerabilities or NVIDIA AI Concerns here.

Terms of use

The model is governed by the CC BY NV SA 4.0 license (https://creativecommons.org/licenses/by-nc-sa/4.0/). ADDITIONAL INFORMATION: Meta Llama 3 Community license (https://llama.meta.com/llama3/license/), Built with Meta Llama 3

License

The model is governed by the CC BY NV SA 4.0 license (https://creativecommons.org/licenses/by-nc-sa/4.0/). ADDITIONAL INFORMATION: Meta Llama 3 Community license (https://llama.meta.com/llama3/license/), Built with Meta Llama 3