NGC | Catalog
For contents of this collection and more information, please view on a desktop device.

Description

NeVA is a multi-modal vision-language model that understands text and images and generates informative responses.

Curator

NVIDIA

Modified

November 15, 2023
Containers
Sorry, your browser does not support inline SVG.
Helm Charts
Sorry, your browser does not support inline SVG.
Models
Sorry, your browser does not support inline SVG.
Resources
Sorry, your browser does not support inline SVG.

Model Overview

Description:

NeVA is NVIDIA's version of the LLaVA model where the open source LLaMA model is replaced with a GPT model trained by NVIDIA. At a high level the image is encoded using a frozen hugging face CLIP model and projected to the text embedding dimensions. This is then concatenated with the embeddings of the prompt and passed in through the language model. Training happens in two stages:

  • Pretraining: Here the language model is frozen and only the projection layer (that maps the image encoding to the embedding space) is trained. Here, image-caption pairs are used to pretrain the model.
  • Finetuning: Here the language model is also trained along with the projection layer. To finetune the model synthetic instruction data generated using GPT4 is used.

References(s):

Model Architecture:

Architecture Type: Transformer
Network Architecture: GPT + CLIP
Model version: 8B, 22B, 43B

Input:

Input Format: Red, Green, Blue (RGB) Image + Text
Input Parameters: temperature, max output tokens, quality, toxicity, humor, creativity, violence, helpfulness, not_appropriate
Other Properties Related to Output: None

Output:

Output Format: Text
Output Parameters: None
Other Properties Related to Output: None

Software Integration:

Runtime(s): N/A
Supported Hardware Platform(s): Hopper, Ampere/Turing
Supported Operating System(s): Linux

Training & Finetuning:

Pretraining Dataset:

Link: CC-3M

Properties (Quantity, Dataset Descriptions, Sensor(s)):
The dataset consists of CC3M images and captions filtered to 595,000 samples.

Dataset License:

Finetuning Dataset:

Link: Synthetic data generated by GPT4

Properties (Quantity, Dataset Descriptions, Sensor(s)):
The data has 158,000 samples was generated synthetically by GPT4. It consists of a mix of short question answers, detailed image description, and higher level reasoning questions.

Dataset License: CC-BY-NC 4.0 License CC BY-NC 4.0

Inference:

Engine: Triton
Test Hardware: Other