NeVA is NVIDIA's version of the LLaVA model where the open source LLaMA model is replaced with a GPT model trained by NVIDIA. At a high level the image is encoded using a frozen hugging face CLIP model and projected to the text embedding dimensions. This is then concatenated with the embeddings of the prompt and passed in through the language model. Training happens in two stages:
Architecture Type: Transformer
Network Architecture: GPT + CLIP
Model version: 8B, 22B, 43B
Input Format: Red, Green, Blue (RGB) Image + Text
Input Parameters: temperature, max output tokens, quality, toxicity, humor, creativity, violence, helpfulness, not_appropriate
Other Properties Related to Output: None
Output Format: Text
Output Parameters: None
Other Properties Related to Output: None
Runtime(s): N/A
Supported Hardware Platform(s): Hopper, Ampere/Turing
Supported Operating System(s): Linux
Link: CC-3M
Properties (Quantity, Dataset Descriptions, Sensor(s)):
The dataset consists of CC3M images and captions filtered to 595,000 samples.
Dataset License:
Link: Synthetic data generated by GPT4
Properties (Quantity, Dataset Descriptions, Sensor(s)):
The data has 158,000 samples was generated synthetically by GPT4. It consists of a mix of short question answers, detailed image description, and higher level reasoning questions.
Dataset License: CC-BY-NC 4.0 License CC BY-NC 4.0
Engine: Triton
Test Hardware: Other