# Model Overview ## Description: The Nemovision-4B-v2-Instruct model uses Mistral-NeMo-Minitron-4B-Instruct language model and RADIO vision encoder to be performant on a broad range of RTX GPUs with the accuracy developers need. The vision language model is based on VILA VLM architecture and trained with the VILA and NeMo frameworks and datasets. This is a model for generating responses for roleplaying, retrieval augmented generation, and function calling with vision understanding and reasoning capabilities. This model is ready for commercial use. ## License/Terms of Use The use of this model is governed by the [NVIDIA Community Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-community-models-license/) ## Model Architecture: **Architecture Type:** Transformer
**Network Architecture** - Vision Encoder: radio:768:nvidia/C-RADIO - Language Encoder: MN-Minitron-4B-128k-Instruct ## Input **Input Type(s):** Video, Image(s), Text
**Input Format(s):** Video (.mp4), Image (Red, Green, Blue (RGB)), and Text (String)
**Input Parameters:** Video (3D), Image (2D), Text (1D)
**Other Properties Related to Input:** The model has a maximum of 8192 input tokens.
## Output **Output Type(s):** Text
**Output Format(s):** String
**Output Parameters:** 1D
**Other Properties Related to Input:** The model has a maximum of 8192 input tokens. Maximum output for both versions can be set apart from input.
## Prompt Format: **Single Turn** ``` ~~System {system prompt}~~ ~~User {prompt}~~ Assistant\n ``` ``` ~~System {system prompt}~~ ~~User {prompt}~~ Assistant\n ``` **Multi-image** ``` ~~System {system prompt}~~ ~~User {prompt}~~ ~~Assistant\n ``` **Multi-Turn or Few-shot** ``` ~~System {system prompt}~~ [...]~~ ~~User {prompt}~~ ~~Assistant [ ... ]~~ ~~User {prompt}~~ Assistant\n ``` ## Software Integration: **Runtime(s):** AI Inference Manager (NVAIM) Version 1.0.0
**Supported Hardware Microarchitecture Compatibility:** GPU supporting DirectX 11/12 and Vulkan 1.2 or higher
**[Preferred/Supported] Operating System(s):**
* Windows
## Software Integration: (Cloud) **[Preferred/Supported] Operating System(s):**
* Linux
# Training & Evaluation: ## Training Dataset: NV-Pretraining and NV-VILA-SFT data were used. Additionally,the following datasets were used: * [OASST1](https://huggingface.co/datasets/OpenAssistant/oasst1) * [OASST2](https://huggingface.co/datasets/OpenAssistant/oasst2) * [Localized Narratives](https://google.github.io/localized-narratives/) * [TextCaps](https://textvqa.org/textcaps/dataset/) * [TextVQA](https://textvqa.org/dataset/) * [RefCOCO](https://github.com/lichengunc/refer) * [VQAv2](https://visualqa.org/) * [GQA](https://cs.stanford.edu/people/dorarad/gqa/index.html) * [SynthDoG-en](https://huggingface.co/datasets/naver-clova-ix/synthdog-en) * [A-OKVQ](https://github.com/allenai/aokvqa) * [WIT](https://github.com/google-research-datasets/wit) * [CLEVR](https://cs.stanford.edu/people/jcjohns/clevr/) * [CLEVR-X](https://github.com/ExplainableML/CLEVR-X) * [CLEVR-Math](https://huggingface.co/datasets/dali-does/clevr-math) * [ScreenQA](https://github.com/google-research-datasets/screen_qa) * [WikiSQL](https://github.com/salesforce/WikiSQL) * [WikiTablQuestions](https://github.com/ppasupat/WikiTableQuestions/) * [RenderedText](https://github.com/GbotHQ/ocr-dataset-rendering/) * [FinQA](https://github.com/czyssrs/FinQA) * [TAT-QA](https://nextplusplus.github.io/TAT-QA/) * [Dolly](https://huggingface.co/datasets/databricks/databricks-dolly-15k) * [Websight](https://huggingface.co/datasets/HuggingFaceM4/WebSight) * [RAVEN](https://github.com/WellyZhang/RAVEN) * [VizWiz](https://vizwiz.org/tasks-and-datasets/vqa/) * [Inter-GPS](https://github.com/lupantech/InterGPS) * [YouCook2](http://youcook2.eecs.umich.edu/) * [ActivityNet Captions](https://cs.stanford.edu/people/ranjaykrishna/densevid/) * [Video Localized Narratives](https://google.github.io/video-localized-narratives/) * [CLEVRER](https://google.github.io/video-localized-narratives/) * [Perception Test](https://github.com/google-deepmind/perception_test) * [Next-QA](https://github.com/doc-doc/NExT-QA) **Data Collection Method by dataset:**
- Hybrid: Automated, Human
**Labeling Method by dataset:**
- Hybrid: Automated, Human
**Properties:**
NV-Pretraining data was collected from 5M subsampled NV-CLIP dataset. Stage 3 NV-SFT data has 2.8M images and 3.58M annotations on images that only have commercial license. Additionally, 355K videos with commercial license and 400K annotations on videos were used. ## Evaluation Dataset: **Data Collection Method by dataset:**
- Hybrid: Human, Automatic/Sensors
**Labeling Method by dataset:**
- Hybrid: Human, Automatic/Sensors
**Properties:** A collection of different benchmarks, including academic VQA benchmarks and recent benchmarks specifically proposed for language understanding and reasoning, instruction-following, and function calling LMMs.
* [GQA](https://cs.stanford.edu/people/dorarad/gqa/about.html) * [ScienceQA Image](https://scienceqa.github.io/) * [Text VQA](https://textvqa.org/) * [POPE](https://github.com/AoiDragon/POPE) * [MME](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models) * [SEED-Bench](https://github.com/AILab-CVC/SEED-Bench) * [MMMU](https://mmmu-benchmark.github.io/) * [Video MME](https://video-mme.github.io/home_page.html) * [Egoschema](https://egoschema.github.io/) * [Perception Test](https://github.com/google-deepmind/perception_test) * [IF-Eval](https://github.com/google-research/google-research/tree/master/instruction_following_eval) Image Benchmarks |Benchmark|GQA |SQA Image|Text VQA|POPE (Popular)|MME_sum |SEED |SEED Image|MMMU val (beam 5)| |---------|------|---------|--------|--------------|--------------|-----|----------|-----------------| |Accuracy |60.78 |76.1 |75.48 |88.33 |1842.7 |69.98|74 |41.22 | Video benchmarks |Benchmark|VideoMME w/o Sub @32f|VideoMME w/ Sub @32f|Egoschema (val)|Perception Test| |---------|---------------------|--------------------|---------------|----------------------| |Accuracy |53.11 |57.7 |58.6 |65.63 | Text Benchmarks |Benchmark|IFEval |MMLU(5-shot) |GSM8K |MBPP | |---------|---------------------|-------------|------------|------| |Accuracy |54.34 |64.98 |63.76 |59.14 | ## Inference: **Framework:**
* PyTorch **Test Hardware:**
* H100
* A100
* A10g
* L40s
**Supported Hardware Platform(s):** L40s, A10g, A100, H100
## Ethical Considerations: NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards . Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).