Meta Llama 3.2 3B Instruct ONNX INT4 RTX

NGC Catalog

CLASSIC

Welcome Guest

For downloads and more information, please view on a desktop device.

Features

Description

Built with Llama - The Meta Llama 3.2 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction-tuned text-to-text generative models in 1B and 3B sizes.

Publisher

Model Overview

Description:

Built with Llama - The Meta Llama 3.2 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction-tuned text-to-text generative models in 1B and 3B sizes . The Llama 3.2 instruction-tuned text only models are optimized for multilingual dialogue use cases, including agentic retrieval and summarization tasks. They outperform many of the available open source and closed chat models on common industry benchmarks.

This model is ready for commercial use.

Llama 3.2 3B Instruct INT4 ONNX

Llama 3.2 3B Instruct model is quantized to AWQ INT4 using AutoAWQ and converted to ONNX using Onnxruntime-GenAI.

Third-Party Community Consideration Llama 3.2

This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case; see link to the Llama 3.2 Model Card.

License/Terms of Use:

This model is governed by the NVIDIA AI Foundation Models Community License Agreement . Additional Information Llama 3.2 Community License Agreement, Built with Llama Acceptable Use Policy

Reference(s):

The Llama 3 Herd of Models

Llama 3.2 Model Card

Model Architecture:

Llama 3.2 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.

Architecture Type: Large Language Model (LLM)

Network Architecture: llama

Training, Testing, and Evaluation Datasets:

Refer to Llama 3.2 Model Card for the details.

Input:

Input Type: Text and Code

Input Format: String

Input Parameters:  Temperature, TopP

Other Properties Related to Input: Supports English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai

Output:

Output Type(s): Text and Code

Output Format: String

Output Parameters:  Max output tokens

Software Integration:

Runtime(s): Not Applicable

Supported Hardware Platform(s): RTX 4090, 6GB or higher VRAM gpus are recommended. Higher VRAM may be required for larger context length use cases.

Supported Operating System(s): Windows

Model Version(s): 1.0

Inference:

Inference Backend: Onnxruntime-GenAI-DirectML

(Note: please refer to ReadMe.txt for the detailed instructions.)

Accuracy Scores:

MMLU (5# shots): 58.58

Evaluation Dataset:

Link: https://people.eecs.berkeley.edu/~hendrycks/data.tar .

Data Collection Method by dataset = Unknown

Labeling Method by dataset = Not Applicable

Test configuration:

GPU: RTX 4090.  
Windows 11: 23H2
NVIDIA Graphics driver: R560 or higher