Meta Llama 3.1 8b Instruct ONNX INT4 RTX

NGC Catalog

CLASSIC

Welcome Guest

For downloads and more information, please view on a desktop device.

Features

Description

Publisher

Model Overview

Description:

Built with Meta Llama 3.1 - The Meta Llama 3.1 collection of multilingual large language models (LLMs) is a collection of pretrained, and instruction tuned generative models in 8B, 70B and 405B sizes (text in/text out). The Llama 3.1 instruction tuned text only models (8B, 70B, 405B) are optimized for multilingual dialogue use cases.

We downloaded Meta Llama 3.1 8B instruct model in Pytorch bfloat16 format from HuggingFace. We used AutoAWQ to convert it to Meta Llama 3.1 8B Pytorch INT4 model. We used Onnxruntime-GenAI SDK to convert Meta Llama 3.1 8B Pytorch INT4 model to Meta Llama 3.1 8B ONNX INT4 model. We have posted this Meta Llama 3.1 8B ONNX INT4 model files here.

Third-Party Community Consideration

This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case; see link to Non-NVIDIA Meta-Llama-3.1-8B-Instruct Model Card.

Terms of use

GOVERNING TERMS: This model is governed by the NVIDIA Agreements | Enterprise Software | NVIDIA Community Model License.

ADDITIONAL INFORMATION: Meta Llama 3.1 Community License Agreement, acceptable use policy, and Meta’s privacy policy Built with Meta Llama 3.1.

References(s):

Meta Llama 3 Model Card on Hugging Face
Meta Llama 3 blogpost

Model Architecture:

Llama 3.1 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback(RLHF) to align with human preferences for helpfulness and safety.

Architecture Type: Transformer
Network Architecture: Llama 3.1

Input:

Input Type: Text and Code
Input Format: Text
Input Parameters: Temperature, TopP
Other Properties Related to Input: Supports English, German, Italian, Portuguese, Hindi, Spanish, and Thai

Output:

Output Type(s): Text and Code
Output Format: Text and code
Output Parameters: Max output tokens

Software Integration:

Runtime(s): N/A
Supported Hardware Platform(s): RTX 4090. 6GB or higher VRAM gpus are recommended. Higher VRAM may be required for larger context length use cases.
Supported Operating System(s): Windows
Supported languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.

Model Version:1.0

Training, Testing, and Evaluation Datasets:

Refer to [Llama 3.1 Model Card](llama-models/models/llama3_1/MODEL_CARD.md at main · meta-llama/llama-models · GitHub) for the details.

Inference:

We used GenAI ORT->DML backend for inference. The instructions to use this backend are given in readme.txt file available under “Files” tab.

MMLU Accuracy ( 5 shot ) :

With GenAI ORT->DML backend, we got below accuracy numbers on a desktop RTX 4090 GPU system.

"overall_accuracy": 66.61

Evaluation Dataset:

Link: https://people.eecs.berkeley.edu/~hendrycks/data.tar .
Data Collection Method by dataset - Unknown
Labeling Method by dataset - Not Applicable

Test Hardware:

GPU - RTX 4090 Desktop system

OS – Windows 11 23H2