Built with Meta Llama 3.1 - The Meta Llama 3.1 collection of multilingual large language models (LLMs) is a collection of pretrained, and instruction tuned generative models in 8B, 70B and 405B sizes (text in/text out). The Llama 3.1 instruction tuned text only models (8B, 70B, 405B) are optimized for multilingual dialogue use cases.
We downloaded Meta Llama 3.1 8B instruct model in Pytorch bfloat16 format from HuggingFace. We used AutoAWQ to convert it to Meta Llama 3.1 8B Pytorch INT4 model. We used Onnxruntime-GenAI SDK to convert Meta Llama 3.1 8B Pytorch INT4 model to Meta Llama 3.1 8B ONNX INT4 model. We have posted this Meta Llama 3.1 8B ONNX INT4 model files here.
This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case; see link to Non-NVIDIA Meta-Llama-3.1-8B-Instruct Model Card.
GOVERNING TERMS: This model is governed by the NVIDIA Agreements | Enterprise Software | NVIDIA Community Model License.
ADDITIONAL INFORMATION: Meta Llama 3.1 Community License Agreement, acceptable use policy, and Meta’s privacy policy Built with Meta Llama 3.1.
Llama 3.1 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback(RLHF) to align with human preferences for helpfulness and safety.
Architecture Type: Transformer
Network Architecture: Llama 3.1
Input Type: Text and Code
Input Format: Text
Input Parameters: Temperature, TopP
Other Properties Related to Input: Supports English, German, Italian, Portuguese, Hindi, Spanish, and Thai
Output Type(s): Text and Code
Output Format: Text and code
Output Parameters: Max output tokens
Runtime(s): N/A
Supported Hardware Platform(s): RTX 4090. 6GB or higher VRAM gpus are recommended. Higher VRAM may be required for larger context length use cases.
Supported Operating System(s): Windows
Supported languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.
Refer to [Llama 3.1 Model Card](llama-models/models/llama3_1/MODEL_CARD.md at main · meta-llama/llama-models · GitHub) for the details.
We used GenAI ORT->DML backend for inference. The instructions to use this backend are given in readme.txt file available under “Files” tab.
MMLU Accuracy ( 5 shot ) :
With GenAI ORT->DML backend, we got below accuracy numbers on a desktop RTX 4090 GPU system.
"overall_accuracy": 66.61
Link: https://people.eecs.berkeley.edu/~hendrycks/data.tar .
Data Collection Method by dataset - Unknown
Labeling Method by dataset - Not Applicable
GPU - RTX 4090 Desktop system
OS – Windows 11 23H2