Llama-2-70B | NVIDIA NGC

NGC Catalog

CLASSIC

Welcome Guest

For downloads and more information, please view on a desktop device.

Description

Llama 2 70B is one of a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters developed by Meta.

Publisher

Redistribution Information

Supported Runtime(s): TensorRT-LLM
Supported Hardware(s): Ampere, Hopper
Supported OS(s): Linux

Meta Terms of Use: By using this model, you are agreeing to the terms and conditions of the license, acceptable use policy and Meta's privacy policy.

Llama 2

Llama 2 70B is one of a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters developed by Meta. Model details can be found here. This model is optimized through NVIDIA NeMo Framework, and is provided through a .nemo checkpoint.

Benefits of using Llama 2 checkpoints in NeMo Framework

The following resources reference different checkpoints of the Llama 2 family of models, but can be easily modified to apply to Llama 2 70B by changing the reference to the model!

P-Tuning and LoRA

NeMo Framework offers support for various parameter-efficient fine-tuning (PEFT) methods for Llama 2 model family.

PEFT techniques allow customizing foundation models to improve performance on specific tasks.

Two of them, P-Tuning and Low-Rank Adaptation (LoRA), are supported out of the box for Llama 2 and have been described in detail in NeMo Framework user guide, showing how to tune Llama-2 to answer biomedical questions based on PubMedQA.

Supervised Fine-tuning

NeMo Framework offers Supervised fine-tuning (SFT) support for Llama 2 model family.

Fine-tuning refers to how one can modify the weights of a pre-trained foundation model with additional custom data. Supervised fine-tuning (SFT) refers to unfreezing all the weights and layers in tuned model and training on a newly labeled set of examples. One can fine-tune to incorporate new, domain-specific knowledge or teach the foundation model what type of response to provide. One specific type of SFT is also referred to as instruction tuning where we use SFT to teach a model to follow instructions better.

NeMo Framework offers out-of-the-box SFT support for Llama 2, which has been described in detail in NeMo Framework user guide, showing how to tune Llama-2 to follow instructions based on databricks-dolly-15k.

SteerLM

NeMo Toolkit supports SteerLM for Llama 2 model family.

SteerLM ia an attribute-conditioned supervised fine-tuning method that is an (user-steerable) alternative to reinforcement learning from human feedback (RLHF) proposed by NVIDIA. Models tuned with SteerLM offer flexible alignment at inference time.

An example of SteerLM application to Llama-2 13B model is available on Hugging Face.

Optimized Deployment with TensorRT-LLM

Using TensorRT-LLM, NeMo Framework allows exporting Llama 2 checkpoints to formats that are optimized for deployment on NVIDIA GPUs.

TensorRT-LLM is a library that allows building TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs.

Thanks to that, it is possible to reach state-of-the-art performance using export and deployment methods that NVIDIA built for Llama-2. This process has been described in detail in NeMo Framework user guide.

Detailed performance results are available here.