Meta Terms of Use: By using this model, you are agreeing to the terms and conditions of the license, acceptable use policy and Meta's privacy policy.
Llama 2 13B is one of a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters developed by Meta. Model details can be found here. This model is optimized through NVIDIA NeMo Framework, and is provided through a .nemo
checkpoint.
The following resources reference different checkpoints of the Llama 2 family of models, but can be easily modified to apply to Llama 2 13B by changing the reference to the model!
NeMo Framework offers support for various parameter-efficient fine-tuning (PEFT) methods for Llama 2 model family.
PEFT techniques allow customizing foundation models to improve performance on specific tasks.
Two of them, P-Tuning and Low-Rank Adaptation (LoRA), are supported out of the box for Llama 2 and have been described in detail in NeMo Framework user guide, showing how to tune Llama-2 to answer biomedical questions based on PubMedQA.
NeMo Framework offers Supervised fine-tuning (SFT) support for Llama 2 model family.
Fine-tuning refers to how one can modify the weights of a pre-trained foundation model with additional custom data. Supervised fine-tuning (SFT) refers to unfreezing all the weights and layers in tuned model and training on a newly labeled set of examples. One can fine-tune to incorporate new, domain-specific knowledge or teach the foundation model what type of response to provide. One specific type of SFT is also referred to as instruction tuning
where we use SFT to teach a model to follow instructions better.
NeMo Framework offers out-of-the-box SFT support for Llama 2, which has been described in detail in NeMo Framework user guide, showing how to tune Llama-2 to follow instructions based on databricks-dolly-15k.
NeMo Toolkit supports SteerLM for Llama 2 model family.
SteerLM ia an attribute-conditioned supervised fine-tuning method that is an (user-steerable) alternative to reinforcement learning from human feedback (RLHF) proposed by NVIDIA. Models tuned with SteerLM offer flexible alignment at inference time.
An example of SteerLM application to Llama-2 13B model is available on Hugging Face.
Using TensorRT-LLM, NeMo Framework allows exporting Llama 2 checkpoints to formats that are optimized for deployment on NVIDIA GPUs.
TensorRT-LLM is a library that allows building TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs.
Thanks to that, it is possible to reach state-of-the-art performance using export and deployment methods that NVIDIA built for Llama-2. This process has been described in detail in NeMo Framework user guide.
Detailed performance results are available here.