NGC Catalog

CLASSIC

Welcome Guest

For downloads and more information, please view on a desktop device.

Description

This recipe contains information and scripts to produce performance results for the Llama3 70B Supervised Fine-Tuning(SFT) and Low Rank Adaptation(LoRA) schemes of finetuning workloads.

Publisher

NVIDIA

Latest Version

25.02

Modified

April 17, 2025

Compressed Size

17.54 KB

Overview

This recipe contains information and scripts to produce performance results for the Llama3 70B Supervised Fine-Tuning(SFT) and Low Rank Adaptation(LoRA) schemes of finetuning workloads. This workload supports BF16 and FP8 precision. It is best-suited for clusters with at least 8 (for LoRA) or 32 (for SFT) H100 GPUs with at least 80 GB memory each. Fine-tuning of this 70-billion parameter variant of the workload will not fit on fewer GPUs with less memory.

Expected Performance

Performance for Llama 3 fine-tuning is measured by seconds per iteration, or in other words seconds per training step. This metric is logged in a log-*.out which is generated in $STAGE_PATH/logs/experiments/<experiment_name>/<experiment_name>_<timestamp>/<experiment_name>/ folder. $STAGE_PATH will be defined in the later stages of this document.

Example of an output file for LoRA finetuning run with bf16 precison of Llama3 70B model on 16 GPUs, where experiment_name = lora_nemo_llama3_70b_bf16_16:

$STAGE_PATH/logs/experiments/lora_nemo_llama3_70b_bf16_16/lora_nemo_llama3_70b_bf16_16_1741198733/lora_nemo_llama3_70b_bf16_16/log-<slurm-account>.lora_nemo_llama3_70b_bf16_16_1741198733_0.out

Since the performance fluctuates significantly at the beginning due to high warmup step times, we are using average time from the last 20 training steps to obtain throughput value.

Example output line

peak_memory_usage: 35475423232 | memory_allocated: 16209118208 | reduced_train_loss: 0.208 | train_step_timing in s: 11.1 | consumed_samples: 3168

To obtain throughput as a tokens per second measurement, follow this formula:

(sequence length) * (global batch size) / (training_step_timing) = (throughput in tokens per second)

E.g. 8192 * 128 / 11.1 = 94466

To calculate time to train estimate:

(total tokens) / (throughput in tokens per second) / (number of seconds in a day) = (time to train in days)

E.g. 1e12 / 94466 / 86400 = 122.52 days

To calculate the model flops utilization (MFU) for SFT finetuning:

MFU = (global batch size) * (model flops) / (training step time) / (number of GPUs) / (peak GPU FLOPS)

To calculate the model flops utilization (MFU) for LoRA finetuning use the same formula as SFT above but multiplied by 2/3.

E.g. For Llama 70B BF16 SFT finetuning run on 32 gpus with batch size 32

32 * 1.70E+15 / 3.36 / 32 / 9.89E+14 = 51.1%

Note: Model flops for llama 70B finetuning 1.70E+15. The peak theoretical throughput for H100 FP8 is 1979 TFLOPS and for H100 BF16 is 989 TFLOPS.

Llama3 70b Finetuning SFT BF16 (TP=4, PP=4, CP=1, VP=5)	Throughput on 32x H100 GPUs (GBS=32)
Training step time (seconds per step)	3.36
Throughput in tokens per second	39009
Model flops utilization	51.1%
Time to train 1T tokens in days	296

Llama3 70b Finetuning SFT FP8 (TP=4, PP=4, CP=1, VP=5)	Throughput on 32x H100 GPUs (GBS=32)
Training step time (seconds per step)	2.26
Throughput in tokens per second	57996
Model flops utilization	38%
Time to train 1T tokens in days	199

Llama3 70b Finetuning LORA BF16 (TP=2, PP=4, CP=1, VP=20)	Throughput on 8x H100 GPUs (GBS=32)	Throughput on 16x H100 GPUs (GBS=64)	Throughput on 32x H100 GPUs (GBS=128)
Training step time (seconds per step)	9.118	9.4	9.77
Throughput in tokens per second	14375	27887	53663
Model flops utilization	50.3%	48.7%	47.3%
Time to train 1T tokens in days	805	415	215

Llama3 70b Finetuning LORA FP8 (TP=2, PP=4, CP=1, VP=20)	Throughput on 8x H100 GPUs (GBS=32)	Throughput on 16x H100 GPUs (GBS=64)	Throughput on 32x H100 GPUs (GBS=128)
Training step time (seconds per step)	6.5	6.7	7.6
Throughput in tokens per second	20164	39125	68985
Model flops utilization	35.2%	34.2%	30.1%
Time to train 1T tokens in days	573	295	167

Prerequisites

This recipe requires access to Llama3 and its model weights from Huggingface. Instructions are below if needed.

Python virtual environment must be created using Python v.3.10.12 or newer before running the workload.

Request Access

A HuggingFace account is required and you will need to create a HuggingFace access token the huggingface token is used to run the fine-tuning scripts. Add the generated token to your environment via export HF_TOKEN=<your token>.

Access to Llama 3 must be requested through Meta's website then requested on the HuggingFace Llama page. The approval process is not automatic and could take a day or more.

Prepare Environment

Slurm

We reference a number of Slurm commands and parameters in this document. A brief summary is included below. It's important to note these are a guide and might not be applicable to all environments. Please consult with your system administrator for the parameters that are specific to your system.

Common parameters:

SBATCH_PARTITION or -p - Partition (or queue) to use.
SBATCH_ACCOUNT or -A - Slurm account to associate with your job, different from your user. Meant for accounting purposes.
- Encountering errors such as 'GPUs not found' or 'Cannot submit to this partition without GPU resources' means this setting is required.

These parameters can be set either by exporting the environment variable or using the corresponding sbatch flag.

Workload Setup

Create a staging area by running the attached setup.sh. The script clones two repos "Megatron-LM" and "NeMo" and builds Megatron-Lm, which is needed to run our benchmark scripts.

Set the environment variables

export STAGE_PATH=<path to your shared file system folder>
export HF_TOKEN=<your_HF_token_value>

Important: STAGE_PATH used in this step must be used when running the workload.

Prepare python virtual environment

The workload relies on python virtual environment in order ensure there are no conflicts between required dependencies and user's packages.

There are multiple choices available to set up virtual environment:

conda
python venv

Conda

To install and activate conda virtual environment

# pick INSTALL_PATH with sufficient disk space
INSTALL_PATH=~
wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O $INSTALL_PATH/miniconda.sh
bash $INSTALL_PATH/miniconda.sh -b -p $INSTALL_PATH/miniconda3
$INSTALL_PATH/miniconda3/bin/conda init
source ~/.bashrc

When you are finished running this benchmark you can deactivate the environment, run this command

conda deactivate

Python venv

To install and activate python venv

python3 -m venv $STAGE_PATH/test_sft
source $STAGE_PATH/test_sft/bin/activate

When you are finished running this benchmark you can deactivate the environment, run this command

deactivate

Setup script

The setup.sh script will convert the docker file into .sqsh file and will store into the STAGE_PATH. Next step is to clone Megatron-LM and NeMo repositories and install all the necessary packages.

The following command will get the environment set up for you.

Run setup.sh

Ensure that your python environment has been created and activated before running the setup.sh script.

sbatch -A ${SBATCH_ACCOUNT} -p ${SBATCH_PARTITION} -N 1 ./setup.sh

Above command will launch a slurm job and must finish succesfully before you can proceed to the next step.

After running the setup script, directory structure of STAGE_PATH should look like below

|--|STAGE_PATH
|--|--|HF_ckpt
|--|--|logs
|--|--|Megatron-LM
|--|--|NeMo

Run fine-tuning

Ensure that your python environment has been created and activated before running the launch.sh script.

Environment variable SFT_SCHEME allows user to choose between the two types of fine-tuning schemes (lora or sft). The finetuning will run for the first 100 steps and will stop afterwards. Log files and results will be located under the $STAGE_PATH/logs/... folder.

Below is a an example on how to launch the workload:

DTYPE=fp8 MODEL_SIZE=70b SFT_SCHEME=lora JOB_TOTAL_GPUS=16 RUN_CONF_GPUS_PER_NODE=8 ./launch.sh

Where:

DTYPE, MODEL_SIZE, JOB_TOTAL_GPUS and SFT_SCHEME can be changed by the user.
- DTYPE can be either fp8 or bf16.
- MODEL_SIZE should be 70b in this case.
- SFT_SCHEME can be either sft or lora
- JOB_TOTAL_GPUS can start from 8 to 32
Slurm Settings for more information on Slurm parameters.

Below is the summary of configurations we support

Llama Model Type	Fine-tuning scheme	Precision	Scale(#gpus)	Run Command with scale options
70b	SFT	FP8	32	`MODEL_SIZE=70b SFT_SCHEME=sft DTYPE=fp8 JOB_TOTAL_GPUS=32 ./launch.sh`
70b	SFT	BF16	32	`MODEL_SIZE=70b SFT_SCHEME=sft DTYPE=bf16 JOB_TOTAL_GPUS=32 ./launch.sh`
70b	LoRA	FP8	8, 16, 32	`MODEL_SIZE=70b SFT_SCHEME=lora DTYPE=fp8 JOB_TOTAL_GPUS=<8/16/32> ./launch.sh`
70b	LoRA	BF16	8, 16, 32	`MODEL_SIZE=70b SFT_SCHEME=lora DTYPE=bf16 JOB_TOTAL_GPUS=<8/16/32> ./launch.sh`

Note: Running multiple/concurrent jobs using the same model can cause inconsistent or unintended results.

Advanced information for the user

Chapters below are for informational purpose only.

Model Weights

Model weights will be downloaded from Huggingface during the first finetuning run. We have a block of script which takes care of downloading the model weights and converts them to NeMo desirable format. A Slurm executor block with fixed duration is set for the download of weights.

Note: Downloading and checkpoint converting steps are to be done only once per model. Subsequent runs use the cached weights.

Note: The model weights require approximately 150GB of space, so make sure you have enough space to download the weights and run the scripts.

The model weights and context are stored in the HF_HOME location. The directory structure looks like below:

-HF_ckpt
|--|models
|--|--|meta-llama
|--|--|--|Meta-Llama-3-70B
|--|--|--|--|context
|--|--|--|--|weights

The directory contains tokenizer files downloaded from previous steps. The weights are generic for each model and will be used by all the model precisions and fine-tuning schemes.

Common Environment Issues

Slurm parameters might not be applicable to all environments. Please consult with your system administrator and update or remove parameters as needed.

Llama 3 70B Fine-Tuning 25.02 (DGXC Benchmarking)