NGC Catalog

CLASSIC

Welcome Guest

For downloads and more information, please view on a desktop device.

Description

This recipe contains information and scripts to produce performance results for the Llama 2 training workload.

Publisher

Latest Version

24.08.1

Modified

December 23, 2024

Compressed Size

24.16 KB

Overview of the Llama2 7b parameter model:

This recipe contains information and scripts to produce performance results for the Llama 2 training workload. The scripts help perform environment setup, dataset setup, and launch benchmark jobs. This variant of the workload is best-suited for GPU clusters with

At least 8 GPUs with at least 80 GB memory each. Training of this 7-billion parameter variant of the workload will not fit on fewer GPUs with less memory.
H100 GPUs. This workload runs with FP8, which is supported by H100 GPUs. BF16 recipes for A100 GPUs will be available shortly for this workload.

Overview of the Llama2 70b parameter model:

At least 64 GPUs with at least 80 GB memory each. Training of this 70-billion parameter variant of the workload will not fit on fewer GPUs with less memory.
H100 GPUs. This workload runs with FP8, which is supported by H100 GPUs. BF16 recipes for A100 GPUs will be available shortly for this workload.

Expected Performance for 7b parameter model:

Performance for Llama 2 training is measured by seconds per iteration, or in other words seconds per training step. This metric is logged for every training step in a .out file which is generated inside of the $STAGE_PATH/results/ folder.

Since the performance fluctuates significantly at the beginning, we are using the last training step timing to obtain throughput value.

grep train_step_timing results/*.out
Epoch 0: : 100%|██████████| 100/100 [09:37<00:00, v_num=, reduced_train_loss=5.990, global_step=99.00, consumed_samples=25600.0, train_step_timing in s=5.550]

To obtain throughput as a tokens per second measurement, follow this formula:

(sequence length) * (global batch size) / (training_step_timing) = (throughput in tokens per second)

E.g. 4096 * 128 / 4.57 = 114724

To calculate time to train estimate:

(total tokens) / (throughput in tokens per second) / (number of seconds in a day) = (time to train in days)

E.g. 1e12 / 114724 / 86400 = 100.90 days

To calculate the model flops utilization (MFU):

MFU = (global batch size) * (model flops) / (training step time) / (number of GPUs) / (peak GPU FLOPS)

The peak theoretical throughput for H100 FP8 is 1979 TFLOPS.

The model flops for Llama 2 7b for GBS=1 is 1.89E+14. Calculation shown here.

E.g. Llama 2 7b FP8 on 8x H100 GPUs (GBS=128)

peak FLOPS for H100 = 1979 TFLOPS
training step time = 4.57 s
model flops = 1.89E+14

MFU = 128 * 1.89E+14 / 4.57 / 8 / 1979E+12 = 33.39%

Llama 2 7b BF16 (TP=1, PP=1, MBS=1, VP=1, GA=16)	Throughput on 8x H100 GPUs (GBS=128)	Throughput on 16x H100 GPUs (GBS=256)	Throughput on 32x H100 GPUs (GBS=512)	Throughput on 64x H100 GPUs (GBS=1024)	Throughput on 128x H100 GPUs (GBS=2048)	Throughput on 256x H100 GPUs (GBS=4096)	Throughput on 512x H100 GPUs (GBS=8192)	Throughput on 1024x H100 GPUs (GBS=16384)	Throughput on 2048x H100 GPUs (GBS=32768)
Training step time (seconds per step)	5.57	5.61	5.61	5.63	5.65	5.72	5.93	5.96	6.39
Throughput in tokens per second	94127	186912	373824	744992	1484709	2933080	5658420	11259877	21004339
Model flops utilization	27.43%	27.24%	27.24%	27.14%	27.05%	26.71%	25.77%	25.64%	23.91%
Time to train 1T tokens in days	122.96	61.92	30.96	15.54	7.8	3.95	2.05	1.03	0.55

Llama 2 7b FP8 (TP=1, PP=1, MBS=1, VP=1, GA=16)	Throughput on 8x H100 GPUs (GBS=128)	Throughput on 16x H100 GPUs (GBS=256)	Throughput on 32x H100 GPUs (GBS=512)	Throughput on 64x H100 GPUs (GBS=1024)	Throughput on 128x H100 GPUs (GBS=2048)	Throughput on 256x H100 GPUs (GBS=4096)	Throughput on 512x H100 GPUs (GBS=8192)	Throughput on 1024x H100 GPUs (GBS=16384)	Throughput on 2048x H100 GPUs (GBS=32768)
Training step time (seconds per step)	4.01	4.03	4.05	4.08	4.08	4.20	4.30	4.41	4.97
Throughput in tokens per second	130691	260247	517360	1027235	2054437	3993308	7811077	15211221	26997974
Time to train 1T tokens in days	88.56	44.47	22.37	11.27	5.63	2.9	1.48	0.76	0.43

Expected Performance for 70b parameter model:

Since the performance fluctuates significantly at the beginning, we are using the last training step timing to obtain throughput value.

grep train_step_timing results/*.out
Epoch 0: : 100%|██████████| 100/100 [09:37<00:00, v_num=, reduced_train_loss=5.990, global_step=99.00, consumed_samples=25600.0, train_step_timing in s=5.550]

To obtain throughput as a tokens per second measurement, follow this formula:

(sequence length) * (global batch size) / (training_step_timing) = (throughput in tokens per second)

E.g. 4096 * 128 / 5.55 = 94466

To calculate time to train estimate:

(total tokens) / (throughput in tokens per second) / (number of seconds in a day) = (time to train in days)

E.g. 1e12 / 94466 / 86400 = 122.53 days

To calculate the model flops utilization (MFU):

MFU = (global batch size) * (model flops) / (training step time) / (number of GPUs) / (peak GPU FLOPS)

The peak theoretical throughput for H100 FP8 is 1979 TFLOPS.

The model flops for Llama 2 70b for GBS=1 is 1.82E+15. Calculation shown here.

E.g. Llama 2 70b FP8 on 64x H100 GPUs (GBS=128)

peak FLOPS for H100 = 1979 TFLOPS 
training step time = 5.55 s 
model flops = 1.82E+15
MFU = 128 * 1.82E+15 / 5.55 / 64 / 1979E+12 = 33.15%

Llama 2 70b BF16 (TP=4, PP=4, MBS=1, GA=32)	Throughput on 64x H100 GPUs (GBS=128)	Throughput on 128x H100 GPUs (GBS=256)	Throughput on 256x H100 GPUs (GBS=512)	Throughput on 512x H100 GPUs (GBS=1024)	Throughput on 1024x H100 GPUs (GBS=2048)	Throughput on 2048x H100 GPUs (GBS=4096)
Training step time (seconds per step)	7.22	7.24	7.26	7.26	7.27	tbd
Throughput in tokens per second	72616	144831	288864	577728	1153866	tbd
Model flops utilization	25.48%	25.40%	25.33%	25.33%	25.30%	tbd
Time to train 1T tokens in days	159.39	79.91	40.07	20.03	10.03	tbd

Llama 2 70b FP8 (TP=4, PP=4, MBS=1, GA=32)	Throughput on 64x H100 GPUs (GBS=128)	Throughput on 128x H100 GPUs (GBS=256)	Throughput on 256x H100 GPUs (GBS=512)	Throughput on 512x H100 GPUs (GBS=1024)	Throughput on 1024x H100 GPUs (GBS=2048)	Throughput on 2048x H100 GPUs (GBS=4096)
Training step time (seconds per step)	4.88	4.89	4.90	4.91	4.95	4.96
Throughput in tokens per second	107422	214595	427915	854098	1695216	3379335
Time to train 1T tokens in days	107.74	53.93	27.05	13.55	6.83	3.42

Prerequisites

This recipe requires access to Llama 2. Instructions are below if needed.

Prepare Environment

Create a staging area by running the attached setup.sh. The script converts the docker image from nvcr.io/nvidia/nemo:24.03.01.framework to the nvidia+nemo+24.03.01.framework.sqsh file under the $STAGE_PATH folder and copies NeMo Launcher code from the container.

# Set the path where all artifacts will be downloaded
export STAGE_PATH=<path to your shared file system folder> (e.g. /lustre/myproject/nemo)
# Set the Slurm partition to launch against
export SLURM_PARTITION="batch"
# Set the Slurm account to launch against
export SLURM_ACCOUNT="account_name"
# Set the number of GPUs per node according to Slurm's gres, this is usually 8 or null - https://slurm.schedmd.com/gres.html
export SLURM_GPU_PER_NODE=null

# Run the setup
bash ./setup.sh

Request Access

Access to Llama 2 must be requested through Meta's website then requested on the HuggingFace Llama page. The approval process is not automatic and could take a day or more. Once access is granted, download the Llama tokenizer. The tokenizer will be needed to prepare the dataset in the next section and will need to be copied to the $STAGE_PATH.

Prepare Dataset

Pre-training a GPT-3 model requires a text-based dataset to be downloaded and pre-processed for the NeMo Framework to ingest the data optimally. The Pile is often used as the dataset for pre-training models. The NeMo Framework contains helper scripts to download and pre-process the dataset. The following steps outline how to download and pre-process the dataset on DGX Cloud with an explanation of key points after.

Copy the tokenizer.model from wherever it was downloaded to $STAGE_PATH/llama-dataset/llama/tokenizer.model on the Slurm cluster.

Run the generate_dataset.sh script. The script launches several Slurm jobs that will download the dataset from The Pile, pre-process it and save it in a form suitable for subsequent training. The resulting dataset files will be saved under the $STAGE_PATH/llama-dataset folder. The dataset creation may use up to 100GB. Make sure you have sufficient disk space available.

bash ./generate_dataset.sh

If the dataset generation step was successful there should be 4 idx and 4 bin files in the $STAGE_PATH/llama-dataset folder.

my-llama_00_text_document.bin
my-llama_00_text_document.idx
my-llama_01_text_document.bin
my-llama_01_text_document.idx
my-llama_02_text_document.bin
my-llama_02_text_document.idx
my-llama_03_text_document.bin
my-llama_03_text_document.idx

If that is not the case, check the log files in: $STAGE_PATH/results.data_preparation

Run Training

Once the environment has been prepared, it is time to train a model. NeMo Framework contains many predefined configuration files for various models including the 7 billion parameter Llama 2 model. This section will demonstrate how to initiate training the model. You can see the default configuration for Llama 2 70b in the NeMo-Megatron Launcher Github repository. We will modify some of these parameters with our launch command.

NeMo Launcher is using the Hydra framework to process command line arguments and pass them down as hyper parameters to a multi-node job performing the training.

Below are sample scripts for launching Llama 2 7b & 70b models training on 16x nodes (or 128 GPUs) with FP8 precision. The training will run for the first 100 steps and will stop afterwards.

Log files and results will be located under the $STAGE_PATH/results/ folder.

sbatch -A ${SLURM_ACCOUNT} -p ${SLURM_PARTITION} -N ${NUM_NODES} ./launch_7b.sh

sbatch -A ${SLURM_ACCOUNT} -p ${SLURM_PARTITION} -N ${NUM_NODES} ./launch_70b.sh

If you would like to run with FP8 precision, simply set the environment variable ENABLE_FP8=True before launching the script. Otherwise, the training will be done using BF16 precision. For instance,

ENABLE_FP8=True sbatch -A ${SLURM_ACCOUNT} -p ${SLURM_PARTITION} -N ${NUM_NODES} ./launch_7b.sh

ENABLE_FP8=True sbatch -A ${SLURM_ACCOUNT} -p ${SLURM_PARTITION} -N ${NUM_NODES} ./launch_70b.sh

Note that it might be necessary to pass --gres gpu:8 to sbatch for certain clusters on encountering errors like GPU not found. See https://slurm.schedmd.com/gres.html

It is important to maintain these values for model parallelism settings in order to accurately assess performance results for completed jobs against expected baseline.

For the 7b parameter model:

training.model.tensor_model_parallel_size=1
training.model.pipeline_model_parallel_size=1

For the 70b parameter model:

training.model.tensor_model_parallel_size=4
training.model.pipeline_model_parallel_size=4

Global batch size ( training.model.global_batch_size) value should be set to <number of nodes> * 16. E.g., 16 * 16 = 256 (in the example above).

Notes

The calculations for the 7b parameter model:

model flops = (sequence length) * ((attention flops) + (mlp flops) + (embedding flops))

model flops breakdown:
    attention flops = 12 * (number of layers) * (hidden size)^2 * (1 + (number of query groups)/(number of attention heads) + (sequence length)/(hidden size))
    mlp flops = 18 * (number of layers) * (FFN size) * (hidden size)
    embedding flops = 6 * (vocab size) * (hidden size)

Llama 2 7b calculation:
    sequence length = 4096
    attention flops = 12 * 32 * 4096^2 * (1 + 32/32 + 4096/4096) = 19,327,352,832
    mlp flops = 18 * 32 * 11008 * 4096 = 25,971,130,368
    embedding flops = 6 * 32000 * 4096 = 786,432,000

    model flops = 4096 * (19,327,352,832 + 25,971,130,368 + 786,432,000) = 1.89E+14

The calculations for the 70b parameter model:

model flops = (sequence length) * ((attention flops) + (mlp flops) + (normalization flops) + (embedding flops))

model flops breakdown:
    attention flops = 12 * (number of layers) * (hidden size)^2 * (1 + (number of query groups)/(number of attention heads) + (sequence length)/(hidden size))
    mlp flops = 18 * (number of layers) * (FFN size) * (hidden size)
    embedding flops = 6 * (vocab size) * (hidden size)

Llama 2 70b calculation: 
    sequence length = 4096
    attention flops = 12 * 80 * 8192^2 * (1 + 8/64 + 4096/8192) = 104,689,827,840
    mlp flops = 18 * 80 * 28672 * 8192 = 338,228,674,560
    embedding flops = 6 * 32000 * 8192 = 1,572,864,000

    model flops = 4096 * (104,689,827,840 + 338,228,674,560 + 1,572,864,000) = 1.82E+15

Llama2 (DGXC Benchmarking)