NGC Catalog
CLASSIC
Welcome Guest
Resources
Nemotron (DGXC Benchmarking)

Nemotron (DGXC Benchmarking)

For downloads and more information, please view on a desktop device.
Description
This recipe contains information and scripts to produce performance results for the Nemotron training workload. 
Publisher
-
Latest Version
24.08.1
Modified
December 23, 2024
Compressed Size
2.1 MB

Nemotron 15b parameter Overview

This recipe contains information and scripts to produce performance results for the Nemotron 4 15B training workload. The scripts help perform environment setup and launch benchmark jobs.

This variant of the workload is best-suited for GPU clusters with:

  • At least 8 GPUs with at least 80 GB memory each. Training of this 15-billion parameter variant of the workload will not fit on fewer GPUs with less memory.
  • This workload runs with BF16 or FP8 precision. FP8 is only supported by H100 GPUs. BF16 recipes are suitable for both A100 and H100 GPUs.
  • With BF16 precision the minimum number of required H100 GPUs is 16. Otherwise you will run into Out Of Memory errors.

Nemotron 340b parameter Overview

This recipe contains information and scripts to produce performance results for the Nemotron 4 340B training workload. The scripts help perform environment setup and launch benchmark jobs.

This variant of the workload is best-suited for GPU clusters with:

  • At least 128 GPUs with at least 80 GB memory each. Training of this 340-billion parameter variant of the workload will not fit on fewer GPUs with less memory.
  • This workload supports BF16 or FP8 precision. FP8 is only supported by H100 GPUs. BF16 recipes are suitable for both A100 and H100 GPUs.

Expected Performance of 15b parameter model:

Performance for Nemotron 4 training is measured by seconds per iteration, or in other words seconds per training step. This metric is logged for every training step in a .out file which is generated inside of the $STAGE_PATH/results/ folder.

Since the performance fluctuates significantly at the beginning, we are using the last training step timing to obtain throughput value.

grep train_step_timing results/*.out
Epoch 0: : 100%|██████████| 100/100 [10:48<00:00, reduced_train_loss=0.0172, global_step=99.00, consumed_samples=25600.0, train_step_timing in s=3.130]

To obtain throughput as a tokens per second measurement, follow this formula:

(sequence length) * (global batch size) / (training_step_timing) = (throughput in tokens per second)

E.g. 4096 * 256 / 3.13 = 335008

To calculate time to train estimate:

(total tokens) / (throughput in tokens per second) / (number of seconds in a day) = (time to train in days) 

E.g. 1e12 / 335008 / 86400 = 34.55 days

To calculate the model flops utilization (MFU):

MFU = (global batch size) * (model flops) / (training step time) / (number of GPUs) / (peak GPU FLOPS)

The peak theorhetical throughput for H100 FP8 is 1979 TFLOPS and for H100 BF16 is 989 TFLOPS.

The model flops for NeMoTron4 15b for GBS=1 is 434.5e12. Calculation shown here.

E.g. NeMotron4 15b BF16 on 64x H100 GPUs (GBS=256)

peak FLOPS for H100 BF16 = 989 TFLOPS
training step time = 3.13 s
model flops = 434.5e12

MFU = 256 * 434.5e12 / 3.13 / 64 / 989e+12 = 56%
Nemotron4 15b BF16 (TP=4, PP=1, MBS=4, GA=4) Throughput on 16x H100 GPUs (GBS=64) Throughput on 32x H100 GPUs (GBS=128) Throughput on 64x H100 GPUs (GBS=256) Throughput on 128x H100 GPUs (GBS=512) Throughput on 256x H100 GPUs (GBS=1024) Throughput on 512x H100 GPUs (GBS=2048) Throughput on 1024x H100 GPUs (GBS=4096) Throughput on 2048x H100 GPUs (GBS=8192)
Training step time (seconds per step) 3.12 3.14 3.14 3.15 3.18 3.19 3.26 3.28
Throughput in tokens per second 84020 166971 333941 665762 1318963 2629658 5146385 10223219
Model flops utilization 56.30% 55.94% 55.94% 55.76% 55.24% 55.06% 53.88% 53.56%
Time to train 1T tokens in days 137.75 69.32 34.66 17.38 8.77 4.4 2.24 1.13
Nemotron4 15b FP8 (TP=4, PP=1, MBS=4, GA=4) Throughput on 8x H100 GPUs (GBS=32) Throughput on 16x H100 GPUs (GBS=64) Throughput on 32x H100 GPUs (GBS=128) Throughput on 64x H100 GPUs (GBS=256) Throughput on 128x H100 GPUs (GBS=512) Throughput on 256x H100 GPUs (GBS=1024) Throughput on 512x H100 GPUs (GBS=2048) Throughput on 1024x H100 GPUs (GBS=4096) Throughput on 2048x H100 GPUs (GBS=8192)
Training step time (seconds per step) 2.28 2.29 2.3 2.31 2.32 2.33 2.35 2.37 2.46
Throughput in tokens per second 57371 114392 228073 453655 902901 1796813 3568772 7066002 13623924
Time to train 1T tokens in days 201.74 101.18 50.75 25.51 12.82 6.44 3.24 1.64 0.85

Expected Performance of 340b parameter model:

Performance for Nemotron 4 training is measured by seconds per iteration, or in other words seconds per training step. This metric is logged for every training step in a .out file which is generated inside of the $STAGE_PATH/results/ folder.

Since the performance fluctuates significantly at the beginning, we are using the last training step timing to obtain throughput value.

grep train_step_timing results/*.out
Epoch 0: : 100%|██████████| 100/100 [07:57<00:00, reduced_train_loss=7.310, global_step=99.00, consumed_samples=25600.0, train_step_timing in s=3.590]

To obtain throughput as a tokens per second measurement, follow this formula:

(sequence length) * (global batch size) / (training_step_timing) = (throughput in tokens per second)

E.g. 4096 * 256 / 3.59 = 292082

To calculate time to train estimate:

(total tokens) / (throughput in tokens per second) / (number of seconds in a day) = (time to train in days) 

E.g. 1e12 / 292082 / 86400 = 39.6 days

To calculate the model flops utilization (MFU):

MFU = (global batch size) * (model flops) / (training step time) / (number of GPUs) / (peak GPU FLOPS)

The peak theorhetical throughput for H100 FP8 is 1979 TFLOPS and for H100 BF16 is 989 TFLOPS.

The model flops for NeMoTron4 340b for GBS=1 is 1.01e16. Calculation shown here.

E.g. NeMotron4 340b BF16 on 128x H100 GPUs (GBS=32)

peak FLOPS for H100 BF16 = 989 TFLOPS
training step time = 4.93 s
model flops = 1.01e16

MFU = 32 * 1.01e16 / 4.93 / 128 / 989e+12 = 51.7%
Nemotron4 340b BF16 (TP=8, PP=8, MBS=1, GA=16, VP=12) Throughput on 128x H100 GPUs (GBS=32) Throughput on 256x H100 GPUs (GBS=64) Throughput on 512x H100 GPUs (GBS=128) Throughput on 1024x H100 GPUs (GBS=256) Throughput on 2048x H100 GPUs (GBS=512)
Training step time (seconds per step) 4.93 4.97 4.99 5.02 5.11
Throughput in tokens per second 26608 52757 105051 208696 410003
Model flops utilization 51.76% 51.36% 51.14% 50.84% 49.94%
Time to train 1T tokens in days 435 219 110 55 28
Nemotron4 340b FP8 (TP=8, PP=8, MBS=1, GA=16, VP=12) Throughput on 128x H100 GPUs (GBS=32) Throughput on 256x H100 GPUs (GBS=64) Throughput on 512x H100 GPUs (GBS=128) Throughput on 1024x H100 GPUs (GBS=256) Throughput on 2048x H100 GPUs (GBS=512)
Training step time (seconds per step) 3.39 3.43 3.49 3.56 3.7
Throughput in tokens per second 38671 76434 150296 294170 567415
Time to train 1T tokens in days 299 151 77 39 20

Prepare Environment

Create a staging area by running the attached setup.sh. The script converts the docker image from nvcr.io/nvidia/nemo:24.05 to the nvidia+nemo+24.05.sqsh file under the $STAGE_PATH folder and copies NeMo Launcher code from the container.

# Set the path where all artifacts will be downloaded
export STAGE_PATH=<path to your shared file system folder> (e.g. /lustre/myproject/nemo)
# Set the Slurm partition to launch against
export SLURM_PARTITION="batch"
# Set the Slurm account to launch against
export SLURM_ACCOUNT="account_name"
# Set the number of GPUs per node according to Slurm's gres, this is usually 8 or null - https://slurm.schedmd.com/gres.html
export SLURM_GPU_PER_NODE=null

# Run the setup
bash ./setup.sh

Run Training

Once the environment has been prepared, it is time to train a model. NeMo Framework contains many predefined configuration files for various models including the 340 billion parameter Nemotron 4 model. This section will demonstrate how to initiate training the model. You can see the default configuration for Nemotron 340b in the NeMo-Framework-Launcher Github repository. We will modify some of these parameters with our launch command.

NeMo Launcher is using the Hydra framework to process command line arguments and pass them down as hyper parameters to a multi-node job performing the training.

Below is a command template for launching Nemotron 4 15b and Nemotron 4 340b model training with BF16 precision on a specified number of nodes. The training will run for the first 100 steps and will stop afterwards.

Log files and results will be located under the $STAGE_PATH/results/bf16/15b/ folder for the 15b parameter model and $STAGE_PATH/results/bf16/340b/ folder for the 340b parameter model.

sbatch -A ${SLURM_ACCOUNT} -p ${SLURM_PARTITION} -N ${NUM_NODES} $STAGE_PATH/launch_nemotron4_15b.sh
sbatch -A ${SLURM_ACCOUNT} -p ${SLURM_PARTITION} -N ${NUM_NODES} $STAGE_PATH/launch_nemotron4_340b.sh

If you would like to run with FP8 precision, simply set the environment variable ENABLE_FP8=True before launching the script. For instance,

ENABLE_FP8=True sbatch -A ${SLURM_ACCOUNT} -p ${SLURM_PARTITION} -N ${NUM_NODES} $STAGE_PATH/launch_nemotron4_15b.sh
ENABLE_FP8=True sbatch -A ${SLURM_ACCOUNT} -p ${SLURM_PARTITION} -N ${NUM_NODES} $STAGE_PATH/launch_nemotron4_340b.sh

Note that it might be necessary to pass --gres gpu:8 to sbatch for certain clusters on encountering errors like GPU not found. See https://slurm.schedmd.com/gres.html

It is important to maintain these values for model parallelism settings in order to accurately assess performance results for completed jobs against expected baseline. For the 15b parameter model:

  • training.model.tensor_model_parallel_size=4
  • training.model.pipeline_model_parallel_size=1
  • training.model.micro_batch_size=4

For the 340b parameter model:

  • training.model.tensor_model_parallel_size=8
  • training.model.pipeline_model_parallel_size=8
  • training.model.micro_batch_size=1
  • training.model.virtual_pipeline_model_parallel_size=12

Global batch size ( training.model.global_batch_size) value for the 15b parameter model should be set to <number of nodes> * 2. E.g., 128 * 2 = 256 (in the example above). Global batch size ( training.model.global_batch_size) value for the 340b parameter model should be set to <number of nodes> * 32. E.g., 16 * 32 = 512 (in the example above).

Run Nsight Profiling for the 15b parameter model:

If not already installed, download Nsight Systems 2024.4.1 from either PBSS bucket s3://nsight or online.

Download Nsight Systems to $STAGE_PATH/nsight-systems-2024.4.1 or set the variable NSIGHT_DIR to the path where Nsight Systems is installed. To enable profiling with Nsight Systems set variable ENABLE_PROFILE=true. The variables JOB_TYPE can be set optionally to save profile runs to a separate folder.

NSIGHT_DIR=<path to Nsight Systems> ENABLE_PROFILE=true sbatch -A ${SLURM_ACCOUNT} -p ${SLURM_PARTITION} -N ${NUM_NODES} ./launch_nemotron4_15b.sh

Notes

The calculations for the 15b parameter model:

model flops = (sequence length) * ((attention flops) + (mlp flops) + (embedding flops))

model flops breakdown:
    attention flops = (24 * (number of layers) * (hidden size)^2) + (12 * (number of layers) * (hidden size) * (sequence length))
    mlp flops = 48 * (number of layers) * (hidden size)^2
    embedding flops = 6 * (vocab size) * (hidden size)

Nemotron4 15b calculation:
    sequence length = 4096
    number of layers = 32
    hidden size = 6144
    vocab size = 256000 
    attention flops = 24 * 32 * 6144^2 + 12 * 32 * 6144 * 4096 = 38666279738
    mlp flops = 48 * 32 * 6144^2 = 57982058496
    embedding flops = 6 * 256000 * 6144 = 9437184000

    model flops = 4096 * (38666279738 + 57982058496 + 9437184000) = 434,526,299,070,464 = 434.5e12

The calculations for the 340b parameter model:

model flops = (sequence length) * ((attention flops) + (mlp flops) + (embedding flops))

model flops breakdown:
    attention flops = (24 * (number of layers) * (hidden size)^2) + (12 * (number of layers) * (hidden size) * (sequence length))
    mlp flops = 48 * (number of layers) * (hidden size)^2
    embedding flops = 6 * (vocab size) * (hidden size)

Nemotron4 340b calculation:
    sequence length = 4096
    number of layers = 96
    hidden size = 18432
    vocab size = 256000 
    attention flops = 24 * 96 * 18432^2 + 12 * 96 * 18432 * 4096 = 869730877440
    mlp flops = 48 * 96 * 18432^2 = 1565515579392
    embedding flops = 6 * 256000 * 18432 = 28311552000

    model flops = 4096 * (869730877440 + 1565515579392 + 28311552000) = 1.01e16