NGC Catalog
CLASSIC
Welcome Guest
Resources
Grok1 314B 24.11.1 (DGXC Benchmarking)

Grok1 314B 24.11.1 (DGXC Benchmarking)

For downloads and more information, please view on a desktop device.
Description
This recipe contains information and scripts to produce performance results for the Grok1 314B training workload.
Publisher
NVIDIA
Latest Version
24.11.1
Modified
January 27, 2025
Compressed Size
10 KB

Overview

This recipe contains information and scripts to produce performance results for the Grok 1 training workload. The scripts help perform environment setup and launch benchmark jobs. This variant of the workload is best-suited for GPU clusters with

  • At least 8 GPUs with at least 80 GB memory each. Training of this 314-billion parameter variant of the workload will not fit on fewer GPUs with less memory.
  • H100 or GH200 GPUs. This workload runs with FP8 and BF16 precision.
GPUs SeqLen Layers TP PP CP EP DP VP MBS GBS GA
8 4096 2 4 1 1 2 1 NA 1 1024 128
16 4096 4 4 1 1 4 1 NA 1 1024 128
32 4096 4 4 1 1 8 1 NA 1 1024 128
64 8192 8 4 1 2 8 1 NA 1 1024 128
128 8192 16 4 2 2 8 1 8 1 1024 128
256 8192 32 4 4 2 8 1 8 1 1024 128
512 8192 64 4 8 2 8 1 8 1 1024 128
1024 8192 64 4 8 2 8 2 8 1 2048 128
2048 8192 64 4 8 2 8 4 8 1 4096 128

Expected Performance

Performance for Grok 1 training is measured by seconds per iteration, or in other words seconds per training step. This metric is logged for every training step in a .out file which is generated inside of the $STAGE_PATH/results/$GSW_VERSION/$DTYPE/314b/$JOB_TOTAL_GPUS folder.

Since the performance fluctuates significantly at the beginning, we are using the last training step timing to obtain throughput value.

grep train_step_timing *.out
Epoch 0: : 100%|██████████| 50/50 [23:22<00:00, reduced_train_loss=0.186, global_step=49.00, consumed_samples=51200.0, train_step_timing in s=24.00]

To obtain throughput as a tokens per second measurement, follow this formula:

(sequence length) * (global batch size) / (training_step_timing) = (throughput in tokens per second)

E.g. 8192 * 4096 / 22 = 1525201

To calculate time to train with 1T tokens estimate:

(total tokens) / (throughput in tokens per second) / (number of seconds in a day) = (time to train in days) 

E.g. 1e12 / 1525201 / 86400 = 7.59 days

To calculate the model flops utilization (MFU):

MFU = (global batch size) * (model flops) / (training step time) / (number of GPUs) /peak GPU FLOPS)

The peak theoretical throughput for H100 FP8 is 1979 TFLOPS and for H100 BF16 is 989 TFLOPS.

The model flops for Grok 1 for GBS=1 per GPU for 2048 GPUs is 4.27E+15.

E.g. Grok 1 BF16 on 2048x H100 GPUs (GBS=4096)

peak FLOPS for H100 = 989 TFLOPS
training step time = 22
model flops = 4.27E+15

MFU = 4096 * 4.27E+15 / 22 / 2048 / 989E+12 = 39.3%
Grok 1 314b Dev BF16 Throughput on 8x H100 GPUs Throughput on 16x H100 GPUs Throughput on 32x H100 GPUs Throughput on 64x H100 GPUs Throughput on 128x H100 GPUs Throughput on 256 H100 GPUs Throughput on 512 H100 GPUs Throughput on 1024 H100 GPUs Throughput on 2048 H100 GPUs
Training step time (seconds per step) 24.03 21.52 10.64 20.64 20.64 20.7 20.83 21 21.09
Throughput in tokens per second 174544 194903 394202 406425 406425 405247 402718 798915 1591011
Model flops utilization 45.4% 44.9% 45.4% 44.5% 43.0% 42.1% 41.5% 41.1% 41.0%
Time to train 1T tokens in days NA NA NA NA NA NA 28.74 14.49 7.27
Grok 1 314b Dev FP8 Throughput on 8x H100 GPUs Throughput on 16x H100 GPUs Throughput on 32x H100 GPUs Throughput on 64x H100 GPUs Throughput on 128x H100 GPUs Throughput on 256 H100 GPUs Throughput on 512 H100 GPUs Throughput on 1024 H100 GPUs Throughput on 2048 H100 GPUs
Training step time (seconds per step) 19.89 17.08 8.3 16.21 16.48 16.55 16.69 16.89 17.04
Throughput in tokens per second 210875 245568 505338 517496 509017 506865 502613 993322 1969157
Model flops utilization 27.4% 28.3% 29.1% 28.3% 26.9% 26.3% 25.9% 25.6% 25.3%
Time to train 1T tokens in days NA NA NA NA NA NA 23.03 11.65 5.88

For proxy configs (<512 GPUs scales) we don't provide time to train estimates to avoid misleading conclusions. Proxy configs are not realistic and were created to allow fit of Grok model to smaller number of GPUs than intended.

Download benchmark

Note, that this step requires that you have NGC CLI tooling installed on your slurm login node. If you don't have it installed follow these instructions:

  1. Click the top-right "Download" button and select "CLI". This will copy the download command into clipboard.
  2. From your terminal on login node, paste the download command and execute it.
  3. Extract the downloaded benchmark zip containing required scripts and configuration files.

Prepare Environment

Create a staging area by running the attached setup.sh. The script converts the docker image from nvcr.io/nvidia/nemo:dev (sha256:9c876dce0621f954d1733063e2af642682b188d757fb280cc55755fb8e194400) to the nvidia+nemo+dev.sqsh file under the $STAGE_PATH folder and copies NeMo Launcher code from the container.

# Set the path where all artifacts will be downloaded
export STAGE_PATH=<path to your shared file system folder> (e.g. /lustre/myproject/nemo)
# Set the Slurm partition to launch against
export SLURM_PARTITION=<partition>
# Set the Slurm account to launch against
export SLURM_ACCOUNT=<account>
# Set the number of GPUs per node according to Slurm's gres, this is usually 8 or "" - https://slurm.schedmd.com/gres.html
export SLURM_GPUS_PER_NODE=""

# Run the setup
STAGE_PATH=<stage path> bash setup.sh

Note: Slurm parameters might not be applicable to all environments. Please consult with your system administrator and update or remove parameters as needed.

Dataset

Grok 1 uses synthetic data. A dataset does not need to be downloaded.

Run Training

Once the environment has been prepared, it is time to train a model. This section will demonstrate how to initiate training the model.

The training will run for the first 50 steps and will stop afterwards. Log files and results will be located under the $STAGE_PATH/results/$GSW_VERSION/$DTYPE/314b/$JOB_TOTAL_GPUS folder.

Below is a command template for launching Grok 1 model training.

DTYPE=<fp8/bf16> sbatch -A ${SLURM_ACCOUNT} -p ${SLURM_PARTITION} -N ${NUM_NODES} ./launch.sh

Where:

  • DTYPE is a required environment variable.
    • DTYPE can be either fp8 or bf16.
  • NUM_NODES can be calculated by N_GPUS / N_GPUS_PER_NODE, N_GPUS_PER_NODE is 8 for DGX H100, therefore for 256 GPUs scale, NUM_NODES should be 256 / 8 = 32.
  • -A and -p parameters are not applicable to all environments. Please consult with your system administrator and update or remove parameters as needed.

Note: that it might be necessary to pass --gres=gpu:8 to sbatch for certain clusters on encountering errors like GPU not found. See https://slurm.schedmd.com/gres.html