Once the training job has finished successfully it's performance measurement metric is it's training throughput which is based on time it took to complete each training step.
The example below is taken from the end of the output log file - see $STAGE_PATH/results/fp8/175b/$NUM_NODES/*.out, where NUM_NODES=16, the training step time was measured as 5.9 seconds during step number 300.
grep train_step_timing results/fp8/175b/16/*.out
Epoch 0: : 100%|██████████| 300/300 [31:51<00:00, reduced_train_loss=6.130, global_step=299.0, consumed_samples=76800.0, train_step_timing in s=5.900, val_loss=6.250]
Since the performance fluctuates significantly at the beginning, we are using the last training step timing to obtain throughput value.
To obtain throughput as a tokens per second measurement, follow this formula:
(sequence length) * (global batch size) / (training_step_timing) = (throughput in tokens per second)
E.g. 2048 * 256 / 5.80 = 90394.48
To calculate time to train estimate:
(total tokens) / (throughput in tokens per second) / (number of seconds in a day) = (time to train in days)
E.g. 1e12 / 90394.48 / 86400 = 128.04 days
To calculate the model flops utilization (MFU):
MFU = (global batch size) * (model flops) / (training step time) / (number of GPUs) / (peak GPU FLOPS)
The peak theoretical throughput for H100 FP8 is 1979 TFLOPS and for H100 BF16 is 989 TFLOPS.
The model flops for GPT 3 175b for GBS=1 is 2.20E+15. Calculation shown here.
E.g. GPT 3 175b FP8 on 128x H100 GPUs (GBS=256)
peak FLOPS for H100 = 1979 TFLOPS
training step time = 5.90 s
model flops = 2.20E+15
MFU = 256 * 2.20E+15 / 5.90 / 128 / 1979E+12 = 37.68%
NeMo Megatron FP8 (TP=4,PP=8, MBS=1, VP=12, SEQ=2048) | 128x H100 GPUs (GBS=256) | 256x H100 GPUs (GBS=512) | 512x H100 GPUs (GBS=1024) | 1024x H100 GPUs (GBS=2048) | 2048x H100 GPUs (GBS=4096) |
---|---|---|---|---|---|
Training step time (seconds per step) | 5.90 | 5.95 | 6.01 | 6.09 | 6.21 |
Throughput in tokens per second | 88862.37 | 176231.26 | 348943.76 | 688719.86 | 1350822.54 |
Time to train 1T tokens in days | 130.24 | 65.67 | 33.16 | 16.80 | 8.56 |
NeMo Megatron BF16 (TP=4,PP=8, MBS=1, VP=12, SEQ=2048) | 128x H100 GPUs (GBS=256) | 256x H100 GPUs (GBS=512) | 512x H100 GPUs (GBS=1024) | 1024x H100 GPUs (GBS=2048) | 2048x H100 GPUs (GBS=4096) |
---|---|---|---|---|---|
Training step time (seconds per step) | 9.07 | 9.08 | 9.08 | 9.11 | 9.14 |
Throughput in tokens per second | 57804.63 | 115481.93 | 230963.87 | 460406.58 | 917790.80 |
Model flops utilization | 49.05% | 49.00% | 49.00% | 48.84% | 48.68% |
Time to train 1T tokens in days | 200.22 | 100.22 | 50.11 | 25.13 | 12.61 |
Create a staging area by running the setup.sh script. The script saves the container image from the registry in the $STAGE_PATH folder and copies the NeMo Launcher code from the container to the staging directory.
# Set the path where all artifacts will be downloaded
export STAGE_PATH=<path to your shared file system folder> (e.g. /lustre/myproject/nemo)
# Set the Slurm partition to launch against
export SLURM_PARTITION="batch"
# Set the Slurm account to launch against
export SLURM_ACCOUNT="account_name"
# Set the number of GPUs per node according to Slurm's gres, this is usually 8 or null - https://slurm.schedmd.com/gres.html
export SLURM_GPU_PER_NODE=null
# Run the setup
bash ./setup.sh
Pre-training a GPT-3 model requires a text-based dataset to be downloaded and pre-processed for the NeMo Framework to ingest the data optimally. The Pile is often used as the dataset for pre-training models. The NeMo Framework contains helper scripts to download and pre-process the dataset. The following steps outline how to download and pre-process the dataset on DGX Cloud with an explanation of key points after.
Run the generate_dataset.sh script. The script launches several Slurm jobs that will download the dataset from The Pile, pre-process it and save it in a form suitable for subsequent training. The resulting dataset files will be saved under the $STAGE_PATH/gpt3-dataset folder. The dataset creation may use up to 250GB. Make sure you have sufficient disk space available.
bash ./generate_dataset.sh
If the dataset generation step was successful there should be 4 idx and 4 bin files in the $STAGE_PATH/gpt3-dataset folder.
my-gpt3_00_text_document.bin
my-gpt3_00_text_document.idx
my-gpt3_01_text_document.bin
my-gpt3_01_text_document.idx
my-gpt3_02_text_document.bin
my-gpt3_02_text_document.idx
my-gpt3_03_text_document.bin
my-gpt3_03_text_document.idx
If that is not the case, check the log files in: $STAGE_PATH/results.data_preparation
Once the environment has been prepared, it is time to train a model. The NeMo Framework contains many predefined configuration files for various models including the 175 billion parameter GPT-3 model. This section will demonstrate how to initiate training on the model.
NeMo uses the Hydra framework to process command line arguments and the base config in the gpt3_175b_hydra.yaml file and passes them down as hyper parameters to a multi-node job performing the training.
Run the launch_175b.sh script to start NeMo Megatron 175b model training. Minimum required number of nodes is 16 (or 128 GPUs). The training will run for the first 300 steps and will stop afterwards. Log files and results will be located under $STAGE_PATH/results/bf16/175b/$SLURM_JOB_NUM_NODES folder
sbatch -A ${SLURM_ACCOUNT} -p ${SLURM_PARTITION} -N $NUM_NODES ./launch_175b.sh
If you would like to run with FP8 precision, simply set the environment variable ENABLE_FP8=True before launching the script. Otherwise, the training will be done using BF16 precision. For instance,
ENABLE_FP8=True sbatch -A ${SLURM_ACCOUNT} -p ${SLURM_PARTITION} -N $NUM_NODES ./launch_175b.sh
Note, that it might be necessary to pass --gres gpu:8
to sbatch for certain clusters on encountering errors like GPU not found. See https://slurm.schedmd.com/gres.html
It is important to maintain these values for model parallelism settings in order to accurately assess performance results for completed jobs against expected baseline, which can be seen in the gpt3_175b_hydra.yaml:
Global batch size ( training.model.global_batch_size) value should be set to <number of nodes> * 16. E.g., 16 * 16 = 256 (in the example above).
There is a known sporadic bug that may cause training errors like this: FileNotFoundError: [Errno 2] No such file or directory: '/user/results/70b/128/results/nemo_log_globalrank-149_localrank-5.txt'
The fix is to pass additional hyper parameter in the command line or set it in the gpt3_175b_hydra.yaml file:
model flops = (sequence length) * ((attention flops) + (mlp flops) + (embedding flops))
model flops breakdown:
attention flops = (24 * (number of layers) * (hidden size)^2) + (12 * (number of layers) * (hidden size) * (sequence length))
mlp flops = 48 * (number of layers) * (hidden size)^2
embedding flops = 6 * (vocab size) * (hidden size)
GPT 3 175b calculation:
sequence length = 2048
attention flops = 24 * 96 * 12288^2 + 12 * 96 * 12288 * 4096 = 376,883,380,224
mlp flops = 48 * 96 * 12288^2 = 695,784,701,952
embedding flops = 6 * 51200 * 12288 = 3,774,873,600
model flops = 2048 * (376,883,380,224 + 695,784,701,952 + 3,774,873,600) = 2.20E+15