Performance Recipes are ready-to-use templates for evaluating performance of specific AI use cases across hardware and software combinations. These containerized recipes allow users to quickly set up and run standardized benchmarking methodology in their own environment, ensuring consistent and comparable results across platforms.
These Performance Recipes support performance characterization
Each recipe maps to one workload and can be run at various cluster scales and precisions. These workloads are tested against the NVIDIA Reference Architecture and those results are provided as a baseline for comparison. These performance metrics are collected from production environments and are subject to real-world variability.
At this time, Performance Recipes require Slurm as the cluster's job scheduler. Before you use the Performance Recipes, make sure you have installed the following packages on your cluster.
Follow these steps to access and start using the LLM Benchmarking Collection:
Each workload resource includes:
The overview page for each workload highlights target performance metrics for the specified configuration, focusing on speed measurements such as the time taken per training step and the number of tokens processed per second.
The following table lists each benchmark used to evaluate the model’s performance, along with their specific configurations.
Benchmark | Framework | Container Version | Model | Model Size | Type | Max Scale (# of GPUs) | Precision | Model Access Required |
---|---|---|---|---|---|---|---|---|
Nemotron | NeMo | 24.12, 24.09 | Nemotron4 | 15B, 340B | Training | 2048 | FP8, BF16 | No |
Megatron | NeMo | 24.12 | GPT3 | 175B | Training | 2048 | FP8, BF16 | No |
Llama | NeMo | 24.12 | Llama 3.1 | 8B, 70B, 405B | Training | 2304 | FP8, BF16 | Yes |
Maxtext | Maxtext | 25.01 | Llama3 | 70B | Training | 2048 | FP8, BF16 | No |
Grok | NeMo | 24.12 | Grok1 | 314B | Training | 2048 | FP8, BF16 | No |
SFT | NeMo | 24.12 | Llama 3 | 8B, 70B | Supervised Fine-Tuning | 32 | FP8, BF16 | Yes |
LORA | NeMo | 24.12 | Llama 3 | 8B, 70B | LORA Fine-Tuning | 32 | FP8, BF16 | Yes |
RAG Blueprint Pipeline | NIM | instruct:1.3.3, rerank:1.3, embed:1.3.1 | Llama 3.1 and 3.2 | 70b, 1b | Inference | n/a | n/a | Yes |
NIM | NIM | 1.0.3 | Llama 3 | 70B | Inference | 4 | FP8 | Yes |
Baseline performance metrics were using workloads on the NVIDIA DGX H100 Reference Architecture. For more information see DGX H100 Systems.
Note, the benchmarks are updated monthly. For older releases you can use Search feature in NGC Resources section.
E.g., here are the steps to locate benchmarks from release 24.11.1:
The LLM Benchmarking Collection published baseline benchmark results using the following infrastructure, CSP-specific configurations, and software.
AI platforms may vary in implementation, such as differences in network fabric and virtualization implementations, and thus require different tuning. For optimal performance, users should leverage the correct implementation for their platform. The example platform-specific tuning is provided as a starting point. Further tuning may be necessary if instance type varies from the Reference Architecture.
Enable Elastic Fabric Adapter (EFA) support by following the step-by-step guide. Use the reference NCCL tests Dockerfile with EFA support.
Additionally CPU pinning has been found to improve performance. To enable set these three Slurm flags:
If you need to build a new docker image, use setup.sh script from corresponding benchmark folder to determine the exact base image. The setup.sh will include commands for importing docker image used by the workload: e.g., nemotron/setup.sh includes "enroot import --output ${STAGE_PATH}/nvidia+nemo+24.12.sqsh docker://nvcr.io#nvidia/nemo:24.12"
which translates into nvcr.io/nvidia/nemo:24.12
as the base image.
Ensure that all required pre-conditions for GCP cluster deployment have been met.
Configure Compute Fabric with TCP-X by ensuring the following environment variables are set and present for your environment.
NCCL_LIB_DIR='/var/lib/tcpxo/lib64' source /var/lib/tcpxo/lib64/nccl-env-profile.sh; \
export NCCL_FASTRAK_CTRL_DEV=enp0s12; \
export NCCL_FASTRAK_IFNAME=enp6s0,enp7s0,enp13s0,enp14s0,enp134s0,enp135s0,enp141s0,enp142s0; \
export NCCL_SOCKET_IFNAME=enp0s12; \
export NCCL_FASTRAK_LLCM_DEVICE_DIRECTORY=/dev/aperture_devices; \
export NCCL_NET=FasTrak; \
ls /var/lib/tcpxo/lib64;"
Requires four settings for optimal performance: two environment variables and two slurm parameters.
<path to topo file>
.Example Configuration for Nemo Megatron Launcher:
export NCCL_TOPO_FILE=/opt/microsoft/nvd5-topo.xml # Exact location varies by cluster
export NCCL_P2P_NET_CHUNKSIZE=2097152
srun --container-image ${IMAGE} \
--container-writable \
--container-mounts ${NCCL_TOPO_FILE},${DATA_DIR}:/datasets/,${RESULT_DIR},$INDEX_MAPPING_DIR,${STAGE_PATH}/cfg:/cfg/ \
--container-env=NCCL_TOPO_FILE,NCCL_P2P_NET_CHUNKSIZE \
--cpu-bind=mask_cpu:"fff,fff000,fff000000,fff000000000,fff000000000000,fff000000000000000,fff000000000000000000,fff000000000000000000000" \
--no-container-mount-home
<snip> ...
Adds support for the following workloads:
The following workloads will no longer be maintained moving forward, starting in collection version 25.02:
Contains synopsis and resolution for known issues
Large scale pre-training run logs contain message like below:
[userbuffers.cu:userbuffers_fp16_sum_inplace_gpu_rr_rs_oop_fp8:797] [6] Reduce-scatter: SM 18 [2]: expecting 1 got 0
[userbuffers.cu:userbuffers_fp16_sum_inplace_gpu_rr_rs_oop_fp8:797] [6] Reduce-scatter: SM 18 [4]: expecting 1 got 0
[userbuffers.cu:userbuffers_fp16_sum_inplace_gpu_rr_rs_oop_fp8:797] [6] Reduce-scatter: SM 19 [2]: expecting 1 got 0
[userbuffers.cu:userbuffers_fp16_sum_inplace_gpu_rr_rs_oop_fp8:797] [6] Reduce-scatter: SM 19 [4]: expecting 1 got 0
[userbuffers.cu:userbuffers_fp16_sum_inplace_gpu_rr_rs_oop_fp8:797] [6] Reduce-scatter: SM 22 [2]: expecting 1 got 0
[userbuffers.cu:userbuffers_fp16_sum_inplace_gpu_rr_rs_oop_fp8:797] [6] Reduce-scatter: SM 22 [4]: expecting 1 got 0
[userbuffers.cu:userbuffers_fp16_sum_inplace_gpu_rr_rs_oop_fp8:797] [6] Reduce-scatter: SM 23 [2]: expecting 1 got 0
[userbuffers.cu:userbuffers_fp16_sum_inplace_gpu_rr_rs_oop_fp8:797] [6] Reduce-scatter: SM 23 [4]: expecting 1 got 0
These usually mean that one of the GPUs is hanging. Possible resolutions:
A slurm job failed during benchmark run. E.g., a nemotron benchmark job with ID=2041792 failed
sacct -j 2041792
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
2041792 launch.sh batch test 224 FAILED 1:0
2041792.bat+ batch test 224 FAILED 1:0
2041792.ext+ extern test 224 COMPLETED 0:0
2041792.0 bash test 224 FAILED 1:0
You can find log files associated with this run under $STAGE_PATH/results/$GSW_VERSION/$DTYPE/15b/$JOB_TOTAL_GPUS
folder. The folder will contain log-nemo_nemotron4_*.out
and log-nemo_nemotron4_*.err
files that will have a root cause error message.
E.g., for the job failure above and assuming the nemotron 15b job ran on 16 GPUs, used version 25.02, and with precision bf16 the path will be under $STAGE_PATH/results/25.02/bf16/15b/16/log-nemo_nemotron4_15b_16_2041792.*
Search for errors in the log-nemo_nemotron4_15b_16_2041792.err
or log-nemo_nemotron4_15b_16_2041792.out
files.
If a benchmark requires virtual python environment (venv) but virtualenv
executable isn't available on the login node and/or login nodes cannot be updated by non-sudo users, you would see errors like below when trying to setup venv
bash-5.2$ virtualenv
bash: virtualenv: command not found
There are alternative virtual environment options available like conda.
To install and activate conda virtual environment
# pick INSTALL_PATH with sufficient disk space
INSTALL_PATH=~
wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O $INSTALL_PATH/miniconda.sh
bash $INSTALL_PATH/miniconda.sh -b -p $INSTALL_PATH/miniconda3
$INSTALL_PATH/miniconda3/bin/conda init
source ~/.bashrc
When you are finished running this benchmark you can deactivate the environment, run this command
conda deactivate
For questions or to provide feedback, please contact LLMBenchmarks@nvidia.com