Evaluate performance and efficiency of deep learning models with the LLM Benchmarking Collection.
The LLM Benchmarking Collection provides a suite of tools to quantify the performance of large language models (LLMs) and fine-tuning workloads across GPU-based infrastructure, whether running on-premises or with cloud service providers (CSPs).
Before you use the LLM Benchmarking Collection, make sure you have installed the following packages on your cluster.
Follow these steps to access and start using the LLM Benchmarking Collection:
Each workload resource includes:
The overview page for each workload highlights target performance metrics for the specified configuration, focusing on speed measurements such as the time taken per training step and the number of tokens processed per second.
The following table lists each benchmark used to evaluate the model’s performance, along with their specific configurations.
Framework | Container Version | Model | Model Size | Type | Max Scale (# of GPUs) | Precision |
---|---|---|---|---|---|---|
NeMo | 24.09 | Nemotron4 | 15B, 340B | Training | 2048 | FP8, BF16 |
NeMo | 24.05 | GPT3 | 175B | Training | 2048 | FP8, BF16 |
NeMo | 24.09 | Llama 3.1 | 8B, 70B, 405B | Training | 2304 | FP8, BF16 |
Maxtext | 2024.12.09 | Llama2 | 70B | Training | 2048 | FP8, BF16 |
NeMo | 24.09 | Grok1 | 314B | Training | 2048 | FP8, BF16 |
HuggingFace TRL | 0.8.2 based on 24.02-py3 | Llama2 | 70B | SFT | 512 | BF16 |
HuggingFace TRL | 0.8.2 based on 24.02-py3 | Mistral | 7B | SFT | 256 | BF16 |
Baseline performance metrics were using workloads on the NVIDIA DGX H100 Reference Architecture. For more information see DGX H100 Systems.
Note, the benchmarks are updated monthly. For older releases you can use Search feature in NGC Resources section.
E.g., here are the steps to locate benchmarks from release 24.11.1:
The LLM Benchmarking Collection published baseline benchmark results using the following infrastructure, CSP-specific configurations, and software.
The benchmarks were built on the NVIDIA Reference Architecture. To achieve optimal performance on each CSP, you must make the following changes.
NCCL_LIB_DIR='/var/lib/tcpxo/lib64' source /var/lib/tcpxo/lib64/nccl-env-profile.sh; \
export NCCL_FASTRAK_CTRL_DEV=enp0s12; \
export NCCL_FASTRAK_IFNAME=enp6s0,enp7s0,enp13s0,enp14s0,enp134s0,enp135s0,enp141s0,enp142s0; \
export NCCL_SOCKET_IFNAME=enp0s12; \
export NCCL_FASTRAK_LLCM_DEVICE_DIRECTORY=/dev/aperture_devices; \
export NCCL_NET=FasTrak; \
ls /var/lib/tcpxo/lib64;"
<path to topo file>
.Example Configuration for Nemo Megatron Launcher:
export NCCL_TOPO_FILE=/opt/microsoft/nvd5-topo.xml # Exact location varies by cluster
export NCCL_P2P_NET_CHUNKSIZE=2097152
srun --container-image ${IMAGE} \
--container-writable \
--container-mounts ${NCCL_TOPO_FILE},${DATA_DIR}:/datasets/,${RESULT_DIR},$INDEX_MAPPING_DIR,${STAGE_PATH}/cfg:/cfg/ \
--container-env=NCCL_TOPO_FILE,NCCL_P2P_NET_CHUNKSIZE \
--cpu-bind=mask_cpu:"fff,fff000,fff000000,fff000000000,fff000000000000,fff000000000000000,fff000000000000000000,fff000000000000000000000" \
--no-container-mount-home
<snip> ...
For questions or to provide feedback, please contact LLMBenchmarks@nvidia.com