Evaluate performance and efficiency of deep learning models with the LLM Benchmarking Collection.
The LLM Benchmarking Collection provides a suite of tools to quantify the performance of large language models (LLMs) and fine-tuning workloads across GPU-based infrastructure, whether running on-premises or with cloud service providers (CSPs).
Before you use the LLM Benchmarking Collection, make sure you have installed the following packages on your cluster.
Follow these steps to access and start using the LLM Benchmarking Collection:
Each workload resource includes:
The overview page for each workload highlights target performance metrics for the specified configuration, focusing on speed measurements such as the time taken per training step and the number of tokens processed per second.
The following table lists each benchmark used to evaluate the model’s performance, along with their specific configurations.
Workload | Type | Description | Container Version | Dataset | Max Scale (#GPUs) | DTYPE |
---|---|---|---|---|---|---|
Nemotron4 | Training | 15B and 340B benchmarks | 24.09 | Synthetic | 2048 | FP8, BF16 |
Nemo Megatron | Training | 175B benchmarks | 24.05 | Pile | 2048 | FP8, BF16 |
Llama 3.1 | Training | 8B, 70B and 405B benchmarks | 24.09 | Pile | 2304 | FP8, BF16 |
PaXML | Training | 5B and 175B benchmarks | 24.03.04 | Synthetic | 2048 | FP8, BF16 |
Maxtext | Training | Llama2 70B benchmarks | 2024.12.09 | Synthetic | 2048 | FP8, BF16 |
Grok1 | Training | Grok1 314B benchmarks | 24.09 | Synthetic | 2048 | FP8, BF16 |
Llama 2 | Fine Tuning | Hugging Face 70B benchmarks | 24.02 | HF Llama2 | 512 | BF16 |
Mistral | Fine Tuning | Hugging Face 7B benchmarks | 24.02 | HF Mistral | 256 | BF16 |
Baseline performance metrics were using workloads on the NVIDIA DGX H100 Reference Architecture. For more information see DGX H100 Systems.
The LLM Benchmarking Collection published baseline benchmark results using the following infrastructure, CSP-specific configurations, and software.
The benchmarks were built on the NVIDIA Reference Architecture. To achieve optimal performance on each CSP, you must make the following changes.
NCCL_LIB_DIR='/var/lib/tcpxo/lib64' source /var/lib/tcpxo/lib64/nccl-env-profile.sh; \
export NCCL_FASTRAK_CTRL_DEV=enp0s12; \
export NCCL_FASTRAK_IFNAME=enp6s0,enp7s0,enp13s0,enp14s0,enp134s0,enp135s0,enp141s0,enp142s0; \
export NCCL_SOCKET_IFNAME=enp0s12; \
export NCCL_FASTRAK_LLCM_DEVICE_DIRECTORY=/dev/aperture_devices; \
export NCCL_NET=FasTrak; \
ls /var/lib/tcpxo/lib64;"
<path to topo file>
.Example Configuration for Nemo Megatron Launcher:
export NCCL_TOPO_FILE=/opt/microsoft/nvd5-topo.xml # Exact location varies by cluster
export NCCL_P2P_NET_CHUNKSIZE=2097152
srun --container-image ${IMAGE} \
--container-writable \
--container-mounts ${NCCL_TOPO_FILE},${DATA_DIR}:/datasets/,${RESULT_DIR},$INDEX_MAPPING_DIR,${STAGE_PATH}/cfg:/cfg/ \
--container-env=NCCL_TOPO_FILE,NCCL_P2P_NET_CHUNKSIZE \
--cpu-bind=mask_cpu:"fff,fff000,fff000000,fff000000000,fff000000000000,fff000000000000000,fff000000000000000000,fff000000000000000000000" \
--no-container-mount-home
<snip> ...
For questions or to provide feedback, please contact LLMBenchmarks@nvidia.com