NVIDIA HPC-Benchmarks

NGC Catalog

CLASSIC

Welcome Guest

For copy image paths and more information, please view on a desktop device.

Description

The NVIDIA HPC-Benchmarks collection provides four accelerated HPC benchmarks: HPL-NVIDIA, HPL-MxP-NVIDIA, HPCG-NVIDIA, and STREAM.

Publisher

NVIDIA

Latest Tag

25.04

Modified

July 5, 2025

Compressed Size

7.03 GB

Multinode Support

Yes

Multi-Arch Support

Yes

25.04 (Latest) Security Scan Results

Linux / amd64

Linux / arm64

NVIDIA HPC-Benchmarks 25.04

The NVIDIA HPC-Benchmarks collection provides four benchmarks (HPL, HPL-MxP, HPCG, and STREAM) widely used in the HPC community optimized for performance on NVIDIA accelerated HPC systems.

NVIDIA's HPL and HPL-MxP benchmarks provide software packages to solve a (random) dense linear system in double precision (64-bit) arithmetic and in mixed precision arithmetic using Tensor Cores, respectively, on distributed-memory computers equipped with NVIDIA GPUs, based on the Netlib HPL benchmark and HPL-MxP benchmark.

NVIDIA's HPCG benchmark accelerates the High Performance Conjugate Gradients (HPCG) Benchmark. HPCG is a software package that performs a fixed number of multigrid preconditioned (using a symmetric Gauss-Seidel smoother) conjugate gradient (PCG) iterations using double precision (64-bit) floating point values.

NVIDIA's STREAM benchmark is a simple synthetic benchmark program that measures sustainable memory bandwidth. NVIDIA HPC-Benchmarks container includes STREAM benchmarks optimized for NVIDIA Ampere GPU architecture (sm80), NVIDIA Hopper GPU architecture (sm90), NVIDIA Blackwell GPU architecture (sm100), and NVIDIA Grace CPU.

Container packages

The NVIDIA HPC-Benchmarks collection provides a multiplatform (x86 and aarch64) container image hpc-benchmarks:25.04 based on NVIDIA Optimized Frameworks 25.01 container image.

In addition to NVIDIA Optimized Frameworks 25.01 container images, the hpc-benchmarks:25.04 container image is provided with the following packages embedded:

NVIDIA HPL 25.04
NVIDIA HPL-MxP 25.04
NVIDIA HPCG 25.04
NVIDIA STREAM 25.04
NVIDIA NVSHMEM 3.2.5
NVIDIA NVPL 25.1

NVIDIA HPC-Benchmarks for MPI libraries that are ABI-compatible with MPICH (e.g., MPICH, Cray MPICH, MVAPICH, etc.) and OpenMPI available on NVIDIA.DEVELOPER.

Prerequisites

Using the NVIDIA HPC-Benchmarks Container requires the host system to have the following installed:

Docker Engine
NVIDIA GPU Drivers
NVIDIA Container Toolkit or NVIDIA Pyxis/Enroot, or Singularity version 3.4.1 or later

For supported versions, see the Framework Containers Support Matrix and the NVIDIA Container Toolkit Documentation

The NVIDIA HPC-Benchmarks Container supports the NVIDIA Ampere GPU architecture (sm80), the NVIDIA Hopper GPU architecture (sm90), the NVIDIA Blackwell GPU architecture (sm100). This version of the container supports clusters featuring DGX A100, DGX H100, DGX B200, NVIDIA Grace Hopper, NVIDIA Grace Blackwell, and NVIDIA Grace CPU nodes. Previous GPU generations are not expected to be compatible.

Containers folder structure

The hpc-benchmarks:25.04 container provides the NVIDIA HPL, NVIDIA HPL-MxP, NVIDIA HPCG and NVIDIA STREAM benchmarks in the following folder structure:

x86 container image:

hpl.sh script in the folder /workspace to invoke the xhpl executable.
hpl-mxp.sh script in the folder /workspace to invoke the xhpl-mxp executable.
hpcg.sh script in the folder /workspace to invoke the xhpcg executable.
stream-gpu-test.sh script in the folder /workspace to invoke the stream_test executable for NVIDIA GPUs.

NVIDIA HPL in the folder /workspace/hpl-linux-x86_64 contains:

xhpl executable.
Samples of Slurm batch-job scripts in sample-slurm directory.
Samples of input files in sample-dat directory.
README, RUNNING, and TUNING guides.

NVIDIA HPL-MxP in the folder /workspace/hpl-mxp-linux-x86_64 contains:

xhpl_mxp executable.
Samples of Slurm batch-job scripts in sample-slurm directory.
README, RUNNING, and TUNING guides.

NVIDIA HPCG in the folder /workspace/hpcg-linux-x86_64 contains:

xhpcg executable.
Samples of Slurm batch-job scripts in sample-slurm directory
Sample input file in sample-dat directory.
README, RUNNING, and TUNING guides.

NVIDIA STREAM in the folder /workspace/stream-gpu-linux-x86_64

stream_test executable. GPU STREAM benchmark with double precision elements.
stream_test_fp32 executable. GPU STREAM benchmark with single precision elements.

aarch64 container image:

hpl-aarch64.sh script in the folder /workspace to invoke the xhpl executable for NVIDIA Grace CPU.
hpl.sh script in the folder /workspace to invoke the xhpl executable for NVIDIA Grace Hopper and NVIDIA Grace Blackwell.
hpl-mxp-aarch64.sh script in the folder /workspace to invoke the xhpl-mxp executable NVIDIA Grace CPU.
hpl-mxp.sh script in the folder /workspace to invoke the xhpl-mxp executable for NVIDIA Grace Hopper and NVIDIA Grace Blackwell.
hpcg-aarch64.sh script in the folder /workspace to invoke the xhpcg executables for NVIDIA Grace Hopper, NVIDIA Grace Blackwell and Grace CPU.
stream-test-cpu.sh script in the folder /workspace to invoke the stream_test executable NVIDIA Grace CPU.
stream-test-gpu.sh script in the folder /workspace to invoke the stream_test executable for NVIDIA GPUs.

NVIDIA HPL in the folder /workspace/hpl-linux-aarch64 contains:

xhpl executable for NVIDIA Grace CPU.
Samples of Slurm batch-job scripts in sample-slurm directory.
Samples of input files in sample-dat directory.
README, RUNNING, and TUNING guides.

NVIDIA HPL in the folder /workspace/hpl-linux-aarch64-gpu contains:

xhpl executable for NVIDIA Grace Hopper and NVIDIA Grace Blackwell.
Samples of Slurm batch-job scripts in sample-slurm directory.
Samples of input files in sample-dat directory.
README, RUNNING, and TUNING guides.

NVIDIA HPL-MxP in the folder /workspace/hpl-mxp-linux-aarch64 contains:

xhpl_mxp executable for NVIDIA Grace CPU.
Samples of Slurm batch-job scripts in sample-slurm directory.
README and RUNNING guides.

NVIDIA HPL-MxP in the folder /workspace/hpl-mxp-linux-aarch64-gpu contains:

xhpl_mxp executable for NVIDIA Grace Hopper and NVIDIA Grace Blackwell.
Samples of Slurm batch-job scripts in sample-slurm directory.
README, RUNNING, and TUNING guides.

NVIDIA HPCG in the folder /workspace/hpcg-linux-aarch64 contains:

xhpcg executable for NVIDIA Grace Hopper and NVIDIA Grace Blackwell.
xhpcg-cpu executable for NVIDIA Grace CPU.
Samples of Slurm batch-job scripts in sample-slurm directory
Sample input file in sample-dat directory.
README, RUNNING, and TUNING guides.

NVIDIA STREAM in the folder /workspace/stream-gpu-linux-aarch64

stream_test executable. GPU STREAM benchmark with double precision elements.
stream_test_fp32 executable. GPU STREAM benchmark with single precision elements.

NVIDIA STREAM in the folder /workspace/stream-cpu-linux-aarch64

stream_test executable. NVIDAI Grace CPU STREAM benchmark with double precision elements.

Running the NVIDIA HPL, NVIDIA HPL-MxP, NVIDIA HPCG, and NVIDIA STREAM Benchmarks

The NVIDIA HPL benchmark uses the same input format as the standard Netlib HPL benchmark. Please see the Netlib HPL benchmark to get started with the HPL software concepts and best practices.

The NVIDIA HPCG benchmark uses the same input format as the standard HPCG-Benchmark. Please see the HPCG-Benchmark to get started with the HPCG software concepts and best practices.

The NVIDIA HPL-MxP benchmark accepts a list of parameters to describe input task and set additional tuning settings. The description of parameters can be found in the README and TUNING files.

The NVIDIA HPL, NVIDIA HPL-MxP, and NVIDIA HPCG benchmarks with GPU support require one GPU per MPI process. Therefore, ensure that the number of MPI processes is set to match the number of available GPUs in the cluster.

NVIDIA HPL Out-of-core mode

Version 25.04 of the NVIDIA HPL benchmark supports an 'out-of-core' mode. This is an opt-in feature and the default mode remains the 'in-core' mode.

The NVIDIA HPL out-of-core mode enables the use of larger matrix sizes. Unlike the in-core mode, any matrix data that exceeds GPU memory capacity is automatically stored in the host CPU memory. To activate this feature, simply set the environment variable HPL_OOC_MODE=1 and specify a larger matrix size (e.g., using the N parameter in the input file).

Performance will depend on host-device transfer speeds. For best performance, try to keep the amount of host memory used for the matrix to around 6-16 GiB on platforms where the CPU and GPU are connected via PCIe (such as x86). For systems where there is a faster CPU-GPU interconnect (such as NVIDIA Grace Hopper and NVIDIA Grace Blackwell), sizes greater than 16 GiB may be beneficial. A method to estimate the matrix size for this feature is to take the largest per GPU memory size used with NVIDIA HPL in-core mode, add the target amount of host data, and then work out the new matrix size from this total size.

All the environment variables needed by the NVIDIA HPL out-of-core mode can be found in the provided /workspace/hpl-linux-x86_64/TUNING or /workspace/hpl-linux-aarch64-gpu/TUNING files.

If NVIDIA HPL out-of-core mode is enabled, it is highly recommended to pass the CPU, GPU, and memory affinity arguments to hpl.sh.

In case of experiencing GPU out-of-memory issues with HPL OOC, consider increasing the size of the GPU memory needed for the driver (not used by HPL OOC). This can be achieved by adjusting the HPL_OOC_SAFE_SIZE environment variable. Default value is 3.0 (the buffer size in GB). Depending on the GPU/driver, you may need to increase this further to resolve memory issues.

NVIDIA HPL with FP64 emulation

Version 25.04 of the NVIDIA HPL benchmark supports FP64 emulation mode [1] on the NVIDIA Blackwell GPU architecture, using the techniques described in [2]. This is an opt-in feature, and the default mode remains the use of the native FP64 computations.

Environment variables to set up and control the NVIDIA HPL Benchmark FP64 emulation mode:

HPL_EMULATE_DOUBLE_PRECISION: Enables/disables FP64 emulation mode
- Default Value: 0
- Possible Values: 1 (enable), 0 (disable)

HPL_DOUBLE_PRECISION_EMULATION_MANTISSA_BIT_COUNT: The maximum number of mantissa bits to be used for FP64 emulation [2] (includes IEEE FP64 standard's implicit bit)
- Default Value: 53
- Possible Values: >0

Note:

The number of slices (INT8 data elements [2]) can be calculated as: nSlices = ceildiv((mantissaBitCount + 1), sizeofBits(INT8)), where the additional bit is used for the sign (+/-) of the value.
In the current iteration of the NVIDIA HPL benchmark, FP64 emulation utilizes INT8 data elements and compute resources. This may change in future releases.

x86 container image

The scripts hpl.sh and hpcg.sh can be invoked on a command line or through a Slurm batch-script to launch the NVIDIA HPL and NVIDIA HPCG benchmarks, respectively. The scripts hpl.sh and hpcg.sh accept the following parameters:

--dat path to HPL.dat. Optional parameters:
--gpu-affinity <string> colon separated list of GPU indices
--cpu-affinity <string> colon separated list of CPU index ranges
--mem-affinity <string> colon separated list of memory indices
--ucx-affinity <string> colon separated list of UCX devices
--ucx-tls <string> UCX transport to use
--exec-name <string> HPL executable file
--no-multinode enable flags for no-multinode (no-network) execution (HPL only)

In addition, the script hpcg.sh alternatively to input file accepts the following parameters:

--nx specifies the local (to an MPI process) X dimensions of the problem
--ny specifies the local (to an MPI process) Y dimensions of the problem
--nz specifies the local (to an MPI process) Z dimensions of the problem
--rt specifies the number of seconds of how long the timed portion of the benchmark should run
--b activates benchmarking mode to bypass CPU reference execution when set to one (--b 1)
--l2cmp activates compression in GPU L2 cache when set to one (--l2cmp 1)
--of activates generating the log into textfiles, instead of console (--of=1)
--gss specifies the slice size for the GPU rank (default is 2048)
--p2p specifies the p2p comm mode: 0 MPI_CPU, 1 MPI_CPU_All2allv, 2 MPI_CUDA_AWARE, 3 MPI_CUDA_AWARE_All2allv, 4 NCCL. Default MPI_CPU
--npx specifies the process grid X dimension of the problem
--npy specifies the process grid Y dimension of the problem
--npz specifies the process grid Z dimension of the problem

The script hpl-mxp.sh can be invoked on a command line or through a Slurm batch script to launch the NVIDIA HPL-MxP benchmark. The script hpl-mxp.sh requires the following parameters:

--gpu-affinity <string> colon separated list of GPU indices
--nprow <int> number of rows in the processor grid"
--npcol <int> number of columns in the processor grid"
--nporder <string> "row" or "column" major layout of the processor grid"
--n <int> size of N-by-N matrix
--nb <int> nb is the blocking constant (panel size)" The full list of accepted parameters can be found in README and TUNING files, or NVIDIA HPL-MxP Documentation on NVIDIA.DEVELOPER.

Note:

CPU and memory affinities can improve performance of NVIDIA HPCG and NVIDIA HPL-MxP benchmarks. Below is an example for DGX nodes:
- DGX-H100 and DGX-B200: --mem-affinity 0:0:0:0:1:1:1:1 --cpu-affinity 0-13:14-27:28-41:42-55:56-69:70-83:84-97:98-111
- DGX-A100: --mem-affinity 2:3:0:1:6:7:4:5 --cpu-affinity 32-47:48-63:0-15:16-31:96-111:112-127:64-79:80-95
Multi-Instance GPU (MIG) technology can help improve HPCG benchmark performance on Nvidia Blackwell with dual-die design. More details can be found in NVIDIA HPCG Documentation on NVIDIA.DEVELOPER

The script stream-gpu-test.sh can be invoked on a command line or through a Slurm batch script to launch the NVIDIA STREAM benchmark. The script stream-gpu-test.sh accepts the following optional parameters:

--d <int> device number
--n <int> number of elements in the arrays
--dt fp32 enable fp32 stream test
--t <string> tests which will be executed, can be any combination of:
- C - COPY test
- S - SCALE test
- A - ADD test
- T - TRAID test for example, value --t CST means that COPY, SCALE, and TRIAD tests will be executed. Default value CSAT

aarch64 container image

NVIDIA HPL, NVIDIA HPCG, NVIDIA HPL-MxP, and NVIDIA STREAM GPU benchmarks on aarch64 can be run similarly to those on x86_64 (see details in x86 container image section).

This section provides sample runs of NVIDIA HPL, NVIDIA HPL-MxP, and NVIDIA HPCG benchmarks for NVIDIA Grace CPU.

The scripts hpl-aarch64.sh and hpcg-aarch64.sh can be invoked either from the command line or through a Slurm batch-script to launch the NVIDIA HPL and NVIDIA HPCG benchmarks for NVIDIA Grace CPU, respectively.

The scripts hpl-aarch64.sh and hpcg-aarch64.sh accept the following parameters:

--dat path to HPL.dat. Optional parameters:
--cpu-affinity <string> colon separated list of CPU index ranges
--mem-affinity <string> colon separated list of memory indices
--ucx-affinity <string> colon separated list of UCX devices
--ucx-tls <string> UCX transport to use
--exec-name <string> HPL executable file

--no-multinode enable flags for no-multinode (no-network) execution Note: It is recommended to bind MPI process to NUMA node on NVIDIA Grace CPU, for example ./hpl-aarch64.sh --dat /my-dat-files/HPL.dat --cpu-affinity 0-71:72-143 --mem-affinity 0:1

In addition, instead of an input file, the script hpcg-aarch64.sh accepts the following parameters:

--nx specifies the local (to an MPI process) X dimensions of the problem
--ny specifies the local (to an MPI process) Y dimensions of the problem
--nz specifies the local (to an MPI process) Z dimensions of the problem
--rt specifies the number of seconds of how long the timed portion of the benchmark should run
--b activates benchmarking mode to bypass CPU reference execution when set to one (--b=1)
--l2cmp activates compression in GPU L2 cache when set to one (--l2cmp=1)

--of activates generating the log into textfiles, instead of console (--of=1)
--gss specifies the slice size for the GPU rank (default is 2048)
--css specifies the slice size for the CPU rank (default is 8) The following parameters control the NVIDIA-HPCG benchmark on NVIDIA Grace Hopper and NVIDIA Grace Blackwell systems:

--exm specifies the execution mode. 0 is GPU-only, 1 is Grace-only, and 2 is GPU-Grace. Default is 0
--ddm specifies the direction that GPU and Grace will not have the same local dimension. 0 is auto, 1 is X, 2 is Y, and 3 is Z. Default is 0. Note that the GPU and Grace local problems can differ in one dimension only
--lpm controls the meaning of the value provided for --g2c parameter. Applicable when --exm is 2 and depends on the different local dimension specified by --ddm Value Explanation:
- 0 means nx/ny/nz are GPU local dims and g2c value is the ratio of GPU dim to Grace dim. For example, --nx 128 --ny 128 --nz 128 --ddm 2 --g2c 8 means the different Grace dim (Y in this example) is 1/8 the different GPU dim. GPU local problem is 128x128x128 and Grace local problem is 128x16x128.
- 1 means nx/ny/nz are GPU local dims and g2c value is the absolute value for the different dim for Grace. For example, --nx 128 --ny 128 --nz 128 --ddm 3 --g2c 64 means the different Grace dim (Z in this example) is 64. GPU local problem is 128x128x128 and Grace local problem is 128x128x64.
- 2 assumes a local problem formed by combining a GPU and a Grace problems. The value 2 means the sum of the different dims of the GPU and Grace is combined in the different dimension value. --g2c is the ratio. For example, --ddm 1, --nx 1024, and --g2c 8, then GPU X dim is 896 and Grace X dim is 128.
- 3 assumes a local problem formed by combining a GPU and a Grace problems. The value 3 means the sum of the different dims of the GPU and Grace is combined in the different dimension value. --g2c is absolute. For example, --ddm 1, --nx 1024, and --g2c 96 then GPU X dim is 928 and Grace X dim is 96.
--g2c specifies the value of different dimensions of the GPU and Grace local problems. Depends on --ddm and --lpm values.

Optional parameters of hpcg-aarch64.sh script:

--p2p specifies the p2p comm mode: 0 MPI_CPU, 1 MPI_CPU_All2allv, 2 MPI_CUDA_AWARE, 3 MPI_CUDA_AWARE_All2allv, 4 NCCL. Default MPI_CPU
--npx specifies the process grid X dimension of the problem
--npy specifies the process grid Y dimension of the problem
--npz specifies the process grid Z dimension of the problem
--gpu-affinity colon separated list of gpu indices
--cpu-affinity colon separated list of cpu index ranges
--mem-affinity colon separated list of memory indices
--ucx-affinity colon separated list of UCX devices
--ucx-tls UCX transport to use
--exec-name HPCG executable file
--cuda-compat manually enable CUDA forward compatibility

The script hpl-mxp-aarch64.sh can be invoked on a command line or through a Slurm batch script to launch the NVIDIA HPL-MxP benchmark for NVIDIA Grace CPU. The script hpl-mxp-aarch64.sh requires the following parameters:

--nprow <int> number of rows in the processor grid
--npcol <int> number of columns in the processor grid
--nporder <string> "row" or "column" major layout of the processor grid
--n <int> size of N-by-N matrix
--nb <int> nb is the blocking constant (panel size) The full list of accepted parameters can be found in README and TUNING files.

The script stream-cpu-test.sh can be invoked on a command line or through a Slurm batch script to launch the NVIDIA STREAM benchmark. The script stream-cpu-test.sh accepts the following optional parameters:

--n <int> number of elements in the arrays
--t <int> number of threads

For a general guide on pulling and running containers, see Running A Container chapter in the NVIDIA Containers For Deep Learning Frameworks User’s Guide. For more information about using NGC, refer to the NGC Container User Guide.

NVIDIA HPL Benchmark Environment variables

NVIDIA HPL takes several runtime environment variables to improve the performance on different platforms.

HPL_P2P_AS_BCAST: Which communication library to use in the final solve step.
- Default Value: 1
- Possible Values: 0 (NCCL bcast), 1 (NCCL send/recv), 2 (CUDA-aware MPI), 3 (host MPI), 4 (NVSHMEM)

HPL_USE_NVSHMEM: Enables/disables NVSHMEM support in HPL.
- Default Value: 1
- Possible Values: 1 (enable), 0 (disable)

HPL_NVSHMEM_INIT: NVSHMEM initialization type.
- Default Value: 0
- Possible Values: 0 (using MPI) , 1 (using unique ID (UID))

HPL_FCT_COMM_POLICY: Which communication library to use in the panel factorization.
- Default Value: 1
- Possible Values: 0 (NVSHMEM), 1 (host MPI)

HPL_NVSHMEM_SWAP: Performs row swaps using NVSHMEM instead of NCCL (default) Number of matrix blocks (size NB) to group for computations.
- Default Value: 0
- Possible Values: 1 (enable), 0 (disable)

HPL_CHUNK_SIZE_NBS: Number of matrix blocks (size NB) to group for computations.
- Default Value: 16
- Possible Values: >0

HPL_DIST_TRSM_FLAG: Perform the solve step (TRSM) in parallel rather than on only the ranks that own that part of the matrix.
- Default Value: 1
- Possible Values: 1 (enable), 0 (disable)

HPL_CTA_PER_FCT: Sets the number of CTAs (thread blocks) for factorization.
- Default Value: 16
- Possible Values: >0

HPL_ALLOC_HUGEPAGES: Use 2MB hugepages for host-side allocations. Done through the madvise syscall and requires /sys/kernel/mm/transparent_hugepage/enabled to be set to madvise to have an effect.
- Default Value: 0
- Possible Values: 1 (enable), 0 (disable)

WARMUP_END_PROG: Runs the main loop once before the 'real' run. Stops the warmup loop at x%.
- Default Value: -1
- Possible Values: -1-100

TEST_LOOPS: Runs the main loop x many times.
- Default Value: 1
- Possible Values: >0

HPL_CUSOLVER_MP_TESTS: Runs several tests of individual components of HPL (GEMMS, comms, etc.).
- Default Value: 1
- Possible Values: 1 (enable), 0 (disable)

HPL_CUSOLVER_MP_TESTS_GEMM_ITERS: Number of repeat GEMM calls in tests.
- Default Value: 128
- Possible Values: >0

Environment variables to setup and control the NVIDIA HPL Benchmark out-of-core mode:

HPL_OOC_MODE: Enables/disables out-of-core mode.
- Default Value: 1
- Possible Values: 1 (enable), 0 (disable)

HPL_OOC_MAX_GPU_MEM: Limits the amount of GPU memory used for OOC (measured in GiB).
- Default Value: -1
- Possible Values: >=-1

HPL_OOC_TILE_M: Row blocking factor.
- Default Value: 4096
- Possible Values: >0

HPL_OOC_TILE_N: Column blocking factor.
- Default Value: 4096
- Possible Values: >0

HPL_OOC_NUM_STREAMS: Number of streams used for OOC operations.
- Default Value: 3
- Possible Values: >0

HPL_OOC_SAFE_SIZE: GPU memory (in GiB) needed for driver, this amount of memory will not be used by HPL OOC.
- Default Value: 3.0
- Possible Values: >0

Running with Pyxis/Enroot

The examples below use Pyxis/enroot from NVIDIA to facilitate running HPC-Benchmarks Containers. Note that an enroot .credentials file is necessary to use these NGC containers.

To copy and customize the sample Slurm scripts and/or sample HPL.dat/hpcg.dat files from the containers, run the container in interactive mode, while mounting a folder outside the container, and copy the needed files, as follows:

CONT='nvcr.io#nvidia/hpc-benchmarks:25.04'
MOUNT="$PWD:/home_pwd"

srun -N 1 --cpu-bind=none --mpi=pmix \
     --container-image="${CONT}" \
     --container-mounts="${MOUNT}" \
     --pty bash

Once inside the container, copy the needed files to /home_pwd.

NVIDIA HPL, NVIDIA HPL-MxP, NVIDIA HPCG and NVIDIA STREAM Benchmarks with support of GPU

Examples of `NVIDIA HPL` run

Several sample Slurm scripts and several sample input files are available in the container at /workspace/hpl-linux-x86_64 or /workspace/hpl-linux-aarch64-gpu.

To run NVIDIA HPL on a single node with 4 GPUs using your custom HPL.dat file:

CONT='nvcr.io#nvidia/hpc-benchmarks:25.04'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"

srun -N 1 --ntasks-per-node=4 --cpu-bind=none --mpi=pmix \
     --container-image="${CONT}" \
     --container-mounts="${MOUNT}" \
     ./hpl.sh --dat /my-dat-files/HPL.dat

To run NVIDIA HPL on nodes 16 with 4 GPUs (or 8 nodes with 8 GPUs) using provided sample HPL-dgx-64GPUs.dat files:

CONT='nvcr.io#nvidia/hpc-benchmarks:25.04'

srun -N 16 --ntasks-per-node=4 --cpu-bind=none --mpi=pmix \
     --container-image="${CONT}" \
     ./hpl.sh --dat /workspace/hpl-linux-x86_64/sample-dat/HPL-64GPUs.dat

CONT='nvcr.io#nvidia/hpc-benchmarks:25.04'

srun -N 8 --ntasks-per-node=8 --cpu-bind=none --mpi=pmix \
     --container-image="${CONT}" \
     ./hpl.sh --dat /workspace/hpl-linux-x86_64/sample-dat/HPL-64GPUs.dat

Examples of `NVIDIA HPL-MxP` run

Several sample Slurm scripts and are available in the container at /workspace/hpl-mxp-linux-x86_64 or /workspace/hpl-mxp-linux-aarch64-gpu.

To run NVIDIA HPL-MxP on a single node with 8 GPUs:

CONT='nvcr.io/nvidia/hpc-benchmarks:25.04'

srun -N 1 --ntasks-per-node=8 \
     --container-image="${CONT}" \
     ./hpl-mxp.sh --n 380000 --nb 2048 --nprow 2 --npcol 4 --nporder row --gpu-affinity 0:1:2:3:4:5:6:7

To run NVIDIA HPL-MxP on 4 nodes, each node with 4 GPUs:

CONT='nvcr.io/nvidia/hpc-benchmarks:25.04'

srun -N 4 --ntasks-per-node=4 \
     --container-image="${CONT}" \
     ./hpl-mxp.sh --n 280000 --nb 2048 --nprow 4 --npcol 4 --nporder row --gpu-affinity 0:1:2:3

Pay special attention to CPU cores affinity/binding, as it greatly affects the performance of the HPL benchmarks.

Examples of `NVIDIA HPCG` run

Several sample Slurm scripts and sample input file are available in the container at /workspace/hpcg-linux-x86_64 or /workspace/hpcg-linux-aarch64

To run NVIDIA HPCG on a single node with one GPU using your custom hpcg.dat file on x86:

CONT='nvcr.io/nvidia/hpc-benchmarks:25.04'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"

srun -N 1 --ntasks-per-node=8 --cpu-bind=none --mpi=pmix \
     --container-image="${CONT}" \
     --container-mounts="${MOUNT}" \
     ./hpcg.sh --dat /my-dat-files/hpcg.dat

To run NVIDIA HPCG on nodes 16 with 4 GPUs (or 8 nodes with 8 GPUs) using script parameters on x86:

CONT='nvcr.io/nvidia/hpc-benchmarks:25.04'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"

srun -N 16 --ntasks-per-node=4 --cpu-bind=none --mpi=pmix \
     --container-image="${CONT}" \
     --container-mounts="${MOUNT}" \
     ./hpcg.sh --nx 512 --ny 512 --nz 256 --rt 2 --cpu-affinity 0-55:56-111:112-167:168-223 --mem-affinity 0:0:1:1

To run NVIDIA HPCG on nodes 16 with 4 GPUs (or 8 nodes with 8 GPUs) using script parameters on aarch64:

CONT='nvcr.io/nvidia/hpc-benchmarks:25.04'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"

srun -N 16 --ntasks-per-node=4 --cpu-bind=none --mpi=pmix \
     --container-image="${CONT}" \
     --container-mounts="${MOUNT}" \
     ./hpcg-arch64.sh --nx 512 --ny 512 --nz 256 --rt 2 --cpu-affinity 0-35:36-71:72-107:108-143 --mem-affinity 0:0:1:1

NVIDIA HPL, NVIDIA HPL-MxP, NVIDIA HPCG and NVIDIA STREAM Benchmarks for NVIDIA Grace CPU

Examples of `NVIDIA HPL` run

Several sample input files are available in the container at /workspace/hpl-linux-aarch64.

To run NVIDIA HPL on two nodes of NVIDIA Grace CPU using your custom HPL.dat file:

CONT='nvcr.io#nvidia/hpc-benchmarks:25.04'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"

srun -N 2 --ntasks-per-node=2 --cpu-bind=none --mpi=pmix \
     --container-image="${CONT}" \
     --container-mounts="${MOUNT}" \
     ./hpl-aarch64.sh --dat /my-dat-files/HPL.dat --cpu-affinity 0-71:72-143 --mem-affinity 0:1

where --cpu-affinity is mapping to cores on the local node and --mem-affinity is mapping to NUMA-nodes on the local node.

Examples of `NVIDIA HPL-MxP` run

To run NVIDIA HPL-MxP on a single node of NVIDIA Grace Hopper x4:

CONT='nvcr.io/nvidia/hpc-benchmarks:25.04'

srun -N 1 --ntasks-per-node=16 \
     --container-image="${CONT}" \
     ./hpl-mxp-aarch64.sh --n 380000 --nb 2048 --nprow 2 --npcol 4 --nporder row \
     --cpu-affinity 0-71:72-143:144-215:216-287 \
     --mem-affinity 0:1:2:3

where --cpu-affinity is mapping to cores on the local node and --mem-affinity is mapping to NUMA-nodes on the local node.

Examples of `NVIDIA HPCG` run

Sample input file is available in the container at /workspace/hpcg-linux-aarch64

To run NVIDIA HPCG on two nodes of NVIDIA Grace CPU using your custom parameters:

CONT='nvcr.io/nvidia/hpc-benchmarks:25.04'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"

srun -N 2 --ntasks-per-node=4 --cpu-bind=none --mpi=pmix \
     --container-image="${CONT}" \
     --container-mounts="${MOUNT}" \
     ./hpcg-aarch64.sh --exm 1 --nx 512 --ny 512 --nz 288 --rt 30 --cpu-affinity 0-35:36-71:72-107:108-143 --mem-affinity 0:0:1:1

To run NVIDIA HPCG on NVIDIA Grace Hopper x4 using script parameters on aarch64:

CONT='nvcr.io/nvidia/hpc-benchmarks:25.04'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"

#GPU+Grace (Heterogeneous execution)
#GPU rank has 8 OpenMP threads and Grace rank has 64 OpenMP threads
srun -N 2 --ntasks-per-node=8 --cpu-bind=none --mpi=pmix \
     --container-image="${CONT}" \
     --container-mounts="${MOUNT}" \
     ./hpcg-arch64.sh --nx 256 --ny 1024 --nz 288 --rt 2 \
     --exm 2 --ddm 2 --lpm 1 --g2c 64 \
     --npx 4 --npy 4 --npz 1 \
     --cpu-affinity 0-7:8-71:72-79:80-143:144-151:152-215:216-223:224-287 \
     --mem-affinity 0:0:1:1:2:2:3:3

Running with Singularity

The instructions below assume Singularity 3.4.1 or later.

Pull the image

Save the HPC-Benchmark container as a local Singularity image file:

$ singularity pull --docker-login hpc-benchmarks:25.04.sif docker://nvcr.io/nvidia/hpc-benchmarks:25.04

This command saves the container in the current directory as hpc-benchmarks:25.04.sif.

NVIDIA HPL, NVIDIA HPL-MxP, NVIDIA HPCG, and NVIDIA STREAM Benchmarks with support of GPU

Examples of `NVIDIA HPL` run

Several sample Slurm scripts and several sample input files are available in the container at /workspace/hpl-linux-x86_64 or /workspace/hpl-linux-aarch64-gpu.

To run NVIDIA HPL on a single node with 4 GPUs using your custom HPL.dat file:

CONT='/path/to/hpc-benchmarks:25.04.sif'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"

srun -N 1 --ntasks-per-node=4 singularity run --nv \
     -B "${MOUNT}" "${CONT}" \
     ./hpl.sh --dat /my-dat-files/HPL.dat

To run NVIDIA HPL on 16 nodes with 4 GPUs (or 8 nodes with 8 GPUs) using provided sample HPL-dgx-64GPUs.dat files:

CONT='/path/to/hpc-benchmarks:25.04.sif'

srun -N 16 --ntasks-per-node=4 singularity run --nv \
     "${CONT}" \
     ./hpl.sh --dat /workspace/hpl-linux-x86_64/sample-dat/HPL-64GPUs.dat

CONT='/path/to/hpc-benchmarks:25.04.sif'

srun -N 8 --ntasks-per-node=8 singularity run --nv \
     "${CONT}" \
     ./hpl.sh --dat /workspace/hpl-linux-x86_64/sample-dat/HPL-64GPUs.dat

Examples of `NVIDIA HPL-MxP` run

Several sample Slurm scripts are available in the container at /workspace/hpl-mxp-linux-x86_64 or /workspace/hpl-mxp-linux-aarch64-gpu.

To run NVIDIA HPL-MxP on a single node with 8 GPUs:

CONT='/path/to/hpc-benchmarks:25.04.sif'

srun -N 1 --ntasks-per-node=8 singularity run --nv \
     "${CONT}" \
     ./hpl-mxp.sh --n 380000 --nb 2048 --nprow 2 --npcol 4 --nporder row --gpu-affinity 0:1:2:3:4:5:6:7

To run NVIDIA HPL-MxP on a 4 nodes, each node with 4 GPUs:

CONT='/path/to/hpc-benchmarks:25.04.sif'

srun -N 4 --ntasks-per-node=4 singularity run --nv \
     "${CONT}" \
     ./hpl-mxp.sh --n 280000 --nb 2048 --nprow 4 --npcol 4 --nporder row --gpu-affinity 0:1:2:3

Pay special attention to CPU cores affinity/binding, as it greatly affects the performance of the HPL benchmarks.

Examples of `NVIDIA HPCG` run

Several sample Slurm scripts and sample input files are available in the container at /workspace/hpcg-linux-x86_64 or /workspace/hpcg-linux-aarch64-gpu

To run NVIDIA HPCG on a single node with one GPU using your custom hpcg.dat file on x86:

CONT='/path/to/hpc-benchmarks:25.04.sif'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"

srun -N 1 --ntasks-per-node=8 --cpu-bind=none --mpi=pmix \
     --container-image="${CONT}" \
     --container-mounts="${MOUNT}" \
     ./hpcg.sh --dat /my-dat-files/hpcg.dat

To run NVIDIA HPCG on 16 nodes with 4 GPUs (or 8 nodes with 8 GPUs) using script parameters on x86:

CONT='/path/to/hpc-benchmarks:25.04.sif'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"

srun -N 16 --ntasks-per-node=4 --cpu-bind=none --mpi=pmix \
     --container-image="${CONT}" \
     --container-mounts="${MOUNT}" \
     ./hpcg.sh --nx 256 --ny 256 --nz 256 --rt 2

To run NVIDIA HPCG on a single node with one 4 GPUs using your custom hpcg.dat file on aarch64:

CONT='/path/to/hpc-benchmarks:25.04.sif'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"

srun -N 1 --ntasks-per-node=8 --cpu-bind=none --mpi=pmix \
     --container-image="${CONT}" \
     --container-mounts="${MOUNT}" \
     ./hpcg-aarch64.sh --dat /my-dat-files/hpcg.dat

NVIDIA HPL, NVIDIA HPL-MxP, NVIDIA HPCG, and NVIDIA STREAM Benchmarks for NVIDIA Grace CPU

Examples of `NVIDIA HPL` run

Several sample input files are available in the container at /workspace/hpl-linux-aarch64.

To run NVIDIA HPL on two nodes of NVIDIA Grace CPU using your custom HPL.dat file:

CONT='/path/to/hpc-benchmarks:25.04.sif'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"

srun -N 2 --ntasks-per-node=2 singularity run \
     -B "${MOUNT}" "${CONT}" \
     ./hpl-aarch64.sh --dat /my-dat-files/HPL.dat --cpu-affinity 0-71:72-143 --mem-affinity 0:1

where --cpu-affinity is mapping to cores on the local node and --mem-affinity is mapping to NUMA-nodes on the local node.

Examples of `NVIDIA HPL-MxP` run

To run NVIDIA HPL-MxP on a single node of NVIDIA Grace Hopper x4:

CONT='/path/to/hpc-benchmarks:25.04.sif'

srun -N 1 --ntasks-per-node=16 singularity run \
     "${CONT}" \
     ./hpl-mxp-aarch64.sh --n 380000 --nb 2048 --nprow 2 --npcol 4 --nporder row \
     --cpu-affinity 0-71:72-143:144-215:216-287 \
     --mem-affinity 0:1:2:3

where --cpu-affinity is mapping to cores on the local node and --mem-affinity is mapping to NUMA-nodes on the local node.

Examples of `NVIDIA HPCG` run

Sample input files are available in the container at /workspace/hpcg-linux-aarch64

To run NVIDIA HPCG on two nodes of NVIDIA Grace CPU using your custom hpcg.dat file:

CONT='/path/to/hpc-benchmarks:25.04.sif'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"

srun -N 2 --ntasks-per-node=4 singularity run \
     -B "${MOUNT}" "${CONT}" \
     ./hpcg-aarch64.sh --exm 1 --nx 512 --ny 512 --nz 288 --rt 10 --cpu-affinity 0-35:36-71:72-107:108-143 --mem-affinity 0:0:1:1

To run NVIDIA HPCG on NVIDIA Grace Hopper x4 using script parameters on aarch64:

CONT='/path/to/hpc-benchmarks:25.04.sif'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"

#GPU+Grace (Heterogeneous execution)
srun -N 2 --ntasks-per-node=8 singularity run \
     -B "${MOUNT}" "${CONT}" \
     ./hpcg-arch64.sh --nx 256 --ny 1024 --nz 288 --rt 2 \
     --exm 2 --ddm 2 --lpm 1 --g2c 64 \
     --npx 4 --npy 4 --npz 1 \
     --cpu-affinity 0-7:8-71:72-79:80-143:144-151:152-215:216-223:224-287 \
     --mem-affinity 0:0:1:1:2:2:3:3

Running with Docker

The below examples are for single node runs with Docker. It is not recommended to use Docker for multi-node runs.

Pull the image

Download the HPL-Benchmark container as a local Docker image file:

$ docker pull nvcr.io/nvidia/hpc-benchmarks:25.04

NOTE: Adding --privileged flag to the Docker command prevents the “set_mempolicy” error.

NVIDIA HPL, NVIDIA HPL-MxP, NVIDIA HPCG, and NVIDIA STREAM Benchmarks with support of GPU

Examples of `NVIDIA HPL` run

To run NVIDIA HPL on a single node with 4 GPUs using your custom HPL.dat file:

CONT='nvcr.io/nvidia/hpc-benchmarks:25.04'
MOUNT="/full-path/to/your/custom/dat-files:/my-dat-files"

docker run --gpus all --shm-size=1g -v ${MOUNT} \
     ${CONT} \
     mpirun --bind-to none -np 4 \
     ./hpl.sh --dat /my-dat-files/HPL.dat

Examples of `NVIDIA HPL-MxP` run

To run NVIDIA HPL-MxP on a single node with 8 GPUs:

CONT='nvcr.io/nvidia/hpc-benchmarks:25.04'

docker run --gpus all --shm-size=1g \
     ${CONT} \
     mpirun --bind-to none -np 8 \
     ./hpl-mxp.sh --n 380000 --nb 2048 --nprow 2 --npcol 4 --nporder row --gpu-affinity 0:1:2:3:4:5:6:7

Examples of `NVIDIA HPCG` run

To run NVIDIA HPCG on a single node with one GPU using your custom hpcg.dat file on x86:

CONT='nvcr.io/nvidia/hpc-benchmarks:25.04'
MOUNT="/full-path/to/your/custom/dat-files:/my-dat-files"

docker run --gpus all -v --shm-size=1g ${MOUNT} \
     ${CONT} \
     mpirun --bind-to none -np 8 \
     ./hpcg.sh --dat /my-dat-files/hpcg.dat

NVIDIA HPL, NVIDIA HPL-MxP, NVIDIA HPCG, and NVIDIA STREAM Benchmarks for NVIDIA Grace CPU

Examples of `NVIDIA HPL` run

Several sample Docker run scripts are available in the container at /workspace/hpl-linux-aarch64.

To run NVIDIA HPL on a single NVIDIA Grace CPU mode using your custom HPL.dat file:

CONT='nvcr.io#nvidia/hpc-benchmarks:25.04'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"

docker run -v ${MOUNT} \
     "${CONT}" \
     mpirun --bind-to none -np 2 \
     ./hpl-aarch64.sh --dat /my-dat-files/HPL.dat --cpu-affinity 0-71:72-143 --mem-affinity 0:1

where --cpu-affinity is mapping to cores on the local node and --mem-affinity is mapping to NUMA-nodes on the local node.

Examples of `NVIDIA HPL-MxP` run

Several sample Docker run scripts are available in the container at /workspace/hpl-mxp-linux-aarch64.

To run NVIDIA HPL-MxP on a single node of NVIDIA Grace Hopper x4:

CONT='nvcr.io/nvidia/hpc-benchmarks:25.04'

docker run \
     "${CONT}" \
     mpirun --bind-to none -np 4 \
     ./hpl-mxp-aarch64.sh --n 380000 --nb 2048 --nprow 2 --npcol 4 --nporder row \
     --cpu-affinity 0-71:72-143:144-215:216-287 --mem-affinity 0:1:2:3

where --cpu-affinity is mapping to cores on the local node and --mem-affinity is mapping to NUMA-nodes on the local node.

Examples of `NVIDIA HPCG` run

Several sample Docker run scripts are available in the container at /workspace/hpcg-linux-aarch64.

To run NVIDIA HPCG on a single node of NVIDIA Grace CPU using your custom parameters file:

CONT='nvcr.io/nvidia/hpc-benchmarks:25.04'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"

docker run -v ${MOUNT} \
     "${CONT}" \
     mpirun --bind-to none -np 4 \
     ./hpcg-aarch64.sh --exm 1 --nx 512 --ny 512 --nz 288 --rt 10 --cpu-affinity 0-35:36-71:72-107:108-143 --mem-affinity 0:0:1:1

To run NVIDIA HPCG on NVIDIA Grace Hopper x4 using script parameters on aarch64:

CONT='nvcr.io/nvidia/hpc-benchmarks:25.04'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"

#GPU+Grace (Heterogeneous execution)
docker run -v ${MOUNT} \
     "${CONT}" \
     mpirun --bind-to none -np 16 \
     ./hpcg-arch64.sh --nx 256 --ny 1024 --nz 288 --rt 2 \
     --exm 2 --ddm 2 --lpm 1 --g2c 64 \
     --npx 4 --npy 4 --npz 1 \
     --cpu-affinity 0-7:8-71:72-79:80-143:144-151:152-215:216-223:224-287 \
     --mem-affinity 0:0:1:1:2:2:3:3

Known issues

HPCX 2.21 is known to have a long startup time on Blackwell. Enabling the compute cache (export CUDA_CACHE_DISABLE=0) can help reduce this delay.
If NVSHMEM is used in the HPL Benchmark and is initialized using a unique ID (UID), the benchmark may hang during a multi-node run. To workaround this issue, initialize NVSHMEM using MPI export HPL_NVSHMEM_INIT=0 or disable NVSHMEM export HPL_USE_NVSHMEM=0.

Resources

References

[1] Hardware Trends Impacting Floating-Point Computations In Scientific Applications

[2] Performance enhancement of the Ozaki Scheme on integer matrix multiplication unit

Support

For questions or to provide feedback, please contact HPCBenchmarks@nvidia.com

NVIDIA HPC-Benchmarks

NVIDIA HPC-Benchmarks 25.04

Container packages

Prerequisites

Containers folder structure

x86 container image:

aarch64 container image:

Running the NVIDIA HPL, NVIDIA HPL-MxP, NVIDIA HPCG, and NVIDIA STREAM Benchmarks

NVIDIA HPL Out-of-core mode

NVIDIA HPL with FP64 emulation

x86 container image

aarch64 container image

NVIDIA HPL Benchmark Environment variables

Running with Pyxis/Enroot

NVIDIA HPL, NVIDIA HPL-MxP, NVIDIA HPCG and NVIDIA STREAM Benchmarks with support of GPU

Examples of NVIDIA HPL run

Examples of NVIDIA HPL-MxP run

Examples of NVIDIA HPCG run

NVIDIA HPL, NVIDIA HPL-MxP, NVIDIA HPCG and NVIDIA STREAM Benchmarks for NVIDIA Grace CPU

Examples of NVIDIA HPL run

Examples of NVIDIA HPL-MxP run

Examples of NVIDIA HPCG run

Running with Singularity

Pull the image

NVIDIA HPL, NVIDIA HPL-MxP, NVIDIA HPCG, and NVIDIA STREAM Benchmarks with support of GPU

Examples of NVIDIA HPL run

Examples of NVIDIA HPL-MxP run

Examples of NVIDIA HPCG run

NVIDIA HPL, NVIDIA HPL-MxP, NVIDIA HPCG, and NVIDIA STREAM Benchmarks for NVIDIA Grace CPU

Examples of NVIDIA HPL run

Examples of NVIDIA HPL-MxP run

Examples of NVIDIA HPCG run

Running with Docker

Pull the image

NVIDIA HPL, NVIDIA HPL-MxP, NVIDIA HPCG, and NVIDIA STREAM Benchmarks with support of GPU

Examples of NVIDIA HPL run

Examples of NVIDIA HPL-MxP run

Examples of NVIDIA HPCG run

NVIDIA HPL, NVIDIA HPL-MxP, NVIDIA HPCG, and NVIDIA STREAM Benchmarks for NVIDIA Grace CPU

Examples of NVIDIA HPL run

Examples of NVIDIA HPL-MxP run

Examples of NVIDIA HPCG run

Known issues

Resources

References

Support

Examples of `NVIDIA HPL` run

Examples of `NVIDIA HPL-MxP` run

Examples of `NVIDIA HPCG` run

Examples of `NVIDIA HPL` run

Examples of `NVIDIA HPL-MxP` run

Examples of `NVIDIA HPCG` run

Examples of `NVIDIA HPL` run

Examples of `NVIDIA HPL-MxP` run

Examples of `NVIDIA HPCG` run

Examples of `NVIDIA HPL` run

Examples of `NVIDIA HPL-MxP` run

Examples of `NVIDIA HPCG` run

Examples of `NVIDIA HPL` run

Examples of `NVIDIA HPL-MxP` run

Examples of `NVIDIA HPCG` run

Examples of `NVIDIA HPL` run

Examples of `NVIDIA HPL-MxP` run

Examples of `NVIDIA HPCG` run