The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA's latest software release. For the most up-to-date performance measurements, go to NVIDIA Data Center Deep Learning Product Performance.
The following section shows how to run benchmarks measuring the model performance in training and inference modes.
To benchmark the training performance on a specific batch size, follow the instructions
in the Quick Start Guide. You can also add the --max_steps 1000
if you want to get a reliable throughput measurement without running the entire training.
You can also use synthetic data by running with the --dataset_type synthetic
option if you haven't downloaded the dataset yet.
To benchmark the inference performance on a specific batch size, run:
horovodrun -np 1 -H localhost:1 --mpi-args=--oversubscribe numactl --interleave=all -- python -u main.py --dataset_path /data/dlrm/ --amp --restore_checkpoint_path <checkpoint_path> --mode inference
The following sections provide details on how we achieved our performance and accuracy in training and inference.
We used three model size variants to show memory scalability in multi-GPU setup:
Name | Dataset | Number of parameters | Model size |
---|---|---|---|
small | Criteo 1TB, FL=15 | 4.2B | 15.6 GiB |
large | Criteo 1TB, FL=3 | 22.8B | 84.9 GiB |
extra large | Criteo 1TB, FL=0 | 113B | 421 GiB |
Our results were obtained by running training scripts as described in the Quick Start Guide in the DLRM Docker container.
GPUs | Model size | Batch size / GPU | Accuracy (AUC) - TF32 | Accuracy (AUC) - mixed precision | Time to train - TF32 [minutes] | Time to train - mixed precision [minutes] | Time to train speedup (TF32 to mixed precision) |
---|---|---|---|---|---|---|---|
1 | small | 64k | 0.8025 | 0.8025 | 26.75 | 16.27 | 1.64 |
8 | large | 8k | 0.8027 | 0.8026 | 8.77 | 6.57 | 1.33 |
8 | extra large | 8k | 0.8026 | 0.8026 | 10.47 | 9.08 | 1.15 |
Our results were obtained by running training scripts as described in the Quick Start Guide in the DLRM Docker container.
GPUs | Model size | Batch size / GPU | Accuracy (AUC) - FP32 | Accuracy (AUC) - mixed precision | Time to train - FP32 [minutes] | Time to train - mixed precision [minutes] | Time to train speedup (FP32 to mixed precision) |
---|---|---|---|---|---|---|---|
1 | small | 64k | 0.8027 | 0.8025 | 109.63 | 34.83 | 3.15 |
8 | large | 8k | 0.8028 | 0.8026 | 26.01 | 13.73 | 1.89 |
Our results were obtained by running training scripts as described in the Quick Start Guide in the DLRM Docker container.
GPUs | Model size | Batch size / GPU | Accuracy (AUC) - FP32 | Accuracy (AUC) - mixed precision | Time to train - FP32 [minutes] | Time to train - mixed precision [minutes] | Time to train speedup (FP32 to mixed precision) |
---|---|---|---|---|---|---|---|
1 | small | 64k | 0.8026 | 0.8026 | 105.13 | 33.37 | 3.15 |
8 | large | 8k | 0.8027 | 0.8027 | 21.21 | 11.43 | 1.86 |
16 | large | 4k | 0.8025 | 0.8026 | 15.52 | 10.88 | 1.43 |
The histograms below show the distribution of ROC AUC results achieved at the end of the training for each precision/hardware platform tested. There are no statistically significant differences between precision, number of GPUs or hardware platform. Using the larger dataset has a modest, positive impact on final AUC score.
Figure 4. Results of stability tests for DLRM.
We used throughput in items processed per second as the performance metric.
Our results were obtained by following the commands from the Quick Start Guide in the DLRM Docker container on NVIDIA DGX A100 (8x A100 80GB) GPUs. Performance numbers (in items per second) were averaged over 1000 training steps.
GPUs | Model size | Batch size / GPU | Throughput - TF32 | Throughput - mixed precision | Throughput speedup (TF32 to mixed precision) |
---|---|---|---|---|---|
1 | small | 64k | 2.68M | 4.47M | 1.67 |
8 | large | 8k | 9.39M | 13.31M | 1.42 |
8 | extra large | 8k | 9.93M | 12.1M | 1.22 |
To achieve these results, follow the steps in the Quick Start Guide.
For the "extra large" model (113B parameters) we also obtained CPU results for comparison using the same source code
(using the --cpu
command line flag for the CPU-only experiments).
We compare three hardware setups:
Hardware | Throughput [samples / second] | Speedup over CPU |
---|---|---|
2xAMD EPYC 7742 | 17.7k | 1x |
1xA100-80GB + 2xAMD EPYC 7742 (large embeddings on CPU) | 768k | 43x |
DGX A100 (8xA100-80GB) (hybrid parallel) | 12.1M | 683x |
GPUs | Model size | Batch size / GPU | Throughput - FP32 | Throughput - mixed precision | Throughput speedup (FP32 to mixed precision) |
---|---|---|---|---|---|
1 | small | 64k | 0.648M | 2.06M | 3.18 |
8 | large | 8k | 2.9M | 5.89M | 2.03 |
To achieve the same results, follow the steps in the Quick Start Guide.
GPUs | Model size | Batch size / GPU | Throughput - FP32 | Throughput - mixed precision | Throughput speedup (FP32 to mixed precision) |
---|---|---|---|---|---|
1 | small | 64k | 0.675M | 2.16M | 3.2 |
8 | large | 8k | 3.75M | 7.72M | 2.06 |
16 | large | 4k | 5.74M | 9.39M | 1.64 |
To achieve the same results, follow the steps in the Quick Start Guide.
GPUs | Model size | Batch size / GPU | Throughput - TF32 | Throughput - mixed precision | Average latency - TF32 [ms] | Average latency - mixed precision [ms] | Throughput speedup (mixed precision to TF32) |
---|---|---|---|---|---|---|---|
1 | small | 2048 | 1.43M | 1.54M | 1.48 | 1.33 | 1.08 |
GPUs | Model size | Batch size / GPU | Throughput - FP32 | Throughput - mixed precision | Average latency - FP32 [ms] | Average latency - mixed precision [ms] | Throughput speedup (mixed precision to FP32) |
---|---|---|---|---|---|---|---|
1 | small | 2048 | 0.765M | 1.05M | 2.90 | 1.95 | 1.37 |
GPUs | Model size | Batch size / GPU | Throughput - FP32 | Throughput - mixed precision | Average latency - FP32 [ms] | Average latency - mixed precision [ms] | Throughput speedup (mixed precision to FP32) |
---|---|---|---|---|---|---|---|
1 | small | 2048 | 1.03M | 1.37M | 2.10 | 1.63 | 1.53 |