The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA's latest software release. For the most up-to-date performance measurements, go to NVIDIA Data Center Deep Learning Product Performance.
The following section shows how to run benchmarks measuring the model performance in training and inference modes.
To benchmark the training performance on a specific batch size, please follow the instructions
in the Quick Start Guide. You can also add the --max_steps 1000 --benchmark_warmup_steps 500
if you want to get a reliable throughput measurement without running the entire training.
You can create a synthetic dataset by running python -m dlrm.scripts.prepare_synthetic_dataset --synthetic_dataset_dir /tmp/dlrm_synthetic_data
if you haven't yet downloaded the dataset.
To benchmark the inference performance on a specific batch size, run:
python -m dlrm.scripts.main --mode inference_benchmark --dataset /data
You can also create a synthetic dataset by running python -m dlrm.scripts.prepare_synthetic_dataset --synthetic_dataset_dir /tmp/dlrm_synthetic_data
if you haven't yet downloaded the dataset.
The following sections provide details on how we achieved our performance and accuracy in training and inference.
We used three model size variants to show memory scalability in a multi-GPU setup:
Model variant | Frequency threshold | Model size |
---|---|---|
small | 15 | 15 GB |
large | 3 | 82 GB |
xlarge | 2 | 142 GB |
Our results were obtained by running dlrm/scripts/main.py
script as described in the Quick Start Guide in the DLRM Docker container using NVIDIA A100 80GB GPUs.
GPUs | Model size | Batch size / GPU | Accuracy (AUC) - TF32 | Accuracy (AUC) - mixed precision | Time to train - TF32] | Time to train - mixed precision | Time to train speedup (TF32 to mixed precision) |
---|---|---|---|---|---|---|---|
8 | large | 8k | 0.802509 | 0.802528 | 0:06:27 | 0:04:36 | 1.40217 |
1 | small | 64k | 0.802537 | 0.802521 | 0:24:26 | 0:17:47 | 1.37395 |
Our results were obtained by running dlrm/scripts/main.py
script as described in the Quick Start Guide in the DLRM Docker container using NVIDIA V100 32GB GPUs.
GPUs | Model size | Batch size / GPU | Accuracy (AUC) - FP32 | Accuracy (AUC) - mixed precision | Time to train - FP32] | Time to train - mixed precision | Time to train speedup (FP32 to mixed precision) |
---|---|---|---|---|---|---|---|
8 | large | 8k | 0.802568 | 0.802562 | 0:28:24 | 0:11:45 | 2.41702 |
1 | small | 64k | 0.802784 | 0.802723 | 1:58:10 | 0:38:17 | 3.08663 |
Models trained with FP32, TF32, and Automatic Mixed Precision (AMP) achieve similar accuracy.
The plot represents ROC AUC metric as a function of steps (step is single batch) during training for default precision (FP32 for Volta architecture (DGX-1) and TF32 for Ampere GPU architecture (DGX-A100)), and AMP for all three datasets. All other parameters of training are default.
Figure 1. Training stability for a FL3 dataset: distribution of ROC AUC across different configurations. 'All configurations' refer to the distribution of ROC AUC for cartesian product of architecture, training precision.
Figure 2. Training stability for a FL15 dataset: distribution of ROC AUC across different configurations. 'All configurations' refer to the distribution of ROC AUC for cartesian product of architecture, training precision.
Training of the model is stable for multiple configurations achieving a standard deviation of 10e-4. The model achieves similar ROC AUC scores for A100 and V100, training precisions. It was trained for one epoch (roughly 4 billion samples, 64014 batches), starting from 10 different initial random seeds for each setup. The training was performed in the pytorch:21.10-py3 NGC container with and without mixed precision enabled. The provided charts and numbers consider single and multi GPU training. After training, the models were evaluated on the test set. The following plots compare distributions of ROC AUC on the test set.
Figure 3. Training stability for a FL3 dataset: distribution of ROC AUC across different configurations. 'All configurations' refer to the distribution of ROC AUC for cartesian product of architecture, training precision.
Figure 4. Training stability for a FL15 dataset: distribution of ROC AUC across different configurations. 'All configurations' refer to the distribution of ROC AUC for cartesian product of architecture, training precision.
The accuracy of training, measured with ROC AUC on the test set after the final epoch metric was not impacted by enabling mixed precision. The obtained results were statistically similar. The similarity was measured according to the following procedure:
The model was trained 10 times for default settings (FP32 or TF32 for Volta and Ampere architecture respectively) and 10 times for AMP. After the last epoch, the accuracy score ROC AUC was calculated on the test set.
Distributions for two hardware configurations (A100, V100) for 2 datasets are presented below.
Figure 5. Impact of AMP on ROC AUC distribution for A100 and V100 GPUs for single- and multi-gpu training on a dataset with a frequency threshold of 3.
Figure 6. Impact of AMP on ROC AUC distribution for A100 and V100 GPUs for single- and multi-gpu training on a dataset with a frequency threshold of 15.
Distribution of AUC ROC for single precision training (TF32 for A100, FP32 for Volta) and AMP training were compared in terms of mean, variance and Kolmogorov–Smirnov test to state statistical difference between single precision and AMP results. Refer to the expandable table below.
Hardware | Dataset | GPUs | mean AUC ROC for full precision | std AUC ROC for full precision | mean AUC ROC for AMP | std AUC ROC for AMP | KS test value: statictics, p-value |
---|---|---|---|---|---|---|---|
DGX A100 | FL3 | 8 | 0.802681 | 0.000073 | 0.802646 | 0.000063 | ('0.400', '0.418') |
DGX-2 | FL3 | 16 | 0.802614 | 0.000073 | 0.802623 | 0.000122 | ('0.267', '0.787') |
Sample size was set to 10 experiments for each training setup. |
We used throughput in items processed per second as the performance metric.
Our results were obtained by running the following commands:
python -m dlrm.scripts.main --dataset /data --amp --cuda_graphs
python -m torch.distributed.launch --no_python --use_env --nproc_per_node 8 \
bash -c './bind.sh --cpu=dgxa100_ccx.sh --mem=dgxa100_ccx.sh python -m dlrm.scripts.main \
--dataset /data --amp --cuda_graphs'
in the DLRM Docker container on NVIDIA DGX A100 (8x A100 80GB) GPUs. Performance numbers (in records of data per second) were averaged over an entire training epoch.
GPUs | Model size | Batch size / GPU | Throughput - TF32 | Throughput - mixed precision | Throughput speedup (TF32 to mixed precision) |
---|---|---|---|---|---|
8 | large | 8k | 11,400,000 | 16,500,000 | 1.447 |
1 | small | 64k | 2,880,000 | 4,020,000 | 1.396 |
To achieve these same results, follow the steps in the Quick Start Guide.
Our results were obtained by running the following commands:
python -m dlrm.scripts.main --mode train --dataset /data --amp --cuda_graphs
python -m torch.distributed.launch --no_python --use_env --nproc_per_node 8 \
bash -c './bind.sh --cpu=exclusive -- python -m dlrm.scripts.main \
--dataset /data --amp --cuda_graphs'
in the DLRM Docker container on NVIDIA DGX-1 with (8x V100 32GB) GPUs. Performance numbers (in records of data per second) were averaged over an entire training epoch.
GPUs | Model size | Batch size / GPU | Throughput - FP32 | Throughput - mixed precision | Throughput speedup (FP32 to mixed precision) |
---|---|---|---|---|---|
8 | large | 8k | 2,880,000 | 6,920,000 | 2.403 |
1 | small | 64k | 672,000 | 2,090,000 | 3.110 |
To achieve these same results, follow the steps in the Quick Start Guide.
Our results were obtained by running the following commands:
python -m dlrm.scripts.main --dataset /data --amp --cuda_graphs
python -m torch.distributed.launch --no_python --use_env --nproc_per_node [8/16] \
bash -c './bind.sh --cpu=exclusive -- python -m dlrm.scripts.main \
--dataset /data --amp --cuda_graphs'
in the DLRM Docker container on NVIDIA DGX-2 with (16x V100 32GB) GPUs. Performance numbers (in records of data per second) were averaged over an entire training epoch.GPUs | Model size | Batch size / GPU | Throughput - FP32 | Throughput - mixed precision | Throughput speedup (FP32 to mixed precision) |
---|---|---|---|---|---|
16 | large | 4k | 4,740,000 | 10,800,000 | 2.278 |
8 | large | 8k | 3,330,000 | 7,930,000 | 2.381 |
1 | small | 64k | 717,000 | 2,250,000 | 3.138 |
To achieve these same results, follow the steps in the Quick Start Guide.
Our results were obtained by running the --inference_benchmark mode in the DLRM Docker container on on the NVIDIA A100 (1x A100 80GB) GPU.
Mixed Precision | TF32 | |||||||
CUDA Graphs ON | CUDA Graphs OFF | CUDA Graphs ON | CUDA Graphs OFF | |||||
Batch size | Throughput Avg | Latency Avg | Throughput Avg | Latency Avg | Throughput Avg | Latency Avg | Throughput Avg | Latency Avg |
32768 | 14,796,024 | 0.00221 | 14,369,047 | 0.00228 | 8,832,225 | 0.00371 | 8,637,000 | 0.00379 |
16384 | 14,217,340 | 0.00115 | 13,673,623 | 0.00120 | 8,540,191 | 0.00192 | 8,386,694 | 0.00195 |
8192 | 12,769,583 | 0.00064 | 11,336,204 | 0.00072 | 7,658,459 | 0.00107 | 7,463,740 | 0.00110 |
4096 | 10,556,140 | 0.00039 | 8,203,285 | 0.00050 | 6,777,965 | 0.00060 | 6,142,076 | 0.00067 |
2048 | 8,415,889 | 0.00024 | 4,785,479 | 0.00043 | 5,214,990 | 0.00039 | 4,365,954 | 0.00047 |
1024 | 5,045,754 | 0.00020 | 2,357,953 | 0.00043 | 3,854,504 | 0.00027 | 2,615,601 | 0.00039 |
512 | 3,168,261 | 0.00016 | 1,190,989 | 0.00043 | 2,441,310 | 0.00021 | 1,332,944 | 0.00038 |
256 | 1,711,749 | 0.00015 | 542,310 | 0.00047 | 1,365,320 | 0.00019 | 592,034 | 0.00043 |
128 | 889,777 | 0.00014 | 274,223 | 0.00047 | 790,984 | 0.00016 | 300,908 | 0.00043 |
64 | 459,728 | 0.00014 | 136,180 | 0.00047 | 416,463 | 0.00015 | 150,382 | 0.00043 |
32 | 222,386 | 0.00014 | 70,107 | 0.00046 | 174,163 | 0.00018 | 75,768 | 0.00042 |
16 | 117,386 | 0.00014 | 34,983 | 0.00046 | 108,992 | 0.00015 | 38,369 | 0.00042 |
8 | 59,200 | 0.00014 | 18,852 | 0.00042 | 55,661 | 0.00014 | 19,440 | 0.00041 |
4 | 29,609 | 0.00014 | 8,505 | 0.00047 | 27,957 | 0.00014 | 10,206 | 0.00039 |
2 | 14,066 | 0.00014 | 4,610 | 0.00043 | 13,010 | 0.00015 | 5,229 | 0.00038 |
To achieve these same results, follow the steps in the Quick Start Guide.
Our results were obtained by running the --inference_benchmark
mode
in the DLRM Docker container on NVIDIA DGX-1 with (1x V100 32GB) GPU.
Mixed Precision | FP32 | |||||||
CUDA Graphs ON | CUDA Graphs OFF | CUDA Graphs ON | CUDA Graphs OFF | |||||
Batch size | Throughput Avg | Latency Avg | Throughput Avg | Latency Avg | Throughput Avg | Latency Avg | Throughput Avg | Latency Avg |
32768 | 6,716,240 | 0.00488 | 6,792,739 | 0.00482 | 1,809,345 | 0.01811 | 1,802,851 | 0.01818 |
16384 | 6,543,544 | 0.00250 | 6,520,519 | 0.00251 | 1,754,713 | 0.00934 | 1,745,214 | 0.00939 |
8192 | 6,215,194 | 0.00132 | 6,074,446 | 0.00135 | 1,669,188 | 0.00491 | 1,656,393 | 0.00495 |
4096 | 5,230,443 | 0.00078 | 4,901,451 | 0.00084 | 1,586,666 | 0.00258 | 1,574,068 | 0.00260 |
2048 | 4,261,124 | 0.00048 | 3,523,239 | 0.00058 | 1,462,006 | 0.00140 | 1,416,985 | 0.00145 |
1024 | 3,306,724 | 0.00031 | 2,047,274 | 0.00050 | 1,277,860 | 0.00080 | 1,161,032 | 0.00088 |
512 | 2,049,382 | 0.00025 | 1,005,919 | 0.00051 | 1,016,186 | 0.00050 | 841,732 | 0.00061 |
256 | 1,149,997 | 0.00022 | 511,102 | 0.00050 | 726,349 | 0.00035 | 485,162 | 0.00053 |
128 | 663,048 | 0.00019 | 264,015 | 0.00048 | 493,878 | 0.00026 | 238,936 | 0.00054 |
64 | 359,505 | 0.00018 | 132,913 | 0.00048 | 295,273 | 0.00022 | 124,120 | 0.00052 |
32 | 175,465 | 0.00018 | 64,287 | 0.00050 | 157,629 | 0.00020 | 63,919 | 0.00050 |
16 | 99,207 | 0.00016 | 31,062 | 0.00052 | 83,019 | 0.00019 | 34,660 | 0.00046 |
8 | 52,532 | 0.00015 | 16,492 | 0.00049 | 43,289 | 0.00018 | 17,893 | 0.00045 |
4 | 27,626 | 0.00014 | 8,391 | 0.00048 | 22,692 | 0.00018 | 8,923 | 0.00045 |
2 | 13,791 | 0.00015 | 4,146 | 0.00048 | 11,747 | 0.00017 | 4,487 | 0.00045 |
To achieve these same results, follow the steps in the Quick Start Guide.