The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA's latest software release. For the most up-to-date performance measurements, go to NVIDIA Data Center Deep Learning Product Performance.
The following section shows how to run benchmarks measuring the model performance in training mode.
We provide 8 scripts to benchmark the performance of training:
bash scripts/DGXA100_benchmark_training_tf32_1gpu.sh
bash scripts/DGXA100_benchmark_training_amp_1gpu.sh
bash scripts/DGXA100_benchmark_training_tf32_8gpu.sh
bash scripts/DGXA100_benchmark_training_amp_8gpu.sh
bash scripts/DGX1_benchmark_training_fp32_1gpu.sh
bash scripts/DGX1_benchmark_training_amp_1gpu.sh
bash scripts/DGX1_benchmark_training_fp32_8gpu.sh
bash scripts/DGX1_benchmark_training_amp_8gpu.sh
The following sections provide details on how we achieved our performance and accuracy in training.
Our results were obtained by running the trainer/task.py
training script in the TensorFlow NGC container on NVIDIA DGX A100 with (8x A100 40GB) GPUs.
GPUs | Batch size / GPU | Accuracy - TF32 (MAP@12) | Accuracy - mixed precision (MAP@12) | Time to train - TF32 (minutes) | Time to train - mixed precision (minutes) | Time to train speedup (TF32 to mixed precision) |
---|---|---|---|---|---|---|
1 | 131,072 | 0.67683 | 0.67632 | 341 | 359 | - |
8 | 16,384 | 0.67709 | 0.67721 | 93 | 107 | - |
To achieve the same results, follow the steps in the Quick Start Guide.
Our results were obtained by running the trainer/task.py
training script in the TensorFlow NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs.
GPUs | Batch size / GPU | Accuracy - FP32 (MAP@12) | Accuracy - mixed precision (MAP@12) | Time to train - FP32 (minutes) | Time to train - mixed precision (minutes) | Time to train speedup (FP32 to mixed precision) |
---|---|---|---|---|---|---|
1 | 131,072 | 0.67648 | 0.67744 | 654 | 440 | 1.49 |
8 | 16,384 | 0.67692 | 0.67725 | 190 | 185 | 1.03 |
To achieve the same results, follow the steps in the Quick Start Guide.
Models trained with FP32, TF32 and Automatic Mixed Precision (AMP) achieve similar precision.
The Wide and Deep model was trained for 54,713 training steps, starting from 6 different initial random seeds for each setup. The training was performed in the 20.10-tf1-py3 NGC container on NVIDIA DGX A100 40GB and DGX-1 16GB machines with and without mixed precision enabled. After training, the models were evaluated on the validation set. The following table summarizes the final MAP@12 score on the validation set.
Average MAP@12 | Standard deviation | Minimum | Maximum | |
---|---|---|---|---|
DGX A100 TF32 | 0.67709 | 0.00094 | 0.67463 | 0.67813 |
DGX A100 mixed precision | 0.67721 | 0.00048 | 0.67643 | 0.67783 |
DGX-1 FP32 | 0.67692 | 0.00060 | 0.67587 | 0.67791 |
DGX-1 mixed precision | 0.67725 | 0.00064 | 0.67561 | 0.67803 |
Our results were obtained by running the benchmark scripts from the scripts
directory in the TensorFlow NGC container on NVIDIA DGX A100 with (8x A100 40GB) GPUs. Improving model scaling for multi-GPU is under development.
GPUs | Batch size / GPU | Throughput - TF32 (samples/s) | Throughput - mixed precision (samples/s) | Strong scaling - TF32 | Strong scaling - mixed precision |
---|---|---|---|---|---|
1 | 131,072 | 349,879 | 332,529 | 1.00 | 1.00 |
8 | 16,384 | 1,283,457 | 1,111,976 | 3.67 | 3.34 |
Our results were obtained by running the benchmark scripts from the scripts
directory in the TensorFlow NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs. Improving model scaling for multi-GPU is under development.
GPUs | Batch size / GPU | Throughput - FP32 (samples/s) | Throughput - mixed precision (samples/s) | Throughput speedup (FP32 to mixed precision) | Strong scaling - FP32 | Strong scaling - mixed precision |
---|---|---|---|---|---|---|
1 | 131,072 | 182,510 | 271,366 | 1.49 | 1.00 | 1.00 |
8 | 16,384 | 626,301 | 643,334 | 1.03 | 3.43 | 2.37 |