The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA's latest software release. For the most up-to-date performance measurements, go to NVIDIA Data Center Deep Learning Product Performance.
The following section shows how to run benchmarks measuring the model performance in training and inference modes.
To run training benchmarking on a selected number of GPUs with either AMP or TF32/FP32 precision, run the following script:
python scripts/benchmark_training.py --gpus {1,8} --batch_size {2,4} [--amp]
To run inference benchmarking on a single GPU with either AMP or TF32/FP32 precision, run the following script:
python scripts/benchmark_inference.py --batch_size {2,4,8} [--amp]
The following sections provide details on how we achieved our performance and accuracy in training and inference.
Our results were obtained by running the python scripts/train.py --gpus 8 --batch_size 4 [--amp]
training script in the TensorFlow 2.x 21.02-py3 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs.
GPUs | Batch size / GPU | Precision | Final AP BBox | Final AP Segm | Time to train [h] | Time to train speedup |
---|---|---|---|---|---|---|
8 | 2 | TF32 | 0.3796 | 0.3444 | 4.81 | - |
8 | 2 | AMP | 0.3795 | 0.3443 | 3.77 | 1.27 |
Our results were obtained by running the python scripts/train.py --gpus 8 --batch_size 2 [--amp]
training script in the TensorFlow 2.x 21.02-py3 NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs.
GPUs | Batch size / GPU | Precision | Final AP BBox | Final AP Segm | Time to train [h] | Time to train speedup |
---|---|---|---|---|---|---|
8 | 2 | FP32 | 0.3793 | 0.3442 | 11.37 | - |
8 | 2 | AMP | 0.3792 | 0.3444 | 9.01 | 1.26 |
Learning curves
The following image shows the training loss as a function of iteration for training using DGX A100 (TF32 and TF-AMP) and DGX-1 V100 (FP32 and TF-AMP).
Our results were obtained by running the python scripts/benchmark_training.py --gpus {1,8} --batch_size {4,8,16} [--amp]
training script in the TensorFlow 2.x 21.02-py3 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs. Performance numbers (in images per second) were averaged over 200 steps omitting the first 100 warm-up steps.
GPUs | Batch size / GPU | Throughput - TF32 [img/s] | Throughput - mixed precision [img/s] | Throughput speedup (TF32 - mixed precision) | Weak scaling - TF32 | Weak scaling - mixed precision |
---|---|---|---|---|---|---|
1 | 2 | 13.44 | 18.26 | 1.35 | - | - |
1 | 4 | 18.41 | 28.58 | 1.55 | - | - |
8 | 2 | 84.29 | 87.31 | 1.03 | 6.27 | 4.78 |
8 | 4 | 103.80 | 114.45 | 1.10 | 5.63 | 4.04 |
To achieve these same results, follow the steps in the Quick Start Guide.
Our results were obtained by running the python scripts/benchmark_training.py --gpus {1,8} --batch_size {2,4} [--amp]
training script in the TensorFlow 2.x 21.02-py3 NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs. Performance numbers (in images per second) were averaged over 200 steps omitting the first 100 warm-up steps.
GPUs | Batch size / GPU | Throughput - FP32 [img/s] | Throughput - mixed precision [img/s] | Throughput speedup (FP32 - mixed precision) | Weak scaling - FP32 | Weak scaling - mixed precision |
---|---|---|---|---|---|---|
1 | 2 | 7.57 | 14.47 | 1.91 | - | - |
1 | 4 | 8.51 | 19.35 | 2.27 | - | - |
8 | 2 | 44.55 | 53.40 | 1.37 | 5.26 | 3.69 |
8 | 4 | 50.56 | 58.33 | 1.15 | 6.67 | 4.03 |
To achieve these same results, follow the steps in the Quick Start Guide.
Our results were obtained by running the python scripts/benchmark_inference.py --batch_size {8,16,24} [--amp]
benchmarking script in the TensorFlow 2.x 21.02-py3 NGC container on NVIDIA DGX A100 (1x A100 80GB) GPU.
TF32
Batch size | Throughput Avg [img/s] | Latency Avg | Latency 90% | Latency 95% | Latency 99% |
---|---|---|---|---|---|
6 | 39.23 | 0.1530 | 0.1540 | 0.1542 | 0.1546 |
12 | 42.55 | 0.2654 | 0.2840 | 0.2875 | 0.2945 |
24 | 47.92 | 0.5007 | 0.5248 | 0.5294 | 0.5384 |
FP16
Batch size | Throughput Avg [img/s] | Latency Avg | Latency 90% | Latency 95% | Latency 99% |
---|---|---|---|---|---|
6 | 60.79 | 0.0987 | 0.0988 | 0.1000 | 0.1005 |
12 | 76.23 | 0.1574 | 0.1614 | 0.1621 | 0.1636 |
24 | 80.67 | 0.2975 | 0.3025 | 0.3035 | 0.3054 |
To achieve these same results, follow the steps in the Quick Start Guide.
Our results were obtained by running the python scripts/benchmark_inference.py --batch_size {6,12,24} [--amp]
benchmarking script in the TensorFlow 2.x 21.02-py3 NGC container on NVIDIA DGX-1 with (1x V100 16GB) GPU.
FP32
Batch size | Throughput Avg [img/s] | Latency Avg | Latency 90% | Latency 95% | Latency 99% |
---|---|---|---|---|---|
6 | 18.56 | 0.3234 | 0.3263 | 0.3269 | 0.3280 |
12 | 20.50 | 0.5854 | 0.5920 | 0.5933 | 0.5958 |
24 | OOM | - | - | - | - |
FP16
Batch size | Throughput Avg [img/s] | Latency Avg | Latency 90% | Latency 95% | Latency 99% |
---|---|---|---|---|---|
6 | 35.46 | 0.1692 | 0.1705 | 0.1707 | 0.1712 |
12 | 41.44 | 0.2896 | 0.2937 | 0.2945 | 0.2960 |
24 | 42.53 | 0.5643 | 0.5718 | 0.5733 | 0.5761 |
To achieve these same results, follow the steps in the Quick Start Guide.