The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA's latest software release. For the most up-to-date performance measurements, go to NVIDIA Data Center Deep Learning Product Performance.
The following section shows how to run benchmarks measuring the model performance in training and inference modes.
Training benchmark for EfficientNet-B0 was run on NVIDIA DGX A100 80GB and NVIDIA DGX-1 V100 16GB.
To benchmark training performance with other parameters, run:
bash ./scripts/B0/training/{AMP, FP32, TF32}/train_benchmark_8x{A100-80G, V100-16G}.sh
Training benchmark for EfficientNet-B4 was run on NVIDIA DGX A100- 80GB and NVIDIA DGX-1 V100 32GB.
bash ./scripts/B4/training/{AMP, FP32, TF32}/train_benchmark_8x{A100-80G, V100-16G}.sh
Inference benchmark for EfficientNet-B0 was run on NVIDIA DGX A100- 80GB and NVIDIA DGX-1 V100 16GB.
Inference benchmark for EfficientNet-B4 was run on NVIDIA DGX A100- 80GB and NVIDIA DGX-1 V100 32GB.
The following sections provide details on how we achieved our performance and accuracy in training and inference.
Our results were obtained by running the training scripts in the tensorflow:21.02-tf2-py3 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs.
GPUs | Accuracy - TF32 | Accuracy - mixed precision | Time to train - TF32 | Time to train - mixed precision | Time to train speedup (TF32 to mixed precision) |
---|---|---|---|---|---|
8 | 77.38 | 77.43 | 19 | 10.5 | 1.8 |
16 | 77.46 | 77.62 | 10 | 5.5 | 1.81 |
Our results were obtained by running the training scripts in the tensorflow:21.02-tf2-py3 NGC container on NVIDIA DGX-1 (8x V100 16GB) GPUs.
GPUs | Accuracy - FP32 | Accuracy - mixed precision | Time to train - FP32 | Time to train - mixed precision | Time to train speedup (FP32 to mixed precision) |
---|---|---|---|---|---|
8 | 77.54 | 77.51 | 48 | 44 | 1.09 |
32 | 77.38 | 77.62 | 11.48 | 11.44 | 1.003 |
Our results were obtained by running the training scripts in the tensorflow:21.02-tf2-py3 NGC container on multi-node NVIDIA DGX A100 (8x A100 80GB) GPUs.
GPUs | Accuracy - TF32 | Accuracy - mixed precision | Time to train - TF32 | Time to train - mixed precision | Time to train speedup (TF32 to mixed precision) |
---|---|---|---|---|---|
32 | 82.69 | 82.69 | 38 | 17.5 | 2.17 |
64 | 82.75 | 82.78 | 18 | 8.5 | 2.11 |
Our results were obtained by running the training scripts in the tensorflow:21.02-tf2-py3 NGC container on multi-node NVIDIA DGX-1 (8x V100 32GB) GPUs.
GPUs | Accuracy - FP32 | Accuracy - mixed precision | Time to train - FP32 | Time to train - mixed precision | Time to train speedup (FP32 to mixed precision) |
---|---|---|---|---|---|
32 | 82.78 | 82.78 | 95 | 39.5 | 2.40 |
64 | 82.74 | 82.74 | 53 | 19 | 2.78 |
Our results were obtained by running the training benchmark script in the tensorflow:21.02-tf2-py3 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs. Performance numbers (in items/images per second) were averaged over 5 entire training epoch.
GPUs | Throughput - TF32 | Throughput - mixed precision | Throughput speedup (TF32 - mixed precision) | Weak scaling - TF32 | Weak scaling - mixed precision |
---|---|---|---|---|---|
1 | 1206 | 2549 | 2.11 | 1 | 1 |
8 | 9365 | 16336 | 1.74 | 7.76 | 6.41 |
16 | 18361 | 33000 | 1.79 | 15.223 | 12.95 |
To achieve these same results, follow the steps in the Quick Start Guide.
Our results were obtained by running the training benchmark script in the tensorflow:21.02-tf2-py3 NGC container on NVIDIA DGX-1 (8x V100 16GB) GPUs. Performance numbers (in items/images per second) were averaged over an entire training epoch.
GPUs | Throughput - FP32 | Throughput - mixed precision | Throughput speedup (FP32 - mixed precision) | Weak scaling - FP32 | Weak scaling - mixed precision |
---|---|---|---|---|---|
1 | 629 | 712 | 1.13 | 1 | 1 |
8 | 4012 | 4065 | 1.01 | 6.38 | 5.71 |
To achieve these same results, follow the steps in the Quick Start Guide.
Our results were obtained by running the training benchmark script in the tensorflow:21.02-tf2-py3 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs. Performance numbers (in items/images per second) were averaged over 5 entire training epoch.
GPUs | Throughput - TF32 | Throughput - mixed precision | Throughput speedup (TF32 - mixed precision) | Weak scaling - TF32 | Weak scaling - mixed precision |
---|---|---|---|---|---|
1 | 167 | 394 | 2.34 | 1 | 1 |
8 | 1280 | 2984 | 2.33 | 7.66 | 7.57 |
32 | 5023 | 11034 | 2.19 | 30.07 | 28.01 |
64 | 9838 | 21844 | 2.22 | 58.91 | 55.44 |
To achieve these same results, follow the steps in the Quick Start Guide.
Our results were obtained by running the training benchmark script in the tensorflow:21.02-tf2-py3 NGC container on NVIDIA DGX-1 (8x V100 16GB) GPUs. Performance numbers (in items/images per second) were averaged over an entire training epoch.
GPUs | Throughput - FP32 | Throughput - mixed precision | Throughput speedup (FP32 - mixed precision) | Weak scaling - FP32 | Weak scaling - mixed precision |
---|---|---|---|---|---|
1 | 89 | 193 | 2.16 | 1 | 1 |
8 | 643 | 1298 | 2.00 | 7.28 | 6.73 |
32 | 2095 | 4892 | 2.33 | 23.54 | 25.35 |
64 | 4109 | 9666 | 2.35 | 46.17 | 50.08 |
To achieve these same results, follow the steps in the Quick Start Guide.
Our results were obtained by running the inferencing benchmarking script in the tensorflow:21.02-tf2-py3 NGC container on NVIDIA DGX A100 (1x A100 80GB) GPU.
FP16 Inference Latency
Batch size | Resolution | Throughput Avg | Latency Avg (ms) | Latency 90% (ms) | Latency 95% (ms) | Latency 99% (ms) |
---|---|---|---|---|---|---|
1 | 224x224 | 111 | 8.97 | 8.88 | 8.92 | 8.96 |
2 | 224x224 | 233 | 8.56 | 8.44 | 8.5 | 8.54 |
4 | 224x224 | 432 | 9.24 | 9.12 | 9.16 | 9.2 |
8 | 224x224 | 771 | 10.32 | 10.16 | 10.24 | 10.24 |
1024 | 224x224 | 10269 | 102.4 | 102.4 | 102.4 | 102.4 |
TF32 Inference Latency
Batch size | Resolution | Throughput Avg | Latency Avg (ms) | Latency 90% (ms) | Latency 95% (ms) | Latency 99% (ms) |
---|---|---|---|---|---|---|
1 | 224x224 | 101 | 9.87 | 9.78 | 9.82 | 9.86 |
2 | 224x224 | 204 | 9.78 | 9.66 | 9.7 | 9.76 |
4 | 224x224 | 381 | 10.48 | 10.36 | 10.4 | 10.44 |
8 | 224x224 | 584 | 13.68 | 13.52 | 13.6 | 13.68 |
512 | 224x224 | 5480 | 92.16 | 92.16 | 92.16 | 92.16 |
To achieve these same results, follow the steps in the Quick Start Guide.
Our results were obtained by running the inference-script-name.sh
inferencing benchmarking script in the TensorFlow NGC container on NVIDIA DGX-1 (1x V100 16GB) GPU.
FP16 Inference Latency
Batch size | Resolution | Throughput Avg | Latency Avg (ms) | Latency 90% (ms) | Latency 95% (ms) | Latency 99% (ms) |
---|---|---|---|---|---|---|
1 | 224x224 | 98.8 | 10.12 | 10.03 | 10.06 | 10.10 |
2 | 224x224 | 199.3 | 10.02 | 9.9 | 9.94 | 10.0 |
4 | 224x224 | 382.5 | 10.44 | 10.28 | 10.36 | 10.4 |
8 | 224x224 | 681.2 | 11.68 | 11.52 | 11.6 | 11.68 |
256 | 224x224 | 5271 | 48.64 | 46.08 | 46.08 | 48.64 |
FP32 Inference Latency |
Batch size | Resolution | Throughput Avg | Latency Avg (ms) | Latency 90% (ms) | Latency 95% (ms) | Latency 99% (ms) |
---|---|---|---|---|---|---|
1 | 224x224 | 68.39 | 14.62 | 14.45 | 14.51 | 14.56 |
2 | 224x224 | 125.62 | 15.92 | 15.78 | 15.82 | 15.82 |
4 | 224x224 | 216.41 | 18.48 | 18.24 | 18.4 | 18.44 |
8 | 224x224 | 401.60 | 19.92 | 19.6 | 19.76 | 19.84 |
128 | 224x224 | 2713 | 47.36 | 46.08 | 46.08 | 47.36 |
To achieve these same results, follow the steps in the Quick Start Guide.
Our results were obtained by running the inferencing benchmarking script in the tensorflow:21.02-tf2-py3 NGC container on NVIDIA DGX A100 (1x A100 80GB) GPU.
FP16 Inference Latency
Batch size | Resolution | Throughput Avg | Latency Avg (ms) | Latency 90% (ms) | Latency 95% (ms) | Latency 99% (ms) |
---|---|---|---|---|---|---|
1 | 380x380 | 57.54 | 17.37 | 17.24 | 17.30 | 17.35 |
2 | 380x380 | 112.06 | 17.84 | 17.7 | 17.76 | 17.82 |
4 | 380x380 | 219.71 | 18.2 | 18.08 | 18.12 | 18.16 |
8 | 380x380 | 383.39 | 20.8 | 20.64 | 20.72 | 20.8 |
128 | 380x380 | 1470 | 87.04 | 85.76 | 85.76 | 87.04 |
TF32 Inference Latency
Batch size | Resolution | Throughput Avg | Latency Avg (ms) | Latency 90% (ms) | Latency 95% (ms) | Latency 99% (ms) |
---|---|---|---|---|---|---|
1 | 380x380 | 52.68 | 18.98 | 18.86 | 18.91 | 18.96 |
2 | 380x380 | 95.32 | 20.98 | 20.84 | 20.9 | 20.96 |
4 | 380x380 | 182.14 | 21.96 | 21.84 | 21.88 | 21.92 |
8 | 380x380 | 325.72 | 24.56 | 24.4 | 24.4 | 24.48 |
64 | 380x380 | 694 | 91.52 | 90.88 | 91.52 | 91.52 |
To achieve these same results, follow the steps in the Quick Start Guide.
Our results were obtained by running the inference-script-name.sh
inferencing benchmarking script in the TensorFlow NGC container on NVIDIA DGX-1 (1x V100 16GB) GPU.
FP16 Inference Latency
Batch size | Resolution | Throughput Avg | Latency Avg (ms) | Latency 90% (ms) | Latency 95% (ms) | Latency 99% (ms) |
---|---|---|---|---|---|---|
1 | 380x380 | 54.27 | 18.35 | 18.20 | 18.25 | 18.32 |
2 | 380x380 | 104.27 | 19.18 | 19.02 | 19.08 | 19.16 |
4 | 380x380 | 182.61 | 21.88 | 21.64 | 21.72 | 21.84 |
8 | 380x380 | 234.06 | 34.16 | 33.92 | 34.0 | 34.08 |
64 | 380x380 | 782.47 | 81.92 | 80.0 | 80.64 | 81.28 |
FP32 Inference Latency
Batch size | Resolution | Throughput Avg | Latency Avg (ms) | Latency 90% (ms) | Latency 95% (ms) | Latency 99% (ms) |
---|---|---|---|---|---|---|
1 | 380x380 | 30.48 | 32.80 | 32.86 | 31.83 | 32.60 |
2 | 380x380 | 58.59 | 34.12 | 31.92 | 33.02 | 33.9 |
4 | 380x380 | 111.35 | 35.92 | 35.0 | 35.12 | 35.68 |
8 | 380x380 | 199.00 | 40.24 | 38.72 | 39.04 | 40.0 |
32 | 380x380 | 307.04 | 104.0 | 104.0 | 104.0 | 104.0 |
To achieve these same results, follow the steps in the Quick Start Guide.