The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA's latest software release. For the most up-to-date performance measurements, go to NVIDIA Data Center Deep Learning Product Performance.
The following section shows how to run benchmarks measuring the model performance in training and inference modes.
To benchmark the training performance on a specific batch size, run:
For 1 GPU
FP32 / TF32
python ./main.py --mode=training_benchmark --warmup_steps 200 --batch_size <batch size> --data_dir=<path to imagenet> --results_dir=<path to results directory>
AMP
python ./main.py --mode=training_benchmark --amp --warmup_steps 200 --batch_size <batch size> --data_dir=<path to imagenet> --results_dir=<path to results directory>
For multiple GPUs
FP32 / TF32
mpiexec --allow-run-as-root --bind-to socket -np <num_gpus> python ./main.py --mode=training_benchmark --batch_size <batch size> --data_dir=<path to imagenet> --results_dir=<path to results directory>
AMP
mpiexec --allow-run-as-root --bind-to socket -np <num_gpus> python ./main.py --mode=training_benchmark --amp --batch_size <batch size> --data_dir=<path to imagenet> --results_dir=<path to results directory>
Each of these scripts runs 200 warm-up iterations and measures the first epoch.
To control warmup and benchmark length, use the --warmup_steps
, --num_iter
and --iter_unit
flags. Features like XLA or DALI can be controlled
with --xla
and --dali
flags. For proper throughput reporting the value of --num_iter
must be greater than --warmup_steps
value.
Suggested batch sizes for training are 256 for mixed precision training and 128 for single precision training per single V100 16 GB.
If no --data_dir=<path to imagenet>
flag is specified then the benchmarks will use a synthetic dataset. The resolution of synthetic images used can be controlled with --synthetic_data_size
flag.
To benchmark the inference performance on a specific batch size, run:
python ./main.py --mode=inference_benchmark --warmup_steps 20 --num_iter 100 --iter_unit batch --batch_size <batch size> --data_dir=<path to imagenet> --results_dir=<path to results directory>
python ./main.py --mode=inference_benchmark --amp --warmup_steps 20 --num_iter 100 --iter_unit batch --batch_size <batch size> --data_dir=<path to imagenet> --results_dir=<path to results directory>
By default, each of these scripts runs 20 warm-up iterations and measures the next 80 iterations.
To control warm-up and benchmark length, use the --warmup_steps
, --num_iter
and --iter_unit
flags.
If no --data_dir=<path to imagenet>
flag is specified then the benchmarks will use a synthetic dataset.
The benchmark can be automated with the inference_benchmark.sh
script provided in resnet50v1.5
, by simply running:
bash ./resnet50v1.5/inference_benchmark.sh <data dir> <data idx dir>
The <data dir>
parameter refers to the input data directory (by default /data/tfrecords
inside the container).
By default, the benchmark tests the following configurations: FP32, AMP, AMP + XLA with different batch sizes.
When the optional directory with the DALI index files <data idx dir>
is specified, the benchmark executes an additional DALI + AMP + XLA configuration.
For proper throughput reporting the value of --num_iter
must be greater than --warmup_steps
value.
For performance benchmark of raw model, synthetic dataset can be used. To use synthetic dataset, use --synthetic_data_size
flag instead of --data_dir
to specify input image size.
The following sections provide details on how we achieved our performance and accuracy in training and inference.
Our results were obtained by running the /resnet50v1.5/training/DGXA100_RN50_{PRECISION}_90E.sh
training script in the TensorFlow 20.06-tf1-py3 NGC container
NGC container on NVIDIA DGX A100 (8x A100 40GB) GPUs.
Epochs | Batch Size / GPU | Accuracy - TF32 (top1) | Accuracy - mixed precision (top1) |
---|---|---|---|
90 | 256 | 77.01 | 76.93 |
Our results were obtained by running the /resnet50v1.5/training/DGX1_RN50_{PRECISION}_{EPOCHS}E.sh
training script in the TensorFlow 20.06-tf1-py3 NGC container
NGC container on NVIDIA DGX-1 with (8x V100 16G) GPUs.
Epochs | Batch Size / GPU | Accuracy - FP32 | Accuracy - mixed precision |
---|---|---|---|
90 | 128 (FP32) / 256 (AMP) | 77.01 | 76.99 |
250 | 128 (FP32) / 256 (AMP) | 78.34 | 78.35 |
Example training loss plot
Our results were obtained by running the resnet50v1.5/training/training_perf.sh
benchmark script in the
TensorFlow 20.06-tf1-py3 NGC container NGC container
on NVIDIA DGX A100 (8x A100 40GB) GPUs. Performance numbers (in images per second) were averaged over an entire training epoch.
GPUs | Batch Size / GPU | Throughput - TF32 + XLA | Throughput - mixed precision + XLA | Throughput speedup (TF32 - mixed precision) | Weak scaling - TF32 + XLA | Weak scaling - mixed precision + XLA |
---|---|---|---|---|---|---|
1 | 256 | 909 img/s | 2375 img/s | 2.60x | 1.00x | 1.00x |
8 | 256 | 7000 img/s | 17400 img/s | 2.48x | 7.70x | 7.32x |
Our results were obtained by running the resnet50v1.5/training/training_perf.sh
benchmark script in the
TensorFlow 20.06-tf1-py3 NGC container NGC container
on NVIDIA DGX-1 with (8x V100 16G) GPUs. Performance numbers (in images per second) were averaged over an entire training epoch.
GPUs | Batch Size / GPU | Throughput - FP32 + XLA | Throughput - mixed precision + XLA | Throughput speedup (FP32 - mixed precision) | Weak scaling - FP32 + XLA | Weak scaling - mixed precision + XLA |
---|---|---|---|---|---|---|
1 | 128 (FP32) / 256 (AMP) | 412 img/s | 1270 img/s | 3.08x | 1.00x | 1.00x |
8 | 128 (FP32) / 256 (AMP) | 3170 img/s | 9510 img/s | 3.00x | 7.69x | 7.48x |
Our results were obtained by running the resnet50v1.5/training/training_perf.sh
benchmark script in the
TensorFlow 20.06-tf1-py3 NGC container NGC container
on NVIDIA DGX-2 with (16x V100 32G) GPUs. Performance numbers (in images per second) were averaged over an entire training epoch.
GPUs | Batch Size / GPU | Throughput - FP32 + XLA | Throughput - mixed precision + XLA | Throughput speedup (FP32 - mixed precision) | Weak scaling - FP32 + XLA | Weak scaling - mixed precision + XLA |
---|---|---|---|---|---|---|
1 | 128 (FP32) / 256 (AMP) | 432 img/s | 1300 img/s | 3.01x | 1.00x | 1.00x |
16 | 128 (FP32) / 256 (AMP) | 6500 img/s | 17250 img/s | 2.65x | 15.05x | 13.27x |
Our results were estimated based on the training performance results on NVIDIA DGX A100 with (8x A100 40G) GPUs.
GPUs | Time to train - mixed precision + XLA | Time to train - TF32 + XLA |
---|---|---|
1 | ~18h | ~40h |
8 | ~2h | ~5h |
Our results were estimated based on the training performance results on NVIDIA DGX-1 with (8x V100 16G) GPUs.
GPUs | Time to train - mixed precision + XLA | Time to train - FP32 + XLA |
---|---|---|
1 | ~25h | ~77h |
8 | ~3.5h | ~10h |
Our results were estimated based on the training performance results on NVIDIA DGX-2 with (16x V100 32G) GPUs.
GPUs | Time to train - mixed precision + XLA | Time to train - FP32 + XLA |
---|---|---|
1 | ~25h | ~74h |
16 | ~2h | ~5h |
Our results were obtained by running the inference_benchmark.sh
inferencing benchmarking script
in the TensorFlow 20.06-tf1-py3 NGC container NGC container
on NVIDIA DGX A100 with (1x A100 40G) GPU.
TF32 Inference Latency
Batch Size | Avg throughput | Avg latency | 90% Latency | 95% Latency | 99% Latency |
---|---|---|---|---|---|
1 | 191.23 img/s | 5.26 ms | 5.29 ms | 5.31 ms | 5.42 ms |
2 | 376.83 img/s | 5.34 ms | 5.36 ms | 5.39 ms | 5.56 ms |
4 | 601.12 img/s | 6.65 ms | 6.80 ms | 6.93 ms | 7.05 ms |
8 | 963.86 img/s | 8.31 ms | 8.63 ms | 8.80 ms | 9.17 ms |
16 | 1361.58 img/s | 11.82 ms | 12.04 ms | 12.15 ms | 12.44 ms |
32 | 1602.09 img/s | 19.99 ms | 20.48 ms | 20.74 ms | 21.36 ms |
64 | 1793.81 img/s | 35.82 ms | 37.22 ms | 37.43 ms | 37.84 ms |
128 | 1876.22 img/s | 68.23 ms | 69.60 ms | 70.08 ms | 70.70 ms |
256 | 1911.96 img/s | 133.90 ms | 135.16 ms | 135.59 ms | 136.49 ms |
TF32 Inference Latency + XLA
Batch Size | Avg throughput | Avg latency | 90% Latency | 95% Latency | 99% Latency |
---|---|---|---|---|---|
1 | 158.67 img/s | 6.34 ms | 6.39 ms | 6.46 ms | 7.16 ms |
2 | 321.83 img/s | 6.24 ms | 6.29 ms | 6.34 ms | 6.39 ms |
4 | 574.28 img/s | 7.01 ms | 7.03 ms | 7.06 ms | 7.14 ms |
8 | 1021.20 img/s | 7.84 ms | 8.00 ms | 8.08 ms | 8.28 ms |
16 | 1515.79 img/s | 10.56 ms | 10.88 ms | 10.98 ms | 11.22 ms |
32 | 1945.44 img/s | 16.46 ms | 16.78 ms | 16.96 ms | 17.49 ms |
64 | 2313.13 img/s | 27.81 ms | 28.68 ms | 29.10 ms | 30.33 ms |
128 | 2449.88 img/s | 52.27 ms | 54.00 ms | 54.43 ms | 56.85 ms |
256 | 2548.87 img/s | 100.45 ms | 102.34 ms | 103.04 ms | 104.81 ms |
Mixed Precision Inference Latency
Batch Size | Avg throughput | Avg latency | 90% Latency | 95% Latency | 99% Latency |
---|---|---|---|---|---|
1 | 223.35 img/s | 4.51 ms | 4.50 ms | 4.52 ms | 4.76 ms |
2 | 435.51 img/s | 4.63 ms | 4.62 ms | 4.64 ms | 4.76 ms |
4 | 882.00 img/s | 4.63 ms | 4.60 ms | 4.71 ms | 5.36 ms |
8 | 1503.24 img/s | 5.40 ms | 5.50 ms | 5.59 ms | 5.78 ms |
16 | 1903.58 img/s | 8.47 ms | 8.67 ms | 8.77 ms | 9.14 ms |
32 | 1974.01 img/s | 16.23 ms | 16.65 ms | 16.96 ms | 17.98 ms |
64 | 3570.46 img/s | 18.14 ms | 18.26 ms | 18.43 ms | 19.35 ms |
128 | 3474.94 img/s | 37.86 ms | 44.09 ms | 55.30 ms | 66.90 ms |
256 | 3229.32 img/s | 81.02 ms | 96.21 ms | 105.67 ms | 126.31 ms |
Mixed Precision Inference Latency + XLA
Batch Size | Avg throughput | Avg latency | 90% Latency | 95% Latency | 99% Latency |
---|---|---|---|---|---|
1 | 174.68 img/s | 5.76 ms | 5.81 ms | 5.95 ms | 6.13 ms |
2 | 323.90 img/s | 6.21 ms | 6.26 ms | 6.31 ms | 6.64 ms |
4 | 639.75 img/s | 6.25 ms | 6.45 ms | 6.55 ms | 6.79 ms |
8 | 1215.50 img/s | 6.59 ms | 6.94 ms | 7.03 ms | 7.25 ms |
16 | 2219.96 img/s | 7.29 ms | 7.45 ms | 7.57 ms | 8.09 ms |
32 | 2363.70 img/s | 13.70 ms | 13.91 ms | 14.08 ms | 14.64 ms |
64 | 3940.95 img/s | 18.76 ms | 26.58 ms | 35.41 ms | 59.06 ms |
128 | 3274.01 img/s | 41.70 ms | 52.19 ms | 61.14 ms | 78.68 ms |
256 | 3676.14 img/s | 71.67 ms | 82.36 ms | 88.53 ms | 108.18 ms |
Our results were obtained by running the inference_benchmark.sh
inferencing benchmarking script
in the TensorFlow 20.06-tf1-py3 NGC container NGC container
on NVIDIA DGX-1 with (1x V100 16G) GPU.
FP32 Inference Latency
Batch Size | Avg throughput | Avg latency | 90% Latency | 95% Latency | 99% Latency |
---|---|---|---|---|---|
1 | 173.35 img/s | 5.79 ms | 5.90 ms | 5.95 ms | 6.04 ms |
2 | 303.65 img/s | 6.61 ms | 6.80 ms | 6.87 ms | 7.01 ms |
4 | 562.35 img/s | 7.12 ms | 7.32 ms | 7.42 ms | 7.69 ms |
8 | 783.24 img/s | 10.22 ms | 10.37 ms | 10.44 ms | 10.60 ms |
16 | 1003.10 img/s | 15.99 ms | 16.07 ms | 16.12 ms | 16.29 ms |
32 | 1140.12 img/s | 28.19 ms | 28.27 ms | 28.38 ms | 28.54 ms |
64 | 1252.06 img/s | 51.12 ms | 51.82 ms | 52.75 ms | 53.45 ms |
128 | 1324.91 img/s | 96.61 ms | 97.02 ms | 97.25 ms | 99.08 ms |
256 | 1348.52 img/s | 189.85 ms | 191.16 ms | 191.77 ms | 192.47 ms |
Mixed Precision Inference Latency
Batch Size | Avg throughput | Avg latency | 90% Latency | 95% Latency | 99% Latency |
---|---|---|---|---|---|
1 | 237.35 img/s | 4.25 ms | 4.39 ms | 4.54 ms | 5.30 ms |
2 | 464.94 img/s | 4.32 ms | 4.63 ms | 4.83 ms | 5.52 ms |
4 | 942.44 img/s | 4.26 ms | 4.55 ms | 4.74 ms | 5.45 ms |
8 | 1454.93 img/s | 5.57 ms | 5.73 ms | 5.91 ms | 6.51 ms |
16 | 2003.75 img/s | 8.13 ms | 8.19 ms | 8.29 ms | 8.50 ms |
32 | 2356.17 img/s | 13.69 ms | 13.82 ms | 13.92 ms | 14.26 ms |
64 | 2706.11 img/s | 23.86 ms | 23.82 ms | 23.89 ms | 24.10 ms |
128 | 2770.61 img/s | 47.04 ms | 49.36 ms | 62.43 ms | 90.05 ms |
256 | 2742.14 img/s | 94.67 ms | 108.02 ms | 119.34 ms | 145.55 ms |
Mixed Precision Inference Latency + XLA
Batch Size | Avg throughput | Avg latency | 90% Latency | 95% Latency | 99% Latency |
---|---|---|---|---|---|
1 | 162.95 img/s | 6.16 ms | 6.28 ms | 6.34 ms | 6.50 ms |
2 | 335.63 img/s | 5.96 ms | 6.10 ms | 6.14 ms | 6.25 ms |
4 | 637.72 img/s | 6.30 ms | 6.53 ms | 7.17 ms | 8.10 ms |
8 | 1153.92 img/s | 7.03 ms | 7.97 ms | 8.22 ms | 9.00 ms |
16 | 1906.52 img/s | 8.64 ms | 9.51 ms | 9.88 ms | 10.47 ms |
32 | 2492.78 img/s | 12.84 ms | 13.06 ms | 13.13 ms | 13.24 ms |
64 | 2910.05 img/s | 22.66 ms | 21.82 ms | 24.71 ms | 48.61 ms |
128 | 2964.31 img/s | 45.25 ms | 59.30 ms | 71.42 ms | 98.72 ms |
256 | 2898.12 img/s | 90.53 ms | 106.12 ms | 118.12 ms | 150.78 ms |
Our results were obtained by running the inference_benchmark.sh
inferencing benchmarking script
in the TensorFlow 20.06-tf1-py3 NGC container NGC container
on NVIDIA DGX-2 with (1x V100 32G) GPU.
FP32 Inference Latency
Batch Size | Avg throughput | Avg latency | 90% Latency | 95% Latency | 99% Latency |
---|---|---|---|---|---|
1 | 187.41 img/s | 5.374 ms | 5.61 ms | 5.70 ms | 6.33 ms |
2 | 339.52 img/s | 5.901 ms | 6.16 ms | 6.29 ms | 6.53 ms |
4 | 577.50 img/s | 6.940 ms | 7.07 ms | 7.24 ms | 7.99 ms |
8 | 821.15 img/s | 9.751 ms | 9.99 ms | 10.15 ms | 10.80 ms |
16 | 1055.64 img/s | 15.209 ms | 15.26 ms | 15.30 ms | 16.14 ms |
32 | 1195.74 img/s | 26.772 ms | 26.93 ms | 26.98 ms | 27.80 ms |
64 | 1313.83 img/s | 48.796 ms | 48.99 ms | 49.72 ms | 51.83 ms |
128 | 1372.58 img/s | 93.262 ms | 93.90 ms | 94.97 ms | 96.57 ms |
256 | 1414.99 img/s | 180.923 ms | 181.65 ms | 181.92 ms | 183.37 ms |
Mixed Precision Inference Latency
Batch Size | Avg throughput | Avg latency | 90% Latency | 95% Latency | 99% Latency |
---|---|---|---|---|---|
1 | 289.89 img/s | 3.50 ms | 3.81 ms | 3.90 ms | 4.19 ms |
2 | 606.27 img/s | 3.38 ms | 3.56 ms | 3.76 ms | 4.25 ms |
4 | 982.92 img/s | 4.09 ms | 4.42 ms | 4.53 ms | 4.81 ms |
8 | 1553.34 img/s | 5.22 ms | 5.31 ms | 5.50 ms | 6.74 ms |
16 | 2091.27 img/s | 7.82 ms | 7.77 ms | 7.82 ms | 8.77 ms |
32 | 2457.61 img/s | 13.14 ms | 13.15 ms | 13.21 ms | 13.37 ms |
64 | 2746.11 img/s | 23.31 ms | 23.50 ms | 23.56 ms | 24.31 ms |
128 | 2937.20 img/s | 43.58 ms | 43.76 ms | 43.82 ms | 44.37 ms |
256 | 3009.83 img/s | 85.06 ms | 86.23 ms | 87.37 ms | 88.67 ms |
Mixed Precision Inference Latency + XLA
Batch Size | Avg throughput | Avg latency | 90% Latency | 95% Latency | 99% Latency |
---|---|---|---|---|---|
1 | 240.66 img/s | 4.22 ms | 4.59 ms | 4.69 ms | 4.84 ms |
2 | 428.60 img/s | 4.70 ms | 5.11 ms | 5.44 ms | 6.01 ms |
4 | 945.38 img/s | 4.26 ms | 4.35 ms | 4.42 ms | 4.74 ms |
8 | 1518.66 img/s | 5.33 ms | 5.50 ms | 5.63 ms | 5.88 ms |
16 | 2091.66 img/s | 7.83 ms | 7.74 ms | 7.79 ms | 8.88 ms |
32 | 2604.17 img/s | 12.40 ms | 12.45 ms | 12.51 ms | 12.61 ms |
64 | 3101.15 img/s | 20.64 ms | 20.93 ms | 21.00 ms | 21.17 ms |
128 | 3408.72 img/s | 37.55 ms | 37.93 ms | 38.05 ms | 38.53 ms |
256 | 3633.85 img/s | 70.85 ms | 70.93 ms | 71.12 ms | 71.45 ms |
Our results were obtained by running the inference_benchmark.sh
inferencing benchmarking script
in the TensorFlow 20.06-tf1-py3 NGC container NGC container
on NVIDIA T4 with (1x T4 16G) GPU.
FP32 Inference Latency
Batch Size | Avg throughput | Avg latency | 90% Latency | 95% Latency | 99% Latency |
---|---|---|---|---|---|
1 | 136.44 img/s | 7.34 ms | 7.43 ms | 7.47 ms | 7.54 ms |
2 | 215.38 img/s | 9.29 ms | 9.42 ms | 9.46 ms | 9.59 ms |
4 | 289.29 img/s | 13.83 ms | 14.08 ms | 14.16 ms | 14.40 ms |
8 | 341.77 img/s | 23.41 ms | 23.79 ms | 23.86 ms | 24.11 ms |
16 | 394.36 img/s | 40.58 ms | 40.87 ms | 40.98 ms | 41.41 ms |
32 | 414.66 img/s | 77.18 ms | 78.05 ms | 78.29 ms | 78.67 ms |
64 | 424.42 img/s | 150.82 ms | 152.99 ms | 153.44 ms | 154.34 ms |
128 | 429.83 img/s | 297.82 ms | 301.09 ms | 301.60 ms | 302.51 ms |
256 | 425.72 img/s | 601.37 ms | 605.74 ms | 606.47 ms | 608.74 ms |
Mixed Precision Inference Latency
Batch Size | Avg throughput | Avg latency | 90% Latency | 95% Latency | 99% Latency |
---|---|---|---|---|---|
1 | 211.04 img/s | 4.77 ms | 5.05 ms | 5.08 ms | 5.15 ms |
2 | 381.23 img/s | 5.27 ms | 5.40 ms | 5.45 ms | 5.52 ms |
4 | 593.13 img/s | 6.75 ms | 6.89 ms | 6.956 ms | 7.02 ms |
8 | 791.12 img/s | 10.16 ms | 10.35 ms | 10.43 ms | 10.68 ms |
16 | 914.26 img/s | 17.55 ms | 17.80 ms | 17,89 ms | 18.19 ms |
32 | 972.36 img/s | 32.92 ms | 33.33 ms | 33.46 ms | 33.61 ms |
64 | 991.39 img/s | 64.56 ms | 65.62 ms | 65.92 ms | 66.35 ms |
128 | 995.81 img/s | 128.55 ms | 130.03 ms | 130.37 ms | 131.08 ms |
256 | 993.39 img/s | 257.71 ms | 259.26 ms | 259.62 ms | 260.36 ms |
Mixed Precision Inference Latency + XLA
Batch Size | Avg throughput | Avg latency | 90% Latency | 95% Latency | 99% Latency |
---|---|---|---|---|---|
1 | 167.01 img/s | 6.01 ms | 6.12 ms | 6.14 ms | 6.18 ms |
2 | 333.67 img/s | 6.03 ms | 6.11 ms | 6.15 ms | 6.23 ms |
4 | 605.94 img/s | 6.63 ms | 6.79 ms | 6.86 ms | 7.02 ms |
8 | 802.13 img/s | 9.98 ms | 10.14 ms | 10.22 ms | 10.36 ms |
16 | 986.85 img/s | 16.27 ms | 16.36 ms | 16.42 ms | 16.52 ms |
32 | 1090.38 img/s | 29.35 ms | 29.68 ms | 29.79 ms | 30.07 ms |
64 | 1131.56 img/s | 56.63 ms | 57.22 ms | 57.41 ms | 57.76 ms |
128 | 1167.62 img/s | 109.77 ms | 111.06 ms | 111.27 ms | 111.85 ms |
256 | 1193.74 img/s | 214.46 ms | 216.28 ms | 216.86 ms | 217.80 ms |