NGC | Catalog
CatalogResourcesResNet v1.5 for TensorFlow

ResNet v1.5 for TensorFlow

For downloads and more information, please view on a desktop device.
Logo for ResNet v1.5 for TensorFlow

Description

With modified architecture and initialization this ResNet50 version gives ~0.5% better accuracy than original.

Publisher

NVIDIA

Use Case

Classification

Framework

TensorFlow

Latest Version

20.12.6

Modified

March 2, 2022

Compressed Size

2.67 MB

The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA's latest software release. For the most up-to-date performance measurements, go to NVIDIA Data Center Deep Learning Product Performance.

Benchmarking

The following section shows how to run benchmarks measuring the model performance in training and inference modes.

Training performance benchmark

To benchmark the training performance on a specific batch size, run:

  • For 1 GPU

    • FP32 / TF32

      python ./main.py --mode=training_benchmark --warmup_steps 200 --batch_size <batch size> --data_dir=<path to imagenet> --results_dir=<path to results directory>

    • AMP

      python ./main.py --mode=training_benchmark --amp --warmup_steps 200 --batch_size <batch size> --data_dir=<path to imagenet> --results_dir=<path to results directory>

  • For multiple GPUs

    • FP32 / TF32

      mpiexec --allow-run-as-root --bind-to socket -np <num_gpus> python ./main.py --mode=training_benchmark --batch_size <batch size> --data_dir=<path to imagenet> --results_dir=<path to results directory>

    • AMP

      mpiexec --allow-run-as-root --bind-to socket -np <num_gpus> python ./main.py --mode=training_benchmark --amp --batch_size <batch size> --data_dir=<path to imagenet> --results_dir=<path to results directory>

Each of these scripts runs 200 warm-up iterations and measures the first epoch.

To control warmup and benchmark length, use the --warmup_steps, --num_iter and --iter_unit flags. Features like XLA or DALI can be controlled with --xla and --dali flags. For proper throughput reporting the value of --num_iter must be greater than --warmup_steps value. Suggested batch sizes for training are 256 for mixed precision training and 128 for single precision training per single V100 16 GB.

If no --data_dir=<path to imagenet> flag is specified then the benchmarks will use a synthetic dataset. The resolution of synthetic images used can be controlled with --synthetic_data_size flag.

Inference performance benchmark

To benchmark the inference performance on a specific batch size, run:

  • FP32 / TF32

python ./main.py --mode=inference_benchmark --warmup_steps 20 --num_iter 100 --iter_unit batch --batch_size <batch size> --data_dir=<path to imagenet> --results_dir=<path to results directory>

  • AMP

python ./main.py --mode=inference_benchmark --amp --warmup_steps 20 --num_iter 100 --iter_unit batch --batch_size <batch size> --data_dir=<path to imagenet> --results_dir=<path to results directory>

By default, each of these scripts runs 20 warm-up iterations and measures the next 80 iterations. To control warm-up and benchmark length, use the --warmup_steps, --num_iter and --iter_unit flags. If no --data_dir=<path to imagenet> flag is specified then the benchmarks will use a synthetic dataset.

The benchmark can be automated with the inference_benchmark.sh script provided in resnet50v1.5, by simply running: bash ./resnet50v1.5/inference_benchmark.sh <data dir> <data idx dir>

The <data dir> parameter refers to the input data directory (by default /data/tfrecords inside the container). By default, the benchmark tests the following configurations: FP32, AMP, AMP + XLA with different batch sizes. When the optional directory with the DALI index files <data idx dir> is specified, the benchmark executes an additional DALI + AMP + XLA configuration. For proper throughput reporting the value of --num_iter must be greater than --warmup_steps value.

For performance benchmark of raw model, synthetic dataset can be used. To use synthetic dataset, use --synthetic_data_size flag instead of --data_dir to specify input image size.

Results

The following sections provide details on how we achieved our performance and accuracy in training and inference.

Training accuracy results

Training accuracy: NVIDIA DGX A100 (8x A100 40GB)

Our results were obtained by running the /resnet50v1.5/training/DGXA100_RN50_{PRECISION}_90E.sh training script in the TensorFlow 20.06-tf1-py3 NGC container NGC container on NVIDIA DGX A100 (8x A100 40GB) GPUs.

Epochs Batch Size / GPU Accuracy - TF32 (top1) Accuracy - mixed precision (top1)
90 256 77.01 76.93
Training accuracy: NVIDIA DGX-1 (8x V100 16G)

Our results were obtained by running the /resnet50v1.5/training/DGX1_RN50_{PRECISION}_{EPOCHS}E.sh training script in the TensorFlow 20.06-tf1-py3 NGC container NGC container on NVIDIA DGX-1 with (8x V100 16G) GPUs.

Epochs Batch Size / GPU Accuracy - FP32 Accuracy - mixed precision
90 128 (FP32) / 256 (AMP) 77.01 76.99
250 128 (FP32) / 256 (AMP) 78.34 78.35

Example training loss plot

TrainingLoss

Training performance results

Training performance: NVIDIA DGX A100 (8x A100 40GB)

Our results were obtained by running the resnet50v1.5/training/training_perf.sh benchmark script in the TensorFlow 20.06-tf1-py3 NGC container NGC container on NVIDIA DGX A100 (8x A100 40GB) GPUs. Performance numbers (in images per second) were averaged over an entire training epoch.

GPUs Batch Size / GPU Throughput - TF32 + XLA Throughput - mixed precision + XLA Throughput speedup (TF32 - mixed precision) Weak scaling - TF32 + XLA Weak scaling - mixed precision + XLA
1 256 909 img/s 2375 img/s 2.60x 1.00x 1.00x
8 256 7000 img/s 17400 img/s 2.48x 7.70x 7.32x
Training performance: NVIDIA DGX-1 (8x V100 16G)

Our results were obtained by running the resnet50v1.5/training/training_perf.sh benchmark script in the TensorFlow 20.06-tf1-py3 NGC container NGC container on NVIDIA DGX-1 with (8x V100 16G) GPUs. Performance numbers (in images per second) were averaged over an entire training epoch.

GPUs Batch Size / GPU Throughput - FP32 + XLA Throughput - mixed precision + XLA Throughput speedup (FP32 - mixed precision) Weak scaling - FP32 + XLA Weak scaling - mixed precision + XLA
1 128 (FP32) / 256 (AMP) 412 img/s 1270 img/s 3.08x 1.00x 1.00x
8 128 (FP32) / 256 (AMP) 3170 img/s 9510 img/s 3.00x 7.69x 7.48x
Training performance: NVIDIA DGX-2 (16x V100 32G)

Our results were obtained by running the resnet50v1.5/training/training_perf.sh benchmark script in the TensorFlow 20.06-tf1-py3 NGC container NGC container on NVIDIA DGX-2 with (16x V100 32G) GPUs. Performance numbers (in images per second) were averaged over an entire training epoch.

GPUs Batch Size / GPU Throughput - FP32 + XLA Throughput - mixed precision + XLA Throughput speedup (FP32 - mixed precision) Weak scaling - FP32 + XLA Weak scaling - mixed precision + XLA
1 128 (FP32) / 256 (AMP) 432 img/s 1300 img/s 3.01x 1.00x 1.00x
16 128 (FP32) / 256 (AMP) 6500 img/s 17250 img/s 2.65x 15.05x 13.27x

Training Time for 90 Epochs

Training time: NVIDIA DGX A100 (8x A100 40GB)

Our results were estimated based on the training performance results on NVIDIA DGX A100 with (8x A100 40G) GPUs.

GPUs Time to train - mixed precision + XLA Time to train - TF32 + XLA
1 ~18h ~40h
8 ~2h ~5h
Training time: NVIDIA DGX-1 (8x V100 16G)

Our results were estimated based on the training performance results on NVIDIA DGX-1 with (8x V100 16G) GPUs.

GPUs Time to train - mixed precision + XLA Time to train - FP32 + XLA
1 ~25h ~77h
8 ~3.5h ~10h
Training time: NVIDIA DGX-2 (16x V100 32G)

Our results were estimated based on the training performance results on NVIDIA DGX-2 with (16x V100 32G) GPUs.

GPUs Time to train - mixed precision + XLA Time to train - FP32 + XLA
1 ~25h ~74h
16 ~2h ~5h

Inference performance results

Inference performance: NVIDIA DGX A100 (1x A100 40GB)

Our results were obtained by running the inference_benchmark.sh inferencing benchmarking script in the TensorFlow 20.06-tf1-py3 NGC container NGC container on NVIDIA DGX A100 with (1x A100 40G) GPU.

TF32 Inference Latency

Batch Size Avg throughput Avg latency 90% Latency 95% Latency 99% Latency
1 191.23 img/s 5.26 ms 5.29 ms 5.31 ms 5.42 ms
2 376.83 img/s 5.34 ms 5.36 ms 5.39 ms 5.56 ms
4 601.12 img/s 6.65 ms 6.80 ms 6.93 ms 7.05 ms
8 963.86 img/s 8.31 ms 8.63 ms 8.80 ms 9.17 ms
16 1361.58 img/s 11.82 ms 12.04 ms 12.15 ms 12.44 ms
32 1602.09 img/s 19.99 ms 20.48 ms 20.74 ms 21.36 ms
64 1793.81 img/s 35.82 ms 37.22 ms 37.43 ms 37.84 ms
128 1876.22 img/s 68.23 ms 69.60 ms 70.08 ms 70.70 ms
256 1911.96 img/s 133.90 ms 135.16 ms 135.59 ms 136.49 ms

TF32 Inference Latency + XLA

Batch Size Avg throughput Avg latency 90% Latency 95% Latency 99% Latency
1 158.67 img/s 6.34 ms 6.39 ms 6.46 ms 7.16 ms
2 321.83 img/s 6.24 ms 6.29 ms 6.34 ms 6.39 ms
4 574.28 img/s 7.01 ms 7.03 ms 7.06 ms 7.14 ms
8 1021.20 img/s 7.84 ms 8.00 ms 8.08 ms 8.28 ms
16 1515.79 img/s 10.56 ms 10.88 ms 10.98 ms 11.22 ms
32 1945.44 img/s 16.46 ms 16.78 ms 16.96 ms 17.49 ms
64 2313.13 img/s 27.81 ms 28.68 ms 29.10 ms 30.33 ms
128 2449.88 img/s 52.27 ms 54.00 ms 54.43 ms 56.85 ms
256 2548.87 img/s 100.45 ms 102.34 ms 103.04 ms 104.81 ms

Mixed Precision Inference Latency

Batch Size Avg throughput Avg latency 90% Latency 95% Latency 99% Latency
1 223.35 img/s 4.51 ms 4.50 ms 4.52 ms 4.76 ms
2 435.51 img/s 4.63 ms 4.62 ms 4.64 ms 4.76 ms
4 882.00 img/s 4.63 ms 4.60 ms 4.71 ms 5.36 ms
8 1503.24 img/s 5.40 ms 5.50 ms 5.59 ms 5.78 ms
16 1903.58 img/s 8.47 ms 8.67 ms 8.77 ms 9.14 ms
32 1974.01 img/s 16.23 ms 16.65 ms 16.96 ms 17.98 ms
64 3570.46 img/s 18.14 ms 18.26 ms 18.43 ms 19.35 ms
128 3474.94 img/s 37.86 ms 44.09 ms 55.30 ms 66.90 ms
256 3229.32 img/s 81.02 ms 96.21 ms 105.67 ms 126.31 ms

Mixed Precision Inference Latency + XLA

Batch Size Avg throughput Avg latency 90% Latency 95% Latency 99% Latency
1 174.68 img/s 5.76 ms 5.81 ms 5.95 ms 6.13 ms
2 323.90 img/s 6.21 ms 6.26 ms 6.31 ms 6.64 ms
4 639.75 img/s 6.25 ms 6.45 ms 6.55 ms 6.79 ms
8 1215.50 img/s 6.59 ms 6.94 ms 7.03 ms 7.25 ms
16 2219.96 img/s 7.29 ms 7.45 ms 7.57 ms 8.09 ms
32 2363.70 img/s 13.70 ms 13.91 ms 14.08 ms 14.64 ms
64 3940.95 img/s 18.76 ms 26.58 ms 35.41 ms 59.06 ms
128 3274.01 img/s 41.70 ms 52.19 ms 61.14 ms 78.68 ms
256 3676.14 img/s 71.67 ms 82.36 ms 88.53 ms 108.18 ms
Inference performance: NVIDIA DGX-1 (1x V100 16G)

Our results were obtained by running the inference_benchmark.sh inferencing benchmarking script in the TensorFlow 20.06-tf1-py3 NGC container NGC container on NVIDIA DGX-1 with (1x V100 16G) GPU.

FP32 Inference Latency

Batch Size Avg throughput Avg latency 90% Latency 95% Latency 99% Latency
1 173.35 img/s 5.79 ms 5.90 ms 5.95 ms 6.04 ms
2 303.65 img/s 6.61 ms 6.80 ms 6.87 ms 7.01 ms
4 562.35 img/s 7.12 ms 7.32 ms 7.42 ms 7.69 ms
8 783.24 img/s 10.22 ms 10.37 ms 10.44 ms 10.60 ms
16 1003.10 img/s 15.99 ms 16.07 ms 16.12 ms 16.29 ms
32 1140.12 img/s 28.19 ms 28.27 ms 28.38 ms 28.54 ms
64 1252.06 img/s 51.12 ms 51.82 ms 52.75 ms 53.45 ms
128 1324.91 img/s 96.61 ms 97.02 ms 97.25 ms 99.08 ms
256 1348.52 img/s 189.85 ms 191.16 ms 191.77 ms 192.47 ms

Mixed Precision Inference Latency

Batch Size Avg throughput Avg latency 90% Latency 95% Latency 99% Latency
1 237.35 img/s 4.25 ms 4.39 ms 4.54 ms 5.30 ms
2 464.94 img/s 4.32 ms 4.63 ms 4.83 ms 5.52 ms
4 942.44 img/s 4.26 ms 4.55 ms 4.74 ms 5.45 ms
8 1454.93 img/s 5.57 ms 5.73 ms 5.91 ms 6.51 ms
16 2003.75 img/s 8.13 ms 8.19 ms 8.29 ms 8.50 ms
32 2356.17 img/s 13.69 ms 13.82 ms 13.92 ms 14.26 ms
64 2706.11 img/s 23.86 ms 23.82 ms 23.89 ms 24.10 ms
128 2770.61 img/s 47.04 ms 49.36 ms 62.43 ms 90.05 ms
256 2742.14 img/s 94.67 ms 108.02 ms 119.34 ms 145.55 ms

Mixed Precision Inference Latency + XLA

Batch Size Avg throughput Avg latency 90% Latency 95% Latency 99% Latency
1 162.95 img/s 6.16 ms 6.28 ms 6.34 ms 6.50 ms
2 335.63 img/s 5.96 ms 6.10 ms 6.14 ms 6.25 ms
4 637.72 img/s 6.30 ms 6.53 ms 7.17 ms 8.10 ms
8 1153.92 img/s 7.03 ms 7.97 ms 8.22 ms 9.00 ms
16 1906.52 img/s 8.64 ms 9.51 ms 9.88 ms 10.47 ms
32 2492.78 img/s 12.84 ms 13.06 ms 13.13 ms 13.24 ms
64 2910.05 img/s 22.66 ms 21.82 ms 24.71 ms 48.61 ms
128 2964.31 img/s 45.25 ms 59.30 ms 71.42 ms 98.72 ms
256 2898.12 img/s 90.53 ms 106.12 ms 118.12 ms 150.78 ms
Inference performance: NVIDIA DGX-2 (1x V100 32G)

Our results were obtained by running the inference_benchmark.sh inferencing benchmarking script in the TensorFlow 20.06-tf1-py3 NGC container NGC container on NVIDIA DGX-2 with (1x V100 32G) GPU.

FP32 Inference Latency

Batch Size Avg throughput Avg latency 90% Latency 95% Latency 99% Latency
1 187.41 img/s 5.374 ms 5.61 ms 5.70 ms 6.33 ms
2 339.52 img/s 5.901 ms 6.16 ms 6.29 ms 6.53 ms
4 577.50 img/s 6.940 ms 7.07 ms 7.24 ms 7.99 ms
8 821.15 img/s 9.751 ms 9.99 ms 10.15 ms 10.80 ms
16 1055.64 img/s 15.209 ms 15.26 ms 15.30 ms 16.14 ms
32 1195.74 img/s 26.772 ms 26.93 ms 26.98 ms 27.80 ms
64 1313.83 img/s 48.796 ms 48.99 ms 49.72 ms 51.83 ms
128 1372.58 img/s 93.262 ms 93.90 ms 94.97 ms 96.57 ms
256 1414.99 img/s 180.923 ms 181.65 ms 181.92 ms 183.37 ms

Mixed Precision Inference Latency

Batch Size Avg throughput Avg latency 90% Latency 95% Latency 99% Latency
1 289.89 img/s 3.50 ms 3.81 ms 3.90 ms 4.19 ms
2 606.27 img/s 3.38 ms 3.56 ms 3.76 ms 4.25 ms
4 982.92 img/s 4.09 ms 4.42 ms 4.53 ms 4.81 ms
8 1553.34 img/s 5.22 ms 5.31 ms 5.50 ms 6.74 ms
16 2091.27 img/s 7.82 ms 7.77 ms 7.82 ms 8.77 ms
32 2457.61 img/s 13.14 ms 13.15 ms 13.21 ms 13.37 ms
64 2746.11 img/s 23.31 ms 23.50 ms 23.56 ms 24.31 ms
128 2937.20 img/s 43.58 ms 43.76 ms 43.82 ms 44.37 ms
256 3009.83 img/s 85.06 ms 86.23 ms 87.37 ms 88.67 ms

Mixed Precision Inference Latency + XLA

Batch Size Avg throughput Avg latency 90% Latency 95% Latency 99% Latency
1 240.66 img/s 4.22 ms 4.59 ms 4.69 ms 4.84 ms
2 428.60 img/s 4.70 ms 5.11 ms 5.44 ms 6.01 ms
4 945.38 img/s 4.26 ms 4.35 ms 4.42 ms 4.74 ms
8 1518.66 img/s 5.33 ms 5.50 ms 5.63 ms 5.88 ms
16 2091.66 img/s 7.83 ms 7.74 ms 7.79 ms 8.88 ms
32 2604.17 img/s 12.40 ms 12.45 ms 12.51 ms 12.61 ms
64 3101.15 img/s 20.64 ms 20.93 ms 21.00 ms 21.17 ms
128 3408.72 img/s 37.55 ms 37.93 ms 38.05 ms 38.53 ms
256 3633.85 img/s 70.85 ms 70.93 ms 71.12 ms 71.45 ms
Inference performance: NVIDIA T4 (1x T4 16G)

Our results were obtained by running the inference_benchmark.sh inferencing benchmarking script in the TensorFlow 20.06-tf1-py3 NGC container NGC container on NVIDIA T4 with (1x T4 16G) GPU.

FP32 Inference Latency

Batch Size Avg throughput Avg latency 90% Latency 95% Latency 99% Latency
1 136.44 img/s 7.34 ms 7.43 ms 7.47 ms 7.54 ms
2 215.38 img/s 9.29 ms 9.42 ms 9.46 ms 9.59 ms
4 289.29 img/s 13.83 ms 14.08 ms 14.16 ms 14.40 ms
8 341.77 img/s 23.41 ms 23.79 ms 23.86 ms 24.11 ms
16 394.36 img/s 40.58 ms 40.87 ms 40.98 ms 41.41 ms
32 414.66 img/s 77.18 ms 78.05 ms 78.29 ms 78.67 ms
64 424.42 img/s 150.82 ms 152.99 ms 153.44 ms 154.34 ms
128 429.83 img/s 297.82 ms 301.09 ms 301.60 ms 302.51 ms
256 425.72 img/s 601.37 ms 605.74 ms 606.47 ms 608.74 ms

Mixed Precision Inference Latency

Batch Size Avg throughput Avg latency 90% Latency 95% Latency 99% Latency
1 211.04 img/s 4.77 ms 5.05 ms 5.08 ms 5.15 ms
2 381.23 img/s 5.27 ms 5.40 ms 5.45 ms 5.52 ms
4 593.13 img/s 6.75 ms 6.89 ms 6.956 ms 7.02 ms
8 791.12 img/s 10.16 ms 10.35 ms 10.43 ms 10.68 ms
16 914.26 img/s 17.55 ms 17.80 ms 17,89 ms 18.19 ms
32 972.36 img/s 32.92 ms 33.33 ms 33.46 ms 33.61 ms
64 991.39 img/s 64.56 ms 65.62 ms 65.92 ms 66.35 ms
128 995.81 img/s 128.55 ms 130.03 ms 130.37 ms 131.08 ms
256 993.39 img/s 257.71 ms 259.26 ms 259.62 ms 260.36 ms

Mixed Precision Inference Latency + XLA

Batch Size Avg throughput Avg latency 90% Latency 95% Latency 99% Latency
1 167.01 img/s 6.01 ms 6.12 ms 6.14 ms 6.18 ms
2 333.67 img/s 6.03 ms 6.11 ms 6.15 ms 6.23 ms
4 605.94 img/s 6.63 ms 6.79 ms 6.86 ms 7.02 ms
8 802.13 img/s 9.98 ms 10.14 ms 10.22 ms 10.36 ms
16 986.85 img/s 16.27 ms 16.36 ms 16.42 ms 16.52 ms
32 1090.38 img/s 29.35 ms 29.68 ms 29.79 ms 30.07 ms
64 1131.56 img/s 56.63 ms 57.22 ms 57.41 ms 57.76 ms
128 1167.62 img/s 109.77 ms 111.06 ms 111.27 ms 111.85 ms
256 1193.74 img/s 214.46 ms 216.28 ms 216.86 ms 217.80 ms