NVIDIA Deep Learning Examples
NVIDIA Deep Learning Examples
ResNet50 v1.5 for PyTorch
Resource
NVIDIA Deep Learning Examples
NVIDIA Deep Learning Examples
ResNet50 v1.5 for PyTorch

With modified architecture and initialization this ResNet50 version gives ~0.5% better accuracy than original.

The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA's latest software release. For the most up-to-date performance measurements, go to NVIDIA Data Center Deep Learning Product Performance.

Benchmarking

The following section shows how to run benchmarks measuring the model performance in training and inference modes.

Training performance benchmark

To benchmark training, run:

  • For 1 GPU
    • FP32 (V100 GPUs only) python ./launch.py --model resnet50 --precision FP32 --mode benchmark_training --platform DGX1V <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100
    • TF32 (A100 GPUs only) python ./launch.py --model resnet50 --precision TF32 --mode benchmark_training --platform DGXA100 <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100
    • AMP python ./launch.py --model resnet50 --precision AMP --mode benchmark_training --platform <DGX1V|DGXA100> <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100
  • For multiple GPUs
    • FP32 (V100 GPUs only) python ./launch.py --model resnet50 --precision FP32 --mode benchmark_training --platform DGX1V <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100
    • TF32 (A100 GPUs only) python ./multiproc.py --nproc_per_node 8 ./launch.py --model resnet50 --precision TF32 --mode benchmark_training --platform DGXA100 <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100
    • AMP python ./multiproc.py --nproc_per_node 8 ./launch.py --model resnet50 --precision AMP --mode benchmark_training --platform <DGX1V|DGXA100> <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100

Each of these scripts will run 100 iterations and save results in the benchmark.json file.

Inference performance benchmark

To benchmark inference, run:

  • FP32 (V100 GPUs only)

python ./launch.py --model resnet50 --precision FP32 --mode benchmark_inference --platform DGX1V <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100

  • TF32 (A100 GPUs only)

python ./launch.py --model resnet50 --precision TF32 --mode benchmark_inference --platform DGXA100 <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100

  • AMP

python ./launch.py --model resnet50 --precision AMP --mode benchmark_inference --platform <DGX1V|DGXA100> <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100

Each of these scripts will run 100 iterations and save results in the benchmark.json file.

Results

Training accuracy results

Our results were obtained by running the applicable training script in the pytorch-20.12 NGC container.

To achieve these same results, follow the steps in the Quick Start Guide.

Training accuracy: NVIDIA DGX A100 (8x A100 80GB)
EpochsMixed Precision Top1TF32 Top1
9077.12 +/- 0.1176.95 +/- 0.18
25078.43 +/- 0.1178.38 +/- 0.17
Training accuracy: NVIDIA DGX-1 (8x V100 16GB)
EpochsMixed Precision Top1FP32 Top1
9076.88 +/- 0.1677.01 +/- 0.16
25078.25 +/- 0.1278.30 +/- 0.16
Training accuracy: NVIDIA DGX-2 (16x V100 32GB)
epochsMixed Precision Top1FP32 Top1
5075.81 +/- 0.0876.04 +/- 0.05
9077.10 +/- 0.0677.23 +/- 0.04
25078.59 +/- 0.1378.46 +/- 0.03
Example plots

The following images show a 250 epochs configuration on a DGX-1V.

ValidationLoss

ValidationTop1

ValidationTop5

Training performance results

Our results were obtained by running the applicable training script in the pytorch-21.03 NGC container.

To achieve these same results, follow the steps in the Quick Start Guide.

Training performance: NVIDIA DGX A100 (8x A100 80GB)
GPUsThroughput - TF32Throughput - mixed precisionThroughput speedup (TF32 to mixed precision)TF32 Strong ScalingMixed Precision Strong ScalingMixed Precision Training Time (90E)TF32 Training Time (90E)
1938 img/s2470 img/s2.63 x1.0 x1.0 x~14 hours~36 hours
87248 img/s16621 img/s2.29 x7.72 x6.72 x~3 hours~5 hours
Training performance: NVIDIA DGX-1 16GB (8x V100 16GB)
GPUsThroughput - FP32Throughput - mixed precisionThroughput speedup (FP32 to mixed precision)FP32 Strong ScalingMixed Precision Strong ScalingMixed Precision Training Time (90E)FP32 Training Time (90E)
1367 img/s1200 img/s3.26 x1.0 x1.0 x~29 hours~92 hours
82855 img/s8322 img/s2.91 x7.76 x6.93 x~5 hours~12 hours
Training performance: NVIDIA DGX-1 32GB (8x V100 32GB)
GPUsThroughput - FP32Throughput - mixed precisionThroughput speedup (FP32 to mixed precision)FP32 Strong ScalingMixed Precision Strong ScalingMixed Precision Training Time (90E)FP32 Training Time (90E)
1356 img/s1156 img/s3.24 x1.0 x1.0 x~30 hours~95 hours
82766 img/s8056 img/s2.91 x7.75 x6.96 x~5 hours~13 hours

Inference performance results

Our results were obtained by running the applicable training script in the pytorch-21.03 NGC container.

To achieve these same results, follow the steps in the Quick Start Guide.

Inference performance: NVIDIA DGX-1 (1x V100 16GB)
FP32 Inference Latency
Batch SizeThroughput AvgLatency AvgLatency 95%Latency 99%
196 img/s10.37 ms10.81 ms11.73 ms
2196 img/s10.24 ms11.18 ms12.89 ms
4386 img/s10.46 ms11.01 ms11.75 ms
8709 img/s11.5 ms12.36 ms13.12 ms
161023 img/s16.07 ms15.69 ms15.97 ms
321127 img/s29.37 ms28.53 ms28.67 ms
641200 img/s55.4 ms53.5 ms53.71 ms
1281229 img/s109.26 ms104.04 ms104.34 ms
2561261 img/s214.48 ms202.51 ms202.88 ms
Mixed Precision Inference Latency
Batch SizeThroughput AvgLatency AvgLatency 95%Latency 99%
178 img/s12.78 ms13.27 ms14.36 ms
2154 img/s13.01 ms13.74 ms15.19 ms
4300 img/s13.41 ms14.25 ms15.68 ms
8595 img/s13.65 ms14.51 ms15.6 ms
161178 img/s14.0 ms15.07 ms16.26 ms
322146 img/s15.84 ms17.25 ms18.53 ms
642984 img/s23.18 ms21.51 ms21.93 ms
1283249 img/s43.55 ms39.36 ms40.1 ms
2563382 img/s84.14 ms75.3 ms80.08 ms
Inference performance: NVIDIA T4
FP32 Inference Latency
Batch SizeThroughput AvgLatency AvgLatency 95%Latency 99%
198 img/s10.7 ms12.82 ms16.71 ms
2186 img/s11.26 ms13.79 ms16.99 ms
4325 img/s12.73 ms13.89 ms18.03 ms
8363 img/s22.41 ms22.57 ms22.9 ms
16409 img/s39.77 ms39.8 ms40.23 ms
32420 img/s77.62 ms76.92 ms77.28 ms
64428 img/s152.73 ms152.03 ms153.02 ms
128426 img/s309.26 ms303.38 ms305.13 ms
256415 img/s635.98 ms620.16 ms625.21 ms
Mixed Precision Inference Latency
Batch SizeThroughput AvgLatency AvgLatency 95%Latency 99%
179 img/s12.96 ms15.47 ms20.0 ms
2156 img/s13.18 ms14.9 ms18.73 ms
4317 img/s12.99 ms14.69 ms19.05 ms
8652 img/s12.82 ms16.04 ms19.43 ms
161050 img/s15.8 ms16.57 ms20.62 ms
321128 img/s29.54 ms28.79 ms28.97 ms
641165 img/s57.41 ms55.67 ms56.11 ms
1281190 img/s114.24 ms109.17 ms110.41 ms
2561198 img/s225.95 ms215.28 ms222.94 ms

NVIDIA uses cookies to improve your experience on our web site. We and our third-party partners also use cookies and other tools to collect and record information you provide as well as information about your interactions with our websites for performance improvement, analytics, and to assist in marketing efforts. By clicking "Accept All", you consent to our use of cookies and other tools as described in our Cookie Policy. You can manage your cookie settings by clicking on "Manage Settings." By continuing to use this site or by clicking one of the buttons below, you agree to our Terms of Service (which contains important waivers). Please see our Privacy Policy for more information on our privacy practices.