The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA's latest software release. For the most up-to-date performance measurements, go to NVIDIA Data Center Deep Learning Product Performance.
The following section shows how to run benchmarks measuring the model performance in training and inference modes.
To benchmark training, run:
python ./launch.py --model se-resnext101-32x4d --precision FP32 --mode benchmark_training --platform DGX1V <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100
python ./launch.py --model se-resnext101-32x4d --precision TF32 --mode benchmark_training --platform DGXA100 <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100
python ./launch.py --model se-resnext101-32x4d --precision AMP --mode benchmark_training --platform <DGX1V|DGXA100> <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100
python ./launch.py --model se-resnext101-32x4d --precision FP32 --mode benchmark_training --platform DGX1V <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100
python ./multiproc.py --nproc_per_node 8 ./launch.py --model se-resnext101-32x4d --precision TF32 --mode benchmark_training --platform DGXA100 <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100
python ./multiproc.py --nproc_per_node 8 ./launch.py --model se-resnext101-32x4d --precision AMP --mode benchmark_training --platform <DGX1V|DGXA100> <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100
Each of these scripts will run 100 iterations and save results in the benchmark.json
file.
To benchmark inference, run:
python ./launch.py --model se-resnext101-32x4d --precision FP32 --mode benchmark_inference --platform DGX1V <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100
python ./launch.py --model se-resnext101-32x4d --precision TF32 --mode benchmark_inference --platform DGXA100 <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100
python ./launch.py --model se-resnext101-32x4d --precision AMP --mode benchmark_inference --platform <DGX1V|DGXA100> <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100
Each of these scripts will run 100 iterations and save results in the benchmark.json
file.
Our results were obtained by running the applicable training script the pytorch-20.12 NGC container.
To achieve these same results, follow the steps in the Quick Start Guide.
Epochs | Mixed Precision Top1 | TF32 Top1 |
---|---|---|
90 | 80.03 +/- 0.11 | 79.92 +/- 0.07 |
250 | 80.9 +/- 0.08 | 80.98 +/- 0.07 |
Epochs | Mixed Precision Top1 | FP32 Top1 |
---|---|---|
90 | 80.04 +/- 0.07 | 79.93 +/- 0.10 |
250 | 80.92 +/- 0.09 | 80.97 +/- 0.09 |
The following images show a 250 epochs configuration on a DGX-1V.
Our results were obtained by running the applicable training script the pytorch-21.03 NGC container.
To achieve these same results, follow the steps in the Quick Start Guide.
GPUs | Throughput - TF32 | Throughput - mixed precision | Throughput speedup (TF32 to mixed precision) | TF32 Strong Scaling | Mixed Precision Strong Scaling | Mixed Precision Training Time (90E) | TF32 Training Time (90E) |
---|---|---|---|---|---|---|---|
1 | 395 img/s | 855 img/s | 2.16 x | 1.0 x | 1.0 x | ~40 hours | ~86 hours |
8 | 2991 img/s | 5779 img/s | 1.93 x | 7.56 x | 6.75 x | ~6 hours | ~12 hours |
GPUs | Throughput - FP32 | Throughput - mixed precision | Throughput speedup (FP32 to mixed precision) | FP32 Strong Scaling | Mixed Precision Strong Scaling | Mixed Precision Training Time (90E) | FP32 Training Time (90E) |
---|---|---|---|---|---|---|---|
1 | 132 img/s | 443 img/s | 3.34 x | 1.0 x | 1.0 x | ~76 hours | ~254 hours |
8 | 1004 img/s | 2971 img/s | 2.95 x | 7.57 x | 6.7 x | ~12 hours | ~34 hours |
GPUs | Throughput - FP32 | Throughput - mixed precision | Throughput speedup (FP32 to mixed precision) | FP32 Strong Scaling | Mixed Precision Strong Scaling | Mixed Precision Training Time (90E) | FP32 Training Time (90E) |
---|---|---|---|---|---|---|---|
1 | 130 img/s | 427 img/s | 3.26 x | 1.0 x | 1.0 x | ~79 hours | ~257 hours |
8 | 992 img/s | 2925 img/s | 2.94 x | 7.58 x | 6.84 x | ~12 hours | ~34 hours |
Our results were obtained by running the applicable training script the pytorch-21.03 NGC container.
To achieve these same results, follow the steps in the Quick Start Guide.
Batch Size | Throughput Avg | Latency Avg | Latency 95% | Latency 99% |
---|---|---|---|---|
1 | 40 img/s | 24.92 ms | 26.78 ms | 31.12 ms |
2 | 80 img/s | 24.89 ms | 27.63 ms | 30.81 ms |
4 | 127 img/s | 31.58 ms | 35.92 ms | 39.64 ms |
8 | 250 img/s | 32.29 ms | 34.5 ms | 38.14 ms |
16 | 363 img/s | 44.5 ms | 44.16 ms | 44.37 ms |
32 | 423 img/s | 76.86 ms | 75.89 ms | 76.17 ms |
64 | 472 img/s | 138.36 ms | 135.85 ms | 136.52 ms |
128 | 501 img/s | 262.64 ms | 255.48 ms | 256.02 ms |
256 | 508 img/s | 519.84 ms | 500.71 ms | 501.5 ms |
Batch Size | Throughput Avg | Latency Avg | Latency 95% | Latency 99% |
---|---|---|---|---|
1 | 29 img/s | 33.83 ms | 39.1 ms | 41.57 ms |
2 | 58 img/s | 34.35 ms | 36.92 ms | 41.66 ms |
4 | 117 img/s | 34.33 ms | 38.67 ms | 41.05 ms |
8 | 232 img/s | 34.66 ms | 39.51 ms | 42.16 ms |
16 | 459 img/s | 35.23 ms | 36.77 ms | 38.11 ms |
32 | 871 img/s | 37.62 ms | 39.36 ms | 41.26 ms |
64 | 1416 img/s | 46.95 ms | 45.26 ms | 47.48 ms |
128 | 1533 img/s | 87.49 ms | 83.54 ms | 83.75 ms |
256 | 1576 img/s | 170.79 ms | 161.97 ms | 162.93 ms |
Batch Size | Throughput Avg | Latency Avg | Latency 95% | Latency 99% |
---|---|---|---|---|
1 | 40 img/s | 25.12 ms | 28.83 ms | 31.59 ms |
2 | 75 img/s | 26.82 ms | 30.54 ms | 33.13 ms |
4 | 136 img/s | 29.79 ms | 33.33 ms | 37.65 ms |
8 | 155 img/s | 51.74 ms | 52.57 ms | 53.12 ms |
16 | 164 img/s | 97.99 ms | 98.76 ms | 99.21 ms |
32 | 173 img/s | 186.31 ms | 186.43 ms | 187.4 ms |
64 | 171 img/s | 378.1 ms | 377.19 ms | 378.82 ms |
128 | 165 img/s | 785.83 ms | 778.23 ms | 782.64 ms |
256 | 158 img/s | 1641.96 ms | 1601.74 ms | 1614.52 ms |
Batch Size | Throughput Avg | Latency Avg | Latency 95% | Latency 99% |
---|---|---|---|---|
1 | 31 img/s | 32.51 ms | 37.26 ms | 39.53 ms |
2 | 61 img/s | 32.76 ms | 37.61 ms | 39.62 ms |
4 | 123 img/s | 32.98 ms | 38.97 ms | 42.66 ms |
8 | 262 img/s | 31.01 ms | 36.3 ms | 39.11 ms |
16 | 482 img/s | 33.76 ms | 34.54 ms | 38.5 ms |
32 | 512 img/s | 63.68 ms | 63.29 ms | 63.73 ms |
64 | 527 img/s | 123.57 ms | 122.69 ms | 123.56 ms |
128 | 525 img/s | 248.97 ms | 245.39 ms | 246.66 ms |
256 | 527 img/s | 496.23 ms | 485.68 ms | 488.3 ms |