ResNeXt101-32x4d for PyTorch

NVIDIA Deep Learning Examples

Resource

NVIDIA Deep Learning Examples

ResNeXt101-32x4d for PyTorch

ResNet with bottleneck 3x3 Convolutions substituted by 3x3 Grouped Convolutions.

The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA's latest software release. For the most up-to-date performance measurements, go to NVIDIA Data Center Deep Learning Product Performance.

Benchmarking

The following section shows how to run benchmarks measuring the model performance in training and inference modes.

Training performance benchmark

To benchmark training, run:

For 1 GPU
- FP32 (V100 GPUs only) python ./launch.py --model resnext101-32x4d --precision FP32 --mode benchmark_training --platform DGX1V <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100
- TF32 (A100 GPUs only) python ./launch.py --model resnext101-32x4d --precision TF32 --mode benchmark_training --platform DGXA100 <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100
- AMP python ./launch.py --model resnext101-32x4d --precision AMP --mode benchmark_training --platform <DGX1V|DGXA100> <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100
For multiple GPUs
- FP32 (V100 GPUs only) python ./launch.py --model resnext101-32x4d --precision FP32 --mode benchmark_training --platform DGX1V <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100
- TF32 (A100 GPUs only) python ./multiproc.py --nproc_per_node 8 ./launch.py --model resnext101-32x4d --precision TF32 --mode benchmark_training --platform DGXA100 <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100
- AMP python ./multiproc.py --nproc_per_node 8 ./launch.py --model resnext101-32x4d --precision AMP --mode benchmark_training --platform <DGX1V|DGXA100> <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100

Each of these scripts will run 100 iterations and save results in the benchmark.json file.

Inference performance benchmark

To benchmark inference, run:

FP32 (V100 GPUs only)

python ./launch.py --model resnext101-32x4d --precision FP32 --mode benchmark_inference --platform DGX1V <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100

TF32 (A100 GPUs only)

python ./launch.py --model resnext101-32x4d --precision TF32 --mode benchmark_inference --platform DGXA100 <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100

python ./launch.py --model resnext101-32x4d --precision AMP --mode benchmark_inference --platform <DGX1V|DGXA100> <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100

Each of these scripts will run 100 iterations and save results in the benchmark.json file.

Results

Training accuracy results

Our results were obtained by running the applicable training script the pytorch-20.12 NGC container.

To achieve these same results, follow the steps in the Quick Start Guide.

Training accuracy: NVIDIA DGX A100 (8x A100 80GB)

Epochs	Mixed Precision Top1	TF32 Top1
90	79.47 +/- 0.03	79.38 +/- 0.07
250	80.19 +/- 0.08	80.27 +/- 0.1

Training accuracy: NVIDIA DGX-1 (8x V100 16GB)

Epochs	Mixed Precision Top1	FP32 Top1
90	79.49 +/- 0.05	79.40 +/- 0.10
250	80.26 +/- 0.11	80.06 +/- 0.06

Example plots

The following images show a 250 epochs configuration on a DGX-1V.

ValidationLoss

ValidationTop1

ValidationTop5

Training performance results

Our results were obtained by running the applicable training script the pytorch-21.03 NGC container.

To achieve these same results, follow the steps in the Quick Start Guide.

Training performance: NVIDIA DGX A100 (8x A100 80GB)

GPUs	Throughput - TF32	Throughput - mixed precision	Throughput speedup (TF32 to mixed precision)	TF32 Strong Scaling	Mixed Precision Strong Scaling	Mixed Precision Training Time (90E)	TF32 Training Time (90E)
1	456 img/s	1211 img/s	2.65 x	1.0 x	1.0 x	~28 hours	~74 hours
8	3471 img/s	7925 img/s	2.28 x	7.6 x	6.54 x	~5 hours	~10 hours

Training performance: NVIDIA DGX-1 16GB (8x V100 16GB)

GPUs	Throughput - FP32	Throughput - mixed precision	Throughput speedup (FP32 to mixed precision)	FP32 Strong Scaling	Mixed Precision Strong Scaling	Mixed Precision Training Time (90E)	FP32 Training Time (90E)
1	147 img/s	587 img/s	3.97 x	1.0 x	1.0 x	~58 hours	~228 hours
8	1133 img/s	4065 img/s	3.58 x	7.65 x	6.91 x	~9 hours	~30 hours

Training performance: NVIDIA DGX-1 32GB (8x V100 32GB)

GPUs	Throughput - FP32	Throughput - mixed precision	Throughput speedup (FP32 to mixed precision)	FP32 Strong Scaling	Mixed Precision Strong Scaling	Mixed Precision Training Time (90E)	FP32 Training Time (90E)
1	144 img/s	565 img/s	3.9 x	1.0 x	1.0 x	~60 hours	~233 hours
8	1108 img/s	3863 img/s	3.48 x	7.66 x	6.83 x	~9 hours	~31 hours

Inference performance results

Our results were obtained by running the applicable training script the pytorch-21.03 NGC container.

To achieve these same results, follow the steps in the Quick Start Guide.

Inference performance: NVIDIA DGX-1 (1x V100 16GB)

FP32 Inference Latency

Batch Size	Throughput Avg	Latency Avg	Latency 95%	Latency 99%
1	55 img/s	17.95 ms	20.61 ms	22.0 ms
2	105 img/s	19.2 ms	20.74 ms	22.77 ms
4	170 img/s	23.65 ms	24.66 ms	28.0 ms
8	336 img/s	24.05 ms	24.92 ms	27.75 ms
16	397 img/s	40.77 ms	40.44 ms	40.65 ms
32	452 img/s	72.12 ms	71.1 ms	71.35 ms
64	500 img/s	130.9 ms	128.19 ms	128.64 ms
128	527 img/s	249.57 ms	242.77 ms	243.63 ms
256	533 img/s	496.76 ms	478.04 ms	480.42 ms

Mixed Precision Inference Latency

Batch Size	Throughput Avg	Latency Avg	Latency 95%	Latency 99%
1	43 img/s	23.08 ms	24.18 ms	27.82 ms
2	84 img/s	23.65 ms	24.64 ms	27.87 ms
4	164 img/s	24.38 ms	27.33 ms	27.95 ms
8	333 img/s	24.18 ms	25.92 ms	28.3 ms
16	640 img/s	25.4 ms	26.53 ms	29.47 ms
32	1195 img/s	27.72 ms	29.9 ms	32.19 ms
64	1595 img/s	41.89 ms	40.15 ms	41.08 ms
128	1699 img/s	79.45 ms	75.65 ms	76.08 ms
256	1746 img/s	154.68 ms	145.76 ms	146.52 ms

Inference performance: NVIDIA T4

FP32 Inference Latency

Batch Size	Throughput Avg	Latency Avg	Latency 95%	Latency 99%
1	56 img/s	18.18 ms	20.45 ms	24.58 ms
2	109 img/s	18.77 ms	21.53 ms	26.21 ms
4	151 img/s	26.89 ms	27.81 ms	30.94 ms
8	164 img/s	48.99 ms	49.44 ms	49.91 ms
16	172 img/s	93.51 ms	93.73 ms	94.16 ms
32	180 img/s	178.83 ms	178.41 ms	179.07 ms
64	178 img/s	361.95 ms	360.7 ms	362.32 ms
128	172 img/s	756.93 ms	750.21 ms	752.45 ms
256	161 img/s	1615.79 ms	1580.61 ms	1583.43 ms

Mixed Precision Inference Latency

Batch Size	Throughput Avg	Latency Avg	Latency 95%	Latency 99%
1	44 img/s	23.0 ms	25.77 ms	29.41 ms
2	87 img/s	23.14 ms	26.55 ms	30.97 ms
4	178 img/s	22.8 ms	24.2 ms	29.38 ms
8	371 img/s	21.98 ms	25.34 ms	29.61 ms
16	553 img/s	29.47 ms	29.52 ms	31.14 ms
32	578 img/s	56.56 ms	56.04 ms	56.37 ms
64	591 img/s	110.82 ms	109.37 ms	109.83 ms
128	597 img/s	220.44 ms	215.33 ms	216.3 ms
256	598 img/s	439.3 ms	428.2 ms	431.46 ms