NGC | Catalog
CatalogResourcesSE-ResNeXt101-32x4d for PyTorch

SE-ResNeXt101-32x4d for PyTorch

For downloads and more information, please view on a desktop device.
Logo for SE-ResNeXt101-32x4d for PyTorch

Description

ResNet with bottleneck 3x3 Convolutions substituted by 3x3 Grouped Convolutions.

Publisher

NVIDIA Deep Learning Examples

Use Case

Classification

Framework

Other

Latest Version

21.03.1

Modified

November 4, 2022

Compressed Size

35.43 KB

The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA's latest software release. For the most up-to-date performance measurements, go to NVIDIA Data Center Deep Learning Product Performance.

Benchmarking

The following section shows how to run benchmarks measuring the model performance in training and inference modes.

Training performance benchmark

To benchmark training, run:

  • For 1 GPU
    • FP32 (V100 GPUs only) python ./launch.py --model se-resnext101-32x4d --precision FP32 --mode benchmark_training --platform DGX1V <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100
    • TF32 (A100 GPUs only) python ./launch.py --model se-resnext101-32x4d --precision TF32 --mode benchmark_training --platform DGXA100 <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100
    • AMP python ./launch.py --model se-resnext101-32x4d --precision AMP --mode benchmark_training --platform <DGX1V|DGXA100> <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100
  • For multiple GPUs
    • FP32 (V100 GPUs only) python ./launch.py --model se-resnext101-32x4d --precision FP32 --mode benchmark_training --platform DGX1V <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100
    • TF32 (A100 GPUs only) python ./multiproc.py --nproc_per_node 8 ./launch.py --model se-resnext101-32x4d --precision TF32 --mode benchmark_training --platform DGXA100 <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100
    • AMP python ./multiproc.py --nproc_per_node 8 ./launch.py --model se-resnext101-32x4d --precision AMP --mode benchmark_training --platform <DGX1V|DGXA100> <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100

Each of these scripts will run 100 iterations and save results in the benchmark.json file.

Inference performance benchmark

To benchmark inference, run:

  • FP32 (V100 GPUs only)

python ./launch.py --model se-resnext101-32x4d --precision FP32 --mode benchmark_inference --platform DGX1V <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100

  • TF32 (A100 GPUs only)

python ./launch.py --model se-resnext101-32x4d --precision TF32 --mode benchmark_inference --platform DGXA100 <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100

  • AMP

python ./launch.py --model se-resnext101-32x4d --precision AMP --mode benchmark_inference --platform <DGX1V|DGXA100> <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100

Each of these scripts will run 100 iterations and save results in the benchmark.json file.

Results

Training accuracy results

Our results were obtained by running the applicable training script the pytorch-20.12 NGC container.

To achieve these same results, follow the steps in the Quick Start Guide.

Training accuracy: NVIDIA DGX A100 (8x A100 80GB)
Epochs Mixed Precision Top1 TF32 Top1
90 80.03 +/- 0.11 79.92 +/- 0.07
250 80.9 +/- 0.08 80.98 +/- 0.07
Training accuracy: NVIDIA DGX-1 (8x V100 16GB)
Epochs Mixed Precision Top1 FP32 Top1
90 80.04 +/- 0.07 79.93 +/- 0.10
250 80.92 +/- 0.09 80.97 +/- 0.09
Example plots

The following images show a 250 epochs configuration on a DGX-1V.

ValidationLoss

ValidationTop1

ValidationTop5

Training performance results

Our results were obtained by running the applicable training script the pytorch-21.03 NGC container.

To achieve these same results, follow the steps in the Quick Start Guide.

Training performance: NVIDIA DGX A100 (8x A100 80GB)
GPUs Throughput - TF32 Throughput - mixed precision Throughput speedup (TF32 to mixed precision) TF32 Strong Scaling Mixed Precision Strong Scaling Mixed Precision Training Time (90E) TF32 Training Time (90E)
1 395 img/s 855 img/s 2.16 x 1.0 x 1.0 x ~40 hours ~86 hours
8 2991 img/s 5779 img/s 1.93 x 7.56 x 6.75 x ~6 hours ~12 hours
Training performance: NVIDIA DGX-1 16GB (8x V100 16GB)
GPUs Throughput - FP32 Throughput - mixed precision Throughput speedup (FP32 to mixed precision) FP32 Strong Scaling Mixed Precision Strong Scaling Mixed Precision Training Time (90E) FP32 Training Time (90E)
1 132 img/s 443 img/s 3.34 x 1.0 x 1.0 x ~76 hours ~254 hours
8 1004 img/s 2971 img/s 2.95 x 7.57 x 6.7 x ~12 hours ~34 hours
Training performance: NVIDIA DGX-1 32GB (8x V100 32GB)
GPUs Throughput - FP32 Throughput - mixed precision Throughput speedup (FP32 to mixed precision) FP32 Strong Scaling Mixed Precision Strong Scaling Mixed Precision Training Time (90E) FP32 Training Time (90E)
1 130 img/s 427 img/s 3.26 x 1.0 x 1.0 x ~79 hours ~257 hours
8 992 img/s 2925 img/s 2.94 x 7.58 x 6.84 x ~12 hours ~34 hours

Inference performance results

Our results were obtained by running the applicable training script the pytorch-21.03 NGC container.

To achieve these same results, follow the steps in the Quick Start Guide.

Inference performance: NVIDIA DGX-1 (1x V100 16GB)
FP32 Inference Latency
Batch Size Throughput Avg Latency Avg Latency 95% Latency 99%
1 40 img/s 24.92 ms 26.78 ms 31.12 ms
2 80 img/s 24.89 ms 27.63 ms 30.81 ms
4 127 img/s 31.58 ms 35.92 ms 39.64 ms
8 250 img/s 32.29 ms 34.5 ms 38.14 ms
16 363 img/s 44.5 ms 44.16 ms 44.37 ms
32 423 img/s 76.86 ms 75.89 ms 76.17 ms
64 472 img/s 138.36 ms 135.85 ms 136.52 ms
128 501 img/s 262.64 ms 255.48 ms 256.02 ms
256 508 img/s 519.84 ms 500.71 ms 501.5 ms
Mixed Precision Inference Latency
Batch Size Throughput Avg Latency Avg Latency 95% Latency 99%
1 29 img/s 33.83 ms 39.1 ms 41.57 ms
2 58 img/s 34.35 ms 36.92 ms 41.66 ms
4 117 img/s 34.33 ms 38.67 ms 41.05 ms
8 232 img/s 34.66 ms 39.51 ms 42.16 ms
16 459 img/s 35.23 ms 36.77 ms 38.11 ms
32 871 img/s 37.62 ms 39.36 ms 41.26 ms
64 1416 img/s 46.95 ms 45.26 ms 47.48 ms
128 1533 img/s 87.49 ms 83.54 ms 83.75 ms
256 1576 img/s 170.79 ms 161.97 ms 162.93 ms
Inference performance: NVIDIA T4
FP32 Inference Latency
Batch Size Throughput Avg Latency Avg Latency 95% Latency 99%
1 40 img/s 25.12 ms 28.83 ms 31.59 ms
2 75 img/s 26.82 ms 30.54 ms 33.13 ms
4 136 img/s 29.79 ms 33.33 ms 37.65 ms
8 155 img/s 51.74 ms 52.57 ms 53.12 ms
16 164 img/s 97.99 ms 98.76 ms 99.21 ms
32 173 img/s 186.31 ms 186.43 ms 187.4 ms
64 171 img/s 378.1 ms 377.19 ms 378.82 ms
128 165 img/s 785.83 ms 778.23 ms 782.64 ms
256 158 img/s 1641.96 ms 1601.74 ms 1614.52 ms
Mixed Precision Inference Latency
Batch Size Throughput Avg Latency Avg Latency 95% Latency 99%
1 31 img/s 32.51 ms 37.26 ms 39.53 ms
2 61 img/s 32.76 ms 37.61 ms 39.62 ms
4 123 img/s 32.98 ms 38.97 ms 42.66 ms
8 262 img/s 31.01 ms 36.3 ms 39.11 ms
16 482 img/s 33.76 ms 34.54 ms 38.5 ms
32 512 img/s 63.68 ms 63.29 ms 63.73 ms
64 527 img/s 123.57 ms 122.69 ms 123.56 ms
128 525 img/s 248.97 ms 245.39 ms 246.66 ms
256 527 img/s 496.23 ms 485.68 ms 488.3 ms