NGC | Catalog
Welcome Guest
CatalogResourcesnnU-Net for PyTorch

nnU-Net for PyTorch

For downloads and more information, please view on a desktop device.
Logo for nnU-Net for PyTorch

Description

An optimized, robust and self-adapting framework for U-Net based medical image segmentation

Publisher

NVIDIA

Use Case

Segmentation

Framework

PyTorch

Latest Version

21.11.0

Modified

February 3, 2022

Compressed Size

5.31 MB

Benchmarking

The following section shows how to run benchmarks to measure the model performance in training and inference modes.

Training performance benchmark

To benchmark training, run scripts/benchmark.py script with --mode train:

python scripts/benchmark.py --mode train --gpus <ngpus> --dim {2,3} --batch_size <bsize> [--amp] 

For example, to benchmark 3D U-Net training using mixed-precision on 8 GPUs with batch size of 2, run:

python scripts/benchmark.py --mode train --gpus 8 --dim 3 --batch_size 2 --amp

Each of these scripts will by default run 1 warm-up epoch and start performance benchmarking during the second epoch.

At the end of the script, a line reporting the best train throughput and latency will be printed.

Inference performance benchmark

To benchmark inference, run scripts/benchmark.py script with --mode predict:

python scripts/benchmark.py --mode predict --dim {2,3} --batch_size <bsize> [--amp]

For example, to benchmark inference using mixed-precision for 3D U-Net, with batch size of 4, run:

python scripts/benchmark.py --mode predict --dim 3 --amp --batch_size 4

Each of these scripts will by default run warm-up for 1 data pass and start inference benchmarking during the second pass.

At the end of the script, a line reporting the inference throughput and latency will be printed.

Results

The following sections provide details on how to achieve the same performance and accuracy in training and inference.

Training accuracy results

Training accuracy: NVIDIA DGX A100 (8x A100 80G)

Our results were obtained by running the python scripts/train.py --gpus {1,8} --fold {0,1,2,3,4} --dim {2,3} [--amp] training scripts and averaging results in the PyTorch 21.11 NGC container on NVIDIA DGX with (8x A100 80G) GPUs.

Dimension GPUs Batch size / GPU Accuracy - mixed precision Accuracy - TF32 Time to train - mixed precision Time to train - TF32 Time to train speedup (TF32 to mixed precision)
2 1 2 73.21 73.11 33 min 48 min 1.46
2 8 2 73.15 73.16 9 min 13 min 1.44
3 1 2 74.35 74.34 104 min 167 min 1.61
3 8 2 74.30 74.32 23min 36 min 1.57
Training accuracy: NVIDIA DGX-1 (8x V100 16G)

Our results were obtained by running the python scripts/train.py --gpus {1,8} --fold {0,1,2,3,4} --dim {2,3} [--amp] training scripts and averaging results in the PyTorch 21.11 NGC container on NVIDIA DGX-1 with (8x V100 16G) GPUs.

Dimension GPUs Batch size / GPU Accuracy - mixed precision Accuracy - FP32 Time to train - mixed precision Time to train - FP32 Time to train speedup (FP32 to mixed precision)
2 1 2 73.18 73.22 60 min 114 min 1.90
2 8 2 73.15 73.18 13 min 19 min 1.46
3 1 2 74.31 74.33 201 min 680 min 3.38
3 8 2 74.35 74.39 41 min 153 min 3.73

Training performance results

Training performance: NVIDIA DGX A100 (8x A100 80G)

Our results were obtained by running the python scripts/benchmark.py --mode train --gpus {1,8} --dim {2,3} --batch_size <bsize> [--amp] training script in the NGC container on NVIDIA DGX A100 (8x A100 80G) GPUs. Performance numbers (in volumes per second) were averaged over an entire training epoch.

Dimension GPUs Batch size / GPU Throughput - mixed precision [img/s] Throughput - TF32 [img/s] Throughput speedup (TF32 - mixed precision) Weak scaling - mixed precision Weak scaling - TF32
2 1 64 1129.48 702.82 1.607 N/A N/A
2 1 128 1234.69 741.01 1.666 N/A N/A
2 8 64 7015.45 4613.27 1.521 6.211 6.564
2 8 128 8293.61 5498.78 1.508 6.717 7.421
3 1 1 13.92 9.22 1.509 N/A N/A
3 1 2 17.68 10.72 1.649 N/A N/A
3 1 4 20.56 11.5 1.787 N/A N/A
3 8 1 92.97 61.68 1.416 6.679 7.119
3 8 2 114.47 72.23 1.475 6.475 7.242
3 8 4 140.55 85.53 1.643 6.836 7.437

To achieve these same results, follow the steps in the Quick Start Guide.

Training performance: NVIDIA DGX-1 (8x V100 16G)

Our results were obtained by running the python scripts/benchmark.py --mode train --gpus {1,8} --dim {2,3} --batch_size <bsize> [--amp] training script in the PyTorch 21.11 NGC container on NVIDIA DGX-1 with (8x V100 16G) GPUs. Performance numbers (in volumes per second) were averaged over an entire training epoch.

Dimension GPUs Batch size / GPU Throughput - mixed precision [img/s] Throughput - FP32 [img/s] Throughput speedup (FP32 - mixed precision) Weak scaling - mixed precision Weak scaling - FP32
2 1 64 607.16 298.84 2.032 N/A N/A
2 1 128 653.44 307.01 2.128 N/A N/A
2 8 64 4058.79 2196.05 1.848 6.685 7.349
2 8 128 4649.37 2388.46 1.848 7.115 7.779
3 1 1 8.66 1.99 4.352 N/A N/A
3 1 2 9.65 2.07 4.662 N/A N/A
3 1 4 9.99 OOM N/A N/A N/A
3 8 1 58.45 15.55 3.756 6.749 7.819
3 8 2 66.03 16.22 4.071 6.842 7.835
3 8 4 67.37 OOM N/A 6.743 N/A

To achieve these same results, follow the steps in the Quick Start Guide.

Inference performance results

Inference performance: NVIDIA DGX A100 (1x A100 80G)

Our results were obtained by running the python scripts/benchmark.py --mode predict --dim {2,3} --batch_size <bsize> [--amp] inferencing benchmarking script in the PyTorch 21.11 NGC container on NVIDIA DGX A100 (1x A100 80G) GPU.

FP16

Dimension Batch size Resolution Throughput Avg [img/s] Latency Avg [ms] Latency 90% [ms] Latency 95% [ms] Latency 99% [ms]
2 64 4x192x160 3211.23 19.93 20.24 20.38 20.84
2 128 4x192x160 3465.45 36.94 38.35 38.72 38.95
3 1 4x128x128x128 41.93 23.85 24.40 24.61 24.99
3 2 4x128x128x128 44.24 45.21 47.08 47.38 48.24
3 4 4x128x128x128 45.81 87.31 88.13 88.56 89.69

TF32

Dimension Batch size Resolution Throughput Avg [img/s] Latency Avg [ms] Latency 90% [ms] Latency 95% [ms] Latency 99% [ms]
2 64 4x192x160 2172.38 29.46 29.94 30.03 30.19
2 128 4x192x160 1769.56 72.34 72.84 73.04 74.79
3 1 4x128x128x128 23.83 41.97 42.71 42.76 42.87
3 2 4x128x128x128 26.75 74.77 75.79 76.06 77.04
3 4 4x128x128x128 27.10 147.62 147.81 149.14 190.08

Throughput is reported in images per second. Latency is reported in milliseconds per batch. To achieve these same results, follow the steps in the Quick Start Guide.

Inference performance: NVIDIA DGX-1 (1x V100 16G)

Our results were obtained by running the python scripts/benchmark.py --mode predict --dim {2,3} --batch_size <bsize> [--amp] inferencing benchmarking script in the PyTorch 21.11 NGC container on NVIDIA DGX-1 with (1x V100 16G) GPU.

FP16

Dimension Batch size Resolution Throughput Avg [img/s] Latency Avg [ms] Latency 90% [ms] Latency 95% [ms] Latency 99% [ms]
2 64 4x192x160 1809.79 35.36 35.75 35.84 36.21
2 128 4x192x160 1987.91 64.39 64.79 64.87 65.01
3 1 4x128x128x128 26.75 37.38 37.66 37.74 38.17
3 2 4x128x128x128 23.28 85.91 86.77 87.39 89.54
3 4 4x128x128x128 23.83 167.83 169.41 170.30 173.47

FP32

Dimension Batch size Resolution Throughput Avg [img/s] Latency Avg [ms] Latency 90% [ms] Latency 95% [ms] Latency 99% [ms]
2 64 4x192x160 1007.91 63.50 63.93 64.03 64.19
2 128 4x192x160 812.08 157.62 159.02 159.72 161.24
3 1 4x128x128x128 8.23 121.45 122.84 123.93 124.69
3 2 4x128x128x128 8.42 237.65 239.90 240.60 242.85
3 4 4x128x128x128 8.37 478.01 482.70 483.43 484.84

Throughput is reported in images per second. Latency is reported in milliseconds per batch. To achieve these same results, follow the steps in the Quick Start Guide.