NGC | Catalog
CatalogResourcesnnU-Net for PyTorch

nnU-Net for PyTorch

For downloads and more information, please view on a desktop device.
Logo for nnU-Net for PyTorch

Description

An optimized, robust and self-adapting framework for U-Net based medical image segmentation

Publisher

NVIDIA Deep Learning Examples

Latest Version

22.11.0

Modified

December 5, 2022

Compressed Size

39.82 KB

Benchmarking

The following section shows how to run benchmarks to measure the model performance in training and inference modes.

Training performance benchmark

To benchmark training, run the scripts/benchmark.py script with --mode train:

python scripts/benchmark.py --mode train --gpus <ngpus> --dim {2,3} --batch_size <bsize> [--amp] [--bind]

For example, to benchmark 3D U-Net training using mixed-precision on 8 GPUs with a batch size of 2, run:

python scripts/benchmark.py --mode train --gpus 8 --dim 3 --batch_size 2 --amp

Each of these scripts will by default run 1 warm-up epoch and start performance benchmarking during the second epoch.

At the end of the script, a line reporting the best train throughput and latency will be printed.

Inference performance benchmark

To benchmark inference, run the scripts/benchmark.py script with --mode predict:

python scripts/benchmark.py --mode predict --dim {2,3} --batch_size <bsize> [--amp]

For example, to benchmark inference using mixed-precision for 3D U-Net, with a batch size of 4, run:

python scripts/benchmark.py --mode predict --dim 3 --amp --batch_size 4

Each of these scripts will by default run a warm-up for 1 data pass and start inference benchmarking during the second pass.

At the end of the script, a line reporting the inference throughput and latency will be printed.

Note that this benchmark reports performance numbers for iterations over samples with fixed patch sizes. The real inference process uses sliding window for input images with arbitrary resolution and performance may vary for images with different resolutions.

Results

The following sections provide details on how to achieve the same performance and accuracy in training and inference.

Training accuracy results

Training accuracy: NVIDIA DGX A100 (8x A100 80G)

Our results were obtained by running the python scripts/train.py --gpus {1,8} --fold {0,1,2,3,4} --dim {2,3} [--amp] [--bind] --learning_rate lr --seed n training scripts and averaging results in the PyTorch 22.11 NGC container on NVIDIA DGX with (8x A100 80G) GPUs.

Note: We recommend using --bind flag for multi-GPU settings to increase the throughput. To launch multi-GPU with --bind use PyTorch distributed launcher, e.g., python -m torch.distributed.launch --use_env --nproc_per_node=8 scripts/benchmark.py --mode train --gpus 8 --dim 3 --amp --batch_size 2 --bind for the interactive session, or use regular command when launching with SLURM's sbatch.

Dimension GPUs Batch size / GPU Dice - mixed precision Dice - TF32 Time to train - mixed precision Time to train - TF32 Time to train speedup (TF32 to mixed precision)
2 1 2 73.21 73.11 33 min 48 min 1.46
2 8 2 73.15 73.16 9 min 13 min 1.44
3 1 2 74.35 74.34 104 min 167 min 1.61
3 8 2 74.30 74.32 23min 36 min 1.57

Reported dice score is the average over 5 folds from the best run for grid search over learning rates {1e-4, 2e-4, ..., 9e-4} and seed {1, 3, 5}.

Training accuracy: NVIDIA DGX-1 (8x V100 32G)

Our results were obtained by running the python scripts/train.py --gpus {1,8} --fold {0,1,2,3,4} --dim {2,3} [--amp] [--bind] --seed n training scripts and averaging results in the PyTorch 22.11 NGC container on NVIDIA DGX-1 with (8x V100 32G) GPUs.

Note: We recommend using --bind flag for multi-GPU settings to increase the throughput. To launch multi-GPU with --bind use PyTorch distributed launcher, e.g., python -m torch.distributed.launch --use_env --nproc_per_node=8 scripts/benchmark.py --mode train --gpus 8 --dim 3 --amp --batch_size 2 --bind for the interactive session, or use regular command when launching with SLURM's sbatch.

Dimension GPUs Batch size / GPU Dice - mixed precision Dice - FP32 Time to train - mixed precision Time to train - FP32 Time to train speedup (FP32 to mixed precision)
2 1 2 73.18 73.22 60 min 114 min 1.90
2 8 2 73.15 73.18 13 min 19 min 1.46
3 1 2 74.31 74.33 201 min 680 min 3.38
3 8 2 74.35 74.39 41 min 153 min 3.73

Reported dice score is the average over 5 folds from the best run for grid search over learning rates {1e-4, 2e-4, ..., 9e-4} and seed {1, 3, 5}.

Training performance results

Training performance: NVIDIA DGX A100 (8x A100 80G)

Our results were obtained by running the python scripts/benchmark.py --mode train --gpus {1,8} --dim {2,3} --batch_size <bsize> [--amp] training script in the NGC container on NVIDIA DGX A100 (8x A100 80G) GPUs. Performance numbers (in volumes per second) were averaged over an entire training epoch.

Note: We recommend using --bind flag for multi-gpu settings to increase the througput. To launch multi-GPU with --bind use python -m torch.distributed.launch --use_env --nproc_per_node=<npgus> scripts/train.py --bind ... for the interactive session, or use regular command when launching with SLURM's sbatch.

Dimension GPUs Batch size / GPU Throughput - mixed precision [img/s] Throughput - TF32 [img/s] Throughput speedup (TF32 - mixed precision) Weak scaling - mixed precision Weak scaling - TF32
2 1 32 1040.58 732.22 1.42 - -
2 1 64 1238.68 797.37 1.55 - -
2 1 128 1345.29 838.38 1.60 - -
2 8 32 7747.27 5588.2 1.39 7.45 7.60
2 8 64 9417.27 6246.95 1.51 7.60 8.04
2 8 128 10694.1 6631.08 1.61 7.95 7.83
3 1 1 24.61 9.66 2.55 - -
3 1 2 27.48 11.27 2.44 - -
3 1 4 29.96 12.22 2.45 - -
3 8 1 187.07 76.44 2.45 7.63 7.91
3 8 2 220.83 88.67 2.49 7.83 7.87
3 8 4 234.5 96.61 2.43 7.91 7.91

To achieve these same results, follow the steps in the Quick Start Guide.

Training performance: NVIDIA DGX-1 (8x V100 32G)

Our results were obtained by running the python scripts/benchmark.py --mode train --gpus {1,8} --dim {2,3} --batch_size <bsize> [--amp] [--bind] training script in the PyTorch 22.11 NGC container on NVIDIA DGX-1 with (8x V100 32G) GPUs. Performance numbers (in volumes per second) were averaged over an entire training epoch.

Note: We recommend using --bind flag for multi-gpu settings to increase the througput. To launch multi-GPU with --bind use python -m torch.distributed.launch --use_env --nproc_per_node=<npgus> scripts/train.py --bind ... for the interactive session, or use regular command when launching with SLURM's sbatch.

Dimension GPUs Batch size / GPU Throughput - mixed precision [img/s] Throughput - FP32 [img/s] Throughput speedup (FP32 - mixed precision) Weak scaling - mixed precision Weak scaling - FP32
2 1 32 561.6 310.21 1.81 - -
2 1 64 657.91 326.02 2.02 - -
2 1 128 706.92 332.81 2.12 - -
2 8 32 3903.88 2396.88 1.63 6.95 7.73
2 8 64 4922.76 2590.66 1.90 7.48 7.95
2 8 128 5597.87 2667.56 2.10 7.92 8.02
3 1 1 11.38 2.07 5.50 - -
3 1 2 12.34 2.51 4.92 - -
3 8 1 84.38 16.55 5.10 7.41 8.00
3 8 2 98.17 20.15 4.87 7.96 8.03

To achieve these same results, follow the steps in the Quick Start Guide.

Inference performance results

Inference performance: NVIDIA DGX A100 (1x A100 80G)

Our results were obtained by running the python scripts/benchmark.py --mode predict --dim {2,3} --batch_size <bsize> [--amp] inferencing benchmarking script in the PyTorch 22.11 NGC container on NVIDIA DGX A100 (1x A100 80G) GPU.

FP16

Dimension Batch size Resolution Throughput Avg [img/s] Latency Avg [ms] Latency 90% [ms] Latency 95% [ms] Latency 99% [ms]
2 32 192x160 1818.05 17.6 19.86 20.38 20.98
2 64 192x160 3645.16 17.56 19.86 20.82 23.66
2 128 192x160 3850.35 33.24 34.72 61.4 63.58
3 1 128x128x128 68.45 14.61 17.02 17.41 19.27
3 2 128x128x128 56.9 35.15 40.9 43.15 57.94
3 4 128x128x128 76.39 52.36 57.9 59.52 70.24

TF32

Dimension Batch size Resolution Throughput Avg [img/s] Latency Avg [ms] Latency 90% [ms] Latency 95% [ms] Latency 99% [ms]
2 32 192x160 1868.56 17.13 51.75 53.07 54.92
2 64 192x160 2508.57 25.51 56.83 90.08 96.87
2 128 192x160 2609.6 49.05 191.48 201.8 205.29
3 1 128x128x128 35.02 28.55 51.75 53.07 54.92
3 2 128x128x128 39.88 50.15 56.83 90.08 96.87
3 4 128x128x128 41.32 96.8 191.48 201.8 205.29

Throughput is reported in images per second. Latency is reported in milliseconds per batch. To achieve these same results, follow the steps in the Quick Start Guide.

Inference performance: NVIDIA DGX-1 (1x V100 32G)

Our results were obtained by running the python scripts/benchmark.py --mode predict --dim {2,3} --batch_size <bsize> [--amp] inferencing benchmarking script in the PyTorch 22.11 NGC container on NVIDIA DGX-1 with (1x V100 32G) GPU.

FP16

Dimension Batch size Resolution Throughput Avg [img/s] Latency Avg [ms] Latency 90% [ms] Latency 95% [ms] Latency 99% [ms]
2 32 192x160 1254.38 25.51 29.07 30.07 31.23
2 64 192x160 2024.13 31.62 71.51 71.78 72.44
2 128 192x160 2136.95 59.9 61.23 61.63 110.13
3 1 128x128x128 36.93 27.08 28.6 31.43 48.3
3 2 128x128x128 38.86 51.47 53.3 54.77 92.49
3 4 128x128x128 39.15 102.18 104.62 112.17 180.47

FP32

Dimension Batch size Resolution Throughput Avg [img/s] Latency Avg [ms] Latency 90% [ms] Latency 95% [ms] Latency 99% [ms]
2 32 192x160 1019.97 31.37 32.93 55.58 69.14
2 64 192x160 1063.59 60.17 62.32 63.11 111.01
2 128 192x160 1069.81 119.65 123.48 123.83 225.46
3 1 128x128x128 9.92 100.78 103.2 103.62 111.97
3 2 128x128x128 10.14 197.33 201.05 201.4 201.79
3 4 128x128x128 10.25 390.33 398.21 399.34 401.05

Throughput is reported in images per second. Latency is reported in milliseconds per batch. To achieve these same results, follow the steps in the Quick Start Guide.