NGC | Catalog
CatalogResourcesnnU-Net for TensorFlow2

nnU-Net for TensorFlow2

For downloads and more information, please view on a desktop device.
Logo for nnU-Net for TensorFlow2

Description

An optimized, robust and self-adapting framework for U-Net based medical image segmentation

Publisher

NVIDIA Deep Learning Examples

Latest Version

22.11.0

Modified

December 5, 2022

Compressed Size

43.72 KB

Benchmarking

The following section shows how to run benchmarks to measure the model performance in training and inference modes.

Training performance benchmark

To benchmark training, run the scripts/benchmark.py script with --mode train:

python scripts/benchmark.py --xla --mode train --gpus <ngpus> --dim {2,3} --batch-size <bsize> [--amp]

For example, to benchmark 3D U-Net training using mixed-precision on 8 GPUs with batch size of 2, run:

python scripts/benchmark.py --xla --mode train --gpus 8 --dim 3 --batch-size 2 --amp

Each of these scripts will by default run a warm-up for 100 iterations and then start benchmarking for another 100 steps. You can adjust these settings with --warmup-steps and --bench-steps parameters.

At the end of the script, a line reporting the training throughput and latency will be printed.

Inference performance benchmark

To benchmark inference, run the scripts/benchmark.py script with --mode predict:

python scripts/benchmark.py --xla --mode predict --gpus <ngpus> --dim {2,3} --batch-size <bsize> [--amp]

For example, to benchmark inference using mixed-precision for 3D U-Net on 1 GPU, with a batch size of 4, run:

python scripts/benchmark.py --xla --mode predict --gpus 1 --dim 3 --batch-size 4 --amp 

Each of these scripts will by default run a warm-up for 100 iterations and then start benchmarking for another 100 steps. You can adjust these settings with --warmup-steps and --bench-steps parameters.

At the end of the script, a line reporting the inference throughput and latency will be printed.

Note that this benchmark reports performance numbers for iterations over samples with fixed patch sizes. The real inference process uses sliding window for input images with arbitrary resolution and performance may vary for images with different resolutions.

Results

The following sections provide details on how to achieve the same performance and accuracy in training and inference.

Training accuracy results

Training accuracy: NVIDIA DGX A100 (8xA100 80G)

Our results were obtained by running the python scripts/train.py --xla --gpus {1,8} --fold {0,1,2,3,4} --dim {2,3} --learning_rate lr [--amp] --seed n training scripts and averaging results in the TensorFlow 22.11 NGC container on NVIDIA DGX with (8x A100 80G) GPUs.

Dimension GPUs Batch size / GPU Dice - mixed precision Accuracy - FP32 Time to train - mixed precision Time to train - TF32 Time to train speedup (TF32 to mixed precision)
2 1 64 0.7312 0.7302 29 min 40 min 1.38
2 8 64 0.7322 0.7310 8 min 10 min 1.22
3 1 2 0.7435 0.7441 85 min 153 min 1.79
3 8 2 0.7440 0.7438 19 min 33 min 1.69

Reported dice score is the average over 5 folds from the best run for grid search over learning rates {1e-4, 2e-4, ..., 9e-4} and seed {1, 3, 5}.

Training accuracy: NVIDIA DGX-1 (8xV100 32G)

Our results were obtained by running the python scripts/train.py --xla --gpus {1,8} --fold {0,1,2,3,4} --dim {2,3} [--amp] --seed n training scripts and averaging results in the TensorFlow 22.11 NGC container on NVIDIA DGX-1 with (8x V100 32G) GPUs.

Dimension GPUs Batch size / GPU Dice - mixed precision Accuracy - FP32 Time to train - mixed precision Time to train - FP32 Time to train speedup (FP32 to mixed precision)
2 1 64 0.7315 0.7311 52 min 102 min 1.96
2 8 64 0.7312 0.7316 12 min 17 min 1.41
3 1 2 0.7435 0.7441 181 min 580 min 3.20
3 8 2 0.7434 0.7440 35 min 131 min 3.74

Reported dice score is the average over 5 folds from the best run for grid search over learning rates {1e-4, 2e-4, ..., 9e-4} and seed {1, 3, 5}.

Training performance results

Training performance: NVIDIA DGX A100 (8xA100 80G)

Our results were obtained by running the python scripts/benchmark.py --xla --mode train --gpus {1,8} --dim {2,3} --batch-size <bsize> [--amp] training script in the NGC container on NVIDIA DGX A100 (8x A100 80G) GPUs. Performance numbers (in volumes per second) were averaged over an entire training epoch.

Note: We recommend using --bind flag for multi-GPU settings to increase the throughput. To launch multi-GPU with --bind you will have to add --horovod e.g., python scripts/benchmark.py --xla --mode train --gpus 8 --dim 3 --amp --batch-size 2 --bind --horovod for the interactive session, or use regular command when launching with SLURM's sbatch.

Dimension GPUs Batch size / GPU Throughput - mixed precision [img/s] Throughput - TF32 [img/s] Throughput speedup (TF32 - mixed precision) Weak scaling - mixed precision Weak scaling - TF32
2 1 32 1347.19 748.56 1.80 - -
2 1 64 1662.8 804.23 2.07 - -
2 1 128 1844.7 881.87 2.09 - -
2 8 32 9056.45 5420.51 1.67 6.72 6.91
2 8 64 11687.11 6250.52 1.87 7.03 7.49
2 8 128 13679.76 6841.78 2.00 7.42 7.66
3 1 1 27.02 11.63 2.32 - -
3 1 2 29.3 11.81 2.48 - -
3 1 4 31.87 12.17 2.62 - -
3 8 1 186.84 91.11 2.05 7.24 7.83
3 8 2 219.34 92.91 2.36 7.77 7.87
3 8 4 244.01 96.52 2.53 7.76 7.93

To achieve these same results, follow the steps in the Quick Start Guide.

Training performance: NVIDIA DGX-1 (8xV100 32G)

Our results were obtained by running the python scripts/benchmark.py --xla --mode train --gpus {1,8} --dim {2,3} --batch-size <bsize> [--amp] training script in the TensorFlow 22.11 NGC container on NVIDIA DGX-1 with (8x V100 32G) GPUs. Performance numbers (in volumes per second) were averaged over an entire training epoch.

Note: We recommend using --bind flag for multi-GPU settings to increase the throughput. To launch multi-GPU with --bind you will have to add --horovod e.g., python scripts/benchmark.py --xla --mode train --gpus 8 --dim 3 --amp --batch-size 2 --bind --horovod for the interactive session, or use regular command when launching with SLURM's sbatch.

Dimension GPUs Batch size / GPU Throughput - mixed precision [img/s] Throughput - FP32 [img/s] Throughput speedup (FP32 - mixed precision) Weak scaling - mixed precision Weak scaling - FP32
2 1 32 697.36 312.51 2.23 - -
2 1 64 819.15 337.42 2.43 - -
2 1 128 894.94 352.32 2.54 - -
2 8 32 4355.65 2260.37 1.93 6.25 7.23
2 8 64 5696.41 2585.65 2.20 6.95 7.66
2 8 128 6714.96 2779.25 2.42 7.50 7.89
3 1 1 12.15 2.08 5.84 - -
3 1 2 13.13 2.5 5.25 - -
3 8 1 82.62 16.59 4.98 6.80 7.98
3 8 2 97.68 19.91 4.91 7.44 7.96

To achieve these same results, follow the steps in the Quick Start Guide.

Inference performance results

Inference performance: NVIDIA DGX A100 (1xA100 80G)

Our results were obtained by running the python scripts/benchmark.py --xla --mode predict --dim {2,3} --batch-size <bsize> [--amp] inferencing benchmarking script in the TensorFlow 22.11 NGC container on NVIDIA DGX A100 (1x A100 80G) GPU.

FP16

Dimension Batch size Resolution Throughput Avg [img/s] Latency Avg [ms] Latency 90% [ms] Latency 95% [ms] Latency 99% [ms]
2 32 192x160 1728.03 18.52 22.55 23.18 24.82
2 64 192x160 4160.91 15.38 17.49 18.53 19.88
2 128 192x160 4672.52 27.39 27.68 27.79 27.87
3 1 128x128x128 78.2 12.79 14.29 14.87 15.25
3 2 128x128x128 63.76 31.37 36.07 40.02 42.44
3 4 128x128x128 83.17 48.1 50.96 52.08 52.56

TF32

Dimension Batch size Resolution Throughput Avg [img/s] Latency Avg [ms] Latency 90% [ms] Latency 95% [ms] Latency 99% [ms]
2 32 192x160 2067.63 15.48 17.97 19.12 19.77
2 64 192x160 2447 26.15 26.43 26.48 26.62
2 128 192x160 2514.75 50.9 51.15 51.23 51.28
3 1 128x128x128 38.85 25.74 26.04 26.19 27.41
3 2 128x128x128 40.1 49.87 50.31 50.44 50.57
3 4 128x128x128 41.69 95.95 97.09 97.41 98.03

Throughput is reported in images per second. Latency is reported in milliseconds per batch. To achieve these same results, follow the steps in the Quick Start Guide.

Inference performance: NVIDIA DGX-1 (1xV100 32G)

Our results were obtained by running the python scripts/benchmark.py --mode predict --dim {2,3} --batch-size <bsize> [--amp] inferencing benchmarking script in the TensorFlow 22.11 NGC container on NVIDIA DGX-1 with (1x V100 32G) GPU.

FP16

Dimension Batch size Resolution Throughput Avg [img/s] Latency Avg [ms] Latency 90% [ms] Latency 95% [ms] Latency 99% [ms]
2 32 192x160 1166.83 27.42 28.76 28.91 29.16
2 64 192x160 2263.21 28.28 30.63 31.83 32.5
2 128 192x160 2387.06 53.62 53.97 54.07 54.3
3 1 128x128x128 36.87 27.12 27.32 27.37 27.42
3 2 128x128x128 37.65 53.12 53.49 53.59 53.71
3 4 128x128x128 38.8 103.11 104.16 104.3 104.75

FP32

Dimension Batch size Resolution Throughput Avg [img/s] Latency Avg [ms] Latency 90% [ms] Latency 95% [ms] Latency 99% [ms]
2 32 192x160 990.61 32.3 32.46 32.51 32.78
2 64 192x160 1034.22 61.88 62.19 62.32 62.56
2 128 192x160 1084.21 118.06 118.45 118.6 118.95
3 1 128x128x128 9.65 103.62 104.46 104.52 104.63
3 2 128x128x128 9.96 200.75 202.51 202.74 202.86
3 4 128x128x128 10.13 394.74 396.74 397.0 397.82

Throughput is reported in images per second. Latency is reported in milliseconds per batch. To achieve these same results, follow the steps in the Quick Start Guide.

Known issues

There are no known issues in this release.