The following section shows how to run benchmarks to measure the model performance in training and inference modes.
To benchmark training, run the scripts/benchmark.py
script with --mode train
:
python scripts/benchmark.py --mode train --gpus <ngpus> --dim {2,3} --batch_size <bsize> [--amp] [--bind]
For example, to benchmark 3D U-Net training using mixed-precision on 8 GPUs with a batch size of 2, run:
python scripts/benchmark.py --mode train --gpus 8 --dim 3 --batch_size 2 --amp
Each of these scripts will by default run 1 warm-up epoch and start performance benchmarking during the second epoch.
At the end of the script, a line reporting the best train throughput and latency will be printed.
To benchmark inference, run the scripts/benchmark.py
script with --mode predict
:
python scripts/benchmark.py --mode predict --dim {2,3} --batch_size <bsize> [--amp]
For example, to benchmark inference using mixed-precision for 3D U-Net, with a batch size of 4, run:
python scripts/benchmark.py --mode predict --dim 3 --amp --batch_size 4
Each of these scripts will by default run a warm-up for 1 data pass and start inference benchmarking during the second pass.
At the end of the script, a line reporting the inference throughput and latency will be printed.
Note that this benchmark reports performance numbers for iterations over samples with fixed patch sizes. The real inference process uses sliding window for input images with arbitrary resolution and performance may vary for images with different resolutions.
The following sections provide details on how to achieve the same performance and accuracy in training and inference.
Our results were obtained by running the python scripts/train.py --gpus {1,8} --fold {0,1,2,3,4} --dim {2,3} [--amp] [--bind] --learning_rate lr --seed n
training scripts and averaging results in the PyTorch 22.11 NGC container on NVIDIA DGX with (8x A100 80G) GPUs.
Note: We recommend using --bind
flag for multi-GPU settings to increase the throughput. To launch multi-GPU with --bind
use PyTorch distributed launcher, e.g., python -m torch.distributed.launch --use_env --nproc_per_node=8 scripts/benchmark.py --mode train --gpus 8 --dim 3 --amp --batch_size 2 --bind
for the interactive session, or use regular command when launching with SLURM's sbatch.
Dimension | GPUs | Batch size / GPU | Dice - mixed precision | Dice - TF32 | Time to train - mixed precision | Time to train - TF32 | Time to train speedup (TF32 to mixed precision) |
---|---|---|---|---|---|---|---|
2 | 1 | 2 | 73.21 | 73.11 | 33 min | 48 min | 1.46 |
2 | 8 | 2 | 73.15 | 73.16 | 9 min | 13 min | 1.44 |
3 | 1 | 2 | 74.35 | 74.34 | 104 min | 167 min | 1.61 |
3 | 8 | 2 | 74.30 | 74.32 | 23min | 36 min | 1.57 |
Reported dice score is the average over 5 folds from the best run for grid search over learning rates {1e-4, 2e-4, ..., 9e-4} and seed {1, 3, 5}.
Our results were obtained by running the python scripts/train.py --gpus {1,8} --fold {0,1,2,3,4} --dim {2,3} [--amp] [--bind] --seed n
training scripts and averaging results in the PyTorch 22.11 NGC container on NVIDIA DGX-1 with (8x V100 32G) GPUs.
Note: We recommend using --bind
flag for multi-GPU settings to increase the throughput. To launch multi-GPU with --bind
use PyTorch distributed launcher, e.g., python -m torch.distributed.launch --use_env --nproc_per_node=8 scripts/benchmark.py --mode train --gpus 8 --dim 3 --amp --batch_size 2 --bind
for the interactive session, or use regular command when launching with SLURM's sbatch.
Dimension | GPUs | Batch size / GPU | Dice - mixed precision | Dice - FP32 | Time to train - mixed precision | Time to train - FP32 | Time to train speedup (FP32 to mixed precision) |
---|---|---|---|---|---|---|---|
2 | 1 | 2 | 73.18 | 73.22 | 60 min | 114 min | 1.90 |
2 | 8 | 2 | 73.15 | 73.18 | 13 min | 19 min | 1.46 |
3 | 1 | 2 | 74.31 | 74.33 | 201 min | 680 min | 3.38 |
3 | 8 | 2 | 74.35 | 74.39 | 41 min | 153 min | 3.73 |
Reported dice score is the average over 5 folds from the best run for grid search over learning rates {1e-4, 2e-4, ..., 9e-4} and seed {1, 3, 5}.
Our results were obtained by running the python scripts/benchmark.py --mode train --gpus {1,8} --dim {2,3} --batch_size <bsize> [--amp]
training script in the NGC container on NVIDIA DGX A100 (8x A100 80G) GPUs. Performance numbers (in volumes per second) were averaged over an entire training epoch.
Note: We recommend using --bind
flag for multi-gpu settings to increase the througput. To launch multi-GPU with --bind
use python -m torch.distributed.launch --use_env --nproc_per_node=<npgus> scripts/train.py --bind ...
for the interactive session, or use regular command when launching with SLURM's sbatch.
Dimension | GPUs | Batch size / GPU | Throughput - mixed precision [img/s] | Throughput - TF32 [img/s] | Throughput speedup (TF32 - mixed precision) | Weak scaling - mixed precision | Weak scaling - TF32 |
---|---|---|---|---|---|---|---|
2 | 1 | 32 | 1040.58 | 732.22 | 1.42 | - | - |
2 | 1 | 64 | 1238.68 | 797.37 | 1.55 | - | - |
2 | 1 | 128 | 1345.29 | 838.38 | 1.60 | - | - |
2 | 8 | 32 | 7747.27 | 5588.2 | 1.39 | 7.45 | 7.60 |
2 | 8 | 64 | 9417.27 | 6246.95 | 1.51 | 7.60 | 8.04 |
2 | 8 | 128 | 10694.1 | 6631.08 | 1.61 | 7.95 | 7.83 |
3 | 1 | 1 | 24.61 | 9.66 | 2.55 | - | - |
3 | 1 | 2 | 27.48 | 11.27 | 2.44 | - | - |
3 | 1 | 4 | 29.96 | 12.22 | 2.45 | - | - |
3 | 8 | 1 | 187.07 | 76.44 | 2.45 | 7.63 | 7.91 |
3 | 8 | 2 | 220.83 | 88.67 | 2.49 | 7.83 | 7.87 |
3 | 8 | 4 | 234.5 | 96.61 | 2.43 | 7.91 | 7.91 |
To achieve these same results, follow the steps in the Quick Start Guide.
Our results were obtained by running the python scripts/benchmark.py --mode train --gpus {1,8} --dim {2,3} --batch_size <bsize> [--amp] [--bind]
training script in the PyTorch 22.11 NGC container on NVIDIA DGX-1 with (8x V100 32G) GPUs. Performance numbers (in volumes per second) were averaged over an entire training epoch.
Note: We recommend using --bind
flag for multi-gpu settings to increase the througput. To launch multi-GPU with --bind
use python -m torch.distributed.launch --use_env --nproc_per_node=<npgus> scripts/train.py --bind ...
for the interactive session, or use regular command when launching with SLURM's sbatch.
Dimension | GPUs | Batch size / GPU | Throughput - mixed precision [img/s] | Throughput - FP32 [img/s] | Throughput speedup (FP32 - mixed precision) | Weak scaling - mixed precision | Weak scaling - FP32 |
---|---|---|---|---|---|---|---|
2 | 1 | 32 | 561.6 | 310.21 | 1.81 | - | - |
2 | 1 | 64 | 657.91 | 326.02 | 2.02 | - | - |
2 | 1 | 128 | 706.92 | 332.81 | 2.12 | - | - |
2 | 8 | 32 | 3903.88 | 2396.88 | 1.63 | 6.95 | 7.73 |
2 | 8 | 64 | 4922.76 | 2590.66 | 1.90 | 7.48 | 7.95 |
2 | 8 | 128 | 5597.87 | 2667.56 | 2.10 | 7.92 | 8.02 |
3 | 1 | 1 | 11.38 | 2.07 | 5.50 | - | - |
3 | 1 | 2 | 12.34 | 2.51 | 4.92 | - | - |
3 | 8 | 1 | 84.38 | 16.55 | 5.10 | 7.41 | 8.00 |
3 | 8 | 2 | 98.17 | 20.15 | 4.87 | 7.96 | 8.03 |
To achieve these same results, follow the steps in the Quick Start Guide.
Our results were obtained by running the python scripts/benchmark.py --mode predict --dim {2,3} --batch_size <bsize> [--amp]
inferencing benchmarking script in the PyTorch 22.11 NGC container on NVIDIA DGX A100 (1x A100 80G) GPU.
FP16
Dimension | Batch size | Resolution | Throughput Avg [img/s] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] |
---|---|---|---|---|---|---|---|
2 | 32 | 192x160 | 1818.05 | 17.6 | 19.86 | 20.38 | 20.98 |
2 | 64 | 192x160 | 3645.16 | 17.56 | 19.86 | 20.82 | 23.66 |
2 | 128 | 192x160 | 3850.35 | 33.24 | 34.72 | 61.4 | 63.58 |
3 | 1 | 128x128x128 | 68.45 | 14.61 | 17.02 | 17.41 | 19.27 |
3 | 2 | 128x128x128 | 56.9 | 35.15 | 40.9 | 43.15 | 57.94 |
3 | 4 | 128x128x128 | 76.39 | 52.36 | 57.9 | 59.52 | 70.24 |
TF32
Dimension | Batch size | Resolution | Throughput Avg [img/s] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] |
---|---|---|---|---|---|---|---|
2 | 32 | 192x160 | 1868.56 | 17.13 | 51.75 | 53.07 | 54.92 |
2 | 64 | 192x160 | 2508.57 | 25.51 | 56.83 | 90.08 | 96.87 |
2 | 128 | 192x160 | 2609.6 | 49.05 | 191.48 | 201.8 | 205.29 |
3 | 1 | 128x128x128 | 35.02 | 28.55 | 51.75 | 53.07 | 54.92 |
3 | 2 | 128x128x128 | 39.88 | 50.15 | 56.83 | 90.08 | 96.87 |
3 | 4 | 128x128x128 | 41.32 | 96.8 | 191.48 | 201.8 | 205.29 |
Throughput is reported in images per second. Latency is reported in milliseconds per batch. To achieve these same results, follow the steps in the Quick Start Guide.
Our results were obtained by running the python scripts/benchmark.py --mode predict --dim {2,3} --batch_size <bsize> [--amp]
inferencing benchmarking script in the PyTorch 22.11 NGC container on NVIDIA DGX-1 with (1x V100 32G) GPU.
FP16
Dimension | Batch size | Resolution | Throughput Avg [img/s] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] |
---|---|---|---|---|---|---|---|
2 | 32 | 192x160 | 1254.38 | 25.51 | 29.07 | 30.07 | 31.23 |
2 | 64 | 192x160 | 2024.13 | 31.62 | 71.51 | 71.78 | 72.44 |
2 | 128 | 192x160 | 2136.95 | 59.9 | 61.23 | 61.63 | 110.13 |
3 | 1 | 128x128x128 | 36.93 | 27.08 | 28.6 | 31.43 | 48.3 |
3 | 2 | 128x128x128 | 38.86 | 51.47 | 53.3 | 54.77 | 92.49 |
3 | 4 | 128x128x128 | 39.15 | 102.18 | 104.62 | 112.17 | 180.47 |
FP32
Dimension | Batch size | Resolution | Throughput Avg [img/s] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] |
---|---|---|---|---|---|---|---|
2 | 32 | 192x160 | 1019.97 | 31.37 | 32.93 | 55.58 | 69.14 |
2 | 64 | 192x160 | 1063.59 | 60.17 | 62.32 | 63.11 | 111.01 |
2 | 128 | 192x160 | 1069.81 | 119.65 | 123.48 | 123.83 | 225.46 |
3 | 1 | 128x128x128 | 9.92 | 100.78 | 103.2 | 103.62 | 111.97 |
3 | 2 | 128x128x128 | 10.14 | 197.33 | 201.05 | 201.4 | 201.79 |
3 | 4 | 128x128x128 | 10.25 | 390.33 | 398.21 | 399.34 | 401.05 |
Throughput is reported in images per second. Latency is reported in milliseconds per batch. To achieve these same results, follow the steps in the Quick Start Guide.