The following section shows how to run benchmarks to measure the model performance in training and inference modes.
To benchmark training, run scripts/benchmark.py
script with --mode train
:
python scripts/benchmark.py --mode train --gpus <ngpus> --dim {2,3} --batch_size <bsize> [--amp]
For example, to benchmark 3D U-Net training using mixed-precision on 8 GPUs with batch size of 2, run:
python scripts/benchmark.py --mode train --gpus 8 --dim 3 --batch_size 2 --amp
Each of these scripts will by default run 1 warm-up epoch and start performance benchmarking during the second epoch.
At the end of the script, a line reporting the best train throughput and latency will be printed.
To benchmark inference, run scripts/benchmark.py
script with --mode predict
:
python scripts/benchmark.py --mode predict --dim {2,3} --batch_size <bsize> [--amp]
For example, to benchmark inference using mixed-precision for 3D U-Net, with batch size of 4, run:
python scripts/benchmark.py --mode predict --dim 3 --amp --batch_size 4
Each of these scripts will by default run warm-up for 1 data pass and start inference benchmarking during the second pass.
At the end of the script, a line reporting the inference throughput and latency will be printed.
The following sections provide details on how to achieve the same performance and accuracy in training and inference.
Our results were obtained by running the python scripts/train.py --gpus {1,8} --fold {0,1,2,3,4} --dim {2,3} [--amp]
training scripts and averaging results in the PyTorch 21.11 NGC container on NVIDIA DGX with (8x A100 80G) GPUs.
Dimension | GPUs | Batch size / GPU | Accuracy - mixed precision | Accuracy - TF32 | Time to train - mixed precision | Time to train - TF32 | Time to train speedup (TF32 to mixed precision) |
---|---|---|---|---|---|---|---|
2 | 1 | 2 | 73.21 | 73.11 | 33 min | 48 min | 1.46 |
2 | 8 | 2 | 73.15 | 73.16 | 9 min | 13 min | 1.44 |
3 | 1 | 2 | 74.35 | 74.34 | 104 min | 167 min | 1.61 |
3 | 8 | 2 | 74.30 | 74.32 | 23min | 36 min | 1.57 |
Our results were obtained by running the python scripts/train.py --gpus {1,8} --fold {0,1,2,3,4} --dim {2,3} [--amp]
training scripts and averaging results in the PyTorch 21.11 NGC container on NVIDIA DGX-1 with (8x V100 16G) GPUs.
Dimension | GPUs | Batch size / GPU | Accuracy - mixed precision | Accuracy - FP32 | Time to train - mixed precision | Time to train - FP32 | Time to train speedup (FP32 to mixed precision) |
---|---|---|---|---|---|---|---|
2 | 1 | 2 | 73.18 | 73.22 | 60 min | 114 min | 1.90 |
2 | 8 | 2 | 73.15 | 73.18 | 13 min | 19 min | 1.46 |
3 | 1 | 2 | 74.31 | 74.33 | 201 min | 680 min | 3.38 |
3 | 8 | 2 | 74.35 | 74.39 | 41 min | 153 min | 3.73 |
Our results were obtained by running the python scripts/benchmark.py --mode train --gpus {1,8} --dim {2,3} --batch_size <bsize> [--amp]
training script in the NGC container on NVIDIA DGX A100 (8x A100 80G) GPUs. Performance numbers (in volumes per second) were averaged over an entire training epoch.
Dimension | GPUs | Batch size / GPU | Throughput - mixed precision [img/s] | Throughput - TF32 [img/s] | Throughput speedup (TF32 - mixed precision) | Weak scaling - mixed precision | Weak scaling - TF32 |
---|---|---|---|---|---|---|---|
2 | 1 | 64 | 1129.48 | 702.82 | 1.607 | N/A | N/A |
2 | 1 | 128 | 1234.69 | 741.01 | 1.666 | N/A | N/A |
2 | 8 | 64 | 7015.45 | 4613.27 | 1.521 | 6.211 | 6.564 |
2 | 8 | 128 | 8293.61 | 5498.78 | 1.508 | 6.717 | 7.421 |
3 | 1 | 1 | 13.92 | 9.22 | 1.509 | N/A | N/A |
3 | 1 | 2 | 17.68 | 10.72 | 1.649 | N/A | N/A |
3 | 1 | 4 | 20.56 | 11.5 | 1.787 | N/A | N/A |
3 | 8 | 1 | 92.97 | 61.68 | 1.416 | 6.679 | 7.119 |
3 | 8 | 2 | 114.47 | 72.23 | 1.475 | 6.475 | 7.242 |
3 | 8 | 4 | 140.55 | 85.53 | 1.643 | 6.836 | 7.437 |
To achieve these same results, follow the steps in the Quick Start Guide.
Our results were obtained by running the python scripts/benchmark.py --mode train --gpus {1,8} --dim {2,3} --batch_size <bsize> [--amp]
training script in the PyTorch 21.11 NGC container on NVIDIA DGX-1 with (8x V100 16G) GPUs. Performance numbers (in volumes per second) were averaged over an entire training epoch.
Dimension | GPUs | Batch size / GPU | Throughput - mixed precision [img/s] | Throughput - FP32 [img/s] | Throughput speedup (FP32 - mixed precision) | Weak scaling - mixed precision | Weak scaling - FP32 |
---|---|---|---|---|---|---|---|
2 | 1 | 64 | 607.16 | 298.84 | 2.032 | N/A | N/A |
2 | 1 | 128 | 653.44 | 307.01 | 2.128 | N/A | N/A |
2 | 8 | 64 | 4058.79 | 2196.05 | 1.848 | 6.685 | 7.349 |
2 | 8 | 128 | 4649.37 | 2388.46 | 1.848 | 7.115 | 7.779 |
3 | 1 | 1 | 8.66 | 1.99 | 4.352 | N/A | N/A |
3 | 1 | 2 | 9.65 | 2.07 | 4.662 | N/A | N/A |
3 | 1 | 4 | 9.99 | OOM | N/A | N/A | N/A |
3 | 8 | 1 | 58.45 | 15.55 | 3.756 | 6.749 | 7.819 |
3 | 8 | 2 | 66.03 | 16.22 | 4.071 | 6.842 | 7.835 |
3 | 8 | 4 | 67.37 | OOM | N/A | 6.743 | N/A |
To achieve these same results, follow the steps in the Quick Start Guide.
Our results were obtained by running the python scripts/benchmark.py --mode predict --dim {2,3} --batch_size <bsize> [--amp]
inferencing benchmarking script in the PyTorch 21.11 NGC container on NVIDIA DGX A100 (1x A100 80G) GPU.
FP16
Dimension | Batch size | Resolution | Throughput Avg [img/s] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] |
---|---|---|---|---|---|---|---|
2 | 64 | 4x192x160 | 3211.23 | 19.93 | 20.24 | 20.38 | 20.84 |
2 | 128 | 4x192x160 | 3465.45 | 36.94 | 38.35 | 38.72 | 38.95 |
3 | 1 | 4x128x128x128 | 41.93 | 23.85 | 24.40 | 24.61 | 24.99 |
3 | 2 | 4x128x128x128 | 44.24 | 45.21 | 47.08 | 47.38 | 48.24 |
3 | 4 | 4x128x128x128 | 45.81 | 87.31 | 88.13 | 88.56 | 89.69 |
TF32
Dimension | Batch size | Resolution | Throughput Avg [img/s] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] |
---|---|---|---|---|---|---|---|
2 | 64 | 4x192x160 | 2172.38 | 29.46 | 29.94 | 30.03 | 30.19 |
2 | 128 | 4x192x160 | 1769.56 | 72.34 | 72.84 | 73.04 | 74.79 |
3 | 1 | 4x128x128x128 | 23.83 | 41.97 | 42.71 | 42.76 | 42.87 |
3 | 2 | 4x128x128x128 | 26.75 | 74.77 | 75.79 | 76.06 | 77.04 |
3 | 4 | 4x128x128x128 | 27.10 | 147.62 | 147.81 | 149.14 | 190.08 |
Throughput is reported in images per second. Latency is reported in milliseconds per batch. To achieve these same results, follow the steps in the Quick Start Guide.
Our results were obtained by running the python scripts/benchmark.py --mode predict --dim {2,3} --batch_size <bsize> [--amp]
inferencing benchmarking script in the PyTorch 21.11 NGC container on NVIDIA DGX-1 with (1x V100 16G) GPU.
FP16
Dimension | Batch size | Resolution | Throughput Avg [img/s] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] |
---|---|---|---|---|---|---|---|
2 | 64 | 4x192x160 | 1809.79 | 35.36 | 35.75 | 35.84 | 36.21 |
2 | 128 | 4x192x160 | 1987.91 | 64.39 | 64.79 | 64.87 | 65.01 |
3 | 1 | 4x128x128x128 | 26.75 | 37.38 | 37.66 | 37.74 | 38.17 |
3 | 2 | 4x128x128x128 | 23.28 | 85.91 | 86.77 | 87.39 | 89.54 |
3 | 4 | 4x128x128x128 | 23.83 | 167.83 | 169.41 | 170.30 | 173.47 |
FP32
Dimension | Batch size | Resolution | Throughput Avg [img/s] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] |
---|---|---|---|---|---|---|---|
2 | 64 | 4x192x160 | 1007.91 | 63.50 | 63.93 | 64.03 | 64.19 |
2 | 128 | 4x192x160 | 812.08 | 157.62 | 159.02 | 159.72 | 161.24 |
3 | 1 | 4x128x128x128 | 8.23 | 121.45 | 122.84 | 123.93 | 124.69 |
3 | 2 | 4x128x128x128 | 8.42 | 237.65 | 239.90 | 240.60 | 242.85 |
3 | 4 | 4x128x128x128 | 8.37 | 478.01 | 482.70 | 483.43 | 484.84 |
Throughput is reported in images per second. Latency is reported in milliseconds per batch. To achieve these same results, follow the steps in the Quick Start Guide.