The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA's latest software release. For the most up-to-date performance measurements, go to [NVIDIA Data Center Deep Learning Product Performance](https://developer.nvidia.com/deep-learning-performance-training-inference). ### Benchmarking The following section shows how to run benchmarks measuring the model performance in training and inference modes. #### Training performance benchmark To benchmark training, run one of the `train_benchmark` scripts in `./scripts/`: ```bash bash scripts/unet3d_train_benchmark{_TF-AMP}.sh ``` For example, to benchmark training using mixed-precision on 4 GPUs with batch size of 2 use: ```bash bash scripts/unet3d_train_benchmark_TF-AMP.sh 4 2 ``` Each of these scripts will by default run 40 warm-up iterations and benchmark the performance during training in the next 40 iterations. To have more control, you can run the script by directly providing all relevant run parameters. For example: ```bash horovodrun -np python main.py --exec_mode train --benchmark --augment --data_dir --model_dir --batch_size --warmup_steps --max_steps ``` At the end of the script, a line reporting the best train throughput will be printed. #### Inference performance benchmark To benchmark inference, run one of the scripts in `./scripts/`: ```bash bash scripts/unet3d_infer_benchmark{_TF-AMP}.sh ``` For example, to benchmark inference using mixed-precision with batch size 4: ```bash bash scripts/unet3d_infer_benchmark_TF-AMP.sh 4 ``` Each of these scripts will by default run 20 warm-up iterations and benchmark the performance during inference in the next 20 iterations. To have more control, you can run the script by directly providing all relevant run parameters. For example: ```bash python main.py --exec_mode predict --benchmark --data_dir --model_dir --batch_size --warmup_steps --max_steps ``` At the end of the script, a line reporting the best inference throughput will be printed. ### Results The following sections provide details on how we achieved our performance and accuracy of training and inference. #### Training accuracy results To reproduce this result, start the Docker container interactively and run one of the train scripts: ```bash bash scripts/unet3d_train_full{_TF-AMP}.sh ``` for example to train using 8 GPUs and batch size of 2: ```bash bash scripts/unet3d_train_full_TF-AMP.sh 8 /data/preprocessed /results 2 ``` This command will launch a script which will run 5-fold cross-validation training for 16,000 iterations on each fold and print: * the validation DICE scores for each class: Tumor Core (TC), Peritumoral Edema (ED), Enhancing Tumor (ET), * the mean DICE score, * the whole tumor (WT) which represents a binary classification case (tumor vs background). The time reported is for one fold, which means that the training of 5 folds will take 5 times longer. The default batch size is 2, however if you have less than 16 GB memory card and you encounter GPU memory issues you should decrease the batch size. The logs of the runs can be found in the `/results` directory once the script is finished. ##### Training accuracy: NVIDIA DGX A100 (8x A100 80G) The following table lists the average DICE score across 5-fold cross-validation. Our results were obtained by running the `scripts/unet3d_train_full{_TF-AMP}.sh` training script in the `tensorflow:21.10-tf1-py3` NGC container on NVIDIA DGX A100 (8x A100 80G) GPUs. | GPUs | Batch size / GPU | DICE - TF32 | DICE - mixed precision | Time to train - FP32 | Time to train - mixed precision | Time to train speedup (FP32 to mixed precision) | |---|---|--------|--------|--------|--------|------| | 8 | 2 | 0.8818 | 0.8819 | 8 min | 7 min | 1.14 | ##### Training accuracy: NVIDIA DGX-1 (8x V100 16G) The following table lists the average DICE score across 5-fold cross-validation. Our results were obtained by running the `scripts/unet3d_train_full{_TF-AMP}.sh` training script in the `tensorflow:21.10-tf1-py3` NGC container on NVIDIA DGX-1 (8x V100 16G) GPUs. | GPUs | Batch size / GPU | DICE - FP32 | DICE - mixed precision | Time to train - FP32 | Time to train - mixed precision | Time to train speedup (FP32 to mixed precision) | |---|---|--------|--------|--------|--------|------| | 8 | 2 | 0.8818 | 0.8819 | 33 min | 13 min | 2.54 | #### Training performance results ##### Training performance: NVIDIA DGX A100 (8x A100 80G) Our results were obtained by running the `scripts/unet3d_train_benchmark{_TF-AMP}.sh` training script in the `tensorflow:21.10-tf1-py3` NGC container on NVIDIA DGX A100 with (8x A100 80G) GPUs. Performance numbers (in volumes per second) were averaged over 80 iterations, excluding the first 40 warm-up steps. | GPUs | Batch size / GPU | Throughput - TF32 [img/s] | Throughput - mixed precision [img/s] | Throughput speedup (FP32 - mixed precision) | Weak scaling - FP32 | Weak scaling - mixed precision | |---|---|--------|--------|------|------|------| | 1 | 2 | 10.40 | 17.91 | 1.72 | N/A | N/A | | 1 | 4 | 10.66 | 19.88 | 1.86 | N/A | N/A | | 1 | 8 | 3.99 | 20.89 | 5.23 | N/A | N/A | | 8 | 2 | 81.71 | 100.24 | 1.23 | 7.85 | 5.60 | | 8 | 4 | 80.65 | 140.44 | 1.74 | 7.56 | 7.06 | | 8 | 8 | 29.79 | 137.61 | 4.62 | 7.47 | 6.59 | ##### Training performance: NVIDIA DGX-1 (8x V100 16G) Our results were obtained by running the `scripts/unet3d_train_benchmark{_TF-AMP}.sh` training script in the `tensorflow:21.10-tf1-py3` NGC container on NVIDIA DGX-1 with (8x V100 16G) GPUs. Performance numbers (in volumes per second) were averaged over 80 iterations, excluding the first 40 warm-up steps. | GPUs | Batch size / GPU | Throughput - FP32 [img/s] | Throughput - mixed precision [img/s] | Throughput speedup (FP32 - mixed precision) | Weak scaling - FP32 | Weak scaling - mixed precision | |---|---|-------|-------|------|------|------| | 1 | 1 | 1.87 | 7.45 | 3.98 | N/A | N/A | | 1 | 2 | 2.32 | 8.79 | 3.79 | N/A | N/A | | 8 | 1 | 14.49 | 46.88 | 3.23 | 7.75 | 6.29 | | 8 | 2 | 18.06 | 58.30 | 3.23 | 7.78 | 6.63 | To achieve these same results, follow the steps in the [Training performance benchmark](#training-performance-benchmark) section. #### Inference performance results ##### Inference performance: NVIDIA DGX A100 (1x A100 80G) Our results were obtained by running the `scripts/unet3d_infer_benchmark{_TF-AMP}.sh` inference benchmarking script in the `tensorflow:21.10-tf1-py3` NGC container on NVIDIA DGX A100 with (1x A100 80G) GPU. Performance numbers (in volumes per second) were averaged over 40 iterations, excluding the first 20 warm-up steps. FP16 | Batch size | Resolution | Throughput Avg [img/s] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] | |---|---------------|-------|--------|--------|--------|--------| | 1 | 224x224x160x4 | 15.58 | 67.32 | 68.63 | 78.00 | 109.42 | | 2 | 224x224x160x4 | 15.81 | 129.06 | 129.93 | 135.31 | 166.62 | | 4 | 224x224x160x4 | 8.34 | 479.47 | 482.55 | 487.68 | 494.80 | TF32 | Batch size | Resolution | Throughput Avg [img/s] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] | |---|---------------|-------|---------|---------|---------|---------| | 1 | 224x224x160x4 | 9.42 | 106.22 | 106.68 | 107.67 | 122.73 | | 2 | 224x224x160x4 | 4.69 | 427.13 | 428.33 | 428.76 | 429.19 | | 4 | 224x224x160x4 | 2.32 | 1723.79 | 1725.77 | 1726.30 | 1728.23 | To achieve these same results, follow the steps in the [Inference performance benchmark](#inference-performance-benchmark) section. ##### Inference performance: NVIDIA DGX-1 (1x V100 16G) Our results were obtained by running the `scripts/unet3d_infer_benchmark{_TF-AMP}.sh` inference benchmarking script in the `tensorflow:21.10-tf1-py3` NGC container on NVIDIA DGX-1 with (1x V100 16G) GPU. Performance numbers (in volumes per second) were averaged over 40 iterations, excluding the first 20 warm-up steps. FP16 | Batch size | Resolution | Throughput Avg [img/s] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] | |---|---------------|------|--------|--------|--------|--------| | 1 | 224x224x160x4 | 7.64 | 136.81 | 138.94 | 143.59 | 152.74 | | 2 | 224x224x160x4 | 7.75 | 260.66 | 267.07 | 270.88 | 274.44 | | 4 | 224x224x160x4 | 4.78 | 838.52 | 842.88 | 843.30 | 844.62 | FP32 | Batch size | Resolution | Throughput Avg [img/s] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] | |---|---------------|------|--------|--------|--------|--------| | 1 | 224x224x160x4 | 2.30 | 434.95 | 436.82 | 437.40 | 438.48 | | 2 | 224x224x160x4 | 2.40 | 834.99 | 837.22 | 837.51 | 838.18 | | 4 | 224x224x160x4 | OOM | | | | | To achieve these same results, follow the steps in the [Inference performance benchmark](#inference-performance-benchmark) section.