NGC | Catalog
Welcome Guest
CatalogResources3D-UNet Medical Image Segmentation for TensorFlow

3D-UNet Medical Image Segmentation for TensorFlow

For downloads and more information, please view on a desktop device.
Logo for 3D-UNet Medical Image Segmentation for TensorFlow

Description

A convolutional neural network for 3D image segmentation.

Publisher

NVIDIA

Use Case

Segmentation

Framework

TensorFlow

Latest Version

21.10.0

Modified

February 3, 2022

Compressed Size

249.15 KB

The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA's latest software release. For the most up-to-date performance measurements, go to NVIDIA Data Center Deep Learning Product Performance.

Benchmarking

The following section shows how to run benchmarks measuring the model performance in training and inference modes.

Training performance benchmark

To benchmark training, run one of the train_benchmark scripts in ./scripts/:

bash scripts/unet3d_train_benchmark{_TF-AMP}.sh <num/of/gpus> <path/to/dataset> <path/to/checkpoints> <batch/size>

For example, to benchmark training using mixed-precision on 4 GPUs with batch size of 2 use:

bash scripts/unet3d_train_benchmark_TF-AMP.sh 4 <path/to/dataset> <path/to/checkpoints> 2

Each of these scripts will by default run 40 warm-up iterations and benchmark the performance during training in the next 40 iterations.

To have more control, you can run the script by directly providing all relevant run parameters. For example:

horovodrun -np <num/of/gpus> python main.py --exec_mode train --benchmark --augment --data_dir <path/to/dataset> --model_dir <path/to/checkpoints> --batch_size <batch/size> --warmup_steps <warm-up/steps> --max_steps <max/steps>

At the end of the script, a line reporting the best train throughput will be printed.

Inference performance benchmark

To benchmark inference, run one of the scripts in ./scripts/:

bash scripts/unet3d_infer_benchmark{_TF-AMP}.sh <path/to/dataset> <path/to/checkpoints> <batch/size>

For example, to benchmark inference using mixed-precision with batch size 4:

bash scripts/unet3d_infer_benchmark_TF-AMP.sh <path/to/dataset> <path/to/checkpoints> 4

Each of these scripts will by default run 20 warm-up iterations and benchmark the performance during inference in the next 20 iterations.

To have more control, you can run the script by directly providing all relevant run parameters. For example:

python main.py --exec_mode predict --benchmark --data_dir <path/to/dataset> --model_dir <optional, path/to/checkpoint> --batch_size <batch/size> --warmup_steps <warm-up/steps> --max_steps <max/steps>

At the end of the script, a line reporting the best inference throughput will be printed.

Results

The following sections provide details on how we achieved our performance and accuracy of training and inference.

Training accuracy results

To reproduce this result, start the Docker container interactively and run one of the train scripts:

bash scripts/unet3d_train_full{_TF-AMP}.sh <num/of/gpus> <path/to/dataset> <path/to/checkpoint> <batch/size>

for example to train using 8 GPUs and batch size of 2:

bash scripts/unet3d_train_full_TF-AMP.sh 8 /data/preprocessed /results 2

This command will launch a script which will run 5-fold cross-validation training for 16,000 iterations on each fold and print:

  • the validation DICE scores for each class: Tumor Core (TC), Peritumoral Edema (ED), Enhancing Tumor (ET),
  • the mean DICE score,
  • the whole tumor (WT) which represents a binary classification case (tumor vs background).

The time reported is for one fold, which means that the training of 5 folds will take 5 times longer. The default batch size is 2, however if you have less than 16 GB memory card and you encounter GPU memory issues you should decrease the batch size. The logs of the runs can be found in the /results directory once the script is finished.

Training accuracy: NVIDIA DGX A100 (8x A100 80G)

The following table lists the average DICE score across 5-fold cross-validation. Our results were obtained by running the scripts/unet3d_train_full{_TF-AMP}.sh training script in the tensorflow:21.10-tf1-py3 NGC container on NVIDIA DGX A100 (8x A100 80G) GPUs.

GPUs Batch size / GPU DICE - TF32 DICE - mixed precision Time to train - FP32 Time to train - mixed precision Time to train speedup (FP32 to mixed precision)
8 2 0.8818 0.8819 8 min 7 min 1.14
Training accuracy: NVIDIA DGX-1 (8x V100 16G)

The following table lists the average DICE score across 5-fold cross-validation. Our results were obtained by running the scripts/unet3d_train_full{_TF-AMP}.sh training script in the tensorflow:21.10-tf1-py3 NGC container on NVIDIA DGX-1 (8x V100 16G) GPUs.

GPUs Batch size / GPU DICE - FP32 DICE - mixed precision Time to train - FP32 Time to train - mixed precision Time to train speedup (FP32 to mixed precision)
8 2 0.8818 0.8819 33 min 13 min 2.54

Training performance results

Training performance: NVIDIA DGX A100 (8x A100 80G)

Our results were obtained by running the scripts/unet3d_train_benchmark{_TF-AMP}.sh training script in the tensorflow:21.10-tf1-py3 NGC container on NVIDIA DGX A100 with (8x A100 80G) GPUs. Performance numbers (in volumes per second) were averaged over 80 iterations, excluding the first 40 warm-up steps.

GPUs Batch size / GPU Throughput - TF32 [img/s] Throughput - mixed precision [img/s] Throughput speedup (FP32 - mixed precision) Weak scaling - FP32 Weak scaling - mixed precision
1 2 10.40 17.91 1.72 N/A N/A
1 4 10.66 19.88 1.86 N/A N/A
1 8 3.99 20.89 5.23 N/A N/A
8 2 81.71 100.24 1.23 7.85 5.60
8 4 80.65 140.44 1.74 7.56 7.06
8 8 29.79 137.61 4.62 7.47 6.59
Training performance: NVIDIA DGX-1 (8x V100 16G)

Our results were obtained by running the scripts/unet3d_train_benchmark{_TF-AMP}.sh training script in the tensorflow:21.10-tf1-py3 NGC container on NVIDIA DGX-1 with (8x V100 16G) GPUs. Performance numbers (in volumes per second) were averaged over 80 iterations, excluding the first 40 warm-up steps.

GPUs Batch size / GPU Throughput - FP32 [img/s] Throughput - mixed precision [img/s] Throughput speedup (FP32 - mixed precision) Weak scaling - FP32 Weak scaling - mixed precision
1 1 1.87 7.45 3.98 N/A N/A
1 2 2.32 8.79 3.79 N/A N/A
8 1 14.49 46.88 3.23 7.75 6.29
8 2 18.06 58.30 3.23 7.78 6.63

To achieve these same results, follow the steps in the Training performance benchmark section.

Inference performance results

Inference performance: NVIDIA DGX A100 (1x A100 80G)

Our results were obtained by running the scripts/unet3d_infer_benchmark{_TF-AMP}.sh inference benchmarking script in the tensorflow:21.10-tf1-py3 NGC container on NVIDIA DGX A100 with (1x A100 80G) GPU. Performance numbers (in volumes per second) were averaged over 40 iterations, excluding the first 20 warm-up steps.

FP16

Batch size Resolution Throughput Avg [img/s] Latency Avg [ms] Latency 90% [ms] Latency 95% [ms] Latency 99% [ms]
1 224x224x160x4 15.58 67.32 68.63 78.00 109.42
2 224x224x160x4 15.81 129.06 129.93 135.31 166.62
4 224x224x160x4 8.34 479.47 482.55 487.68 494.80

TF32

Batch size Resolution Throughput Avg [img/s] Latency Avg [ms] Latency 90% [ms] Latency 95% [ms] Latency 99% [ms]
1 224x224x160x4 9.42 106.22 106.68 107.67 122.73
2 224x224x160x4 4.69 427.13 428.33 428.76 429.19
4 224x224x160x4 2.32 1723.79 1725.77 1726.30 1728.23

To achieve these same results, follow the steps in the Inference performance benchmark section.

Inference performance: NVIDIA DGX-1 (1x V100 16G)

Our results were obtained by running the scripts/unet3d_infer_benchmark{_TF-AMP}.sh inference benchmarking script in the tensorflow:21.10-tf1-py3 NGC container on NVIDIA DGX-1 with (1x V100 16G) GPU. Performance numbers (in volumes per second) were averaged over 40 iterations, excluding the first 20 warm-up steps.

FP16

Batch size Resolution Throughput Avg [img/s] Latency Avg [ms] Latency 90% [ms] Latency 95% [ms] Latency 99% [ms]
1 224x224x160x4 7.64 136.81 138.94 143.59 152.74
2 224x224x160x4 7.75 260.66 267.07 270.88 274.44
4 224x224x160x4 4.78 838.52 842.88 843.30 844.62

FP32

Batch size Resolution Throughput Avg [img/s] Latency Avg [ms] Latency 90% [ms] Latency 95% [ms] Latency 99% [ms]
1 224x224x160x4 2.30 434.95 436.82 437.40 438.48
2 224x224x160x4 2.40 834.99 837.22 837.51 838.18
4 224x224x160x4 OOM

To achieve these same results, follow the steps in the Inference performance benchmark section.