NGC | Catalog
CatalogResourcesnnU-Net for TensorFlow2

nnU-Net for TensorFlow2

For downloads and more information, please view on a desktop device.
Logo for nnU-Net for TensorFlow2


An optimized, robust and self-adapting framework for U-Net based medical image segmentation


NVIDIA Deep Learning Examples

Use Case




Latest Version



November 4, 2022

Compressed Size

43.03 KB

This resource is using open-source code maintained in github (see the quick-start-guide section) and available for download from NGC

The nnU-Net ("no-new-Net") refers to a robust and self-adapting framework for U-Net based medical image segmentation. This repository contains a nnU-Net implementation as described in the paper: nnU-Net: Self-adapting Framework for U-Net-Based Medical Image Segmentation.

The differences between this nnU-net and the original model are:

  • Dynamic selection of patch size is not supported, and it has to be set in data_preprocessing/ file.
  • Cascaded U-Net is not supported.
  • The following data augmentations are not used: rotation, simulation of low resolution, gamma augmentation.

This model is trained with mixed precision using Tensor Cores on Volta, Turing, and the NVIDIA Ampere GPU architectures. Therefore, researchers can get results 2x faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.

Model architecture

The nnU-Net allows training two types of networks: 2D U-Net and 3D U-Net to perform semantic segmentation of 2D or 3D images, with high accuracy and performance.

The following figure shows the architecture of the 3D U-Net model and its different components. U-Net is composed of a contractive and an expanding path, that aims at building a bottleneck in its centermost part through a combination of convolution, instance norm and leaky relu operations. After this bottleneck, the image is reconstructed through a combination of convolutions and upsampling. Skip connections are added with the goal of helping the backward flow of gradients in order to improve the training.

Figure 1: The 3D U-Net architecture

Default configuration

All convolution blocks in U-Net in both encoder and decoder are using two convolution layers followed by instance normalization and a leaky ReLU nonlinearity. For downsampling we are using stride convolution whereas transposed convolution for upsampling.

All models were trained with the Adam optimizer. For loss function we use the average of cross-entropy and dice coefficient.

Used data augmentation: crop with oversampling the foreground class, mirroring, zoom, Gaussian noise, Gaussian blur, brightness.

Feature support matrix

The following features are supported by this model:

Feature nnUNet
Automatic mixed precision (AMP) Yes
Horovod Multi-GPU (NCCL) Yes



NVIDIA Data Loading Library (DALI) is a collection of highly optimized building blocks, and an execution engine, to accelerate the pre-processing of the input data for deep learning applications. DALI provides both the performance and the flexibility for accelerating different data pipelines as a single library. This single library can then be easily integrated into different deep learning training and inference applications. For details, refer to example sources in this repository or refer to the DALI documentation.

Automatic Mixed Precision (AMP)

Computation graphs can be modified by TensorFlow during runtime to support mixed precision training, which allows to use FP16 training with FP32 master weights. A detailed explanation of mixed precision can be found in the next section.

Multi-GPU training with Horovod Horovod is a distributed training framework for TensorFlow, Keras, PyTorch, and MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use. For more information about how to get started with Horovod, refer to the Horovod: Official repository. Our model uses Horovod to implement efficient multi-GPU training with NCCL. For details, refer to example scripts in this repository or refer to the TensorFlow tutorial.


XLA (Accelerated Linear Algebra) is a compiler which can accelerate TensorFlow networks by model-specific optimizations i.e. fusing multiple GPU operations together. Operations fused into a single GPU kernel do not have to use additional memory to store intermediate values by keeping them entirely in GPU registers, therefore reducing memory operations and improving performance. For details refer to the TensorFlow documentation.

Mixed precision training

Mixed precision is the combined use of different numerical precisions in a computational method. Mixed precision training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of Tensor Cores in Volta, and following with both the Turing and Ampere architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:

  1. Porting the model to use the FP16 data type where appropriate.
  2. Adding loss scaling to preserve small gradient values.

This can now be achieved using Automatic Mixed Precision (AMP) for TensorFlow to enable the full mixed precision methodology in your existing TensorFlow model code. AMP enables mixed precision training on NVIDIA Volta, NVIDIA Turing, and NVIDIA Ampere GPU architectures automatically. The TensorFlow framework code makes all necessary model changes internally.

In TF-AMP, the computational graph is optimized to use as few casts as necessary and maximize the use of FP16, and the loss scaling is automatically applied inside of supported optimizers. AMP can be configured to work with the existing tf.contrib loss scaling manager by disabling the AMP scaling with a single environment variable to perform only the automatic mixed-precision optimization. It accomplishes this by automatically rewriting all computation graphs with the necessary operations to enable mixed precision training and automatic loss scaling.

For information about:

Enabling mixed precision

Mixed precision is enabled in TensorFlow by using the Automatic Mixed Precision (TF-AMP) extension which casts variables to half-precision upon retrieval, while storing variables in single-precision format. Furthermore, to preserve small gradient magnitudes in backpropagation, a loss scaling step must be included when applying gradients. In TensorFlow, loss scaling can be applied statically by using simple multiplication of loss by a constant value or automatically, by TF-AMP. Automatic mixed precision makes all the adjustments internally in TensorFlow, providing two benefits over manual operations. First, programmers need not modify network model code, reducing development and maintenance effort. Second, using AMP maintains forward and backward compatibility with all the APIs for defining and running TensorFlow models.

Example nnU-Net scripts for training, inference and benchmarking from the scripts/ directory enable mixed precision if --amp command line flag is used.

Internally, mixed precision is enabled by setting keras.mixed_precision policy to mixed_float16. Additionally, our custom training loop uses a LossScaleOptimizer wrapper for the optimizer. For more information see the Mixed precision guide.


TensorFloat-32 (TF32) is the new math mode in NVIDIA A100 GPUs for handling the matrix math, also called tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on NVIDIA Volta GPUs.

TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models which require high dynamic range for weights or activations.

For more information, refer to the TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x blog post.

TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default.


Deep supervision

Deep supervision is a technique which adds auxiliary loss outputs to the U-Net decoder layers. For nnU-Net, we add auxiliary losses to three latest decoder levels. Final loss is a weighted average of the obtained loss values. Deep supervision can be enabled by adding the --deep-supervision flag.

Test time augmentation

Test time augmentation is an inference technique which averages predictions from augmented images with its prediction. As a result, predictions are more accurate, but with the cost of a slower inference process. For nnU-Net, we use all possible flip combinations for image augmenting. Test time augmentation can be enabled by adding the --tta flag to the training or inference script invocation.

Sliding window inference

During inference this method replaces an input image with arbitrary resolution with a batch of overlapping windows, which cover the whole input. After passing this batch through the network a prediction with the original resolution is reassembled. Predicted values inside overlapped regions are obtained from a weighted average. Overlap ratio and weights for the average (i.e. blending mode) can be adjusted with the --overlap and --blend-mode options respectively.