NGC | Catalog
Welcome Guest


For downloads and more information, please view on a desktop device.
Logo for clara_pt_self_supervised_learning_segmentation


A self-supervised pipeline for 3D segmentation



Use Case



Clara Train

Latest Version



March 25, 2022


1.05 GB

Model Overview

A pre-trained model for volumetric (3D) segmentation of abdominal organs from CT image. The model was first pre-trained using self-supervised learning technique, the resulting model was fine-tuned with a fully supervised task of abdominal multi-organ segmentation.

Note: The 4.1 version of this model is only compatible with the 4.1 version of the Clara Train SDK container

Model Architecture

This model is trained using self-supervised learning first. The self-supervised learning utilizes augmentations to mutate the image thus creating a self-supervised image reconstruction task. Further, advanced techniques such as contrastive learning have been utilized to augment the learning process. Once the self-supervised learning is complete, the encoder model weights are transferred to the full model UNETR which utilizes the ViT encoder as the backbone. Fully supervised learning is then performed on the full model.

Diagram showing the flow of self-supervised learning and then the fully-supervised learning is performed on the pre-trained weights

Training Algorithm

Self-Supervised Learning

A 3D patch is selected from the CT volume. The 3D patch is then augmented using augmentations such as flips, outer cutout, inner cutout and local patch shuffling.

Two augmented patches are generated via random combinations of the afore-mentioned augmentations. Both patches are reconstructed with a forward pass through network which is based on the ViT backbone. L1 Loss and Contrastive loss are used to drive the learning process of the model.

Once this model is trained to convergence, the backbone of ViT is transffered to the backbone ViT of the full model UNETR [1], which is then trained for a 3D segmentation task.

Downstream 3D Segmentation Task

The segmentation of abdominal region is formulated as the voxel-wise 14-class classification. Each voxel is predicted as one of the following:

  • Spleen
  • Right Kidney
  • Left Kidney
  • Gallbladder
  • Esophagus
  • Liver
  • Stomach
  • Aorta
  • Inferior Vena Cava
  • Portal Vein & Splenic Vein
  • Pancreas
  • Right Adrenal Gland
  • Left Adrenal Gland

The model is optimized with Adam optimizer method minimizing soft dice loss and cross-entropy loss between the predicted mask and ground truth segmentation.


Self-Supervised Learning Configuration

The self-supervised learning was performed with the following:

  • Script: (single GPU) or (multi GPU)
  • GPU: (at least) a single 16GB of GPU memory
  • Actual Model Input: 96 x 96 x 96 for traing, 96 x 96 x 96 for validation/testing
  • AMP: False
  • Optimizer: Adam
  • Learning Rate: 1e-4
  • Loss: L1 loss & Contrastive Loss
  • Validation Frequency: 5 epochs

Segmentation Training configuration

The training was performed with the following:

  • Script: (single GPU) or (multi GPU)
  • GPU: (at least) a 16GB of GPU memory
  • Actual Model Input: 96 x 96 x 96 for traing, 96 x 96 x 96 for validation/testing
  • AMP: False
  • Optimizer: Adam
  • (Initial) Learning Rate: 1e-4
  • Loss: DiceCELoss
  • Validation Frequency: 5 epochs

If out-of-memory or program crash occurs while caching the data set, please change the cache_rate in CacheDataset to a lower value in the range (0, 1).

Self-Supervised Learning Dataset

The training data is from The Cancer Imaging Archive (TCIA) [2]. It contains a total of 771 3D CT Volumes. 600 were used for training and 171 for validation. The dataset split is provided as a json file in the 'config' directory of the MMAR. The filename is 'tcia_dataset_split.json'.

Downstream 3D Segmentation Dataset

The training data is from the MICCAI 2015 Beyond the Cranial Vault (BTCV) abdominal segmentation challenge BTCV.

It contains 30 3D CT volumes that contain annotations of 13 organs. The split is 24 training volumes and 6 validation volumes. The dataset split is provided as a json file in the 'config' directory of the MMAR. The filename is 'btcv_dataset_0.json'.

  • Target: Abdominal Organs
  • Task: Segmentation
  • Modality: CT
  • Size: 30 3D volumes (24 Training, 6 Validation)


Dice score is used for evaluating the performance of the final downstream 3D segmentation model. The trained model achieved average validation Dice score 0.8100 averaged across 13 different organs of the abdomen.It should also be noted that when using the from commands the Dice Score reported is 0.7887 (this is due to testing being done at original resolution of data).

For evaluating the self-supervised model, L1 error was used as the metric to select the best pre-trained model.


Self-Supervised Performance

Training and validation curves over 400 epochs.

Graph that shows training loss over 400 epochs

Graph that shows validation metric over 400 epochs

Segmentation Performance

Validation mean dice score over 3000 epochs. The highest validation Dice score that was achieved is 0.8100. One can observe the decreasing training loss curve.

Graph that shows decreasing training DiceCE Loss

How to Use this Model

The model was validated with NVIDIA hardware and software. For hardware, the model can run on any NVIDIA GPU with memory greater than 16 GB. For software, this model is usable only as part of Transfer Learning & Annotation Tools in Clara Train SDK container. Find out more about Clara Train at the Clara Train Collections on NGC.

Full instructions for the training and validation workflow can be found in our documentation.

Sliding-window Inference

Inference is performed in a sliding window manner with a specified stride. Due to the large size of CT, it is best to use GPUs with 16GB or more of memory for inference/validation.


This training and inference pipeline was developed by NVIDIA. It is based on a segmentation model developed by NVIDIA researchers. This research use only software has not been cleared or approved by FDA or any regulatory agency. Clara pre-trained models are for developmental purposes only and cannot be used directly for clinical procedures.


[1] Hatamizadeh, A., Tang, Y., Nath, V., Yang, D., Myronenko, A., Landman, B., ... & Xu, D. (2022). Unetr: Transformers for 3d medical image segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 574-584).

[2] Harmon, S. A., Sanford, T. H., Xu, S., Turkbey, E. B., Roth, H., Xu, Z., Yang, D., Myronenko, A., Anderson, V., Amalou, A., Blain, M., Kassin, M., Long, D., Varble, N., Walker, S. M., Bagci, U., Ierardi, A. M., Stellato, E., Plensich, G. G., … Turkbey, B. (2020). Artificial intelligence for the detection of COVID-19 pneumonia on chest CT using multinational datasets. Nature Communications, 11(1).


End User License Agreement is included with the product. Licenses are also available along with the model application zip file. By pulling and using the Clara Train SDK container and downloading models, you accept the terms and conditions of these licenses.