NGC | Catalog
Welcome Guest
CatalogResourcesMask R-CNN for PyTorch

Mask R-CNN for PyTorch

For downloads and more information, please view on a desktop device.
Logo for Mask R-CNN for PyTorch


Mask R-CNN is a convolution based network for object instance segmentation. This implementation provides 1.3x faster training while maintaining target accuracy.



Use Case




Latest Version



November 18, 2021

Compressed Size

6.97 MB

Mask R-CNN is a convolution based neural network for the task of object instance segmentation. The paper describing the model can be found here. NVIDIA's Mask R-CNN 19.2 is an optimized version of Facebook's implementation.This model is trained with mixed precision using Tensor Cores on Volta, Turing, and the NVIDIA Ampere GPU architectures. Therefore, researchers can get results 1.3x faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.

The repository also contains scripts to interactively launch training, benchmarking and inference routines in a Docker container.

The major differences between the official implementation of the paper and our version of Mask R-CNN are as follows:

  • Mixed precision support with PyTorch AMP.
  • Gradient accumulation to simulate larger batches.
  • Custom fused CUDA kernels for faster computations.

These techniques/optimizations improve model performance and reduce training time by a factor of 1.3x, allowing you to perform more efficient instance segmentation with no additional effort.

Other publicly available implementations of Mask R-CNN include:

Model architecture

Mask R-CNN builds on top of FasterRCNN adding an additional mask head for the task of image segmentation.

The architecture consists of following:

  • R-50 backbone with FPN
  • RPN head
  • RoI ALign
  • Bounding and classification box head
  • Mask head

Default Configuration

The default configuration of this model can be found at pytorch/maskrcnn_benchmark/config/ The default hyper-parameters are as follows:

  • General:

    • Base Learning Rate set to 0.001
    • Global batch size set to 16 images
    • Steps set to 30000
    • Images re-sized with aspect ratio maintained and smaller side length between [800,1333]
    • Global train batch size - 16
    • Global test batch size - 8
  • Feature extractor:

    • Backend network set to Resnet50_conv4
    • First two blocks of backbone network weights are frozen
  • Region Proposal Network (RPN):

    • Anchor stride set to 16
    • Anchor sizes set to (32, 64, 128, 256, 512)
    • Foreground IOU Threshold set to 0.7, Background IOU Threshold set to 0.5
    • RPN target fraction of positive proposals set to 0.5
    • Train Pre-NMS Top proposals set to 12000
    • Train Post-NMS Top proposals set to 2000
    • Test Pre-NMS Top proposals set to 6000
    • Test Post-NMS Top proposals set to 1000
    • RPN NMS Threshold set to 0.7
  • RoI heads:

    • Foreground threshold set to 0.5
    • Batch size per image set to 512
    • Positive fraction of batch set to 0.25

This repository implements multi-gpu and gradient accumulation to support larger batches and mixed precision support. This implementation also includes the following optimizations.

  • Target generation - Optimized GPU implementation for generating binary mask ground truths from the list of polygon coordinates that exist in the dataset.

  • Custom CUDA kernels for:

    • Box Intersection over Union (IoU) computation
    • Proposal matcher
    • Generate anchor boxes
    • Pre NMS box selection - Selection of RoIs based on objectness score before NMS is applied.

    The source files can be found under maskrcnn_benchmark/csrc/cuda.

Feature support matrix

The following features are supported by this model.

Feature Mask R-CNN
PyTorch AMP Yes


APEX is a PyTorch extension with NVIDIA-maintained utilities to streamline mixed precision and distributed training, whereas AMP is an abbreviation used for automatic mixed precision training.

DDP stands for DistributedDataParallel and is used for multi-GPU training.

Mixed precision training

Mixed precision is the combined use of different numerical precisions in a computational method. Mixed precision training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of tensor cores in the Volta, and following with both the Turing and Ampere architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:

  1. Porting the model to use the FP16 data type where appropriate.

  2. Adding loss scaling to preserve small gradient values.

For information about:

Enabling mixed precision

In this repository, mixed precision training is enabled by using Pytorch's AMP.

Enabling TF32

TensorFloat-32 (TF32) is the new math mode in NVIDIA A100 GPUs for handling the matrix math also called tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs.

TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models which require high dynamic range for weights or activations.

For more information, refer to the TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x blog post.

TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default.