This resource is using open-source code maintained in github (see the quick-start-guide section) and available for download from NGC
Mask R-CNN is a convolution based neural network for the task of object instance segmentation. The paper describing the model can be found here. NVIDIA's Mask R-CNN is an optimized version of Facebook's implementation.This model is trained with mixed precision using Tensor Cores on Volta, Turing, and the NVIDIA Ampere GPU architectures. Therefore, researchers can get results 1.3x faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.
The repository also contains scripts to interactively launch training, benchmarking and inference routines in a Docker container.
The major differences between the official implementation of the paper and our version of Mask R-CNN are as follows:
- Mixed precision support with PyTorch AMP.
- Gradient accumulation to simulate larger batches.
- Custom fused CUDA kernels for faster computations.
These techniques/optimizations improve model performance and reduce training time by a factor of 1.3x, allowing you to perform more efficient instance segmentation with no additional effort.
Other publicly available implementations of Mask R-CNN include:
Mask R-CNN builds on top of FasterRCNN adding an additional mask head for the task of image segmentation.
The architecture consists of following:
- R-50 backbone with FPN
- RPN head
- RoI ALign
- Bounding and classification box head
- Mask head
The default configuration of this model can be found at
pytorch/maskrcnn_benchmark/config/defaults.py. The default hyper-parameters are as follows:
- Base Learning Rate set to 0.001
- Global batch size set to 16 images
- Steps set to 30000
- Images re-sized with aspect ratio maintained and smaller side length between [800,1333]
- Global train batch size - 16
- Global test batch size - 8
- Backend network set to Resnet50_conv4
- First two blocks of backbone network weights are frozen
Region Proposal Network (RPN):
- Anchor stride set to 16
- Anchor sizes set to (32, 64, 128, 256, 512)
- Foreground IOU Threshold set to 0.7, Background IOU Threshold set to 0.5
- RPN target fraction of positive proposals set to 0.5
- Train Pre-NMS Top proposals set to 12000
- Train Post-NMS Top proposals set to 2000
- Test Pre-NMS Top proposals set to 6000
- Test Post-NMS Top proposals set to 1000
- RPN NMS Threshold set to 0.7
- Foreground threshold set to 0.5
- Batch size per image set to 512
- Positive fraction of batch set to 0.25
This repository implements multi-gpu and gradient accumulation to support larger batches and mixed precision support. This implementation also includes the following optimizations.
Target generation - Optimized GPU implementation for generating binary mask ground truths from the list of polygon coordinates that exist in the dataset.
Custom CUDA kernels for:
- Box Intersection over Union (IoU) computation
- Proposal matcher
- Generate anchor boxes
- Pre NMS box selection - Selection of RoIs based on objectness score before NMS is applied.
The source files can be found under
Feature support matrix
The following features are supported by this model.
AMP is an abbreviation used for automatic mixed precision training.
NHWC is the channels last memory format for tensors.
Mixed precision training
Mixed precision is the combined use of different numerical precisions in a computational method. Mixed precision training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of tensor cores in the Volta, and following with both the Turing and Ampere architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
Porting the model to use the FP16 data type where appropriate.
Adding loss scaling to preserve small gradient values.
For information about:
Techniques used for mixed precision training, see the Mixed-Precision Training of Deep Neural Networks blog.
Enabling mixed precision
In this repository, mixed precision training is enabled by the PyTorch native AMP library. PyTorch has an automatic mixed precision module that allows mixed precision to be enabled with minimal code changes.
Automatic mixed precision can be enabled with the following code changes:
# Create gradient scaler
scaler = torch.cuda.amp.GradScaler(init_scale=8192.0)
# Wrap the forward pass in torch.cuda.amp.autocast
loss_dict = model(images, targets)
# Gradient scaling
AMP can be enabled by setting
TensorFloat-32 (TF32) is the new math mode in NVIDIA A100 GPUs for handling the matrix math also called tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs.
TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models which require high dynamic range for weights or activations.
For more information, refer to the TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x blog post.
TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default.
MLPerf Training is an ML Commons benchmark that measures how fast systems can train models to a target quality metric. MaskRCNN is one of the MLPerf training benchmarks which is improved every year. Some of the performance optimizations used in MLPerf can be introduced to this repository easily to gain significant training speedup. Here is NVIDIA's MLPerf v1.1 submission codebase.
Listed below are some of the performance optimization tricks applied to this repository:
- Prefetcher: PyTorch CUDA Streams are used to fetch the data required for the next iteration during the current iteration to reduce dataloading time before each iteration.
- pin_memory: Setting pin_memory can speed up host to device transfer of samples in dataloader. More details can be found in this blog.
- Hybrid Dataloader: Some dataloading is done on the CPU and the rest is on the GPU.
- FusedSGD: Replace SGD with Apex FusedSGD for training speedup.
- Native DDP: Use PyTorch DistributedDataParallel.
- Native NHWC: Switching from channels first (NCHW) memory format to NHWC (channels last) gives better performance.
Increasing the local batch size and applying the above tricks gives ~2x speedup for end-to-end training time on 8 DGX A100s when compared to the old implementation.