This resource is using open-source code maintained in github (see the quick-start-guide section) and available for download from NGC
Mask R-CNN is a convolution based neural network for the task of object instance segmentation. The paper describing the model can be found here. NVIDIA's Mask R-CNN is an optimized version of Facebook's implementation.This model is trained with mixed precision using Tensor Cores on Volta, Turing, and the NVIDIA Ampere GPU architectures. Therefore, researchers can get results 1.3x faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.
The repository also contains scripts to interactively launch training, benchmarking and inference routines in a Docker container.
The major differences between the official implementation of the paper and our version of Mask R-CNN are as follows:
These techniques/optimizations improve model performance and reduce training time by a factor of 1.3x, allowing you to perform more efficient instance segmentation with no additional effort.
Other publicly available implementations of Mask R-CNN include:
Mask R-CNN builds on top of FasterRCNN adding an additional mask head for the task of image segmentation.
The architecture consists of following:
The default configuration of this model can be found at pytorch/maskrcnn_benchmark/config/defaults.py
. The default hyper-parameters are as follows:
General:
Feature extractor:
Region Proposal Network (RPN):
RoI heads:
This repository implements multi-gpu and gradient accumulation to support larger batches and mixed precision support. This implementation also includes the following optimizations.
Target generation - Optimized GPU implementation for generating binary mask ground truths from the list of polygon coordinates that exist in the dataset.
Custom CUDA kernels for:
The source files can be found under maskrcnn_benchmark/csrc/cuda
.
The following features are supported by this model.
Feature | Mask R-CNN |
---|---|
Native AMP | Yes |
Native DDP | Yes |
Native NHWC | Yes |
AMP is an abbreviation used for automatic mixed precision training.
Native DDP; Apex DDP where DDP stands for DistributedDataParallel and is used for multi-GPU training.
NHWC is the channels last memory format for tensors.
Mixed precision is the combined use of different numerical precisions in a computational method. Mixed precision training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of tensor cores in the Volta, and following with both the Turing and Ampere architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
Porting the model to use the FP16 data type where appropriate.
Adding loss scaling to preserve small gradient values.
For information about:
How to train using mixed precision, see the Mixed Precision Training paper and Training With Mixed Precision documentation.
Techniques used for mixed precision training, see the Mixed-Precision Training of Deep Neural Networks blog.
In this repository, mixed precision training is enabled by the PyTorch native AMP library. PyTorch has an automatic mixed precision module that allows mixed precision to be enabled with minimal code changes.
Automatic mixed precision can be enabled with the following code changes:
# Create gradient scaler
scaler = torch.cuda.amp.GradScaler(init_scale=8192.0)
# Wrap the forward pass in torch.cuda.amp.autocast
with torch.cuda.amp.autocast():
loss_dict = model(images, targets)
# Gradient scaling
scaler.scale(losses).backward()
scaler.step(optimizer)
scaler.update()
AMP can be enabled by setting DTYPE
to float16
.
TensorFloat-32 (TF32) is the new math mode in NVIDIA A100 GPUs for handling the matrix math also called tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs.
TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models which require high dynamic range for weights or activations.
For more information, refer to the TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x blog post.
TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default.
MLPerf Training is an ML Commons benchmark that measures how fast systems can train models to a target quality metric. MaskRCNN is one of the MLPerf training benchmarks which is improved every year. Some of the performance optimizations used in MLPerf can be introduced to this repository easily to gain significant training speedup. Here is NVIDIA's MLPerf v1.1 submission codebase.
Listed below are some of the performance optimization tricks applied to this repository:
Increasing the local batch size and applying the above tricks gives ~2x speedup for end-to-end training time on 8 DGX A100s when compared to the old implementation.