Masked-attention Mask Transformer or Mask2Former is a segmentation architecture capable of addressing panoptic, instance or semantic image segmentation tasks. The masked attention is essential in efficiently extracting localized features by constraining cross-attention within predicted mask regions.
This model is ready for non-commercial use.
The licenses to use this model is covered by the Model EULA. By downloading the unpruned or pruned version of the model, you accept the terms and conditions of these licenses
Architecture Type: The model in this instance is an instance segmentor that takes color (RGB) images as inputs and generates segmentation masks and associated labels as outputs. Network Architecture: The backbone feature extractor of this model is Swin-T model pretrained on ImageNet dataset.
This model needs to be used with NVIDIA Hardware and Software. For Hardware, the model can run on any NVIDIA GPU with sufficient memory (>12G). This model can only be used with TAO Toolkit.
The primary use case for these models is instance segmentation.
It is intended for training and fine-tune using Train Adapt Optimize (TAO) Toolkit and the users' dataset of object detection. High fidelity models can be trained to new use cases. A Jupyter notebook is available as a part of TAO container and can be used to re-train.
To use these models as pretrained weights for transfer learning, use the following snippet as a template for the model
and train
components of the experiment spec file to train a Mask2former model. For more information on the experiment spec file, see the TAO Toolkit User Guide.
model:
mode: "instance"
backbone:
type: "swin"
swin:
type: "tiny"
window_size: 7
ape: False
pretrain_img_size: 224
mask_former:
num_object_queries: 100
sem_seg_head:
norm: "GN"
num_classes: 1
Runtime Engine:
Supported Hardware Microarchitecture Compatibility:
[Preferred/Supported] Operating System(s):
Link: https://cocodataset.org/
Data Collection Method by dataset: Unknown
Labeling Method by dataset: Human
Properties: The COCO dataset contains 118K training images and corresponding annotation files. The annotation includes bounding boxes and segmentation masks of the 80 thing categories. The categories were mapped to a single category or "object" to train the binary instance segmentation model.
Link: https://cocodataset.org/
Data Collection Method by dataset: Unknown
Labeling Method by dataset: Human
Properties: The COCO dataset contains 5K validation images and corresponding annotation files. The annotation includes bounding boxes and segmentation masks of the 80 thing categories. The categories were mapped to a single category or "object" to train the binary instance segmentation model.
We test the Mask2former model on the modified COCO 2017 validation dataset.
The KPI for the evaluation data are reported below.
model | Precision | mIoU |
---|---|---|
Mask2former | FP16 | 0.96 |
Engine: Tensor(RT)
Test Hardware:
The inference is run on the provided unpruned model at FP16 precision. The inference performance is run using trtexec
on Jetson AGX Xavier, Xavier NX, Orin, Orin NX and NVIDIA T4, and Ampere GPUs. The Jetson devices are running at Max-N configuration for maximum GPU frequency. The performance shown here is the inference only performance. The end-to-end performance with streaming video data might vary depending on other bottlenecks in the hardware and software.
Platform | BS | FPS |
---|---|---|
AGX Orin 64GB | 8 | 17.53 |
Jetson Orin 16GB | 8 | 7.19 |
Jetson Nano 8GB | 8 | 2.13 |
T4 | 16 | 23.54 |
A30 | 16 | 73.94 |
A2 | 16 | 14.43 |
L4 | 16 | 35.27 |
L40 | 16 | 104.42 |
RTX4090 | 16 | 122.55 |
A100 | 16 | 147.11 |
H100 | 16 | 251.99 |
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Promise and the Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns here.