OneFormer is a unified AI model for multiple image segmentation tasks, including semantic, instance, and panoptic segmentation.
Model Overview
Description:
One Transformer to Rule all Segmentation Tasks or OneFormer is a universal segmentation architecture capable of addressing panoptic, instance, or semantic image segmentation tasks. The task token is essential in dynamically guiding the model to output task-specific predictions by conditioning the architecture on the desired segmentation type (e.g., "semantic," "instance," or "panoptic") during a single, unified training and inference process.
This model is ready for non-commercial use.
License/Terms of Use:
GOVERNING TERMS: Use of this model is governed by NVIDIA License. ADDITIONAL INFORMATION: MIT License.
Deployment Geography:
Global
Use Case:
Intended Users: This model is intended for use by computer vision engineers, robotics engineers, and researchers who require a comprehensive and detailed pixel-level understanding of an image.
Intended Use Cases: The model's ability to perform semantic, instance, and panoptic segmentation simultaneously makes it ideal for:
- Autonomous Systems: Providing full scene understanding for self-driving cars or drones by identifying both "stuff" (road, sky) and "things" (car 1, car 2, pedestrian 1).
- Robotic Perception: Enabling robots to identify, locate, and separate individual objects for "pick-and-place" tasks in cluttered environments.
- Medical Image Analysis: Segmenting and counting individual cells or tumors (instances) while also classifying surrounding anatomical regions or tissues (semantic).
- Geospatial Analysis: Analyzing satellite imagery to map land use (e.g., "forest," "water") while also detecting and counting individual objects (e.g., "buildings," "vehicles").
- Computational Photography: Powering features like AR effects or portrait mode by creating precise masks that separate subjects from their background.
Release Date:
NGC 11/25/2025 via [URL]
References(s):
J. Jain, J. Li, M. Chiu, A. Hassani, N. Orlov, H. Shi: OneFormer: One Transformer to Rule Universal Image Segmentation
Model Architecture:
Architecture Type: The model is a unified segmentor that takes color (RGB) images as inputs and generates segmentation masks and associated labels as outputs.
Network Architecture:
- The backbone feature extractor of this model is the Swin-L model pretrained on the ImageNet dataset.
- The multi-scale features from the backbone are then fed into a Pixel Decoder (similar to an FPN) to generate high-resolution, multi-scale feature maps.
- The core of the architecture is a transformer decoder that takes two sets of inputs: the multi-scale feature maps and a set of learnable queries.
- A key innovation of OneFormer is the use of a task token. This single, learnable token is added to the queries to "prompt" the model, conditioning it to perform a specific task (semantic, instance, or panoptic segmentation) using the exact same weights.
- Finally, the refined query embeddings from the decoder are passed to two parallel prediction heads:
- A classification head (a linear layer or small MLP) to predict the class label for each query.
- A mask head (also an MLP) to dynamically generate the final mask for each query by combining the decoder outputs with the pixel decoder's feature maps.
Number of Model Parameters: 223M*10^7
Input(s):
Input Type(s): Image, Text
Input Format(s):
- Image: Red, Green, Blue (RGB)
- Text: String
Input Parameters:
- Image: Two-Dimensional (2D)
- Text: One-Dimensional (1D)
Other Properties Related to Input: The image size should be divisible by 32, and the text should state, "This task is semantic/instance/panoptic".
Output(s)
Output Type(s): Label, Mask and Score for each detected object in the input image.
Output Format(s):
- Label: Integer
- Mask: Red, Green, Blue (RGB)
- Score: Float
Output Parameters:
- Label: One-Dimensional (1D)
- Mask: Two-Dimensional (2D)
- Score: One-Dimensional (1D)
Other Properties Related to Output:
pred_classes: Batch size x Number of queriespred_masks: Batch size x Number of queries x Height x Widthpred_scores: Batch size x Number of queries
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Software Integration:
Runtime Engine(s):
- TAO v6.25.11
Supported Hardware Microarchitecture Compatibility:
- NVIDIA Ampere
- NVIDIA Blackwell
- NVIDIA Jetson
- NVIDIA Hopper
- NVIDIA Lovelace
- NVIDIA Pascal
- NVIDIA Turing
- NVIDIA Volta
Preferred/Supported Operating System(s):
- Linux
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
Model Version(s):
v1.0
Training, Testing, and Evaluation Datasets:
Dataset Overview
- Total Number of Datasets: 02 Datasets (COCO 2017 and ADE20K)
- Data Modality: Image
Images are scaled, typically by resizing the shorter edge to a fixed size (e.g., 800 pixels) while maintaining the aspect ratio. From these scaled images, large random crops (e.g., 1024x1024) are taken to create training patches. Standard augmentations, including random horizontal flipping (with 50% probability) and photometric jitter (adjusting brightness, contrast, and saturation), are applied to improve model robustness.
COCO
Link: https://cocodataset.org/#home
Data Collection Method by dataset:
Automated
Labeling Method by dataset:
Human
Properties:
The COCO dataset (specifically the 2017 version) is the gold-standard benchmark for panoptic segmentation. It is designed to test a model's ability to unify the tasks of instance segmentation (for "things") and semantic segmentation (for "stuff").
- Training Set: 118,000 images (train2017)
- Validation Set: 5,000 images (val2017)
- Total Classes: 133 total categories
- 80 "Thing" Categories (for instance segmentation, e.g., 'person', 'dog', 'stop sign')
- 53 "Stuff" Categories (for semantic segmentation, e.g., 'sky', 'grass', 'road')
- Total Annotations: The dataset is incredibly dense, containing over 1.5 million annotated object instances across its training and validation splits.
ADE20K
Link: https://ade20k.csail.mit.edu/
Data Collection Method by dataset:
Automated
Labeling Method by dataset:
Human
Properties: The ADE20K (MIT SceneParsing Benchmark) dataset is a densely annotated benchmark designed for scene parsing. This task requires the model to label every single pixel in an image with a semantic category, including a vast array of objects, parts of objects, and background "stuff."
- Training Set: 20,210 images
- Validation Set: 2,000 images
- Semantic Classes: The standard benchmark (used by OneFormer) consists of 150 semantic categories (e.g., 'person', 'car', 'building', 'road', 'tree').
- Total Annotations: The full dataset contains over 700,000 unique object instances.
Inference:
Acceleration Engine: Tensor(RT)
Test Hardware:
- 1x NVIDIA A100 80GB
- 1x NVIDIA H100 80GB
Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.