The SegIC (Segment-In-Context) model is designed for in-context segmentation, which aims to segment novel images using a few labeled example images (in-context examples). The research study is at SEGIC: Unleashing the Emergent Correspondence for In-Context Segmentation.
This model is ready for commercial use.
SegIC shows powerful performance as a generalist model for segmenting everything in context. It achieves SOTA results in COCO-20i, FSS-1000 and recent LVIS-92i. The NVIDIA SegIC model is ready for commerical use. It has a deployable version available. We also demonstrate its advantageous performance in retail object detection, while only training the model with a fair number of synthetic retail data.
Meng, Lingchen, et al. "SEGIC: Unleashing the Emergent Correspondence for In-Context Segmentation." arXiv preprint arXiv:2311.14671 (2023).
Nguyen, Khoi, and Sinisa Todorovic. "Feature weighting and boosting for few-shot segmentation." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019.
Li, Xiang, et al. "Fss-1000: A 1000-class dataset for few-shot segmentation." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.
Gupta, Agrim, Piotr Dollar, and Ross Girshick. "Lvis: A dataset for large vocabulary instance segmentation." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.
Architecture Type: Vision Transformer
Network Architecture:
SegIC is mainly built upon a vision foundation model, a text encoder, and a lightweight mask decoder
For more details, please refer to the research paper stated in References
.
Input Types:
Input Formats:
Input Parameters:
Other Properties Related to Input:
Output Types: Predicted segmentation masks.
Output Format: a tensor with shape of batch_size x 1 x 896 x 896
Output Parameters: Four Dimensional
Other Properties Related to Output:
segmentation masks: resolution 896x896. Post-processing is needed to resize predicted masks to Target images
original sizes.
Post processing step:
# original size: W x H, predicted mask size: 896 x 896
max_l = max(W, H)
h = 896 * H/max_l
w = 869 * W/max_l
pred_mask = pred_mask[:, :, :h, :w]
pred_mask_original_size = interpolate(pred_mask, (H, W))
Runtime Engines:
Supported Hardware Microarchitecture Compatibility:
[Preferred/Supported] Operating Systems:
segic_v1.0_unpruned_deployable: Trained SegIC for general in-context segmentation tasks
prompt_feature_extract_v1.0_unpruned_deployable: SegIC feature extractor for example images
Link:
Data Collection Method by Dataset
Labeling Method by Dataset
Properties:
The training dataset is a combination of 3 semantic segmentation datasets: MSCOCO 2017 subset, OpenImage, and synthetic retail data. All three datasets are semantic segmentation data. Below is the summary of each dataset:
Dataset | Description | # Classes | # Images |
---|---|---|---|
MSCOCO 2017 subset | Semantic segmentation dataset. It contains 80 categories of common objects. The subset extracted commercial viable images from MSCOCO. | 80 | 48,000 |
OpenImage subset | The OpenImage images were auto-labeled with 80 MSCOCO categories. | 80 | 64,000 |
Synthetic Retail data | scanned retail objects inserted to a mixture of synthetic scenes | 315 | 48,000 |
Link:
Data Collection Method by Dataset
Labeling Method by Dataset
Properties:
Dataset | Description | # Classes | # Images |
---|---|---|---|
COCO-20i | The one-shot evaluation splits the MSCOCO 2014 validation set to 4 sets for cross validation. Each set contains 20 classes. | 80 | 5,000 |
Voyager Cafe KPI | A binary retail object detection dataset. Images of retail objects in 7 scenes, including checkout counter, shelves and convoyer belt. The 9 categories are manually assigned based on retail object geometry shapes. | 1 | 6,042 |
One-shot semantic segmentation
Model | Training data | # shots / category | mIoU |
---|---|---|---|
NVIDIA SegIC | commercial-viable MSCOCO 17, Synthetic retail data, OpenImage | 1 | 69.57 |
Binary retail object detection
Model | Model Architecture | Training Data | # shots / category | mAP |
---|---|---|---|---|
NVIDIA SegIC | SegIC | commercial-viable MSCOCO 17, Synthetic retail data, OpenImage | 13 | 0.904 |
Retail Object Detection - v2.2.1.1 (supervised learning model) | DINO-FAN-small | Synthetic retail data | N/A | 0.834 |
All SegIC models uses the same group of visual prompts for evaluation. The detection accuracy mAP50 is estimated from segmentation output.
Engine: Tensor(RT)
Test Hardware:
The inference is run on the provided unpruned model at FP16 precision. The inference performance is run using trtexec
on AGX Orin 64GB, and Jetson Orin NX 16GB devices. The Jetson devices are running at Max-N configuration for maximum GPU frequency. The performance shown here is the inference only performance. The end-to-end performance with streaming video data might vary depending on other bottlenecks in the hardware and software.
Platform | TRT Version | Model | Batch Size | FPS |
---|---|---|---|---|
AGX Orin 64GB | 8.6.2.3 | SegIC | 1 | 4.13 |
AGX Orin 64GB | 8.6.2.3 | Prompt Feature Extract | 1 | 4.14 |
AGX Orin 64GB | 8.6.2.3 | SegIC | 8 | 27.49 |
AGX Orin 64GB | 8.6.2.3 | Prompt Feature Extract | 4 | 4.34 |
Jetson Orin NX 16GB | 8.6.2.3 | SegIC | 1 | 1.26 |
Jetson Orin NX 16GB | 8.6.2.3 | Prompt Feature Extract | 1 | 1.28 |
Jetson Orin NX 16GB | 8.6.2.3 | SegIC | 8 | 9.03 |
Jetson Orin NX 16GB | 8.6.2.3 | Prompt Feature Extract | 8 | 1.37 |
DGX H100 80GB | 8.6.2.3 | SegIC | 1 | 60 |
DGX H100 80GB | 8.6.2.3 | Prompt Feature Extract | 1 | 61 |
DGX H100 80GB | 8.6.2.3 | SegIC | 4 | 55 |
DGX H100 80GB | 8.6.2.3 | Prompt Feature Extract | 4 | 219 |
Tesla A30 | 8.6.2.3 | SegIC | 1 | 15 |
Tesla A30 | 8.6.2.3 | Prompt Feature Extract | 1 | 15 |
Tesla A30 | 8.6.2.3 | SegIC | 4 | 14 |
Tesla A30 | 8.6.2.3 | Prompt Feature Extract | 4 | 55 |
NVIDIA TAO Toolkit now provides a Gradio app to the deployable SegIC models.
To launch the Gradio demo, you can follow the instruction at tao_pytorch_backend for interactive experience.
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards [Insert Link to Model Card++ here]. Please report security vulnerabilities or NVIDIA AI Concerns here.