SegIC | NVIDIA NGC

NGC Catalog

CLASSIC

Welcome Guest

For downloads and more information, please view on a desktop device.

Description

In-context segmentation model trained on commercial data.

Publisher

NVIDIA

Latest Version

segic_deployable_v1.0

Modified

November 27, 2024

Size

1.15 GB

SegIC: A Generalist Model for Segmenting Everything in Context

Model Overview

Description:

The SegIC (Segment-In-Context) model is designed for in-context segmentation, which aims to segment novel images using a few labeled example images (in-context examples). The research study is at SEGIC: Unleashing the Emergent Correspondence for In-Context Segmentation.

This model is ready for commercial use.

SegIC shows powerful performance as a generalist model for segmenting everything in context. It achieves SOTA results in COCO-20i, FSS-1000 and recent LVIS-92i. The NVIDIA SegIC model is ready for commerical use. It has a deployable version available. We also demonstrate its advantageous performance in retail object detection, while only training the model with a fair number of synthetic retail data.

License/Terms of Use

SegIC Model License

References:

Meng, Lingchen, et al. "SEGIC: Unleashing the Emergent Correspondence for In-Context Segmentation." arXiv preprint arXiv:2311.14671 (2023).

Nguyen, Khoi, and Sinisa Todorovic. "Feature weighting and boosting for few-shot segmentation." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019.

Li, Xiang, et al. "Fss-1000: A 1000-class dataset for few-shot segmentation." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.

Gupta, Agrim, Piotr Dollar, and Ross Girshick. "Lvis: A dataset for large vocabulary instance segmentation." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.

Model Architecture:

Architecture Type: Vision Transformer
Network Architecture: SegIC is mainly built upon a vision foundation model, a text encoder, and a lightweight mask decoder

Image encoder: For image feature extraction. DINOv2-L traind with proprietary images from NVIDIA
Meta description encoder: For meta feature extraction. CLIP-B trained with proprietary images from NVIDIA
Mask Decoder: For mask decoding. It is mainly composed of a few convolutional layers, projection layers and a transformer decoder.

For more details, please refer to the research paper stated in References.

Input:

Input Types:

Example images: the example image containing target objects
Example image masks: the masks of target objects in the example images. Each mask only contains one object.
(Optional) Meta description: a sentence or words describing target objects
Target image: the test image in search of target objects
Original size: the original size of the target image

Input Formats:

Example images: Red, Green, Blue (RGB). Can support any input resolution and images do not need any additional pre-processing (e.g. alpha channels or bits), but all example images needs to be in the same size.
Example image masks: Grayscale. Resolutions needs to be the same as the example images.
(Optional) Meta description: Tokenized String. Max length 77 words.
Target images: Red, Green, Blue (RGB). Can support any input resolution and images do not need any additional pre-processing (e.g. alpha channels or bits).
Original size: Height, width

Input Parameters:

Example images: Four dimensional (4D), batch_size x 3 x896 x896
Example image masks: 4D, batch_size x 1 x896 x896
meta description: 2D, batch_size x 64
Target images: 2D, 1 x 3 x 896 x 896
Original sizes: 2D, 1 x 2

Other Properties Related to Input:

Example images: Resized to 896x896 at input
Example image masks: Resized to 896x896 at input
Target image: Resized to 896x896 at input. Currently it only supports one target image each time at inference.

Output:

Output Types: Predicted segmentation masks.
Output Format: a tensor with shape of batch_size x 1 x 896 x 896
Output Parameters: Four Dimensional
Other Properties Related to Output:

segmentation masks: resolution 896x896. Post-processing is needed to resize predicted masks to Target images original sizes.
Post processing step:

# original size: W x H, predicted mask size: 896 x 896
max_l = max(W, H)
h = 896 * H/max_l
w = 869 * W/max_l
pred_mask = pred_mask[:, :, :h, :w]
pred_mask_original_size = interpolate(pred_mask, (H, W))

Software Integration:

Runtime Engines:

TAO 5.5.0

Supported Hardware Microarchitecture Compatibility:

NVIDIA Ampere
NVIDIA Blackwell
NVIDIA Jetson
NVIDIA Hopper
NVIDIA Lovelace
NVIDIA Pascal
NVIDIA Turing
NVIDIA Volta

[Preferred/Supported] Operating Systems:

Linux

Model Versions:

segic_v1.0_unpruned_deployable: Trained SegIC for general in-context segmentation tasks
prompt_feature_extract_v1.0_unpruned_deployable: SegIC feature extractor for example images

Training, Testing, and Evaluation Datasets:

Training Dataset:

Link:

OpenImage subset
MSCOCO 2017 subset: a commercially viable filtered version
Synthetic Retail data: Generated by NVIDIA Omniverse Replicator Object. Details of the dataset is described in Retail Object Detection: Synthetic Data - v2.2.x.2 Fine-Tune.
nSpect: NSPECT-156S-4A48

Data Collection Method by Dataset

Hybrid: synthetic, human

Labeling Method by Dataset

Hybrid: synthetic, automated labeling

Properties:
The training dataset is a combination of 3 semantic segmentation datasets: MSCOCO 2017 subset, OpenImage, and synthetic retail data. All three datasets are semantic segmentation data. Below is the summary of each dataset:

Dataset	Description	# Classes	# Images
MSCOCO 2017 subset	Semantic segmentation dataset. It contains 80 categories of common objects. The subset extracted commercial viable images from MSCOCO.	80	48,000
OpenImage subset	The OpenImage images were auto-labeled with 80 MSCOCO categories.	80	64,000
Synthetic Retail data	scanned retail objects inserted to a mixture of synthetic scenes	315	48,000

Evaluation Dataset:

Link:

COCO-20i: a split of MSCOCO 2014 for one-shot segmentation benchmark. Details of the derivation of COCO20-i from MSCOCO see here
Voyager Cafe KPI: Retail object checkout scene collected in NVIDIA Voyager Cafeteria. Details of the dataset is described in Retail Object Detection: Performance - Evaluation Data.

Data Collection Method by Dataset

Human

Labeling Method by Dataset

Human

Properties:

Dataset	Description	# Classes	# Images
COCO-20i	The one-shot evaluation splits the MSCOCO 2014 validation set to 4 sets for cross validation. Each set contains 20 classes.	80	5,000
Voyager Cafe KPI	A binary retail object detection dataset. Images of retail objects in 7 scenes, including checkout counter, shelves and convoyer belt. The 9 categories are manually assigned based on retail object geometry shapes.	1	6,042

Evaluation Results

COCO-20i

One-shot semantic segmentation

Model	Training data	# shots / category	mIoU
NVIDIA SegIC	commercial-viable MSCOCO 17, Synthetic retail data, OpenImage	1	69.57

Voyager Cafe KPI

Binary retail object detection

Model	Model Architecture	Training Data	# shots / category	mAP
NVIDIA SegIC	SegIC	commercial-viable MSCOCO 17, Synthetic retail data, OpenImage	13	0.904
Retail Object Detection - v2.2.1.1 (supervised learning model)	DINO-FAN-small	Synthetic retail data	N/A	0.834

All SegIC models uses the same group of visual prompts for evaluation. The detection accuracy mAP50 is estimated from segmentation output.

Inference:

Engine: Tensor(RT)
Test Hardware:

A2
A30
DGX H100
DGX A100
DGX H100
JAO 64GB
Jetson AGX Xavier
L4
L40
NVIDIA T4
Orin
Orin Nano 8GB
Orin NX
Orin NX16GB
T4
Xavier NX

The inference is run on the provided unpruned model at FP16 precision. The inference performance is run using trtexec on AGX Orin 64GB, and Jetson Orin NX 16GB devices. The Jetson devices are running at Max-N configuration for maximum GPU frequency. The performance shown here is the inference only performance. The end-to-end performance with streaming video data might vary depending on other bottlenecks in the hardware and software.

Platform	TRT Version	Model	Batch Size	FPS
AGX Orin 64GB	8.6.2.3	SegIC	1	4.13
AGX Orin 64GB	8.6.2.3	Prompt Feature Extract	1	4.14
AGX Orin 64GB	8.6.2.3	SegIC	8	27.49
AGX Orin 64GB	8.6.2.3	Prompt Feature Extract	4	4.34
Jetson Orin NX 16GB	8.6.2.3	SegIC	1	1.26
Jetson Orin NX 16GB	8.6.2.3	Prompt Feature Extract	1	1.28
Jetson Orin NX 16GB	8.6.2.3	SegIC	8	9.03
Jetson Orin NX 16GB	8.6.2.3	Prompt Feature Extract	8	1.37
DGX H100 80GB	8.6.2.3	SegIC	1	60
DGX H100 80GB	8.6.2.3	Prompt Feature Extract	1	61
DGX H100 80GB	8.6.2.3	SegIC	4	55
DGX H100 80GB	8.6.2.3	Prompt Feature Extract	4	219
Tesla A30	8.6.2.3	SegIC	1	15
Tesla A30	8.6.2.3	Prompt Feature Extract	1	15
Tesla A30	8.6.2.3	SegIC	4	14
Tesla A30	8.6.2.3	Prompt Feature Extract	4	55

Inference Method

NVIDIA TAO Toolkit now provides a Gradio app to the deployable SegIC models.

To launch the Gradio demo, you can follow the instruction at tao_pytorch_backend for interactive experience.

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards [Insert Link to Model Card++ here]. Please report security vulnerabilities or NVIDIA AI Concerns here.