NGC Catalog
CLASSIC
Welcome Guest
Models
SegIC

SegIC

For downloads and more information, please view on a desktop device.
Logo for SegIC
Description
In-context segmentation model trained on commercial data.
Publisher
NVIDIA
Latest Version
segic_deployable_v1.0
Modified
November 27, 2024
Size
1.15 GB

SegIC: A Generalist Model for Segmenting Everything in Context

Model Overview

Description:

The SegIC (Segment-In-Context) model is designed for in-context segmentation, which aims to segment novel images using a few labeled example images (in-context examples). The research study is at SEGIC: Unleashing the Emergent Correspondence for In-Context Segmentation.

This model is ready for commercial use.

SegIC shows powerful performance as a generalist model for segmenting everything in context. It achieves SOTA results in COCO-20i, FSS-1000 and recent LVIS-92i. The NVIDIA SegIC model is ready for commerical use. It has a deployable version available. We also demonstrate its advantageous performance in retail object detection, while only training the model with a fair number of synthetic retail data.

License/Terms of Use

SegIC Model License

References:

Meng, Lingchen, et al. "SEGIC: Unleashing the Emergent Correspondence for In-Context Segmentation." arXiv preprint arXiv:2311.14671 (2023).

Nguyen, Khoi, and Sinisa Todorovic. "Feature weighting and boosting for few-shot segmentation." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019.

Li, Xiang, et al. "Fss-1000: A 1000-class dataset for few-shot segmentation." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.

Gupta, Agrim, Piotr Dollar, and Ross Girshick. "Lvis: A dataset for large vocabulary instance segmentation." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.

Model Architecture:

Architecture Type: Vision Transformer
Network Architecture: SegIC is mainly built upon a vision foundation model, a text encoder, and a lightweight mask decoder

  • Image encoder: For image feature extraction. DINOv2-L traind with proprietary images from NVIDIA
  • Meta description encoder: For meta feature extraction. CLIP-B trained with proprietary images from NVIDIA
  • Mask Decoder: For mask decoding. It is mainly composed of a few convolutional layers, projection layers and a transformer decoder.

For more details, please refer to the research paper stated in References.

Input:

Input Types:

  • Example images: the example image containing target objects
  • Example image masks: the masks of target objects in the example images. Each mask only contains one object.
  • (Optional) Meta description: a sentence or words describing target objects
  • Target image: the test image in search of target objects
  • Original size: the original size of the target image

Input Formats:

  • Example images: Red, Green, Blue (RGB). Can support any input resolution and images do not need any additional pre-processing (e.g. alpha channels or bits), but all example images needs to be in the same size.
  • Example image masks: Grayscale. Resolutions needs to be the same as the example images.
  • (Optional) Meta description: Tokenized String. Max length 77 words.
  • Target images: Red, Green, Blue (RGB). Can support any input resolution and images do not need any additional pre-processing (e.g. alpha channels or bits).
  • Original size: Height, width

Input Parameters:

  • Example images: Four dimensional (4D), batch_size x 3 x896 x896
  • Example image masks: 4D, batch_size x 1 x896 x896
  • meta description: 2D, batch_size x 64
  • Target images: 2D, 1 x 3 x 896 x 896
  • Original sizes: 2D, 1 x 2

Other Properties Related to Input:

  • Example images: Resized to 896x896 at input
  • Example image masks: Resized to 896x896 at input
  • Target image: Resized to 896x896 at input. Currently it only supports one target image each time at inference.

Output:

Output Types: Predicted segmentation masks.
Output Format: a tensor with shape of batch_size x 1 x 896 x 896
Output Parameters: Four Dimensional
Other Properties Related to Output:

  • segmentation masks: resolution 896x896. Post-processing is needed to resize predicted masks to Target images original sizes.
    Post processing step:

    # original size: W x H, predicted mask size: 896 x 896
    max_l = max(W, H)
    h = 896 * H/max_l
    w = 869 * W/max_l
    pred_mask = pred_mask[:, :, :h, :w]
    pred_mask_original_size = interpolate(pred_mask, (H, W))
    

Software Integration:

Runtime Engines:

  • TAO 5.5.0

Supported Hardware Microarchitecture Compatibility:

  • NVIDIA Ampere
  • NVIDIA Blackwell
  • NVIDIA Jetson
  • NVIDIA Hopper
  • NVIDIA Lovelace
  • NVIDIA Pascal
  • NVIDIA Turing
  • NVIDIA Volta

[Preferred/Supported] Operating Systems:

  • Linux

Model Versions:

segic_v1.0_unpruned_deployable: Trained SegIC for general in-context segmentation tasks
prompt_feature_extract_v1.0_unpruned_deployable: SegIC feature extractor for example images

Training, Testing, and Evaluation Datasets:

Training Dataset:

Link:

  • OpenImage subset
  • MSCOCO 2017 subset: a commercially viable filtered version
  • Synthetic Retail data: Generated by NVIDIA Omniverse Replicator Object. Details of the dataset is described in Retail Object Detection: Synthetic Data - v2.2.x.2 Fine-Tune.
  • nSpect: NSPECT-156S-4A48

Data Collection Method by Dataset

  • Hybrid: synthetic, human

Labeling Method by Dataset

  • Hybrid: synthetic, automated labeling

Properties:
The training dataset is a combination of 3 semantic segmentation datasets: MSCOCO 2017 subset, OpenImage, and synthetic retail data. All three datasets are semantic segmentation data. Below is the summary of each dataset:

Dataset Description # Classes # Images
MSCOCO 2017 subset Semantic segmentation dataset. It contains 80 categories of common objects. The subset extracted commercial viable images from MSCOCO. 80 48,000
OpenImage subset The OpenImage images were auto-labeled with 80 MSCOCO categories. 80 64,000
Synthetic Retail data scanned retail objects inserted to a mixture of synthetic scenes 315 48,000

Evaluation Dataset:

Link:

  • COCO-20i: a split of MSCOCO 2014 for one-shot segmentation benchmark. Details of the derivation of COCO20-i from MSCOCO see here
  • Voyager Cafe KPI: Retail object checkout scene collected in NVIDIA Voyager Cafeteria. Details of the dataset is described in Retail Object Detection: Performance - Evaluation Data.

Data Collection Method by Dataset

  • Human

Labeling Method by Dataset

  • Human

Properties:

Dataset Description # Classes # Images
COCO-20i The one-shot evaluation splits the MSCOCO 2014 validation set to 4 sets for cross validation. Each set contains 20 classes. 80 5,000
Voyager Cafe KPI A binary retail object detection dataset. Images of retail objects in 7 scenes, including checkout counter, shelves and convoyer belt. The 9 categories are manually assigned based on retail object geometry shapes. 1 6,042

Evaluation Results

COCO-20i

One-shot semantic segmentation

Model Training data # shots / category mIoU
NVIDIA SegIC commercial-viable MSCOCO 17, Synthetic retail data, OpenImage 1 69.57

Voyager Cafe KPI

Binary retail object detection

Model Model Architecture Training Data # shots / category mAP
NVIDIA SegIC SegIC commercial-viable MSCOCO 17, Synthetic retail data, OpenImage 13 0.904
Retail Object Detection - v2.2.1.1 (supervised learning model) DINO-FAN-small Synthetic retail data N/A 0.834

All SegIC models uses the same group of visual prompts for evaluation. The detection accuracy mAP50 is estimated from segmentation output.

Inference:

Engine: Tensor(RT)
Test Hardware:

  • A2
  • A30
  • DGX H100
  • DGX A100
  • DGX H100
  • JAO 64GB
  • Jetson AGX Xavier
  • L4
  • L40
  • NVIDIA T4
  • Orin
  • Orin Nano 8GB
  • Orin NX
  • Orin NX16GB
  • T4
  • Xavier NX

The inference is run on the provided unpruned model at FP16 precision. The inference performance is run using trtexec on AGX Orin 64GB, and Jetson Orin NX 16GB devices. The Jetson devices are running at Max-N configuration for maximum GPU frequency. The performance shown here is the inference only performance. The end-to-end performance with streaming video data might vary depending on other bottlenecks in the hardware and software.

Platform TRT Version Model Batch Size FPS
AGX Orin 64GB 8.6.2.3 SegIC 1 4.13
AGX Orin 64GB 8.6.2.3 Prompt Feature Extract 1 4.14
AGX Orin 64GB 8.6.2.3 SegIC 8 27.49
AGX Orin 64GB 8.6.2.3 Prompt Feature Extract 4 4.34
Jetson Orin NX 16GB 8.6.2.3 SegIC 1 1.26
Jetson Orin NX 16GB 8.6.2.3 Prompt Feature Extract 1 1.28
Jetson Orin NX 16GB 8.6.2.3 SegIC 8 9.03
Jetson Orin NX 16GB 8.6.2.3 Prompt Feature Extract 8 1.37
DGX H100 80GB 8.6.2.3 SegIC 1 60
DGX H100 80GB 8.6.2.3 Prompt Feature Extract 1 61
DGX H100 80GB 8.6.2.3 SegIC 4 55
DGX H100 80GB 8.6.2.3 Prompt Feature Extract 4 219
Tesla A30 8.6.2.3 SegIC 1 15
Tesla A30 8.6.2.3 Prompt Feature Extract 1 15
Tesla A30 8.6.2.3 SegIC 4 14
Tesla A30 8.6.2.3 Prompt Feature Extract 4 55

Inference Method

NVIDIA TAO Toolkit now provides a Gradio app to the deployable SegIC models.

To launch the Gradio demo, you can follow the instruction at tao_pytorch_backend for interactive experience.

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards [Insert Link to Model Card++ here]. Please report security vulnerabilities or NVIDIA AI Concerns here.