NGC | Catalog
CatalogModelsCitySemSegFormer

CitySemSegFormer

For downloads and more information, please view on a desktop device.
Logo for CitySemSegFormer

Description

Semantic segmentation of persons in an image.

Publisher

NVIDIA

Use Case

Other

Framework

Transfer Learning Toolkit

Latest Version

deployable_v1.0

Modified

December 14, 2022

Size

331.08 MB

CitySemSegFormer Model Card

Model Overview

The model described in this card segments cityscapes urban city classes within an image and returns a semantic segmentation mask.

Model Architecture

Segformer is a real-time state of the art transformer based semantic segmentation model. SegFormer is a simple, efficient yet powerful semantic segmentation framework which unifies Transformers with lightweight multilayer perception (MLP) decoders. It then predicts a class label for every pixel in the input image. This model segments the urban cityscapes 19 classes which include:

  1. road
  2. sidewalk
  3. building
  4. wall
  5. fence
  6. pole
  7. traffic light
  8. traffic sign
  9. vegetation
  10. terrain
  11. sky
  12. person
  13. rider
  14. car
  15. truck
  16. bus
  17. train
  18. motorcycle
  19. bicycle

Training Algorithm

The training algorithm optimizes the network to minimize the cross-entropy loss for every pixel of the mask.

Training Data

Citysemsegformer model was trained on a proprietary dataset with more than 2 million objects for car class. Most of the training dataset was collected in-house from images from a variety of dashcams and a small seed dataset containing images from traffic cameras in a city in the US. This content was chosen to auto-label urban city classes with segmentation masks. The approximate frequency distribution of predominant classes in the dataset are as following:

Object Distribution
Environment Images Cars Persons Road Signs Two-Wheelers
Dashcam (5ft height) 128,000 1.7M 720,000 354,127 54,000
Traffic signal content 50,000 1.1M 53500 184000 11000
Total 178,000 2.8M 773,500 538,127 65,000

Training Data Ground-truth Labeling Guidelines

  • All objects were auto-labelled using NVSeg[2] model from NVIDIA as starting point.

Performance

Evaluation Data

The inference performance of CitySemSegFormer model was measured against 300 proprietary images that were hand labelled across a variety of environments. The frames are high resolution images 1920x1080 pixels resized to 1820x1024 pixels before passing to the CitySemSegFormer model.

Methodology and KPI

The KPI for the evaluation data are reported in the table below. Model is evaluated based on Mean Intersection-Over-Union. Mean Intersection-Over-Union (MIOU) is a common evaluation metric for semantic image segmentation, which first computes the IOU for each semantic class and then computes the average over classes.

Model CitySemSegFormer
Classes MIOU
road 65.34
sidewalk 13.92
building 72.45
wall 51.33
fence 65.57
pole 43.45
traffic light 8.56
traffic sign 72.8
vegetation 79.72
terrain 38.52
sky 96.47
person 81.32
rider 21.24
car 90.13
truck 33.12
bus 93.93
train 3.2
motorcycle 33.44
bicycle 39.94

Real-time Inference Performance

The inference is run on the provided unpruned models at INT8 precision. On the Jetson Nano FP16 precision is used. The inference performance is run using trtexec on Jetson Nano, AGX Xavier, Xavier NX and NVIDIA T4 GPU. The Jetson devices are running at Max-N configuration for maximum GPU frequency. The performance shown here is the inference only performance. The end-to-end performance with streaming video data might slightly vary depending on other bottlenecks in the hardware and software.

BS - Batch Size

Xavier NX AGX Xavier Orin NX 16GB Orin 64GB T4 A100 A30 A10 A2
Model arch Precision GPU 2*DLA GPU 2*DLA GPU 2*DLA GPU 2*DLA BS FPS BS FPS BS FPS BS FPS BS FPS
CitySemSegFormer FP16 0.4 -- 0.7 -- 0.6 -- 1.5 -- 1 2 4 13 1 6.4 1 4.5 1 1.3

How to use this model

These models need to be used with NVIDIA Hardware and Software. For Hardware, the models can run on any NVIDIA GPU including NVIDIA Jetson devices. These models can only be used with Train Adapt Optimize (TAO) Toolkit, DeepStream SDK or TensorRT.

The model is intended for primarily deploying and doing inference using DeepStream.

Primary use case intended for the model is segmenting urban city classes in a color (RGB) image. The model can be used to segment urban city transport/ setting from photos and videos by using appropriate video or image decoding and pre-processing. Note this model performs semantic segmentation and not instance based segmentation.

The model is encrypted and will only operate with the following key:

  • Model load key: tlt_encode

Input

Color Images of resolution 1820x1024x3

Output

Category label (person or background) for every pixel in the input image. Outputs a semantic of urban city classes for the input image.

Output image

Note: Please note that Citysemsegformer currently can only be used as deployable model in Deepstream. In the current version, citySemSegformer does not provide support for fine-tuning with TAO-Toolkit.

Instructions to deploy these models with DeepStream

To create the entire end-to-end video analytics application, deploy these models with DeepStream SDK. DeepStream SDK is a streaming analytics toolkit to accelerate building AI-based video analytics applications. DeepStream supports direct integration of these models into the deepstream sample app.

To deploy these models with DeepStream 6.1, please follow the instructions below:

Download and install DeepStream SDK. The installation instructions for DeepStream are provided in DeepStream development guide. The config files for the purpose-built models are located in:

/opt/nvidia/deepstream is the default DeepStream installation directory. This path will be different if you are installing in a different directory.

You will need 1 config files and 1 label file. These files are provided in [NVIDIA-AI-IOT](@todo : update).

pgie_tlt_config_citysemsegformer.txt - File to configure inference settings for CitySemSegformer
labels.txt - Label file with 19 classes

Convert the .etlt file to engine if you want to input the model as TRT engine. Otherwise, you can input the etlt model directly to Deepstream. In order to manually convert to TRT engine, follow the example command below:

TAO-Converter Commands

FP16

./tao-converter -k tlt_encode -p input,1x3x1024x1820,1x3x1024x1820,1x3x1024x1820 -t fp16 -e ./bs1_fp16.engine ./citySemSegFormer.etlt

Key Parameters in pgie_tlt_config_citysemsegformer.txt

# You can either provide the etlt model and key or trt engine obtained by using tao-converter
tlt-model-key=tlt_encode
# tlt-encoded-model=../../path/to/.etlt file
model-engine-file=../../path/to/trt_engine
net-scale-factor=0.01735207357279195
offsets=123.675;116.28;103.53
# Since the model input channel is 3, using RGB color format.
model-color-format=0
labelfile-path=./labels.txt
infer-dims=3;1024;1820
batch-size=1
## 0=FP32, 1=INT8, 2=FP16 mode
network-mode=2
interval=0
gie-unique-id=1
cluster-mode=2
## 0=Detector, 1=Classifier, 2=Semantic Segmentation, 3=Instance Segmentation, 100=Other
network-type=100
output-tensor-meta=1
num-detected-classes=20
segmentation-output-order=1
roi-top-offset=0
roi-bottom-offset=0
detected-min-w=0
detected-min-h=0
detected-max-w=0
detected-max-h=0

Run ds-tao-segmentation:

CitySemSegformer
ds-tao-segmentation -c configs/segformer_tao/pgie_tlt_config_citysemsegformer.txt -i $DS_SRC_PATH/samples/streams/sample_720p.h264

Documentation to deploy with DeepStream is provided in "Deploying to DeepStream" chapter of TAO User Guide.

Limitations

Under-represented classes

NVIDIA Citysemsegformer model was trained to detect classes that are predominantly found in road transport setting. It relatively performs poorly on under-represented classes in our internal Intelligent Transport System dataset. Some of these classes include: rider, truck, train, motorcycle.

Model versions:

CitySemSegformer:
  • deployable_citysemsegformer_v1.0 - citySemSegformer model deployable to deepstream.

References

Citations

Using TAO Pre-trained Models

Technical blogs

Suggested reading

License

License to use this model is covered by the Model EULA. By downloading the unpruned or pruned version of the model, you accept the terms and conditions of these licenses

Ethical Considerations

Training and evaluation dataset mostly consists of North American content. An ideal training and evaluation dataset would additionally include content from other geographies.

NVIDIA’s platforms and application frameworks enable developers to build a wide array of AI applications. Consider potential algorithmic bias when choosing or creating the models being deployed. Work with the model’s developer to ensure that it meets the requirements for the relevant industry and use case; that the necessary instruction and documentation are provided to understand error rates, confidence intervals, and results; and that the model is being used under the conditions and in the manner intended.