NGC Catalog
CLASSIC
Welcome Guest
Models
Pre-trained Segformer - CityScapes

Pre-trained Segformer - CityScapes

For downloads and more information, please view on a desktop device.
Logo for Pre-trained Segformer - CityScapes
Description
Pre-trained segformer models trained on CityScapes.
Publisher
-
Latest Version
deployable_fan_tiny_hybrid_v1.0
Modified
October 16, 2023
Size
39.54 MB

Cityscapes SegFormer Model Card

Model Overview

The model described in this card segments cityscapes urban city classes within an image and returns a semantic segmentation mask.

Model Architecture

Segformer is a real-time state of the art transformer based semantic segmentation model. SegFormer is a simple, efficient yet powerful semantic segmentation framework which unifies Transformers with lightweight multilayer perception (MLP) decoders. It then predicts a class label for every pixel in the input image. This model was trained on the cityscapes dataset and segments the urban cityscapes 19 classes which include:

  1. road
  2. sidewalk
  3. building
  4. wall
  5. fence
  6. pole
  7. traffic light
  8. traffic sign
  9. vegetation
  10. terrain
  11. sky
  12. person
  13. rider
  14. car
  15. truck
  16. bus
  17. train
  18. motorcycle
  19. bicycle

Training Algorithm

The training algorithm optimizes the network to minimize the cross-entropy loss for every pixel of the mask.

Training Data Ground-truth Labeling Guidelines

  • This model was trained on the cityscapes dataset[3].

Performance

Evaluation Data

The inference performance of Cityscapes model was measured on the cityscapes validation dataset.

Methodology and KPI

The KPI for the evaluation data are reported in the table below. Model is evaluated based on Mean Intersection-Over-Union. Mean Intersection-Over-Union (MIOU) is a common evaluation metric for semantic image segmentation, which first computes the IOU for each semantic class and then computes the average over classes.

Model Cityscapes SegFormer
fan_hybrid_base 83.5

Real-time Inference Performance

The inference is run on the provided unpruned models at INT8 precision. On the Jetson Nano FP16 precision is used. The inference performance is run using trtexec on Jetson Nano, AGX Xavier, Xavier NX and NVIDIA T4 GPU. The Jetson devices are running at Max-N configuration for maximum GPU frequency. The performance shown here is the inference only performance. The end-to-end performance with streaming video data might slightly vary depending on other bottlenecks in the hardware and software.

BS - Batch Size

fan-tiny-hybrid

Platform BS FPS
Jetson Orin Nano - -
Orin NX 16GB - -
AGX Orin 64GB - -
A2 16 230
T4 16 374
A30 16 1098
L4 4 750
L40 8 1985
A100 64 2381
H100 64 4189

fan-small-hybrid

Platform BS FPS
Jetson Orin Nano - -
Orin NX 16GB - -
AGX Orin 64GB - -
A2 16 174
T4 8 272
A30 16 834
L4 4 581
L40 8 1510
A100 64 1808
H100 64 3143

fan-base-hybrid

Platform BS FPS
Jetson Orin Nano - -
Orin NX 16GB - -
AGX Orin 64GB - -
A2 16 129
T4 8 198
A30 16 606
L4 4 442
L40 8 1144
A100 64 1808
H100 64 2306

fan-large-hybrid

Platform BS FPS
Jetson Orin Nano - -
Orin NX 16GB - -
AGX Orin 64GB - -
A2 16 97.7
T4 8 156.7
A30 16 465
L4 4 341
L40 8 840
A100 64 1000
H100 64 1732

How to use this model

These models need to be used with NVIDIA Hardware and Software. For Hardware, the models can run on any NVIDIA GPU including NVIDIA Jetson devices. These models can only be used with Train Adapt Optimize (TAO) Toolkit, DeepStream SDK or TensorRT.

The model is intended for primarily deploying and doing inference using DeepStream.

Primary use case intended for the model is segmenting urban city classes in a color (RGB) image. The model can be used to segment urban city transport/ setting from photos and videos by using appropriate video or image decoding and pre-processing. Note this model performs semantic segmentation and not instance based segmentation.

Input

Color Images of resolution 1024x1024x3 which the model was trained with. Any resolution input can be fed into the model.

Output

Outputs a semantic of urban city classes for the input image.

Output image

Instructions to deploy these models with DeepStream

To create the entire end-to-end video analytics application, deploy these models with DeepStream SDK. DeepStream SDK is a streaming analytics toolkit to accelerate building AI-based video analytics applications. DeepStream supports direct integration of these models into the deepstream sample app.

To deploy these models with DeepStream 6.1, please follow the instructions below:

Download and install DeepStream SDK. The installation instructions for DeepStream are provided in DeepStream development guide. The config files for the purpose-built models are located in:

/opt/nvidia/deepstream is the default DeepStream installation directory. This path will be different if you are installing in a different directory.

You will need 1 config files and 1 label file. These files are provided in [NVIDIA-AI-IOT](@todo : update).

nvinfer_config.txt - File to configure inference settings for Cityscapes
labels.txt - Label file with 19 classes

Key Parameters in nvinfer_config.txt

# You can either provide the onnx model and key or trt engine obtained by using tao-converter
# onnx-file=../../path/to/.onnx file
model-engine-file=../../path/to/trt_engine
net-scale-factor=0.01735207357279195
offsets=123.675;116.28;103.53
# Since the model input channel is 3, using RGB color format.
model-color-format=0
labelfile-path=./labels.txt
infer-dims=3;1024;1024
batch-size=1
## 0=FP32, 1=INT8, 2=FP16 mode
network-mode=2
interval=0
gie-unique-id=1
cluster-mode=2
## 0=Detector, 1=Classifier, 2=Semantic Segmentation, 3=Instance Segmentation, 100=Other
network-type=100
output-tensor-meta=1
num-detected-classes=20
segmentation-output-order=1
roi-top-offset=0
roi-bottom-offset=0
detected-min-w=0
detected-min-h=0
detected-max-w=0
detected-max-h=0

Run ds-tao-segmentation:

Cityscapes

ds-tao-segmentation -c nvinfer_config.txt -i file:///opt/nvidia/deepstream/deepstream/samples/streams/sample_720p.mp4

Documentation to deploy with DeepStream is provided in "Deploying to DeepStream" chapter of TAO User Guide.

Limitations

Under-represented classes

NVIDIA Cityscapes model was trained to detect classes that are predominantly found in road transport setting. It relatively performs poorly on under-represented classes in our internal Intelligent Transport System dataset. Some of these classes include: rider, truck, train, motorcycle.

Model versions:

FAN Cityscapes:

  • trainable_fan_base_hybrid_v1.0 - cityscapes FAN Base Hybrid model trainable.
  • trainable_fan_tiny_hybrid_v1.0 - cityscapes FAN Tiny Hybrid model trainable.
  • trainable_fan_small_hybrid_v1.0 - cityscapes FAN Small Hybrid model trainable.
  • deployable_fan_base_hybrid_v1.0 - cityscapes FAN Base Hybrid model deployable to deepstream.
  • deployable_fan_tiny_hybrid_v1.0 - cityscapes FAN Tiny Hybrid model deployable to deepstream.
  • deployable_fan_small_hybrid_v1.0 - cityscapes FAN Small Hybrid model deployable to deepstream.

References

Citations

  • Xie, Enze, et al. "SegFormer: Simple and efficient design for semantic segmentation with transformers." Advances in Neural Information Processing Systems 34 (2021): 12077-12090.
  • https://github.com/NVIDIA/semantic-segmentation
  • https://www.cityscapes-dataset.com/
  • The Cityscapes Dataset for Semantic Urban Scene Understanding: https://www.cityscapes-dataset.com/citation/

Using TAO Pre-trained Models

  • Get TAO Container
  • Get other Purpose-built models from NGC model registry:
    • PeopleNet
    • TrafficCamNet
    • FaceDetectIR
    • VehicleMakeNet
    • VehicleTypeNet

Technical blogs

  • Read the 2 part blog on training and optimizing 2D body pose estimation model with TAO - Part 1 | Part 2
  • Learn how to train real-time License plate detection and recognition app with TAO and DeepStream.
  • Model accuracy is extremely important, learn how you can achieve state of the art accuracy for classification and object detection models using TAO
  • Learn how to train Instance segmentation model using MaskRCNN with TAO
  • Learn how to improve INT8 accuracy using Quantization aware training(QAT) with TAO
  • Read the technical tutorial on how PeopleNet model can be trained with custom data using Transfer Learning Toolkit
  • Learn how to train and deploy real-time intelligent video analytics apps and services using DeepStream SDK

Suggested reading

  • More information on about TAO Toolkit and pre-trained models can be found at the NVIDIA Developer Zone
  • Read the TAO getting Started guide and release notes.
  • If you have any questions or feedback, please refer to the discussions on TAO Toolkit Developer Forums
  • Deploy your model on the edge using DeepStream. Learn more about DeepStream SDK

License

This work is licensed under the Creative Commons Attribution NonCommercial ShareAlike 4.0 License (CC-BY-NC-SA-4.0). To view a copy of this license, please visit this link, or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.

Ethical Considerations

Training and evaluation dataset mostly consists of North American content. An ideal training and evaluation dataset would additionally include content from other geographies.

NVIDIA’s platforms and application frameworks enable developers to build a wide array of AI applications. Consider potential algorithmic bias when choosing or creating the models being deployed. Work with the model’s developer to ensure that it meets the requirements for the relevant industry and use case; that the necessary instruction and documentation are provided to understand error rates, confidence intervals, and results; and that the model is being used under the conditions and in the manner intended.