# CitySemSegFormer Model Card ## Description: CitySemSegFormer segments classes within an image. This model segments the urban cityscapes into 19 classes which include: - road - sidewalk - building - wall - fence - pole - traffic light - traffic sign - vegetation - terrain - sky - person - rider - car - truck - bus - train - motorcycle - bicycle This model is ready for commercial use. ## References: ### Citations - Xie, Enze, et al. "SegFormer: Simple and efficient design for semantic segmentation with transformers." Advances in Neural Information Processing Systems 34 (2021): 12077-12090. - https://github.com/NVIDIA/semantic-segmentation ### Using TAO Pre-trained Models - Get [TAO Container](https://ngc.nvidia.com/catalog/containers/nvidia:tao:tao-toolkit) - Get other purpose-built models from the NGC model registry: - [TrafficCamNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:trafficcamnet) - [PeopleNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:peoplenet) - [PeopleNet-Transformer](https://ngc.nvidia.com/catalog/models/nvidia:tao:peoplenet_transformer) - [DashCamNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:dashcamnet) - [FaceDetectIR](https://ngc.nvidia.com/catalog/models/nvidia:tao:facedetectir) - [VehicleMakeNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:vehiclemakenet) - [VehicleTypeNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:vehicletypenet) - [PeopleSegNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:peoplesegnet) - [PeopleSemSegNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:peoplesemsegnet) - [License Plate Detection](https://ngc.nvidia.com/catalog/models/nvidia:tao:lpdnet) - [License Plate Recognition](https://ngc.nvidia.com/catalog/models/nvidia:tao:lprnet) - [PoseClassificationNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:poseclassificationnet) - [Facial Landmark](https://ngc.nvidia.com/catalog/models/nvidia:tao:fpenet) - [FaceDetect](https://ngc.nvidia.com/catalog/models/nvidia:tao:facenet) - [2D Body Pose Estimation](https://ngc.nvidia.com/catalog/models/nvidia:tao:bodyposenet) - [ActionRecognitionNet](https://ngc.nvidia.com/catalog/models/nvidia:tao:actionrecognitionnet) - [People ReIdentification](https://ngc.nvidia.com/catalog/models/nvidia:tao:reidentificationnet) - [PointPillarNet](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/pointpillarnet) - [CitySegFormer](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/citysemsegformer) - [Retail Object Detection](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/retail_object_detection) - [Retail Object Embedding](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/retail_object_recognition) - [Optical Inspection](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/optical_inspection) - [Optical Character Detection](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/ocdnet) - [Optical Character Recognition](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/ocrnet) - [PCB Classification](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/pcb_classification) - [PeopleSemSegFormer](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/peoplesemsegformer) ## Model Architecture: **Architecture Type:** Transformer
**Network Architecture:** Segformer
## Input: **Input Type(s):** Image
**Input Format(s):** Red, Green, Blue (RGB)
**Input Parameters:** 3D
**Other Properties Related to Input:** RGB Fixed Resolution:1024x1024x3 (W x H x C); No minimum bit depth, alpha, or gamma.
## Output: **Output Type(s):** Semantic Segmentation Mask
**Output Format:** Segmentation Mask: 2D vector

## Software Integration: **Runtime Engine(s):** * TAO - 5.2
* DeepStream 6.1 or later
**Supported Hardware Architecture(s):**
* Ampere
* Jetson
* Hopper
* Lovelace
* Pascal
* Turing
* Volta
**Supported Operating System(s):**
* Linux
* Linux 4 Tegra
## Model Version(s): - **trainable_mitb5_citysemsegformer_v1.0** - citySemSegformer mitB5 model trainable. - **deployable_mitb5_citysemsegformer_v2.0** - citySemSegformer mitb5 deployable to Deepstream. - **trainable_fan_baseHybrid_citysemsegformer_v1.0** - citySemSegformer FAN-Base-Hybrid model trainable. - **deployable_fan_baseHybrid_citysemsegformer_v1.0** - citySemSegformer FAN-Base-Hybrid deployable to Deepstream. ## Training & Evaluation: ## Training Dataset: **Data Collection Method by dataset:**
* Automatic/Sensors
**Labeling Method by dataset:**
* Human
**Properties:**
Proprietary dataset of 178,000 images. Most of the training dataset was collected in-house from images from a variety of dashcams and a small seed dataset containing images from traffic cameras in a city in the US. This content was chosen to auto-label urban city classes with segmentation masks. | | | Object| Distribution | | | |--|--|--|--|--|--| |Environment|Images|Cars|Persons|Road Signs|Two-Wheelers| Dashcam (5ft height)|128,000|1.7M|720,000|354,127|54,000| Traffic signal content|50,000|1.1M|53500|184000|11000| Total|178,000|2.8M|773,500|538,127|65,000| ## Evaluation Dataset: **Data Collection Method by dataset:**
* Automatic/Sensors
**Labeling Method by dataset:**
* Human
**Properties:**
300 proprietary images that were hand labeled across a variety of environments. ### Methodology and KPI The KPI for the evaluation data are reported in the table below. Model is evaluated based on Mean Intersection-Over-Union. Mean Intersection-Over-Union (MIOU) is a common evaluation metric for semantic image segmentation, which first computes the IOU for each semantic class and then computes the average over classes. | Model | CitySemSegFormer | |---------------|------------------| | *Classes* | *MIOU* | | road | 73.6 | | sidewalk | 46.88 | | building | 69.47 | | wall | 50.44 | | fence | 65.57 | | pole | 61.47 | | traffic light | 45.87 | | traffic sign | 72.8 | | vegetation | 77.4 | | terrain | 38.19 | | sky | 96.47 | | person | 81.78 | | rider | 33.07 | | car | 90.14 | | truck | 33.43 | | bus | 95.44 | | train | 3.2 | | motorcycle | 54.44 | | bicycle | 49.52 | ## Inference: **Engine:** Tensor(RT)
**Test Hardware:**
- Jetson AGX Xavier - Xavier NX - Orin - Orin NX - NVIDIA T4 - Ampere GPU - A2 - A30 - L4 - T4 - DGX H100 - DGX A100 - DGX H100 - L40 - JAO 64GB - Orin NX16GB - Orin Nano 8GB The inference is run on the provided unpruned models at INT8 precision. On the Jetson Nano FP16 precision is used. The inference performance is run using [`trtexec`](https://github.com/NVIDIA/TensorRT/tree/master/samples/trtexec) on Jetson Nano, AGX Xavier, Xavier NX and NVIDIA T4 GPU. The Jetson devices are running at Max-N configuration for maximum GPU frequency. The performance shown here is the inference only performance. The end-to-end performance with streaming video data might slightly vary depending on other bottlenecks in the hardware and software. **BS** - Batch Size MiT-B5 SegFormer | Platform | BS | FPS | |------------------|----|------| | Jetson Orin Nano | 1 | 1.36 | | Orin NX 16GB | 1 | 1.9 | | AGX Orin 64GB | 1 | 4.8 | | A2 | 1 | 5.8 | | T4 | 1 | 9.4 | | A30 | 4 | 29.3 | | L4 | 1 | 17.8 | | L40 | 1 | 47.3 | | A100 | 8 | 62.2 | | H100 | 8 | 108 | FAN-Base-Hybrid SegFormer | Platform | BS | FPS | |------------------|----|------| | Jetson Orin Nano | 1 | 1.2 | | Orin NX 16GB | 1 | 1.78 | | AGX Orin 64GB | 1 | 4.4 | | A2 | 1 | 4.4 | | T4 | 1 | 7.3 | | A30 | 4 | 23.7 | | L4 | 1 | 15.7 | | L40 | 1 | 40.9 | | A100 | 8 | 50.4 | | H100 | 8 | 89.5 | ## How to use this model These models need to be used with NVIDIA Hardware and Software. For Hardware, the models can run on any NVIDIA GPU including NVIDIA Jetson devices. These models can only be used with Train Adapt Optimize (TAO) Toolkit, DeepStream SDK or TensorRT. The model is intended for primarily deploying and doing inference using DeepStream. Primary use case intended for the model is segmenting urban city classes in a color (RGB) image. The model can be used to segment urban city transport/ setting from photos and videos by using appropriate video or image decoding and pre-processing. Note this model performs semantic segmentation and not instance based segmentation. ### Instructions to deploy these models with DeepStream To create the entire end-to-end video analytics application, deploy these models with [DeepStream SDK](https://developer.nvidia.com/deepstream-sdk). DeepStream SDK is a streaming analytics toolkit to accelerate building AI-based video analytics applications. DeepStream supports direct integration of these models into the deepstream sample app. To deploy these models with [DeepStream 6.1](https://developer.nvidia.com/deepstream-sdk), please follow the instructions below: [Download](https://developer.nvidia.com/deepstream-sdk) and install DeepStream SDK. The installation instructions for DeepStream are provided in [DeepStream development guide](https://docs.nvidia.com/metropolis/deepstream/dev-guide/index.html). The config files for the purpose-built models are located in: `/opt/nvidia/deepstream` is the default DeepStream installation directory. This path will be different if you are installing in a different directory. You will need 1 config files and 1 label file. These files are provided in [NVIDIA-AI-IOT](@todo : update). ```sh nvinfer_config.txt - File to configure inference settings for CitySemSegformer labels.txt - Label file with 19 classes ``` You can input the onnx model directly to Deepstream. In order to manually convert to TRT engine, follow the example command below: Key Parameters in `nvinfer_config.txt` for Deepstream Inference. ```sh # You can either provide the etlt model and key or trt engine obtained by using tao-converter tlt-model-key=tlt_encode # tlt-encoded-model=../../path/to/.etlt file model-engine-file=../../path/to/trt_engine net-scale-factor=0.01735207357279195 offsets=123.675;116.28;103.53 # Since the model input channel is 3, using RGB color format. model-color-format=0 labelfile-path=./labels.txt infer-dims=3;1024;1024 batch-size=1 ## 0=FP32, 1=INT8, 2=FP16 mode network-mode=2 interval=0 gie-unique-id=1 cluster-mode=2 ## 0=Detector, 1=Classifier, 2=Semantic Segmentation, 3=Instance Segmentation, 100=Other network-type=100 output-tensor-meta=1 num-detected-classes=20 segmentation-output-order=1 roi-top-offset=0 roi-bottom-offset=0 detected-min-w=0 detected-min-h=0 detected-max-w=0 detected-max-h=0 ``` Run `ds-tao-segmentation`: #### CitySemSegformer ```sh ds-tao-segmentation -c configs/segformer_tao/nvinfer_config.txt -i file:///opt/nvidia/deepstream/deepstream/samples/streams/sample_720p.mp4 ``` Documentation to deploy with DeepStream is provided in "Deploying to DeepStream" chapter of [TAO User Guide](https://docs.nvidia.com/tao/tao-toolkit/index.html). ## Technical blogs - Read the 2 part blog on training and optimizing 2D body pose estimation model with TAO - [Part 1](https://developer.nvidia.com/blog/training-optimizing-2d-pose-estimation-model-with-tao-toolkit-part-1) | [Part 2](https://developer.nvidia.com/blog/training-optimizing-2d-pose-estimation-model-with-tao-toolkit-part-2) - Learn how to train [real-time License plate detection and recognition app](https://developer.nvidia.com/blog/creating-a-real-time-license-plate-detection-and-recognition-app) with TAO and DeepStream. - Model accuracy is extremely important, learn how you can achieve [state of the art accuracy for classification and object detection models](https://developer.nvidia.com/blog/preparing-state-of-the-art-models-for-classification-and-object-detection-with-tao-toolkit/) using TAO - Learn how to train [Instance segmentation model using MaskRCNN with TAO](https://developer.nvidia.com/blog/training-instance-segmentation-models-using-maskrcnn-on-tao-toolkit/) - Learn how to improve INT8 accuracy using [Quantization aware training(QAT) with TAO](https://developer.nvidia.com/blog/improving-int8-accuracy-using-quantization-aware-training-and-tao-toolkit/) - Read the technical tutorial on how [PeopleNet model can be trained with custom data using Transfer Learning Toolkit](https://devblogs.nvidia.com/training-custom-pretrained-models-using-tlt/) - Learn how to [train and deploy real-time intelligent video analytics apps and services using DeepStream SDK](https://devblogs.nvidia.com/building-iva-apps-using-deepstream-5.0/) ## Suggested reading - More information on about TAO Toolkit and pre-trained models can be found at the [NVIDIA Developer Zone](https://developer.nvidia.com/tao-toolkit) - Read the [TAO getting Started](https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_quick_start_guide.html) guide and [release notes](https://docs.nvidia.com/tao/tao-toolkit/text/release_notes.html). - If you have any questions or feedback, please refer to the discussions on [TAO Toolkit Developer Forums](https://forums.developer.nvidia.com/c/accelerated-computing/intelligent-video-analytics/tao-toolkit/17) - Deploy your model on the edge using DeepStream. Learn more about [DeepStream SDK](https://developer.nvidia.com/deepstream-sdk) ## Ethical Considerations: Training and evaluation dataset is sourced from North America. More inclusive training and evaluation dataset would include content from other parts of the world. NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their sinternal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Promise and the Explainability, Bias, Safety & Security, and Privacy Subcards.