NGC | Catalog


Logo for PeopleSemSegformer
Model to segment persons in an image.
Latest Version
February 21, 2024
204.53 MB

PeopleSemSegFormer Model Card

Model Overview

The model described in this card segments one or more “person” object within an image and returns a semantic segmentation mask for all people within an image.

Model Architecture

Segformer is a real-time state of the art transformer based semantic segmentation model. SegFormer is a simple, efficient yet powerful semantic segmentation framework which unifies Transformers with lightweight multilayer perception (MLP) decoders. It then predicts a class label for every pixel in the input image. This model segments the person and the background.

Training Algorithm

The training algorithm optimizes the network to minimize the cross-entropy loss for every pixel of the mask.

Training Data

PeopleSemSegFormer v1.0 model was trained on a proprietary dataset with more than 7.6 million images and more than 71 million objects for person class. The training dataset consists of a mix of camera heights, crowd-density, and field-of view (FOV). Approximately half of the training data consisted of images captured in an indoor office environment. For this case, the camera is typically set up at approximately 10 feet height, 45-degree angle and has close field-of-view. This content was chosen to improve accuracy of the models for convenience-store retail analytics use-case. We have also added approximately 500 thousand images with low-density scenes with people extending their hands and feet to improve the performance for use-cases where person object detection is followed by pose-estimation. This dataset included about 200k of "Low Contrast" images, where the people and their clothing blend into the background.

Training Dataset Object Distribution
Category Number of Images Number of Persons Number of Bags Number of Faces
Natural 4804552 23085430 8061920 10786381
Rotated 5746323 19930535 7234679 10094039
Broadcast 566408 566408 0 358518
Broadcast Rotated 369895 369895 0 59104
Blended 24841 26041 11225 0
Blended Rotated 24335 24523 9707 0
Simulation 27417 368914 0 92916
Total 7656570 41334979 12280764 18354191

Training Data Ground-truth Labeling Guidelines

  • All objects were auto-labelled using TAO-MAL AI-Assisted Annotation model from NVIDIA.


Evaluation Data

The inference performance of PeopleSemSegFormer model was measured against 300 proprietary images that were hand labelled across a variety of environments. The frames are high resolution images 1920x1080 pixels resized to 1820x1024 pixels before passing to the PeopleSemSegFormer model.

Methodology and KPI

The KPI for the evaluation data are reported in the table below. Model is evaluated based on Mean Intersection-Over-Union. Mean Intersection-Over-Union (MIOU) is a common evaluation metric for semantic image segmentation, which first computes the IOU for each semantic class and then computes the average over classes.

Model FAN-Base-Hybrid-Segformer
Content MIOU
5ft 91.86
10ft 91.4
20ft 89.7
Office use-case 97.01

Real-time Inference Performance

The inference is run on the provided unpruned models at INT8 precision. On the Jetson Nano FP16 precision is used. The inference performance is run using trtexec on Jetson Nano, AGX Xavier, Xavier NX and NVIDIA T4 GPU. The Jetson devices are running at Max-N configuration for maximum GPU frequency. The performance shown here is the inference only performance. The end-to-end performance with streaming video data might slightly vary depending on other bottlenecks in the hardware and software.

BS - Batch Size

Platform BS FPS
Jetson Orin Nano 1 6.6
Orin NX 16GB 1 9.7
AGX Orin 64GB 1 24.2
A2 1 23.3
T4 4 39.6
A30 8 116.8
L4 1 83.4
L40 2 210
A100 32 254
H100 32 454

How to use this model

These models need to be used with NVIDIA Hardware and Software. For Hardware, the models can run on any NVIDIA GPU including NVIDIA Jetson devices. These models can only be used with Train Adapt Optimize (TAO) Toolkit, DeepStream SDK or TensorRT.

The model is intended for training using TAO Toolkit with the user's own dataset or using it as it is. This can provide high fidelity models that are adapted to the use case. The Jupyter notebook available as a part of TAO container can be used to re-train.

Primary use case intended for the model is segmenting people in a color (RGB) image. The model can be used to segment people from photos and videos by using appropriate video or image decoding and pre-processing. Note this model performs semantic segmentation and not instance based segmentation.


Color Images of resolution 512x512x3


Category label (person or background) for every pixel in the input image. Outputs a semantic of people for the input image.

Output image

Instructions to deploy these models with DeepStream

To create the entire end-to-end video analytics application, deploy these models with DeepStream SDK. DeepStream SDK is a streaming analytics toolkit to accelerate building AI-based video analytics applications. DeepStream supports direct integration of these models into the deepstream sample app.

To deploy these models with DeepStream 6.1, please follow the instructions below:

Download and install DeepStream SDK. The installation instructions for DeepStream are provided in DeepStream development guide. The config files for the purpose-built models are located in:

/opt/nvidia/deepstream is the default DeepStream installation directory. This path will be different if you are installing in a different directory.

You will need 1 config files and 1 label file. These files are provided in [NVIDIA-AI-IOT](@todo : update).

nvinfer_config.txt - File to configure inference settings for PeopleSemSegFormer
labels.txt - Label file with 2 classes

Key Parameters in nvinfer_config.txt

# You can either provide the onnx model and key or trt engine obtained by using tao-converter
# model-engine-file=../../path/to/trt_engine
onnx-file=/path/to/onnx-file # Provide path to onnx model
# Since the model input channel is 3, using RGB color format.
infer-dims=3;512;512 # Replace this with the input dimensions of your image 
## 0=FP32, 1=INT8, 2=FP16 mode
## 0=Detector, 1=Classifier, 2=Semantic Segmentation, 3=Instance Segmentation, 100=Other

Run ds-tao-segmentation:

ds-tao-segmentation -c configs/segformer_tao/nvinfer_config.txt -i file:///opt/nvidia/deepstream/deepstream/samples/streams/sample_720p.mp4

Documentation to deploy with DeepStream is provided in "Deploying to DeepStream" chapter of TAO User Guide.


Under-represented classes

NVIDIA PeopleSemSegFormer model was trained to detect classes that are predominantly found in road transport setting. It relatively performs poorly on under-represented classes in our internal Intelligent Transport System dataset. Some of these classes include: rider, truck, train, motorcycle.

Model versions:

  • trainable_PeopleSemSegFormer_v1.0 - PeopleSemSegFormer model deployable to deepstream.
  • deployable_PeopleSemSegFormer_v1.0 - PeopleSemSegFormer model deployable to deepstream.



Using TAO Pre-trained Models

Technical blogs

Suggested reading


License to use this model is covered by the Model EULA. By downloading the unpruned or pruned version of the model, you accept the terms and conditions of these licenses

Ethical Considerations

Training and evaluation dataset mostly consists of North American content. An ideal training and evaluation dataset would additionally include content from other geographies.

NVIDIA’s platforms and application frameworks enable developers to build a wide array of AI applications. Consider potential algorithmic bias when choosing or creating the models being deployed. Work with the model’s developer to ensure that it meets the requirements for the relevant industry and use case; that the necessary instruction and documentation are provided to understand error rates, confidence intervals, and results; and that the model is being used under the conditions and in the manner intended.