Re-Identification Transformer

NVIDIA

Model

NVIDIA

Re-Identification Transformer

SWIN Transformer based Re-Identification network to generate embeddings for identifying persons in different scenes.

ReIdentificationNet Transformer Model Card

Model Overview

The model described in this card is a re-identification network based on the transformer architecture. It aims to generate embeddings for identifying objects captured in different scenes. A pre-trained ReIdentificationNet Transformer model based on public datasets is delivered. The model was first pre-trained via self-supervised learning on ~3 million image crops & then fine-tuned on various supervised person re-identification datasets. These supervised datasets inlcude - Market1501, sample of the MTMC people tracking dataset of the 2023 AI City Challenge & Nvidia proprietary dataset.

Model Architecture

The model backbone is a SWIN Transformer at its core. It processes images in a non-overlapping window fashion, using self-attention to learn both local and global features. By adjusting window sizes and positions across layers, the model captures essential hierarchical representations. This backbone is trained using the training tricks from SOLIDER.

The ReIdentificationNet Transformer takes cropped images of objects as input and generates embeddings as output.

Training

The training algorithm optimizes the network to minimize the triplet loss, center loss and cross entropy loss.

Training Data

The model is trained on the Market-1501 dataset with 751 annotated people, a sampled version of the MTMC people tracking dataset of the 2023 AI City Challenge with 106 annotated people and Nvidia proprietary dataset with 2969 annotated people. The dataset statistics for the training set are as follows:

Class distribution:

subset	no. total identities	no. total images	no. total cameras	no. real identities	no. real images	no. real cameras	no. synthetic identities	no. synthetic images	no. synthetic cameras
Train	3826	54183	130	3720	44463	44	106	9720	86
Test	809	24324	56	759	20307	13	50	4017	43
Query	808	4122	56	758	3467	13	50	655	43

Data Format

The data format must be in the following format.

/data
    /market1501
        /bounding_box_train
            0001_c1s1_01_00.jpg
            0001_c1s1_02_00.jpg
            0002_c1s1_03_00.jpg
            0002_c1s1_04_00.jpg
            0003_c1s1_05_00.jpg
            0003_c1s1_06_00.jpg
            ...
            ...
            ...
            N.png
        /bounding_box_test
            0001_c1s1_01_00.jpg
            0001_c1s1_02_00.jpg
            0002_c1s1_03_00.jpg
            0002_c1s1_04_00.jpg
            0003_c1s1_05_00.jpg
            0003_c1s1_06_00.jpg
            ...
            ...
            ...
            N.jpg
        /query
            0001_c1s1_01_00.jpg
            0001_c1s1_02_00.jpg
            0002_c1s1_03_00.jpg
            0002_c1s1_04_00.jpg
            0003_c1s1_05_00.jpg
            0003_c1s1_06_00.jpg
            ...
            ...
            ...
            N.jpg

The dataset should be divided into different directories by train, test and query folders. Each of these folders will contain image crops with the above naming scheme.

For example:, the image 0001_c1s1_01_00.jpg is the first sequence s1 of camera c1. 01 is the first frame in the sequence c1s1. 0001 in 0001_c1s1_01_00.jpg is the unique ID assigned to the object. Data after the third _ are ignored.

Performance

Test Data

As shown in the class distribution table above, the test set contains the same identities of the query set. The goal is to identify test samples of the same identities for each query.

Methodology and KPI

The key performance indicators are the ranked accuracy of re-identification and the mean average precision (mAP).

Rank-K accuracy: It is method of computing accuracy where the top-K highest confidence labels are matched with a ground truth label. If the ground truth label falls in one of these top-K labels, we state that this prediction is accurate. It allows us to get an overall accuracy measurement while being lenient on the predictions if the number of classes are too high and too similar. In our case, we compute rank-1, 5 and 10 accuracies. This means in case of rank-10, for a given sample, if the top-10 highest confidence labels predicted, match the label of ground truth, this sample will be counted as a correct measurement.

Mean average precision(mAP): Precision measures how accurate predictions are, in our case the logits of ID of an object. In other words, it measures the percentage of the predictions that are correct. mAP (mean average precision) is the average of average precision (AP) where AP is computed for each class, in our case ID.

The following scores are for models trained on Market1501 & AI City Challenge 2023 datasets. The evaluation set and training set is disjoint.

model	feature dimension	number of training identities	mAP (%)	rank-1 accuracy (%)	rank-5 accuracy (%)	rank-10 accuracy (%)
swin_tiny_market1501_aicity156_featuredim256	256	857	93.3%	95.3%	97.9%	98.6%
swin_base_market1501_aicity156_featuredim256	256	857	93.4%	95.3%	98.0%	98.4%
swin_base_market1501_aicity156_featuredim1024	1024	857	94.6%	96.2%	98.3%	98.8%
swin_tiny_v2*	256	3826	94.90%	96.90%	98.4%	99.0%

(*) indicates models trained on additional Nvidia proprietary dataset.

How to use this model

This model needs to be used with NVIDIA Hardware and Software. For Hardware, the model can run on any NVIDIA GPU including NVIDIA Jetson devices. This model can only be used with Train Adapt Optimize (TAO) Toolkit, DeepStream SDK or TensorRT.

Primary use case intended for this model is to generate embeddings for an object and then perform similarity matching.

The following pre-trained models are provided:

swin_tiny_market1501_aicity156
swin_base_market1501_aicity156
swin_tiny_v2

It is intended for training and fine-tune using Train Adapt Optimize (TAO) Toolkit and the users' dataset of re-identification. High fidelity models can be trained to the new use cases. The Jupyter notebook available as a part of TAO container can be used to re-train.

The model is also intended for easy deployment to the edge using DeepStream SDK or TensorRT. DeepStream provides facility to create efficient video analytic pipelines to capture, decode and pre-process the data before running inference.

Please make sure to use this as the key for all TAO commands that require a model load key.

Model load key: nvidia_tao

Input

B X 3 X 256 X 128 (B C H W)

Output

The feature embedding

Instructions to use the model with TAO toolkit

In order to use the model as pre-trained weights for transfer learning, please use the snippet below as a template for the model component of the experiment spec file to train a ReIdentificationNet Transformer. For more information on experiment spec file, please refer to the Train Adapt Optimize (TAO) Toolkit User Guide.

model:
  backbone: swin_tiny_patch4_window7_224
  last_stride: 1
  pretrain_choice: self
  pretrained_model_path: /path/to/swintiny_pretrained.pth
  input_channels: 3
  input_width: 128
  input_height: 256
  neck: bnneck
  stride_size: [16, 16]
  reduce_feat_dim: True
  feat_dim: 256
  no_margin: True
  neck_feat: after
  metric_loss_type: triplet
  with_center_loss: False
  with_flip_feature: False
  label_smooth: False
  pretrain_hw_ratio: 2

Limitations

NVIDIA ReIdnetificationNet Transformer is trained on the Market-1501 dataset with 751 unique person classes and a sampled version of the MTMC people tracking dataset of the 2023 AI City Challenge with 106 unique person classes. It is expected that the accuracy of the model on external images is not at the same level as the number reported in performance section.

In general, to get better accuracy, more labeled data are needed to fine-tune the pre-trained model through TAO Toolkit.

Model versions:

trainable_v1.0 - Pre-trained model for re-identification transformer.
deployable_v1.0 - Model for re-identification transformer deployable to DeepStream or TensorRT.

Reference

Citations

W. Chen, X. Xu, J. Jia, H. Luo, Y. Wang, F. Wang, R. Jin, and X. Sun, "Beyond Appearance: a Semantic Controllable Self-Supervised Learning Framework for Human-Centric Visual Tasks," 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
S. He, H. Luo, P. Wang, F. Wang, H. Li, and W. Jiang, "TransReID: Transformer-Based Object Re-Identification," 2021 IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 15013-15022.
H. Luo, Y. Gu, X. Liao, S. Lai and W. Jiang, "Bag of Tricks and a Strong Baseline for Deep Person Re-Identification," 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2019, 1 pp. 1487-1495, doi: 10.1109/CVPRW.2019.00190.
L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang and Q. Tian, "Scalable Person Re-identification: A Benchmark," 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1116-1124, doi: 10.1109/ICCV.2015.133.
M. Naphade, S. Wang, D. C. Anastasiu, Z. Tang, M.-C. Chang, Y. Yao, L. Zheng, M. S. Rahman, M. S. Arya, A. Sharma, Q. Feng, V. Ablavsky, S. Sclaroff, P. Chakraborty, S. Prajapati, A. Li, S. Li, K. Kunadharaju, S. Jiang and R. Chellappa, "The 7th AI City Challenge," in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2023.

Using TAO Pre-trained Models

Get TAO Container
Get other purpose-built models from NGC model registry:

License

License to use the model is covered by the Model EULA. By downloading the unpruned or pruned version of the model, you accept the terms and conditions of these licenses.

Technical blogs

Access the latest in Vision AI development workflows with NVIDIA TAO Toolkit 5.0
Improve accuracy and robustness of vision ai models with vision transformers and NVIDIA TAO
Train like a ‘pro’ without being an AI expert using TAO AutoML
Create Custom AI models using NVIDIA TAO Toolkit with Azure Machine Learning
Developing and Deploying AI-powered Robots with NVIDIA Isaac Sim and NVIDIA TAO
Learn endless ways to adapt and supercharge your AI workflows with TAO - Whitepaper
Customize Action Recognition with TAO and deploy with DeepStream
Read the 2 part blog on training and optimizing 2D body pose estimation model with TAO - Part 1 | Part 2
Learn how to train real-time License plate detection and recognition app with TAO and DeepStream.
Model accuracy is extremely important, learn how you can achieve state of the art accuracy for classification and object detection models using TAO

Ethical AI

NVIDIA ReIdentificationNet Transformer model creates embeddings for identifying objects captured in different scenes.

NVIDIA’s platforms and application frameworks enable developers to build a wide array of AI applications. Consider potential algorithmic bias when choosing or creating the models being deployed. Work with the model’s developer to ensure that it meets the requirements for the relevant industry and use case; that the necessary instruction and documentation are provided to understand error rates, confidence intervals, and results; and that the model is being used under the conditions and in the manner intended.

Publisher

NVIDIA

Latest Versionswin_base_1024

UpdatedSeptember 23, 2024 UTC

Compressed Size836.76 MB

Labels

Computer Vision CV Deep Learning TAO TAO Toolkit

ReIdentificationNet Transformer Model Card

Model Overview

Model Architecture

Training

Training Data

Data Format

Performance

Test Data

Methodology and KPI

How to use this model

Input

Output

Instructions to use the model with TAO toolkit

Limitations

Model versions:

Reference

Citations

Using TAO Pre-trained Models

License

Technical blogs

Suggested reading

Ethical AI