The model described in this card: TAO Sparse4D is an advanced three-dimensional (3D) multi-camera detection and tracking network. It generates precise 3D bounding boxes and tracking IDs for a diverse set of objects across multiple camera views. The provided model is pre-trained on the Multi-Target Multi-Camera (MTMC) Tracking 2025 subset from the Nvidia PhysicalAI-SmartSpaces dataset also utilized for the 2025 AI City Challenge.
The model in this card was trained & evaluated on the following moving object classes: Person
, Fourier_GR1_T2_Humanoid
, Agility_Digit_Humanoid
& Nova_Carter
.
This model is ready for commerical use.
License to use these models is covered by the NVIDIA Community Model License. By downloading the model, you accept the terms and conditions of these licenses.
Global
TAO Sparse4D is designed for 3D multi-camera object detection and tracking in indoor environments like warehouses and logistics facilities. The model detects & tracks objects across multiple camera views for applications including warehouse automation, safety and workflow optimization in industrial settings, providing spatial understanding of a scene.
NGC - 06/13/2025
Architecture Type: Convolution Neural Network (CNN) based backbone with Transformer based decoder layers.
Network Architecture: ResNet101 backbone with Transformer based decoder layers.
This work leverages the Sparse4D v3 model & is tailored for indoor environments like warehouses with static camera setups. Sparse4D is a query based technique which involves sampling sparse features for better computational efficiency compared to other dense 3D detection & tracking based techniques. The architecture features a ResNet101 backbone and processes time-synchronized frames from multiple cameras. Key components include the backbone, a Feature Pyramid Network (FPN), and multiple decoder layers incorporating Multi-Scale Deformable Aggregation blocks (featuring key-point generation, feature sampling, visibility net module & embedding generation) along with refinement & classification layers. The model is trained on regression, classification & ID losses. The model also consists of an Instance Bank module utilized for tracking ID assignment & management. Along with the 3D bounding box information & object ID, Sparse4D also outputs an instance feature containing high-dimensional semantics from the image encoder.
Sparse4D is trained on RGB images, camera calibration files, ground truth 3D bounding boxes & object ids, and optional depth maps. Since performing 3D multi-camera detection and tracking across large regions like warehouses with high-density camera setups is computationally expensive, we partition large regions into overlapping groups called Bird's Eye View (BEV) group. Each BEV group contains multiple cameras and serves as the fundamental training unit for Sparse4D.
Input Type(s): Each BEV group will have cameras with RGB images, Camera calibration file, Ground Truth with 3D bounding boxes & Optional Depth Maps available via HuggingFace or generated synthetically by Nvidia Isaac Sim Replicator.
Input Format: Red, Green, Blue (RGB) images stored in raw png/jpg/hdf5 and Depth Maps stored in png/hdf5. Can support input resolution of 3 x 1080 x 1920 for both RGB & Depth Maps. Data preprocessing is required for grouping images & depth maps from multiple cameras to appropriate BEV groups. This can be done via the TAO Data Service.
Input Parameters: Multiple dimensions. See below for detailed model input shapes
Other Properties Related to Input: 3 x 1080 x 1920 (C x H x W) resolution images for both RGB & Depth Maps. Data Pre-Processing needed via TAO Data Services. No alpha channel required.
Note that depth maps are optional and not required for evaluation or inference. The raw model inputs are as follows:
Dimension | Description |
---|---|
B |
Batch size |
C |
Number of channels |
N |
Number of cameras |
H |
Image height |
W |
Image width |
Q |
Number of queries |
M |
Number of output boxes |
E |
Number of instance features |
Input Name | Type | Shape | ResNet101 Shape | Description |
---|---|---|---|---|
img |
List[Tensor] | (B, N, C, H, W) | (1, N, 3, 512, 1408) | Input image tensor |
projection_mat |
List[Tensor] | (B, N, 4, 4) | (1, 10, 4, 4) | List of projection matrices |
image_wh |
List[Tensor] | (B, N, 2) | (1, 10, 2) | List of image width and height |
input_cached_feature |
List[Tensor] | (B, 600, 256) | (1, 600, 256) | List of cached features |
input_cached_anchor |
List[Tensor] | (B, 600, 11) | (1, 600, 11) | List of cached anchor |
prev_exists |
List[Tensor] | (B) | (1) | Indicates if previous frame exists or not |
interval_mask |
List[Tensor] | (Bx1x1) | (1x1x1) | Boolean to describe the interval mask |
input_cached_feature
, input_cached_anchor
, prev_exists
& interval_mask
at the first frame. These tensors will be updated automatically via instance bank from the second frame & onwards.**Output Type(s):**Tensors consisting of bounding boxes in 3D , object confidence scores, classes, class confidence scores, tracking object ID & instance features
Output Format: List of Tensors
Ouput Parameters: Multiple dimensions. See below for detailed model output shapes
Other Properties Related to Output: Please see the details below.
The final output is a list with length of batch size after post-processing. Each element is a dictionary with the following keys:
Output Name | Type | Shape | ResNet Shape | Description |
---|---|---|---|---|
boxes_3d |
List[Tensor] | (B, M, 10) | (1, 600, 10) | List of 3D boxes: (x, y, z, w, l, h, yaw, vx, vy, vz) in OV coordinates |
scores_3d |
List[Tensor] | (B, M) | (1, 600) | List of object confidence scores (classification scores * centerness score) |
labels_3d |
List[Tensor] | (B, M) | (1, 600) | List of class labels |
cls_scores |
List[Tensor] | (B, M) | (1, 600) | List of classification scores |
instance_ids |
List[Tensor] | (B, M) | (1, 600) | List of instance IDs |
instance_feats |
List[Tensor] | (B, M, E) | (1, 600, 256) | List of instance features |
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Runtime Engine(s):
Supported Hardware Microarchitecture Compatibility:
[Preferred/Supported] Operating Systems:
These models are designed for use with NVIDIA hardware and software. For hardware, the models are compatible with any NVIDIA DataCenter GPU. For software, the models are specifically designed for the TAO Toolkit.
The training algorithm optimizes the network to minimize the classification loss, regression loss & ID loss.
The model is trained on the MTMC Tracking 2025 dataset available on HuggingFace. The dataset statistics used for pretraining the model are as follows:
Data Collection Method by dataset:
Labeling Method by dataset:
Properties (Quantity, Dataset Descriptions, Sensor(s)):
subset | no. of scenes | no. BEV groups | no. of objects per scene | no. of cameras per scene | Duration of each camera sequence | FPS |
---|---|---|---|---|---|---|
Train | 111 | 122 | 15-120 | 4-32 | 5 mins | 30 FPS |
Person
, Fourier_GR1_T2_Humanoid
, Agility_Digit_Humanoid
& Nova_Carter
.This raw data includes .mp4 RGB video files, depth maps stored in HDF5 format, ground_truth.json, calibration.json & map.png. For more details on these files including the ground truth format please refer to the raw dataset format here.
The raw data may also be obtained from Nvidia Isaac Sim Replicator Agent IRA. When collected via this route, RGB image data & depth maps can be in frame output/HDF5 output. Utilize the TAO Data Service to convert your raw data format (AICity) to pickle format (OVPKL) accordingly.
The above raw data format needs to be converted to a pickle files for model training via the TAO DataService. The model will utilize these pickle files along with the above files for training. An example command can be found in the sparse4d finetuning notebook. The pickle file will consist of the items.
The process generates pickle files (.pkl
) containing scene information. Each pickle file is a dictionary with the following top-level keys:
metadata
: (dict) Contains metadata about the dataset.infos
: (list) A list of dictionaries, where each dictionary contains information about a specific frame in the scene.metadata
keyThe metadata
dictionary contains the following keys:
Key | Type | Description | Example Value |
---|---|---|---|
version |
str |
Version of the dataset or split. | 'trainval_split' |
split_type |
str |
Type of data split (e.g., 'all', 'train', 'validation'). | 'all' |
infos
keyThe infos
key holds a list of dictionaries. Each dictionary in this list corresponds to a frame and contains the following keys:
Key | Type | Description | Example Value (Illustrative) |
---|---|---|---|
frame_idx |
int |
Index of the frame. | 0 |
cams |
dict |
Dictionary containing data for each camera in the frame. Keys are camera names (e.g., 'Camera' , 'Camera_01' ). |
See cams Structure below. |
scene_name |
str |
Name of the scene. | 'Warehouse_014+bev-sensor-training-1' |
timestamp |
float |
Timestamp of the frame. | 0.0 |
token |
str |
Unique token for the frame. | 'Warehouse_014+bev-sensor-training-1__000000000' |
group_name |
str |
Name of the group this frame belongs to. | 'bev-sensor-training-1' |
instance_inds |
np.ndarray |
Array of instance indices present in the frame. | array([372, 171, 172, 631, 663]) |
asset_inds |
np.ndarray |
Array of asset indices corresponding to instances. | array([ 771, 772, 775, 2094, 1372]) |
gt_boxes |
np.ndarray |
Ground truth bounding boxes for objects in the frame. Shape: (N, 7) where N is the number of objects. Each row is [x, y, z, dx, dy, dz, heading] . |
array([[-0.686, 1.073, ..., -0. ], ...]) |
gt_names |
np.ndarray |
Array of ground truth names for objects. | array(['agility_digit', 'gr1_t2', ...], dtype='<U18') |
gt_velocity |
np.ndarray |
Ground truth velocity for objects. Shape: (N, 3) for [vx, vy, vz] . |
array([[0., 0., 0.], ...]) |
valid_flag |
np.ndarray |
Boolean array indicating if the ground truth data for each object is valid. | array([ True, True, ...]) |
gt_visibility |
list |
List of dictionaries, one for each ground truth object, indicating its visibility percentage in each camera. | [{'Camera': 1.0, 'Camera_01': 1.0, ...}, ...] |
cams
StructureThe cams
dictionary (within each element of the infos
list) contains nested dictionaries, where each key is a camera identifier (e.g., 'Camera'
, 'Camera_01'
, etc.). Each of these camera-specific dictionaries has the following structure:
Key | Type | Description | Example Value (Illustrative) |
---|---|---|---|
data_path |
tuple |
Tuple containing paths to camera data. (h5_file_path, rgb_image_relative_path) |
('data/mtmc/Warehouse_014/Camera.h5', 'rgb/rgb_00000.jpg') |
depth_map_path |
tuple |
Tuple containing paths to depth map data. (h5_file_path, depth_image_relative_path) |
('data/mtmc/Warehouse_014/Camera.h5', 'distance_to_image_plane_png/distance_to_image_plane_00000.png') |
sample_data_token |
str |
Unique token for the sample data from this camera. | 'Warehouse_014+bev-sensor-training-1__000000000+Camera' |
cam_intrinsic |
np.ndarray |
3x3 camera intrinsic matrix. | array([[916.249, 0., 960.], [0., 916.249, 540.], [0., 0., 1.]]) |
sensor2world_transform |
np.ndarray |
4x4 transformation matrix from sensor coordinates to world coordinates. | array([[0.018, -0.999, ..., -4.731], ..., [0., 0., 0., 1.]]) |
Both validation & testing was conduction on random scene from the MTMC Tracking 2025 subset.
Data Collection Method by dataset:
Labeling Method by dataset:
Properties (Quantity, Dataset Descriptions, Sensor(s)): We utilize a random scene from the test set of the MTMC Tracking 2025 subset.
The key performance indicators are average precision (AP) per-class and the mean average precision (mAP) obtained across all classes. We utilize the NuScenes based 3D detection evaluation technique to evaluate model accuracy.
Average Precision (AP): is a metric which quantifies a detector's ability to trade off precision and recall for a single object category at a given center‐distance threshold by computing the normalized area under its precision–recall curve.
Mean Average Precision (mAP) is derived from these AP values by averaging over all target object classes and a set of predefined center‐distance thresholds (0.5, 1, 2, and 4 m). This yields a single scalar that reflects both classification and geospatial localization accuracy. mAP thus serves as a rigorous, holistic metric for comparing 3D detection performance.
The following scores are for models trained on MTMC Tracking 2025 subset. The evaluation set and training set is disjoint.
Object Class | AP |
---|---|
Person |
0.989 |
Fourier_GR1_T2_Humanoid |
0.944 |
Agility_Digit_Humanoid |
0.989 |
Nova_Cater
is ignored due to low object count.
Final mAP
: 0.974
Model inference is performed using the Spatial AI DeepStream Pipeline which uses Nvidia TensorRT. Model runs at mixed precision (FP16+FP32). The below measurements ignore data transfer time from host to device (H2D), device to host (D2H) & other components such as pre/post processing of images & tensors. TRTExec measurements can be found below:
GPU | # of. cameras | Mean Latency per batch | Mean FPS per batch |
---|---|---|---|
1x A6000 Ampere - 48GB | 5 | 32.456 ms | 30.81 |
1 x L40S - 48GB | 8 | 28.9551 ms | 34.53 |
1 x H100 SXM HBM3 - 80GB | 19 | 32.2358 ms | 31.01 |
1 x 6000 Ada - 48GB | 7 | 29.5581 ms | 33.83 |
1 x L4 - 24GB | 2 | 27.30 ms | 36.64 |
In order to use the model as pre-trained weights for transfer learning, please use the snippet below as a template for the model
component of the experiment spec file to train a Sparse4D. For more information on experiment spec file, please refer to the Train Adapt Optimize (TAO) Toolkit User Guide.
model:
type: "sparse4d"
use_grid_mask: true
use_deformable_func: true
use_temporal_align: true
input_shape: [1408, 512]
embed_dims: 256
neck:
type: "FPN"
num_outs: 4
start_level: 0
out_channels: 256
in_channels: [256, 512, 1024, 2048]
add_extra_convs: "on_output"
relu_before_extra_convs: true
depth_branch:
type: "dense_depth"
embed_dims: "${model.embed_dims}"
num_depth_layers: 3
loss_weight: 0.2
head:
type: "sparse4d"
num_output: 300
cls_threshold_to_reg: 0.05
decouple_attn: true
return_feature: true
use_reid_sampling: false
embed_dims: "${model.embed_dims}"
num_groups: 8
num_decoder: 6
num_single_frame_decoder: 1
drop_out: 0.1
temporal: true
with_quality_estimation: true
instance_bank:
num_anchor: 900
anchor: ???
num_temp_instances: 600
confidence_decay: 0.8
feat_grad: false
default_time_interval: 0.033333
embed_dims: "${model.embed_dims}"
use_temporal_align: "${model.use_temporal_align}"
anchor_encoder:
type: 'SparseBox3DEncoder'
vel_dims: 3
embed_dims: [128, 32, 32, 64]
mode: 'cat'
output_fc: false
in_loops: 1
out_loops: 4
operation_order: [
"deformable", "ffn", "norm", "refine", "temp_gnn", "gnn", "norm",
"deformable", "ffn", "norm", "refine", "temp_gnn", "gnn", "norm",
"deformable", "ffn", "norm", "refine", "temp_gnn", "gnn", "norm",
"deformable", "ffn", "norm", "refine", "temp_gnn", "gnn", "norm",
"deformable", "ffn", "norm", "refine", "temp_gnn", "gnn", "norm",
"deformable", "ffn", "norm", "refine"
]
temp_graph_model:
type: "MultiheadAttention"
embed_dims: 512
num_heads: 8
batch_first: true
dropout: 0.1
graph_model:
type: "MultiheadAttention"
embed_dims: "${model.head.temp_graph_model.embed_dims}"
num_heads: "${model.head.temp_graph_model.num_heads}"
batch_first: true
dropout: "${model.head.temp_graph_model.dropout}"
norm_layer:
type: "LN"
normalized_shape: "${model.embed_dims}"
ffn:
type: "AsymmetricFFN"
in_channels: 512
pre_norm:
type: "LN"
embed_dims: 256
feedforward_channels: 1024
num_fcs: 2
ffn_drop: 0.1
act_cfg:
type: "ReLU"
inplace: true
deformable_model:
embed_dims: "${model.embed_dims}"
num_groups: 8
num_levels: 4
attn_drop: 0.15
use_deformable_func: true
use_camera_embed: false
residual_mode: "cat"
kps_generator:
embed_dims: "${model.embed_dims}"
num_learnable_pts: 6
fix_scale:
- [0, 0, 0]
- [0.45, 0, 0]
- [-0.45, 0, 0]
- [0, 0.45, 0]
- [0, -0.45, 0]
- [0, 0, 0.45]
- [0, 0, -0.45]
refine_layer:
type: "SparseBox3DRefinementModule"
embed_dims: "${model.embed_dims}"
refine_yaw: true
with_quality_estimation: true
sampler:
num_dn_groups: 5
num_temp_dn_groups: 3
dn_noise_scale: [2.0, 2.0, 2.0, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5]
max_dn_gt: 128
add_neg_dn: true
cls_weight: 2.0
box_weight: 0.25
reg_weights: [2.0, 2.0, 2.0, 0.5, 0.5, 0.5, 0.0, 0.0, 0.0, 0.0, 0.0]
use_temporal_align: "${model.use_temporal_align}"
visibility_net:
type: "visibility_net"
embedding_dim: 256
hidden_channels: 32
loss:
reg:
type: "sparse_box_3d"
box_weight: 0.25
cls_allow_reverse: [5, 6, 7]
cls:
type: "focal"
use_sigmoid: true
gamma: 2.0
alpha: 0.25
loss_weight: 2.0
id:
type: "cross_entropy_label_smooth"
num_ids: "${dataset.num_ids}"
bnneck:
type: "bnneck"
feat_dim: 256
num_ids: "${dataset.num_ids}"
decoder:
type: "SparseBox3DDecoder"
score_threshold: 0.05
reg_weights: [2.0, 2.0, 2.0, 1 ,1, 1, 1, 1, 1, 1, 1]
For deployment of this model, please refer to the our Nvidia SpatialAI release documentation.
Acceleration Engine: [Tensor(RT)]
Test Hardware:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Sub-cards. Please report security vulnerabilities or NVIDIA AI Concerns here.