# TAO Sparse4D Model Card ## Model Overview ## Description The model described in this card: TAO Sparse4D is an advanced three-dimensional (3D) multi-camera detection and tracking network. It generates precise 3D bounding boxes and tracking IDs for a diverse set of objects across multiple camera views. The provided model is pre-trained on the Multi-Target Multi-Camera (MTMC) Tracking 2025 subset from the [Nvidia PhysicalAI-SmartSpaces dataset](https://huggingface.co/datasets/nvidia/PhysicalAI-SmartSpaces/tree/main) also utilized for the 2025 AI City Challenge.

The model in this card was trained & evaluated on the following moving object classes: `Person`, `Fourier_GR1_T2_Humanoid`, `Agility_Digit_Humanoid` & `Nova_Carter`. This model is ready for commerical use. ### License/Terms of Use License to use these models is covered by the NVIDIA Open Model License. By downloading the model, you accept the terms and conditions of these [licenses](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/). ### Deployment Geography: Global
### Use Case:
TAO Sparse4D is designed for 3D multi-camera object detection and tracking in indoor environments like warehouses and logistics facilities. The model detects & tracks objects across multiple camera views for applications including warehouse automation, safety and workflow optimization in industrial settings, providing spatial understanding of a scene. ### Release Date:
NGC - 06/13/2025
## References - Lin, X., Pei, Z., Lin, T., Huang, L., & Su, Z. (2023). Sparse4d v3: Advancing end-to-end 3d detection and tracking. arXiv preprint arXiv:2311.11722. - Lin, X., Lin, T., Pei, Z., Huang, L., & Su, Z. (2023). Sparse4d v2: Recurrent temporal fusion with sparse model. arXiv preprint arXiv:2305.14018. - Lin, X., Lin, T., Pei, Z., Huang, L., & Su, Z. (2022). Sparse4d: Multi-view 3d object detection with sparse spatial-temporal fusion. arXiv preprint arXiv:2211.10581. ## Model Architecture **Architecture Type:** Convolution Neural Network (CNN) based backbone with Transformer based decoder layers.
**Network Architecture:** ResNet101 backbone with Transformer based decoder layers.
This work leverages the [Sparse4D v3 model](https://arxiv.org/abs/2311.11722) & is tailored for indoor environments like warehouses with static camera setups. Sparse4D is a query based technique which involves sampling sparse features for better computational efficiency compared to other dense 3D detection & tracking based techniques. The architecture features a ResNet101 backbone and processes time-synchronized frames from multiple cameras. Key components include the backbone, a Feature Pyramid Network (FPN), and multiple decoder layers incorporating Multi-Scale Deformable Aggregation blocks (featuring key-point generation, feature sampling, visibility net module & embedding generation) along with refinement & classification layers. The model is trained on regression, classification & ID losses. The model also consists of an Instance Bank module utilized for tracking ID assignment & management. Along with the 3D bounding box information & object ID, Sparse4D also outputs an instance feature containing high-dimensional semantics from the image encoder.

## Input:
Sparse4D is trained on RGB images, camera calibration files, ground truth 3D bounding boxes & object ids, and optional depth maps. Since performing 3D multi-camera detection and tracking across large regions like warehouses with high-density camera setups is computationally expensive, we partition large regions into overlapping groups called Bird's Eye View (BEV) group. Each BEV group contains multiple cameras and serves as the fundamental training unit for Sparse4D. **Input Type(s):** Each BEV group will have cameras with RGB images, Camera calibration file, Ground Truth with 3D bounding boxes & Optional Depth Maps available via HuggingFace or generated synthetically by Nvidia Isaac Sim Replicator.
**Input Format:** Red, Green, Blue (RGB) images stored in raw png/jpg/hdf5 and Depth Maps stored in png/hdf5. Can support input resolution of 3 x 1080 x 1920 for both RGB & Depth Maps. Data preprocessing is required for grouping images & depth maps from multiple cameras to appropriate BEV groups. This can be done via the TAO Data Service.
**Input Parameters:** Multiple dimensions. See below for detailed model input shapes
**Other Properties Related to Input:** 3 x 1080 x 1920 (C x H x W) resolution images for both RGB & Depth Maps. Data Pre-Processing needed via TAO Data Services. No alpha channel required.
Note that depth maps are optional and not required for evaluation or inference. The raw model inputs are as follows: | Dimension | Description | |-----------|-------------| | `B` | Batch size | | `C` | Number of channels | | `N` | Number of cameras | | `H` | Image height | | `W` | Image width | | `Q` | Number of queries | | `M` | Number of output boxes | | `E` | Number of instance features | | Input Name | Type | Shape | ResNet101 Shape | Description | |------------|------|-------|---------------|-------------| | `img` | List[Tensor] | (B, N, C, H, W) | (1, N, 3, 512, 1408) | Input image tensor | | `projection_mat` | List[Tensor] | (B, N, 4, 4) | (1, 10, 4, 4) | List of projection matrices | | `image_wh` | List[Tensor] | (B, N, 2) | (1, 10, 2) | List of image width and height | | `input_cached_feature` | List[Tensor] | (B, 600, 256) | (1, 600, 256) | List of cached features | | `input_cached_anchor` | List[Tensor] | (B, 600, 11) | (1, 600, 11) | List of cached anchor | | `prev_exists` | List[Tensor] | (B) | (1) | Indicates if previous frame exists or not | | `interval_mask` | List[Tensor] | (Bx1x1) | (1x1x1) | Boolean to describe the interval mask | * No. of cameras is dynamic in the model. * Model is initialized with zero tensors for the `input_cached_feature`, `input_cached_anchor`, `prev_exists` & `interval_mask` at the first frame. These tensors will be updated automatically via instance bank from the second frame & onwards.

## Output:
**Output Type(s):**Tensors consisting of bounding boxes in 3D , object confidence scores, classes, class confidence scores, tracking object ID & instance features
**Output Format:** List of Tensors
**Ouput Parameters:** Multiple dimensions. See below for detailed model output shapes
**Other Properties Related to Output:** Please see the details below.
The final output is a list with length of batch size after post-processing. Each element is a dictionary with the following keys: | Output Name | Type | Shape | ResNet Shape | Description | |------------|------|-------|---------------|-------------| | `boxes_3d` | List[Tensor] | (B, M, 10) | (1, 600, 10) | List of 3D boxes: `(x, y, z, width, length, height, yaw, velocity_x, velocity_y, velocity_z)` in OV coordinates | | `scores_3d` | List[Tensor] | (B, M) | (1, 600) | List of object confidence scores (classification scores * centerness score) | | `labels_3d` | List[Tensor] | (B, M) | (1, 600) | List of class labels | | `cls_scores` | List[Tensor] | (B, M) | (1, 600) | List of classification scores | | `instance_ids` | List[Tensor] | (B, M) | (1, 600) | List of instance IDs | | `instance_feats` | List[Tensor] | (B, M, E) | (1, 600, 256) | List of instance features | where: - **x, y, z** - 3D coordinates of the bounding-box centroid in the world coordinate system which is in meters. - **width, length, height** - Box dimensions in meters along its x (width), y (length) and z (height) axes of the object-centered coordinate system, with the origin at the centroid. - **yaw** - Euler angle in radians about the y-axis of the object-centered coordinate system defining the box’s heading in the world coordinate system. (Pitch and roll are assumed zero.) - **velocity** - 3D velocity vector components in meters per second (m/s) in the world coordinate system, representing the object's instantaneous motion along the x, y, and z axes respectively.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. ## Software Integration: **Runtime Engine(s):** * DeepStream - 7.1
* TAO- 6.1.0 EA
**Supported Hardware Microarchitecture Compatibility:**
* NVIDIA Ampere
* NVIDIA Blackwell
* NVIDIA Hopper
* NVIDIA Lovelace
**Preferred/Supported Operating Systems:**
* Linux
These models are designed for use with NVIDIA hardware and software. For hardware, the models are compatible with any NVIDIA DataCenter GPU. For software, the models are specifically designed for the [TAO Toolkit](https://github.com/NVIDIA/tao_pytorch_backend). ## Model versions: - **trainable_v1.0** - Pre-trained model for Sparse4D. - **deployable_v1.0** - Model for Sparse4D deployable to DeepStream or TensorRT. # Training, Testing, and Evaluation Datasets The training algorithm optimizes the network to minimize the classification loss, regression loss & ID loss. ### Training Data The model is trained on the MTMC Tracking 2025 dataset available on [HuggingFace](https://huggingface.co/datasets/nvidia/PhysicalAI-SmartSpaces/tree/main/MTMC_Tracking_2025). The dataset statistics used for pretraining the model are as follows: Data Collection Method by dataset:
* Synthetic
Labeling Method by dataset:
* Synthetic
**Properties (Quantity, Dataset Descriptions, Sensor(s)):** |subset|no. of scenes|no. BEV groups|no. of objects per scene|no. of cameras per scene|Duration of each camera sequence|FPS| |:---:|:---:|:---:|:---:|:---:|:---:|:---:| |Train|111|122|15-120|4-32|5 mins|30 FPS| * Model was trained on the following moving object classes: `Person`, `Fourier_GR1_T2_Humanoid`, `Agility_Digit_Humanoid` & `Nova_Carter`. * Both validation & testing was conduction on random scene from the MTMC Tracking 2025 subset. ### Data Format & Preprocessing #### Raw Data (MTMC Tracking 2025/AI City Challenge Format) This raw data includes .mp4 RGB video files, depth maps stored in HDF5 format, ground_truth.json, calibration.json & map.png. For more details on these files including the ground truth format please refer to the raw dataset format [here](https://huggingface.co/datasets/nvidia/PhysicalAI-SmartSpaces#directory-structure-for-mtmc_tracking_2025). The raw data may also be obtained from Nvidia Isaac Sim Replicator Agent [IRA](https://docs.isaacsim.omniverse.nvidia.com/latest/replicator_tutorials/tutorial_replicator_agent.html). When collected via this route, RGB image data & depth maps can be in frame output/HDF5 output. Utilize the TAO Data Service to convert your raw data format (AICity) to pickle format (OVPKL) accordingly. #### Pickle File Structure (OVPKL format) The above raw data format needs to be converted to a pickle files for model training via the TAO DataService. The model will utilize these pickle files along with the above files for training. An example command can be found in the sparse4d finetuning notebook. The pickle file will consist of the items. The process generates pickle files (`.pkl`) containing scene information. Each pickle file is a dictionary with the following top-level keys: - `metadata`: (dict) Contains metadata about the dataset. - `infos`: (list) A list of dictionaries, where each dictionary contains information about a specific frame in the scene. ##### `metadata` key The `metadata` dictionary contains the following keys: | Key | Type | Description | Example Value | | :--------- | :----- | :------------------------------------------- | :------------------------------------ | | `version` | `str` | Version of the dataset or split. | `'trainval_split'` | | `split_type` | `str` | Type of data split (e.g., 'all', 'train', 'validation'). | `'all'` | ##### `infos` key The `infos` key holds a list of dictionaries. Each dictionary in this list corresponds to a frame and contains the following keys: | Key | Type | Description | Example Value (Illustrative) | | :---------------- | :---------- | :--------------------------------------------------------------------------------------------------------- | :--------------------------------------------------------------- | | `frame_idx` | `int` | Index of the frame. | `0` | | `cams` | `dict` | Dictionary containing data for each camera in the frame. Keys are camera names (e.g., `'Camera'`, `'Camera_01'`). | See [cams Structure](#cams-structure) below. | | `scene_name` | `str` | Name of the scene. | `'Warehouse_014+bev-sensor-training-1'` | | `timestamp` | `float` | Timestamp of the frame. | `0.0` | | `token` | `str` | Unique token for the frame. | `'Warehouse_014+bev-sensor-training-1__000000000'` | | `group_name` | `str` | Name of the group this frame belongs to. | `'bev-sensor-training-1'` | | `instance_inds` | `np.ndarray`| Array of instance indices present in the frame. | `array([372, 171, 172, 631, 663])` | | `asset_inds` | `np.ndarray`| Array of asset indices corresponding to instances. | `array([ 771, 772, 775, 2094, 1372])` | | `gt_boxes` | `np.ndarray`| Ground truth bounding boxes for objects in the frame. Shape: (N, 7) where N is the number of objects. Each row is `[x, y, z, dx, dy, dz, heading]`. | `array([[-0.686, 1.073, ..., -0. ], ...])` | | `gt_names` | `np.ndarray`| Array of ground truth names for objects. | `array(['agility_digit', 'gr1_t2', ...], dtype=' * Synthetic
Labeling Method by dataset:
* Synthetic
**Properties (Quantity, Dataset Descriptions, Sensor(s)):** We utilize a random scene from the test set of the MTMC Tracking 2025 subset. ### Methodology and KPI The key performance indicators are average precision (AP) per-class and the mean average precision (mAP) obtained across all classes. We utilize the NuScenes based 3D detection evaluation technique to evaluate model accuracy. Average Precision (AP): is a metric which quantifies a detector's ability to trade off precision and recall for a single object category at a given center‐distance threshold by computing the normalized area under its precision–recall curve. Mean Average Precision (mAP) is derived from these AP values by averaging over all target object classes and a set of predefined center‐distance thresholds (0.5, 1, 2, and 4 m). This yields a single scalar that reflects both classification and geospatial localization accuracy. mAP thus serves as a rigorous, holistic metric for comparing 3D detection performance. The following scores are for models trained on MTMC Tracking 2025 subset. The evaluation set and training set is disjoint. | Object Class | AP | |-------------------------|-------| | `Person` | 0.989 | | `Fourier_GR1_T2_Humanoid` | 0.944 | | `Agility_Digit_Humanoid` | 0.989 | `Nova_Cater` is ignored due to low object count. `Final mAP`: 0.974 ## Real-time Inference Performance Model inference is performed using the Spatial AI DeepStream Pipeline which uses Nvidia TensorRT. Model runs at mixed precision (FP16+FP32). The below measurements ignore the operations related to pre-processing of images & post-processing of results inside instance bank. The following measurements highlight the number of cameras supported by each GPU at 30 FPS with a resolution of 1408x512p. | GPU | # of. cameras | | :---------------------- | :------------ | | 1x DGX Spark | 2 | | 1x RTX PRO 6000 (Server)| 12 | | 1x B200 | 26 | | 1x GB200 | 31 | | 1 x H100 SXM HBM3 - 80GB| 14 | | 1 x H200 | 15 | | 1 x RTX 6000 ADA | 8 | | 1 x A6000 Ampere - 48GB | 5 | | 1 x A100 Ampere | 8 | | 1 x L4 - 24GB | 2 | | 1 x L40S - 48GB | 8 | | 1 x Jetson AGX Orin | 1 | | 1 x Jetson AGX Thor | 2 | ## How to use this model In order to use the model as pre-trained weights for transfer learning, please use the snippet below as a template for the `model` component of the experiment spec file to train a Sparse4D. For more information on experiment spec file, please refer to the [Train Adapt Optimize (TAO) Toolkit User Guide](https://docs.nvidia.com/tao/tao-toolkit/index.html). ```yaml model: type: "sparse4d" use_grid_mask: true use_deformable_func: true use_temporal_align: true input_shape: [1408, 512] embed_dims: 256 neck: type: "FPN" num_outs: 4 start_level: 0 out_channels: 256 in_channels: [256, 512, 1024, 2048] add_extra_convs: "on_output" relu_before_extra_convs: true depth_branch: type: "dense_depth" embed_dims: "${model.embed_dims}" num_depth_layers: 3 loss_weight: 0.2 head: type: "sparse4d" num_output: 300 cls_threshold_to_reg: 0.05 decouple_attn: true return_feature: true use_reid_sampling: false embed_dims: "${model.embed_dims}" num_groups: 8 num_decoder: 6 num_single_frame_decoder: 1 drop_out: 0.1 temporal: true with_quality_estimation: true instance_bank: num_anchor: 900 anchor: ??? num_temp_instances: 600 confidence_decay: 0.8 feat_grad: false default_time_interval: 0.033333 embed_dims: "${model.embed_dims}" use_temporal_align: "${model.use_temporal_align}" anchor_encoder: type: 'SparseBox3DEncoder' vel_dims: 3 embed_dims: [128, 32, 32, 64] mode: 'cat' output_fc: false in_loops: 1 out_loops: 4 operation_order: [ "deformable", "ffn", "norm", "refine", "temp_gnn", "gnn", "norm", "deformable", "ffn", "norm", "refine", "temp_gnn", "gnn", "norm", "deformable", "ffn", "norm", "refine", "temp_gnn", "gnn", "norm", "deformable", "ffn", "norm", "refine", "temp_gnn", "gnn", "norm", "deformable", "ffn", "norm", "refine", "temp_gnn", "gnn", "norm", "deformable", "ffn", "norm", "refine" ] temp_graph_model: type: "MultiheadAttention" embed_dims: 512 num_heads: 8 batch_first: true dropout: 0.1 graph_model: type: "MultiheadAttention" embed_dims: "${model.head.temp_graph_model.embed_dims}" num_heads: "${model.head.temp_graph_model.num_heads}" batch_first: true dropout: "${model.head.temp_graph_model.dropout}" norm_layer: type: "LN" normalized_shape: "${model.embed_dims}" ffn: type: "AsymmetricFFN" in_channels: 512 pre_norm: type: "LN" embed_dims: 256 feedforward_channels: 1024 num_fcs: 2 ffn_drop: 0.1 act_cfg: type: "ReLU" inplace: true deformable_model: embed_dims: "${model.embed_dims}" num_groups: 8 num_levels: 4 attn_drop: 0.15 use_deformable_func: true use_camera_embed: false residual_mode: "cat" kps_generator: embed_dims: "${model.embed_dims}" num_learnable_pts: 6 fix_scale: - [0, 0, 0] - [0.45, 0, 0] - [-0.45, 0, 0] - [0, 0.45, 0] - [0, -0.45, 0] - [0, 0, 0.45] - [0, 0, -0.45] refine_layer: type: "SparseBox3DRefinementModule" embed_dims: "${model.embed_dims}" refine_yaw: true with_quality_estimation: true sampler: num_dn_groups: 5 num_temp_dn_groups: 3 dn_noise_scale: [2.0, 2.0, 2.0, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5] max_dn_gt: 128 add_neg_dn: true cls_weight: 2.0 box_weight: 0.25 reg_weights: [2.0, 2.0, 2.0, 0.5, 0.5, 0.5, 0.0, 0.0, 0.0, 0.0, 0.0] use_temporal_align: "${model.use_temporal_align}" visibility_net: type: "visibility_net" embedding_dim: 256 hidden_channels: 32 loss: reg: type: "sparse_box_3d" box_weight: 0.25 cls_allow_reverse: [5, 6, 7] cls: type: "focal" use_sigmoid: true gamma: 2.0 alpha: 0.25 loss_weight: 2.0 id: type: "cross_entropy_label_smooth" num_ids: "${dataset.num_ids}" bnneck: type: "bnneck" feat_dim: 256 num_ids: "${dataset.num_ids}" decoder: type: "SparseBox3DDecoder" score_threshold: 0.05 reg_weights: [2.0, 2.0, 2.0, 1 ,1, 1, 1, 1, 1, 1, 1] ``` ## Inference & Deployment: For deployment of this model, please refer to the our Nvidia SpatialAI release documentation. **Acceleration Engine:** [Tensor(RT)]
**Test Hardware:**
* Nvidia Datacenter GPUs
### Ethical Considerations NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Sub-cards. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).