NGC Catalog
CLASSIC
Welcome Guest
Models
Sparse4D

Sparse4D

For downloads and more information, please view on a desktop device.
Description
TAO Pretrained Sparse4D with Resnet101 Backbone
Publisher
NVIDIA
Latest Version
trainable_v1.0
Modified
July 24, 2025
Size
901.75 MB

TAO Sparse4D Model Card

Model Overview

Description

The model described in this card: TAO Sparse4D is an advanced three-dimensional (3D) multi-camera detection and tracking network. It generates precise 3D bounding boxes and tracking IDs for a diverse set of objects across multiple camera views. The provided model is pre-trained on the Multi-Target Multi-Camera (MTMC) Tracking 2025 subset from the Nvidia PhysicalAI-SmartSpaces dataset also utilized for the 2025 AI City Challenge.

The model in this card was trained & evaluated on the following moving object classes: Person, Fourier_GR1_T2_Humanoid, Agility_Digit_Humanoid & Nova_Carter.

This model is ready for commerical use.

License/Terms of Use

License to use these models is covered by the NVIDIA Community Model License. By downloading the model, you accept the terms and conditions of these licenses.

Deployment Geography:

Global

Use Case:

TAO Sparse4D is designed for 3D multi-camera object detection and tracking in indoor environments like warehouses and logistics facilities. The model detects & tracks objects across multiple camera views for applications including warehouse automation, safety and workflow optimization in industrial settings, providing spatial understanding of a scene.

Release Date:

NGC - 06/13/2025

References

  • Lin, X., Pei, Z., Lin, T., Huang, L., & Su, Z. (2023). Sparse4d v3: Advancing end-to-end 3d detection and tracking. arXiv preprint arXiv:2311.11722.
  • Lin, X., Lin, T., Pei, Z., Huang, L., & Su, Z. (2023). Sparse4d v2: Recurrent temporal fusion with sparse model. arXiv preprint arXiv:2305.14018.
  • Lin, X., Lin, T., Pei, Z., Huang, L., & Su, Z. (2022). Sparse4d: Multi-view 3d object detection with sparse spatial-temporal fusion. arXiv preprint arXiv:2211.10581.

Model Architecture

Architecture Type: Convolution Neural Network (CNN) based backbone with Transformer based decoder layers.

Network Architecture: ResNet101 backbone with Transformer based decoder layers.

This work leverages the Sparse4D v3 model & is tailored for indoor environments like warehouses with static camera setups. Sparse4D is a query based technique which involves sampling sparse features for better computational efficiency compared to other dense 3D detection & tracking based techniques. The architecture features a ResNet101 backbone and processes time-synchronized frames from multiple cameras. Key components include the backbone, a Feature Pyramid Network (FPN), and multiple decoder layers incorporating Multi-Scale Deformable Aggregation blocks (featuring key-point generation, feature sampling, visibility net module & embedding generation) along with refinement & classification layers. The model is trained on regression, classification & ID losses. The model also consists of an Instance Bank module utilized for tracking ID assignment & management. Along with the 3D bounding box information & object ID, Sparse4D also outputs an instance feature containing high-dimensional semantics from the image encoder.

Input:

Sparse4D is trained on RGB images, camera calibration files, ground truth 3D bounding boxes & object ids, and optional depth maps. Since performing 3D multi-camera detection and tracking across large regions like warehouses with high-density camera setups is computationally expensive, we partition large regions into overlapping groups called Bird's Eye View (BEV) group. Each BEV group contains multiple cameras and serves as the fundamental training unit for Sparse4D.

Input Type(s): Each BEV group will have cameras with RGB images, Camera calibration file, Ground Truth with 3D bounding boxes & Optional Depth Maps available via HuggingFace or generated synthetically by Nvidia Isaac Sim Replicator.
Input Format: Red, Green, Blue (RGB) images stored in raw png/jpg/hdf5 and Depth Maps stored in png/hdf5. Can support input resolution of 3 x 1080 x 1920 for both RGB & Depth Maps. Data preprocessing is required for grouping images & depth maps from multiple cameras to appropriate BEV groups. This can be done via the TAO Data Service.
Input Parameters: Multiple dimensions. See below for detailed model input shapes
Other Properties Related to Input: 3 x 1080 x 1920 (C x H x W) resolution images for both RGB & Depth Maps. Data Pre-Processing needed via TAO Data Services. No alpha channel required.

Note that depth maps are optional and not required for evaluation or inference. The raw model inputs are as follows:

Dimension Description
B Batch size
C Number of channels
N Number of cameras
H Image height
W Image width
Q Number of queries
M Number of output boxes
E Number of instance features
Input Name Type Shape ResNet101 Shape Description
img List[Tensor] (B, N, C, H, W) (1, N, 3, 512, 1408) Input image tensor
projection_mat List[Tensor] (B, N, 4, 4) (1, 10, 4, 4) List of projection matrices
image_wh List[Tensor] (B, N, 2) (1, 10, 2) List of image width and height
input_cached_feature List[Tensor] (B, 600, 256) (1, 600, 256) List of cached features
input_cached_anchor List[Tensor] (B, 600, 11) (1, 600, 11) List of cached anchor
prev_exists List[Tensor] (B) (1) Indicates if previous frame exists or not
interval_mask List[Tensor] (Bx1x1) (1x1x1) Boolean to describe the interval mask
  • No. of cameras is dynamic in the model.
  • Model is initialized with zero tensors for the input_cached_feature, input_cached_anchor, prev_exists & interval_mask at the first frame. These tensors will be updated automatically via instance bank from the second frame & onwards.

Output:

**Output Type(s):**Tensors consisting of bounding boxes in 3D , object confidence scores, classes, class confidence scores, tracking object ID & instance features
Output Format: List of Tensors
Ouput Parameters: Multiple dimensions. See below for detailed model output shapes
Other Properties Related to Output: Please see the details below.

The final output is a list with length of batch size after post-processing. Each element is a dictionary with the following keys:

Output Name Type Shape ResNet Shape Description
boxes_3d List[Tensor] (B, M, 10) (1, 600, 10) List of 3D boxes: (x, y, z, w, l, h, yaw, vx, vy, vz) in OV coordinates
scores_3d List[Tensor] (B, M) (1, 600) List of object confidence scores (classification scores * centerness score)
labels_3d List[Tensor] (B, M) (1, 600) List of class labels
cls_scores List[Tensor] (B, M) (1, 600) List of classification scores
instance_ids List[Tensor] (B, M) (1, 600) List of instance IDs
instance_feats List[Tensor] (B, M, E) (1, 600, 256) List of instance features

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Runtime Engine(s):

  • DeepStream - 7.1
  • TAO- 6.1.0 EA

Supported Hardware Microarchitecture Compatibility:

  • NVIDIA Ampere
  • NVIDIA Blackwell
  • NVIDIA Hopper
  • NVIDIA Lovelace

[Preferred/Supported] Operating Systems:

  • Linux

These models are designed for use with NVIDIA hardware and software. For hardware, the models are compatible with any NVIDIA DataCenter GPU. For software, the models are specifically designed for the TAO Toolkit.

Model versions:

  • trainable_v1.0 - Pre-trained model for Sparse4D.
  • deployable_v1.0 - Model for Sparse4D deployable to DeepStream or TensorRT.

Training, Testing, and Evaluation Datasets

The training algorithm optimizes the network to minimize the classification loss, regression loss & ID loss.

Training Data

The model is trained on the MTMC Tracking 2025 dataset available on HuggingFace. The dataset statistics used for pretraining the model are as follows:

Data Collection Method by dataset:

  • Synthetic

Labeling Method by dataset:

  • Synthetic

Properties (Quantity, Dataset Descriptions, Sensor(s)):

subset no. of scenes no. BEV groups no. of objects per scene no. of cameras per scene Duration of each camera sequence FPS
Train 111 122 15-120 4-32 5 mins 30 FPS
  • Model was trained on the following moving object classes: Person, Fourier_GR1_T2_Humanoid, Agility_Digit_Humanoid & Nova_Carter.
  • Both validation & testing was conduction on random scene from the MTMC Tracking 2025 subset.

Data Format & Preprocessing

Raw Data (MTMC Tracking 2025/AI City Challenge Format)

This raw data includes .mp4 RGB video files, depth maps stored in HDF5 format, ground_truth.json, calibration.json & map.png. For more details on these files including the ground truth format please refer to the raw dataset format here.

The raw data may also be obtained from Nvidia Isaac Sim Replicator Agent IRA. When collected via this route, RGB image data & depth maps can be in frame output/HDF5 output. Utilize the TAO Data Service to convert your raw data format (AICity) to pickle format (OVPKL) accordingly.

Pickle File Structure (OVPKL format)

The above raw data format needs to be converted to a pickle files for model training via the TAO DataService. The model will utilize these pickle files along with the above files for training. An example command can be found in the sparse4d finetuning notebook. The pickle file will consist of the items.

The process generates pickle files (.pkl) containing scene information. Each pickle file is a dictionary with the following top-level keys:

  • metadata: (dict) Contains metadata about the dataset.
  • infos: (list) A list of dictionaries, where each dictionary contains information about a specific frame in the scene.
metadata key

The metadata dictionary contains the following keys:

Key Type Description Example Value
version str Version of the dataset or split. 'trainval_split'
split_type str Type of data split (e.g., 'all', 'train', 'validation'). 'all'
infos key

The infos key holds a list of dictionaries. Each dictionary in this list corresponds to a frame and contains the following keys:

Key Type Description Example Value (Illustrative)
frame_idx int Index of the frame. 0
cams dict Dictionary containing data for each camera in the frame. Keys are camera names (e.g., 'Camera', 'Camera_01'). See cams Structure below.
scene_name str Name of the scene. 'Warehouse_014+bev-sensor-training-1'
timestamp float Timestamp of the frame. 0.0
token str Unique token for the frame. 'Warehouse_014+bev-sensor-training-1__000000000'
group_name str Name of the group this frame belongs to. 'bev-sensor-training-1'
instance_inds np.ndarray Array of instance indices present in the frame. array([372, 171, 172, 631, 663])
asset_inds np.ndarray Array of asset indices corresponding to instances. array([ 771, 772, 775, 2094, 1372])
gt_boxes np.ndarray Ground truth bounding boxes for objects in the frame. Shape: (N, 7) where N is the number of objects. Each row is [x, y, z, dx, dy, dz, heading]. array([[-0.686, 1.073, ..., -0. ], ...])
gt_names np.ndarray Array of ground truth names for objects. array(['agility_digit', 'gr1_t2', ...], dtype='<U18')
gt_velocity np.ndarray Ground truth velocity for objects. Shape: (N, 3) for [vx, vy, vz]. array([[0., 0., 0.], ...])
valid_flag np.ndarray Boolean array indicating if the ground truth data for each object is valid. array([ True, True, ...])
gt_visibility list List of dictionaries, one for each ground truth object, indicating its visibility percentage in each camera. [{'Camera': 1.0, 'Camera_01': 1.0, ...}, ...]
cams Structure

The cams dictionary (within each element of the infos list) contains nested dictionaries, where each key is a camera identifier (e.g., 'Camera', 'Camera_01', etc.). Each of these camera-specific dictionaries has the following structure:

Key Type Description Example Value (Illustrative)
data_path tuple Tuple containing paths to camera data. (h5_file_path, rgb_image_relative_path) ('data/mtmc/Warehouse_014/Camera.h5', 'rgb/rgb_00000.jpg')
depth_map_path tuple Tuple containing paths to depth map data. (h5_file_path, depth_image_relative_path) ('data/mtmc/Warehouse_014/Camera.h5', 'distance_to_image_plane_png/distance_to_image_plane_00000.png')
sample_data_token str Unique token for the sample data from this camera. 'Warehouse_014+bev-sensor-training-1__000000000+Camera'
cam_intrinsic np.ndarray 3x3 camera intrinsic matrix. array([[916.249, 0., 960.], [0., 916.249, 540.], [0., 0., 1.]])
sensor2world_transform np.ndarray 4x4 transformation matrix from sensor coordinates to world coordinates. array([[0.018, -0.999, ..., -4.731], ..., [0., 0., 0., 1.]])

Testing and Evaluation Datasets

Both validation & testing was conduction on random scene from the MTMC Tracking 2025 subset.

Data Collection Method by dataset:

  • Synthetic

Labeling Method by dataset:

  • Synthetic

Properties (Quantity, Dataset Descriptions, Sensor(s)): We utilize a random scene from the test set of the MTMC Tracking 2025 subset.

Methodology and KPI

The key performance indicators are average precision (AP) per-class and the mean average precision (mAP) obtained across all classes. We utilize the NuScenes based 3D detection evaluation technique to evaluate model accuracy.

Average Precision (AP): is a metric which quantifies a detector's ability to trade off precision and recall for a single object category at a given center‐distance threshold by computing the normalized area under its precision–recall curve.

Mean Average Precision (mAP) is derived from these AP values by averaging over all target object classes and a set of predefined center‐distance thresholds (0.5, 1, 2, and 4 m). This yields a single scalar that reflects both classification and geospatial localization accuracy. mAP thus serves as a rigorous, holistic metric for comparing 3D detection performance.

The following scores are for models trained on MTMC Tracking 2025 subset. The evaluation set and training set is disjoint.

Object Class AP
Person 0.989
Fourier_GR1_T2_Humanoid 0.944
Agility_Digit_Humanoid 0.989

Nova_Cater is ignored due to low object count.

Final mAP: 0.974

Real-time Inference Performance

Model inference is performed using the Spatial AI DeepStream Pipeline which uses Nvidia TensorRT. Model runs at mixed precision (FP16+FP32). The below measurements ignore data transfer time from host to device (H2D), device to host (D2H) & other components such as pre/post processing of images & tensors. TRTExec measurements can be found below:

GPU # of. cameras Mean Latency per batch Mean FPS per batch
1x A6000 Ampere - 48GB 5 32.456 ms 30.81
1 x L40S - 48GB 8 28.9551 ms 34.53
1 x H100 SXM HBM3 - 80GB 19 32.2358 ms 31.01
1 x 6000 Ada - 48GB 7 29.5581 ms 33.83
1 x L4 - 24GB 2 27.30 ms 36.64

How to use this model

In order to use the model as pre-trained weights for transfer learning, please use the snippet below as a template for the model component of the experiment spec file to train a Sparse4D. For more information on experiment spec file, please refer to the Train Adapt Optimize (TAO) Toolkit User Guide.

model:
  type: "sparse4d"
  use_grid_mask: true
  use_deformable_func: true
  use_temporal_align: true
  input_shape: [1408, 512]
  embed_dims: 256
  neck:
    type: "FPN"
    num_outs: 4
    start_level: 0
    out_channels: 256
    in_channels: [256, 512, 1024, 2048]
    add_extra_convs: "on_output"
    relu_before_extra_convs: true
  depth_branch:
    type: "dense_depth"
    embed_dims: "${model.embed_dims}"
    num_depth_layers: 3
    loss_weight: 0.2
  head:
    type: "sparse4d"
    num_output: 300
    cls_threshold_to_reg: 0.05
    decouple_attn: true
    return_feature: true
    use_reid_sampling: false
    embed_dims: "${model.embed_dims}"
    num_groups: 8
    num_decoder: 6
    num_single_frame_decoder: 1
    drop_out: 0.1
    temporal: true
    with_quality_estimation: true
    instance_bank:
      num_anchor: 900
      anchor: ???
      num_temp_instances: 600
      confidence_decay: 0.8
      feat_grad: false
      default_time_interval: 0.033333
      embed_dims: "${model.embed_dims}"
      use_temporal_align: "${model.use_temporal_align}"
    anchor_encoder:
      type: 'SparseBox3DEncoder'
      vel_dims: 3
      embed_dims: [128, 32, 32, 64]
      mode: 'cat'
      output_fc: false
      in_loops: 1
      out_loops: 4
    operation_order: [
      "deformable", "ffn", "norm", "refine", "temp_gnn", "gnn", "norm", 
      "deformable", "ffn", "norm", "refine", "temp_gnn", "gnn", "norm", 
      "deformable", "ffn", "norm", "refine", "temp_gnn", "gnn", "norm", 
      "deformable", "ffn", "norm", "refine", "temp_gnn", "gnn", "norm", 
      "deformable", "ffn", "norm", "refine", "temp_gnn", "gnn", "norm", 
      "deformable", "ffn", "norm", "refine"
    ]
    temp_graph_model:
      type: "MultiheadAttention"
      embed_dims: 512
      num_heads: 8
      batch_first: true
      dropout: 0.1
    graph_model:
      type: "MultiheadAttention"
      embed_dims: "${model.head.temp_graph_model.embed_dims}"
      num_heads: "${model.head.temp_graph_model.num_heads}"
      batch_first: true
      dropout: "${model.head.temp_graph_model.dropout}"
    norm_layer:
      type: "LN"
      normalized_shape: "${model.embed_dims}"
    ffn:
      type: "AsymmetricFFN"
      in_channels: 512
      pre_norm:
        type: "LN"
      embed_dims: 256
      feedforward_channels: 1024
      num_fcs: 2
      ffn_drop: 0.1
      act_cfg:
        type: "ReLU"
        inplace: true
    deformable_model:
      embed_dims: "${model.embed_dims}"
      num_groups: 8
      num_levels: 4
      attn_drop: 0.15
      use_deformable_func: true
      use_camera_embed: false
      residual_mode: "cat"
      kps_generator:
        embed_dims: "${model.embed_dims}"
        num_learnable_pts: 6
        fix_scale:
          - [0, 0, 0]
          - [0.45, 0, 0]
          - [-0.45, 0, 0]
          - [0, 0.45, 0]
          - [0, -0.45, 0]
          - [0, 0, 0.45]
          - [0, 0, -0.45]
    refine_layer:
      type: "SparseBox3DRefinementModule"
      embed_dims: "${model.embed_dims}"
      refine_yaw: true
      with_quality_estimation: true
    sampler:
      num_dn_groups: 5
      num_temp_dn_groups: 3
      dn_noise_scale: [2.0, 2.0, 2.0, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5]
      max_dn_gt: 128
      add_neg_dn: true
      cls_weight: 2.0
      box_weight: 0.25
      reg_weights: [2.0, 2.0, 2.0, 0.5, 0.5, 0.5, 0.0, 0.0, 0.0, 0.0, 0.0]
      use_temporal_align: "${model.use_temporal_align}"
    visibility_net:
      type: "visibility_net"
      embedding_dim: 256
      hidden_channels: 32
    loss:
      reg:
        type: "sparse_box_3d"
        box_weight: 0.25
        cls_allow_reverse: [5, 6, 7]
      cls:
        type: "focal"
        use_sigmoid: true
        gamma: 2.0
        alpha: 0.25
        loss_weight: 2.0
      id:
        type: "cross_entropy_label_smooth"
        num_ids: "${dataset.num_ids}"
    bnneck:
      type: "bnneck"
      feat_dim: 256
      num_ids: "${dataset.num_ids}"
    decoder:
      type: "SparseBox3DDecoder"
      score_threshold: 0.05
    reg_weights: [2.0, 2.0, 2.0, 1 ,1, 1, 1, 1, 1, 1, 1]

Inference & Deployment:

For deployment of this model, please refer to the our Nvidia SpatialAI release documentation.

Acceleration Engine: [Tensor(RT)]
Test Hardware:

  • Nvidia Datacenter GPUs

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Sub-cards. Please report security vulnerabilities or NVIDIA AI Concerns here.