NGC | Catalog
CatalogModelsTransfusion for 3D Object Detection

Transfusion for 3D Object Detection

Logo for Transfusion for 3D Object Detection
Transfusion model to detect 3D objects from pointcloud and RGB data.
Latest Version
December 12, 2023
425.05 MB

3D Fusion Model Card

Model Overview

The model described in this card detects the people within an image and a Lidar with 3D bounding boxes. It also provides a category label for each object.

Model Architecture

This model is based on BEVFusion, which unifies the feature representation from different modalities (Lidar and image). This unified feature can be used for 3D Object Detection tasks.


The BEVFusion codebase from mmdet3d was used to train this model. The code was modified to accomodate three angles in 3D space (roll, pitch ,yaw), whereas only yaw was supported in the original code base. The training algorithm optimizes the network to minimize the Gaussian Focal Loss and L1 Loss.

Training Data

A proprietary synthetic dataset that was generated with Omniverse and Issac-Sim was used during the training. The training dataset has 465,373 images, point cloud pairs, and 1,510,704 objects for the person class. The training dataset consists of image, point cloud data from Lidar, and calibration matrices from nine different indoor scenes. The content was captured from a height of two feet with sensors that were located perpendicularly from the surface. Each scene has three different lighting conditions: Normal, Dark, and Very Dark.

Training Dataset
Scene Category Number of Images
Warehouse 41526
Retail Store 79283
Hospital 57182
Office 145940
Factory 141442
Total 465373


Methodology and KPI

The KPI for the evaluation data are reported in the following table. Evaluation data is also generated with Omniverse. The model was evaluated with the KITTI 3D Metric, which evaluates object detection performance using mean Average Precision (mAP) at IOU 0.5.

Model BEVFusion
Content AP11 AP40
Evaluation set 77.71% 79.35%


RGB Image : 1920 X 1080 X 3 LiDAR Point Cloud : N x 4 (N: nubmer of points, 4: xyz+intensity) Calibration matrix : Lidar to Camera prjoection matrix, Camera Intrinsic matrix


Category labels (people) and 3D bounding-box coordinates in nine dimension representation (cetner_x,cetner_y,center_z, scale_x, scale_y, scale_z, rotation_x, rotation_y, rotation_z) for each detected people in the input image.



  • Liu, Zhijian and Tang, Haotian and Amini, Alexander and Yang, Xingyu and Mao, Huizi and Rus, Daniela and Han, Song: BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation. In: ICRA. (2023)

Technical Blogs

Suggested Reading


License to use these models is covered by the Model EULA. By downloading the model, you accept the terms and conditions of these licenses.

Ethical Considerations

NVIDIA 3D Fusion model detects people .

NVIDIA’s platforms and application frameworks enable developers to build a wide array of AI applications. Consider potential algorithmic bias when choosing or creating the models being deployed. Work with the model’s developer to ensure that it meets the requirements for the relevant industry and use case; that the necessary instruction and documentation are provided to understand error rates, confidence intervals, and results; and that the model is being used under the conditions and in the manner intended.