3D Fusion Model Card

Model Overview

The model described in this card detects the people within an image and a Lidar with 3D bounding boxes. It also provides a category label for each object.

Model Architecture

This model is based on BEVFusion, which unifies the feature representation from different modalities (Lidar and image). This unified feature can be used for 3D Object Detection tasks.

Training

The BEVFusion codebase from mmdet3d was used to train this model. The code was modified to accomodate three angles in 3D space (roll, pitch ,yaw), whereas only yaw was supported in the original code base. The training algorithm optimizes the network to minimize the Gaussian Focal Loss and L1 Loss.

Training Data

A proprietary synthetic dataset that was generated with Omniverse and Issac-Sim was used during the training. The training dataset has 465,373 images, point cloud pairs, and 1,510,704 objects for the person class. The training dataset consists of image, point cloud data from Lidar, and calibration matrices from nine different indoor scenes. The content was captured from a height of two feet with sensors that were located perpendicularly from the surface. Each scene has three different lighting conditions: Normal, Dark, and Very Dark.

	Training Dataset
Scene Category	Number of Images
Warehouse	41526
Retail Store	79283
Hospital	57182
Office	145940
Factory	141442
Total	465373

Performance

Methodology and KPI

The KPI for the evaluation data are reported in the following table. Evaluation data is also generated with Omniverse. The model was evaluated with the KITTI 3D Metric, which evaluates object detection performance using mean Average Precision (mAP) at IOU 0.5.

Model		BEVFusion
Content	AP11	AP40
Evaluation set	77.71%	79.35%

Input

RGB Image : 1920 X 1080 X 3 LiDAR Point Cloud : N x 4 (N: nubmer of points, 4: xyz+intensity) Calibration matrix : Lidar to Camera prjoection matrix, Camera Intrinsic matrix

Output

Category labels (people) and 3D bounding-box coordinates in nine dimension representation (cetner_x,cetner_y,center_z, scale_x, scale_y, scale_z, rotation_x, rotation_y, rotation_z) for each detected people in the input image.

References

Citations

Liu, Zhijian and Tang, Haotian and Amini, Alexander and Yang, Xingyu and Mao, Huizi and Rus, Daniela and Han, Song: BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation. In: ICRA. (2023)

Technical Blogs

Access the latest in Vision AI development workflows with NVIDIA TAO Toolkit 5.0
Improve accuracy and robustness of vision AI models with vision transformers and NVIDIA TAO
Developing and Deploying AI-powered Robots with NVIDIA Isaac Sim and NVIDIA TAO
Learn endless ways to adapt and supercharge your AI workflows with TAO - Whitepaper

License

License to use these models is covered by the Model EULA. By downloading the model, you accept the terms and conditions of these licenses.

Ethical Considerations

NVIDIA 3D Fusion model detects people .

NVIDIA’s platforms and application frameworks enable developers to build a wide array of AI applications. Consider potential algorithmic bias when choosing or creating the models being deployed. Work with the model’s developer to ensure that it meets the requirements for the relevant industry and use case; that the necessary instruction and documentation are provided to understand error rates, confidence intervals, and results; and that the model is being used under the conditions and in the manner intended.

Transfusion for 3D Object Detection