NGC Catalog
CLASSIC
Welcome Guest
Models
BEVFusion for 3D Object Detection

BEVFusion for 3D Object Detection

For downloads and more information, please view on a desktop device.
Logo for BEVFusion for 3D Object Detection
Description
BEVFusion model to detect 3D objects from point cloud and RGB data.
Publisher
NVIDIA
Latest Version
bevfusion_1.0
Modified
November 27, 2024
Size
464.53 MB

TAO BEVFusion Model Card

Model Overview

Description

BEVFusion model detects the people within an image and a Lidar with 3D Bounding Boxes, a category label for each object and confidence scores.

This model is ready for commercial use.

License/Terms of Use

License to use these models is covered by the Model EULA. By downloading the model, you accept the terms and conditions of these licenses.

References

Liu, Zhijian and Tang, Haotian and Amini, Alexander and Yang, Xingyu and Mao, Huizi and Rus, Daniela and Han, Song: BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation. In: ICRA. (2023)

Model Architecture

Architecture Type: Vision Transformer
Network Architecture: This model is a transformer-based network archteicture and fuses the lidar and camear features in Bird's-Eye View Feature space to perform 3D object detection. Swin-Tiny (Image Backbone); SECOND (LiDAR Backbone) were used.

This model was trained using the BEVFusion entrypoint in TAO. The model uses three angles in 3D space (roll, pitch ,yaw) to represent the 3D boudning boxes, whereas only single angle (yaw) was supported in the original BEVFusion code base. The training algorithm optimizes the network to minimize the Gaussian Focal Loss and L1 Loss.

Input

Input Types:

  • Image , Point Cloud

Input Formats:

  • Images: Red, Green, Blue (RGB)
  • Point Cloud: Points

Input Parameters:

  • Images - 1920 X 1080 X 3 for TAO3DSynthetic: Two Dimensional (2D)
  • Images - 370 X 1224 X for KittiPerson : Two Dimensional (2D)
  • Point Cloud - N X 4 : Three Dimensional (3D)

Other Properties Related to Input:

  • Calibration matrix: Lidar to Camera Prjoection Matrix, Camera Intrinsic Matrix
  • No specific minimum or maximum resolution restriction, No Alpha Channel needed

Output

Output Types: 3D Bounding Boxes, Category labels, Confidence scores
Output Format: Images with 3D Bounding Boxes visualization
Output Parameters: Nine Dimensional (9D), One-Dimensional (1D), One-Dimensional (1D)
Other Properties Related to Output:

  • 3D bounding-box coordinates: (Cetner_x, Cetner_y, Center_z, Scale_x, Scale_y, Scale_z, Rotation_x, Rotation_y, Rotation_z).
  • No specific minimum or maximum resolution restriction, No Alpha Channel needed

Software Integration

Runtime Engines:

  • TAO 5.5.0

Supported Hardware Microarchitecture Compatibility:

  • NVIDIA Ampere
  • NVIDIA Blackwell
  • NVIDIA Jetson
  • NVIDIA Hopper
  • NVIDIA Lovelace
  • NVIDIA Pascal
  • NVIDIA Turing
  • NVIDIA Volta

[Preferred/Supported] Operating Systems:

  • Linux

These models are designed for use with NVIDIA hardware and software. For hardware, the models are compatible with any NVIDIA GPU, including NVIDIA Jetson devices. For software, the models are specifically designed for the TAO Toolkit.

The primary application of these models is to estimate an object's 3D bounding box from a single camera and sinlge lidar sensor setup. They can identify the person objects given single rgb image with aligned single lidar point cloud.

Model Versions

bevfusion1.0**: pytorch pth file, inferencable on TAO Toolkit.

Training, Testing, and Evaluation Datasets

Training Dataset

A proprietary synthetic dataset (TAO3DSynthetic) that was generated with Omniverse and Issac-Sim was used during the training and testing. The training dataset has 465,373 images, point cloud pairs, and 1,510,704 objects for the person class.

  • nSpect: NSPECT-I6RW-ENUU

Data Collection Method by Dataset

  • Synthetic

Labeling Method by Dataset

  • Synthetic

Properties:
The training dataset consists of image, point cloud data from Lidar, and calibration matrices from nine different indoor scenes. The content was captured from a height of two feet with sensors that were located perpendicularly from the surface. Each scene has three different lighting conditions: Normal, Dark, and Very Dark.

Training Dataset
Scene Category Number of Images
Warehouse 41526
Retail Store 79283
Hospital 57182
Office 145940
Factory 141442
Total 465373

Evaluation Dataset

Evaluation dataset for TAO3DSynthetic was also generated along with training dataset. Evaluation dataset contains three scenes (Warehouse, Retail Store and Hospital) from training data with unseen people in them and one new scene that is not included in the training dataset.

Data Collection Method by Dataset

  • Synthetic

Labeling Method by Dataset

  • Synthetic

After pretraining is complete on TAO3DSynthetic, the BEVFusion model was fine-tuned on a public KITTI dataset with only pedestrian class to demonstrate the use-case. We do not release the model weights for Kitti fine-tuning and we are showing accuracy results only below for demonstration purpose. The original Kitti provides only rotation_y in 3D bounding box labels. Thus, we zero-padded rotation x and z for KittiPerson dataset to be compatible with TAO3DSynthetic-BEVfusion. We only used the images that contains pedestrian from original Kitti dataset. KittiPerson dataset consists of 955 training images with 2207 person objects and 824 validation images with 2280 person objects.

Methodology and KPI

The KPI for the evaluation data are reported in the following table. For TAO3Dsynthetic-BEVFusion model, the evaluation data is also generated with Omniverse. For KittiPerson-BEVFusion model, validation set from Kitti was used for evaluation. Both training from scrtach and finetuning from TAO3DSynthetic were evaluated for KittiPerson model. Two models were evaluated with the KITTI 3D Metric, which evaluates object detection performance using AP40 at IOU 0.5. Note that Kitti Pedestrian is more chanllenging data as it was captured in the real street scenes.

Model TAO3DSynthetic-BEVFusion
From Scratch 79.35%
Model KittiPerson-BEVFusion
From Scratch 38.383 %
From TAO3DSynthetic-Finetuned 53.8747 %

Inference:

Engine: Pytorch

Inference Method

BEVFusion inference will be run through tao_pytorch_backend.

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards [Insert Link to Model Card++ here]. Please report security vulnerabilities or NVIDIA AI Concerns here.