BEVFusion model detects the people within an image and a Lidar with 3D Bounding Boxes, a category label for each object and confidence scores.
This model is ready for commercial use.
License to use these models is covered by the Model EULA. By downloading the model, you accept the terms and conditions of these licenses.
Liu, Zhijian and Tang, Haotian and Amini, Alexander and Yang, Xingyu and Mao, Huizi and Rus, Daniela and Han, Song: BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation. In: ICRA. (2023)
Architecture Type: Vision Transformer
Network Architecture:
This model is a transformer-based network archteicture and fuses the lidar and camear features in Bird's-Eye View Feature space to perform 3D object detection.
Swin-Tiny (Image Backbone); SECOND (LiDAR Backbone) were used.
This model was trained using the BEVFusion entrypoint in TAO. The model uses three angles in 3D space (roll, pitch ,yaw) to represent the 3D boudning boxes, whereas only single angle (yaw) was supported in the original BEVFusion code base. The training algorithm optimizes the network to minimize the Gaussian Focal Loss and L1 Loss.
Input Types:
Input Formats:
Input Parameters:
Other Properties Related to Input:
Output Types: 3D Bounding Boxes, Category labels, Confidence scores
Output Format: Images with 3D Bounding Boxes visualization
Output Parameters: Nine Dimensional (9D), One-Dimensional (1D), One-Dimensional (1D)
Other Properties Related to Output:
Runtime Engines:
Supported Hardware Microarchitecture Compatibility:
[Preferred/Supported] Operating Systems:
These models are designed for use with NVIDIA hardware and software. For hardware, the models are compatible with any NVIDIA GPU, including NVIDIA Jetson devices. For software, the models are specifically designed for the TAO Toolkit.
The primary application of these models is to estimate an object's 3D bounding box from a single camera and sinlge lidar sensor setup. They can identify the person objects given single rgb image with aligned single lidar point cloud.
bevfusion1.0**: pytorch pth file, inferencable on TAO Toolkit.
A proprietary synthetic dataset (TAO3DSynthetic) that was generated with Omniverse and Issac-Sim was used during the training and testing. The training dataset has 465,373 images, point cloud pairs, and 1,510,704 objects for the person class.
Data Collection Method by Dataset
Labeling Method by Dataset
Properties:
The training dataset consists of image, point cloud data from Lidar, and calibration matrices from nine different indoor scenes.
The content was captured from a height of two feet with sensors that were located perpendicularly from the surface. Each scene has three different lighting conditions: Normal, Dark, and Very Dark.
Training Dataset | |
---|---|
Scene Category | Number of Images |
Warehouse | 41526 |
Retail Store | 79283 |
Hospital | 57182 |
Office | 145940 |
Factory | 141442 |
Total | 465373 |
Evaluation dataset for TAO3DSynthetic was also generated along with training dataset. Evaluation dataset contains three scenes (Warehouse, Retail Store and Hospital) from training data with unseen people in them and one new scene that is not included in the training dataset.
Data Collection Method by Dataset
Labeling Method by Dataset
After pretraining is complete on TAO3DSynthetic, the BEVFusion model was fine-tuned on a public KITTI dataset with only pedestrian class to demonstrate the use-case. We do not release the model weights for Kitti fine-tuning and we are showing accuracy results only below for demonstration purpose. The original Kitti provides only rotation_y in 3D bounding box labels. Thus, we zero-padded rotation x and z for KittiPerson dataset to be compatible with TAO3DSynthetic-BEVfusion. We only used the images that contains pedestrian from original Kitti dataset. KittiPerson dataset consists of 955 training images with 2207 person objects and 824 validation images with 2280 person objects.
The KPI for the evaluation data are reported in the following table. For TAO3Dsynthetic-BEVFusion model, the evaluation data is also generated with Omniverse. For KittiPerson-BEVFusion model, validation set from Kitti was used for evaluation. Both training from scrtach and finetuning from TAO3DSynthetic were evaluated for KittiPerson model. Two models were evaluated with the KITTI 3D Metric, which evaluates object detection performance using AP40 at IOU 0.5. Note that Kitti Pedestrian is more chanllenging data as it was captured in the real street scenes.
Model | TAO3DSynthetic-BEVFusion |
---|---|
From Scratch | 79.35% |
Model | KittiPerson-BEVFusion |
---|---|
From Scratch | 38.383 % |
From TAO3DSynthetic-Finetuned | 53.8747 % |
Engine: Pytorch
BEVFusion inference will be run through tao_pytorch_backend.
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards [Insert Link to Model Card++ here]. Please report security vulnerabilities or NVIDIA AI Concerns here.