The model described in this card detects the people within an image and a Lidar with 3D bounding boxes. It also provides a category label for each object.
This model is based on BEVFusion, which unifies the feature representation from different modalities (Lidar and image). This unified feature can be used for 3D Object Detection tasks.
The BEVFusion codebase from mmdet3d was used to train this model. The code was modified to accomodate three angles in 3D space (roll, pitch ,yaw), whereas only yaw was supported in the original code base. The training algorithm optimizes the network to minimize the Gaussian Focal Loss and L1 Loss.
A proprietary synthetic dataset that was generated with Omniverse and Issac-Sim was used during the training. The training dataset has 465,373 images, point cloud pairs, and 1,510,704 objects for the person class. The training dataset consists of image, point cloud data from Lidar, and calibration matrices from nine different indoor scenes. The content was captured from a height of two feet with sensors that were located perpendicularly from the surface. Each scene has three different lighting conditions: Normal, Dark, and Very Dark.
Training Dataset | |
---|---|
Scene Category | Number of Images |
Warehouse | 41526 |
Retail Store | 79283 |
Hospital | 57182 |
Office | 145940 |
Factory | 141442 |
Total | 465373 |
The KPI for the evaluation data are reported in the following table. Evaluation data is also generated with Omniverse. The model was evaluated with the KITTI 3D Metric, which evaluates object detection performance using mean Average Precision (mAP) at IOU 0.5.
Model | BEVFusion | |
---|---|---|
Content | AP11 | AP40 |
Evaluation set | 77.71% | 79.35% |
RGB Image : 1920 X 1080 X 3 LiDAR Point Cloud : N x 4 (N: nubmer of points, 4: xyz+intensity) Calibration matrix : Lidar to Camera prjoection matrix, Camera Intrinsic matrix
Category labels (people) and 3D bounding-box coordinates in nine dimension representation (cetner_x,cetner_y,center_z, scale_x, scale_y, scale_z, rotation_x, rotation_y, rotation_z) for each detected people in the input image.
License to use these models is covered by the Model EULA. By downloading the model, you accept the terms and conditions of these licenses.
NVIDIA 3D Fusion model detects people .
NVIDIA’s platforms and application frameworks enable developers to build a wide array of AI applications. Consider potential algorithmic bias when choosing or creating the models being deployed. Work with the model’s developer to ensure that it meets the requirements for the relevant industry and use case; that the necessary instruction and documentation are provided to understand error rates, confidence intervals, and results; and that the model is being used under the conditions and in the manner intended.