Bi3D is a network that performs binary depth classification from a stereo camera. Given a fronto-parallel plane at distance d from a stereo camera, Bi3D identifies all the objects that are closer than d. The idea behind Bi3D is that it is faster and easier to classify an object as being closer or farther than a certain distance, rather than to regress its actual distance accurately. Bi3D can be run for multiple fronto-parallel planes (e.g., at distances d1, d2, and d3) and the corresponding classifications can be aggregated (e.g., an object is between d1 and d2). Note that the ground is not an obstacle and is classified by Bi3D as being at infinity (beyond the farthest depth tested).
Bi3D has two models: a feature extraction model, based on the Spatial pyramid pooling, and a UNet segmentation model. Given a distance of interest d, the features extracted from the left and right images of the stereo pair are warped so that they align for objects that lie at distance d, and are offset to the left (or right) when the object is closer (or farther) than d. The segmentation model is UNet-based and uses a Resnet18 encoder and four layers of convolution as its decoder. This second module implicitly matches the warped features to classify the binary depth of objects as being larger or smaller than d.
Bi3D can be used for collision avoidance applications, similar to those used in current industrial autonomous mobile robot (AMR) systems. As an example, Figure 2 shows a robot configured with four safety zones. Each zone corresponds to a specific depth from the stereo camera. Various safety responses could be assigned to the robot depending on its distance from each safety zone.
The training algorithm optimizes the network to minimize the binary cross-entropy (BCE) loss. The BCE loss is computed for the binary prediction from Bi3D and the binary maps computed from the ground-truth depth.
Bi3D model was trained on over 1 millon synthetically generated object examples in rendered scenes from Omniverse, using Replicator Composer. The vertical position of the camera was randomly set to vary from 0.3 to 1.5m above ground level. The training dataset consists of a mix of various categories selected to improve robustness. Some of the dataset categories feature flying objects, images with textureless regions, and realistic images. Although we trained on various synthetic datasets, our target environment for this version of Bi3D is real indoors scenes. For each scene, we generate left image, right image, and ground-truth disparity.
The training dataset is created using Replicator Composer. Replicator Composer is a tool for creating parameterizable datasets using NVIDIA Omniverse Isaac Sim software. If you are looking to create your own synthetic dataset, please follow the following instructions:
We annotate the ground plane in each input stereo pair. The ground plane annoation is a binary mask that is applied to remove the ground plane from each image. The reason for removing the ground plane is to enable the robot distinguish planes on the ground from objects in its path.
We want to note here again that Bi3D model was trained solely on synthetic datasets and was not finetuned on the evaluation dataset below.
We use two KPIs to assess the performance of Bi3D on depth classification tasks on the Middlebury publicly available depth evaluation dataset. The selected KPIs are percentage pixel error (PE) and mean intersection over union (MIOU). We set the maximum disparity at 192 during training. All the images were rescaled to 576x960x3 before passing them to the Bi3D model to keep the input image size within our training receptive field size. Another reason to rescale the input images is to reduce the disparity ranges of the original dataset to values in the neighborhood of our training disparity range. We select four safety zones for evaluation. In each zone, we select the middle disparity plane and evaluate the predicted binary disparity map against the ground-truth disparity map. Our KPIs: MIOU and PE are computed on the values predicted to be in front of the disparity planes. On both maps, we compute the PE and MIOU and report the average over all the evaluation images.
|Safety Zone Disparity||MIOU||PE (%)|
|28 - 45||0.824||10.615|
|12 - 28||0.894||8.984|
|3 - 12||0.920||7.450|
|All zones (9, 18, 39, 45)||0.861||9.749|
Below, we show various binary disparities and their corresponding disparity ground truth of a scene with a chair.
Figure 4. Shows (a) predicted and (b) ground truth disparity maps for various disparities: 6, 18, 39 and 45 repectively.
The Isaac ROS Proximity Segmentation package uses Bi3D to produce a segmented image of user-provided proximity zones. Given a list of disparity values, the package generates a disparity image with various annotations corresponding to various depth zones in the image. In order to use this model, users must download the pre-trained ONNX models of Bi3D and convert them to TensorRT engine plans using trtexec. The package also provides quick start instructions on how to run inference using Bi3D and visualize the outputs, as well as example applications of Bi3D inference. For a detailed step-by-step walk-through and requirements list, please see the Isaac ROS Proximity Segmentation GitHub page.
Figure 5 Showing Inference on Synthetic and Real Warehouse Scenes.
Two stereo RGB Images of resolution 576 X 960 X 3, a set of disparity values to indicate depth plane to be queried e.g. [9, 27, 45]
A color-coded map with colors specifying the various depth zone detected.
Bi3D is designed to run on both NVIDIA GPUs and Deep Learning Accelerator (DLA) engines present on the Jetson Xavier and Orin system-on-a-chip (SoC). The requirements are summarized in the table below. See table 1 for details.
|Jetson AGX Xavier NX 16GB||GeForce 10 series: GeForce GTX 1060 and recent Volta series|
|Jetson Xavier NX||GeForce 16 series|
|Jetson AGX Xavier 64 GB||GeForce 20 series|
|Jetson AGX Xavier||GeForce 30 series|
|Jetson AGX Xavier Industrial||Quadro FX series: Quadro FX 5800|
|Jetson Orin NX 16GB||Quadro x000 series: Quadro 6000, Quadro 7000, Quadro Plex 7000|
|Jetson AGX Orin 32GB||Quadro Kxxx series: Quadro K5000, Quadro K6000|
|Jetson AGX Orin 64GB||Quadro Pxxx series: > Quadro P1000|
|Quadro GVxxx series|
|all Quadro series with > 4096 MB|
|HW Engine||2x DLA v1 (Xavier)|
|HW Engine||2x DLA v1 (Orin)|
|Software||Jetpack 5.0 and later||PyTorch|
Table 1. Bi3D Platform Requirements.
The performance of Bi3D depends on the number of input disparities used. Currently, 2 DLA engines are used to process alternating left/right image pairs. The following table shows targeted performance levels on ROS nodes. The reported numbers include preprocessing and post-processing across the ROS pipeline.
|Platform||Disparity levels||GPU clock (GHz)||GPU clock (GHz)||DLA clock (GHz)||Number of DLA Cores||FPS|
|Xavier (all supported SKUs)||3||1.377||2.265||1.3952||2||33|
|Orin (all supported SKUs)||3||1.377||2.265||1.3952||2||62|
Table 2. End-to-End Bi3D performance as measured in Isaac ROS.
|Model||Platform||Disparity Levels||Compute Hardware||FPS|
Table 3. Bi3D contains 2 models: Featnet and Segnet. This table shows individual model throughput of Bi3D models on RTX3060 and 1-core DLA on NVIDIA Jetson AGX.
NVIDIA Bi3D model was trained to be used as an object depth classifier for objects with respect to a fixed plane. It is not suitable for actual depth estimation in the 3D world. Although depths of objects could be inferred from the binary segmentation maps of a depth plane, the accuracy of a continuous depth estimation model is not to be expected since it was not designed for such application.
 Badki, Abhishek, et al. "Bi3D: Stereo depth estimation via binary classifications." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.
License to use this model is covered by the Model EULA. By downloading the public and release version of the model, you accept the terms and conditions of these licenses.
Only synthetic datasets were used in the training of Bi3D hence there is no use of personal data used in the development of this network. The network learns geometry and does not provide a classification of objects hence there are no ethical concerns in the use of our dataset.
We acknowledge the sources of assets used to render scenes and objects featured in our datasets. The sources are captures in the following attribution list.