NGC | Catalog
Welcome Guest
CatalogModelsBi3D Proximity Segmentation

Bi3D Proximity Segmentation

For downloads and more information, please view on a desktop device.
Logo for Bi3D Proximity Segmentation

Description

Bi3D is a binary depth classification network that is used to classify the depth of objects at a given distance.

Publisher

NVIDIA

Use Case

Other

Framework

PyTorch

Latest Version

1.0.0

Modified

July 5, 2022

Size

62.46 MB

Model Overview

Bi3D is a network that performs binary depth classification from a stereo camera. Given a fronto-parallel plane at distance d from a stereo camera, Bi3D identifies all the objects that are closer than d. The idea behind Bi3D is that it is faster and easier to classify an object as being closer or farther than a certain distance, rather than to regress its actual distance accurately. Bi3D can be run for multiple fronto-parallel planes (e.g., at distances d1, d2, and d3) and the corresponding classifications can be aggregated (e.g., an object is between d1 and d2). Note that the ground is not an obstacle and is classified by Bi3D as being at infinity (beyond the farthest depth tested).

Figure 1. (Left) Left stereo image of robots deployed in a warehouse. (Center) Colored annotation of objects based on the classification of Bi3D from the approaching robot. Since the ground is not an obstacle, Bi3D classifies it as being beyond the farthest plane. (Right) Binary output of Bi3D for a single close depth plane.

Model Architecture

Bi3D has two models: a feature extraction model, based on the Spatial pyramid pooling, and a UNet segmentation model. Given a distance of interest d, the features extracted from the left and right images of the stereo pair are warped so that they align for objects that lie at distance d, and are offset to the left (or right) when the object is closer (or farther) than d. The segmentation model is UNet-based and uses a Resnet18 encoder and four layers of convolution as its decoder. This second module implicitly matches the warped features to classify the binary depth of objects as being larger or smaller than d.

Intended Use

Bi3D can be used for collision avoidance applications, similar to those used in current industrial autonomous mobile robot (AMR) systems. As an example, Figure 2 shows a robot configured with four safety zones. Each zone corresponds to a specific depth from the stereo camera. Various safety responses could be assigned to the robot depending on its distance from each safety zone.

Figure 2. Safety zones defined in front of a robot. Three depth planes define four safety zones. Bi3D classifies objects based on the zone they occupy.

Training

Training Algorithm

The training algorithm optimizes the network to minimize the binary cross-entropy (BCE) loss. The BCE loss is computed for the binary prediction from Bi3D and the binary maps computed from the ground-truth depth.

Training Dataset

Bi3D model was trained on over 800k synthetically generated object examples in rendered scenes from Omniverse, using Replicator Composer. The vertical position of the camera was randomly set to vary from 0.3 to 1.5m above ground level.  The training dataset consists of a mix of various categories selected to improve robustness. Some of the dataset categories feature flying objects, images with textureless regions, and realistic images. Although we trained on various synthetic datasets, our target environment for this version of Bi3D is real indoors scenes. For each scene, we generate left image, right image, and ground-truth disparity.

Figure 3. Sample synthetic stereo images and left disparity map used to train Bi3D.

Training Data Ground-Truth Labelling Guidelines

The training dataset is created using Replicator Composer. Replicator Composer is a tool for creating parameterizable datasets using NVIDIA Omniverse Isaac Sim software. If you are looking to create your own synthetic dataset, please follow the following instructions:

  1. Install Omniverse Launcher
  2. Follow the Instructions on generating data with Replicator.

Groundplane Annotation

We annotate the groundplane in each input stereo pair. The groundplane annoation is a binary mask that is applied to remove the groundplane from each image. The reason for removing the groundplane is to enable the robot distinguish planes on the ground from objects in its path.

Depth Estimation and Key Performance Indicators KPIs

We want to note here again that Bi3D model was trained solely on synthetic datasets and was not finetuned on the evaluation dataset below.

We use two KPIs to assess the performance of Bi3D on depth classification tasks on the Middlebury publicly available depth evaluation dataset. The selected KPIs are percentage pixel error (PE) and mean intersection over union (MIOU).  We set the maximum disparity at 192 during training. All the images were rescaled to 576x960x3 before passing them to the Bi3D model to keep the input image size within our training receptive field size. Another reason to rescale the input images is to reduce the disparity ranges of the original dataset to values in the neighborhood of our training disparity range. We select four safety zones for evaluation. In each zone, we select the middle disparity plane and evaluate the predicted binary disparity map against the ground-truth disparity map. Our KPIs: MIOU and PE are computed on the values predicted to be in front of the disparity planes. On both maps, we compute the PE and MIOU and report the average over all the evaluation images.

Safety Zone Disparity MIOU PE (%)
> 45 0.753 10.715
28 - 45 0.824 10.615
12 - 28 0.894 8.984
3 - 12 0.920 7.450
All zones (9, 18, 39, 45) 0.861 9.749

Qualitative Inference

Below, we show various binary disparities and their corresponding disparity ground truth of a scene with a chair.

(a)
(b)

Figure 4. Shows (a) predicted and (b) ground truth disparity maps for various disparities: 6, 18, 39 and 45 repectively.

How to Use this Model

The Isaac ROS Proximity Segmentation package uses Bi3D to produce a segmented image of user-provided proximity zones. Given a list of disparity values, the package generates a disparity image with various annotations corresponding to various depth zones in the image. In order to use this model, users must download the pre-trained ONNX models of Bi3D and convert them to TensorRT engine plans using trtexec. The package also provides quick start instructions on how to run inference using Bi3D and visualize the outputs, as well as example applications of Bi3D inference. For a detailed step-by-step walk-through and requirements list, please see the Isaac ROS Proximity Segmentation GitHub page.

Sample Predictions from Bi3D

Input Left Image - [Synthetic | Real] Warehouse

Output Image - [Synthetic | Real] Warehouse

Figure 5 Showing Inference on Synthetic and Real Warehouse Scenes

Input

Two stereo RGB Images of resolution 576 X 960 X 3, a set of disparity values to indicate depth plane to be queried e.g. [9, 27, 45]

Output

A color-coded map with colors specifying the various depth zone detected.

Specifications and Performance

Bi3D is designed to run on both NVIDIA GPUs and Deep Learning Accelerator (DLA) engines present on the Jetson Xavier and Orin system-on-a-chip (SoC). The requirements are summarized in the table below. See table 1 for details.

Requirement Jetson GPU
Jetson AGX Xavier NX 16GB GeForce 10 series: GeForce GTX 1060 and recent Volta series
Jetson Xavier NX GeForce 16 series
Jetson AGX Xavier 64 GB GeForce 20 series
Jetson AGX Xavier GeForce 30 series
Jetson AGX Xavier Industrial Quadro FX series: Quadro FX 5800
Jetson Orin NX 16GB Quadro x000 series: Quadro 6000, Quadro 7000, Quadro Plex 7000
Jetson AGX Orin 32GB Quadro Kxxx series: Quadro K5000, Quadro K6000
Jetson AGX Orin 64GB Quadro Pxxx series: > Quadro P1000
Quadro GVxxx series
all Quadro series with > 4096 MB
HW Engine 2x DLA v1 (Xavier)
HW Engine 2x DLA v1 (Orin)
Software Jetpack 5.0 and later PyTorch

Table 1. Bi3D Platform Requirements

The performance of Bi3D depends on the number of input disparities used. Currently, 2 DLA engines are used to process alternating left/right image pairs. The following table shows targeted performance levels on ROS nodes. The reported numbers include preprocessing and post-processing across the ROS pipeline.

Platform Disparity levels GPU clock (GHz) GPU clock (GHz) DLA clock (GHz) Number of DLA Cores FPS
Xavier (all supported SKUs) 3 1.377 2.265 1.3952 2 33
Orin (all supported SKUs) 3 1.377 2.265 1.3952 2 62

Table 2. End-to-End Bi3D performance as measured in Isaac ROS

Model Platform Disparity Levels Compute Hardware FPS
Featnet Xavier 1 GPU 392
Segnet Xavier 1 GPU 250
Featnet Xavier 1 DLA 109
Segnet Xavier 1 DLA 98
Featnet Orin 1 GPU 748
Segnet Orin 1 GPU 640
Featnet Orin 1 DLA 235
Segnet Orin 1 DLA 223

Table 3. Bi3D contains 2 models: Featnet and Segnet. This table shows individual model throughput of Bi3D models on RTX3060 and 1-core DLA on NVIDIA Jetson AGX.

Limitations

Actual Depth Computation of objects

NVIDIA Bi3D model was trained to be used as an object depth classifier for objects with respect to a fixed plane. It is not suitable for actual depth estimation in the 3D world. Although depths of objects could be inferred from the binary segmentation maps of a depth plane, the accuracy of a continuous depth estimation model is not to be expected since it was not designed for such application.

References

[1] Badki, Abhishek, et al. "Bi3d: Stereo depth estimation via binary classifications." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.

License

License to use this model is covered by the Model EULA. By downloading the public and release version of the model, you accept the terms and conditions of these licenses.

Ethical Considerations

Only synthetic datasets were used in the training of Bi3D hence there is no use of personal data used in the development of this network.  The network learns geometry and does not provide a classification of objects hence there are no ethical concerns in the use of our dataset.

Acknowledgements:

We acknowledge the sources of assets used to render scenes and objects featured in our datasets. The sources are captures in the following attribution list