ESS DNN Stereo Disparity

NGC Catalog

CLASSIC

Welcome Guest

For downloads and more information, please view on a desktop device.

Description

ESS is a DNN that estimates disparity for a stereo image pair and returns a continuous disparity map for the given left image.

Publisher

NVIDIA

Latest Version

4.1.0_onnx

Modified

December 3, 2024

Size

123.66 MB

Model Overview

The Efficient Supervised Stereo model (ESS) described in this card estimates disparity for a stereo image pair and returns a continuous disparity map for the given left image and a confidence map of the predicted disparity. ESS is designed, trained and evaluated on robotics applications for robot navigation and manipulation. Its accurate real time stereo disparity is used to provide depth estimation to go from perception to planning to action benefiting autonomous mobile robots, robot manipulators, and humanoids. This model is ready for commercial use.

Disparity Estimation with Colormap - Warehouse.

Disparity Estimation with Colormap - Robotic Arm.

Figure 1. ESS real time disparity estimation of warehouse for robot navigation on the top, and robotic arm for robot manipulation on the bottom.

Model Architecture

The model was optimized from the original depth perception network [1]. It features a feature extractor with enhanced receptive field using atrous spatial pyramid pooling (ASPP) [2]. The feature extractor module is connected with a UNet [3] for stereo matching. The feature extractor learns features from the left and right stereo images and passes the features to a cost volume. The cost volume computes correlations between positions in left and right features. The extracted features and correlations are learned by the stereo matching module. Lastly, we adopt a stereo mixture density module [4] to output a disparity value and a confidence value for every pixel in the left input image.

We appended a parallel confidence network branch to the disparity prediction network. The confidence values measure the certainty of the network on predicted disparities. It can be used as a guidance for reducing the number of pixels in the disparity output, and as a tradeoff for higher disparity accuracy.

Summary of ESS 4.1.0 Improvements

Data-Driven Improvements

Trained network with 1.1 million stereo samples with more randomized features.
- 1 million stereo samples includes AMR scenarios (over 600k samples).
- Robotic ARM Manipuation scenes (over 250k samples).
- 100k real-world images collected from a moving robot.
Real-world images contains data from warehouses, urban areas including: side-walks, sunlight, traffic lights, crowded road crossing and indoor carpeted floors.
Real data training using teacher-student supervision. A foundation model is used to generate psuedo-groundtruth for training our model.
The real data groundtruth uses a consensus from multiple public SoTA stereo models to validate the groundtruth from the foundation model.

Summary of ESS 4.0.0 Improvements

Network Improvements

Improved receptive field in the feature extractor module.
Lean computation correlation in the Cost Volume for stable and efficient learning.
Added a stereo mixture density module [4] to remove depth bleeding caused by interpolation at object boundaries.
Included a bi-modal loss function for better regularized learning of 3D depth objective.

Data-Driven Improvements

Trained with additional 400,000 stereo samples with more randomized features.
Enriched data includes diverse floor and carpet scenes for improved small object detection on the ground, suitable for robot navigation.
Included 200,000 robotic arm manipulation scenes suitable for close robotic arm depth perception.

Inference Improvements

Based on inference runtime performance analysis, several layers are identified and further optimized for efficient usage of hardware memory bandwidth.
These optimized layers are provided as pre-built TensorRT custom plugins (ess_plugins.so) that are loaded during TensorRT initialization.
With the plugins, inference performance on Jetson Orin is increased by 20% for ESS and 16% for Light ESS, respectively.

How to Use this Model

The estimated stereo disparity image from ESS can be additionally processed to produce a depth image or point cloud. The disparity output can be filtered, by thresholding the confidence output, to trade off between accuracy and density. The choice of threshold value depends on use cases, however, 0.35 is a good starting point.

ESS is designed and tested to run on NVIDIA hardware and software. For hardware, the model can run on any NVIDIA GPU including NVIDIA Jetson devices. For software, the model is intended for Ubuntu 22.04 with CUDA 12.2 or Jetpack 6.0 and later. This model can only be used with TensorRT.

Instructions to Convert Model to TensorRT Engine Plan

In order to use this model, convert it to TensorRT engine plans using TAO. The model is encrypted and can only be decrypted using with the key ess. Please this key with the TAO command to convert it into a TensorRT engine plan.

To improve the inference performance, TensorRT custom plugin layers are added in the model. Pre-built plugins are provided with the model.

Convert the ESS model to an engine plan for x86_64 and aarch64 platforms:

LD_PRELOAD=plugins/x86_64/ess_plugins.so tao-converter  -k ess -t fp16 -e ./ess.engine -o output_left,output_conf ess.etlt

LD_PRELOAD=plugins/aarch64/ess_plugins.so tao-converter  -k ess -t fp16 -e ./ess.engine -o output_left,output_conf ess.etlt

Examples

Below we filter out pixels with low confidence by setting disparity[confidence < threshold] = -1, where threshold is 0.0 (fully dense) and 0.35, respectively.

Table with Objects, Wrist Stereo Camera (threshold = 0.0, fully dense)

Lounge, Front Stereo Camera (threshold = 0.35)

Disparity Estimation of Confidence Thresholding at 0.35 with Colormap.

Pallet, Dimension of 59.6 cm x 12 cm, at 6.7m Distance (threshold = 0.35)

Input

Inputs for ESS are RGB stereo image pairs, each of size 576x960x3 (height x width x channels).

Inputs for Light ESS are RGB stereo image pairs, each of size 288x480x3 (height x width x channels).

Inputs need to be normalized by mean 0.5 and standard deviation 0.5.

Output

Disparity map provides a disparity estimate for every pixel in the left image with range between 0 and infinity in FP32.

Confidence map provides a confidence estimate for every pixel in the left disparity map with range between 0 and 1 in FP32.

Both outputs have the shape of 576x960 (height x width) for ESS or 288x480 for Light ESS.

Training

The training algorithm optimizes the network to minimize a negative log-likelihood loss [4] for disparity estimation and BCE loss for confidence estimation.

Dataset

The ESS model is trained on 1,000,000 synthetically generated stereo images in rendered scenes from Omniverse using Replicator-Composer.

Overall, the synthetic dataset consists of a diverse collection of scenes with randomized object models (2500+), textures (300+), materials (200+), background scenes (12), skyboxes (10+), lighting, camera intrinsics, and camera poses. For instance, the virtual stereo camera randomizes each scene via the following. The pose is sampled from defined scenario-specific camera spawn regions and orientations, the vertical camera height ranges from 0.05-2m, the stereo baseline ranges from 0.05-0.25m, and the focal length varies from 5-16mm. The attribution list for the synthetic dataset can be found here.

The synthetic dataset is composed of several categories, including more chaotic-style scenes with flying distractors and increased randomization, and more realistic-style scenes with dropped objects which target certain domains, like robot navigation and robot manipulation. Below are example scenes, showing left, right, and disparity images.

Chaotic-Style Scenes

Realistic-Style Manipulation Scenes

Performance

Dataset and KPI

The inference accuracy of the ESS model is measured on the R2B Dataset 2023. Before resizing images to 576x960 resolution, the top 48 rows are cropped out to preserve height-width aspect ratio. The KPIs are measured on a subset of images from all sequences, , and reported on model input resolution of 576x960 for ESS and 288x480 for Light ESS.

bpx measures the percentage of bad pixels for threshold x (pixels which deviate more than the threshold value from ground truth). rmse and mae measure the root mean squared errors and mean absolute errors.

KPI	ESS	Light ESS
bp2 (%)	11.47	7.95
bp4 (%)	5.71	9.13
rmse	9.63	5.11
mae	4.08	2.25

Realtime Inference Performance

The inference is run on the provided models at FP16 precision. The inference performance is run using trtexec on Jetson AGX Orin, RTX 4060 Ti and RTX 4090. To improve the inference performance, TensorRT custom plugin layers are added in the model. The Jetson devices are running at Max-N configuration for maximum GPU frequency. The results shown here are the inference only performance with input resolution of 576x960 for ESS and 288x480 for Light ESS.

Platform	ESS (fps)	Light ESS (fps)
Jetson AGX Orin	108.86	312.87
RTX 4060 Ti	234.30	777.30
RTX 4090	660.02	1125.87

Limitations

Reflective & Texetureless Surfaces

The disparity estimation for highly reflective and textureless surfaces is not measured separately. In general these features are an ill-posed problem for stereo estimation.

Transparent Materials

ESS is not robust to transparent or semi-transparent materials.

Thin Objects

Very thin objects such as wires are not detected reliably.

Temporal Inconsistency

ESS is not trained on any temporal data. Thus, temporal inconsistency in estimation may appear.

Versions

4.1.0 ESS models are improved by a data-driven process. We trained on a mix of real data (100k) and synthetic data (1 million). The real data groundtruth was obtained from a larger teacher model.

4.0.0 ESS models with improved accuracy and performance. Training data is improved for model generalization and stability. Custom plugins are provided for improved performance.

3.1.0 ESS models with improved accuracy on confidence estimation.

3.0.0 ESS model with confidence output, improved accuracy and performance. Light ESS model with confidence output.

2.0.0 ESS model with improved architecture and training data.

1.1.0 ESS model trained on synthetic and real data.

1.0.0 ESS model trained on synthetic data. Initial version.

References

[1] Nikolai Smolyanskiy, Alexey Kamenev and Stan Birchfield. "On the Importance of Stereo for Accurate Depth Estimation: An Efficient Semi-Supervised Deep Neural Network Approach." CVPR Workshop on Autonomous Driving, 2018.

[2] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy and Alan Yuille. "DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs." IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2017.

[3] Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. "U-Net: Convolutional networks for biomedical image segmentation." International Conference on Medical image computing and computer-assisted intervention (MICCAI), 2015.

[4] Fabio Tosi, Yiyi Liao, Carolin Schmitt and Andreas Geiger. "SMD-Nets: Stereo Mixture Density Networks." Conference on Computer Vision and Pattern Recognition (CVPR), 2021.

License

License to use this model is covered by the Model EULA. By downloading the public and release version of the model, you accept the terms and conditions of these licenses.

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards here. Please report security vulnerabilities or NVIDIA AI Concerns here.