The Efficient Supervised Stereo model (ESS) described in this card estimates disparity for a stereo image pair and returns a continuous disparity map for the given left image and a confidence map of the predicted disparity. ESS is designed, trained and evaluated on robotics applications for robot navigation and manipulation. Its accurate real time stereo disparity is used to provide depth estimation to go from perception to planning to action benefiting autonomous mobile robots, robot manipulators, and humanoids. This model is ready for commercial use.
The model was optimized from the original depth perception network [1]. It features a feature extractor with enhanced receptive field using atrous spatial pyramid pooling (ASPP) [2]. The feature extractor module is connected with a UNet [3] for stereo matching. The feature extractor learns features from the left and right stereo images and passes the features to a cost volume. The cost volume computes correlations between positions in left and right features. The extracted features and correlations are learned by the stereo matching module. Lastly, we adopt a stereo mixture density module [4] to output a disparity value and a confidence value for every pixel in the left input image.
We appended a parallel confidence network branch to the disparity prediction network. The confidence values measure the certainty of the network on predicted disparities. It can be used as a guidance for reducing the number of pixels in the disparity output, and as a tradeoff for higher disparity accuracy.
ess_plugins.so
) that are loaded during TensorRT initialization.The estimated stereo disparity image from ESS can be additionally processed to produce a depth image or point cloud. The disparity output can be filtered, by thresholding the confidence output, to trade off between accuracy and density. The choice of threshold value depends on use cases, however, 0.35 is a good starting point.
ESS is designed and tested to run on NVIDIA hardware and software. For hardware, the model can run on any NVIDIA GPU including NVIDIA Jetson devices. For software, the model is intended for Ubuntu 22.04 with CUDA 12.2 or Jetpack 6.0 and later. This model can only be used with TensorRT.
In order to use this model, convert it to TensorRT engine plans using TAO. The model is encrypted and can only be decrypted using with the key ess. Please this key with the TAO command to convert it into a TensorRT engine plan.
To improve the inference performance, TensorRT custom plugin layers are added in the model. Pre-built plugins are provided with the model.
Convert the ESS model to an engine plan for x86_64 and aarch64 platforms:
LD_PRELOAD=plugins/x86_64/ess_plugins.so tao-converter -k ess -t fp16 -e ./ess.engine -o output_left,output_conf ess.etlt
LD_PRELOAD=plugins/aarch64/ess_plugins.so tao-converter -k ess -t fp16 -e ./ess.engine -o output_left,output_conf ess.etlt
Below we filter out pixels with low confidence by setting disparity[confidence < threshold] = -1
, where threshold
is 0.0 (fully dense) and 0.35, respectively.
Inputs for ESS are RGB stereo image pairs, each of size 576x960x3 (height x width x channels).
Inputs for Light ESS are RGB stereo image pairs, each of size 288x480x3 (height x width x channels).
Inputs need to be normalized by mean 0.5 and standard deviation 0.5.
Disparity map provides a disparity estimate for every pixel in the left image with range between 0 and infinity in FP32.
Confidence map provides a confidence estimate for every pixel in the left disparity map with range between 0 and 1 in FP32.
Both outputs have the shape of 576x960 (height x width) for ESS or 288x480 for Light ESS.
The training algorithm optimizes the network to minimize a negative log-likelihood loss [4] for disparity estimation and BCE loss for confidence estimation.
The ESS model is trained on 1,000,000 synthetically generated stereo images in rendered scenes from Omniverse using Replicator-Composer.
Overall, the synthetic dataset consists of a diverse collection of scenes with randomized object models (2500+), textures (300+), materials (200+), background scenes (12), skyboxes (10+), lighting, camera intrinsics, and camera poses. For instance, the virtual stereo camera randomizes each scene via the following. The pose is sampled from defined scenario-specific camera spawn regions and orientations, the vertical camera height ranges from 0.05-2m, the stereo baseline ranges from 0.05-0.25m, and the focal length varies from 5-16mm. The attribution list for the synthetic dataset can be found here.
The synthetic dataset is composed of several categories, including more chaotic-style scenes with flying distractors and increased randomization, and more realistic-style scenes with dropped objects which target certain domains, like robot navigation and robot manipulation. Below are example scenes, showing left, right, and disparity images.
The inference accuracy of the ESS model is measured on the R2B Dataset 2023. Before resizing images to 576x960 resolution, the top 48 rows are cropped out to preserve height-width aspect ratio. The KPIs are measured on a subset of images from all sequences, , and reported on model input resolution of 576x960 for ESS and 288x480 for Light ESS.
bpx
measures the percentage of bad pixels for threshold x
(pixels which deviate more than the threshold value from ground truth). rmse
and mae
measure the root mean squared errors and mean absolute errors.
KPI | ESS | Light ESS |
---|---|---|
bp2 (%) | 11.47 | 7.95 |
bp4 (%) | 5.71 | 9.13 |
rmse | 9.63 | 5.11 |
mae | 4.08 | 2.25 |
The inference is run on the provided models at FP16 precision. The inference performance is run using trtexec on Jetson AGX Orin, RTX 4060 Ti and RTX 4090. To improve the inference performance, TensorRT custom plugin layers are added in the model. The Jetson devices are running at Max-N configuration for maximum GPU frequency. The results shown here are the inference only performance with input resolution of 576x960 for ESS and 288x480 for Light ESS.
Platform | ESS (fps) | Light ESS (fps) |
---|---|---|
Jetson AGX Orin | 108.86 | 312.87 |
RTX 4060 Ti | 234.30 | 777.30 |
RTX 4090 | 660.02 | 1125.87 |
The disparity estimation for highly reflective and textureless surfaces is not measured separately. In general these features are an ill-posed problem for stereo estimation.
ESS is not robust to transparent or semi-transparent materials.
Very thin objects such as wires are not detected reliably.
ESS is not trained on any temporal data. Thus, temporal inconsistency in estimation may appear.
4.1.0 ESS models are improved by a data-driven process. We trained on a mix of real data (100k) and synthetic data (1 million). The real data groundtruth was obtained from a larger teacher model.
4.0.0 ESS models with improved accuracy and performance. Training data is improved for model generalization and stability. Custom plugins are provided for improved performance.
3.1.0 ESS models with improved accuracy on confidence estimation.
3.0.0 ESS model with confidence output, improved accuracy and performance. Light ESS model with confidence output.
2.0.0 ESS model with improved architecture and training data.
1.1.0 ESS model trained on synthetic and real data.
1.0.0 ESS model trained on synthetic data. Initial version.
[1] Nikolai Smolyanskiy, Alexey Kamenev and Stan Birchfield. "On the Importance of Stereo for Accurate Depth Estimation: An Efficient Semi-Supervised Deep Neural Network Approach." CVPR Workshop on Autonomous Driving, 2018.
[2] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy and Alan Yuille. "DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs." IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2017.
[3] Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. "U-Net: Convolutional networks for biomedical image segmentation." International Conference on Medical image computing and computer-assisted intervention (MICCAI), 2015.
[4] Fabio Tosi, Yiyi Liao, Carolin Schmitt and Andreas Geiger. "SMD-Nets: Stereo Mixture Density Networks." Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
License to use this model is covered by the Model EULA. By downloading the public and release version of the model, you accept the terms and conditions of these licenses.
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards here. Please report security vulnerabilities or NVIDIA AI Concerns here.