NGC | Catalog
CatalogModelsESS DNN Stereo Disparity

ESS DNN Stereo Disparity

For downloads and more information, please view on a desktop device.
Logo for ESS DNN Stereo Disparity

Description

ESS is a DNN that estimates disparity for a stereo image pair and returns a continuous disparity map for the given left image.

Publisher

NVIDIA

Latest Version

2.0.0

Modified

September 25, 2023

Size

65.44 MB

Model Overview

The ESS model described in this card estimates disparity for a stereo image pair and returns a continuous disparity map for the given left image.

Model Architecture

The model was optimized from the correlation network [1] and it uses atrous spatial pyramid pooling [2] for image feature extraction together with a UNet [3] for stereo matching. The feature extractor learns features from the left and right stereo images and passes the features to a cost volume. The cost volume computes correlations between positions in left and right features. The extracted features and correlations are then learnt by the stereo matching module. It then predicts a disparity value for every pixel in the left input image.

How to Use this Model

The estimated stereo disparity from the ESS mode along with additional processing produces a depth image, or point cloud of the scene, used for robot navigation.

This model needs to be used with NVIDIA hardware and software. For Hardware, the model can run on any NVIDIA GPU including NVIDIA Jetson devices. For software, the model is intended for Ubuntu 20.04 with CUDA 11.6 or Jetpack 5.0 DP and later. Other softwares may apply. This model is typically used with TensorRT.

Instructions to Convert Model to TensorRT Engine Plan

In order to use this model, convert it to TensorRT engine plans using TAO. The model is encrypted and will only operate with the key ess. Please make sure to use this key for TAO command to convert it into a TensorRT engine plan.

Convert the ESS model to an engine plan:

./tao-converter -k ess -t fp16 -e ./ess.engine -o output_left ess.etlt

Example

NVIDIA Lounge
NVIDIA Storage

Input

RGB stereo images of resolution 576x960x3 (height x width x channels), normalized by mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5).

Output

Non-negative real values for every pixel in the input left image. Outputs a disparity estimation for the input left image.

Training

The training algorithm optimizes the network to minimize the supervised L1 loss for every pixel of the mask along with unsupervised losses for photometric consistency and piecewise smoothness, as elaborated in [1].

Dataset

The ESS model is trained on 300,000 synthetically generated stereo images in rendered scenes from Omniverse using Replicator-Composer, as well as about 25,000 real sensor frames collected using HAWK stereo cameras.

The attribution list for the synthetic dataset can be found here. The vertical height of the camera is set to randomly vary from 0.3 to 1.5m above ground level. The training dataset consists of a mix of object sizes, textures and scales, camera positions and rotations, background and lighting. Some of the dataset categories feature flying objects, objects with textureless regions, and realistic images. For each scene, we generate left, right and ground truth disparity:


The real dataset was collected on diverse scenes including NVIDIA’s facilities indoor and outdoor, and public space within the United States, with a variety of range of distance, lighting and real world objects.

Performance

Dataset and KPI

The inference performance of the ESS model is measured on HAWK stereo camera sequences of R2B Dataset 2023. Before resizing images to 576x960 resolution, the top 48 rows are cropped out to preserve height-width aspect ratio. The KPIs are the average of a subset from all 7 sequences, computed and reported on resolution 960x576.

bpx is measured by the percentage of bad pixels, which are defined as pixels that have absolute disparity differences larger than x from ground truth. xs are the absolute disparity difference thresholds, which are 1, 2 and 3, respectively. rmse and mae measure the rooted mean squared errors and mean absolute errors.

KPI Values
bp1 (%) 25.4272
bp2 (%) 9.4941
bp3 (%) 5.2346
rmse 1.8629
mae 0.9520

Realtime Inference Performance

The inference is run on the provided models at FP16 precision. The inference performance is run using trtexec on Jetson AGX Xavier, Orin and RTX3060. The Jetson devices are running at Max-N configuration for maximum GPU frequency. The performance shown here is the inference only performance with input resolution of 576x960x3.

Platform Batch Size FPS
Jetson AGX Xaiver 1 25.8937
Jetson AGX Orin 1 54.2302
RTX 3060 Ti 1 144.884

Limitations

Reflective surfaces

The performance for highly reflectable surfaces is unknown since this is an ill-posed problem.

Versions

2.0.0 ESS model with improved architecture and training data.

1.1.0 ESS model trained on synthetic and real data.

1.0.0 ESS model trained on synthetic data. Iinitial version.

References

[1] Smolyanskiy, Nikolai & Kamenev, Alexey & Birchfield, Stan. (2018). On the Importance of Stereo for Accurate Depth Estimation: An Efficient Semi-Supervised Deep Neural Network Approach. CVPR Workshop on Autonomous Driving.

[2] Chen, Liang-Chieh & Papandreou, George & Kokkinos, Iasonas & Murphy, Kevin & Yuille, Alan. (2017). DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4): 834-848.

[3] Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. "U-Net: Convolutional networks for biomedical image segmentation." International Conference on Medical image computing and computer-assisted intervention (MICCAI), 2015.

License

License to use this model is covered by the Model EULA. By downloading the public and release version of the model, you accept the terms and conditions of these licenses.

Ethical Considerations

This dataset consists of captures in NVIDIA’s facilities and public space in the United States. Persons in the dataset(s) have been informed of the data recording at the time of capture.

This dataset is not sufficiently diverse to be representative of the public in general, however it does contain diversity of the population represented in Urban centers.