NGC | Catalog
CatalogModelsESS DNN Stereo Disparity

ESS DNN Stereo Disparity

For downloads and more information, please view on a desktop device.
Logo for ESS DNN Stereo Disparity


ESS is a DNN that estimates disparity for a stereo image pair and returns a continuous disparity map for the given left image.



Use Case




Latest Version



January 25, 2023


65.43 MB

Model Overview

The ESS model described in this card estimates disparity for a stereo image pair and returns a continuous disparity map for the given left image.

Model Architecture

The model was optimized from the correlation network [1] and it uses atrous spatial pyramid pooling [2] for image feature extraction together with a UNet [3] for stereo matching. The feature extractor learns features from the left and right stereo images and passes the features to a cost volume. The cost volume computes correlations between positions in left and right features. The extracted features and correlations are then learnt by the stereo matching module. It then predicts a disparity value for every pixel in the left input image.

How to Use this Model

The estimated stereo disparity from the ESS mode along with additional processing produces a depth image, or point cloud of the scene, used for robot navigation.

This model needs to be used with NVIDIA hardware and software. For Hardware, the model can run on any NVIDIA GPU including NVIDIA Jetson devices. For software, the model is intended for Ubuntu 20.04 with CUDA 11.6 or Jetpack 5.0 DP and later. Other softwares may apply. This model is typically used with TensorRT.

Instructions to Convert Model to TensorRT Engine Plan

In order to use this model, convert it to TensorRT engine plans using TAO. The model is encrypted and will only operate with the key ess. Please make sure to use this key for TAO command to convert it into a TensorRT engine plan.

Convert the ESS model to an engine plan:

./tao-converter -k ess -t fp16 -e ./ess.engine -o output_left ess.etlt


Synthetic Images

Real Images


RGB stereo images of resolution 576x960x3 (height x width x channels), normalized by mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5).


Non-negative real values for every pixel in the input left image. Outputs a disparity estimation for the input left image.


The training algorithm optimizes the network to minimize the supervised L1 loss for every pixel of the mask along with unsupervised losses for photometric consistency and piecewise smoothness, as elaborated in [1].


The ESS model is trained on about 400k synthetically generated object examples in rendered scenes from Omniverse using Replicator-Composer. The attribution list for the dataset can be found here. The vertical height of the camera is set to randomly vary from 0.3 to 1.5m above ground level. The training dataset consists of a mix of object sizes, textures and scales, camera positions and rotations, background and lighting. Some of the dataset categories feature flying objects, objects with textureless regions, and realistic images. For each scene, we generate left, right and ground truth disparity:


Dataset and KPI

The inference performance of the ESS model was measured on the train dense evaluation dataset from Middlebury. The images were resized to 576x960 pixels before passing to the ESS model.

bpx is measured by the percentage of bad pixels, which are defined as pixels that have absolute disparity differences larger than x from ground truth. xs are the absolute disparity difference thresholds, which are 0.006, 3 and 9, respectively. rmse and rmse_percentage measure the rooted mean squared errors and rooted mean squared percentage errors for non-occluded pixels.

KPI Values
bp006 (%) 0.208
bp3 (%) 0.180
bp9 (%) 0.058
rmse 4.921
rmse_percentage (%) 0.135

Realtime Inference Performance

The inference is run on the provided models at FP16 precision. The inference performance is run using trtexec on Jetson AGX Xavier, Orin and RTX3060. The Jetson devices are running at Max-N configuration for maximum GPU frequency. The performance shown here is the inference only performance with input resolution of 576x960x3.

Platform Batch Size FPS
Jetson AGX Xaiver 1 32.5189
Jetson AGX Orin 1 69.9954
RTX 3060 Ti 1 186.834


Far away objects

The ESS model was trained on objects closer than 30 m. Therefore it may not be able to predict disparities that are further away.

Blur inputs

The model was not trained on any blur images and hence not intended to estimate disparity for blur images.

Reflective surfaces

The performance for highly reflectable surfaces is unknown since this is an ill-posed problem.


1.0.0 Initial version.


[1] Smolyanskiy, Nikolai & Kamenev, Alexey & Birchfield, Stan. (2018). On the Importance of Stereo for Accurate Depth Estimation: An Efficient Semi-Supervised Deep Neural Network Approach. CVPR Workshop on Autonomous Driving.

[2] Chen, Liang-Chieh & Papandreou, George & Kokkinos, Iasonas & Murphy, Kevin & Yuille, Alan. (2017). DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4): 834-848.

[3] Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. "U-Net: Convolutional networks for biomedical image segmentation." International Conference on Medical image computing and computer-assisted intervention (MICCAI), 2015.


License to use this model is covered by the Model EULA. By downloading the public and release version of the model, you accept the terms and conditions of these licenses.

Ethical Considerations

Only synthetic datasets were used in the training of the ESS model hence there is no personal data used in the development of this network. The network learns geometry, and does not provide a classification of objects hence there are no ethical concerns in the use of our model.