ESS DNN Stereo Disparity

ESS DNN Stereo Disparity

Logo for ESS DNN Stereo Disparity
ESS is a DNN that estimates disparity for a stereo image pair and returns a continuous disparity map for the given left image.
Latest Version
December 11, 2023
135.02 MB

Model Overview

The ESS model described in this card estimates disparity for a stereo image pair and returns a continuous disparity map for the given left image and a confidence map of the predicted disparity. This model is ready for commercial use.


[1] Smolyanskiy, Nikolai & Kamenev, Alexey & Birchfield, Stan. (2018). On the Importance of Stereo for Accurate Depth Estimation: An Efficient Semi-Supervised Deep Neural Network Approach. CVPR Workshop on Autonomous Driving.

[2] Chen, Liang-Chieh & Papandreou, George & Kokkinos, Iasonas & Murphy, Kevin & Yuille, Alan. (2017). DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4): 834-848.

[3] Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. "U-Net: Convolutional networks for biomedical image segmentation." International Conference on Medical image computing and computer-assisted intervention (MICCAI), 2015.

Model Architecture

The model was optimized from the convolution network [1] and it uses atrous spatial pyramid pooling [2] for image feature extraction together with a UNet [3] for stereo matching. The feature extractor learns features from the left and right stereo images and passes the features to a cost volume. The cost volume computes correlations between positions in left and right features. The extracted features and correlations are learned by the stereo matching module, which predicts a disparity value for every pixel in the left input image. A confidence map for the prediction is provided by a confidence module.

How to Use this Model

The estimated stereo disparity image from the ESS model can be additionally processed to produce a depth image or point cloud.

This model needs to be used with NVIDIA hardware and software. For hardware, the model can run on any NVIDIA GPU including NVIDIA Jetson devices. For software, the model is intended for Ubuntu 20.04 with CUDA 11.6 or Jetpack 5.0 DP and later. This model can only be used with TensorRT.

Instructions to Convert Model to TensorRT Engine Plan

In order to use this model, convert it to TensorRT engine plans using TAO. The model is encrypted and will only operate with the key ess. Please make sure to use this key with the TAO command to convert it into a TensorRT engine plan.

Convert the ESS model to an engine plan:

./tao-converter -k ess -t fp16 -e ./ess.engine -o output_left,output_conf ess.etlt


The disparity map can be filtered, by thresholding the confidence output, to trade off between accuracy and density. The choice of threshold value is dependent on use case. Below we filter out pixels with low confidence by setting disparity[confidence < threshold] = -1, where threshold is 0.8 and 0.0 (fully dense), respectively.

NVIDIA Lounge (threshold = 0.8)
NVIDIA Cafe (threshold = 0.0, fully dense)


Inputs for ESS are RGB stereo image pairs, each of size 576x960x3 (height x width x channels).

Inputs for Light ESS are RGB stereo image pairs, each of size 288x480x3 (height x width x channels).

Inputs need to be normalized by mean 0.5 and standard deviation 0.5.


Disparity map provides a disparity estimate for every pixel in the left image with range between 0 and infinity in FP32.

Confidence map provides a confidence estimate for every pixel in the left disparity map with range between 0 and 1 in FP32.

Both outputs have the shape of 576x960 (height x width) for ESS or 288x480 for Light ESS.


The training algorithm optimizes the network to minimize the supervised L1 loss for disparity estimation and BCE loss for confidence estimation.


The ESS model is trained on 600,000 synthetically generated stereo images in rendered scenes from Omniverse using Replicator-Composer, as well as about 25,000 real sensor frames collected using HAWK stereo cameras.

The attribution list for the synthetic dataset can be found here. The vertical height of the camera is set to randomly vary from 0.3 to 1.5m above ground level. The training dataset consists of a mix of object sizes, textures and scales, camera positions and rotations, background and lighting. Some of the dataset categories feature flying objects, objects with textureless regions, and realistic images. For each scene, we generate left, right RGB images and ground truth disparity:

The real dataset was collected on diverse scenes including NVIDIA’s facilities indoor and outdoor, and public space within the United States, with a variety of distance, lighting and real world objects.


Dataset and KPI

The inference accuracy of the ESS model is measured on the R2B Dataset 2023. Before resizing images to 576x960 resolution, the top 48 rows are cropped out to preserve height-width aspect ratio. The KPIs are measured on a subset of images from all sequences, and reported on model input resolution 576x960.

bpx measures the percentage of bad pixels for threshold x (pixels which deviate more than the threshold value from ground truth). rmse and mae measure the root mean squared errors and mean absolute errors.

bp2 (%) 14.5839 20.6584
bp4 (%) 7.6857 9.9691
rmse 9.4793 10.3727
mae 4.3440 4.6873

Realtime Inference Performance

The inference is run on the provided models at FP16 precision. The inference performance is run using trtexec on Jetson AGX Xavier, Orin and RTX3060. The Jetson devices are running at Max-N configuration for maximum GPU frequency. The results shown here is the inference only performance with input resolution of 576x960x3 and 288x480x3 for ESS and Light ESS, respectively.

Platform ESS (fps) Light ESS (fps)
Jetson AGX Xaiver 28.2454 119.4450
Jetson AGX Orin 84.8102 304.0230
RTX 3060 Ti 163.8710 626.2390


Reflective & Texetureless Surfaces

The disparity estimation for highly reflective and textureless surfaces is not measured separately. In general these features are an ill-posed problem for stereo estimation.


3.0.0 ESS model with confidence output, improved accuracy and performance. Light ESS model with confidence output.

2.0.0 ESS model with improved architecture and training data.

1.1.0 ESS model trained on synthetic and real data.

1.0.0 ESS model trained on synthetic data. Initial version.


License to use this model is covered by the Model EULA. By downloading the public and release version of the model, you accept the terms and conditions of these licenses.

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards here. Please report security vulnerabilities or NVIDIA AI Concerns here.