The ESS model described in this card estimates disparity for a stereo image pair and returns a continuous disparity map for the given left image.
The model was optimized from the correlation network  and it uses atrous spatial pyramid pooling  for image feature extraction together with a UNet  for stereo matching. The feature extractor learns features from the left and right stereo images and passes the features to a cost volume. The cost volume computes correlations between positions in left and right features. The extracted features and correlations are then learnt by the stereo matching module. It then predicts a disparity value for every pixel in the left input image.
The estimated stereo disparity from the ESS mode along with additional processing produces a depth image, or point cloud of the scene, used for robot navigation.
This model needs to be used with NVIDIA hardware and software. For Hardware, the model can run on any NVIDIA GPU including NVIDIA Jetson devices. For software, the model is intended for Ubuntu 20.04 with CUDA 11.6 or Jetpack 5.0 DP and later. Other softwares may apply. This model is typically used with TensorRT.
In order to use this model, convert it to TensorRT engine plans using TAO. The model is encrypted and will only operate with the key ess. Please make sure to use this key for TAO command to convert it into a TensorRT engine plan.
Convert the ESS model to an engine plan:
./tao-converter -k ess -t fp16 -e ./ess.engine -o output_left ess.etlt
RGB stereo images of resolution 576x960x3 (height x width x channels), normalized by mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5).
Non-negative real values for every pixel in the input left image. Outputs a disparity estimation for the input left image.
The training algorithm optimizes the network to minimize the supervised L1 loss for every pixel of the mask along with unsupervised losses for photometric consistency and piecewise smoothness, as elaborated in .
The ESS model is trained on about 400k synthetically generated object examples in rendered scenes from Omniverse using Replicator-Composer. The attribution list for the dataset can be found here. The vertical height of the camera is set to randomly vary from 0.3 to 1.5m above ground level. The training dataset consists of a mix of object sizes, textures and scales, camera positions and rotations, background and lighting. Some of the dataset categories feature flying objects, objects with textureless regions, and realistic images. For each scene, we generate left, right and ground truth disparity:
The inference performance of the ESS model was measured on the train dense evaluation dataset from Middlebury. The images were resized to 576x960 pixels before passing to the ESS model.
bpx is measured by the percentage of bad pixels, which are defined as pixels that have absolute disparity differences larger than
x from ground truth.
xs are the absolute disparity difference thresholds, which are
rmse_percentage measure the rooted mean squared errors and rooted mean squared percentage errors for non-occluded pixels.
The inference is run on the provided models at FP16 precision. The inference performance is run using trtexec on Jetson AGX Xavier, Orin and RTX3060. The Jetson devices are running at Max-N configuration for maximum GPU frequency. The performance shown here is the inference only performance with input resolution of 576x960x3.
|Jetson AGX Xaiver||1||32.5189|
|Jetson AGX Orin||1||69.9954|
|RTX 3060 Ti||1||186.834|
The ESS model was trained on objects closer than 30 m. Therefore it may not be able to predict disparities that are further away.
The model was not trained on any blur images and hence not intended to estimate disparity for blur images.
The performance for highly reflectable surfaces is unknown since this is an ill-posed problem.
1.0.0 Initial version.
 Smolyanskiy, Nikolai & Kamenev, Alexey & Birchfield, Stan. (2018). On the Importance of Stereo for Accurate Depth Estimation: An Efficient Semi-Supervised Deep Neural Network Approach. CVPR Workshop on Autonomous Driving.
 Chen, Liang-Chieh & Papandreou, George & Kokkinos, Iasonas & Murphy, Kevin & Yuille, Alan. (2017). DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4): 834-848.
 Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. "U-Net: Convolutional networks for biomedical image segmentation." International Conference on Medical image computing and computer-assisted intervention (MICCAI), 2015.
License to use this model is covered by the Model EULA. By downloading the public and release version of the model, you accept the terms and conditions of these licenses.
Only synthetic datasets were used in the training of the ESS model hence there is no personal data used in the development of this network. The network learns geometry, and does not provide a classification of objects hence there are no ethical concerns in the use of our model.