FoundationStereo is a foundation model developed by NVIDIA Research for Stereo Depth Estimation. The model was designed to achieve strong zero-shot generalization and has been shown to generalize to various scenarios with wide zero-shot coverage. The model takes as input an RGB stereo image pair and outputs an accurate disparity map.
This model is ready for commercial use.
GOVERNING TERMS: Use of this model is governed by the NVIDIA Community Model License. ADDITIONAL INFORMATION: Apache 2.0.
Global
The FoundationStereo model is for developers who intend to apply accurate zero-shot depth to 3D perception use cases in industrial, robotics, and smart space applications using stereo images as input.
NGC 07/30/2025 via https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/foundationstereo
Wen, B., Trepte, M., Aribido, J., Kautz, J., Gallo, O., & Birchfield, S. (2025). FoundationStereo: Zero-Shot Stereo Matching. arXiv preprint arXiv:2501.09898.
Architecture Type: Mixed Transformer-CNN based Network Architecture
Network Architecture:
The network consists of various modules. A generic designed feature extractor based on:
The pretrained DepthAnythingV2 is a foundational monocular depth estimation network. The model is frozen in the feature extraction phase and its features are sandwiched with a pretrained CNN-based model, EdgeNeXt. The Edgenext model, though pretrained on NV-Imagenet, is not frozen during training to enable a side-tuning effect into its layer weights.
The extracted stereo features are passed into an Attentive Hybrid Cost Filtering (AHCF) cost volume. The cost volume computes rich correlated mappings between stereo features.
In addition, a disparity transformer is also used to compute feature attention. The features from the disparity transformer are combined with the cost features from the cost volume and combined to estimate an initial disparity map.
To obtain highly accurate disparity output, the initial estimated disparity maps are refined iteratively using a convolutional GRU. Lastly, the mean average error metric is used to assess the model performance.
Number of model parameters: 6.3 * 10^7
Cumulative Compute: 2.03 * 10^22
Estimated Energy and Emissions for Model Training: 5.77 * 10^4 KWH
Input Types: Two RGB Stereo Images.
Input Formats: RGB image: Red, Green, Blue (RGB), Grayscale image.
Input Parameters: Two-Dimensional (2D).
Other Properties Related to Input: B X 3 X H X W (Batch Size x Channel x Height x Width)
Output Types: Image.
Output Format Disparity Map.
Output Parameters: Two-Dimensional (2D).
Other Properties Related to Output B x 3 x H x W (Batch Size x Channel x Height x Width)
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware and software frameworks, FoundationStereo achieves faster training and inference times compared to CPU-only solutions.
Runtime:
Supported Hardware Microarchitecture Compatibility:
Preferred/Supported Operating Systems:
Linux
The network is trained using a mixture of various datasets -synthetically generated data and real-world collected datasets with pseudo-label groundtruth.
During training, the model predicts a coarse initial disparity. A GRU module uses a context network pipeline to refine the initial disparity using a specified iteration sequence. L1-loss is applied to the refined disparity and the network weights are updated accordingly. The validation dataset is carefully selected from a mix of synthetic and real data.
The synthetic dataset used consists of 1M SGD dataset named: Foundation Stereo Dataset (FSD). The training dataset can be downloaded at the link below. The real dataset mix was obtained from internal dataset collection efforts as well as Driving Stereo - large commercial stereo dataset.
The model is first pretrained on a synthetic dataset, then finetuned procedurally on various synthetic + real-world dataset mixtures.
Link
Data Collection Method by Dataset
Labeling Method by Dataset
The synthetic dataset used was generated using NVIDIA Omniverse and NVIDIA native 3D assets. The Real dataset collection was driven by NVIDIA using well-regulated, consent driven approaches.
Properties
Link
Acceleration Engine
Test Hardware:
The inference performance of the FoundationStereo model is evaluated at FP16 precision. The model's input resolution is 3x320x736 pixels. The performance assessment was conducted using trtexec on a range of devices. In the table below, we specify various inference rates on diverse hardware platforms. In the table, "BS" stands for "batch size."
The performance data presented pertains solely to model inference. The end-to-end performance, when integrated with streaming video data, pre-processing and post-processing, might differ due to potential bottlenecks in hardware and software.
Models (FP16) | Devices | Latency (BS=1) | Images per Second (BS=1) | Latency (BS=2) | Images per Second (BS=2) | Latency (BS=4) | Images per Second (BS=4) |
---|---|---|---|---|---|---|---|
FoundationStereo | Orin AGX | 563.76 | 1.77 | 1142.44 | 1.75 | 2322.65 | 1.72 |
FoundationStereo | Thor AGX | 653.34 | 1.53 | 1397.48 | 1.43 | 2930.58 | 1.36 |
FoundationStereo | A100 | - | - | - | - | - | - |
FoundationStreo | H100 | - | - | - | - | - | - |
FoundationStereo may produce unreliable depth estimates of transparent objects (e.g., glass and water), high-saturation scenes, or poorly lit areas.
These models are designed for use with NVIDIA hardware and software. For hardware, the models are compatible with any NVIDIA GPU, including NVIDIA Jetson devices. For software, the models are specifically designed for the TensorRT.
The primary application of this model is to estimate an object's depth from a stereo RGB pair.
The model is designed for deployment on edge devices using TensorRT. TAO Triton apps offer capabilities to construct efficient image analytic pipelines. These pipelines can capture, decode, and process data before executing inference.
To create the entire end-to-end inference application, deploy this model with [Triton Inference Server] (https://developer.nvidia.com/nvidia-triton-inference-server). NVIDIA Triton Inference Server is an open-source inference serving software that helps standardize model deployment and execution and delivers fast and scalable AI in production. Triton supports direct integration of this model into the server and inference from a client.
To deploy this model with [Triton Inference Server] (https://developer.nvidia.com/nvidia-triton-inference-server) and end-to-end inference from images, please refer to the TAO Triton apps.
The training and evaluation dataset mostly consists of North American content. An ideal training and evaluation dataset would additionally include content from other geographies.
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards.
Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.