FoundationStereo | NVIDIA NGC

NGC Catalog

CLASSIC

Welcome Guest

For downloads and more information, please view on a desktop device.

Description

stereo depth estimation model.

Publisher

NVIDIA

Latest Version

deployable_foundationstereo_small_v1.0

Modified

August 5, 2025

Size

212.82 MB

FoundationStereo Overview

Description

FoundationStereo is a foundation model developed by NVIDIA Research for Stereo Depth Estimation. The model was designed to achieve strong zero-shot generalization and has been shown to generalize to various scenarios with wide zero-shot coverage. The model takes as input an RGB stereo image pair and outputs an accurate disparity map.

This model is ready for commercial use.

License / Terms of Use:

GOVERNING TERMS: Use of this model is governed by the NVIDIA Community Model License. ADDITIONAL INFORMATION: Apache 2.0.

Deployment Geography:

Global

Use Case:

The FoundationStereo model is for developers who intend to apply accurate zero-shot depth to 3D perception use cases in industrial, robotics, and smart space applications using stereo images as input.

Release Date:

NGC 07/30/2025 via https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/foundationstereo

References

Wen, B., Trepte, M., Aribido, J., Kautz, J., Gallo, O., & Birchfield, S. (2025). FoundationStereo: Zero-Shot Stereo Matching. arXiv preprint arXiv:2501.09898.

Model Architecture

Architecture Type: Mixed Transformer-CNN based Network Architecture

Network Architecture:

The network consists of various modules. A generic designed feature extractor based on:

Pretrained DepthAnythingV2 and
A Side-tuning CNN based on pretrained EdgeNeXt features

The pretrained DepthAnythingV2 is a foundational monocular depth estimation network. The model is frozen in the feature extraction phase and its features are sandwiched with a pretrained CNN-based model, EdgeNeXt. The Edgenext model, though pretrained on NV-Imagenet, is not frozen during training to enable a side-tuning effect into its layer weights.

The extracted stereo features are passed into an Attentive Hybrid Cost Filtering (AHCF) cost volume. The cost volume computes rich correlated mappings between stereo features.

In addition, a disparity transformer is also used to compute feature attention. The features from the disparity transformer are combined with the cost features from the cost volume and combined to estimate an initial disparity map.

To obtain highly accurate disparity output, the initial estimated disparity maps are refined iteratively using a convolutional GRU. Lastly, the mean average error metric is used to assess the model performance.

Number of model parameters: 6.3 * 10^7

Computational Load

Cumulative Compute: 2.03 * 10^22
Estimated Energy and Emissions for Model Training: 5.77 * 10^4 KWH

Input

Input Types: Two RGB Stereo Images.
Input Formats: RGB image: Red, Green, Blue (RGB), Grayscale image.
Input Parameters: Two-Dimensional (2D).
Other Properties Related to Input: B X 3 X H X W (Batch Size x Channel x Height x Width)

Output

Output Types: Image.
Output Format Disparity Map.
Output Parameters: Two-Dimensional (2D).
Other Properties Related to Output B x 3 x H x W (Batch Size x Channel x Height x Width)

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware and software frameworks, FoundationStereo achieves faster training and inference times compared to CPU-only solutions.

Software Integration

Runtime:

FoundationStereo perception module has been implemented in NVIDIA Isaac-ROS as a ROS node. Detailed inference methodology in Isaac-ROS can be found here.

Supported Hardware Microarchitecture Compatibility:

NVIDIA Ampere
NVIDIA Blackwell
NVIDIA Jetson
NVIDIA Hopper
NVIDIA Lovelace
NVIDIA Pascal
NVIDIA Turing
NVIDIA Volta

Preferred/Supported Operating Systems:

Linux

Model Version(s):

Deployable_fs_small_v1.0: decrypted ONNX files, inferencable on [Isaac-ROS] (https://isaac.gitlab-master-pages.nvidia.com/-/isaac/-/jobs/192249774/artifacts/public/shauryad/FS_docs/v/shauryad/FS_docs/index.html).

Training, Testing, and Evaluation Datasets

The network is trained using a mixture of various datasets -synthetically generated data and real-world collected datasets with pseudo-label groundtruth.

During training, the model predicts a coarse initial disparity. A GRU module uses a context network pipeline to refine the initial disparity using a specified iteration sequence. L1-loss is applied to the refined disparity and the network weights are updated accordingly. The validation dataset is carefully selected from a mix of synthetic and real data.

Training Dataset

The synthetic dataset used consists of 1M SGD dataset named: Foundation Stereo Dataset (FSD). The training dataset can be downloaded at the link below. The real dataset mix was obtained from internal dataset collection efforts as well as Driving Stereo - large commercial stereo dataset.

The model is first pretrained on a synthetic dataset, then finetuned procedurally on various synthetic + real-world dataset mixtures.

Link

Foundation Stereo Dataset
Driving Stereo Dataset
Isaac Dataset (NVIDIA Internal)

Data Collection Method by Dataset

FSD Dataset: Synthetic
Driving Stereo Dataset: Human
Isaac Dataset: Human

Labeling Method by Dataset

FSD Dataset: Synthetic
Driving Stereo Dataset: Automatic
Isaac Dataset: Automatic

The synthetic dataset used was generated using NVIDIA Omniverse and NVIDIA native 3D assets. The Real dataset collection was driven by NVIDIA using well-regulated, consent driven approaches.

Properties

FSD Dataset:
- Size: 1M,
- Resolution: 720 x 1280 x 3,
Isaac Dataset:
- Size = 100k,
- Resolution: 720 x 1200 x 3
Driving Stereo Dataset:
- Size = 106k
- Resolution: 384 x 1040 x 3

Evaluation Dataset

Link

Middlebury Dataset
Eth3D Dataset
Isaac Dataset (NVIDIA Internal)

Inference

Acceleration Engine

Tensor(RT)

Test Hardware:

H100
A100
RTX 5090
L40
Thor
AGX Orin

The inference performance of the FoundationStereo model is evaluated at FP16 precision. The model's input resolution is 3x320x736 pixels. The performance assessment was conducted using trtexec on a range of devices. In the table below, we specify various inference rates on diverse hardware platforms. In the table, "BS" stands for "batch size."

The performance data presented pertains solely to model inference. The end-to-end performance, when integrated with streaming video data, pre-processing and post-processing, might differ due to potential bottlenecks in hardware and software.

Models (FP16)	Devices	Latency (BS=1)	Images per Second (BS=1)	Latency (BS=2)	Images per Second (BS=2)	Latency (BS=4)	Images per Second (BS=4)
FoundationStereo	Orin AGX	563.76	1.77	1142.44	1.75	2322.65	1.72
FoundationStereo	Thor AGX	653.34	1.53	1397.48	1.43	2930.58	1.36
FoundationStereo	A100	-	-	-	-	-	-
FoundationStreo	H100	-	-	-	-	-	-

Output Image

(Left ) Input Image, (Second from left) GT Disparity , (Third from left) FoundationStereo Prediction , (Rightmost) Error Map.

Limitations

Failure Cases

FoundationStereo may produce unreliable depth estimates of transparent objects (e.g., glass and water), high-saturation scenes, or poorly lit areas.

Inference Method

These models are designed for use with NVIDIA hardware and software. For hardware, the models are compatible with any NVIDIA GPU, including NVIDIA Jetson devices. For software, the models are specifically designed for the TensorRT.

The primary application of this model is to estimate an object's depth from a stereo RGB pair.

The model is designed for deployment on edge devices using TensorRT. TAO Triton apps offer capabilities to construct efficient image analytic pipelines. These pipelines can capture, decode, and process data before executing inference.

Instructions to Deploy the Model with Triton Inference Server

To create the entire end-to-end inference application, deploy this model with [Triton Inference Server] (https://developer.nvidia.com/nvidia-triton-inference-server). NVIDIA Triton Inference Server is an open-source inference serving software that helps standardize model deployment and execution and delivers fast and scalable AI in production. Triton supports direct integration of this model into the server and inference from a client.

To deploy this model with [Triton Inference Server] (https://developer.nvidia.com/nvidia-triton-inference-server) and end-to-end inference from images, please refer to the TAO Triton apps.

Limitations:

The training and evaluation dataset mostly consists of North American content. An ideal training and evaluation dataset would additionally include content from other geographies.

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.