NGC Catalog

CLASSIC

Welcome Guest

For downloads and more information, please view on a desktop device.

Description

NVOF is a deep learning based optical flow and stereo matching solution.

Publisher

NVIDIA

Latest Version

v1.0

Modified

May 23, 2025

Size

194.66 MB

Model Overview

Description:

NvOF is a lightweight deep learning-based network trained for estimating motion between frames. It provides models for two motion estimation tasks:

Optical Flow: Estimates motion between consecutive frames in a video sequence. Optical flow is defined as the pattern of apparent motion of image objects caused by the movement of the object or camera.
Stereo Matching: Computes disparity maps from stereo image pairs, which can be used for depth estimation and 3D reconstruction.

Both models share similar architectural principles but are trained for their specific tasks.

Model Architecture:

Architecture Type: Convolutional Neural Network (CNN)

Network Architecture: These models are based on the widely used coarse-to-fine residual structure. First, a pyramidal feature extraction based on convolution is performed to obtain features at different scales. Then, the cost volume layer is used to calculate feature correlations. Finally, a UNet-like network decodes the extracted features and correlations into residuals to refine the upsampled outputs from the previous scale.

Model Version(s):

Version 1.0:

The initial release version includes both optical flow estimation and stereo matching models, trained on synthetic and real-world datasets.

Training and Evaluation Datasets:

Training Datasets:

Optical Flow Model

For optical flow estimation, we utilize several public datasets including TartanAir, Spring, AutoFlow, and Kubric, along with additional synthetic data generated using UE4 to create more virtual scenarios.

Stereo Matching Model

For stereo matching, we utilize several public datasets including TartanAir, FoundationStereo, and DrivingStereo. Additionally, we leverage the CARLA simulator to generate synthetic data specifically for driving scenarios.

Evaluation Datasets:

We use MPI-Sintel and KITTI2015, which are publicly available benchmarks, as our evaluation datasets for both models.

Evaluation Results:

Optical Flow Model

AEPE computes average per-pixel end-point error between estimated optical flow vector and ground-truth optical flow vector. Fl-all computes the percentage of pixels whose optical flow error is larger than 3 pixels.

Dataset	AEPE	Fl-all
Sintel-final	3.04	9.04%
Sintel-clean	2.22	5.90%

Stereo Matching Model

D1-all computes the percentage of stereo disparity outliers in the first frame.

Dataset	AEPE	D1-all
Sintel-final	4.34	13.77%
KITTI2015	1.04	4.50%

Inference:

Both models need to be used with NVIDIA hardware and software. They can run on any NVIDIA GPU including NVIDIA Jetson devices and are optimized for use with TensorRT.

Input:

Input Type(s): Image
Input Format(s): Red, Green, Blue (RGB)
Input Parameters: 4D
Other Properties Related to Input:

For both models, the input consists of two RGB images with pixel values in the range of 0-255.

For the optical flow model, inputs are typically consecutive frames from a video sequence.
For the stereo matching model, inputs are stereo image pairs (rectified left and right views) captured at the same time.

Channel ordering of the input should be NCHW, where:

N = Batch Size
C = Number of channels (6, representing two RGB images)
H = Height of images (must be divisible by 64)
W = Width of images (must be divisible by 64)

Output:

Output Type(s):

Optical Flow Model: Dense optical flow map (2 channels)
Stereo Matching Model: Disparity map (1 channel)

Output Format: Float32
Output Parameters: 4D
Other Properties Related to Output:

The optical flow model outputs a dense flow map with the same dimensions as the input images, containing two channels representing motion in the x and y directions.
The stereo matching model outputs a disparity map with the same dimensions as the input images, containing one channel representing the horizontal displacement between corresponding points in the stereo pair.

Channel ordering of the output will be NCHW, where:

N = Batch Size
C = Number of channels (2 for optical flow or 1 for disparity)
H = Height of the output (same as the input)
W = Width of the output (same as the input)

Pre-generated TensorRT Engine

We provide pre-generated TensorRT engines for both models with versionCompatible and hardwareCompatibilityLevel=ampere+ that users can use directly for inference. The engines were built with --minShapes=input:1x6x256x256 --optShapes=input:1x6x512x512 --maxShapes=input:1x6x1088x1920. These compatibility configurations for version/hardware/resolution will make the latency of the current engines 30% - 40% slower compared to engines optimized for a specific environment. If you have requirements for latency optimization tailored to your running environment or for enhancing the quality specific to your scenario, please contact us.

In the attachments, we also provide Python demos which show:

How to load and deserialize the TensorRT engines
How to get model input/output information
How to prepare input image pairs
How to run inference and get results for both optical flow and stereo matching

Please refer to the attached Python demos for detailed implementation.

Hardware Requirements

The pre-generated TensorRT engines require:

NVIDIA GPU with Ampere+ architecture
TensorRT 10.0 or newer

Real-time Inference Latency

The inference is run on the provided models at FP16 precision. The inference performance is measured using the Python demos listed above, and it does not include the memory copy time between the device and the host. To optimize the inference performance, we have used some custom plugins in the model engines. The Jetson devices are running at Max-N configuration for maximum GPU frequency. The results shown here are the inference performance with input resolution of 1088×1920.

Platform	NVOF (Optical Flow Model) (ms)	NVOF (Stereo Matching Model) (ms)
RTX4090	2.76	2.12
A100	4.54	3.47
Jetson Orin	21.2	18.97

License/Terms of Use:

NVIDIA Open Model License Agreement

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure these models meet requirements for the relevant industry and use case and address unforeseen product misuse.

For more detailed information on ethical considerations for these models, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards.