NvOF is a lightweight deep learning-based network trained for estimating motion between frames. It provides models for two motion estimation tasks:
Optical Flow: Estimates motion between consecutive frames in a video sequence. Optical flow is defined as the pattern of apparent motion of image objects caused by the movement of the object or camera.
Stereo Matching: Computes disparity maps from stereo image pairs, which can be used for depth estimation and 3D reconstruction.
Both models share similar architectural principles but are trained for their specific tasks.
Architecture Type: Convolutional Neural Network (CNN)
Network Architecture: These models are based on the widely used coarse-to-fine residual structure. First, a pyramidal feature extraction based on convolution is performed to obtain features at different scales. Then, the cost volume layer is used to calculate feature correlations. Finally, a UNet-like network decodes the extracted features and correlations into residuals to refine the upsampled outputs from the previous scale.
The initial release version includes both optical flow estimation and stereo matching models, trained on synthetic and real-world datasets.
For optical flow estimation, we utilize several public datasets including TartanAir, Spring, AutoFlow, and Kubric, along with additional synthetic data generated using UE4 to create more virtual scenarios.
For stereo matching, we utilize several public datasets including TartanAir, FoundationStereo, and DrivingStereo. Additionally, we leverage the CARLA simulator to generate synthetic data specifically for driving scenarios.
We use MPI-Sintel and KITTI2015, which are publicly available benchmarks, as our evaluation datasets for both models.
AEPE
computes average per-pixel end-point error between estimated optical flow vector and ground-truth optical flow vector. Fl-all
computes the percentage of pixels whose optical flow error is larger than 3 pixels.
Dataset | AEPE | Fl-all |
---|---|---|
Sintel-final | 3.04 | 9.04% |
Sintel-clean | 2.22 | 5.90% |
D1-all
computes the percentage of stereo disparity outliers in the first frame.
Dataset | AEPE | D1-all |
---|---|---|
Sintel-final | 4.34 | 13.77% |
KITTI2015 | 1.04 | 4.50% |
Both models need to be used with NVIDIA hardware and software. They can run on any NVIDIA GPU including NVIDIA Jetson devices and are optimized for use with TensorRT.
Input Type(s): Image
Input Format(s): Red, Green, Blue (RGB)
Input Parameters: 4D
Other Properties Related to Input:
For both models, the input consists of two RGB images with pixel values in the range of 0-255.
Channel ordering of the input should be NCHW, where:
Output Type(s):
Output Format: Float32
Output Parameters: 4D
Other Properties Related to Output:
Channel ordering of the output will be NCHW, where:
We provide pre-generated TensorRT engines for both models with versionCompatible
and hardwareCompatibilityLevel=ampere+
that users can use directly for inference.
The engines were built with --minShapes=input:1x6x256x256 --optShapes=input:1x6x512x512 --maxShapes=input:1x6x1088x1920
.
These compatibility configurations for version/hardware/resolution will make the latency of the current engines 30% - 40% slower compared to engines optimized for a specific environment.
If you have requirements for latency optimization tailored to your running environment or for enhancing the quality specific to your scenario, please contact us.
In the attachments, we also provide Python demos which show:
Please refer to the attached Python demos for detailed implementation.
The pre-generated TensorRT engines require:
The inference is run on the provided models at FP16 precision. The inference performance is measured using the Python demos listed above, and it does not include the memory copy time between the device and the host. To optimize the inference performance, we have used some custom plugins in the model engines. The Jetson devices are running at Max-N configuration for maximum GPU frequency. The results shown here are the inference performance with input resolution of 1088×1920.
Platform | NVOF (Optical Flow Model) (ms) | NVOF (Stereo Matching Model) (ms) |
---|---|---|
RTX4090 | 2.76 | 2.12 |
A100 | 4.54 | 3.47 |
Jetson Orin | 21.2 | 18.97 |
NVIDIA Open Model License Agreement
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure these models meet requirements for the relevant industry and use case and address unforeseen product misuse.
For more detailed information on ethical considerations for these models, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards.