FoundationPose | NVIDIA NGC

NGC Catalog

CLASSIC

Welcome Guest

For downloads and more information, please view on a desktop device.

Description

6-DoF object pose estimation and tracking, providing the object pose and 3D bounding box

Publisher

NVIDIA

Latest Version

deployable_v1.0

Modified

November 12, 2024

Size

125.95 MB

FoundationPose Model Card

Model Overview

Description

FoundationPose is a unified foundation model for 6-DoF (Degrees of Freedom) object pose estimation and tracking. This approach can be instantly applied at test-time to a novel object without fine-tuning, as long as its CAD model is given. This model is ready for commercial use.

License

The license to use these models is covered by the Model EULA. By downloading the unpruned or pruned version of the model, you accept the terms and conditions of these licenses.

References

Bowen Wen, Wei Yang, Jan Kautz, Stan Birchfield. "FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects." IEEE Conference on Computer Vision and Pattern Recognition. (CVPR). 2024.

Model Architecture

Architecture Type: Transformer-based Network Architecture

Network Architecture

Network Encoder: CNN (Convolutional Neural Network) blocks with residual connection
Network Decoder: Multi-head self-attention module

More Details

This model contains a novel design of transformer-based network architectures and contrastive learning formulation, which leads to strong generalization ability. The FoundationPose model contains two seperate networks, refine net and score net.
The refinement network extracts feature maps from the two RGBD input branches with a single shared CNN encoder. The feature maps are concatenated, fed into CNN blocks with residual connection, and tokenized by dividing into patches with position embedding. Finally, the network predicts the translation update and rotation update, each individually processed by a transformer encoder and linearly projected to the output dimension.
The score net performs with a pose ranking encoder, utilizing the same backbone architecture for feature extraction as in the refinement network. The extracted features are concatenated, tokenized and forwarded to the multi-head self-attention module so as to leverage the global image context for comparison.

Input

Input Types

RGB image
Depth image
2D bounding box
CAD model
Intrinsic matrix

Input Formats

RGB image: Red, Green, Blue (RGB). Can support any input resolution and images do not need any additional pre-processing (e.g. alpha channels or bits)
Depth image: Depth values. Can support any resolution and images do not need any additional pre-processing (e.g. alpha channels or bits)
CAD model: The CAD model is in OBJ format and includes the texture PNG image under the same folder
Intrinsic matrix: The input requires the correct camera calibration information, which includes principle points and focal length. The format is txt file
2D bounding box: The coordinate of the target object in the first frame, and the coordinates in xyxy format

Input Parameters: Multiple dimensions. See below for detailed input shapes Other Properties Related to Input:

RGB image: B X 3 X H X W (Batch Size x Channel x Height x Width)
Depth image: B X H X W (Batch Size x Height x Width)
2D bounding box: B X 1 X 4 (Batch size X 1 X Bounding box coordinate)
CAD model: OBJ file
Intrinsic matrix: txt file

Output

Output Types: Testing image annotated with 3D bounding box, as well as the 6-DoF (Degrees of Freedom) axis Output Format: Two Dimensional (2D) vectors Other Properties Related to Output:

pred_trans: 252 x 3 (Number of Hypothesis Poses x Translation)
pred_rot: 252 x 3 (Number of Hypothesis Poses x Rotation)
pred_score: 252 x 1 (Number of Hypothesis Poses x Pose Score)

Software Integration

Runtime Engines

TAO Triton Apps

Supported Hardware Microarchitecture Compatibility

NVIDIA Ampere
NVIDIA Blackwell
NVIDIA Jetson
NVIDIA Hopper
NVIDIA Lovelace
NVIDIA Pascal
NVIDIA Turing
NVIDIA Volta

[Preferred/Supported] Operating Systems

Linux

Model Versions

Deployable_v1.0: decrypted ONNX files, inferencable on TAO Triton apps.

Training, Testing, and Evaluation Datasets

This model uses two networks to make pose estimations. The training of refine network is supervised by L2 loss, unifying both updates and input observation in the camera coordinate frame. To train the pose ranking network, we used a pose-conditioned triplet loss for optimizing the positive samples that close enough to the ground truth.

Training Dataset

Link

Synthetic Dataset: Generated by NVIDIA Omniverse.

Data Collection Method by Dataset

Purely Synthetic Data

Labeling Method by Dataset

Purely Synthetic Data

Properties

The FoundationPosePose model was trained on two recent large scale 3D databases including Objaverse and GSO (Google Scanned Objects).For Objaverse, we chose the objects from the ObjaverseLVIS subset that consists of more than 40K objects belonging to 1,156 LVIS categories. This list contains the most relevant daily-life objects with reasonable quality, and diversity of shapes and appearances.
The synthetic data generation is implemented in NVIDIA Isaac Sim, leveraging path tracing for high-fidelity photo-realistic rendering. We perform gravity and physics simulation to produce physically plausible scenes. In each scene, we randomly sample objects with the original texture. In addition, the object size, material, camera pose, and lighting are also randomized.

Evaluation Dataset

Link

LINEMOD: Provides additional ground-truth annotations for all modeled objects in one of the test sets from LM. This introduces challenging test cases with various levels of occlusion
YCB-V (YCB-Video): 21 YCB objects captured in 92 videos.

Data Collection Method by Dataset

Human

Labeling Method by Dataset

Human

Dataset Licenses

LINEMOD: CC BY 4.0
YCB-V (YCB-Video): MIT

Evaluation Results

Accuracy was determined using the following metrics:

Area under the curve (AUC) of ADD and ADD-S.
Recall of ADD that is less than 0.1 of the object diameter (ADD-0.1d).

In all evaluation, it always uses the same trained model and configurations for inference without any fine-tuning. The following table presents the results among RGBD methods on 3 core datasets from BOP. These involve various challenging scenarios (dense clutter, multi-instance, static or dynamic scenes, table-top or robotic manipulation), and objects with diverse properties (textureless, shiny, symmetric, varying sizes).

Method	Unseen objects	Occluded-LINEMOD	TLESS	YCB-Video	Mean
FoundationPose	✓	78.8	83.0	88.0	83.3

Inference

Engine

Tensor(RT)

Test Hardware [Name the specific test hardware model]

A2
A30
DGX H100
DGX A100
DGX H100
JAO 64GB
Jetson AGX Xavier
L4
L40
NVIDIA T4
Orin
Orin Nano 8GB
Orin NX
Orin NX16GB
T4
Xavier NX

The inference performance of the provided FoundationPose model is evaluated at FP16 precisions. The model's input resolution is 6x160x160 pixels. The performance assessment was conducted using trtexec on a range of devices. In the table, "BS" stands for "batch size."

The performance data presented pertains solely to model inference. The end-to-end performance, when integrated with streaming video data, the pre-processing and the post-processing, might different due to potential bottlenecks in hardware and software.

Models (FP16)	Devices	Latency (BS=1)	Images per Second (BS=1)	Latency (BS=252)	Images per Second (BS=252)
FoundationPose - Refine Network	Orin Nano 8GB	4.12	242.59	1118.49	228.88
FoundationPose - Refine Network	Orin NX 16GB	2.85	350.39	768.46	333.13
FoundationPose - Refine Network	Orin AGX 64GB	1.10	908.91	300.26	852.59
FoundationPose - Refine Network	Tesla T4	5.62	182.74	1236.00	205.66
FoundationPose - Refine Network	A30	2.55	392.99	529.99	475.48
FoundationPose - Refine Network	A2	9.15	109.28	1985.32	126.93
FoundationPose - Refine Network	A100	1.57	638.29	266.14	949.55
FoundationPose - Refine Network	H100	1.16	878.56	123.09	2050.85
FoundationPose - Refine Network	L4	2.59	389.66	558.35	457.71
FoundationPose - Refine Network	L40	1.05	978.63	222.20	1145.97

Models (FP16)	Devices	Latency (BS=252)	Images per Second (BS=252)
FoundationPose - Score Network	Orin NX 8GB	816.39	308.68
FoundationPose - Score Network	Orin NX 16GB	564.27	446.59
FoundationPose - Score Network	Orin AGX 64GB	210.02	1199.89
FoundationPose - Score Network	Tesla T4	1122.12	224.57
FoundationPose - Score Network	A30	394.69	638.48
FoundationPose - Score Network	A2	1702.51	148.55
FoundationPose - Score Network	A100	195.37	1289.84
FoundationPose - Score Network	H100	109.66	2301.73
FoundationPose - Score Network	L4	470.54	539.91
FoundationPose - Score Network	L40	196.02	1313.98

Output Image

Input Image	Input CAD Model	Output Result

Limitations

Reflective Surface Objects

The FoundationPose might difficult to detect and track the object pose that appears reflective surface in different light conditions.

Inference Method

These models are designed for use with NVIDIA hardware and software. For hardware, the models are compatible with any NVIDIA GPU, including NVIDIA Jetson devices. For software, the models are specifically designed for the TensorRT.

The primary application of these models is to estimate an object's pose from a single RGBD image or a sequence of RGBD video. They can identify the objects in photos, given the right image pre-processing and post-processing procedures.

Furthermore, these models are designed for deployment to edge devices using the TensorRT. TAO Triton apps offers capabilities to construct efficient image analytic pipelines. These pipelines can capture, decode, and process data before executing inference.

Instructions to Deploy the Model with Triton Inference Server

To create the entire end-to-end inference application, deploy this model with Triton Inference Server. NVIDIA Triton Inference Server is an open-source inference serving software that helps standardize model deployment and execution and delivers fast and scalable AI in production. Triton supports direct integration of this model into the server and inference from a client.

To deploy this model with Triton Inference Server and end-to-end inference from images, please refer to the TAO Triton apps.

Ethical Considerations

NVIDIA FoundationPose model estimates the object pose. However, no additional information such as people and other distractors in the background are inferred. The training and evaluation dataset mostly consists of North American content. An ideal training and evaluation dataset would additionally include content from other geographies. NVIDIA’s platforms and application frameworks enable developers to build a wide array of AI applications. Consider potential algorithmic bias when choosing or creating the models being deployed. Work with the model’s developer to ensure that it meets the requirements for the relevant industry and use case; that the necessary instruction and documentation are provided to understand error rates, confidence intervals, and results; and that the model is being used under the conditions and in the manner intended.