FoundationPose is a unified foundation model for 6-DoF (Degrees of Freedom) object pose estimation and tracking. This approach can be instantly applied at test-time to a novel object without fine-tuning, as long as its CAD model is given. This model is ready for commercial use.
The license to use these models is covered by the Model EULA. By downloading the unpruned or pruned version of the model, you accept the terms and conditions of these licenses.
Bowen Wen, Wei Yang, Jan Kautz, Stan Birchfield. "FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects." IEEE Conference on Computer Vision and Pattern Recognition. (CVPR). 2024.
Architecture Type: Transformer-based Network Architecture
Network Architecture
More Details
This model contains a novel design of transformer-based network architectures and contrastive learning formulation, which leads to strong generalization ability. The FoundationPose model contains two seperate networks, refine net and score net.
The refinement network extracts feature maps from the two RGBD input branches with a single shared CNN encoder. The feature maps are concatenated, fed into CNN blocks with residual connection, and tokenized by dividing into patches with position embedding. Finally, the network predicts the translation update and rotation update, each individually processed by a transformer encoder and linearly projected to the output dimension.
The score net performs with a pose ranking encoder, utilizing the same backbone architecture for feature extraction as in the refinement network. The extracted features are concatenated, tokenized and forwarded to the multi-head self-attention module so as to leverage the global image context for comparison.
Input Types
Input Formats
xyxy
formatInput Parameters: Multiple dimensions. See below for detailed input shapes Other Properties Related to Input:
Output Types: Testing image annotated with 3D bounding box, as well as the 6-DoF (Degrees of Freedom) axis Output Format: Two Dimensional (2D) vectors Other Properties Related to Output:
pred_trans
: 252 x 3 (Number of Hypothesis Poses x Translation)pred_rot
: 252 x 3 (Number of Hypothesis Poses x Rotation)pred_score
: 252 x 1 (Number of Hypothesis Poses x Pose Score)Runtime Engines
Supported Hardware Microarchitecture Compatibility
[Preferred/Supported] Operating Systems
This model uses two networks to make pose estimations. The training of refine network is supervised by L2 loss, unifying both updates and input observation in the camera coordinate frame. To train the pose ranking network, we used a pose-conditioned triplet loss for optimizing the positive samples that close enough to the ground truth.
Link
Data Collection Method by Dataset
Labeling Method by Dataset
Properties
The FoundationPosePose model was trained on two recent large scale 3D databases including Objaverse and GSO (Google Scanned Objects).For Objaverse, we chose the objects from the ObjaverseLVIS subset that consists of more than 40K objects belonging to 1,156 LVIS categories. This list contains the most relevant daily-life objects with reasonable quality, and diversity of shapes and appearances.
The synthetic data generation is implemented in NVIDIA Isaac Sim, leveraging path tracing for high-fidelity photo-realistic rendering. We perform gravity and physics simulation to produce physically plausible scenes. In each scene, we randomly sample objects with the original texture. In addition, the object size, material, camera pose, and lighting are also randomized.
Link
Data Collection Method by Dataset
Labeling Method by Dataset
Dataset Licenses
Accuracy was determined using the following metrics:
In all evaluation, it always uses the same trained model and configurations for inference without any fine-tuning. The following table presents the results among RGBD methods on 3 core datasets from BOP. These involve various challenging scenarios (dense clutter, multi-instance, static or dynamic scenes, table-top or robotic manipulation), and objects with diverse properties (textureless, shiny, symmetric, varying sizes).
Method | Unseen objects | Occluded-LINEMOD | TLESS | YCB-Video | Mean |
---|---|---|---|---|---|
FoundationPose | ✓ | 78.8 | 83.0 | 88.0 | 83.3 |
Engine
Test Hardware [Name the specific test hardware model]
The inference performance of the provided FoundationPose model is evaluated at FP16 precisions. The model's input resolution is 6x160x160 pixels. The performance assessment was conducted using trtexec on a range of devices. In the table, "BS" stands for "batch size."
The performance data presented pertains solely to model inference. The end-to-end performance, when integrated with streaming video data, the pre-processing and the post-processing, might different due to potential bottlenecks in hardware and software.
Models (FP16) | Devices | Latency (BS=1) | Images per Second (BS=1) | Latency (BS=252) | Images per Second (BS=252) |
---|---|---|---|---|---|
FoundationPose - Refine Network | Orin Nano 8GB | 4.12 | 242.59 | 1118.49 | 228.88 |
FoundationPose - Refine Network | Orin NX 16GB | 2.85 | 350.39 | 768.46 | 333.13 |
FoundationPose - Refine Network | Orin AGX 64GB | 1.10 | 908.91 | 300.26 | 852.59 |
FoundationPose - Refine Network | Tesla T4 | 5.62 | 182.74 | 1236.00 | 205.66 |
FoundationPose - Refine Network | A30 | 2.55 | 392.99 | 529.99 | 475.48 |
FoundationPose - Refine Network | A2 | 9.15 | 109.28 | 1985.32 | 126.93 |
FoundationPose - Refine Network | A100 | 1.57 | 638.29 | 266.14 | 949.55 |
FoundationPose - Refine Network | H100 | 1.16 | 878.56 | 123.09 | 2050.85 |
FoundationPose - Refine Network | L4 | 2.59 | 389.66 | 558.35 | 457.71 |
FoundationPose - Refine Network | L40 | 1.05 | 978.63 | 222.20 | 1145.97 |
Models (FP16) | Devices | Latency (BS=252) | Images per Second (BS=252) |
---|---|---|---|
FoundationPose - Score Network | Orin NX 8GB | 816.39 | 308.68 |
FoundationPose - Score Network | Orin NX 16GB | 564.27 | 446.59 |
FoundationPose - Score Network | Orin AGX 64GB | 210.02 | 1199.89 |
FoundationPose - Score Network | Tesla T4 | 1122.12 | 224.57 |
FoundationPose - Score Network | A30 | 394.69 | 638.48 |
FoundationPose - Score Network | A2 | 1702.51 | 148.55 |
FoundationPose - Score Network | A100 | 195.37 | 1289.84 |
FoundationPose - Score Network | H100 | 109.66 | 2301.73 |
FoundationPose - Score Network | L4 | 470.54 | 539.91 |
FoundationPose - Score Network | L40 | 196.02 | 1313.98 |
Input Image | Input CAD Model | Output Result |
---|---|---|
The FoundationPose might difficult to detect and track the object pose that appears reflective surface in different light conditions.
These models are designed for use with NVIDIA hardware and software. For hardware, the models are compatible with any NVIDIA GPU, including NVIDIA Jetson devices. For software, the models are specifically designed for the TensorRT.
The primary application of these models is to estimate an object's pose from a single RGBD image or a sequence of RGBD video. They can identify the objects in photos, given the right image pre-processing and post-processing procedures.
Furthermore, these models are designed for deployment to edge devices using the TensorRT. TAO Triton apps offers capabilities to construct efficient image analytic pipelines. These pipelines can capture, decode, and process data before executing inference.
To create the entire end-to-end inference application, deploy this model with Triton Inference Server. NVIDIA Triton Inference Server is an open-source inference serving software that helps standardize model deployment and execution and delivers fast and scalable AI in production. Triton supports direct integration of this model into the server and inference from a client.
To deploy this model with Triton Inference Server and end-to-end inference from images, please refer to the TAO Triton apps.
NVIDIA FoundationPose model estimates the object pose. However, no additional information such as people and other distractors in the background are inferred. The training and evaluation dataset mostly consists of North American content. An ideal training and evaluation dataset would additionally include content from other geographies. NVIDIA’s platforms and application frameworks enable developers to build a wide array of AI applications. Consider potential algorithmic bias when choosing or creating the models being deployed. Work with the model’s developer to ensure that it meets the requirements for the relevant industry and use case; that the necessary instruction and documentation are provided to understand error rates, confidence intervals, and results; and that the model is being used under the conditions and in the manner intended.