NGC Catalog
CLASSIC
Welcome Guest
Models
FoundationPose

FoundationPose

For downloads and more information, please view on a desktop device.
Logo for FoundationPose
Description
6-DoF object pose estimation and tracking, providing the object pose and 3D bounding box
Publisher
NVIDIA
Latest Version
deployable_v1.0
Modified
November 12, 2024
Size
125.95 MB

FoundationPose Model Card

Model Overview

Description

FoundationPose is a unified foundation model for 6-DoF (Degrees of Freedom) object pose estimation and tracking. This approach can be instantly applied at test-time to a novel object without fine-tuning, as long as its CAD model is given. This model is ready for commercial use.

License

The license to use these models is covered by the Model EULA. By downloading the unpruned or pruned version of the model, you accept the terms and conditions of these licenses.

References

Bowen Wen, Wei Yang, Jan Kautz, Stan Birchfield. "FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects." IEEE Conference on Computer Vision and Pattern Recognition. (CVPR). 2024.

Model Architecture

Architecture Type: Transformer-based Network Architecture

Network Architecture

  • Network Encoder: CNN (Convolutional Neural Network) blocks with residual connection
  • Network Decoder: Multi-head self-attention module

More Details

  • This model contains a novel design of transformer-based network architectures and contrastive learning formulation, which leads to strong generalization ability. The FoundationPose model contains two seperate networks, refine net and score net.

  • The refinement network extracts feature maps from the two RGBD input branches with a single shared CNN encoder. The feature maps are concatenated, fed into CNN blocks with residual connection, and tokenized by dividing into patches with position embedding. Finally, the network predicts the translation update and rotation update, each individually processed by a transformer encoder and linearly projected to the output dimension.

  • The score net performs with a pose ranking encoder, utilizing the same backbone architecture for feature extraction as in the refinement network. The extracted features are concatenated, tokenized and forwarded to the multi-head self-attention module so as to leverage the global image context for comparison.

Input

Input Types

  • RGB image
  • Depth image
  • 2D bounding box
  • CAD model
  • Intrinsic matrix

Input Formats

  • RGB image: Red, Green, Blue (RGB). Can support any input resolution and images do not need any additional pre-processing (e.g. alpha channels or bits)
  • Depth image: Depth values. Can support any resolution and images do not need any additional pre-processing (e.g. alpha channels or bits)
  • CAD model: The CAD model is in OBJ format and includes the texture PNG image under the same folder
  • Intrinsic matrix: The input requires the correct camera calibration information, which includes principle points and focal length. The format is txt file
  • 2D bounding box: The coordinate of the target object in the first frame, and the coordinates in xyxy format

Input Parameters: Multiple dimensions. See below for detailed input shapes Other Properties Related to Input:

  • RGB image: B X 3 X H X W (Batch Size x Channel x Height x Width)
  • Depth image: B X H X W (Batch Size x Height x Width)
  • 2D bounding box: B X 1 X 4 (Batch size X 1 X Bounding box coordinate)
  • CAD model: OBJ file
  • Intrinsic matrix: txt file

Output

Output Types: Testing image annotated with 3D bounding box, as well as the 6-DoF (Degrees of Freedom) axis Output Format: Two Dimensional (2D) vectors Other Properties Related to Output:

  • pred_trans: 252 x 3 (Number of Hypothesis Poses x Translation)
  • pred_rot: 252 x 3 (Number of Hypothesis Poses x Rotation)
  • pred_score: 252 x 1 (Number of Hypothesis Poses x Pose Score)

Software Integration

Runtime Engines

  • TAO Triton Apps

Supported Hardware Microarchitecture Compatibility

  • NVIDIA Ampere
  • NVIDIA Blackwell
  • NVIDIA Jetson
  • NVIDIA Hopper
  • NVIDIA Lovelace
  • NVIDIA Pascal
  • NVIDIA Turing
  • NVIDIA Volta

[Preferred/Supported] Operating Systems

  • Linux

Model Versions

  • Deployable_v1.0: decrypted ONNX files, inferencable on TAO Triton apps.

Training, Testing, and Evaluation Datasets

This model uses two networks to make pose estimations. The training of refine network is supervised by L2 loss, unifying both updates and input observation in the camera coordinate frame. To train the pose ranking network, we used a pose-conditioned triplet loss for optimizing the positive samples that close enough to the ground truth.

Training Dataset

Link

  • Synthetic Dataset: Generated by NVIDIA Omniverse.

Data Collection Method by Dataset

  • Purely Synthetic Data

Labeling Method by Dataset

  • Purely Synthetic Data

Properties

  • The FoundationPosePose model was trained on two recent large scale 3D databases including Objaverse and GSO (Google Scanned Objects).For Objaverse, we chose the objects from the ObjaverseLVIS subset that consists of more than 40K objects belonging to 1,156 LVIS categories. This list contains the most relevant daily-life objects with reasonable quality, and diversity of shapes and appearances.

  • The synthetic data generation is implemented in NVIDIA Isaac Sim, leveraging path tracing for high-fidelity photo-realistic rendering. We perform gravity and physics simulation to produce physically plausible scenes. In each scene, we randomly sample objects with the original texture. In addition, the object size, material, camera pose, and lighting are also randomized.

Evaluation Dataset

Link

  • LINEMOD: Provides additional ground-truth annotations for all modeled objects in one of the test sets from LM. This introduces challenging test cases with various levels of occlusion
  • YCB-V (YCB-Video): 21 YCB objects captured in 92 videos.

Data Collection Method by Dataset

  • Human

Labeling Method by Dataset

  • Human

Dataset Licenses

  • LINEMOD: CC BY 4.0
  • YCB-V (YCB-Video): MIT

Evaluation Results

Accuracy was determined using the following metrics:

  • Area under the curve (AUC) of ADD and ADD-S.
  • Recall of ADD that is less than 0.1 of the object diameter (ADD-0.1d).

In all evaluation, it always uses the same trained model and configurations for inference without any fine-tuning. The following table presents the results among RGBD methods on 3 core datasets from BOP. These involve various challenging scenarios (dense clutter, multi-instance, static or dynamic scenes, table-top or robotic manipulation), and objects with diverse properties (textureless, shiny, symmetric, varying sizes).

Method Unseen objects Occluded-LINEMOD TLESS YCB-Video Mean
FoundationPose ✓ 78.8 83.0 88.0 83.3

Inference

Engine

  • Tensor(RT)

Test Hardware [Name the specific test hardware model]

  • A2
  • A30
  • DGX H100
  • DGX A100
  • DGX H100
  • JAO 64GB
  • Jetson AGX Xavier
  • L4
  • L40
  • NVIDIA T4
  • Orin
  • Orin Nano 8GB
  • Orin NX
  • Orin NX16GB
  • T4
  • Xavier NX

The inference performance of the provided FoundationPose model is evaluated at FP16 precisions. The model's input resolution is 6x160x160 pixels. The performance assessment was conducted using trtexec on a range of devices. In the table, "BS" stands for "batch size."

The performance data presented pertains solely to model inference. The end-to-end performance, when integrated with streaming video data, the pre-processing and the post-processing, might different due to potential bottlenecks in hardware and software.

Models (FP16) Devices Latency (BS=1) Images per Second (BS=1) Latency (BS=252) Images per Second (BS=252)
FoundationPose - Refine Network Orin Nano 8GB 4.12 242.59 1118.49 228.88
FoundationPose - Refine Network Orin NX 16GB 2.85 350.39 768.46 333.13
FoundationPose - Refine Network Orin AGX 64GB 1.10 908.91 300.26 852.59
FoundationPose - Refine Network Tesla T4 5.62 182.74 1236.00 205.66
FoundationPose - Refine Network A30 2.55 392.99 529.99 475.48
FoundationPose - Refine Network A2 9.15 109.28 1985.32 126.93
FoundationPose - Refine Network A100 1.57 638.29 266.14 949.55
FoundationPose - Refine Network H100 1.16 878.56 123.09 2050.85
FoundationPose - Refine Network L4 2.59 389.66 558.35 457.71
FoundationPose - Refine Network L40 1.05 978.63 222.20 1145.97
Models (FP16) Devices Latency (BS=252) Images per Second (BS=252)
FoundationPose - Score Network Orin NX 8GB 816.39 308.68
FoundationPose - Score Network Orin NX 16GB 564.27 446.59
FoundationPose - Score Network Orin AGX 64GB 210.02 1199.89
FoundationPose - Score Network Tesla T4 1122.12 224.57
FoundationPose - Score Network A30 394.69 638.48
FoundationPose - Score Network A2 1702.51 148.55
FoundationPose - Score Network A100 195.37 1289.84
FoundationPose - Score Network H100 109.66 2301.73
FoundationPose - Score Network L4 470.54 539.91
FoundationPose - Score Network L40 196.02 1313.98

Output Image

Input Image Input CAD Model Output Result

Limitations

Reflective Surface Objects

The FoundationPose might difficult to detect and track the object pose that appears reflective surface in different light conditions.

Inference Method

These models are designed for use with NVIDIA hardware and software. For hardware, the models are compatible with any NVIDIA GPU, including NVIDIA Jetson devices. For software, the models are specifically designed for the TensorRT.

The primary application of these models is to estimate an object's pose from a single RGBD image or a sequence of RGBD video. They can identify the objects in photos, given the right image pre-processing and post-processing procedures.

Furthermore, these models are designed for deployment to edge devices using the TensorRT. TAO Triton apps offers capabilities to construct efficient image analytic pipelines. These pipelines can capture, decode, and process data before executing inference.

Instructions to Deploy the Model with Triton Inference Server

To create the entire end-to-end inference application, deploy this model with Triton Inference Server. NVIDIA Triton Inference Server is an open-source inference serving software that helps standardize model deployment and execution and delivers fast and scalable AI in production. Triton supports direct integration of this model into the server and inference from a client.

To deploy this model with Triton Inference Server and end-to-end inference from images, please refer to the TAO Triton apps.

Ethical Considerations

NVIDIA FoundationPose model estimates the object pose. However, no additional information such as people and other distractors in the background are inferred. The training and evaluation dataset mostly consists of North American content. An ideal training and evaluation dataset would additionally include content from other geographies. NVIDIA’s platforms and application frameworks enable developers to build a wide array of AI applications. Consider potential algorithmic bias when choosing or creating the models being deployed. Work with the model’s developer to ensure that it meets the requirements for the relevant industry and use case; that the necessary instruction and documentation are provided to understand error rates, confidence intervals, and results; and that the model is being used under the conditions and in the manner intended.