CenterPose is a single-stage, keypoint-based method for category-level object pose estimation. It processes unknown object instances within a recognized category using a single RGB image. The pretrained model detects the projections of 3D keypoints, estimates a 6-DoF pose, and regresses the relative 3D bounding cuboid dimensions.
This model is supported two different types of backbone network as the feature extractor, including DLA34 and FAN-Small-Hybrid. The DLA34 is a standard Convolutional Neural Network (CNN) backbone, while the FAN-Small-Hybrid is a transformer-based classification backbone.
The network architecture processes a re-scaled and padded RGB image. Using the DLA34/FAN-Small-Hybrid feature extractor combined with an upsampling module, the network outputs three distinct heads that predict the 2D bounding box, projections of 3D bounding box keypoints, and cuboid dimensions.
This model uses a single-stage network to make all predictions and is trained using the CenterPose entry point since TAO 5.2 in November 2023. The training algorithm optimizes the network to minimize both the focal loss and the l1 loss for all keypoints and cuboid dimensions.
The CenterPose model was trained on the Objectron dataset, a newly introduced benchmark for monocular RGB category-level 6-DoF object pose estimation. This dataset comprises 15k annotated video clips, totaling over 4M annotated frames. Every category is marked with a 3D bounding cuboid that indicates the object's position, orientation relative to the camera, and the cuboid's dimensions.
For training and evaluation purposes, we extracted frames by temporally downsampling the original videos to 15 fps.
For symmetric objects, such as bottles, we produced multiple ground truth labels during the training phase by rotating them N times around their symmetry axis.
Category | # of Training Videos | # of Training Images | # of Testing Videos | # of Testing Images |
---|---|---|---|---|
Cereal Box | 1,288 | 22,024 | 321 | 5,428 |
Bottle | 1,542 | 26,090 | 385 | 6,442 |
The performance of the CenterPose model during inference was evaluated using the test samples from each category in the official dataset release. These frames, originally high-resolution images of 600x800 pixels, were resized to 512x512 pixels before being processed by the CenterPose model.
Accuracy was determined using a 3D intersection-over-union (IoU) criterion with a threshold greater than 0.5. The 2D MPE (mean pixel projection error) metric calculates the average normalized distance between the projections of 3D bounding box keypoints from both the estimated and ground truth poses. For viewpoint estimation, we present the average precision (AP) for azimuth and elevation with thresholds of 15° and 10° degrees, respectively.
For symmetric object categories, like bottles, we rotated the estimated bounding box around the symmetry axis N times (where N = 100) and assessed the prediction in relation to each rotated instance. The results reflect the instance that either maximizes the 3D IoU or minimizes the 2D pixel projection error.
Evaluation data key performance indicators (KPIs) are provided in the table below. The evaluation of the pretrained models was based on FP32 precision.
Category | Backbone Architecture | 3D IoU ↑ | 2D MPE ↓ | AP @ 15° Azimuth Error ↑ | AP @ 10° Elevation Error ↑ |
---|---|---|---|---|---|
Cereal Box | DLA34 | 0.8131 | 0.039 | 0.9273 | 0.9350 |
Cereal Box | FAN-Small-Hybrid | 0.8290 | 0.036 | 0.9418 | 0.9514 |
Bottle | DLA34 | 0.7939 | 0.040 | 0.9703 | 0.8933 |
Bottle | FAN-Small-Hybrid | 0.8187 | 0.039 | 0.9820 | 0.9056 |
The inference performance of the provided CenterPose model is evaluated at both FP16 and FP32 precisions. The model's input resolution is 512x512 pixels. The performance assessment was conducted using trtexec on a range of devices: Orin Nano 8GB, Orin NX 16GB, Jetson AGX Orin 64GB, A2, A30, A100, H100, L4, L40, and Tesla T4. In the table, "BS" stands for "batch size."
The performance data presented pertains solely to model inference. The end-to-end performance, when integrated with streaming video data, may vary slightly due to potential bottlenecks in both hardware and software.
Models (FP16) | Devices | Latency ↓ (ms, BS=1) | Images per Second ↑ (BS=1) |
---|---|---|---|
CenterPose - DLA34 | Orin Nano 8GB | 52.19 | 19.16 |
CenterPose - DLA34 | Orin NX 16GB | 36.05 | 27.74 |
CenterPose - DLA34 | AGX Orin 64GB | 17.53 | 57.04 |
CenterPose - DLA34 | A2 | 63.75 | 15.69 |
CenterPose - DLA34 | A30 | 17.40 | 57.46 |
CenterPose - DLA34 | A100 | 12.17 | 82.16 |
CenterPose - DLA34 | H100 | 9.45 | 105.84 |
CenterPose - DLA34 | L4 | 24.58 | 40.68 |
CenterPose - DLA34 | L40 | 9.37 | 106.70 |
CenterPose - DLA34 | Tesla T4 | 41.20 | 24.27 |
CenterPose - DLA34 | RTX 4060Ti | 21.70 | 46.10 |
CenterPose - FAN-Small-Hybrid | Orin Nano 8GB | 125.94 | 7.94 |
CenterPose - FAN-Small-Hybrid | Orin NX 16GB | 88.12 | 11.35 |
CenterPose - FAN-Small-Hybrid | AGX Orin 64GB | 35.68 | 28.03 |
CenterPose - FAN-Small-Hybrid | A2 | 172.55 | 5.80 |
CenterPose - FAN-Small-Hybrid | A30 | 37.41 | 26.73 |
CenterPose - FAN-Small-Hybrid | A100 | 20.01 | 49.99 |
CenterPose - FAN-Small-Hybrid | H100 | 13.11 | 76.26 |
CenterPose - FAN-Small-Hybrid | L4 | 53.52 | 18.69 |
CenterPose - FAN-Small-Hybrid | L40 | 17.65 | 56.65 |
CenterPose - FAN-Small-Hybrid | Tesla T4 | 102.33 | 9.77 |
CenterPose - FAN-Small-Hybrid | RTX 4060Ti | 48.10 | 20.80 |
Models (FP32) | Devices | Latency ↓ (ms, BS=1) | Images per Second ↑ (BS=1) |
---|---|---|---|
CenterPose - DLA34 | Orin Nano 8GB | 80.81 | 12.37 |
CenterPose - DLA34 | Orin NX 16GB | 55.67 | 17.96 |
CenterPose - DLA34 | AGX Orin 64GB | 25.36 | 39.44 |
CenterPose - DLA34 | A2 | 155.43 | 6.43 |
CenterPose - DLA34 | A30 | 40.83 | 24.49 |
CenterPose - DLA34 | A100 | 25.17 | 39.74 |
CenterPose - DLA34 | H100 | 16.28 | 61.42 |
CenterPose - DLA34 | L4 | 49.99 | 20.00 |
CenterPose - DLA34 | L40 | 18.64 | 53.63 |
CenterPose - DLA34 | Tesla T4 | 101.42 | 9.86 |
CenterPose - DLA34 | RTX 4060Ti | 43.40 | 23.10 |
CenterPose - FAN-Small-Hybrid | Orin Nano 8GB | 208.25 | 4.80 |
CenterPose - FAN-Small-Hybrid | Orin NX 16GB | 144.80 | 6.91 |
CenterPose - FAN-Small-Hybrid | AGX Orin 64GB | 60.29 | 16.59 |
CenterPose - FAN-Small-Hybrid | A2 | 450.28 | 2.22 |
CenterPose - FAN-Small-Hybrid | A30 | 113.68 | 8.80 |
CenterPose - FAN-Small-Hybrid | A100 | 58.62 | 17.06 |
CenterPose - FAN-Small-Hybrid | H100 | 30.90 | 32.36 |
CenterPose - FAN-Small-Hybrid | L4 | 149.66 | 6.68 |
CenterPose - FAN-Small-Hybrid | L40 | 53.02 | 18.86 |
CenterPose - FAN-Small-Hybrid | Tesla T4 | 294.14 | 3.40 |
CenterPose - FAN-Small-Hybrid | RTX 4060Ti | 128.90 | 7.80 |
These models are designed for use with NVIDIA platforms, including Jetson and x86_64 with a dGPU. To use this model in an inference pipeline in ROS 2, please consult Isaac ROS Pose Estimation.
Cereal Box | Bottle |
---|---|
The CenterPose model was trained to identify dominant objects in the camera view. As a result, it might not detect objects who appear very small with respect to the camera view.
If objects are occluded or truncated to the extent that less than 40% of the object remains visible, the CenterPose model may not recognize them. The model can detect partially occluded objects as long as the majority of the object remains visible. Heavily occluded objects might compromise detection accuracy.
The CenterPose model was trained on RGB images taken under good lighting conditions and captured by a pinhole camera. Consequently, images shot in poor lighting or those exhibiting distortion or blur might not yield optimal detection results.
Daquan Zhou, Zhiding Yu, Enze Xie, Chaowei Xiao, Anima Anandkumar, Jiashi Feng and Jose M. Alvarez. "Understanding The Robustness in Vision Transformers". International Conference on Machine Learning (ICML). 2022.
Fisher Yu, Dequan Wang, Evan Shelhamer and Trevor Darrell. "Deep Layer Aggregation". Conference on Computer Vision and Pattern Recognition. 2018.
Adel Ahmadyan, Liangkai Zhang, Artsiom Ablavatski, Jianing Wei, Matthias Grundmann. "Objectron: A Large Scale Dataset of Object-Centric Videos in the Wild with Pose Annotations". IEEE Conference on Computer Vision and Pattern Recognition. 2021.
License to use these models is covered by the Model EULA. By downloading the unpruned or pruned version of the model, you accept the terms and conditions of these licenses.
NVIDIA CenterPose model estimates the object pose. However, no additional information such as people and other distractors in the background are inferred. Training and evaluation dataset mostly consists of North American content. An ideal training and evaluation dataset would additionally include content from other geographies. NVIDIA’s platforms and application frameworks enable developers to build a wide array of AI applications. Consider potential algorithmic bias when choosing or creating the models being deployed. Work with the model’s developer to ensure that it meets the requirements for the relevant industry and use case; that the necessary instruction and documentation are provided to understand error rates, confidence intervals, and results; and that the model is being used under the conditions and in the manner intended.