CenterPose

CenterPose

Logo for CenterPose
Description
3 pose detection model for retail objects.
Publisher
-
Latest Version
trainable_fan_small
Modified
December 12, 2023
Size
327.96 MB

CenterPose Model Card

Model Overview

CenterPose is a single-stage, keypoint-based method for category-level object pose estimation. It processes unknown object instances within a recognized category using a single RGB image. The pretrained model detects the projections of 3D keypoints, estimates a 6-DoF pose, and regresses the relative 3D bounding cuboid dimensions.

Model Architecture

This model supports two different types of backbone networks as the feature extractor, including DLA34 and FAN-Small-Hybrid. The DLA34 is a standard Convolutional Neural Network (CNN) backbone, while the FAN-Small-Hybrid is a transformer-based classification backbone.

The network architecture processes a re-scaled and padded RGB image. Using the DLA34/FAN-Small-Hybrid feature extractor combined with an upsampling module, the network outputs distinct heads that predict the projections of 3D bounding box keypoints and relative cuboid dimensions. After detecting objects in the image space, the estimated relative cuboid dimensions enables you to utilize robust, off-the-shelf PnP algorithms for the pose estimation.

Training

This model uses a single-stage network to make all predictions and is trained using the CenterPose entry point since the TAO 5.2 version. The training algorithm optimizes the network to minimize both the focal loss and the l1 loss for all keypoints and cuboid dimensions.

Training Data

The CenterPose model was trained on the Objectron dataset, a newly introduced benchmark for monocular RGB category-level 6-DoF object pose estimation. This dataset is comprised of 15k annotated video clips, totaling over 4M annotated frames. Every category is marked with a 3D bounding cuboid that indicates the object's position, orientation relative to the camera, and the cuboid's dimensions.

Objects are from the following eight categories: bikes, books, bottles, cameras, cereal boxes, chairs, laptops, and shoes.

For training purposes, frames are extracted by temporally downsampling the original videos to 15 fps. For symmetric objects, such as bottles, multiple ground truth labels are produced during the training phase by rotating them N times around their symmetry axis.

Category # of Training Videos # of Training Images # of Testing Videos # of Testing Images
Bike 375 8,396 94 2,090
Book 1,618 31,609 404 7,885
Bottle 1,542 26,090 385 6,442
Camera 652 12,758 163 3,283
Cereal Box 1,288 22,024 321 5,428
Chair 1,555 27,608 388 6,695
Laptop 1,179 26,462 294 6,608
Shoe 1,693 30,515 423 7,859

Accuracy and Performance

Evaluation Data

The performance of the CenterPose model during inference was evaluated using the test samples from each category in the official dataset release. These frames, originally high-resolution images of 600x800 pixels, were resized to 512x512 pixels before being processed by the CenterPose model.

Methodology and KPI

Accuracy was determined using a 3D intersection-over-union (IoU) criterion with a threshold greater than 0.5. The 2D MPE (mean pixel projection error) metric calculates the average normalized distance between the projections of 3D bounding box keypoints from both the estimated and ground truth poses. For viewpoint estimation, the average precision (AP) is presented for azimuth and elevation with thresholds of 15° and 10° degrees, respectively.

For symmetric object categories, like bottles, the estimated bounding box is rotated around the symmetry axis N times (where N = 100) and assessed the prediction in relation to each rotated instance. The results reflect the instance that either maximizes the 3D IoU or minimizes the 2D pixel projection error.

For evaluation purposes, frames are exctacted by temporally downsampling the original videos to 15 fps. Evaluation data key performance indicators (KPIs) are provided in the following table. The evaluation of the pretrained models was based on FP32 precision.

Category Backbone Architecture 3D IoU ↑ 2D MPE ↓ AP @ 15° Azimuth Error ↑ AP @ 10° Elevation Error ↑
Bike DLA34 0.6271 0.0941 0.8667 0.8990
Book DLA34 0.5678 0.0637 0.7380 0.8503
Bottle DLA34 0.7939 0.0402 0.9703 0.8933
Camera DLA34 0.7213 0.0574 0.8058 0.8402
Cereal Box DLA34 0.8131 0.0392 0.9273 0.9350
Chair DLA34 0.8495 0.0610 0.8787 0.8954
Laptop DLA34 0.7398 0.0475 0.8751 0.7871
Shoe DLA34 0.6711 0.0445 0.6838 0.7915
Bike FAN-Small-Hybrid 0.6489 0.0885 0.8974 0.9418
Book FAN-Small-Hybrid 0.6286 0.0597 0.7869 0.8835
Bottle FAN-Small-Hybrid 0.8187 0.0396 0.9820 0.9056
Camera FAN-Small-Hybrid 0.7384 0.0581 0.8360 0.8662
Cereal Box FAN-Small-Hybrid 0.8290 0.0365 0.9418 0.9514
Chair FAN-Small-Hybrid 0.8552 0.0587 0.8893 0.9153
Laptop FAN-Small-Hybrid 0.7003 0.0477 0.8917 0.7714
Shoe FAN-Small-Hybrid 0.6902 0.0442 0.7188 0.8276

Real-Time Inference Performance

The inference performance of the provided CenterPose model is evaluated at FP16 and FP32 precisions. The model's input resolution is 512x512 pixels. The performance assessment was conducted using trtexec on a range of devices including: Orin Nano 8GB, Orin NX 16GB, Jetson AGX Orin 64GB, A2, A30, A100, H100, L4, L40, and Tesla T4. In the table, "BS" stands for "batch size."

The performance data presented pertains solely to model inference. The end-to-end performance, when integrated with streaming video data and the post-processing, might vary slightly due to potential bottlenecks in hardware and software.

Models (FP16) Devices Latency ↓ (ms, BS=1) Images per Second ↑ (BS=1) Latency ↓ (ms, BS=8) Images per Second ↑ (BS=8)
CenterPose - DLA34 Orin Nano 8GB 52.19 19.16 167.85 (BS=4) 23.83 (BS=4)
CenterPose - DLA34 Orin NX 16GB 36.05 27.74 115.91 (BS=4) 34.51 (BS=4)
CenterPose - DLA34 AGX Orin 64GB 17.53 57.04 89.28 89.60
CenterPose - DLA34 A2 63.75 15.69 121.59 65.79
CenterPose - DLA34 A30 17.40 57.46 28.39 281.76
CenterPose - DLA34 A100 12.17 82.16 17.38 460.24
CenterPose - DLA34 H100 9.45 105.84 12.23 654.23
CenterPose - DLA34 L4 24.58 40.68 47.44 168.62
CenterPose - DLA34 L40 9.37 106.70 16.65 480.62
CenterPose - DLA34 Tesla T4 41.20 24.27 75.96 105.32
CenterPose - FAN-Small-Hybrid Orin Nano 8GB 125.94 7.94 482.63 (BS=4) 8.29 (BS=4)
CenterPose - FAN-Small-Hybrid Orin NX 16GB 88.12 11.35 333.54 (BS=4) 11.99 (BS=4)
CenterPose - FAN-Small-Hybrid AGX Orin 64GB 35.68 28.03 262.80 30.44
CenterPose - FAN-Small-Hybrid A2 172.55 5.80 315.35 25.37
CenterPose - FAN-Small-Hybrid A30 37.41 26.73 63.66 125.66
CenterPose - FAN-Small-Hybrid A100 20.01 49.99 32.30 247.64
CenterPose - FAN-Small-Hybrid H100 13.11 76.26 19.48 410.74
CenterPose - FAN-Small-Hybrid L4 53.52 18.69 111.70 71.62
CenterPose - FAN-Small-Hybrid L40 17.65 56.65 36.43 219.60
CenterPose - FAN-Small-Hybrid Tesla T4 102.33 9.77 187.63 42.64
Models (FP32) Devices Latency ↓ (ms, BS=1) Images per Second ↑ (BS=1) Latency ↓ (ms, BS=8) Images per Second ↑ (BS=8)
CenterPose - DLA34 Orin Nano 8GB 80.81 12.37 277.50 (BS=4) 14.41 (BS=4)
CenterPose - DLA34 Orin NX 16GB 55.67 17.96 192.77 (BS=4) 20.75 (BS=4)
CenterPose - DLA34 AGX Orin 64GB 25.36 39.44 150.98 52.99
CenterPose - DLA34 A2 155.43 6.43 307.11 26.05
CenterPose - DLA34 A30 40.83 24.49 74.51 107.37
CenterPose - DLA34 A100 25.17 39.74 41.79 191.44
CenterPose - DLA34 H100 16.28 61.42 25.00 320.03
CenterPose - DLA34 L4 49.99 20.00 97.38 82.15
CenterPose - DLA34 L40 18.64 53.63 33.47 239.01
CenterPose - DLA34 Tesla T4 101.42 9.86 188.23 42.50
CenterPose - FAN-Small-Hybrid Orin Nano 8GB 208.25 4.80 832.21 (BS=4) 4.81 (BS=4)
CenterPose - FAN-Small-Hybrid Orin NX 16GB 144.80 6.91 572.54 (BS=4) 6.99 (BS=4)
CenterPose - FAN-Small-Hybrid AGX Orin 64GB 60.29 16.59 494.86 16.17
CenterPose - FAN-Small-Hybrid A2 450.28 2.22 872.87 9.17
CenterPose - FAN-Small-Hybrid A30 113.68 8.80 215.12 37.19
CenterPose - FAN-Small-Hybrid A100 58.62 17.06 109.13 73.31
CenterPose - FAN-Small-Hybrid H100 30.90 32.36 54.41 147.02
CenterPose - FAN-Small-Hybrid L4 149.66 6.68 309.32 25.86
CenterPose - FAN-Small-Hybrid L40 53.02 18.86 108.91 73.46
CenterPose - FAN-Small-Hybrid Tesla T4 294.14 3.40 579.57 13.80

How to Use This Model

These models are designed for use with NVIDIA hardware and software. For hardware, the models are compatible with any NVIDIA GPU, including NVIDIA Jetson devices. For software, the models are specifically designed for the Train Adapt Optimize (TAO) Toolkit, or TensorRT.

The primary application of these models is to estimate an object's pose from a single RGB image. They can identify the objects in photos, given the right image decoding and pre-processing procedures.

For training, the models are intended for use with the Train Adapt Optimize (TAO) Toolkit and the your dataset. It's possible to train high-fidelity models tailored to new use cases. The Jupyter Notebook, which is included in the TAO Container, can be used for re-training.

Furthermore, these models are designed for deployment to edge devices using the TensorRT. TAO Triton apps offers capabilities to construct efficient image analytic pipelines. These pipelines can capture, decode, and pre-process data before executing inference.

Input

RGB Image of dimensions: 512 X 512 X 3 (W x H x C). Channel Ordering of the Input: NCHW, where N = Batch Size, C = number of channels (3), H = Height of images (512), W = Width of the images (512)

Output

The model outputs three distinct results that predict the projections of 3D bounding box keypoints, the cuboid relative dimensions, and the 6-DoF (Degrees of Freedom). The +y is up (aligned with the gravity, green line); The +x follows right hand rule (red line); The +z is the front face (blue line).

Output Image

Bike Book Bottle Camera
Cereal Box Chair Laptop Shoes

Instructions to Deploy the Model with TAO-Deploy

To use these models for object pose estimation, see the following YAML as a template for the dataset and inference sections of the experiment spec file when testing the images.

  • Convert the pretrained ONNX models into a TRT engine using TensorRT.
  • Set up the engine file path and configure the correct intrinsic matrix in the experiment spec file.
  • Run the TAO-Deploy inference pipeline to get the visualization results.
  • Note: The CenterPose model with FAN-Small-Hybrid only supports the FP16 when the TensorRT version >= 8.6.

For more information on the experiment spec file and usage instruction, see the TAO Toolkit User Guide.

dataset:
  inference_data: /path/to/inference/images/folder
  num_classes: 1
  batch_size: 1
  workers: 4

inference:
  trt_engine: /path/to/engine.trt
  visualization_threshold: 0.3
  principle_point_x: 298.3
  principle_point_y: 392.1
  focal_length_x: 651.2
  focal_length_y: 651.2
  skew: 0.0
  axis_size: 0.5
  use_pnp: True
  save_json: True
  save_visualization: True
  opencv: True

Instructions to Deploy the Model with Triton Inference Server

To create the entire end-to-end inference application, deploy this model with Triton Inference Server. NVIDIA Triton Inference Server is an open-source inference serving software that helps standardize model deployment and execution and delivers fast and scalable AI in production. Triton supports direct integration of this model into the server and inference from a client.

To deploy this model with Triton Inference Server and end-to-end inference from images, please refer to the TAO Triton apps.

Limitations

Very Small Objects

The CenterPose model was trained to identify dominant objects in the camera view. As a result, it might not detect objects that appear very small with respect to the camera view.

Occluded Objects

If objects are occluded or truncated to the extent that less than 40% of the object remains visible, the CenterPose model might not recognize them. The model can detect partially occluded objects as long as the majority of the object remains visible. Heavily occluded objects might compromise detection accuracy.

Dark-Lighting, Distortion Images, Blurry Images

The CenterPose model was trained on RGB images taken under good lighting conditions and captured by a pinhole camera. Consequently, images shot in poor lighting or those exhibiting distortion or blur might not yield optimal detection results.

Model Versions

References

Citations

  • Yunzhi Lin, Jonathan Tremblay, Stephen Tyree, Patricio Vela, and Stan Birchfield. "Single-Stage Keypoint-based Category-level Object Pose Estimation from an RGB Image." IEEE International Conference on Robotics and Automation (ICRA). 2022.

  • Daquan Zhou, Zhiding Yu, Enze Xie, Chaowei Xiao, Anima Anandkumar, Jiashi Feng and Jose M. Alvarez. "Understanding The Robustness in Vision Transformers". International Conference on Machine Learning (ICML). 2022.

  • Fisher Yu, Dequan Wang, Evan Shelhamer and Trevor Darrell. "Deep Layer Aggregation". Conference on Computer Vision and Pattern Recognition. 2018.

  • Adel Ahmadyan, Liangkai Zhang, Artsiom Ablavatski, Jianing Wei, Matthias Grundmann. "Objectron: A Large Scale Dataset of Object-Centric Videos in the Wild with Pose Annotations". IEEE Conference on Computer Vision and Pattern Recognition. 2021.

Technical Blogs

Suggested Reading

License

The license to use these models is covered by the Model EULA. By downloading the unpruned or pruned version of the model, you accept the terms and conditions of these licenses.

Ethical Considerations

NVIDIA CenterPose model estimates the object pose. However, no additional information such as people and other distractors in the background are inferred. The training and evaluation dataset mostly consists of North American content. An ideal training and evaluation dataset would additionally include content from other geographies. NVIDIA’s platforms and application frameworks enable developers to build a wide array of AI applications. Consider potential algorithmic bias when choosing or creating the models being deployed. Work with the model’s developer to ensure that it meets the requirements for the relevant industry and use case; that the necessary instruction and documentation are provided to understand error rates, confidence intervals, and results; and that the model is being used under the conditions and in the manner intended.