# CenterPose Model Card - Multiple Categories ## Model Overview ## Description: The Multi-category CenterPose model detects the projections of Three-Dimensional (3D) key points, estimates a 6-Degree of Freedom (DOF) pose, and creates a 3D bounding box. This model is ready for commercial use. ### License: License to use these models is covered by the NVIDIA Open Model License. By downloading the model, you accept the terms and conditions of these [licenses](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/). ## References: ### Citations - Yunzhi Lin, Jonathan Tremblay, Stephen Tyree, Patricio Vela, and Stan Birchfield. "Single-Stage Keypoint-based Category-level Object Pose Estimation from an RGB Image." *IEEE International Conference on Robotics and Automation (ICRA)*. 2022. - Daquan Zhou, Zhiding Yu, Enze Xie, Chaowei Xiao, Anima Anandkumar, Jiashi Feng and Jose M. Alvarez. "Understanding The Robustness in Vision Transformers". *International Conference on Machine Learning (ICML).* 2022. - Fisher Yu, Dequan Wang, Evan Shelhamer and Trevor Darrell. "Deep Layer Aggregation". *Conference on Computer Vision and Pattern Recognition.* 2018. - Adel Ahmadyan, Liangkai Zhang, Artsiom Ablavatski, Jianing Wei, Matthias Grundmann. "Objectron: A Large Scale Dataset of Object-Centric Videos in the Wild with Pose Annotations". *IEEE Conference on Computer Vision and Pattern Recognition*. 2021. ## Model Architecture: **Architecture Type:** Convolutional Neural Network (CNN) and transformer-based
**Network Architecture:** DLA34 and FAN-Small-Hybrid
This model supports two different types of backbone networks as the feature extractor, including [DLA34](https://arxiv.org/pdf/1707.06484.pdf) and [FAN-Small-Hybrid](https://arxiv.org/abs/2204.12451). The DLA34 is a standard Convolutional Neural Network (CNN) backbone, while the FAN-Small-Hybrid is a transformer-based classification backbone.

The network architecture processes a re-scaled and padded RGB image. Using the DLA34/FAN-Small-Hybrid feature extractor combined with an upsampling module, the network outputs distinct heads that predict the projections of 3D bounding box keypoints and relative cuboid dimensions. After detecting objects in the image space, the estimated relative cuboid dimensions enables you to utilize robust, off-the-shelf PnP algorithms for the pose estimation. ## Input: **Input Type(s):** Images
**Input Format(s):** Red, Green, Blue (RGB)
**Input Parameters:** Three-Dimensional (3D)
**Other Properties Related to Input:** RGB Image of dimensions: 512 X 512 X 3 (W x H x C). Channel Ordering of the Input: NCHW, where N = Batch Size, C = number of channels (3), H = Height of images (512), W = Width of the images (512)
## Output: **Output Type(s):** 3D Bounding-Box, Cuboid Dimensions and 6-DoF (Degrees of Freedom)
**Output Format:** 9 Three-Dimensional (3D) Bounding Box Vertices: (x-coordinate, Y-coordinate, Z-coordinate), Cuboid Dimensions (Relative to 3D Bounding Box): (X-scale, Y-scale, Z-scale), 6-DoF: Floating Points
**Other Properties Related to Output:** The +y is up (aligned with the gravity, green line); The +x follows right hand rule (red line); The +z is the front face (blue line).
| **Bike** | **Book** | **Bottle** | **Camera** | | ------ | ------ | ------ | ------ | |

| | **Cereal Box** | **Chair** | **Laptop** | **Shoes** | | ------ | ------ | ------ | ------ | |

| ## Software Integration: **Runtime Engine(s):** * TAO - 5.2
**Supported Hardware Architecture(s):**
* Ampere
* Jetson
* Hopper
* Lovelace
* Pascal
* Turing
* Volta
**Supported Operating System(s):**
* Linux
* Linux 4 Tegra
## Model Version(s): - **Deployable_v1.0**: decrypted [ONNX](https://onnx.ai/) files, inferencable on [Isaac ROS pose estimation](https://github.com/NVIDIA-ISAAC-ROS/isaac_ros_pose_estimation) pipeline, [TAO Triton apps](https://github.com/NVIDIA-AI-IOT/tao-toolkit-triton-apps), and [TAO Toolkit](https://docs.nvidia.com/tao/tao-toolkit/index.html).
# Training & Evaluation:
## Training Dataset: **Data Collection Method by dataset:**
* Automatic/Sensors
**Labeling Method by dataset:**
* Human
**Properties:**
- Trained on a dataset comprised of 15,000 annotated video clips, totaling over 4M annotated frames. Every category is marked with a 3D bounding cuboid that indicates the object's position, orientation relative to the camera, and the cuboid's dimensions. - Objects are from the following eight categories: bikes, books, bottles, cameras, cereal boxes, chairs, laptops, and shoes. - For training purposes, frames are extracted by temporally down-sampling the original videos to 15 fps. For symmetric objects, such as bottles, multiple ground truth labels are produced during the training phase by rotating them N times around their symmetry axis. | Category | # of Training Videos | # of Training Images | # of Testing Videos | # of Testing Images | | -- | -- | -- | -- | -- | | Bike | 375 | 8,396 | 94 | 2,090 | | Book | 1,618 | 31,609 | 404 | 7,885 | | Bottle | 1,542 | 26,090 | 385 | 6,442 | | Camera | 652 | 12,758 | 163 | 3,283 | | Cereal Box | 1,288 | 22,024 | 321 | 5,428 | | Chair | 1,555 | 27,608 | 388 | 6,695 | | Laptop | 1,179 | 26,462 | 294 | 6,608 | | Shoe | 1,693 | 30,515 | 423 | 7,859 | ## Evaluation Dataset: **Data Collection Method by dataset:**
* Automatic/Sensors
**Labeling Method by dataset:**
* Human
**Properties:**
- Evaluated using approximately 46,000 test samples from the training dataset. These frames, originally high-resolution images of 600x800 pixels, were resized to 512x512 pixels before being processed by the CenterPose model. ### Methodology and KPI Accuracy was determined using a 3D intersection-over-union (IoU) criterion with a threshold greater than 0.5. The 2D MPE (mean pixel projection error) metric calculates the average normalized distance between the projections of 3D bounding box keypoints from both the estimated and ground truth poses. For viewpoint estimation, the average precision (AP) is presented for azimuth and elevation with thresholds of 15° and 10° degrees, respectively. For symmetric object categories, like bottles, the estimated bounding box is rotated around the symmetry axis N times (where N = 100) and assessed the prediction in relation to each rotated instance. The results reflect the instance that either maximizes the 3D IoU or minimizes the 2D pixel projection error. For evaluation purposes, frames are exctacted by temporally downsampling the original videos to 15 fps. Evaluation data key performance indicators (KPIs) are provided in the following table. The evaluation of the pretrained models was based on FP32 precision. | Category | Backbone Architecture | 3D IoU ↑| 2D MPE ↓ | AP @ 15° Azimuth Error ↑ | AP @ 10° Elevation Error ↑ | | ------- | ------- | -------| -------| -------| ------- | | Bike | DLA34 | 0.5420 | 0.0915 | 0.8192 | 0.8512 | | Book | DLA34 | 0.4553 | 0.0755 | 0.6466 | 0.7831 | | Bottle | DLA34 | 0.7306 | 0.0419 | 0.9510 | 0.8133 | | Camera | DLA34 | 0.6814 | 0.0602 | 0.7395 | 0.7801 | | Cereal Box | DLA34 | 0.7491 | 0.0455 | 0.8939 | 0.9065 | | Chair | DLA34 | 0.8200 | 0.0604 | 0.8140 | 0.8579 | | Laptop | DLA34 | 0.7194 | 0.0525 | 0.8510 | 0.7623 | | Shoe | DLA34 | 0.6228 | 0.0467 | 0.6016 | 0.6881 | | Bike | FAN-Small-Hybrid | 0.6505 | 0.0921 | 0.9033 | 0.9527 | | Book | FAN-Small-Hybrid | 0.5768 | 0.0665 | 0.7584 | 0.8783 | | Bottle | FAN-Small-Hybrid | 0.8198 | 0.0410 | 0.9796 | 0.9072 | | Camera | FAN-Small-Hybrid | 0.7336 | 0.0558 | 0.8234 | 0.8604 | | Cereal Box | FAN-Small-Hybrid | 0.8203 | 0.0393 | 0.9210 | 0.9420 | | Chair | FAN-Small-Hybrid | 0.8603 | 0.0536 | 0.8610 | 0.9252 | | Laptop | FAN-Small-Hybrid | 0.7525 | 0.0481 | 0.9057 | 0.8143 | | Shoe | FAN-Small-Hybrid | 0.6718 | 0.0445 | 0.6968 | 0.8151 | ## Inference: **Engine:** Tensor(RT)
**Test Hardware:**
- Jetson AGX Xavier - Xavier NX - Orin - Orin NX - NVIDIA T4 - Ampere GPU - A2 - A30 - L4 - T4 - DGX H100 - DGX A100 - DGX H100 - L40 - JAO 64GB - Orin NX16GB - Orin Nano 8GB The inference performance of the provided CenterPose model is evaluated at FP16 and FP32 precisions. The model's input resolution is 512x512 pixels. The performance assessment was conducted using **trtexec** on a range of devices including: Orin Nano 8GB, Orin NX 16GB, Jetson AGX Orin 64GB, A2, A30, A100, H100, L4, L40, and Tesla T4. In the table, "BS" stands for "batch size." The performance data presented pertains solely to model inference. The end-to-end performance, when integrated with streaming video data and the post-processing, might vary slightly due to potential bottlenecks in hardware and software. | Models (FP16) | Devices | Latency ↓ (ms, BS=1) | Images per Second ↑ (BS=1) | Latency ↓ (ms, BS=8) | Images per Second ↑ (BS=8) | | ---- | ---- | ---- | ---- | ---- | ---- | | CenterPose - DLA34 | Orin Nano 8GB | 52.19 | 19.16 | 167.85 (BS=4) | 23.83 (BS=4) | | CenterPose - DLA34 | Orin NX 16GB | 36.05 | 27.74 | 115.91 (BS=4) | 34.51 (BS=4) | | CenterPose - DLA34 | AGX Orin 64GB | 17.53 | 57.04 | 89.28 | 89.60 | | CenterPose - DLA34 | A2 | 63.75 | 15.69 | 121.59 | 65.79 | | CenterPose - DLA34 | A30 | 17.40 | 57.46 | 28.39 | 281.76 | | CenterPose - DLA34 | A100 | 12.17 | 82.16 | 17.38 | 460.24 | | CenterPose - DLA34 | H100 | 9.45 | 105.84 | 12.23 | 654.23 | | CenterPose - DLA34 | L4 | 24.58 | 40.68 | 47.44 | 168.62 | | CenterPose - DLA34 | L40 | 9.37 | 106.70 | 16.65 | 480.62 | | CenterPose - DLA34 | Tesla T4 | 41.20 | 24.27 | 75.96 | 105.32 | | CenterPose - FAN-Small-Hybrid | Orin Nano 8GB | 125.94 | 7.94 | 482.63 (BS=4) | 8.29 (BS=4) | | CenterPose - FAN-Small-Hybrid | Orin NX 16GB | 88.12 | 11.35 | 333.54 (BS=4) | 11.99 (BS=4) | | CenterPose - FAN-Small-Hybrid | AGX Orin 64GB | 35.68 | 28.03 | 262.80 | 30.44 | | CenterPose - FAN-Small-Hybrid | A2 | 172.55 | 5.80 | 315.35 | 25.37 | | CenterPose - FAN-Small-Hybrid | A30 | 37.41 | 26.73 | 63.66 | 125.66 | | CenterPose - FAN-Small-Hybrid | A100 | 20.01 | 49.99 | 32.30 | 247.64 | | CenterPose - FAN-Small-Hybrid | H100 | 13.11 | 76.26 | 19.48 | 410.74 | | CenterPose - FAN-Small-Hybrid | L4 | 53.52 | 18.69 | 111.70 | 71.62 | | CenterPose - FAN-Small-Hybrid | L40 | 17.65 | 56.65 | 36.43 | 219.60 | | CenterPose - FAN-Small-Hybrid | Tesla T4 | 102.33 | 9.77 | 187.63 | 42.64 | | Models (FP32) | Devices | Latency ↓ (ms, BS=1) | Images per Second ↑ (BS=1) | Latency ↓ (ms, BS=8) | Images per Second ↑ (BS=8) | | ---- | ---- | ---- | ---- | ---- | ---- | | CenterPose - DLA34 | Orin Nano 8GB | 80.81 | 12.37 | 277.50 (BS=4) | 14.41 (BS=4) | | CenterPose - DLA34 | Orin NX 16GB | 55.67 | 17.96 | 192.77 (BS=4) | 20.75 (BS=4) | | CenterPose - DLA34 | AGX Orin 64GB | 25.36 | 39.44 | 150.98 | 52.99 | | CenterPose - DLA34 | A2 | 155.43 | 6.43 | 307.11 | 26.05 | | CenterPose - DLA34 | A30 | 40.83 | 24.49 | 74.51 | 107.37 | | CenterPose - DLA34 | A100 | 25.17 | 39.74 | 41.79 | 191.44 | | CenterPose - DLA34 | H100 | 16.28 | 61.42 | 25.00 | 320.03 | | CenterPose - DLA34 | L4 | 49.99 | 20.00 | 97.38 | 82.15 | | CenterPose - DLA34 | L40 | 18.64 | 53.63 | 33.47 | 239.01 | | CenterPose - DLA34 | Tesla T4 | 101.42 | 9.86 | 188.23 | 42.50 | | CenterPose - FAN-Small-Hybrid | Orin Nano 8GB | 208.25 | 4.80 | 832.21 (BS=4) | 4.81 (BS=4) | | CenterPose - FAN-Small-Hybrid | Orin NX 16GB | 144.80 | 6.91 | 572.54 (BS=4) | 6.99 (BS=4) | | CenterPose - FAN-Small-Hybrid | AGX Orin 64GB | 60.29 | 16.59 | 494.86 | 16.17 | | CenterPose - FAN-Small-Hybrid | A2 | 450.28 | 2.22 | 872.87 | 9.17 | | CenterPose - FAN-Small-Hybrid | A30 | 113.68 | 8.80 | 215.12 | 37.19 | | CenterPose - FAN-Small-Hybrid | A100 | 58.62 | 17.06 | 109.13 | 73.31 | | CenterPose - FAN-Small-Hybrid | H100 | 30.90 | 32.36 | 54.41 | 147.02 | | CenterPose - FAN-Small-Hybrid | L4 | 149.66 | 6.68 | 309.32 | 25.86 | | CenterPose - FAN-Small-Hybrid | L40 | 53.02 | 18.86 | 108.91 | 73.46 | | CenterPose - FAN-Small-Hybrid | Tesla T4 | 294.14 | 3.40 | 579.57 | 13.80 | ## How to Use This Model These models are designed for use with NVIDIA hardware and software. For hardware, the models are compatible with any NVIDIA GPU, including NVIDIA Jetson devices. For software, the models are specifically designed for the [Train Adapt Optimize (TAO) Toolkit](https://developer.nvidia.com/tao-toolkit), or [TensorRT](https://developer.nvidia.com/tensorrt). The primary application of these models is to estimate an object's pose from a single RGB image. They can identify the objects in photos, given the right image decoding and pre-processing procedures. For training, the models are intended for use with the Train Adapt Optimize (TAO) Toolkit and the your dataset. It's possible to train high-fidelity models tailored to new use cases. The Jupyter Notebook, which is included in the [TAO Container](https://ngc.nvidia.com/catalog/containers/nvidia:tao:tao-toolkit), can be used for re-training. Furthermore, these models are designed for deployment to edge devices using the TensorRT. [TAO Triton apps](https://github.com/NVIDIA-AI-IOT/tao-toolkit-triton-apps) offers capabilities to construct efficient image analytic pipelines. These pipelines can capture, decode, and pre-process data before executing inference. ### Instructions to Deploy the Model with TAO-Deploy To use these models for object pose estimation, see the following YAML as a template for the dataset and inference sections of the experiment spec file when testing the images. * Convert the pretrained ONNX models into a TRT engine using TensorRT. * Set up the engine file path and configure the correct intrinsic matrix in the experiment spec file. * Run the TAO-Deploy inference pipeline to get the visualization results. * Note: The CenterPose model with FAN-Small-Hybrid only supports the FP16 when the TensorRT version >= 8.6. For more information on the experiment spec file and usage instruction, see the [TAO Toolkit User Guide](https://docs.nvidia.com/tao/tao-toolkit/index.html). ```yaml dataset: inference_data: /path/to/inference/images/folder num_classes: 1 batch_size: 1 workers: 4 inference: trt_engine: /path/to/engine.trt visualization_threshold: 0.3 principle_point_x: 298.3 principle_point_y: 392.1 focal_length_x: 651.2 focal_length_y: 651.2 skew: 0.0 axis_size: 0.5 use_pnp: True save_json: True save_visualization: True opencv: True ``` ### Instructions to Deploy the Model with Triton Inference Server To create the entire end-to-end inference application, deploy this model with [Triton Inference Server](https://developer.nvidia.com/nvidia-triton-inference-server). NVIDIA Triton Inference Server is an open-source inference serving software that helps standardize model deployment and execution and delivers fast and scalable AI in production. Triton supports direct integration of this model into the server and inference from a client. To deploy this model with [Triton Inference Server](https://developer.nvidia.com/nvidia-triton-inference-server) and end-to-end inference from images, please refer to the [TAO Triton apps](https://github.com/NVIDIA-AI-IOT/tao-toolkit-triton-apps). ## Technical Blogs - Read the 2 part blog on training and optimizing 2D body pose estimation model with TAO - [Part 1](https://developer.nvidia.com/blog/training-optimizing-2d-pose-estimation-model-with-tao-toolkit-part-1) | [Part 2](https://developer.nvidia.com/blog/training-optimizing-2d-pose-estimation-model-with-tao-toolkit-part-2). - Learn how to train [real-time license plate detection and recognition app](https://developer.nvidia.com/blog/creating-a-real-time-license-plate-detection-and-recognition-app) with TAO and DeepStream. - Model accuracy is extremely important, learn how you can achieve [state of the art accuracy for classification and object detection models](https://developer.nvidia.com/blog/preparing-state-of-the-art-models-for-classification-and-object-detection-with-tao-toolkit/) using TAO. - Learn how to train [an instance segmentation model using MaskRCNN with TAO](https://developer.nvidia.com/blog/training-instance-segmentation-models-using-maskrcnn-on-tao-toolkit/). - Read the technical tutorial on how [PeopleNet model can be trained with custom data using the Transfer Learning Toolkit](https://devblogs.nvidia.com/training-custom-pretrained-models-using-tlt/). - Learn how to [train and deploy real-time intelligent video analytics apps and services using the DeepStream SDK](https://devblogs.nvidia.com/building-iva-apps-using-deepstream-5.0/). ## Suggested Reading - More information on about TAO Toolkit and pre-trained models can be found at the [NVIDIA Developer Zone](https://developer.nvidia.com/tao-toolkit). - Read the [TAO Quick Start](https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_quick_start_guide.html) guide and [release notes](https://docs.nvidia.com/tao/tao-toolkit/text/release_notes.html). - If you have any questions or feedback, see the discussions on the [TAO Toolkit Developer Forums](https://forums.developer.nvidia.com/c/accelerated-computing/intelligent-video-analytics/tao-toolkit/17). - Deploy your model on the edge using DeepStream. Learn more about [DeepStream SDK](https://developer.nvidia.com/deepstream-sdk). ## Ethical Considerations: The NVIDIA CenterPose model estimates the object pose. However, no additional information such as people and other distractors in the background are inferred. The training and evaluation dataset mostly consists of North American content. An ideal training and evaluation dataset would additionally include content from other geographies. NVIDIA’s platforms and application frameworks enable developers to build a wide array of AI applications. Consider potential algorithmic bias when choosing or creating the models being deployed. NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Promise and the Explainability, Bias, Safety & Security, and Privacy Subcards.