Logo for BodyPoseNet
Detect body pose from an image.
Latest Version
March 13, 2024
64.09 MB

BodyPoseNet Model Card

Model Overview

The BodyPoseNet models described in this card are used for multi-person human pose estimation network, which aims to predict the skeleton for every person in a given input image which consists of keypoints and the connections between them. This follows a single shot bottom-up methodology and there is no need for a person detector. Hence, the compute does not scale linearly with the number of people in the scene. The pose / skeleton output is commonly used as input for applications like activity/gesture recognition, fall detection, posture analysis, among others.

The default model predicts 18 keypoints including nose, neck, right_shoulder, right_elbow, right_wrist, left_shoulder, left_elbow, left_wrist, right_hip, right_knee, right_ankle, left_hip, left_knee, left_ankle, right_eye, left_eye, right_ear, left_ear.

Fig 1. Example illustration of BodyPoseNet output

Model Architecture

This is a fully convolutional model with architecture consisting of a backbone network (like VGG), an initial estimation stage which does a pixel-wise prediction of confidence maps (heatmaps) and part affinity fields followed by multistage refinement (0 to N stages) on the initial predictions.

Training Algorithm

The training algorithm optimizes the network to minimize the loss on confidence maps (heatmaps) and part affinity fields for given image and ground truth pose labels.

Training Data

The available pretrained model is trained on a subset of the Google OpenImages dataset.


Evaluation Dataset

The inference performance of BodyPoseNet v1.0 model was measured against COCO validation dataset.

Methodology and KPI

The KPI for the evaluation data are reported in the table below.

Metric IoU Area Score
AP 0.50:0.95 all 56.2
AP 0.5 all 79.3
AP 0.50:0.95 medium 57.2
AP 0.50:0.95 large 54.9

Real-time Inference Performance

The inference performance is measured for INT8 precision and for a input dimension of 288x384. The inference performance runs with trtexec on Jetson Nano, AGX Xavier, Xavier NX and NVIDIA T4 GPU. The Jetson devices run at Max-N configuration for maximum system performance. The end-to-end performance with streaming video data might slightly vary depending on use cases of applications.

Device Precision Batch_size FPS Latency
Nano INT8 8 5 200.0ms
NX INT8 8 93 10.71ms
Xavier INT8 8 160 6.25ms
T4 INT8 8 555 1.80ms

How to use this model

The models in this page can only be used with Train Adapt Optimize (TAO) Toolkit. TAO provides a simple command line interface to train a deep learning model for body pose estimation.

Primary use case for this model is to detect human poses in a given RGB image. BodyPoseNet is commonly used for activity/gesture recognition, fall detection, posture analysis etc.

  1. Install the NGC CLI from

  2. Configure the NGC CLI using the following command

ngc config set
  1. To view all the models that are supported in TAO:
ngc registry model list nvidia/tao/bodyposenet:*
  1. To download the model:
ngc registry model download-version nvidia/tao/bodyposenet:<template> --dest <path>


Network accepts H X W x 3 input. The images are pre-processed to handle normalization, resizing while maintaining the aspect ratio etc.


Network outputs two tensors: confidence maps (H1' x W1' x C) and part affinity fields (H2' x W2' x P). After NMS and bipartite graph matching, we obtain final results with M x N X 3


  • N is the number of keypoints.
  • M is the number of humans detected in the image.
  • C is the number of confidence map channels - corresponds to number of keypoints + background
  • P is the number of part affinity field channels - corresponds to the (2 x number of edges used in the skeleton)
  • H1', W1' are the height and width of the output confidence maps respectively
  • H2', W2' are the height and width of the output part affinity fields respectively


Crowded scenes

BodyPoseNet model does not give good results for very crowded scenes, especially if detecting the pose for small-scale people in the image.


The network may have difficulty estimating poses of people who are occluded by other objects or persons.


The network may have difficulty estimating poses of people when there exists no distinction with the background (for example, estimation failure may occur for a person wearing a black sweater against a dark background).

Model versions:

  • trainable_v1.0 - this pretrained model is intended to be used for finetuning on custom datasets using TAO.
  • deployabale_v1.0 - this deployable model is intended to run on the inference pipeline. There are INT8 calibration files provided for three resolutions including 224x320, 288x384 and 320x448. These calibration files are generated for TensorRT 7.
  • deployabale_v1.0.1 - this deployable model is intended to run on the inference pipeline. There are INT8 calibration files provided for three resolutions including 224x320, 288x384 and 320x448. These calibration files are generated for TensorRT 8.

The trainable and deployable models are encrypted and will only operate with the following key:

  • Model load key: nvidia_tlt

Please make sure to use this as the key for all TAO commands that require a model load key.


  • Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, Yaser Sheikh (2017). Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

Using TAO Pre-trained Models

Technical blogs

Suggested reading


This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, please visit this link, or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.

Ethical Considerations

NVIDIA’s platforms and application frameworks enable developers to build a wide array of AI applications. Consider potential algorithmic bias when choosing or creating the models being deployed. Work with the model’s developer to ensure that it meets the requirements for the relevant industry and use case; that the necessary instruction and documentation are provided to understand error rates, confidence intervals, and results; and that the model is being used under the conditions and in the manner intended.