NGC Catalog

CLASSIC

Welcome Guest

For downloads and more information, please view on a desktop device.

Description

Pose classification network to classify poses of people from their skeletons.

Publisher

NVIDIA

Latest Version

deployable_onnx_v1.0

Modified

November 27, 2024

Size

12.14 MB

PoseClassificationNet Model Card

Description:

PoseClassificationNet recognizes the pose of people:

getting up
jumping
sitting
sitting down
standing
walking

This model is ready for commercial use.

References:

Citations

Yan, S., Xiong, Y., Lin, D.: Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. In: AAAI (2018)
Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., Suleyman, M., Zisserman, A.: The Kinetics Human Action Video Dataset. In: arXiv (2017)

Using TAO Pre-trained Models

Get TAO Container
Get other purpose-built models from the NGC model registry:

Model Architecture:

Architecture Type: Graph Convolutional Network (GCN)
Network Architecture: Spatial-Temporal Graph Convolutional Network (ST-GCN)

Input:

Input Type(s): Video
Input Format(s): MP4
Input Parameters: 4D
Other Properties Related to Input:
The input data for training or inference are formatted as a NumPy array in five dimensions (N, C, T, V, M):

N indicates the number of sequences.
C stands for the number of input channels, which is set as 3 in this example.
T represents the maximum sequence length in frames that is 300 (10 seconds for 30 FPS) in our case.
V defines the number of joint points, set as 34 for the NVIDIA format.
M means the number of persons. Our pre-trained model assumes a single object but it can also support multiple people.

Output:

Output Type(s): Label(s)
Output Format: Label: Text String
Other Properties Related to Output: Category Label(s): sitting_down, getting_up, sitting, standing, walking and jumping

Software Integration:

Runtime Engine(s):

TAO - 5.2
DeepStream 6.1 or later

Supported Hardware Architecture(s):

Ampere
Jetson
Hopper
Lovelace
Pascal
Turing
Volta

Supported Operating System(s):

Linux
Linux 4 Tegra

Model Version(s):

trainable_v1.1 - Pre-trained model for 3D body pose in the NVIDIA format.
deployable_v1.1 - Model for 3D body pose in the NVIDIA format deployable to DeepStream or TensorRT.

Training & Evaluation:

Training Dataset:

Data Collection Method by dataset:

Automatic/Sensors

Labeling Method by dataset:

Automated

Properties:
Proprietary, internal datasets with 6 annotated action classes, i.e., sitting_down, getting_up, sitting, standing, walking and jumping. The skeletons are based on the 34-keypoint NVIDIA format generated by the deepstream-bodypose-3d app. The dataset statistics are as follows:

classes	no. train sequences	no. val sequences	no. test sequences
sitting_down	1923	53	94
getting_up	1884	56	109
sitting	909	55	101
standing	1391	54	99
walking	1894	45	99
jumping	1440	55	90

Data Format

The output of model inference is an array of N elements that gives the predicted action class for each sequence.

The labels used for training or evaluation are stored as a pickle file that consists of a list of two lists, including N elements each, e.g., [["xl6vmD0XBS0.json", "OkLnSMGCWSw.json", "IBopZFDKfYk.json", "HpoFylcrYT4.json", "mlAtn_zi0bY.json", ...], [235, 388, 326, 306, 105, ...]]. The first list contains N strings of sample names. The second one lists the labeled action class ID of each sequence.

The graph to model skeletons is defined by two configuration paratmers:

graph_layout (string): Must be one the following candidates:
- nvidia consists of 34 joints. For more information, please refer to here.
- openpose consists of 18 joints. For more information, please refer to here.
- human3.6m consists of 17 joints. For more information, please refer to here.
- ntu-rgb+d consists of 25 joints. For more information, please refer to here.
- ntu_edge consists of 24 joints. For more information, please refer to here.
- coco consists of 17 joints. For more information, please refer to here.
graph_strategy (string): Must be one of the following candidates (For more information, please refer to the section "Partition Strategies" in the paper):
- uniform: Uniform Labeling
- distance: Distance Partitioning
- spatial: Spatial Configuration

Evaluation Dataset:

Data Collection Method by dataset:

Automatic/Sensors

Labeling Method by dataset:

Automated

Properties: ~100 random sequences per class from the training dataset described above.

Methodology and KPI

The key performance indicator is the accuracy of action recognition, i.e., the ratio of correctly predicted samples to the total labeled samples.

Name	Score
Class accuracy: sitting_down	98.94
Class accuracy: getting_up	99.08
Class accuracy: sitting	87.13
Class accuracy: standing	80.81
Class accuracy: walking	92.93
Class accuracy: jumping	85.56
Total accuracy	90.88
Average class accuracy	90.74

Inference:

Engine: Tensor(RT)
Test Hardware:

Jetson AGX Xavier
Xavier NX
Orin
Orin NX
NVIDIA T4
Ampere GPU
A2
A30
L4
T4
DGX H100
DGX A100
DGX H100
L40
JAO 64GB
Orin NX16GB
Orin Nano 8GB

The inference performance runs with trtexec on NVIDIA Ampere and Jetson GPUs. The end-to-end performance with streaming video data might slightly vary depending on use cases of applications.

Model	Graph Layout	Device	Precision	Batch Size	Latency (ms)	Sequences per Second
ST-GCN	NVIDIA (34 keypoints)	A10	TF32	1	2.89	346.45
ST-GCN	NVIDIA (34 keypoints)	A10	TF32	4	9.86	101.38
ST-GCN	NVIDIA (34 keypoints)	A10	TF32	16	33.86	29.53
ST-GCN	NVIDIA (34 keypoints)	A10	Mixed	1	1.59	628.45
ST-GCN	NVIDIA (34 keypoints)	A10	Mixed	4	5.57	179.67
ST-GCN	NVIDIA (34 keypoints)	A10	Mixed	16	20.47	48.84
ST-GCN	NVIDIA (34 keypoints)	A30	TF32	1	2.14	336.12
ST-GCN	NVIDIA (34 keypoints)	A30	TF32	4	6.87	145.59
ST-GCN	NVIDIA (34 keypoints)	A30	TF32	16	23.92	41.80
ST-GCN	NVIDIA (34 keypoints)	A30	Mixed	1	1.28	780.07
ST-GCN	NVIDIA (34 keypoints)	A30	Mixed	4	4.10	244.08
ST-GCN	NVIDIA (34 keypoints)	A30	Mixed	16	14.85	67.33
ST-GCN	NVIDIA (34 keypoints)	Jetson AGX Orin	Best	1	4.58	218.14
ST-GCN	NVIDIA (34 keypoints)	Jetson AGX Orin	Best	4	16.28	61.41
ST-GCN	NVIDIA (34 keypoints)	Jetson AGX Orin	Best	16	61.61	16.23

How to use this model

This model needs to be used with NVIDIA Hardware and Software. For Hardware, the model can run on any NVIDIA GPU including NVIDIA Jetson devices. This model can only be used with Train Adapt Optimize (TAO) Toolkit, DeepStream SDK or TensorRT.

Primary use case intended for this model is to recognize the action from the sequence of skeletons. The maximum sequence length in frames is 300.

A pre-trained model is provided:

st-gcn_3dbp_nvidia

It is intended for training and fine-tune using Train Adapt Optimize (TAO) Toolkit and the users' dataset of skeleton-based action recognition. High fidelity models can be trained to the new use cases. The Jupyter notebook available as a part of TAO container can be used to re-train.

The model is also intended for easy deployment to the edge using DeepStream SDK or TensorRT. DeepStream provides facility to create efficient video analytic pipelines to capture, decode and pre-process the data before running inference.

The model is encrypted and can be decrypted with the following key:

Model load key: nvidia_tao

Please make sure to use this as the key for all TAO commands that require a model load key.

Instructions to use the model with TAO toolkit

In order to use the model as pre-trained weights for transfer learning, please use the snippet below as a template for the model component of the experiment spec file to train a PoseClassificationNet. For more information on experiment spec file, please refer to the Train Adapt Optimize (TAO) Toolkit User Guide.

model:
  model_type: ST-GCN
  pretrained_model_path: "/path/to/st-gcn_3dbp_nvidia.tlt"
  input_channels: 3
  dropout: 0.5
  graph_layout: "nvidia"
  graph_strategy: "spatial"
  edge_importance_weighting: True

Instructions to deploy the model with Triton Inference Server

To create the entire end-to-end video analytic application, deploy this model with Triton Inference Server. NVIDIA Triton Inference Server is an open-source inference serving software that helps standardize model deployment and execution and delivers fast and scalable AI in production. Triton supports direct integration of this model into the server and inference from a client.

To deploy this model with Triton Inference Server and end-to-end inference from video, please refer to the TAO Triton apps.

Technical blogs

Access the latest in Vision AI development workflows with NVIDIA TAO Toolkit 5.0
Improve accuracy and robustness of vision ai models with vision transformers and NVIDIA TAO
Train like a ‘pro’ without being an AI expert using TAO AutoML
Create Custom AI models using NVIDIA TAO Toolkit with Azure Machine Learning
Developing and Deploying AI-powered Robots with NVIDIA Isaac Sim and NVIDIA TAO
Learn endless ways to adapt and supercharge your AI workflows with TAO - Whitepaper
Customize Action Recognition with TAO and deploy with DeepStream
Read the 2 part blog on training and optimizing 2D body pose estimation model with TAO - Part 1 | Part 2
Learn how to train real-time License plate detection and recognition app with TAO and DeepStream.
Model accuracy is extremely important, learn how you can achieve state of the art accuracy for classification and object detection models using TAO

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Promise and the Explainability, Bias, Safety & Security, and Privacy Subcards.