NGC | Catalog
CatalogModelsPose Classification

Pose Classification

Logo for Pose Classification
Description
Pose classification network to classify poses of people from their skeletons.
Publisher
NVIDIA
Latest Version
deployable_onnx_v1.0
Modified
March 13, 2024
Size
12.14 MB

PoseClassificationNet Model Card

Model Overview

The model described in this card is pose classification network, which aims to recognize what people do in videos based on their skeletons. It is a graph convolutional network (GCN). A pre-trained PoseClassificationNet model based on 3D body poses is delivered. The model is trained on an NVIDIA dataset with 6 annotated action classes.

Model Architecture

The model has ST-GCN backbone. It will take a sequence of skeletons as input and predict the action label of those frames.

Training

The training algorithm optimizes the network to minimize the cross entropy loss for classification.

Training Data

The model is trained on an NVIDIA dataset with 6 annotated action classes, i.e., sitting_down, getting_up, sitting, standing, walking and jumping. The skeletons are based on the 34-keypoint NVIDIA format generated by the deepstream-bodypose-3d app. The dataset statistics are as follows:

  • Class distribution:

    classes no. train sequences no. val sequences no. test sequences
    sitting_down 1923 53 94
    getting_up 1884 56 109
    sitting 909 55 101
    standing 1391 54 99
    walking 1894 45 99
    jumping 1440 55 90

Data Format

The input data for training or inference are formatted as a NumPy array in five dimensions (N, C, T, V, M):

  1. N indicates the number of sequences.
  2. C stands for the number of input channels, which is set as 3 in this example.
  3. T represents the maximum sequence length in frames that is 300 (10 seconds for 30 FPS) in our case.
  4. V defines the number of joint points, set as 34 for the NVIDIA format.
  5. M means the number of persons. Our pre-trained model assumes a single object but it can also support multiple people.

The output of model inference is an array of N elements that gives the predicted action class for each sequence.

The labels used for training or evaluation are stored as a pickle file that consists of a list of two lists, including N elements each, e.g., [["xl6vmD0XBS0.json", "OkLnSMGCWSw.json", "IBopZFDKfYk.json", "HpoFylcrYT4.json", "mlAtn_zi0bY.json", ...], [235, 388, 326, 306, 105, ...]]. The first list contains N strings of sample names. The second one lists the labeled action class ID of each sequence.

The graph to model skeletons is defined by two configuration paratmers:

  1. graph_layout (string): Must be one the following candidates:

    • nvidia consists of 34 joints. For more information, please refer to here.

    • openpose consists of 18 joints. For more information, please refer to here.

    • human3.6m consists of 17 joints. For more information, please refer to here.

    • ntu-rgb+d consists of 25 joints. For more information, please refer to here.

    • ntu_edge consists of 24 joints. For more information, please refer to here.

    • coco consists of 17 joints. For more information, please refer to here.

  2. graph_strategy (string): Must be one of the following candidates (For more information, please refer to the section "Partition Strategies" in the paper):

    • uniform: Uniform Labeling

    • distance: Distance Partitioning

    • spatial: Spatial Configuration

Performance

Test Data

As shown in the class distribution table above, the test dataset is obtained by randomly sampling ~100 sequences per class.

Methodology and KPI

The key performance indicator is the accuracy of action recognition, i.e., the ratio of correctly predicted samples to the total labeled samples.

Name Score
Class accuracy: sitting_down 98.94
Class accuracy: getting_up 99.08
Class accuracy: sitting 87.13
Class accuracy: standing 80.81
Class accuracy: walking 92.93
Class accuracy: jumping 85.56
Total accuracy 90.88
Average class accuracy 90.74

Real-time Inference Performance

The inference performance runs with trtexec on NVIDIA Ampere and Jetson GPUs. The end-to-end performance with streaming video data might slightly vary depending on use cases of applications.

Model Graph Layout Device Precision Batch Size Latency (ms) Sequences per Second
ST-GCN NVIDIA (34 keypoints) A10 TF32 1 2.89 346.45
ST-GCN NVIDIA (34 keypoints) A10 TF32 4 9.86 101.38
ST-GCN NVIDIA (34 keypoints) A10 TF32 16 33.86 29.53
ST-GCN NVIDIA (34 keypoints) A10 Mixed 1 1.59 628.45
ST-GCN NVIDIA (34 keypoints) A10 Mixed 4 5.57 179.67
ST-GCN NVIDIA (34 keypoints) A10 Mixed 16 20.47 48.84
ST-GCN NVIDIA (34 keypoints) A30 TF32 1 2.14 336.12
ST-GCN NVIDIA (34 keypoints) A30 TF32 4 6.87 145.59
ST-GCN NVIDIA (34 keypoints) A30 TF32 16 23.92 41.80
ST-GCN NVIDIA (34 keypoints) A30 Mixed 1 1.28 780.07
ST-GCN NVIDIA (34 keypoints) A30 Mixed 4 4.10 244.08
ST-GCN NVIDIA (34 keypoints) A30 Mixed 16 14.85 67.33
ST-GCN NVIDIA (34 keypoints) Jetson AGX Orin Best 1 4.58 218.14
ST-GCN NVIDIA (34 keypoints) Jetson AGX Orin Best 4 16.28 61.41
ST-GCN NVIDIA (34 keypoints) Jetson AGX Orin Best 16 61.61 16.23

How to use this model

This model needs to be used with NVIDIA Hardware and Software. For Hardware, the model can run on any NVIDIA GPU including NVIDIA Jetson devices. This model can only be used with Train Adapt Optimize (TAO) Toolkit, DeepStream SDK or TensorRT.

Primary use case intended for this model is to recognize the action from the sequence of skeletons. The maximum sequence length in frames is 300.

A pre-trained model is provided:

  • st-gcn_3dbp_nvidia

It is intended for training and fine-tune using Train Adapt Optimize (TAO) Toolkit and the users' dataset of skeleton-based action recognition. High fidelity models can be trained to the new use cases. The Jupyter notebook available as a part of TAO container can be used to re-train.

The model is also intended for easy deployment to the edge using DeepStream SDK or TensorRT. DeepStream provides facility to create efficient video analytic pipelines to capture, decode and pre-process the data before running inference.

The model is encrypted and can be decrypted with the following key:

  • Model load key: nvidia_tao

Please make sure to use this as the key for all TAO commands that require a model load key.

Input

3 X 300 X 34 X 1 (C T V M)

Output

The classification logits

Instructions to use the model with TAO toolkit

In order to use the model as pre-trained weights for transfer learning, please use the snippet below as a template for the model component of the experiment spec file to train a PoseClassificationNet. For more information on experiment spec file, please refer to the Train Adapt Optimize (TAO) Toolkit User Guide.

model:
  model_type: ST-GCN
  pretrained_model_path: "/path/to/st-gcn_3dbp_nvidia.tlt"
  input_channels: 3
  dropout: 0.5
  graph_layout: "nvidia"
  graph_strategy: "spatial"
  edge_importance_weighting: True

Instructions to deploy the model with Triton Inference Server

To create the entire end-to-end video analytic application, deploy this model with Triton Inference Server. NVIDIA Triton Inference Server is an open-source inference serving software that helps standardize model deployment and execution and delivers fast and scalable AI in production. Triton supports direct integration of this model into the server and inference from a client.

To deploy this model with Triton Inference Server and end-to-end inference from video, please refer to the TAO Triton apps.

Limitations

NVIDIA PoseClassificationNet is trained on an NVIDIA dataset with 6 annotated action classes. Other action classes cannot be recognized correctly. It is also expected that the accuracy of the model on external videos is not at the same level as the number reported in performance section.

In general, to get better accuracy, more labeled data are needed to fine-tune the pre-trained model through TAO Toolkit.

Model versions:

  • trainable_v1.1 - Pre-trained model for 3D body pose in the NVIDIA format.
  • deployable_v1.1 - Model for 3D body pose in the NVIDIA format deployable to DeepStream or TensorRT.

Reference

Citations

  • Yan, S., Xiong, Y., Lin, D.: Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. In: AAAI (2018)
  • Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., Suleyman, M., Zisserman, A.: The Kinetics Human Action Video Dataset. In: arXiv (2017)

Using TAO Pre-trained Models

License

License to use the model is covered by the Model EULA. By downloading the unpruned or pruned version of the model, you accept the terms and conditions of these licenses.

Technical blogs

Suggested reading

Ethical AI

NVIDIA PoseClassificationNet model classifies the action in a sequence of skeletons.

NVIDIA’s platforms and application frameworks enable developers to build a wide array of AI applications. Consider potential algorithmic bias when choosing or creating the models being deployed. Work with the model’s developer to ensure that it meets the requirements for the relevant industry and use case; that the necessary instruction and documentation are provided to understand error rates, confidence intervals, and results; and that the model is being used under the conditions and in the manner intended.