5 class action recognition network to recognize what people do in an image.
ActionRecognitionNet Model Card
Description:
ActionRecognitionNet recognizes actions of people in a sequence of video frames:
- walk,
- bike riding,
- running,
- Falling on the floor, and
- push
This model is ready for commercial use.
References:
Citations
- H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: A Large Video Database for Human Motion Recognition. In: ICCV(2011)
- Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos. arXiv preprint arXiv:1406.2199. 2014 Jun 9.
- Carreira J, Zisserman A. Quo vadis, action recognition? a new model and the kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2017 (pp. 6299-6308).
- He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recognition. In: CVPR (2015)
Using TAO Pre-trained Models
- Get TAO Container
- Get other purpose-built models from the NGC model registry:
- TrafficCamNet
- PeopleNet
- PeopleNet-Transformer
- DashCamNet
- FaceDetectIR
- VehicleMakeNet
- VehicleTypeNet
- PeopleSegNet
- PeopleSemSegNet
- License Plate Detection
- License Plate Recognition
- PoseClassificationNet
- Facial Landmark
- FaceDetect
- 2D Body Pose Estimation
- ActionRecognitionNet
- People ReIdentification
- PointPillarNet
- CitySegFormer
- Retail Object Detection
- Retail Object Embedding
- Optical Inspection
- Optical Character Detection
- Optical Character Recognition
- PCB Classification
- PeopleSemSegFormer
Model Architecture:
Architecture Type: Convolution Neural Network (CNN)
Network Architecture: ResNet18
Input:
Input Type(s): Images from video
Input Format(s): Red, Green, Blue (RGB)
Input Parameters: (3D, 4D)
Other Properties Related to Input:
- RGB model:
- 3D model: 3 X 32 X 224 X 224 (C x D x H x W)
- 2D model: 96 X 224 X 224 (C x D H W)
- Optical flow model:
- 3D model: 2 X 32 x 224 x 224 (C x D x H x W)
- 2D model: 64 X 224 X 224 (CxD x H x W)
Output:
Output Type(s): Label(s)
Output Format: Label: Text String
Other Properties Related to Output: Category Label(s):walk, ride_bike, run, fall_floor, and push
Software Integration:
Runtime Engine(s):
- TAO - 5.2
- DeepStream 6.1 or later
Supported Hardware Architecture(s):
- Ampere
- Jetson
- Hopper
- Lovelace
- Pascal
- Turing
- Volta
Supported Operating System(s):
- Linux
- Linux 4 Tegra
Model Version(s):
- trainable_v1.0 - Pre-trained models for 2D/3D RGB-Only and OF-Only models.
- deployable_v1.0 - Models for 2D/3D RGB-Only models deployable to deepstream.
Training & Evaluation:
Training Dataset:
Data Collection Method by dataset:
- Unknown
Labeling Method by dataset:
- Unknown
Properties:
Trained on 1024 Videos of people walking, bike-riding,, running, falling on the floor, and pushing. Videos are varied in visible body parts, camera motion, camera viewpoint, number of people involved in the action and video quality.
-
Class distribution:
classes number of videos walk 494 ride_bike 93 run 209 fall_floor 123 push 105 -
visible body parts: upper body, full body, lower body
-
camera motion: motion, static
-
camera view point: front, back, left, right
-
number of people involved in the action: single, two, three
-
video quality: good, medium, bad
-
video size: most of videos are in 320x240
Data Format
The data format must be in the following format.
/Dataset_01
/class_1
/video_1
/rgb
0000.png
0001.png
0002.png
...
...
...
N.png
/u
0000.jpg
0001.jpg
0002.jpg
...
...
...
N.jpg
/v
0000.jpg
0001.jpg
0002.jpg
...
...
...
N.jpg
TAO toolkit support training ActionRecognitionNet with RGB input or optical flow input. The dataset should be divided into different directory by classes. Each of classes directory will contain multiple video clips folders which contain the corresponding RGB frames (rgb), optical flow x-axis grayscale images (u), and optical flow y-axis grayscale images (v).
Evaluation Dataset:
Data Collection Method by dataset:
- Unknown
Labeling Method by dataset:
- Unknown
Properties:
The evaluation dataset are obtained by randomly collecting 10% video per class out of HMDB5. The videos are also diversed by visible body parts/camera motion/camera viewpoint/number of people involved in the action/video quality.
Methodology and KPI
The key performance indicator is the accuracy of action recognition. The center evaluation inference is performed on the middle part of frames in the video clip. For example, if the model requires 32 frames as input and a video clip has 128 frames, then we will choose the frames from index 48 to index 79 to do the inference. The conv evaluation inference is performed on 10 segments out of a video clip. We uniformly divide the video clip into 10 parts, choose center of each segments as start point and then pick 32 consecutive frames from those start points to form the inference segments. And the final label of the video is determined by the average score of those 10 segments.
| model | dataset | center accuracy | conv accuracy |
|---|---|---|---|
| resnet18_2d_rgb_hmdb5_32 | HMDB5 | 84.69% | 82.88% |
| resnet18_3d_rgb_hmdb5_32 | HMDB5 | 84.69% | 85.59% |
| resnet18_2d_of_hmdb5_32_a100 | HMDB5 | 78.38% | 81.08% |
| resnet18_2d_of_hmdb5_32_xavier | HMDB5 | 80.18% | 82.88% |
| resnet18_3d_of_hmdb5_32_a100 | HMDB5 | 91.89% | 92.79% |
| resnet18_3d_of_hmdb5_32_xavier | HMDB5 | 90.99% | 95.50% |
Inference:
Engine: Tensor(RT)
Test Hardware:
- Jetson AGX Xavier
- Xavier NX
- Orin
- Orin NX
- NVIDIA T4
- Ampere GPU
- A2
- A30
- L4
- T4
- DGX H100
- DGX A100
- DGX H100
- L40
- JAO 64GB
- Orin NX16GB
- Orin Nano 8GB
The inference uses FP16 precision. The inference performance runs with trtexec on Jetson Nano, Xavier NX, AGX Xavier and NVIDIA T4 GPU. The Jetson devices run at Max-N configuration for maximum system performance. The data is the inference only performance. The end-to-end performance with streaming video data might slightly vary depending on use cases of applications.
| Model Type | Device | precision | batch_size | FPS |
|---|---|---|---|---|
| 3D-RGB | Jetson Nano | FP16 | 1 | 0.57 |
| 3D-RGB | Jetson NX | FP16 | 4 | 4.9 |
| 3D-RGB | Jetson Xavier | FP16 | 4 | 33 |
| 3D-RGB | T4 | FP16 | 4 | 137 |
| 2D-RGB | Jetson Nano | FP16 | 1 | 30 |
| 2D-RGB | Jetson NX | FP16 | 16 | 250 |
| 2D-RGB | Jetson Xavier | FP16 | 16 | 490 |
| 2D-RGB | T4 | FP16 | 16 | 1818 |
How to use this model
This model needs to be used with NVIDIA Hardware and Software. For Hardware, the model can run on any NVIDIA GPU including NVIDIA Jetson devices. This model can only be used with Train Adapt Optimize (TAO) Toolkit, DeepStream SDK or TensorRT.
Primary use case intended for this model is to recognize the action from the sequence of RGB frames and optical flow gray images. The sequence number is 32.
There are six models provided:
- resnet18_2d_rgb_hmdb5_32
- resnet18_3d_rgb_hmdb5_32
- resnet18_2d_of_hmdb5_32_a100
- resnet18_2d_of_hmdb5_32_xavier
- resnet18_3d_of_hmdb5_32_a100
- resnet18_3d_of_hmdb5_32_xavier
They are intended for training and fine-tune using Train Adapt Optimize (TAO) Toolkit and the users' dataset of action recognition. High fidelity models can be trained to the new use cases. The Jupyter notebook available as a part of TAO container can be used to re-train.
These models are also intended for easy deployment to the edge using DeepStream SDK or TensorRT. DeepStream provides facility to create efficient video analytic pipelines to capture, decode and pre-process the data before running inference.
The models are encrypted and can be decrypted with the following key:
- Model load key:
nvidia_tao
Please make sure to use this as the key for all TAO commands that require a model load key.
Instructions to use the model with TAO toolkit
In order to use these models as pretrained weights for transfer learning, please use the snippet below as a template for the model_config component of the experiment spec file to train a 2D/3D ActionRecognitionNet. For more information on experiment spec file, please refer to the Train Adapt Optimize (TAO) Toolkit User Guide.
model_config:
model_type: rgb
# model_type: of
input_type: "2d"
# input_type: "3d"
backbone: resnet18
rgb_seq_length: 32
rgb_pretrained_model_path: /workspace/action_recognition/resnet18_2d_rgb_hmdb5_32.tlt
# rgb_pretrained_model_path: /workspace/action_recognition/resnet18_3d_rgb_hmdb5_32.tlt
rgb_pretrained_num_classes: 5
# of_pretrained_model_path: /workspace/action_recognition/resnet18_2d_of_hmdb5_32_a100.tlt
# of_pretrained_num_classes: 5
sample_rate: 1
Instructions to deploy the model with DeepStream
To create the entire end-to-end video analytic application, deploy this model with DeepStream SDK. DeepStream SDK is a streaming analytic toolkit to accelerate building AI-based video analytic applications. DeepStream supports direct integration of this model into the deepstream sample app.
To deploy this model with DeepStream 6.0, please refer to the sample code: sources/apps/sample_apps/deepstream-3d-action-recognition/ in Deepstream SDK
Technical blogs
- Access the latest in Vision AI development workflows with NVIDIA TAO Toolkit 5.0
- Improve accuracy and robustness of vision ai models with vision transformers and NVIDIA TAO
- Train like a ‘pro’ without being an AI expert using TAO AutoML
- Create Custom AI models using NVIDIA TAO Toolkit with Azure Machine Learning
- Developing and Deploying AI-powered Robots with NVIDIA Isaac Sim and NVIDIA TAO
- Learn endless ways to adapt and supercharge your AI workflows with TAO - Whitepaper
- Customize Action Recognition with TAO and deploy with DeepStream
- Read the 2 part blog on training and optimizing 2D body pose estimation model with TAO - Part 1 | Part 2
- Learn how to train real-time License plate detection and recognition app with TAO and DeepStream.
- Model accuracy is extremely important, learn how you can achieve state of the art accuracy for classification and object detection models using TAO
Suggested reading
- More information on about TAO Toolkit and pre-trained models can be found at the NVIDIA Developer Zone
- TAO documentation
- Read the TAO getting Started guide and release notes.
- If you have any questions or feedback, please refer to the discussions on TAO Toolkit Developer Forums
- Deploy your models for video analytics application using DeepStream. Learn more about DeepStream SDK
- Deploy your models in Riva for ConvAI use case.
Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Promise and the Explainability, Bias, Safety & Security, and Privacy Subcards.