NGC | Catalog
Welcome Guest
CatalogModelsAction Recognition Net

Action Recognition Net

For downloads and more information, please view on a desktop device.
Logo for Action Recognition Net


5 class action recognition network to recognize what people do in an image.



Use Case

Action Recognition


Transfer Learning Toolkit

Latest Version



May 17, 2022


296.42 MB

ActionRecognitionNet Model Card

Model Overview

The model described in this card is action recognition network, which aims to recognize what people do in videos. Six pretrained ActionRecognitionNet models are delivered --- Three 2D models which are trained with RGB, optical flow generated on A100 with NVOF SDK and optical flow generated on Jetson Xavier with VPI respectively. And there are also three 3D models with the same input type as the 2D models. Both models are trained on a subset of HMDB51.

Model Architecture

Both 2D and 3D models are with ResNet-style backbone. They will take a sequence of RGB frames or optical flow gray images as input and predict the action label of those frames.


The training algorithm optimizes the network to minimize the cross entropy loss for classification.

Training Data

The models are trained on a subset of HMDB51. We pick videos of walk, ride_bike, run, fall_floor and push out of HMDB51 to form HMDB5. The training videos are varied in visible body parts, camera motion, camera viewpoint, number of people involved in the action and video quality. The dataset statistics:

  • Class distribution:

    classes number of videos
    walk 494
    ride_bike 93
    run 209
    fall_floor 123
    push 105
  • visible body parts: upper body, full body, lower body

  • camera motion: motion, static

  • camera view point: front, back, left, right

  • number of people involved in the action: single, two, three

  • video quality: good, medium, bad

  • video size: most of videos are in 320x240

Data Format

The data format must be in the following format.


TAO toolkit support training ActionRecognitionNet with RGB input or optical flow input. The dataset should be divided into different directory by classes. Each of classes directory will contain multiple video clips folders which contain the corresponding RGB frames (rgb), optical flow x-axis grayscale images (u), and optical flow y-axis grayscale images (v).


Evaluation Data

The evaluation dataset are obtained by randomly collecting 10% video per class out of HMDB5. The videos are also diversed by visible body parts/camera motion/camera viewpoint/number of people involved in the action/video quality.

Methodology and KPI

The key performance indicator is the accuracy of action recognition. The center evaluation inference is performed on the middle part of frames in the video clip. For example, if the model requires 32 frames as input and a video clip has 128 frames, then we will choose the frames from index 48 to index 79 to do the inference. The conv evaluation inference is performed on 10 segments out of a video clip. We uniformly divide the video clip into 10 parts, choose center of each segments as start point and then pick 32 consecutive frames from those start points to form the inference segments. And the final label of the video is determined by the average score of those 10 segments.

model dataset center accuracy conv accuracy
resnet18_2d_rgb_hmdb5_32 HMDB5 84.69% 82.88%
resnet18_3d_rgb_hmdb5_32 HMDB5 84.69% 85.59%
resnet18_2d_of_hmdb5_32_a100 HMDB5 78.38% 81.08%
resnet18_2d_of_hmdb5_32_xavier HMDB5 80.18% 82.88%
resnet18_3d_of_hmdb5_32_a100 HMDB5 91.89% 92.79%
resnet18_3d_of_hmdb5_32_xavier HMDB5 90.99% 95.50%

Real-time Inference Performance

The inference uses FP16 precision. The inference performance runs with trtexec on Jetson Nano, Xavier NX, AGX Xavier and NVIDIA T4 GPU. The Jetson devices run at Max-N configuration for maximum system performance. The data is the inference only performance. The end-to-end performance with streaming video data might slightly vary depending on use cases of applications.

Model Type Device precision batch_size FPS
3D-RGB Jetson Nano FP16 1 0.57
3D-RGB Jetson NX FP16 4 4.9
3D-RGB Jetson Xavier FP16 4 33
3D-RGB T4 FP16 4 137
2D-RGB Jetson Nano FP16 1 30
2D-RGB Jetson NX FP16 16 250
2D-RGB Jetson Xavier FP16 16 490
2D-RGB T4 FP16 16 1818

How to use this model

This model needs to be used with NVIDIA Hardware and Software. For Hardware, the model can run on any NVIDIA GPU including NVIDIA Jetson devices. This model can only be used with Train Adapt Optimize (TAO) Toolkit, DeepStream SDK or TensorRT.

Primary use case intended for this model is to recognize the action from the sequence of RGB frames and optical flow gray images. The sequence number is 32.

There are six models provided:

  • resnet18_2d_rgb_hmdb5_32
  • resnet18_3d_rgb_hmdb5_32
  • resnet18_2d_of_hmdb5_32_a100
  • resnet18_2d_of_hmdb5_32_xavier
  • resnet18_3d_of_hmdb5_32_a100
  • resnet18_3d_of_hmdb5_32_xavier

They are intended for training and fine-tune using Train Adapt Optimize (TAO) Toolkit and the users' dataset of action recognition. High fidelity models can be trained to the new use cases. The Jupyter notebook available as a part of TAO container can be used to re-train.

These models are also intended for easy deployment to the edge using DeepStream SDK or TensorRT. DeepStream provides facility to create efficient video analytic pipelines to capture, decode and pre-process the data before running inference.

The models are encrypted and can be decrypted with the following key:

  • Model load key: nvidia_tao

Please make sure to use this as the key for all TAO commands that require a model load key.


  • RGB model:
    • 3D model: 3 X 32 X 224 X 224 (C D H W)
    • 2D model: 96 X 224 X 224 (CXD H W)
  • Optical flow model:
    • 3D model: 2 X 32 x 224 x 224 (C D H W)
    • 2D model: 64 X 224 X 224 (CXD H W)


The classification logits

Instructions to use the model with TAO toolkit

In order to use these models as pretrained weights for transfer learning, please use the snippet below as a template for the model_config component of the experiment spec file to train a 2D/3D ActionRecognitionNet. For more information on experiment spec file, please refer to the Train Adapt Optimize (TAO) Toolkit User Guide.

  model_type: rgb
  # model_type: of                                                                   
  input_type: "2d"
  # input_type: "3d"                                                                   
  backbone: resnet18                                                                
  rgb_seq_length: 32                                                                
  rgb_pretrained_model_path: /workspace/action_recognition/resnet18_2d_rgb_hmdb5_32.tlt
  # rgb_pretrained_model_path: /workspace/action_recognition/resnet18_3d_rgb_hmdb5_32.tlt
  rgb_pretrained_num_classes: 5
  # of_pretrained_model_path: /workspace/action_recognition/resnet18_2d_of_hmdb5_32_a100.tlt
  # of_pretrained_num_classes: 5
  sample_rate: 1

Instructions to deploy the model with DeepStream

To create the entire end-to-end video analytic application, deploy this model with DeepStream SDK. DeepStream SDK is a streaming analytic toolkit to accelerate building AI-based video analytic applications. DeepStream supports direct integration of this model into the deepstream sample app.

To deploy this model with DeepStream 6.0, please refer to the sample code: sources/apps/sample_apps/deepstream-3d-action-recognition/ in Deepstream SDK


NVIDIA ActionRecognitionNet is trained on HMDB5 which is a subset of HMDB51 containing 1024 videos in total. So it is expected the accuracy of the model on videos other than those from HMDB5 is not at the same level as the number reported in performance section.

In general, to get better accuracy, more data is needed to finetune the pretrained model through TAO Toolkit.

Model versions:

  • trainable_v1.0 - Pre-trained models for 2D/3D RGB-Only and OF-Only models.
  • deployable_v1.0 - Models for 2D/3D RGB-Only models deployable to deepstream.



  • H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: A Large Video Database for Human Motion Recognition. In: ICCV(2011)
  • Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos. arXiv preprint arXiv:1406.2199. 2014 Jun 9.
  • Carreira J, Zisserman A. Quo vadis, action recognition? a new model and the kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2017 (pp. 6299-6308).
  • He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recognition. In: CVPR (2015)

Using TAO Pre-trained Models


License to use these models is covered by the Model EULA. By downloading the unpruned or pruned version of the model, you accept the terms and conditions of these licenses.

Technical blogs

Suggested reading

Ethical AI

NVIDIA ActionRecognitionNet model classify the action in a sequence of frames.

NVIDIA’s platforms and application frameworks enable developers to build a wide array of AI applications. Consider potential algorithmic bias when choosing or creating the models being deployed. Work with the model’s developer to ensure that it meets the requirements for the relevant industry and use case; that the necessary instruction and documentation are provided to understand error rates, confidence intervals, and results; and that the model is being used under the conditions and in the manner intended.