NGC | Catalog


Logo for GestureNet
Classify gestures from hand crop images.
Latest Version
April 4, 2023
43.99 MB

GestureNet Model Card

Model Overview

The model described in this card is a classification network, which aims to classify hand crop images into 5 gesture types:

  • thumbs up
  • fist
  • stop
  • ok
  • two
  • random

GestureNet is cascaded with hand detect or a bodypose network. For example, BodyPoseNet detects human body joints which are used to create hand crops and GestureNet acts as a classifier determining the gesture of the hand.

Model Architecture

This is a classification model with a Resnet-18 backbone.

Training Algorithm

This model was trained using the GestureNet entrypoint in TAO. The training algorithm optimizes the network to minimize the categorical cross entropy loss for the classes.

Training Data

GestureNet was trained on a proprietary dataset with more than 150K images. The training dataset contains hand crops from user facing camera at various heights with variations in user, backgrounds and illumination.


Evaluation Dataset

The inference performance of the GestureNet model was measured against 10000 proprietary images across a variety of environments, backgrounds and illumination.

Methodology and KPI

The KPI for the evaluation data are reported in the table below. Model is evaluated based on precision, recall and f1 score.

Model GestureNet
Content Precision Recall F1-Score
Evaluation set 85% 85% 85%

Real-time Inference Performance

The inference uses FP16 precision. The inference performance runs with trtexec on Jetson Nano, AGX Xavier, Xavier NX and NVIDIA T4 GPU. The Jetson devices run at Max-N configuration for maximum system performance. The end-to-end performance with streaming video data might slightly vary depending on use cases of applications.

Device Precision Batch_size FPS
Nano FP16 1 79
NX FP16 1 111
Xavier FP16 1 481
T4 FP16 1 1227

How to use this model

This model needs to be used with NVIDIA Hardware and Software. For Hardware, the model can run on any NVIDIA GPU including NVIDIA Jetson devices. This model can only be used with Train Adapt Optimize (TAO) Toolkit, DeepStream 6.0 or TensorRT.

There are two flavors of the model:

  • trainable
  • deployable

The trainable model is intended for training using TAO Toolkit and the user's own dataset. This can provide high fidelity models that are adapted to the use case. The Jupyter notebook available as a part of TAO container can be used to re-train. The deployable model is intended for efficient deployment on the edge using DeepStream or TensorRT. The trainable and deployable models are encrypted and will only operate with the following key:

  • Model load key: nvidia_tlt

Please make sure to use this as the key for all TAO commands that require a model load key.


RGB Images of 160 X 160 X 3


Gesture category labels.

Instructions to deploy this model with DeepStream

To create the entire end-to-end video analytic application, deploy this model with DeepStream.


Non-frontal view

The GestureNet model is designed to classify hand gestures from a camera facing the subject.

Complex background

The GestureNet model is designed to classify hand gestures of subjects inside rooms with mosty monochromatic background behind the hand.

Dark-lighting, Monochrome or Infrared Camera Images

The GestureNet model was trained on RGB images in good lighting conditions. Therefore, images captured in dark lighting conditions or a monochrome image or IR camera image may not provide good detection results.

Model versions:

  • trainable_v1.0 - Pre-trained model that is intended for training.
  • deployabale_v1.0 - Deployment models that is intended to run on the inference pipeline.
  • deployabale_v2.0 - Deployment models that is intended to run on the inference pipeline with int8 calibration. The calibration file is generated for TensorRT 7.
  • deployabale_v2.0.1 - Deployment models that is intended to run on the inference pipeline with int8 calibration. The calibration file is generated for TensorRT 8.



  • He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

Using TAO Pre-trained Models

Technical blogs

Suggested reading


License to use this model is covered by the Model EULA. By downloading the unpruned or pruned version of the model, you accept the terms and conditions of these licenses.

Ethical Consideration

NVIDIA GestureNet model classifies the type of the gesture from a given crop. Training and evaluation dataset mostly consists of users from South Asia. An ideal training and evaluation dataset would additionally include users from other ethnicities.

NVIDIA’s platforms and application frameworks enable developers to build a wide array of AI applications. Consider potential algorithmic bias when choosing or creating the models being deployed. Work with the model’s developer to ensure that it meets the requirements for the relevant industry and use case; that the necessary instruction and documentation are provided to understand error rates, confidence intervals, and results; and that the model is being used under the conditions and in the manner intended.