Logo for EmotionNet
Network to classify emotions from face.
Latest Version
July 24, 2023
4.38 MB

EmotionNet Model Card

Model Overview

The model described in this card is a classification network, which aims to classify human emotion into 6 categories.

- Neutral
- Happy
- Surprise
- Squint
- Disgust
- Scream

Model Architecture

This is a classification model with five fully connected layers.

Training Algorithm

This model was trained using the EmotionNet entrypoint in TAO. The training algorithm optimizes the network to minimize the categorical cross entropy loss for the emotion classes.

Training Data

EmotionNet v1.0 model was trained on MultiPie dataset with more than 750K images. The training dataset consists of images taken from cameras mounted at varied heights and angles. It contains 337 subjects, imaged under 15 view points and 19 illumination conditions in up to four recording sessions. The facial landmarks labels were acquired from NVIDIA data factory team and applied to training. All data has been labeled including subjects with both profile and non-profile faces.

Multi-PIE dataset The information to purchase and license Multi-PIE dataset can be found here

Training Data Ground-truth Labeling Guidelines

The training dataset is created by labeling ground-truth landmarks by human labellers. Please refer to the FPENet Model Card for the instructions of landmarks labeling.

During the data collection, subjects are asked to perform posed emotion. The emotion labels are obtained during collection.


Evaluation Data


The inference performance of EmotionNet model was measured against 71020 MultiPie images across a variety of subjects, illuminitation conditions, camera heights and camera angles.

Methodology and KPI

The KPI results for EmotionNet landmarks_v1 model are reported in the table below. Model is evaluated based on Precision, Recall, and f_scroe.

================  ===========  ========  =========  ============
content             precision    recall    f_score    numsamples
================  ===========  ========  =========  ============
disgust                0.8156   0.73016    0.77052          6300
happy                 0.81961   0.93027    0.87144         13480
neutral               0.95134   0.94047    0.94587         33260
scream                0.98347   0.99443    0.98892          7180
squint                0.76419   0.64815     0.7014          5400
surprise              0.93284   0.92593    0.92937          5400
Average               0.87784   0.86157    0.86792         71020
Weighted_average      0.90191     0.902     0.9007         71020
================  ===========  ========  =========  ============

Real-time Inference Performance

The inference uses FP16 precision. The inference performance runs with trtexec on Jetson Nano, AGX Xavier, Xavier NX and NVIDIA T4 GPU. The Jetson devices run at Max-N configuration for maximum system performance. The end-to-end performance with streaming video data might slightly vary depending on use cases of applications.

Device Precision Batch_size FPS
Nano FP16 1 1190
NX FP16 1 4016
Xavier FP16 1 5988
T4 FP16 1 19644

How to use this model

This model needs to be used with NVIDIA Hardware and Software. For Hardware, the model can run on any NVIDIA GPU including NVIDIA Jetson devices. This model can only be used with Train Adapt Optimize (TAO) Toolkit, DeepStream 6.0 or TensorRT.

Primary use case for this model is to detect human emotion. The model can be used to detect human emotion from photos and videos by using appropriate video or image decoding and pre-processing. The model takes in facial landmarks as input and provide emotion classes as output.

There are two flavors of the model:

  • trainable
  • deployable

The trainable model is intended for training using TAO Toolkit and the user's own dataset. This can provide high fidelity models that are adapted to the use case. The Jupyter notebook available as a part of TAO container can be used to re-train. The deployable model is intended for efficient deployment on the edge using DeepStream or TensorRT. The trainable and deployable models are encrypted and will only operate with the following key:

  • Model load key: nvidia_tlt

Please make sure to use this as the key for all TAO commands that require a model load key.


68 points (X, Y) of Human facial landmarks (1 x 136 x 1)

The training pipeline can accept more input points, but the pre-train model is trained with 68 points input.


Category labels (emotion) of each subject in the input image.

Instructions to use the model with TAO

In order to use this model as a pretrained weights for transfer learning, please use the below mentioned template for the model component of the experiment spec file to train a EmotionNet model.

  __class_name__: EmotionNetModel
    use_batch_norm: True
    data_format: channels_first
    regularization_type: l2
    regularization_factor: 0.0015
    bias_regularizer: null
    use_landmarks_input: True
    activation_type: 'relu'
    dropout_rate: 0.3
    num_class: 6

Instructions to deploy this model with DeepStream

To create the entire end-to-end video analytic application, deploy this model with DeepStream.


Spontaneous Emotion

NVIDIA EmotionNet model were trained with MultiPie datasets that have six emotion classes. The subjects in the datasets were performing pose emotion. Therefore it will not be able to detect emotion classes that are not belong to these classes or spontaneous emotions.

Unstable landmarks

NVIDIA EmotionNet model (landmarks_v1.0) does not give good results on detectioning emotion, if the landmarks are not stable (generally, this happens when the face are occluded).

Model versions

  • trainable_v1.0 - Pre-trained model that is intended for training.
  • deployabale_v1.0 - Deployment models that is intended to run on the inference pipeline.



  • Gross, R., Iain M., Jeffrey C., Takeo K., and Simon B.. "Multi-pie." Image and Vision Computing 28, no. 5 (2010): 807-813.
  • Lucey, Patrick, Jeffrey F. Cohn, Takeo Kanade, Jason Saragih, Zara Ambadar, and Iain Matthews. "The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression." In 2010 ieee computer society conference on computer vision and pattern recognition-workshops, pp. 94-101. IEEE, 2010.

Using TAO Pre-trained Models

Technical blogs

Suggested reading


License to use this model is covered by the Model EULA. By downloading the model, you accept the terms and conditions of these licenses.

Ethical Considerations

NVIDIA EmotionNet model detects emotion categories. However, no additional information such as race, gender, and skin type about the faces is inferred.

NVIDIA’s platforms and application frameworks enable developers to build a wide array of AI applications. Consider potential algorithmic bias when choosing or creating the models being deployed. Work with the model’s developer to ensure that it meets the requirements for the relevant industry and use case; that the necessary instruction and documentation are provided to understand error rates, confidence intervals, and results; and that the model is being used under the conditions and in the manner intended.