The FPENet model described in this card is a facial keypoints estimator network, which aims to predict the (x,y) location of keypoints for a given input face image. FPEnet is generally used in conjuction with a face detector and the output is commonly used for face alignment, head pose estimation, emotion detection, eye blink detection, gaze estimation, among others.
This model predicts 68, 80 or 104 keypoints for a given face- Chin: 1-17, Eyebrows: 18-27, Nose: 28-36, Eyes: 37-48, Mouth: 49-61, Inner Lips: 62-68, Pupil: 69-76, Ears: 77-80, additional eye landmarks: 81-104. It can also handle visible or occluded flag for each keypoint. An example of the kaypoints is shown as follows:
This is a classification model with a Recombinator network backbone. Recombinator networks are a family of CNN architectures that are suited for fine grained pixel level predictions (as oppose to image level prediction like classification). The model recombines the layer inputs such that convolutional layers in the finer branches get inputs from both coarse and fine layers.
This model was trained using the FPENet entrypoint in TAO. The training algorithm optimizes the network to minimize the manhattan distance (L1), squared euclidean (L2) or the Wing Loss over the keypoints. Individual face regions can be weighted based on- the 'eyes', the 'mouth', the 'pupil' and the rest of the 'face'.
A pre-trained (trainable
) model is available, trained on a combination of NVIDIA internal dataset and Multi-PIE dataset. NVIDIA internal data has approximately 500k images and Multipie has 750k images.
The ground truth dataset is created by labeling ground-truth facial keypoints by human labellers.
If you are looking to re-train with your own dataset, please follow the guideline below.
Face bounding boxes labeling:
The Sloth and Label-Studio tools have been utilized for labeling.
The evaluation is done on the Multi-PIE dataset Users IDs that are used for KPI- 342 079 164 250 343 080 165 251 344 081 166 252 345 082 167 253 346 083 168 254 084 169 255
The region keypoint pixel error is the mean euclidean error in pixel location prediction as compared to the ground truth. We bucketize and average the error per face region (eyes, mouth, chin, etc.). Metric- Region keypoints pixel error
trtexec
on the specific hardware)The inference uses FP16 precision. The inference performance runs with trtexec
on Jetson Nano, AGX Xavier, Xavier NX and NVIDIA T4 GPU. The Jetson devices run at Max-N configuration for maximum system performance. The end-to-end performance with streaming video data might slightly vary depending on use cases of applications.
Device | Precision | Batch_size | FPS |
---|---|---|---|
Nano | FP16 | 1 | 115 |
NX | FP16 | 1 | 483 |
Xavier | FP16 | 1 | 1015 |
T4 | FP16 | 1 | 2489 |
This model needs to be used with NVIDIA Hardware and Software. For Hardware, the model can run on any NVIDIA GPU including NVIDIA Jetson devices. This model can only be used with Train Adapt Optimize (TAO) Toolkit, DeepStream 6.0 or TensorRT.
There are two flavors of the model:
The trainable
model is intended for training using TAO Toolkit and the user's own dataset. This can provide high fidelity models that are adapted to the use case. The Jupyter notebook available as a part of TAO container can be used to re-train.
The deployable
model is intended for efficient deployment on the edge using DeepStream or TensorRT.
The trainable
and deployable
models are encrypted and will only operate with the following key:
nvidia_tlt
Please make sure to use this as the key for all TAO commands that require a model load key.
Images of 80 X 80 X 1
N X 2 keypoint locations. N X 1 keypoint confidence.
N is the number of keypoints.
Besides predicting the 68, 80 points, this model can be finetuned to predict other number of facial points or general purpose key points with TAO toolkit above 22.04 version. Following is an example to enable 10 keypoints estimation by changing num_keypoints
in the training specification file:
num_keypoints: 10
dataloader:
...
num_keypoints: 10
...
Some known limitations include relative increase in keypoint estimation error in extreme head pose (yaw > 60 degree) and occlusions.
Honari, S., Molchanov, P., Tyree, S., Vincent, P., Pal, C., & Kautz, J. (2018). Improving landmark localization with semi-supervised learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1546-1555).
Feng, Z. H., Kittler, J., Awais, M., Huber, P., & Wu, X. J. (2018). Wing loss for robust facial landmark localisation with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2235-2245).
License to use this model is covered by the Model EULA. By downloading the unpruned or pruned version of the model, you accept the terms and conditions of these licenses.
Training and evaluation dataset mostly consists of North American content. An ideal training and evaluation dataset would additionally include content from other geographies.
NVIDIA’s platforms and application frameworks enable developers to build a wide array of AI applications. Consider potential algorithmic bias when choosing or creating the models being deployed. Work with the model’s developer to ensure that it meets the requirements for the relevant industry and use case; that the necessary instruction and documentation are provided to understand error rates, confidence intervals, and results; and that the model is being used under the conditions and in the manner intended.