Clara Guardian

Clara Guardian

Logo for Clara Guardian
NVIDIA Clara™ Guardian is a collection of models and reference applications that simplifies the development and deployment of smart sensors with multimodal AI, anywhere in a healthcare facility.
April 4, 2023
Sorry, your browser does not support inline SVG.
Helm Charts
Sorry, your browser does not support inline SVG.
Sorry, your browser does not support inline SVG.
Sorry, your browser does not support inline SVG.

Clara Guardian’s key components include healthcare pre-trained models for computer vision and speech, training tools, deployment SDKs, and NVIDIA Fleet Command. NVIDIA Fleet Command is a hybrid-cloud platform for securely managing and scaling AI deployments across millions of servers or edge devices at hospitals.

This makes it easy for ecosystem partners to add AI capabilities to common sensors that can monitor crowds for safe social distancing, measure body temperature, detect the absence of protective gear such as masks, or interact remotely with high-risk patients so that everyone in the healthcare facility stays safe and informed.

Applications and services can run on a wide range of hardware, from NVIDIA Jetson Nano to a NVIDIA Turing T4 /NVIDIA A100 GPU /NVIDIA A30 GPU server, allowing developers to securely deploy anywhere, from the edge to the cloud.

What’s in this Collection?

Vision Models

  1. 2D Bodypose

2D BodyPose is a fully convolutional model with architecture consisting of a backbone network (like VGG), an initial estimation stage which does a pixel-wise prediction of confidence maps (heatmaps) and part affinity fields followed by multistage refinement (0 to N stages) on the initial predictions.

  1. Facial Landmarks Estimation

Facial Landmarks Estimation model is a classification model with a Recombinator network backbone.It aims to predict the (x,y) location of keypoints for a given input face image. FPEnet is generally used in conjuction with a face detector and the output is commonly used for face alignment, head pose estimation, emotion detection, eye blink detection, gaze estimation, among others.

  1. Gaze Estimation

Gaze Estimation model is a multi-input and multi-branch network. The model input consists of face crop, left eye crop, right eye crop, and facegrid. Face, left eye, and right eye branch are based on AlexNet as feature extractors.

  1. GestureNet

GestureNet is a classification network with a Resnet-18 backbone, which aims to classify hand crop images into 5 gesture types:

  • thumbs up
  • fist
  • stop
  • ok
  • two
  • random
  1. HeartRateNet HeartRateNet is two branch model with an attention mechanism that takes in a motion map and an appearance map both derived from RGB face videos. The motion maps are two consecutive (current and previous) frames' with a face region of interest (ROI) difference. The appearance map is the current frame obtained from the camera with the same ROI as the motion map. The appearance map is primary used as the attention mechanism which allows the model to focus on the more important features extracted from the model.

For more information on motion map, appearance map and attention mechanisms, how they are applied in this model, and the benefits on using attention mechanisms please see DeepPhys: Video-Based Physiological Measurement Using Convolutional Attention Networks

Conversation AI Models

  1. MatchboxNet

MatchboxNet 3x1x64 model has been trained on the Google Speech Commands Dataset (v2).Speech Command Recognition is the task of classifying an input audio pattern into a discrete set of classes. It is a subset of Automatic Speech Recognition, sometimes referred to as Key Word Spotting, in which a model is constantly analyzing speech patterns to detect certain "command" classes. Upon detection of these commands, a specific action can be taken by the system. It is often the objective of command recognition models to be small and efficient, so that they can be deployed onto low power sensors and remain active for long durations of time.

  1. QuartzNet

QuartzNet is an end-to-end architecture that is trained using CTC loss. QuartzNet models take in audio segments and transcribe them to letter, byte pair, or word piece sequences. The pretrained models here can be used immediately for fine-tuning or dataset evaluation.

  1. Fastpitch and HiFi-GAN models

FastPitch is a fully-parallel text-to-speech model based on FastSpeech, conditioned on fundamental frequency contours. The model predicts pitch contours during inference. By altering these predictions, the generated speech can be more expressive, better match the semantic of the utterance, and in the end more engaging to the listener. FastPitch is based on a fully-parallel Transformer architecture, with much higher real-time factor than Tacotron2 for mel-spectrogram synthesis of a typical utterance.

Hifi-GAN is a neural vocoder model for text-to-speech applications. It is intended as the second part of a two-stage speech synthesis pipeline, with a mel-spectrogram generator such as FastPitch as the first stage. ​

Vision Apps

  1. 2D Bodypose

The 2D bodypose sample application uses the 2D bodypose model to detect human body parts coordinates. The application can output the 18 body parts:

  • nose
  • neck
  • right shoulder
  • right elbow
  • right hand
  • left shoulder
  • left elbow
  • left hand
  • right hip
  • right knee
  • right foot
  • left hip
  • left knee
  • left foot
  • right eye
  • left eye
  • right ear
  • left ear ​ For an overview of the application pipeline, please refer to this link Prerequisites : Deepstream SDK 6.0 GA and above
  1. Facial Landmarks Estimation The facial landmarks estimation deepstream sample application identify landmarks in human face with face detection model and facial landmarks estimation model. With the pretrained facial landmarks estimation model, the application can idetify 80 landmarks in one human face.

The pretrained models used in this sample application:

  1. Gaze Estimation

​The GazeNet deepstream sample application recognizes a person's eye gaze point of regard (X, Y, Z) and gaze vector (theta and phi). The eye gaze vector can also be derived from eye position and eye gaze points of regard. The input of the GazeNet should be the human face and the faciallandmarks.

The pretrained models used in this sample application:

  1. GestureNet

​The gesture sample application uses the GestureNet model to the gesture of the hands which is identified by the 2D BodyPose model. ​For an overview of the application pipeline, please refer to this link.

Prerequisites : Deepstream SDK 6.0 GA and above

  1. HeartRateNet The HeartRate sample application measures a person's heart rate with the face information.

The pretrained models used in this sample application:

Prerequisites :

  • Deepstream SDK 6.0 GA and above
  • gst-nvdsvideotemplate plugin

Since the HeartRateNet is a multi-input network, the gst-nvinfer plugin can not support HeartRateNet inferencing.​

Conversational AI Apps

  1. MatchboxNet The MatchboxNet model is trained using NVIDIA NeMo. For an overview of the model and training process, refer to the NeMo tutorial. Models trained in NVIDIA NeMo have the format .nemo. To use these models in Riva, users need to convert the model checkpoints to .riva format for building and deploying with Riva ServiceMaker using the nemo2riva tool. The nemo2riva tool is currently packaged and available via Riva Quickstart here

  2. QuartzNet QuartzNet is the next generation of the Jasper speech recognition model. It improves on Jasper by replacing 1D convolutions with 1D time-channel separable convolutions. Doing this effectively factorizes the convolution kernels, enabling deeper models while reducing the number of parameters by over an order of magnitude.

Details on how to build a speech-to-text pipeline based on QuartzNet can be found here

  1. FastPitch and HiFi-GAN models The text-to-speech (TTS) is based on a two-stage pipeline. RIVA first generates a mel spectrogram using the first model, and then generates speech using the second model. This pipeline forms a text-to-speech system that enables you to synthesize natural sounding speech from raw transcripts without any additional information such as patterns or rhythms of speech. ​ Details on how to build a text-to-speech pipeline based FastPitch and HiF--GAN models on can be found here


  • DeepStream SDK 6.0 GA and above
  • RIVA 1.8 and above

Technical Support

Use the

NVIDIA Devtalk forum for questions regarding this Software.