Clara Guardian’s key components include healthcare pre-trained models for computer vision and speech, training tools, deployment SDKs, and NVIDIA Fleet Command. NVIDIA Fleet Command is a hybrid-cloud platform for securely managing and scaling AI deployments across millions of servers or edge devices at hospitals.
This makes it easy for ecosystem partners to add AI capabilities to common sensors that can monitor crowds for safe social distancing, measure body temperature, detect the absence of protective gear such as masks, or interact remotely with high-risk patients so that everyone in the healthcare facility stays safe and informed.
Applications and services can run on a wide range of hardware, from NVIDIA Jetson Nano to a NVIDIA Turing T4 /NVIDIA A100 GPU /NVIDIA A30 GPU server, allowing developers to securely deploy anywhere, from the edge to the cloud.
What’s in this Collection?
2D BodyPose is a fully convolutional model with architecture consisting of a backbone network (like VGG), an initial estimation stage which does a pixel-wise prediction of confidence maps (heatmaps) and part affinity fields followed by multistage refinement (0 to N stages) on the initial predictions.
Facial Landmarks Estimation model is a classification model with a Recombinator network backbone.It aims to predict the (x,y) location of keypoints for a given input face image. FPEnet is generally used in conjuction with a face detector and the output is commonly used for face alignment, head pose estimation, emotion detection, eye blink detection, gaze estimation, among others.
Gaze Estimation model is a multi-input and multi-branch network. The model input consists of face crop, left eye crop, right eye crop, and facegrid. Face, left eye, and right eye branch are based on AlexNet as feature extractors.
GestureNet is a classification network with a Resnet-18 backbone, which aims to classify hand crop images into 5 gesture types:
For more information on motion map, appearance map and attention mechanisms, how they are applied in this model, and the benefits on using attention mechanisms please see DeepPhys: Video-Based Physiological Measurement Using Convolutional Attention Networks
MatchboxNet 3x1x64 model has been trained on the Google Speech Commands Dataset (v2).Speech Command Recognition is the task of classifying an input audio pattern into a discrete set of classes. It is a subset of Automatic Speech Recognition, sometimes referred to as Key Word Spotting, in which a model is constantly analyzing speech patterns to detect certain "command" classes. Upon detection of these commands, a specific action can be taken by the system. It is often the objective of command recognition models to be small and efficient, so that they can be deployed onto low power sensors and remain active for long durations of time.
QuartzNet is an end-to-end architecture that is trained using CTC loss. QuartzNet models take in audio segments and transcribe them to letter, byte pair, or word piece sequences. The pretrained models here can be used immediately for fine-tuning or dataset evaluation.
FastPitch is a fully-parallel text-to-speech model based on FastSpeech, conditioned on fundamental frequency contours. The model predicts pitch contours during inference. By altering these predictions, the generated speech can be more expressive, better match the semantic of the utterance, and in the end more engaging to the listener. FastPitch is based on a fully-parallel Transformer architecture, with much higher real-time factor than Tacotron2 for mel-spectrogram synthesis of a typical utterance.
Hifi-GAN is a neural vocoder model for text-to-speech applications. It is intended as the second part of a two-stage speech synthesis pipeline, with a mel-spectrogram generator such as FastPitch as the first stage.
The 2D bodypose sample application uses the 2D bodypose model to detect human body parts coordinates. The application can output the 18 body parts:
The pretrained models used in this sample application:
The GazeNet deepstream sample application recognizes a person's eye gaze point of regard (X, Y, Z) and gaze vector (theta and phi). The eye gaze vector can also be derived from eye position and eye gaze points of regard. The input of the GazeNet should be the human face and the faciallandmarks.
The pretrained models used in this sample application:
The gesture sample application uses the GestureNet model to the gesture of the hands which is identified by the 2D BodyPose model. For an overview of the application pipeline, please refer to this link.
Prerequisites : Deepstream SDK 6.0 GA and above
The pretrained models used in this sample application:
Prerequisites :
Since the HeartRateNet is a multi-input network, the gst-nvinfer plugin can not support HeartRateNet inferencing.
MatchboxNet
The MatchboxNet model is trained using NVIDIA NeMo. For an overview of the model and training process, refer to the NeMo tutorial.
Models trained in NVIDIA NeMo have the format .nemo
. To use these models in Riva, users need to convert the model checkpoints to .riva
format for building and deploying with Riva ServiceMaker using the nemo2riva
tool. The nemo2riva
tool is currently packaged and available via Riva Quickstart here
QuartzNet QuartzNet is the next generation of the Jasper speech recognition model. It improves on Jasper by replacing 1D convolutions with 1D time-channel separable convolutions. Doing this effectively factorizes the convolution kernels, enabling deeper models while reducing the number of parameters by over an order of magnitude.
Details on how to build a speech-to-text pipeline based on QuartzNet can be found here
Use the
NVIDIA Devtalk forum for questions regarding this Software.