NGC | Catalog
CatalogModelsRIVA Diarizer Neural VAD

RIVA Diarizer Neural VAD

Logo for RIVA Diarizer Neural VAD
Description
Neural VAD model used in Riva Speaker Diarization
Publisher
NVIDIA
Latest Version
deployable_v1.0
Modified
April 4, 2023
Size
340.78 KB

Speaker Diarization: MarbleNet Model Card

Model Overview

This model can be used for Voice Activity Detection (VAD) and served as first step for Speaker Diarization (SD).

Model Architecture

The model is based on MarbleNet architecture presented in MarbleNet paper [1]. Different from the paper, the input feature of this model is log-mel spectrogram with n_mels=80 so it can be easily and efficiently integrated with speaker diarization.

Training

The model was trained on mutiple publicly available datasets. The NeMo toolkit [2] was used for training this model for 50 epochs on multiple GPUs.

How to Use this Model

To use this model, we can use Riva Skills Quick start guide, it is a starting point to try out Riva models. Information regarding Quick start guide can be found : here. To use Riva Speech ASR service using this model, document has the necessary information.

Input

This model accepts 16000 KHz Mono-channel Audio (wav files) as input.

Output

This model provides frame-level voice activity prediction.

References

[1] Jia, Fei, Somshubra Majumdar, and Boris Ginsburg. MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021. [2] NVIDIA NeMo Toolkit

License

By downloading and using the models and resources packaged with Riva Conversational AI, you would be accepting the terms of the Riva license