NGC | Catalog
CatalogModelsRIVA Diarizer Neural VAD

RIVA Diarizer Neural VAD

Logo for RIVA Diarizer Neural VAD
Neural VAD model used in Riva Speaker Diarization
Latest Version
April 4, 2023
340.78 KB

Speaker Diarization: MarbleNet Model Card

Model Overview

This model can be used for Voice Activity Detection (VAD) and served as first step for Speaker Diarization (SD).

Model Architecture

The model is based on MarbleNet architecture presented in MarbleNet paper [1]. Different from the paper, the input feature of this model is log-mel spectrogram with n_mels=80 so it can be easily and efficiently integrated with speaker diarization.


The model was trained on mutiple publicly available datasets. The NeMo toolkit [2] was used for training this model for 50 epochs on multiple GPUs.

How to Use this Model

To use this model, we can use Riva Skills Quick start guide, it is a starting point to try out Riva models. Information regarding Quick start guide can be found : here. To use Riva Speech ASR service using this model, document has the necessary information.


This model accepts 16000 KHz Mono-channel Audio (wav files) as input.


This model provides frame-level voice activity prediction.


[1] Jia, Fei, Somshubra Majumdar, and Boris Ginsburg. MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021. [2] NVIDIA NeMo Toolkit


By downloading and using the models and resources packaged with Riva Conversational AI, you would be accepting the terms of the Riva license