This model can be used for Voice Activity Detection (VAD) and served as first step for Speaker Diarization (SD).
The model is based on MarbleNet architecture presented in MarbleNet paper [1]. Different from the paper, the input feature of this model is log-mel spectrogram with n_mels=80 so it can be easily and efficiently integrated with speaker diarization.
The model was trained on mutiple publicly available datasets. The NeMo toolkit [2] was used for training this model for 50 epochs on multiple GPUs.
To use this model, we can use Riva Skills Quick start guide, it is a starting point to try out Riva models. Information regarding Quick start guide can be found : here. To use Riva Speech ASR service using this model, document has the necessary information.
This model accepts 16000 KHz Mono-channel Audio (wav files) as input.
This model provides frame-level voice activity prediction.
[1] Jia, Fei, Somshubra Majumdar, and Boris Ginsburg. MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021. [2] NVIDIA NeMo Toolkit
By downloading and using the models and resources packaged with Riva Conversational AI, you would be accepting the terms of the Riva license