Speaker Diarization: MarbleNet Model Card
This model can be used for Voice Activity Detection (VAD) and served as first step for Speaker Diarization (SD).
The model is based on MarbleNet architecture presented in MarbleNet paper . Different from the paper, the input feature of this model is log-mel spectrogram with n_mels=80 so it can be easily and efficiently integrated with speaker diarization.
The model was trained on mutiple publicly available datasets. The NeMo toolkit  was used for training this model for 50 epochs on multiple GPUs.
How to Use this Model
To use this model, we can use Riva Skills Quick start guide, it is a starting point to try out Riva models. Information regarding Quick start guide can be found : here. To use Riva Speech ASR service using this model, document has the necessary information.
This model accepts 16000 KHz Mono-channel Audio (wav files) as input.
This model provides frame-level voice activity prediction.
 Jia, Fei, Somshubra Majumdar, and Boris Ginsburg. MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021.  NVIDIA NeMo Toolkit
By downloading and using the models and resources packaged with Riva Conversational AI, you would be accepting the terms of the Riva license