Voice Activity Detection Marblenet

Model Overview

This model can be used for Voice Activity Detection (VAD), and served as first step for Automatic Speech Recognition (ASR) and Speaker Diarization (SD).

Model Architecture

The model is based on MarbleNet architecture presented in MarbleNet paper [1]. Different from the paper, the input feature of this model is log-mel spectrogram with n_mels=80 so it can be easily and efficiently integrated with ASR.

Training

The model was trained on mutiple publicly available datasets. The NeMo toolkit was used for training this model for 50 epochs on multiple GPUs.

Datasets

While training this model, we used the following datasets:

 - Subset of [Freesound background data](https://freesound.org/)
 - Subset of [MUSAN](https://www.openslr.org/17/)
 - Subset of [Fisher English Training Speech 2004](https://catalog.ldc.upenn.edu/LDC2004T19)
 - Subset of [Google Speech Command v2](https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html)
 - Subset from cleaned German (mcv7.0), Mandarin (aishell2), French (mls), Russian (mcv, ruls, sova) , Spanish (mls) from NeMo ASR set
 - Training subset of [AMI Meeting Corpus](https://groups.inf.ed.ac.uk/ami/corpus/)
 - [ICSI Meeting Corpus](https://groups.inf.ed.ac.uk/ami/icsi/)

Performance

Achieve 0.9093 TPR for FPR = 0.315 and 0.9112 AUROC on category ALL in AVA speech [2]. Note you might need to finetune and select optimal thresholds on your data to boost performance.

How to Use this Model

To use this model , we can use Riva Skills Quick start guide , it is a starting point to try out Riva models. Information regarding Quick start guide can be found : here. To use Riva Speech ASR service using this model , document has all the necessary information.

Input

Audio sample that is to be transcribed

Output

This model provides frame-level voice activity prediction.

Limitations

Since this model was trained on publically available datasets, the performance of this model might degrade for custom data that the model has not been trained on.

References

[1] Jia, Fei, Somshubra Majumdar, and Boris Ginsburg. "MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection." ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021.

[2] Chaudhuri, Sourish, Joseph Roth, Daniel PW Ellis, Andrew Gallagher, Liat Kaver, Radhika Marvin, Caroline Pantofaru et al. "Ava-speech: A densely labeled dataset of speech activity in movies." arXiv preprint arXiv:1808.00606 (2018).

Licence

By downloading and using the models and resources packaged with Riva Conversational AI, you would be accepting the terms of the Riva license