This model for Voice Activity Detection (VAD), which can serve as the first step for Automatic Speech Recognition (ASR) and Speaker Diarization (SD). Different from segment-based VAD that predicts whether the input audio contains speech or not, this model is a frame-based VAD, which outputs a speech probablity for each 20ms frame of the input audio. The model is trained on a combination of synthetic and real-world data to achieve more robust performance in very noisy situations.
The model is based on MarbleNet architecture presented in MarbleNet paper . Different from the paper, the stride for the first convolution is 2, so that there is a 2x subsampling rate of the model. Also, the input feature of this model is un-normalized log-mel spectrogram with n_mels=80 so it can be easily and efficiently integrated with ASR. For ASR+VAD pipeline, please refer to this example.
The NeMo toolkit  was used for training the models for 50 epochs, with noise and gain augmentation. This model is trained with this example script and this base config.
Full config can be found inside the .nemo files.
While training this model, we used the following datasets:
We use the NeMo ASR data simulator to generate synthetic data. Each session is 3 minutes long, and the mean silence ratio is set to 0.3, while the mean overlap is set to 0.05. The variance of both silence and overlap is set to 0.005. The generated audios are split into segments of 20 seconds for training.
The synthetic dataset consists of the following:
Subset from cleaned German (mcv7.0), Mandarin (aishell2), French (mls), Russian (mcv, ruls, sova) , Spanish (mls) from NeMo ASR set, total of about 2.5K hours.
The AUROC performance is listed in the following table.
The model is available for use in the NeMo toolkit , and can be used as a pre-trained checkpoint for inference.
import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.EncDecFrameClassificationModel.from_pretrained(model_name="vad_multilingual_frame_marblenet")
python <NEMO_ROOT>/examples/asr/speech_classification/frame_vad_infer.py --config-path="../conf/vad" --config-name="frame_vad_infer_postprocess.yaml" dataset=<Path of manifest file of evaluation data, where audio files should have unique names>
This model accepts 16 KHz Mono-channel Audio (wav files) as input.
This model provides a sequance of speech probabilities for each 20ms frame of the input audio.
Since this model was trained on publically available datasets, the performance of this model might degrade for custom data that the model has not been trained on.
 Jia, Fei, Somshubra Majumdar, and Boris Ginsburg. "MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection." ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021.