This model can be used for Voice Activity Detection (VAD) for telephone conversation such as CALLHOME.
The model is based on MarbleNet architecture presented in MarbleNet paper . The input feature of this model is log-mel spectrogram while the vad_marblenet uses MFCC.
The model was trained on mutiple publicly available datasets. The NeMo toolkit was used for training this model over several hundred epochs on multiple GPUs.
While training this model, we used the following datasets:
Model achieves FA at 3.4%, MISS at 3.6% when performing speaker diarization on CH-109 (109 conversations from the CALLHOME American English Speech (LDC97S42) corpus that have 2 speakers only) given threshold t=0.7 (collar=0.25 and skip overlap). Threshold is tuned on 11 multi-speaker sessions from CALLHOME. Note you might need to finetune and select optimal threshold on your data to boost performance.
import nemo import nemo.collections.asr as nemo_asr vad_model = nemo_asr.models.EncDecClassificationModel.from_pretrained(model_name="MarbleNet-3x2x64-Telephony")
python NeMo/examples/asr/vad_infer.py --vad_model=vad_telephony_marblenet.nemo --dataset=/fullpath/to/manifest/ --out_dir='frame/demo' --time_length=0.15
You can use posteriors and select optimal threshold in NeMo to achieve better result.
This model accepts 16000 KHz Mono-channel Audio (wav files) as input.
This model provides frame-level voice activity prediction.
Since this model was trained on publically available datasets, the performance of this model might degrade for custom data that the model has not been trained on.
 Jia, Fei, Somshubra Majumdar, and Boris Ginsburg. "MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection." ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021.