NGC | Catalog
CatalogModelsVAD telephony Marblenet

VAD telephony Marblenet

Logo for VAD telephony Marblenet
Description
MarbleNet VAD model for telephony data
Publisher
NVIDIA
Latest Version
1.0.0rc1
Modified
April 4, 2023
Size
347.31 KB

Model Overview

This model can be used for Voice Activity Detection (VAD) for telephone conversation such as CALLHOME.

Model Architecture

The model is based on MarbleNet architecture presented in MarbleNet paper [1]. The input feature of this model is log-mel spectrogram while the vad_marblenet uses MFCC.

Training

The model was trained on mutiple publicly available datasets. The NeMo toolkit was used for training this model over several hundred epochs on multiple GPUs.

Datasets

While training this model, we used the following datasets:

  1. Subset of Freesound background data
  2. Training subset of AMI Meeting Corpus
  3. Subset of Fisher English Training Speech
  4. Subset of Switchboard (Disk6 of 2000 NIST Speaker Recognition Evaluation)
  5. Subset of MUSAN

Performance

Model achieves FA at 3.4%, MISS at 3.6% when performing speaker diarization on CH-109 (109 conversations from the CALLHOME American English Speech (LDC97S42) corpus that have 2 speakers only) given threshold t=0.7 (collar=0.25 and skip overlap). Threshold is tuned on 11 multi-speaker sessions from CALLHOME. Note you might need to finetune and select optimal threshold on your data to boost performance.

How to Use this Model

Automatically load the model from NGC

import nemo
import nemo.collections.asr as nemo_asr
vad_model = nemo_asr.models.EncDecClassificationModel.from_pretrained(model_name="MarbleNet-3x2x64-Telephony")

Perform VAD for your audio with this model

python NeMo/examples/asr/vad_infer.py  --vad_model=vad_telephony_marblenet.nemo --dataset=/fullpath/to/manifest/ --out_dir='frame/demo' --time_length=0.15

You can use posteriors and select optimal threshold in NeMo to achieve better result.

Input

This model accepts 16000 KHz Mono-channel Audio (wav files) as input.

Output

This model provides frame-level voice activity prediction.

Limitations

Since this model was trained on publically available datasets, the performance of this model might degrade for custom data that the model has not been trained on.

References

[1] Jia, Fei, Somshubra Majumdar, and Boris Ginsburg. "MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection." ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021.

License

License to use this model is covered by the NGC TERMS OF USE unless another License/Terms Of Use/EULA is clearly specified. By downloading the public and release version of the model, you accept the terms and conditions of the NGC TERMS OF USE.