NGC | Catalog
Welcome Guest
CatalogModelsVAD telephony Marblenet

VAD telephony Marblenet

For downloads and more information, please view on a desktop device.
Logo for VAD telephony Marblenet

Description

MarbleNet VAD model for telephony data

Publisher

NVIDIA

Use Case

Other

Framework

PyTorch with NeMo

Latest Version

1.0.0rc1

Modified

June 30, 2021

Size

347.31 KB

Model Overview

This model can be used for Voice Activity Detection (VAD) for telephone conversation such as CALLHOME.

Model Architecture

The model is based on MarbleNet architecture presented in MarbleNet paper [1]. The input feature of this model is log-mel spectrogram while the vad_marblenet uses MFCC.

Training

The model was trained on mutiple publicly available datasets. The NeMo toolkit was used for training this model over several hundred epochs on multiple GPUs.

Datasets

While training this model, we used the following datasets:

  1. Subset of Freesound background data
  2. Training subset of AMI Meeting Corpus
  3. Subset of Fisher English Training Speech
  4. Subset of Switchboard (Disk6 of 2000 NIST Speaker Recognition Evaluation)
  5. Subset of MUSAN

Performance

Model achieves FA at 3.4%, MISS at 3.6% when performing speaker diarization on CH-109 (109 conversations from the CALLHOME American English Speech (LDC97S42) corpus that have 2 speakers only) given threshold t=0.7 (collar=0.25 and skip overlap). Threshold is tuned on 11 multi-speaker sessions from CALLHOME. Note you might need to finetune and select optimal threshold on your data to boost performance.

How to Use this Model

Automatically load the model from NGC

import nemo
import nemo.collections.asr as nemo_asr
vad_model = nemo_asr.models.EncDecClassificationModel.from_pretrained(model_name="MarbleNet-3x2x64-Telephony")

Perform VAD for your audio with this model

python NeMo/examples/asr/vad_infer.py  --vad_model=vad_telephony_marblenet.nemo --dataset=/fullpath/to/manifest/ --out_dir='frame/demo' --time_length=0.15

You can use posteriors and select optimal threshold in NeMo to achieve better result.

Input

This model accepts 16000 KHz Mono-channel Audio (wav files) as input.

Output

This model provides frame-level voice activity prediction.

Limitations

Since this model was trained on publically available datasets, the performance of this model might degrade for custom data that the model has not been trained on.

References

[1] Jia, Fei, Somshubra Majumdar, and Boris Ginsburg. "MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection." ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021.

License

License to use this model is covered by the NGC TERMS OF USE unless another License/Terms Of Use/EULA is clearly specified. By downloading the public and release version of the model, you accept the terms and conditions of the NGC TERMS OF USE.