NGC | Catalog
CatalogModelsVAD Marblenet

VAD Marblenet

Logo for VAD Marblenet
Description
MarbleNet VAD model
Publisher
NVIDIA
Latest Version
1.0.0rc1
Modified
April 4, 2023
Size
360.59 KB

Model Overview

This model is training on Google Speech Command v2 (Speech) and Freesound (Background) dataset and can be used for Voice Activity Detection (VAD).

Model Architecture

The model is based on MarbleNet architecture and follows the exact same setup presented in MarbleNet paper [1]. The input feature of this model is MFCC while the vad_telephony_marblenet uses log-mel spectrogram.

Training

The model was trained on Google Speech Command v2 (Speech) and Freesound (Background categories) dataset. The NeMo toolkit was used for training this model over several hundred epochs on multiple GPUs.

Datasets

While training this model, we used the following datasets:

  1. Google Speech Command v2
  2. Subset of Freesound background data

Performance

Achieve 0.858±0.016 TPR for FPR = 0.315 and 0.858±0.011 AUROC on category ALL in AVA speech [2]. For more details about the model performance please refer to the MarbleNet paper. Note you might need to finetune and select optimal threshold on your data to boost performance.

How to Use this Model

The model is available for use in the NeMo toolkit, and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.

Automatically load the model from NGC

import nemo
import nemo.collections.asr as nemo_asr
vad_model = nemo_asr.models.EncDecClassificationModel.from_pretrained(model_name="MarbleNet-3x2x64")

Perform VAD for your audio with this model

python NeMo/examples/asr/vad_infer.py  --vad_model=vad_marblenet.nemo --dataset=/fullpath/to/manifest/ --out_dir='frame/demo' --time_length=0.63

You can use posteriors and select optimal threshold in NeMo to achieve better result.

Input

This model accepts 16000 KHz Mono-channel Audio (wav files) as input.

Output

This model provides frame-level voice activity prediction.

Limitations

Since this model was trained on publically available datasets, the performance of this model might degrade for custom data that the model has not been trained on.

References

[1] Jia, Fei, Somshubra Majumdar, and Boris Ginsburg. "MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection." ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021.

[2] Chaudhuri, Sourish, Joseph Roth, Daniel PW Ellis, Andrew Gallagher, Liat Kaver, Radhika Marvin, Caroline Pantofaru et al. "Ava-speech: A densely labeled dataset of speech activity in movies." arXiv preprint arXiv:1808.00606 (2018).

License

License to use this model is covered by the NGC TERMS OF USE unless another License/Terms Of Use/EULA is clearly specified. By downloading the public and release version of the model, you accept the terms and conditions of the NGC TERMS OF USE.