NVIDIA
NVIDIA
VAD multilingual Marblenet
Model
NVIDIA
NVIDIA
VAD multilingual Marblenet

MarbleNet VAD model with multilingual data

Model Overview

This model can be used for Voice Activity Detection (VAD), and served as first step for Automatic Speech Recognition (ASR) and Speaker Diarization (SD).

Model Architecture

The model is based on MarbleNet architecture presented in MarbleNet paper [1]. Different from the paper, the input feature of this model is log-mel spectrogram with n_mels=80 so it can be easily and efficiently integrated with ASR.

Training

The model was trained on mutiple publicly available datasets. The NeMo toolkit was used for training this model for 50 epochs on multiple GPUs.

Datasets

While training this model, we used the following datasets:

Performance

Achieve 0.9093 TPR for FPR = 0.315 and 0.9112 AUROC on category ALL in AVA speech [2]. For more details about the model performance and parameters, please refer to the config yaml file in NeMo. Note you might need to finetune and select optimal thresholds on your data to boost performance.

How to Use this Model

The model is available for use in the NeMo toolkit, and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.

Automatically load the model from NGC

import nemo
import nemo.collections.asr as nemo_asr
vad_model = nemo_asr.models.EncDecClassificationModel.from_pretrained(model_name="vad_multilingual_marblenet")

Perform VAD for your audio with this model

python vad_infer.py --config-path="../conf/vad" --config-name="vad_inference_postprocessing.yaml" dataset=<Path of json file of evaluation data. Audio files should have unique names>

You can use posteriors and select optimal postprocessing thresholds in NeMo to achieve better result.

Input

This model accepts 16000 KHz Mono-channel Audio (wav files) as input.

Output

This model provides frame-level voice activity prediction.

Limitations

Since this model was trained on publically available datasets, the performance of this model might degrade for custom data that the model has not been trained on.

References

[1] Jia, Fei, Somshubra Majumdar, and Boris Ginsburg. "MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection." ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021.

[2] Chaudhuri, Sourish, Joseph Roth, Daniel PW Ellis, Andrew Gallagher, Liat Kaver, Radhika Marvin, Caroline Pantofaru et al. "Ava-speech: A densely labeled dataset of speech activity in movies." arXiv preprint arXiv:1808.00606 (2018).

License

License to use this model is covered by the NGC TERMS OF USE unless another License/Terms Of Use/EULA is clearly specified. By downloading the public and release version of the model, you accept the terms and conditions of the NGC TERMS OF USE.

Publisher
NVIDIA
NVIDIA
Latest Version1.10.0
UpdatedApril 4, 2023 UTC
Compressed Size490 KB

NVIDIA uses cookies to improve your experience on our web site. We and our third-party partners also use cookies and other tools to collect and record information you provide as well as information about your interactions with our websites for performance improvement, analytics, and to assist in marketing efforts. By clicking "Accept All", you consent to our use of cookies and other tools as described in our Cookie Policy. You can manage your cookie settings by clicking on "Manage Settings." By continuing to use this site or by clicking one of the buttons below, you agree to our Terms of Service (which contains important waivers). Please see our Privacy Policy for more information on our privacy practices.