VAD multilingual Marblenet

NVIDIA

Model

NVIDIA

VAD multilingual Marblenet

MarbleNet VAD model with multilingual data

Model Overview

This model can be used for Voice Activity Detection (VAD), and served as first step for Automatic Speech Recognition (ASR) and Speaker Diarization (SD).

Model Architecture

The model is based on MarbleNet architecture presented in MarbleNet paper [1]. Different from the paper, the input feature of this model is log-mel spectrogram with n_mels=80 so it can be easily and efficiently integrated with ASR.

Training

The model was trained on mutiple publicly available datasets. The NeMo toolkit was used for training this model for 50 epochs on multiple GPUs.

Datasets

While training this model, we used the following datasets:

Subset of Freesound background data
Subset of MUSAN
Subset of Fisher English Training Speech 2004
Subset of Google Speech Command v2
Subset from cleaned German (mcv7.0), Mandarin (aishell2), French (mls), Russian (mcv, ruls, sova) , Spanish (mls) from NeMo ASR set
Training subset of AMI Meeting Corpus
ICSI Meeting Corpus

Performance

Achieve 0.9093 TPR for FPR = 0.315 and 0.9112 AUROC on category ALL in AVA speech [2]. For more details about the model performance and parameters, please refer to the config yaml file in NeMo. Note you might need to finetune and select optimal thresholds on your data to boost performance.

How to Use this Model

The model is available for use in the NeMo toolkit, and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.

Automatically load the model from NGC

import nemo
import nemo.collections.asr as nemo_asr
vad_model = nemo_asr.models.EncDecClassificationModel.from_pretrained(model_name="vad_multilingual_marblenet")

Perform VAD for your audio with this model

python vad_infer.py --config-path="../conf/vad" --config-name="vad_inference_postprocessing.yaml" dataset=<Path of json file of evaluation data. Audio files should have unique names>

You can use posteriors and select optimal postprocessing thresholds in NeMo to achieve better result.

Input

This model accepts 16000 KHz Mono-channel Audio (wav files) as input.

Output

This model provides frame-level voice activity prediction.

Limitations

Since this model was trained on publically available datasets, the performance of this model might degrade for custom data that the model has not been trained on.

References

[1] Jia, Fei, Somshubra Majumdar, and Boris Ginsburg. "MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection." ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021.

[2] Chaudhuri, Sourish, Joseph Roth, Daniel PW Ellis, Andrew Gallagher, Liat Kaver, Radhika Marvin, Caroline Pantofaru et al. "Ava-speech: A densely labeled dataset of speech activity in movies." arXiv preprint arXiv:1808.00606 (2018).

License

License to use this model is covered by the NGC TERMS OF USE unless another License/Terms Of Use/EULA is clearly specified. By downloading the public and release version of the model, you accept the terms and conditions of the NGC TERMS OF USE.

Publisher

NVIDIA

Latest Version1.10.0

UpdatedApril 4, 2023 UTC

Compressed Size490 KB

Labels

AI Automatic Speech Recognition Conversational AI de DL en es fr PytorchLightning ru zh