NGC | Catalog
CatalogModelsFrame-VAD Multilingual MarbleNet

Frame-VAD Multilingual MarbleNet

Logo for Frame-VAD Multilingual MarbleNet
Frame-based VAD model using MarbleNet model trained with real multilingual data and synthetic English data.
Latest Version
May 2, 2023
490 KB

Model Overview

This model for Voice Activity Detection (VAD), which can serve as the first step for Automatic Speech Recognition (ASR) and Speaker Diarization (SD). Different from segment-based VAD that predicts whether the input audio contains speech or not, this model is a frame-based VAD, which outputs a speech probablity for each 20ms frame of the input audio. The model is trained on a combination of synthetic and real-world data to achieve more robust performance in very noisy situations.

Model Architecture

The model is based on MarbleNet architecture presented in MarbleNet paper [1]. Different from the paper, the stride for the first convolution is 2, so that there is a 2x subsampling rate of the model. Also, the input feature of this model is un-normalized log-mel spectrogram with n_mels=80 so it can be easily and efficiently integrated with ASR. For ASR+VAD pipeline, please refer to this example.


The NeMo toolkit [2] was used for training the models for 50 epochs, with noise and gain augmentation. This model is trained with this example script and this base config.

Full config can be found inside the .nemo files.


While training this model, we used the following datasets:

Synthetic Data:

We use the NeMo ASR data simulator to generate synthetic data. Each session is 3 minutes long, and the mean silence ratio is set to 0.3, while the mean overlap is set to 0.05. The variance of both silence and overlap is set to 0.005. The generated audios are split into segments of 20 seconds for training.

The synthetic dataset consists of the following:

  • 500 hours of audio synthesized from LibriSpeech training set.
  • 500 hours of audio synthesized from Fisher.

Real-world Data:

Subset from cleaned German (mcv7.0), Mandarin (aishell2), French (mls), Russian (mcv, ruls, sova) , Spanish (mls) from NeMo ASR set, total of about 2.5K hours.

Noise Augmentation:


The AUROC performance is listed in the following table.

Version AMI AVA CallHome-109 VoxConv-test
1.20.0 95.83 93.77 92.43 96.45

How to Use this Model

The model is available for use in the NeMo toolkit [2], and can be used as a pre-trained checkpoint for inference.

Automatically load the model from NGC

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecFrameClassificationModel.from_pretrained(model_name="vad_multilingual_frame_marblenet")

Perform VAD Inference

python <NEMO_ROOT>/examples/asr/speech_classification/ --config-path="../conf/vad" --config-name="frame_vad_infer_postprocess.yaml" dataset=<Path of manifest file of evaluation data, where audio files should have unique names>


This model accepts 16 KHz Mono-channel Audio (wav files) as input.


This model provides a sequance of speech probabilities for each 20ms frame of the input audio.


Since this model was trained on publically available datasets, the performance of this model might degrade for custom data that the model has not been trained on.


[1] Jia, Fei, Somshubra Majumdar, and Boris Ginsburg. "MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection." ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021.

[2] NVIDIA NeMo Toolkit


License to use this model is covered by the NGC TERMS OF USE unless another License/Terms Of Use/EULA is clearly specified. By downloading the public and release version of the model, you accept the terms and conditions of the NGC TERMS OF USE.