NGC | Catalog
CatalogModelsVAD - MatchboxNet 3x1x1

VAD - MatchboxNet 3x1x1

For downloads and more information, please view on a desktop device.
Logo for VAD - MatchboxNet 3x1x1

Description

Checkpoint of MatchboxNet 3x1x1 trained on Google Speech Command v2 (Speech) and Freesound (Background) dataset

Publisher

NVIDIA

Use Case

Automatic Speech Recognition

Framework

PyTorch with NeMo

Latest Version

1

Modified

September 24, 2020

Size

603.01 KB

Voice activity detection (VAD) is the task of distinguishing human speech segments from background noise in audio stream.

VAD is an important pre-processing stage of an ASR system to decide when to start ASR and when to close the microphone. The models need to be small and efficient so that they can be deployed onto devices. Also, VAD requires low latency.

This VAD tutorial is based on the MatchboxNet model with a modified decoder head to suit classification tasks. MatchboxNet shows the great performance on classifying the segment (utterance-level / second unit) to be speech or non-speech (99.78% F1 score).

A Jupyter Notebook containing all the steps to download the dataset, train a model and evaluate its results is available at: VAD Using Nemo

Model Results

Accuracy: 0.9971

F1 score : 0.9975