NGC | Catalog
Welcome Guest
CatalogModelsCommandrecognition En Matchboxnet3x1x64 Subset Task

Commandrecognition En Matchboxnet3x1x64 Subset Task

For downloads and more information, please view on a desktop device.
Logo for Commandrecognition En Matchboxnet3x1x64 Subset Task

Description

MatchboxNet 3x1x64 trained on Google Speech Commands Dataset (v2) (subset classification)

Publisher

NVIDIA

Use Case

Other

Framework

PyTorch with NeMo

Latest Version

1.0.0rc1

Modified

June 30, 2021

Size

303.96 KB

Model Overview

MatchboxNet 3x1x64 model which has been trained on the Google Speech Commands Dataset (v2) - with the subset of the dataset (10 specific classes) being used as the actual speech commands to be recognized, while the remaining classes fall under the "other" class. An additional class is also added to represent "silence" - which is constructed out of audio samples that represent background noise.

Speech Command Recognition is the task of classifying an input audio pattern into a discrete set of classes. It is a subset of Automatic Speech Recognition, sometimes referred to as Key Word Spotting, in which a model is constantly analyzing speech patterns to detect certain "command" classes. Upon detection of these commands, a specific action can be taken by the system. It is often the objective of command recognition models to be small and efficient, so that they can be deployed onto low power sensors and remain active for long durations of time.

Model Architecture

The discription of this model is in MatchboxNet [1] paper. Basically, this Speech Command recognition model is based on the QuartzNet model with a modified decoder head to suit classification tasks. Instead of predicting a token for each time step of the input, we predict a single label for the entire duration of the audio signal. This is accomplished by a decoder head that performs Global Max / Average pooling across all timesteps prior to classification. After this, the model can be trained via standard categorical cross-entropy loss.

  • Audio preprocessing (feature extraction): signal normalization, windowing, (log) spectrogram (or mel scale spectrogram, or MFCC)
  • Data augmentation using SpecAugment to increase number of data samples.
  • Develop a small Neural classification model which can be trained efficiently.

Training

The model was trained on Google Speech Command v2 dataset. The NeMo toolkit was used for training this model over several hundred epochs on multiple GPUs.

Datasets

While training this model, we used the following datasets:

Performance

The general metric of speech command recognition is accuracy on the corresponding development and test set of the model. Note that as the "silence" / "background" class is built non-deterministically, the scores on that class will vary between different constructions of the train set under different environments.

On the Google Speech Commands v2 dataset (10 + 2 classes), which this model was trained on, it gets approximately:

  • 98.2 % on the test set

How to Use this Model

Automatically load the model from NGC

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecClassificationModel.from_pretrained(model_name="commandrecognition_en_matchboxnet3x1x64_v2_subset_task")

Usage

A Jupyter Notebook containing all the steps to download the dataset, train a model and evaluate its results is available at : Speech Commands Using NeMo

To train the model using multiple GPUs and Automatic Mixed Precision, please use the training script provided in the examples/asr/ directory - MatchboxNet speech commands.

Input

This model accepts 16000 KHz Mono-channel Audio (wav files) as input.

Output

This model provides the speech command prediction for the input audio snippet.

Limitations

Performance might degrade with extremely noisy input. Since this model was trained on publicly available datasets, the performance of this model might degrade for custom data that the model has not been trained on. The model can only detect the 10 English words as well as "other" and "silence", but can be finetuned to detect other words.

References

[1] Majumdar, Somshubra, and Boris Ginsburg. "MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition." Proc. Interspeech 2020 (2020): 3356-3360.

License

License to use this model is covered by the NGC TERMS OF USE unless another License/Terms Of Use/EULA is clearly specified. By downloading the public and release version of the model, you accept the terms and conditions of the NGC TERMS OF USE.