MatchboxNet 3x2x64 model which has been trained on the Google Speech Commands Dataset (v2) - with the subset of the dataset (10 specific classes) being used as the actual speech commands to be recognized, while the remaining classes fall under the "other" class. An additional class is also added to represent "silence" - which is constructed out of audio samples that represent background noise.
Speech Command Recognition is the task of classifying an input audio pattern into a discrete set of classes. It is a subset of Automatic Speech Recognition, sometimes referred to as Key Word Spotting, in which a model is constantly analyzing speech patterns to detect certain "command" classes. Upon detection of these commands, a specific action can be taken by the system. It is often the objective of command recognition models to be small and efficient, so that they can be deployed onto low power sensors and remain active for long durations of time.
The discription of this model is in MatchboxNet  paper. Basically, this Speech Command recognition model is based on the QuartzNet model with a modified decoder head to suit classification tasks. Instead of predicting a token for each time step of the input, we predict a single label for the entire duration of the audio signal. This is accomplished by a decoder head that performs Global Max / Average pooling across all timesteps prior to classification. After this, the model can be trained via standard categorical cross-entropy loss.
The model was trained on Google Speech Command v2 dataset. The NeMo toolkit was used for training this model over several hundred epochs on multiple GPUs.
While training this model, we used the following datasets:
The general metric of speech command recognition is accuracy on the corresponding development and test set of the model. Note that as the "silence" / "background" class is built non-deterministically, the scores on that class will vary between different constructions of the train set under different environments.
On the Google Speech Commands v2 dataset (10 + 2 classes), which this model was trained on, it gets approximately:
import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.EncDecClassificationModel.from_pretrained(model_name="commandrecognition_en_matchboxnet3x2x64_v2_subset_task")
A Jupyter Notebook containing all the steps to download the dataset, train a model and evaluate its results is available at : Speech Commands Using NeMo
To train the model using multiple GPUs and Automatic Mixed Precision, please use the training script provided in the
examples/asr/ directory - MatchboxNet speech commands.
This model accepts 16000 KHz Mono-channel Audio (wav files) as input.
This model provides the speech command prediction for the input audio snippet.
Performance might degrade with extremely noisy input. Since this model was trained on publicly available datasets, the performance of this model might degrade for custom data that the model has not been trained on. The model can only detect the 10 English words as well as "other" and "silence", but can be finetuned to detect other words.
 Majumdar, Somshubra, and Boris Ginsburg. "MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition." Proc. Interspeech 2020 (2020): 3356-3360.