NGC Catalog

CLASSIC

Welcome Guest

For downloads and more information, please view on a desktop device.

Description

Checkpoint of MatchboxNet 3x1x1 trained on Google Speech Command v1 (30 classes) dataset

Publisher

NVIDIA

Latest Version

Modified

April 4, 2023

Size

631.17 KB

Speech Commands (v1 dataset)

Speech Command Recognition is the task of classifying an input audio pattern into a discrete set of classes. It is a subset of Automatic Speech Recognition, sometimes referred to as Key Word Spotting, in which a model is constantly analyzing speech patterns to detect certain "command" classes. Upon detection of these commands, a specific action can be taken by the system. It is often the objective of command recognition models to be small and efficient, so that they can be deployed onto low power sensors and remain active for long durations of time.

This Speech Command recognition tutorial is based on the QuartzNet model with a modified decoder head to suit classification tasks. Instead of predicting a token for each time step of the input, we predict a single label for the entire duration of the audio signal. This is accomplished by a decoder head that performs Global Max / Average pooling across all timesteps prior to classification. After this, the model can be trained via standard categorical cross-entropy loss.

Audio preprocessing (feature extraction): signal normalization, windowing, (log) spectrogram (or mel scale spectrogram, or MFCC)
Data augmentation using SpecAugment to increase number of data samples.
Develop a small Neural classification model which can be trained efficiently.

A Jupyter Notebook containing all the steps to download the dataset, train a model and evaluate its results is available at : Speech Commands Using NeMo

Model Results

MatchboxNet 3x1x1

Parameter Count: 77K parameters
Accuracy : 97.3226 %

Google Speech Commands v1 - MatchboxNet 3x1x1

Speech Commands (v1 dataset)

Model Results