QuartzNet15x5 Encoder and Decoder neural module's checkpoints available here are trained using Neural Modules toolkit. NVIDIA’s Apex/Amp O1 optimization level was used for training on 8xV100 GPUs. These modules were trained using LibriSpeech (+-10% speed perturbation) and Mozilla's EN Common Voice "validated" set. This model achievs 4.19% WER on LibriSpeech test-clean and 10.98% on test-other without any language models, using greedy decoder.
Most state-of-the-art (SOTA) ASR models are extremely large; they tend to have on the order of a few hundred million parameters. This makes them hard to deploy on a large scale given current limitations of devices on the edge. Quartznet model consists of 79 layers and has a total of 18.9 million parameters, with five blocks that repeat fifteen times plus four additional convolutional layers.
The model is composed of multiple blocks with residual connections between them, trained with CTC loss. Each block consists of one or more modules with 1D time-channel separable convolutional layers, batch normalization, and ReLU layers. Model achieves near state-of-the-art accuracy on LibriSpeech and Wall Street Journal, while having fewer parameters than all competing models. Neural Modules (NeMo) toolkit makes it easy to use this model for transfer learning or fine tuning. Encoder and Decoder checkpoints trained with NeMo can be used for fine tuning on new datasets.
Dataset used here is the LibriSpeech training dataset with two types of data augmentation techniques: speed perturbation and Cutout. Speed perturbation means additional training samples were created by slowing down or speeding up the original audio data by 10%. Cutout refers to randomly masking out small rectangles out of the spectrogram input as a regularization technique. NVIDIA’s Apex/Amp O1 optimization level was used for training on V100 GPUs
Please refer to https://github.com/NVIDIA/NeMo for further documentation. https://docs.nvidia.com/deeplearning/nemo/neural-modules-release-notes/index.html