This repository provides an implementation of the QuartzNet model in PyTorch from the paper QuartzNet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions. The QuartzNet model is an end-to-end neural acoustic model for automatic speech recognition (ASR), that provides high accuracy at a low memory footprint. The QuartzNet architecture of convolutional layers was designed to facilitate fast GPU inference, by allowing whole sub-blocks to be fused into a single GPU kernel. This is important for meeting strict real-time requirements of ASR systems in deployment.

This repository is a PyTorch implementation of QuartzNet and provides scripts to train the QuartzNet 10x5 model from scratch on the LibriSpeech dataset to achieve the greedy decoding results improved upon the original paper. The repository is self-contained and includes data preparation scripts, training, and inference scripts. Both training and inference scripts offer the option to use Automatic Mixed Precision (AMP) to benefit from Tensor Cores for better performance.

In addition to providing the hyperparameters for training a model checkpoint, we publish a thorough inference analysis across different NVIDIA GPU platforms, for example, DGX-2, NVIDIA A100 GPU, and T4.

This model is trained with mixed precision using Tensor Cores on Volta, Turing, and the NVIDIA Ampere GPU architectures. Therefore, researchers can get results [1.4]x faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.

### Model architecture

QuartzNet is an end-to-end neural acoustic model that is based on efficient, time-channel separable convolutions (Figure 1). In the audio processing stage, each frame is transformed into mel-scale spectrogram features, which the acoustic model takes as input and outputs a probability distribution over the vocabulary for each frame.

*Figure 1. Architecture of QuartzNet (source)
*

### Default configuration

The following features were implemented in this model:

- GPU-supported feature extraction with data augmentation options SpecAugment and Cutout using the DALI library
- offline and online Speed Perturbation using the DALI library
- data-parallel multi-GPU training and evaluation
- AMP with dynamic loss scaling for Tensor Core training
- FP16 inference

### Feature support matrix

Feature |
QuartzNet |
---|---|

Apex AMP | Yes |

DALI | Yes |

#### Features

**DALI**
NVIDIA Data Loading Library (DALI) is a collection of highly optimized building blocks, and an execution engine, to accelerate the pre-processing of the input data for deep learning applications. DALI provides both the performance and the flexibility for accelerating different data pipelines as a single library. This single library can then be easily integrated into different deep learning training and inference applications. For details, see example sources in this repository or see the DALI documentation.

**Automatic Mixed Precision (AMP)**
Computation graphs can be modified by PyTorch on runtime to support mixed precision training. A detailed explanation of mixed precision can be found in the next section.

### Mixed precision training

Mixed precision is the combined use of different numerical precisions in a computational method. Mixed precision training offers significant computational speedup by performing operations in half-precision format while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of Tensor Cores in Volta, and following with both the Turing and Ampere architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training previously required two steps:

- Porting the model to use the FP16 data type where appropriate.
- Adding loss scaling to preserve small gradient values.

For information about:

- How to train using mixed precision, see the Mixed Precision Training paper and Training With Mixed Precision documentation.
- Techniques used for mixed precision training, see the Mixed-Precision Training of Deep Neural Networks blog.
- APEX tools for mixed precision training, see the NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch.

#### Enabling mixed precision

For training, mixed precision can be enabled by setting the flag: `train.py --amp`

. When using bash helper scripts, mixed precision can be enabled with the environment variable `AMP=true`

, for example, `AMP=true bash scripts/train.sh`

, `AMP=true bash scripts/inference.sh`

, etc.

#### Enabling TF32

TensorFloat-32 (TF32) is the new math mode in NVIDIA A100 GPUs for handling the matrix math also called tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs.

TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models which require high dynamic range for weights or activations.

For more information, refer to the TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x blog post.

TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default.

### Glossary

**Time-channel separable (TCS) convolution**
A module composed mainly of two convolutional layers: a 1D depthwise convolutional layer,
and a pointwise convolutional layer (Figure 2). The former operates across K time frames, and the latter across all channels. By decoupling time and channel axes, the separable module uses less parameters and calculates the result faster, than it would otherwise would.

*Figure 2. Time-channel separable (TCS) convolutional module: (a) basic design, (b) TCS with a group shuffle layer, added to increase cross-group interchange*

**Automatic Speech Recognition (ASR)**
Uses both an acoustic model and a language model to output the transcript of an input audio signal.

**Acoustic model**
Assigns a probability distribution over a vocabulary of characters given an audio frame. Typically, a large part of the entire ASR model.

**Language model**
Assigns a probability distribution over a sequence of words. Given a sequence of words, it assigns a probability to the whole sequence.

**Pre-training**
Training a model on vast amounts of data on the same (or different) task to build general understandings.