End-to-end neural acoustic model for automatic speech recognition providing high accuracy at a low memory footprint.
QuartzNet is an end-to-end neural acoustic model that is based on efficient, time-channel separable convolutions (Figure 1). In the audio processing stage, each frame is transformed into mel-scale spectrogram features, which the acoustic model takes as input and outputs a probability distribution over the vocabulary for each frame.
Figure 1. Architecture of QuartzNet (source)
The following datasets were used to train this model:
- LibriSpeech - Corpus of approximately 1000 hours of 16kHz read English speech derived from audiobooks from the LibriVox project, carefully segmented and aligned.
Performance numbers for this model are available in NGC.