End-to-end neural acoustic model for automatic speech recognition providing high accuracy at a low memory footprint.
QuartzNet is an end-to-end neural acoustic model that is based on efficient, time-channel separable convolutions (Figure 1). In the audio processing stage, each frame is transformed into mel-scale spectrogram features, which the acoustic model takes as input and outputs a probability distribution over the vocabulary for each frame.
Figure 1. Architecture of QuartzNet (source)
This model was trained using script available on NGC and in GitHub repo.
The following datasets were used to train this model:
Performance numbers for this model are available in NGC.