These models are based on the QuartzNet  architecture, which is a variant of Jasper  that uses 1D time-channel separable convolutional layers in its convolutional residual blocks and are therefore smaller than Jasper models.
Jasper models utilizes a character encoding. The pretrained models here can be used immediately for fine-tuning or dataset evaluation.
Trained or fine-tuned NeMo models (with the file extenstion
.nemo) can be converted to Riva models (with the file extension
.riva) and then deployed. Here is a pre-trained QuartzNet speech-to-text (STT) -- a.k.a. automatic speech recognition (ASR) -- Riva model.
The Quartznet model is composed of multiple blocks with residual connections between them, trained with CTC loss. Each block consists of one or more modules with 1D time-channel separable convolutional layers, batch normalization, and ReLU layers.
The Quartznet 15x5 model consists of 79 layers and has a total of 18.9 million parameters, with five blocks that repeat fifteen times plus four additional convolutional layers .
This QuartzNet model was trained on a combination of seven datasets of English speech, with a total of 7,057 hours of audio samples. Samples were limited to a minimum duration of 0.1s long, and a maximum duration of 16.7s long. The model was trained for 300 epochs with Apex/Amp optimization level O1.
The datasets included in training are detailed in the table below. The "Duration" column indicates how many hours of audio are contained in that dataset before length filtering was performed.
| Dataset | Speed Perturbed | Duration (h) | |-------------------------------- |----------------- |-------------- | | LibriSpeech | Y | 2,903 | | Wall Street Journal | Y | 245 | | Fisher English Training Speech | N | 1,906 | | Switchboard | N | 316 | | Mozilla Common Voice* | N | 1,090 | | NSC Singapore English (Part 1) | N | 1,857 |
The performance of Automatic Speech Recognition models is measuring using Character Error Rate.
The model obtains the following scores on the following evaluation datasets -
Note that these scores on Librispeech are not particularly indicative of the quality of transcriptions that models trained on ASR Set will achieve, but they are a useful proxy.
The model is available for use in the NeMo toolkit , and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name="stt_en_quartznet15x5")
python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py \ pretrained_name="stt_en_quartznet15x5" \ audio_dir=""
This model accepts 16000 KHz Mono-channel Audio (wav files) as input.
This model provides transcribed speech as a string for a given audio sample.
Since this model was trained on publically available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech.