Jasper model is an end-to-end neural speech recognition models trained with CTC loss, which has been trained on the ASR Set dataset with over 7000 hours of English speech.
Jasper models utilizes a character encoding. The pretrained models here can be used immediately for fine-tuning or dataset evaluation.
Trained or fine-tuned NeMo models (with the file extenstion .nemo
) can be converted to Riva models (with the file extension .riva
) and then deployed. Here is a pre-trained Jasper speech-to-text (STT) -- a.k.a. automatic speech recognition (ASR) -- Riva model.
The Jasper model is composed of multiple blocks with residual connections between them, trained with CTC loss. Each block consists of one or more modules with convolutional layers, batch normalization, and ReLU layers.
The Jasper 10x5-DR model consists of a stack of five blocks that repeat ten times plus four additional convolutional layers [1].
This Jasper model was trained on a combination of seven datasets of English speech, with a total of 7,057 hours of audio samples. Samples were limited to a minimum duration of 0.1s long, and a maximum duration of 16.7s long. The model was trained for 600 epochs with Apex/Amp optimization level O1. The NeMo toolkit [2] was used for training this model over several hundred epochs on multiple GPUs.
The model has been fine-tuned with Room Impulse Response (RIR) and noise augmentation to make it more robust to noise.
The datasets included in training are detailed in the table below. The "Duration" column indicates how many hours of audio are contained in that dataset before length filtering was performed.
| Dataset | Speed Perturbed | Duration (h) | |-------------------------------- |----------------- |-------------- | | LibriSpeech | Y | 2,903 | | Wall Street Journal | Y | 245 | | Fisher English Training Speech | N | 1,906 | | Switchboard | N | 316 | | Mozilla Common Voice* | N | 1,090 | | NSC Singapore English (Part 1) | N | 1,857 |
The performance of Automatic Speech Recognition models is measuring using Character Error Rate.
The model obtains the following scores on the following evaluation datasets -
LibriSpeech dev-clean
LibriSpeech dev-other
Note that these scores on Librispeech are not particularly indicative of the quality of transcriptions that models trained on ASR Set will achieve, but they are a useful proxy.
The model is available for use in the NeMo toolkit [2], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name="stt_en_jasper10x5dr")
python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py \
pretrained_name="stt_en_jasper10x5dr" \
audio_dir=""
This model accepts 16000 KHz Mono-channel Audio (wav files) as input.
This model provides transcribed speech as a string for a given audio sample.
Since this model was trained on publically available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech.
[1] Jasper: An End-to-End Convolutional Neural Acoustic Model
License to use this model is covered by the NGC TERMS OF USE unless another License/Terms Of Use/EULA is clearly specified. By downloading the public and release version of the model, you accept the terms and conditions of the NGC TERMS OF USE.