STT En Jasper10x5dr | NVIDIA NGC

NVIDIA

STT En Jasper10x5dr

Model

NVIDIA

STT En Jasper10x5dr

Jasper models are end-to-end neural automatic speech recognition (ASR) models that transcribe segments of audio to text.

Model Overview

Jasper model is an end-to-end neural speech recognition models trained with CTC loss, which has been trained on the ASR Set dataset with over 7000 hours of English speech.

Jasper models utilizes a character encoding. The pretrained models here can be used immediately for fine-tuning or dataset evaluation.

Trained or fine-tuned NeMo models (with the file extenstion .nemo) can be converted to Riva models (with the file extension .riva) and then deployed. Here is a pre-trained Jasper speech-to-text (STT) -- a.k.a. automatic speech recognition (ASR) -- Riva model.

Model Architecture

The Jasper model is composed of multiple blocks with residual connections between them, trained with CTC loss. Each block consists of one or more modules with convolutional layers, batch normalization, and ReLU layers.

The Jasper 10x5-DR model consists of a stack of five blocks that repeat ten times plus four additional convolutional layers [1].

Training

This Jasper model was trained on a combination of seven datasets of English speech, with a total of 7,057 hours of audio samples. Samples were limited to a minimum duration of 0.1s long, and a maximum duration of 16.7s long. The model was trained for 600 epochs with Apex/Amp optimization level O1. The NeMo toolkit [2] was used for training this model over several hundred epochs on multiple GPUs.

The model has been fine-tuned with Room Impulse Response (RIR) and noise augmentation to make it more robust to noise.

Datasets

The datasets included in training are detailed in the table below. The "Duration" column indicates how many hours of audio are contained in that dataset before length filtering was performed.

Dataset	Speed Perturbed	Duration (h)
LibriSpeech	Y	2,903
Wall Street Journal	Y	245
Fisher English Training Speech	N	1,906
Switchboard	N	316
Mozilla Common Voice*	N	1,090
NSC Singapore English (Part 1)	N	1,857

Only non-dev and non-test validated clips from Mozilla Common Voice version en_1488h_2019-12-10.

Performance

The performance of Automatic Speech Recognition models is measuring using Character Error Rate.

The model obtains the following scores on the following evaluation datasets -

3.74 % on LibriSpeech dev-clean
10.21 % on LibriSpeech dev-other

Note that these scores on Librispeech are not particularly indicative of the quality of transcriptions that models trained on ASR Set will achieve, but they are a useful proxy.

How to Use this Model

The model is available for use in the NeMo toolkit [2], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.

Automatically load the model from NGC

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name="stt_en_jasper10x5dr")

Transcribing text with this model

python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py \
  pretrained_name="stt_en_jasper10x5dr" \
  audio_dir=""

Input

This model accepts 16000 KHz Mono-channel Audio (wav files) as input.

Output

This model provides transcribed speech as a string for a given audio sample.

Limitations

Since this model was trained on publically available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech.

References

[1] Jasper: An End-to-End Convolutional Neural Acoustic Model

[2] NVIDIA NeMo Toolkit

Licence

License to use this model is covered by the NGC TERMS OF USE unless another License/Terms Of Use/EULA is clearly specified. By downloading the public and release version of the model, you accept the terms and conditions of the NGC TERMS OF USE.

Publisher

NVIDIA

Latest Version1.0.0rc1

UpdatedApril 4, 2023 UTC

Compressed Size1.15 GB

Labels

Jasperdr PytorchLightning STT