NGC | Catalog
Welcome Guest
CatalogModelsSTT En Quartznet15x5

STT En Quartznet15x5

For downloads and more information, please view on a desktop device.
Logo for STT En Quartznet15x5

Description

QuartzNet is a Jasper-like network that uses separable convolutions and larger filter sizes. It has comparable accuracy to Jasper while having much fewer parameters. This particular model has 15 blocks each repeated 5 times.

Publisher

NVIDIA

Use Case

Speech To Text

Framework

PyTorch with NeMo

Latest Version

1.0.0rc1

Modified

March 15, 2022

Size

67.7 MB

Model Overview

These models are based on the QuartzNet [1] architecture, which is a variant of Jasper [2] that uses 1D time-channel separable convolutional layers in its convolutional residual blocks and are therefore smaller than Jasper models.

Jasper models utilizes a character encoding. The pretrained models here can be used immediately for fine-tuning or dataset evaluation.

Trained or fine-tuned NeMo models (with the file extenstion .nemo) can be converted to Riva models (with the file extension .riva) and then deployed. Here is a pre-trained QuartzNet speech-to-text (STT) -- a.k.a. automatic speech recognition (ASR) -- Riva model.

Model Architecture

The Quartznet model is composed of multiple blocks with residual connections between them, trained with CTC loss. Each block consists of one or more modules with 1D time-channel separable convolutional layers, batch normalization, and ReLU layers.

The Quartznet 15x5 model consists of 79 layers and has a total of 18.9 million parameters, with five blocks that repeat fifteen times plus four additional convolutional layers [1].

Training

This QuartzNet model was trained on a combination of seven datasets of English speech, with a total of 7,057 hours of audio samples. Samples were limited to a minimum duration of 0.1s long, and a maximum duration of 16.7s long. The model was trained for 300 epochs with Apex/Amp optimization level O1.

Datasets

The datasets included in training are detailed in the table below. The "Duration" column indicates how many hours of audio are contained in that dataset before length filtering was performed.

| Dataset | Speed Perturbed | Duration (h) | |-------------------------------- |----------------- |-------------- | | LibriSpeech | Y | 2,903 | | Wall Street Journal | Y | 245 | | Fisher English Training Speech | N | 1,906 | | Switchboard | N | 316 | | Mozilla Common Voice* | N | 1,090 | | NSC Singapore English (Part 1) | N | 1,857 |

  • Only non-dev and non-test validated clips from Mozilla Common Voice version en_1488h_2019-12-10.

Performance

The performance of Automatic Speech Recognition models is measuring using Character Error Rate.

The model obtains the following scores on the following evaluation datasets -

  • 4.4 % on LibriSpeech dev-clean
  • 11.3 % on LibriSpeech dev-other

Note that these scores on Librispeech are not particularly indicative of the quality of transcriptions that models trained on ASR Set will achieve, but they are a useful proxy.

How to Use this Model

The model is available for use in the NeMo toolkit [3], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.

Automatically load the model from NGC

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name="stt_en_quartznet15x5")

Transcribing text with this model

python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py \
  pretrained_name="stt_en_quartznet15x5" \
  audio_dir=""

Input

This model accepts 16000 KHz Mono-channel Audio (wav files) as input.

Output

This model provides transcribed speech as a string for a given audio sample.

Limitations

Since this model was trained on publically available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech.

References

[1] QuartzNet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions

[2] Jasper: An End-to-End Convolutional Neural Acoustic Model

[3] NVIDIA NeMo Toolkit

Licence

License to use this model is covered by the NGC TERMS OF USE unless another License/Terms Of Use/EULA is clearly specified. By downloading the public and release version of the model, you accept the terms and conditions of the NGC TERMS OF USE.