QuartzNet15x5: WSJ, LibriSpeech & MCV

NGC Catalog

CLASSIC

Welcome Guest

For downloads and more information, please view on a desktop device.

Description

QuartzNet15x5 model trained on WSJ, LibriSpeech and Mozilla's Common Voice En with NeMo

Publisher

NVIDIA

Latest Version

Modified

April 4, 2023

Size

72.6 MB

Overview

QuarzNet is a end-to-end neural acoustic model for automatic speech recognition. The model is composed of multiple blocks with residual connections between them. Each block consists of one or more modules with 1D time-channel separable convolutional layers, batch normalization, and ReLU layers and it is trained with CTC loss. QuartzNet is a Jasper-like network which uses separable convolutions and larger filter sizes. It has comparable accuracy to Jasper while having much fewer parameters.

We provide a QuartzNet model pre-trained on WSJ, LibriSpeech and Mozilla's Common Voice En. Specifically, we fine-tune the pre-trained QuartzNet model available in NGC with Wall Street Journal data (CSR-I (WSJ0) Complete and CSR-II (WSJ1) Complete).

Datasets

Pre-trained on: LibriSpeech +- 10% speed perturbation and Mozilla’s Common Voice En (Validated).
Fine-tuned on: Wall Street Journal +- 10% speed perturbation (CSR-I (WSJ0) Complete and CSR-II (WSJ1) Complete).

Word Error Rate

WSJ eval-92: (Greedy) 4.45%, (+ 6-Gram WSJ Language Model) 2.39%
WSJ dev-93: (Greedy) 6.59%, (+ 6-Gram WSJ Language Model) 3.76%