NGC | Catalog
CatalogModelsQuartzNet15x5: WSJ, LibriSpeech & MCV

QuartzNet15x5: WSJ, LibriSpeech & MCV

For downloads and more information, please view on a desktop device.
Logo for QuartzNet15x5: WSJ, LibriSpeech & MCV

Description

QuartzNet15x5 model trained on WSJ, LibriSpeech and Mozilla's Common Voice En with NeMo

Publisher

NVIDIA

Use Case

Speech Recognition With Ne Mo

Framework

PyTorch with NeMo

Latest Version

1

Modified

August 4, 2021

Size

72.61 MB

Overview

QuarzNet is a end-to-end neural acoustic model for automatic speech recognition. The model is composed of multiple blocks with residual connections between them. Each block consists of one or more modules with 1D time-channel separable convolutional layers, batch normalization, and ReLU layers and it is trained with CTC loss. QuartzNet is a Jasper-like network which uses separable convolutions and larger filter sizes. It has comparable accuracy to Jasper while having much fewer parameters.

We provide a QuartzNet model pre-trained on WSJ, LibriSpeech and Mozilla's Common Voice En. Specifically, we fine-tune the pre-trained QuartzNet model available in NGC with Wall Street Journal data (CSR-I (WSJ0) Complete and CSR-II (WSJ1) Complete).

Datasets

  • Pre-trained on: LibriSpeech +- 10% speed perturbation and Mozilla’s Common Voice En (Validated).
  • Fine-tuned on: Wall Street Journal +- 10% speed perturbation (CSR-I (WSJ0) Complete and CSR-II (WSJ1) Complete).

Word Error Rate

  • WSJ eval-92: (Greedy) 4.45%, (+ 6-Gram WSJ Language Model) 2.39%
  • WSJ dev-93: (Greedy) 6.59%, (+ 6-Gram WSJ Language Model) 3.76%