NGC | Catalog
Welcome Guest
CatalogModelsJasper checkpoint (PyTorch, AMP, LibriSpeech)

Jasper checkpoint (PyTorch, AMP, LibriSpeech)

For downloads and more information, please view on a desktop device.
Logo for Jasper checkpoint (PyTorch, AMP, LibriSpeech)


Jasper PyTorch checkpoint trained on LibriSpeech (test-other 9.66% WER)


NVIDIA Deep Learning Examples

Use Case

Speech Recognition



Latest Version



October 29, 2021


1.24 GB

Model Overview

The Jasper model is an end-to-end neural acoustic model for automatic speech recognition (ASR).

Model Architecture

Details on the model architecture can be found in the paper Jasper: An End-to-End Convolutional Neural Acoustic Model.

Figure 1: Jasper BxR model: B- number of blocks, R- number of sub-blocks Figure 2: Jasper Dense Residual

Jasper is an end-to-end neural acoustic model that is based on convolutions. In the audio processing stage, each frame is transformed into mel-scale spectrogram features, which the acoustic model takes as input and outputs a probability distribution over the vocabulary for each frame. The acoustic model has a modular block structure and can be parametrized accordingly: a Jasper BxR model has B blocks, each consisting of R repeating sub-blocks.

Each sub-block applies the following operations in sequence: 1D-Convolution, Batch Normalization, ReLU activation, and Dropout.

Each block input is connected directly to the last subblock of all following blocks via a residual connection, which is referred to as dense residual in the paper. Every block differs in kernel size and number of filters, which are increasing in size from the bottom to the top layers. Irrespective of the exact block configuration parameters B and R, every Jasper model has four additional convolutional blocks: one immediately succeeding the input layer (Prologue) and three at the end of the B blocks (Epilogue).

The Prologue is to decimate the audio signal in time in order to process a shorter time sequence for efficiency. The Epilogue with dilation captures a bigger context around an audio time step, which decreases the model word error rate (WER). The paper achieves best results with Jasper 10x5 with dense residual connections, which is also the focus of this repository and is in the following referred to as Jasper Large.


This model was trained using script available on NGC and in GitHub repo


The following datasets were used to train this model:

  • LibriSpeech - Corpus of approximately 1000 hours of 16kHz read English speech derived from audiobooks from the LibriVox project, carefully segmented and aligned.


Performance numbers for this model are available in NGC



This model was trained using open-source software available in Deep Learning Examples repository. For terms of use, please refer to the license of the script and the datasets the model was derived from.