NGC | Catalog
CatalogModelsRIVA Citrinet-256 ASR English - ASR set 2.0

RIVA Citrinet-256 ASR English - ASR set 2.0

For downloads and more information, please view on a desktop device.
Logo for RIVA Citrinet-256 ASR English - ASR set 2.0


English Citrinet-256 ASR model trained on ASR set 2.0, no-weight-decay



Use Case

Nvidia Riva



Latest Version



February 7, 2023


39.23 MB

Speech Recognition: CitriNet 256 Model Card

Model Overview

CitriNet models are end-to-end neural automatic speech recognition (ASR) models that transcribe segments of audio to text.

Model Architecture

Citrinet is a version of QuartzNet that extends ContextNet, utilizing subword encoding (via Word Piece tokenization) and Squeeze-and-Excitation(SE) mechanism and are therefore smaller than QuartzNet models.

CitriNet models take in audio segments and transcribe them to letter, byte pair, or word piece sequences. The pretrained models here can be used immediately for fine-tuning or dataset evaluation.

Intended Use

Primary use case intended for these models is automatic speech recognition.

Input: Single-channel audio files (WAV) with a 16kHz sample rate

Output: Transcripts, which are sequences of valid vocabulary labels as given by the specification file

How to Use This Model

CitriNet is an end-to-end architecture that is trained using CTC loss. These model checkpoints are intended to be used with Nvidia Riva.

The models are encrypted with the key tlt_encode.

Training Information

The model was trained on various proprietary and open-source datasets. These datasets include domain specific data for various domains, spontaneous speech and dialogue, all of which contribute to the model’s accuracy.

This model delivers WER that is better than or comparable to popular alternate Speech to Text solutions for a range of domains and use cases.


Currently, CitriNet models only support training and inference on .wav audio files. All models included here were trained and evaluated on audio files with a sample rate of 16kHz, so for best performance you may need to upsample or downsample audio files to 16kHz.

In addition, the model will perform best on audio samples that are longer than 0.1 seconds long. For training and fine-tuning CitriNet models, it is recommended that samples are capped at a maximum length of around 15 seconds, depending on the amount of memory available to you. You do not need to place a maximum length limitation for evaluation.


By downloading and using the models and resources packaged with Riva, you would be accepting the terms of the Riva license

Suggested Reading

Deploy your model for Production using Riva. Learn more about Riva framework

Ethical AI

NVIDIA’s platforms and application frameworks enable developers to build a wide array of AI applications. Consider potential algorithmic bias when choosing or creating the models being deployed. Work with the model’s developer to ensure that it meets the requirements for the relevant industry and use case; that the necessary instruction and documentation are provided to understand error rates, confidence intervals, and results; and that the model is being used under the conditions and in the manner intended.