NGC | Catalog
CatalogCollectionsNeMo - Automatic Speech Recognition

NeMo - Automatic Speech Recognition

Logo for NeMo - Automatic Speech Recognition
This collection contains NeMo models for Automatic Speech Recognition (ASR): Speech to Text, Speech Classification, Speaker Diarization, Speaker Verification, Speaker Recognition, Command Recognition, Voice Activity Detection
July 24, 2023
Sorry, your browser does not support inline SVG.
Helm Charts
Sorry, your browser does not support inline SVG.
Sorry, your browser does not support inline SVG.
Sorry, your browser does not support inline SVG.


NVIDIA NeMo toolkit supports various Automatic Speech Recognition (ASR) models such as Jasper, QuartzNet, Citrinet and Conformer-CTC. Furthermore, it also supports multiple subtasks related to speech classification, speaker recognition and speaker diarization. For futher information regarding NeMo's capabilities in the domain of speech recognition, visit the NeMo ASR documentation page.

Trained or fine-tuned NeMo models (with the file extenstion .nemo) can be converted to Riva models (with the file extension .riva) and then deployed. For more details, see the Riva documentation on Model Development with NeMo.


You can instantiate many pretrained models automatically directly from NGC. To do so, start your script with:

import nemo
import nemo.collections.asr as nemo_asr

Then chose what type of model you would like to instantiate. See table below for the list of models that are available for each task. For example:

# simply use ASRModel to instantiate any ASR pretrained model
quartznet = nemo_asr.models.ASRModel.from_pretrained(model_name="QuartzNet15x5Base-En")

# or equivalently, use the exact class
quartznet = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name="QuartzNet15x5Base-En")

Note that you can also list all available models using API by calling .list_available_models(...) method.

You can also download a models ".nemo" files from the "File Browser" tab and then instantiate those models with restore_from(PATH_TO_DOTNEMO_FILE)` method. In this case, make sure you are matching NeMo and models' versions.

ASR with NeMo

NeMo's ASR collections supports a variety of sub-domains related to the field of speech and speaker recognition. Here, we list each task along with the pretrained models that are available for that task. Multiple example notebooks are available under the examples/asr/ directory of NeMo, as well as several tutorial notebooks under tutorials/asr/ at NVIDIA NeMo.

Automatic Speech Recognition (ASR)

Automatic speech recognition (ASR) is the task of transcribing a given audio segment into text that can be read. NeMo supports a large collection of models such as Jasper, QuartzNet, Citrinet and Conformer-CTC in order to perform automatic speech recognition. Visit NeMo Automatic Speech Recognition for more information.


Model Name Model Base Class Model Card
stt_en_jasper10x5dr EncDecCTCModel NGC Model Card
stt_en_citrinet_256 EncDecCTCModelBPE NGC Model Card
stt_en_citrinet_512 EncDecCTCModelBPE NGC Model Card
stt_en_citrinet_1024 EncDecCTCModelBPE NGC Model Card
stt_en_citrinet_256_gamma_0_25 EncDecCTCModelBPE NGC Model Card
stt_en_citrinet_512_gamma_0_25 EncDecCTCModelBPE NGC Model Card
stt_en_citrinet_1024_gamma_0_25 EncDecCTCModelBPE NGC Model Card
stt_en_contextnet_1024_mls EncDecRNNTBPEModel NGC Model Card
stt_en_contextnet_512_mls EncDecRNNTBPEModel NGC Model Card
stt_en_contextnet_256_mls EncDecRNNTBPEModel NGC Model Card
stt_en_contextnet_1024 EncDecRNNTBPEModel NGC Model Card
stt_en_contextnet_512 EncDecRNNTBPEModel NGC Model Card
stt_en_conformer_ctc_small EncDecCTCModelBPE NGC Model Card
stt_en_conformer_ctc_medium EncDecCTCModelBPE NGC Model Card
stt_en_conformer_ctc_large EncDecCTCModelBPE NGC Model Card
stt_en_conformer_ctc_small_ls EncDecCTCModelBPE NGC Model Card
stt_en_conformer_ctc_medium_ls EncDecCTCModelBPE NGC Model Card
stt_en_conformer_ctc_large_ls EncDecCTCModelBPE NGC Model Card
stt_en_conformer_transducer_small EncDecRNNTBPEModel NGC Model Card
stt_en_conformer_transducer_medium EncDecRNNTBPEModel NGC Model Card
stt_en_conformer_transducer_large EncDecRNNTBPEModel NGC Model Card
stt_en_conformer_transducer_large_ls EncDecRNNTBPEModel NGC Model Card
stt_en_conformer_transducer_xlarge EncDecRNNTBPEModel NGC Model Card
stt_en_conformer_transducer_xxlarge EncDecRNNTBPEModel NGC Model Card
stt_en_squeezeformer_ctc_xsmall_ls EncDecCTCModelBPE NGC Model Card
stt_en_squeezeformer_ctc_small_ls EncDecCTCModelBPE NGC Model Card
stt_en_squeezeformer_ctc_small_medium_ls EncDecCTCModelBPE NGC Model Card
stt_en_squeezeformer_ctc_medium_ls EncDecCTCModelBPE NGC Model Card
stt_en_squeezeformer_ctc_medium_large_ls EncDecCTCModelBPE NGC Model Card
stt_en_squeezeformer_ctc_large_ls EncDecCTCModelBPE NGC Model Card
stt_en_fastconformer_ctc_large EncDecCTCModelBPE NGC Model Card
stt_en_fastconformer_ctc_xlarge EncDecCTCModelBPE NGC Model Card
stt_en_fastconformer_transducer_large EncDecRNNTBPEModel NGC Model Card
stt_en_fastconformer_transducer_xlarge EncDecRNNTBPEModel NGC Model Card
stt_en_fastconformer_transducer_xxlarge EncDecRNNTBPEModel NGC Model Card
stt_en_fastconformer_hybrid_large_pc EncDecHybridRNNTCTCBPEModel NGC Model Card
stt_en_fastconformer_hybrid_large_streaming_480s EncDecHybridRNNTCTCBPEModel NGC Model Card
stt_en_fastconformer_hybrid_large_streaming_1040s EncDecHybridRNNTCTCBPEModel NGC Model Card
stt_en_fastconformer_hybrid_large_streaming_multi EncDecHybridRNNTCTCBPEModel NGC Model Card


Model Name Model Base Class Model Card
stt_fr_quartznet15x5 EncDecCTCModel NGC Model Card
stt_fr_citrinet_1024_gamma_0_25 EncDecCTCModelBPE NGC Model Card
stt_fr_conformer_ctc_large EncDecCTCModelBPE NGC Model Card
stt_fr_contextnet_1024 EncDecRNNTBPEModel NGC Model Card
stt_fr_conformer_transducer_large EncDecRNNTBPEModel NGC Model Card


Model Name Model Base Class Model Card
stt_ca_quartznet15x5 EncDecCTCModel NGC Model Card
stt_ca_conformer_ctc_large EncDecCTCModelBPE NGC Model Card
stt_ca_conformer_transducer_large EncDecRNNTBPEModel NGC Model Card


Model Name Model Base Class Model Card
stt_it_quartznet15x5 EncDecCTCModel NGC Model Card


Model Name Model Base Class Model Card
stt_es_quartznet15x5 EncDecCTCModel NGC Model Card
stt_es_citrinet_512 EncDecCTCModelBPE NGC Model Card


Model Name Model Base Class Model Card
stt_de_quartznet15x5 EncDecCTCModel NGC Model Card
stt_de_citrinet_1024 EncDecCTCModelBPE NGC Model Card
stt_de_conformer_ctc_large EncDecCTCModelBPE NGC Model Card
stt_de_contextnet_1024 EncDecRNNTBPEModel NGC Model Card
stt_de_conformer_transducer_large EncDecRNNTBPEModel NGC Model Card


Model Name Model Base Class Model Card
stt_pl_quartznet15x5 EncDecCTCModel NGC Model Card


Model Name Model Base Class Model Card
stt_ru_quartznet15x5 EncDecCTCModel NGC Model Card


Model Name Model Base Class Model Card
stt_zh_citrinet_512 EncDecCTCModel NGC Model Card
stt_zh_citrinet_1024_gamma_0_25 EncDecCTCModel NGC Model Card
stt_zh_conformer_transducer_large EncDecCTCModel NGC Model Card


Model Name Model Base Class Model Card
stt_rw_conformer_ctc_large EncDecCTCModelBPE NGC Model Card
stt_rw_conformer_transducer_large EncDecRNNTBPEModel NGC Model Card

Language Modeling for ASR

Language models can help to increase the accuracy of ASR models by incorportaing language knowledge into their prediction. NeMo supports N-gram LM in fusion with beam search decodng and also neural rescorer. You may visit NeMo Language Modeling for ASR for more information.

Model Name Model Base Class Model Card
asrlm_en_transformer_large_ls TransformerLMModel NGC Model Card

Speech Classification (SC)

Speech Classification (SC) refers to a set of tasks or problems of getting a program to automatically classify input utterance or audio segment into categories, such as Speech Command Recognition (multi-class), Voice Activity Detection (binary or multi-class), and Audio Sentiment Classification (typically multi-class), etc. NeMo provides MatchboxNet and MarbleNet models for speech classification tasks. Visit NeMo Speech Classification for more information.

Model Name Model Base Class Model Card
vad_marblenet EncDecClassificationModel NGC Model Card
vad_telephony_marblenet EncDecClassificationModel NGC Model Card
commandrecognition_en_matchboxnet3x1x64_v1 EncDecClassificationModel NGC Model Card
commandrecognition_en_matchboxnet3x2x64_v1 EncDecClassificationModel NGC Model Card
commandrecognition_en_matchboxnet3x1x64_v2 EncDecClassificationModel NGC Model Card
commandrecognition_en_matchboxnet3x2x64_v2 EncDecClassificationModel NGC Model Card
commandrecognition_en_matchboxnet3x1x64_v2_subset_task EncDecClassificationModel NGC Model Card
commandrecognition_en_matchboxnet3x2x64_v2_subset_task EncDecClassificationModel NGC Model Card

Speaker Recognition (SR)

Speaker Recognition (SR) is a broad research area which solves two major tasks: speaker identification (who is speaking?) and speaker verification (is the speaker who they claim to be?). We focus on far-field, text-independent speaker recognition when the identity of the speaker is based on how the speech is spoken, not necessarily in what is being said. Typically such SR systems operate on unconstrained speech utterances, which are converted into vectors of fixed length, called speaker embeddings. Speaker embeddings can also be used in automatic speech recognition (ASR) and speech synthesis. Visit NeMo Speaker Recognition for more information.

Model Name Model Base Class Model Card
speakerverification_speakernet EncDecSpeakerLabelModel NGC Model Card
ecapa_tdnn EncDecSpeakerLabelModel NGC Model Card
titanet_large EncDecSpeakerLabelModel NGC Model Card

Speaker Diarization (SD)

Speaker Diarization (SD) is the task of segmenting audio recordings by speaker labels, that is Who Speaks When? A diarization system consists of a Voice Activity Detection (VAD) model to get the time stamps of audio where speech is being spoken while ignoring the background noise and a Speaker Embeddings model to get speaker embeddings on speech segments obtained from VAD time stamps. These speaker embeddings would then be clustered into clusters based on number of speakers present in the audio recording. Visit NeMo Speaker Diarization for more information.

Speaker Diarization is a composite model - which utilizes multiple independent models at the same time in order to perform diarization. For futher details about the pretrained checkpoints used for this task, please visit the checkpoints page for NeMo Speaker Diarization.