NeMo - Automatic Speech Recognition

NGC Catalog

CLASSIC

Welcome Guest

For contents of this collection and more information, please view on a desktop device.

Description

This collection contains NeMo models for Automatic Speech Recognition (ASR): Speech to Text, Speech Classification, Speaker Diarization, Speaker Verification, Speaker Recognition, Command Recognition, Voice Activity Detection

Curator

NVIDIA

Modified

March 14, 2025

Containers

Helm Charts

Models

Resources

Overview

NVIDIA NeMo toolkit supports various Automatic Speech Recognition (ASR) models such as Jasper, QuartzNet, Citrinet and Conformer-CTC. Furthermore, it also supports multiple subtasks related to speech classification, speaker recognition and speaker diarization. For futher information regarding NeMo's capabilities in the domain of speech recognition, visit the NeMo ASR documentation page.

Trained or fine-tuned NeMo models (with the file extenstion .nemo) can be converted to Riva models (with the file extension .riva) and then deployed. For more details, see the Riva documentation on Model Development with NeMo.

Usage

You can instantiate many pretrained models automatically directly from NGC. To do so, start your script with:

import nemo
import nemo.collections.asr as nemo_asr

Then chose what type of model you would like to instantiate. See table below for the list of models that are available for each task. For example:

# simply use ASRModel to instantiate any ASR pretrained model
quartznet = nemo_asr.models.ASRModel.from_pretrained(model_name="QuartzNet15x5Base-En")

# or equivalently, use the exact class
quartznet = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name="QuartzNet15x5Base-En")

Note that you can also list all available models using API by calling .list_available_models(...) method.

You can also download a models ".nemo" files from the "File Browser" tab and then instantiate those models with restore_from(PATH_TO_DOTNEMO_FILE)` method. In this case, make sure you are matching NeMo and models' versions.

ASR with NeMo

NeMo's ASR collections supports a variety of sub-domains related to the field of speech and speaker recognition. Here, we list each task along with the pretrained models that are available for that task. Multiple example notebooks are available under the examples/asr/ directory of NeMo, as well as several tutorial notebooks under tutorials/asr/ at NVIDIA NeMo.

Automatic Speech Recognition (ASR)

Automatic speech recognition (ASR) is the task of transcribing a given audio segment into text that can be read. NeMo supports a large collection of models such as Jasper, QuartzNet, Citrinet and Conformer-CTC in order to perform automatic speech recognition. Visit NeMo Automatic Speech Recognition for more information.

English

Model Name	Model Base Class	Model Card
stt_en_jasper10x5dr	EncDecCTCModel	NGC Model Card
stt_en_citrinet_256	EncDecCTCModelBPE	NGC Model Card
stt_en_citrinet_512	EncDecCTCModelBPE	NGC Model Card
stt_en_citrinet_1024	EncDecCTCModelBPE	NGC Model Card
stt_en_citrinet_256_gamma_0_25	EncDecCTCModelBPE	NGC Model Card
stt_en_citrinet_512_gamma_0_25	EncDecCTCModelBPE	NGC Model Card
stt_en_citrinet_1024_gamma_0_25	EncDecCTCModelBPE	NGC Model Card
stt_en_contextnet_1024_mls	EncDecRNNTBPEModel	NGC Model Card
stt_en_contextnet_512_mls	EncDecRNNTBPEModel	NGC Model Card
stt_en_contextnet_256_mls	EncDecRNNTBPEModel	NGC Model Card
stt_en_contextnet_1024	EncDecRNNTBPEModel	NGC Model Card
stt_en_contextnet_512	EncDecRNNTBPEModel	NGC Model Card
stt_en_conformer_ctc_small	EncDecCTCModelBPE	NGC Model Card
stt_en_conformer_ctc_medium	EncDecCTCModelBPE	NGC Model Card
stt_en_conformer_ctc_large	EncDecCTCModelBPE	NGC Model Card
stt_en_conformer_ctc_small_ls	EncDecCTCModelBPE	NGC Model Card
stt_en_conformer_ctc_medium_ls	EncDecCTCModelBPE	NGC Model Card
stt_en_conformer_ctc_large_ls	EncDecCTCModelBPE	NGC Model Card
stt_en_conformer_transducer_small	EncDecRNNTBPEModel	NGC Model Card
stt_en_conformer_transducer_medium	EncDecRNNTBPEModel	NGC Model Card
stt_en_conformer_transducer_large	EncDecRNNTBPEModel	NGC Model Card
stt_en_conformer_transducer_large_ls	EncDecRNNTBPEModel	NGC Model Card
stt_en_conformer_transducer_xlarge	EncDecRNNTBPEModel	NGC Model Card
stt_en_conformer_transducer_xxlarge	EncDecRNNTBPEModel	NGC Model Card
stt_en_squeezeformer_ctc_xsmall_ls	EncDecCTCModelBPE	NGC Model Card
stt_en_squeezeformer_ctc_small_ls	EncDecCTCModelBPE	NGC Model Card
stt_en_squeezeformer_ctc_small_medium_ls	EncDecCTCModelBPE	NGC Model Card
stt_en_squeezeformer_ctc_medium_ls	EncDecCTCModelBPE	NGC Model Card
stt_en_squeezeformer_ctc_medium_large_ls	EncDecCTCModelBPE	NGC Model Card
stt_en_squeezeformer_ctc_large_ls	EncDecCTCModelBPE	NGC Model Card
stt_en_fastconformer_ctc_large	EncDecCTCModelBPE	NGC Model Card
stt_en_fastconformer_ctc_xlarge	EncDecCTCModelBPE	NGC Model Card
stt_en_fastconformer_transducer_large	EncDecRNNTBPEModel	NGC Model Card
stt_en_fastconformer_transducer_xlarge	EncDecRNNTBPEModel	NGC Model Card
stt_en_fastconformer_transducer_xxlarge	EncDecRNNTBPEModel	NGC Model Card
stt_en_fastconformer_hybrid_large_pc	EncDecHybridRNNTCTCBPEModel	NGC Model Card
stt_en_fastconformer_hybrid_large_streaming_480s	EncDecHybridRNNTCTCBPEModel	NGC Model Card
stt_en_fastconformer_hybrid_large_streaming_1040s	EncDecHybridRNNTCTCBPEModel	NGC Model Card
stt_en_fastconformer_hybrid_large_streaming_multi	EncDecHybridRNNTCTCBPEModel	NGC Model Card

French

Model Name	Model Base Class	Model Card
stt_fr_quartznet15x5	EncDecCTCModel	NGC Model Card
stt_fr_citrinet_1024_gamma_0_25	EncDecCTCModelBPE	NGC Model Card
stt_fr_conformer_ctc_large	EncDecCTCModelBPE	NGC Model Card
stt_fr_contextnet_1024	EncDecRNNTBPEModel	NGC Model Card
stt_fr_conformer_transducer_large	EncDecRNNTBPEModel	NGC Model Card

Catalan

Model Name	Model Base Class	Model Card
stt_ca_quartznet15x5	EncDecCTCModel	NGC Model Card
stt_ca_conformer_ctc_large	EncDecCTCModelBPE	NGC Model Card
stt_ca_conformer_transducer_large	EncDecRNNTBPEModel	NGC Model Card

Italian

Model Name	Model Base Class	Model Card
stt_it_quartznet15x5	EncDecCTCModel	NGC Model Card

Spanish

Model Name	Model Base Class	Model Card
stt_es_quartznet15x5	EncDecCTCModel	NGC Model Card
stt_es_citrinet_512	EncDecCTCModelBPE	NGC Model Card

German

Model Name	Model Base Class	Model Card
stt_de_quartznet15x5	EncDecCTCModel	NGC Model Card
stt_de_citrinet_1024	EncDecCTCModelBPE	NGC Model Card
stt_de_conformer_ctc_large	EncDecCTCModelBPE	NGC Model Card
stt_de_contextnet_1024	EncDecRNNTBPEModel	NGC Model Card
stt_de_conformer_transducer_large	EncDecRNNTBPEModel	NGC Model Card

Polish

Model Name	Model Base Class	Model Card
stt_pl_quartznet15x5	EncDecCTCModel	NGC Model Card

Russian

Model Name	Model Base Class	Model Card
stt_ru_quartznet15x5	EncDecCTCModel	NGC Model Card

Mandarin

Model Name	Model Base Class	Model Card
stt_zh_citrinet_512	EncDecCTCModel	NGC Model Card
stt_zh_citrinet_1024_gamma_0_25	EncDecCTCModel	NGC Model Card
stt_zh_conformer_transducer_large	EncDecCTCModel	NGC Model Card

Kinyarwanda

Model Name	Model Base Class	Model Card
stt_rw_conformer_ctc_large	EncDecCTCModelBPE	NGC Model Card
stt_rw_conformer_transducer_large	EncDecRNNTBPEModel	NGC Model Card

Language Modeling for ASR

Language models can help to increase the accuracy of ASR models by incorportaing language knowledge into their prediction. NeMo supports N-gram LM in fusion with beam search decodng and also neural rescorer. You may visit NeMo Language Modeling for ASR for more information.

Model Name	Model Base Class	Model Card
asrlm_en_transformer_large_ls	TransformerLMModel	NGC Model Card

Speech Classification (SC)

Speech Classification (SC) refers to a set of tasks or problems of getting a program to automatically classify input utterance or audio segment into categories, such as Speech Command Recognition (multi-class), Voice Activity Detection (binary or multi-class), and Audio Sentiment Classification (typically multi-class), etc. NeMo provides MatchboxNet and MarbleNet models for speech classification tasks. Visit NeMo Speech Classification for more information.

Model Name	Model Base Class	Model Card
vad_marblenet	EncDecClassificationModel	NGC Model Card
vad_telephony_marblenet	EncDecClassificationModel	NGC Model Card
commandrecognition_en_matchboxnet3x1x64_v1	EncDecClassificationModel	NGC Model Card
commandrecognition_en_matchboxnet3x2x64_v1	EncDecClassificationModel	NGC Model Card
commandrecognition_en_matchboxnet3x1x64_v2	EncDecClassificationModel	NGC Model Card
commandrecognition_en_matchboxnet3x2x64_v2	EncDecClassificationModel	NGC Model Card
commandrecognition_en_matchboxnet3x1x64_v2_subset_task	EncDecClassificationModel	NGC Model Card
commandrecognition_en_matchboxnet3x2x64_v2_subset_task	EncDecClassificationModel	NGC Model Card

Speaker Recognition (SR)

Speaker Recognition (SR) is a broad research area which solves two major tasks: speaker identification (who is speaking?) and speaker verification (is the speaker who they claim to be?). We focus on far-field, text-independent speaker recognition when the identity of the speaker is based on how the speech is spoken, not necessarily in what is being said. Typically such SR systems operate on unconstrained speech utterances, which are converted into vectors of fixed length, called speaker embeddings. Speaker embeddings can also be used in automatic speech recognition (ASR) and speech synthesis. Visit NeMo Speaker Recognition for more information.

Model Name	Model Base Class	Model Card
speakerverification_speakernet	EncDecSpeakerLabelModel	NGC Model Card
ecapa_tdnn	EncDecSpeakerLabelModel	NGC Model Card
titanet_large	EncDecSpeakerLabelModel	NGC Model Card

Speaker Diarization (SD)

Speaker Diarization (SD) is the task of segmenting audio recordings by speaker labels, that is Who Speaks When? A diarization system consists of a Voice Activity Detection (VAD) model to get the time stamps of audio where speech is being spoken while ignoring the background noise and a Speaker Embeddings model to get speaker embeddings on speech segments obtained from VAD time stamps. These speaker embeddings would then be clustered into clusters based on number of speakers present in the audio recording. Visit NeMo Speaker Diarization for more information.

Speaker Diarization is a composite model - which utilizes multiple independent models at the same time in order to perform diarization. For futher details about the pretrained checkpoints used for this task, please visit the checkpoints page for NeMo Speaker Diarization.