NVIDIA NeMo toolkit supports various Automatic Speech Recognition (ASR) models such as Jasper, QuartzNet, Citrinet and Conformer-CTC. Furthermore, it also supports multiple subtasks related to speech classification, speaker recognition and speaker diarization. For futher information regarding NeMo's capabilities in the domain of speech recognition, visit the NeMo ASR documentation page.
Trained or fine-tuned NeMo models (with the file extenstion .nemo
) can be converted to Riva models (with the file extension .riva
) and then deployed. For more details, see the Riva documentation on Model Development with NeMo.
You can instantiate many pretrained models automatically directly from NGC. To do so, start your script with:
import nemo
import nemo.collections.asr as nemo_asr
Then chose what type of model you would like to instantiate. See table below for the list of models that are available for each task. For example:
# simply use ASRModel to instantiate any ASR pretrained model
quartznet = nemo_asr.models.ASRModel.from_pretrained(model_name="QuartzNet15x5Base-En")
# or equivalently, use the exact class
quartznet = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name="QuartzNet15x5Base-En")
Note that you can also list all available models using API by calling .list_available_models(...)
method.
You can also download a models ".nemo" files from the "File Browser" tab and then instantiate those models with restore_from(PATH_TO_DOTNEMO_FILE)` method. In this case, make sure you are matching NeMo and models' versions.
NeMo's ASR collections supports a variety of sub-domains related to the field of speech and speaker recognition. Here, we list each task along with the pretrained models that are available for that task. Multiple example notebooks are available under the examples/asr/
directory of NeMo, as well as several tutorial notebooks under tutorials/asr/
at NVIDIA NeMo.
Automatic speech recognition (ASR) is the task of transcribing a given audio segment into text that can be read. NeMo supports a large collection of models such as Jasper, QuartzNet, Citrinet and Conformer-CTC in order to perform automatic speech recognition. Visit NeMo Automatic Speech Recognition for more information.
Model Name | Model Base Class | Model Card |
---|---|---|
stt_en_jasper10x5dr | EncDecCTCModel | NGC Model Card |
stt_en_citrinet_256 | EncDecCTCModelBPE | NGC Model Card |
stt_en_citrinet_512 | EncDecCTCModelBPE | NGC Model Card |
stt_en_citrinet_1024 | EncDecCTCModelBPE | NGC Model Card |
stt_en_citrinet_256_gamma_0_25 | EncDecCTCModelBPE | NGC Model Card |
stt_en_citrinet_512_gamma_0_25 | EncDecCTCModelBPE | NGC Model Card |
stt_en_citrinet_1024_gamma_0_25 | EncDecCTCModelBPE | NGC Model Card |
stt_en_contextnet_1024_mls | EncDecRNNTBPEModel | NGC Model Card |
stt_en_contextnet_512_mls | EncDecRNNTBPEModel | NGC Model Card |
stt_en_contextnet_256_mls | EncDecRNNTBPEModel | NGC Model Card |
stt_en_contextnet_1024 | EncDecRNNTBPEModel | NGC Model Card |
stt_en_contextnet_512 | EncDecRNNTBPEModel | NGC Model Card |
stt_en_conformer_ctc_small | EncDecCTCModelBPE | NGC Model Card |
stt_en_conformer_ctc_medium | EncDecCTCModelBPE | NGC Model Card |
stt_en_conformer_ctc_large | EncDecCTCModelBPE | NGC Model Card |
stt_en_conformer_ctc_small_ls | EncDecCTCModelBPE | NGC Model Card |
stt_en_conformer_ctc_medium_ls | EncDecCTCModelBPE | NGC Model Card |
stt_en_conformer_ctc_large_ls | EncDecCTCModelBPE | NGC Model Card |
stt_en_conformer_transducer_small | EncDecRNNTBPEModel | NGC Model Card |
stt_en_conformer_transducer_medium | EncDecRNNTBPEModel | NGC Model Card |
stt_en_conformer_transducer_large | EncDecRNNTBPEModel | NGC Model Card |
stt_en_conformer_transducer_large_ls | EncDecRNNTBPEModel | NGC Model Card |
stt_en_conformer_transducer_xlarge | EncDecRNNTBPEModel | NGC Model Card |
stt_en_conformer_transducer_xxlarge | EncDecRNNTBPEModel | NGC Model Card |
stt_en_squeezeformer_ctc_xsmall_ls | EncDecCTCModelBPE | NGC Model Card |
stt_en_squeezeformer_ctc_small_ls | EncDecCTCModelBPE | NGC Model Card |
stt_en_squeezeformer_ctc_small_medium_ls | EncDecCTCModelBPE | NGC Model Card |
stt_en_squeezeformer_ctc_medium_ls | EncDecCTCModelBPE | NGC Model Card |
stt_en_squeezeformer_ctc_medium_large_ls | EncDecCTCModelBPE | NGC Model Card |
stt_en_squeezeformer_ctc_large_ls | EncDecCTCModelBPE | NGC Model Card |
stt_en_fastconformer_ctc_large | EncDecCTCModelBPE | NGC Model Card |
stt_en_fastconformer_ctc_xlarge | EncDecCTCModelBPE | NGC Model Card |
stt_en_fastconformer_transducer_large | EncDecRNNTBPEModel | NGC Model Card |
stt_en_fastconformer_transducer_xlarge | EncDecRNNTBPEModel | NGC Model Card |
stt_en_fastconformer_transducer_xxlarge | EncDecRNNTBPEModel | NGC Model Card |
stt_en_fastconformer_hybrid_large_pc | EncDecHybridRNNTCTCBPEModel | NGC Model Card |
stt_en_fastconformer_hybrid_large_streaming_480s | EncDecHybridRNNTCTCBPEModel | NGC Model Card |
stt_en_fastconformer_hybrid_large_streaming_1040s | EncDecHybridRNNTCTCBPEModel | NGC Model Card |
stt_en_fastconformer_hybrid_large_streaming_multi | EncDecHybridRNNTCTCBPEModel | NGC Model Card |
Model Name | Model Base Class | Model Card |
---|---|---|
stt_fr_quartznet15x5 | EncDecCTCModel | NGC Model Card |
stt_fr_citrinet_1024_gamma_0_25 | EncDecCTCModelBPE | NGC Model Card |
stt_fr_conformer_ctc_large | EncDecCTCModelBPE | NGC Model Card |
stt_fr_contextnet_1024 | EncDecRNNTBPEModel | NGC Model Card |
stt_fr_conformer_transducer_large | EncDecRNNTBPEModel | NGC Model Card |
Model Name | Model Base Class | Model Card |
---|---|---|
stt_ca_quartznet15x5 | EncDecCTCModel | NGC Model Card |
stt_ca_conformer_ctc_large | EncDecCTCModelBPE | NGC Model Card |
stt_ca_conformer_transducer_large | EncDecRNNTBPEModel | NGC Model Card |
Model Name | Model Base Class | Model Card |
---|---|---|
stt_it_quartznet15x5 | EncDecCTCModel | NGC Model Card |
Model Name | Model Base Class | Model Card |
---|---|---|
stt_es_quartznet15x5 | EncDecCTCModel | NGC Model Card |
stt_es_citrinet_512 | EncDecCTCModelBPE | NGC Model Card |
Model Name | Model Base Class | Model Card |
---|---|---|
stt_de_quartznet15x5 | EncDecCTCModel | NGC Model Card |
stt_de_citrinet_1024 | EncDecCTCModelBPE | NGC Model Card |
stt_de_conformer_ctc_large | EncDecCTCModelBPE | NGC Model Card |
stt_de_contextnet_1024 | EncDecRNNTBPEModel | NGC Model Card |
stt_de_conformer_transducer_large | EncDecRNNTBPEModel | NGC Model Card |
Model Name | Model Base Class | Model Card |
---|---|---|
stt_pl_quartznet15x5 | EncDecCTCModel | NGC Model Card |
Model Name | Model Base Class | Model Card |
---|---|---|
stt_ru_quartznet15x5 | EncDecCTCModel | NGC Model Card |
Model Name | Model Base Class | Model Card |
---|---|---|
stt_zh_citrinet_512 | EncDecCTCModel | NGC Model Card |
stt_zh_citrinet_1024_gamma_0_25 | EncDecCTCModel | NGC Model Card |
stt_zh_conformer_transducer_large | EncDecCTCModel | NGC Model Card |
Model Name | Model Base Class | Model Card |
---|---|---|
stt_rw_conformer_ctc_large | EncDecCTCModelBPE | NGC Model Card |
stt_rw_conformer_transducer_large | EncDecRNNTBPEModel | NGC Model Card |
Language models can help to increase the accuracy of ASR models by incorportaing language knowledge into their prediction. NeMo supports N-gram LM in fusion with beam search decodng and also neural rescorer. You may visit NeMo Language Modeling for ASR for more information.
Model Name | Model Base Class | Model Card |
---|---|---|
asrlm_en_transformer_large_ls | TransformerLMModel | NGC Model Card |
Speech Classification (SC) refers to a set of tasks or problems of getting a program to automatically classify input utterance or audio segment into categories, such as Speech Command Recognition (multi-class), Voice Activity Detection (binary or multi-class), and Audio Sentiment Classification (typically multi-class), etc. NeMo provides MatchboxNet and MarbleNet models for speech classification tasks. Visit NeMo Speech Classification for more information.
Model Name | Model Base Class | Model Card |
---|---|---|
vad_marblenet | EncDecClassificationModel | NGC Model Card |
vad_telephony_marblenet | EncDecClassificationModel | NGC Model Card |
commandrecognition_en_matchboxnet3x1x64_v1 | EncDecClassificationModel | NGC Model Card |
commandrecognition_en_matchboxnet3x2x64_v1 | EncDecClassificationModel | NGC Model Card |
commandrecognition_en_matchboxnet3x1x64_v2 | EncDecClassificationModel | NGC Model Card |
commandrecognition_en_matchboxnet3x2x64_v2 | EncDecClassificationModel | NGC Model Card |
commandrecognition_en_matchboxnet3x1x64_v2_subset_task | EncDecClassificationModel | NGC Model Card |
commandrecognition_en_matchboxnet3x2x64_v2_subset_task | EncDecClassificationModel | NGC Model Card |
Speaker Recognition (SR) is a broad research area which solves two major tasks: speaker identification (who is speaking?) and speaker verification (is the speaker who they claim to be?). We focus on far-field, text-independent speaker recognition when the identity of the speaker is based on how the speech is spoken, not necessarily in what is being said. Typically such SR systems operate on unconstrained speech utterances, which are converted into vectors of fixed length, called speaker embeddings. Speaker embeddings can also be used in automatic speech recognition (ASR) and speech synthesis. Visit NeMo Speaker Recognition for more information.
Model Name | Model Base Class | Model Card |
---|---|---|
speakerverification_speakernet | EncDecSpeakerLabelModel | NGC Model Card |
ecapa_tdnn | EncDecSpeakerLabelModel | NGC Model Card |
titanet_large | EncDecSpeakerLabelModel | NGC Model Card |
Speaker Diarization (SD) is the task of segmenting audio recordings by speaker labels, that is Who Speaks When? A diarization system consists of a Voice Activity Detection (VAD) model to get the time stamps of audio where speech is being spoken while ignoring the background noise and a Speaker Embeddings model to get speaker embeddings on speech segments obtained from VAD time stamps. These speaker embeddings would then be clustered into clusters based on number of speakers present in the audio recording. Visit NeMo Speaker Diarization for more information.
Speaker Diarization is a composite model - which utilizes multiple independent models at the same time in order to perform diarization. For futher details about the pretrained checkpoints used for this task, please visit the checkpoints page for NeMo Speaker Diarization.