NVIDIA NeMo toolkit supports multiple Automatic Speech Recognition(ASR) models such as Jasper and QuartzNet. Pretrained checkpoints for these models trained on standard datasets can be used immediately, use speech_to_text.py script in the examples directory. In addition, models for ASR sub-tasks such as speech classification are also provided; for example MatchboxNet trained on the Google Speech Commands Dataset is available in speech_to_label.py. Voice Activity Detection is also supported with the same script, by simply changing the config file passed to the script! NeMo also supports training Speech Recognition models with Byte Pair/Word Piece encoding of the corpus, via the speech_to_text_bpe.py example; these models are still under development. In order to simply perform evaluation on a dataset using these models, use the speech_to_text_infer.py example, which shows how to compute WER over the dataset.
You can instantiate all these models automatically directly from NGC. To do so, start your script with:
import nemo
import nemo.collections.asr as nemo_asr
Then chose what type of model you would like to instantiate. See table below for the list of model base classes. Then use base_class.from_pretrained(...)
method.
For example:
quartznet = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name="QuartzNet15x5Base-En")
Note that you can also list all available models using API by calling base_class.list_available_models(...)
method.
You can also download all models' ".nemo" files in the "File Browser" tab and then instantiate those models with base_class.restore_from(PATH_TO_DOTNEMO_FILE)
method. In this case, make sure you are matching NeMo and models' versions.
Here is a list of currently available models together with their base classes and short descriptions.
Model name | Model Base Class | Description |
---|---|---|
QuartzNet15x5Base-En | EncDecCTCModel |
QuartzNet15x5 model trained on six datasets: LibriSpeech, Mozilla Common Voice (validated clips from en_1488h_2019-12-10), WSJ, Fisher, Switchboard, and NSC Singapore English. It was trained with Apex/Amp optimization level O1 for 600 epochs. The model achieves a WER of 3.79% on LibriSpeech dev-clean, and a WER of 10.05% on dev-other. |
QuartzNet15x5Base-Zh | EncDecCTCModel |
QuartzNet15x5 model trained on ai-shell2 Mandarin Chinese dataset. |
QuartzNet5x5LS-En | EncDecCTCModel |
QuartzNet5x5 model trained on LibriSpeech dataset only. The model achieves a WER of 5.37% on LibriSpeech dev-clean, and a WER of 15.69% on dev-other. |
QuartzNet15x5NR-En | EncDecCTCModel |
Quartznet15x5 model trained for presence of noise. The base model QuartzNet15x5Base-En was finetuned with RIR and noise augmentation to make it more robust to noise. This model should be preferred for noisy speech transcription. This model achieves a WER of 3.96% on LibriSpeech dev-clean and a WER of 10.14% on dev-other. |
Jasper10x5Dr-En | EncDecCTCModel |
JasperNet10x5Dr model trained on six datasets: LibriSpeech, Mozilla Common Voice (validated clips from en_1488h_2019-12-10), WSJ, Fisher, Switchboard, and NSC Singapore English. It was trained with Apex/Amp optimization level O1. The model achieves a WER of 3.37% on LibriSpeech dev-clean, 9.81% on dev-other. |
ContextNet-192-WPE-1024-8x-Stride | EncDecCTCModelBPE |
ContextNet initial implementation model trained on the Librispeech corpus and achieves a WER of 10.09% on test-other and 10.11% on dev-other. |
MatchboxNet-3x1x64-v1 | EncDecClassificationModel |
MatchboxNet model trained on Google Speech Commands dataset (v1, 30 classes) which obtains 97.32% accuracy on test set. |
MatchboxNet-3x2x64-v1 | EncDecClassificationModel |
MatchboxNet model trained on Google Speech Commands dataset (v1, 30 classes) which obtains 97.68% accuracy on test set. |
MatchboxNet-3x1x64-v2 | EncDecClassificationModel |
MatchboxNet model trained on Google Speech Commands dataset (v2, 35 classes) which obtains 97.12% accuracy on test set. |
MatchboxNet-3x1x64-v2 | EncDecClassificationModel |
MatchboxNet model trained on Google Speech Commands dataset (v2, 30 classes) which obtains 97.29% accuracy on test set. |
MatchboxNet-3x1x64-v2-subset-task | EncDecClassificationModel |
MatchboxNet model trained on Google Speech Commands dataset (v2, 10+2 classes) which obtains 98.2% accuracy on test set. |
MatchboxNet-3x2x64-v2-subset-task | EncDecClassificationModel |
MatchboxNet model trained on Google Speech Commands dataset (v2, 10+2 classes) which obtains 98.4% accuracy on test set. |
MatchboxNet-VAD-3x2 | EncDecClassificationModel |
Voice Activity Detection MatchboxNet model trained on google speech command (v2) and freesound background data, which obtains 0.992 accuracy on testset from same source and 0.852 TPR for FPR=0.315 on testset (ALL) of AVA movie data |
SpeakerNet_recognition | EncDecSpeakerLabelModel |
SpeakerNet_recognition model trained end-to-end for speaker recognition purposes with cross_entropy loss. It was trained on voxceleb 1, voxceleb 2 dev datasets and augmented with musan music and noise. Speaker Recognition model achieves 2.65% EER on voxceleb-O cleaned trial file" |
SpeakerNet_verification | EncDecSpeakerLabelModel |
SpeakerNet_verification model trained end-to-end for speaker verification purposes with arcface angular softmax loss. It was trained on voxceleb 1, voxceleb 2 dev datasets and augmented with musan music and noise. Speaker Verification model achieves 2.12% EER on voxceleb-O cleaned trial file |