This collection contains Self-Supervised Learning (SSL) checkpoints for xlarge size versions of Conformer model (around 120M parameters). Models are trained using unlabeled english audio with contrastive loss. These are similar to w2v-Conformer [3,4] and can be fine-tuned for Automatic Speech Recognition (ASR).
For details about conformer architecture, refer to .
All the models in this collection are trained using LibriLight corpus (~56k hrs of unlabeled English speech).
The pre-trained checkpoint is available in NeMo toolkit , and has to be fine-tuned on another labeled dataset for ASR.
import nemo.collections.asr as nemo_asr ssl_model = nemo_asr.models.ssl_models.SpeechEncDecSelfSupervisedModel.from_pretrained(model_name='ssl_en_conformer_large')
Briefly, you can load the pre-trained checkpoint into fine-tune model as shown below
# define fine-tune model asr_model = nemo_asr.models.EncDecRNNTBPEModel(cfg=cfg.model, trainer=trainer) # load ssl checkpoint asr_model.load_state_dict(ssl_model.state_dict(), strict=False) del ssl_model
The list of the available models in this collection is shown in the following table. Performances of the ASR models fine-tuned from these check-points are reported in terms of Word Error Rate (WER%) with greedy decoding.
|Version||SSL Loss||Fine-tune Dataset||Fine-tune Model||Vocabulary Size||LS dev-clean||LS dev-other||LS test-clean||LS test-other|
Since this model was trained on publically available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech.