This collection contains Conformer-CTC large model (around 120M parameters) for Multilingual and Code-Switched speech recongition of English-Spanish speech. It utilizes a Google SentencePiece  tokenizer with vocabulary size 1024, and transcribes text in lower case English and Spanish alphabet along with spaces, apostrophes and a few other characters.
It can transcribe audio samples into English or Spanish or even both English and Spanish used in the same sentence. The language is detected automatically.
Conformer-Transducer model is a non-autoregressive variant of Conformer model  for Automatic Speech Recognition which uses CTC loss/decoding instead of Transducer. You may find more info on the detail of this model here Conformer-CTC Model.
The NeMo toolkit  was used for training the models for over several hundred epochs. These model are trained with this example script and this base config. The model was initialized from the weights of the stt_enes_conformer_ctc_large checkpoint.
The tokenizers for these models were built using the text transcripts of the train set with this script. For the creation of the tokenizer, the original text corpus from both English and Spanish sources we used instead of the synthetic code-switched text corupus.
The models in this collection were trained on a synthetic intra-sentential code-switching set which was constructed from the following English and Spanish datasets:
For the creation of the synthetic code-switched set, samples were chosen randomly from the English and Spanish sources, appropiately normalized and then concatenated with natural pauses to get utterances having lenght between 16-20 seconds.
The performance of Automatic Speech Recognition models is measured using Word Error Rate (WER).
The model obtains the following greedy WER scores on the following evaluation datasets -
5.52 % on
synthetic en-es code-switched set (en-es)
2.22 % on
Librispeech Dev-Clean (en)
2.55 % on
Librispeech Test-Clean (en)
5.36 % on
Librispeech Dev-Other (en)
5.38 % on
Librispeech Test-Other (en)
5.00 % on
MCV Dev v7.0 (es)
3.46 % on
MLS Dev (es)
5.58 % on
Voxpopuli Dev (es)
16.51 % on
Fisher Dev (es)
5.51 % on
MCV Test v7.0 (es)
3.73 % on
MLS Test (es)
6.63 % on
Voxpopuli Test (es)
16.31 % on
Fisher Test (es)
The model was not trained on the above datasets.
The model is available for use in the NeMo toolkit , and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.EncDecCTCModelBPE.from_pretrained(model_name="stt_enes_conformer_ctc_large_codesw")
python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py \ pretrained_name="stt_enes_conformer_ctc_large_codesw" \ audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
This model accepts 16000 KHz Mono-channel Audio (wav files) as input.
This model provides transcribed speech as a string for a given audio sample. The output string may contain English or Spanish characters, depending on the languages used in the audio sample.
Since this model was trained on publically available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech. Further, the model has been trained on a synthetic code-switched set, hence the model performance might degrade on some out of domain code-switching cases.