This collection contains a FastConformer-Transducer large model (around 120M parameters) for Multilingual and Code-Switched speech recongition of English-Mandarin speech. It utilizes a Google SentencePiece  tokenizer with a vocabulary size 1024 for English and uses 5000 characters for Mandarin.
It can transcribe audio samples into English or Mandarin or even both English and Mandarin used in the same sentence. The language is detected automatically.
Conformer-Transducer is the Conformer  model and uses RNNT/Transducer loss/decoder. You may find more information on the details here: Conformer Transducer. FastConformer  is an optimized version of the Conformer model with 8x depthwise-separable convolutional downsampling. The model is trained with Transducer loss. You may find more information on the details of FastConformer here: Fast-Conformer Model These model
The NeMo toolkit  was used for training the models for over several hundred epochs. These model are trained with this example script and this base config. The SentencePiece tokenizers  for these models were built using the text transcripts of the train set with this script.
The model is trained on a composite dataset (NeMo ASRSET) comprising of several thousand hours of English speech:
The performance of Automatic Speech Recognition models is measured using Word Error Rate (WER) on English (en), Character Error Rate (zh) on Mandarin (zh), and Mix Error Rate (MER) on Multilingual/Code-Switch en-zh data.
2.4% WER on LibriSpeech test clean (en)
5.5% WER on LibriSpeech test other (en)
6.7% CER on AISHELL2 iOS test (zh)
15.0% MER on SEAME dev set (en-zh)
14.7% MER on SEAME mandarin test (en-zh)
21.7% MER on SEAME singapore english test (en-zh)
The model is available for use in the NeMo toolkit , and can be used as a pre-trained checkpoint for fine-tuning on another dataset.
import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained(model_name="stt_enzh_fastconformer_transducer_large_codesw")
python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py \ pretrained_name="stt_enzh_fastconformer_transducer_large_codesw" \ audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
This model accepts 16000 KHz Mono-channel Audio (wav files) as input.
This model provides transcribed speech as a string for a given audio sample. The output string may contain English or Mandarin characters, depending on the languages used in the audio sample.
Since this model was trained on publically available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech.