STT En Es Multilingual Code-Switched Conformer CTC Large

NGC Catalog

CLASSIC

Welcome Guest

For downloads and more information, please view on a desktop device.

Description

English + Spanish Multilingual and Code-Switched Speech Recognition Conformer CTC Large Model

Publisher

NVIDIA

Latest Version

1.0.0

Modified

April 4, 2023

Size

465.94 MB

Model Overview

This collection contains Conformer-CTC large model (around 120M parameters) for Multilingual and Code-Switched speech recongition of English-Spanish speech. It utilizes a Google SentencePiece [1] tokenizer with vocabulary size 1024, and transcribes text in lower case English and Spanish alphabet along with spaces, apostrophes and a few other characters.

It can transcribe audio samples into English or Spanish or even both English and Spanish used in the same sentence. The language is detected automatically.

Model Architecture

Conformer-Transducer model is a non-autoregressive variant of Conformer model [2] for Automatic Speech Recognition which uses CTC loss/decoding instead of Transducer. You may find more info on the detail of this model here Conformer-CTC Model.

Training

The NeMo toolkit [3] was used for training the models for over several hundred epochs. These model are trained with this example script and this base config. The model was initialized from the weights of the stt_enes_conformer_ctc_large checkpoint.

The tokenizers for these models were built using the text transcripts of the train set with this script. For the creation of the tokenizer, the original text corpus from both English and Spanish sources we used instead of the synthetic code-switched text corupus.

Datasets

The models in this collection were trained on a synthetic intra-sentential code-switching set which was constructed from the following English and Spanish datasets:

English:

Librispeech: 960 hours of training data [9]

Spanish:

Mozilla Common Voice 7.0: 289 hours of training data after data cleaning [4]
Multilingual LibriSpeech: 801 hours of training data after data cleaning [5]
Voxpopuli transcribed subset: 110 hours of training data after data cleaning [6]
Fisher dataset: 140 hours of training data after data cleaning [7,8]

For the creation of the synthetic code-switched set, samples were chosen randomly from the English and Spanish sources, appropiately normalized and then concatenated with natural pauses to get utterances having lenght between 16-20 seconds.

Performance

The performance of Automatic Speech Recognition models is measured using Word Error Rate (WER).

The model obtains the following greedy WER scores on the following evaluation datasets -

5.52 % on synthetic en-es code-switched set (en-es)
2.22 % on Librispeech Dev-Clean (en)
2.55 % on Librispeech Test-Clean (en)
5.36 % on Librispeech Dev-Other (en)
5.38 % on Librispeech Test-Other (en)
5.00 % on MCV Dev v7.0 (es)
3.46 % on MLS Dev (es)
5.58 % on Voxpopuli Dev (es)
16.51 % on Fisher Dev (es)
5.51 % on MCV Test v7.0 (es)
3.73 % on MLS Test (es)
6.63 % on Voxpopuli Test (es)
16.31 % on Fisher Test (es)

The model was not trained on the above datasets.

How to Use this Model

The model is available for use in the NeMo toolkit [3], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.

Automatically load the model from NGC

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecCTCModelBPE.from_pretrained(model_name="stt_enes_conformer_ctc_large_codesw")

Transcribing text with this model

python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py \
  pretrained_name="stt_enes_conformer_ctc_large_codesw" \
  audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"

Input

This model accepts 16000 KHz Mono-channel Audio (wav files) as input.

Output

This model provides transcribed speech as a string for a given audio sample. The output string may contain English or Spanish characters, depending on the languages used in the audio sample.

Limitations

Since this model was trained on publically available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech. Further, the model has been trained on a synthetic code-switched set, hence the model performance might degrade on some out of domain code-switching cases.

References

[1] Google Sentencepiece Tokenizer

[2] Conformer: Convolution-augmented Transformer for Speech Recognition

[3] NVIDIA NeMo Toolkit

[4] Mozilla CommonVoice (MCV7.0)

[5] Multilingual LibriSpeech (MLS)

[6] VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation

[7] Fisher Spanish - Transcripts

[8] Fisher Spanish Speech

[9] LibriSpeech ASR Corpus

Licence

License to use this model is covered by the NGC TERMS OF USE unless another License/Terms Of Use/EULA is clearly specified. By downloading the public and release version of the model, you accept the terms and conditions of the NGC TERMS OF USE.