This model can be used for Spoken Language Identification (LangID / LID) and serves as the first step for Automatic Speech Recognition (ASR).
The model is based on AmberNet architecture presented in AmberNet paper. (The paper will be published soon)
The model was trained on a publicly available dataset. The NeMo toolkit was used for training this model for 40 epochs on multiple GPUs.
While training this model, we used the following datasets:
Achieve 5.22% error rate on official evaluation set which contains 1609 verified utterances of 33 languages.
The model is available for use in the NeMo toolkit and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
import nemo
import nemo.collections.asr as nemo_asr
vad_model = nemo_asr.models.EncDecSpeakerLabelModel.from_pretrained(model_name="langid_ambernet")
This model accepts 16000 KHz Mono-channel Audio (wav files) as input.
This model provides spoken language identification of the given utterance.
Since this model was trained on publically available datasets, the performance of this model might degrade for custom data that the model has not been trained on. The model is trained with 107 languages, and you need to finetune it for unseen languages.
[1] Valk, Jörgen, and Tanel Alumäe. "VoxLingua107: a dataset for spoken language recognition." 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021.
License to use this model is covered by the NGC TERMS OF USE unless another License/Terms Of Use/EULA is clearly specified. By downloading the public and release version of the model, you accept the terms and conditions of the NGC TERMS OF USE.