NGC | Catalog
CatalogModelsLangID PearlNet

LangID PearlNet

Logo for LangID PearlNet
Description
PearlNet Lang ID model for Spoken Language Identification
Publisher
NVIDIA
Latest Version
1.18.0
Modified
May 31, 2023
Size
229.06 MB

Model Overview

Language Identification model as detailed in [1]. This model can be used for Spoken Language Identification (LangID / LID) and serves as the first step for Automatic Speech Recognition (ASR).

Model Architecture

This model uses Conformer architecture [2]. For pretraining, it uses 18 conformer layers with 8 attention heads and a hidden depth of 512. For finetuning, only the bottom (closest to audio input) 9 layers are selected from the pretrained model.

Training

Training occurs over separate pretraining and finetuning stages.

Pretraining

Pretraining solves the contrastive loss task detailed in [3], using AdamW optimization over 400k updates with batchsize 2048. Training uses Noam scheduling with a peak learning rate of 1.4e-3 after 25k warmup steps.

Finetuning

Finetuning statistically pools outputs before projection through a bottleneck vector of size 256. The model is then trained on a cross-entropy loss task over 20 epochs. Training uses AdamW optimization with Noam scheduling and a warm-up ratio of 0.2. Training data is augmented by masking 50% of time steps using SpecAugment patches of size 24 and applying 4 frequency masks of size 10. Further, data is augmented with the Room Impulse Response (RIR) and Noise [4] corpora and speed perturbation of 0.95x and 1.05x speeds.

The final model parameters are averaged over five best checkpoints in respect to validation loss.

Datasets

Pretraining uses the 400k unlabeled hours of the VoxPopuli dataset [5], a multilingual dataset comprised of European Parliament discussions across 23 of the 24 official languages of the European Union (Maltese is unincluded).

Finetuning training and evaluation uses the Voxlingua107 dataset [6]. It contains YouTube data for 107 languages in the official training set. The total amount of speech in the training set is 6628 hours. The average amount of data per language is 62 hours. For training, we segment the training set into segments of 4, 5, and 6 seconds.

To preserve integrity of the evaluation set, we stratified shuffle split 10% of training data to use as a validation set during training.

Performance

The model achieves 5.34% error rate on official evaluation set which contains 1609 verified utterances of 33 languages.

How to Use this Model

The model is available for use in the NeMo toolkit and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.

AUTOMATICALLY LOAD THE MODEL FROM NGC

import nemo
import nemo.collections.asr as nemo_asr
vad_model = nemo_asr.models.EncDecSpeakerLabelModel.from_pretrained(model_name="langid_pearlnet")

Input

This model accepts 16000 KHz Mono-channel Audio (wav files) as input.

Output

This model provides spoken language identification of the given utterance.

Limitations

Since this model was trained on publically available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech.

References

[1] Bartley, Travis M., et al. "Accidental learners: Spoken language identification in multilingual self-supervised models." ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023.

[2] Gulati, Anmol, et al. "Conformer: Convolution-augmented Transformer for Speech Recognition." Proc. Interspeech 2020 (2020): 5036-5040. ISCA, 2020.

[3] Baevski, Alexei, et al. "wav2vec 2.0: A framework for self-supervised learning of speech representations." Advances in neural information processing systems 33 (2020): 12449-12460. NeurIPS, 2020.

[4] Ko, Tom, et al. "A study on data augmentation of reverberant speech for robust speech recognition." 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017.

[5] Wang, Changhan, et al. "VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation." ACL 2021-59th Annual Meeting of the Association for Computational Linguistics. 2021. ACL, 2021.

[6] Valk, Jörgen, and Tanel Alumäe. "VoxLingua107: a dataset for spoken language recognition." 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021.

Licence

License to use this model is covered by the NGC TERMS OF USE unless another License/Terms Of Use/EULA is clearly specified. By downloading the public and release version of the model, you accept the terms and conditions of the NGC TERMS OF USE.