Speaker Recognition is a broad research area that solves two major tasks: speaker identification (who is speaking?) and speaker verification (is the speaker who they claim to be?). In this work, we focus on far-field, text-independent speaker recognition when the identity of the speaker is based on how the speech is spoken, not necessarily on what is being said. Typically such SR systems operate on unconstrained speech utterances, which are converted into vectors of fixed length, called speaker embeddings. Speaker embeddings are also used in automatic speech recognition (ASR) and speech synthesis.
This model with modified ecapa based encoder is trained end-to-end using angular softmax loss for speaker verification and diarization purposes and for extracting speaker embeddings
ECAPA models consists of blocks of time delay neural blocks (TDNNs) and squeeze and excite (SE) layers unified with blocks of Res2Block layers. For faster training with similar performance numbers on diarization tasks we replaced Res2Blocks with group convolution layers This encoded information is then pooled by attention means to get speaker embeddings.
These models were trained on a composite dataset comprising of several thousand hours of speech, compiled from various publicly available sources. The NeMo toolkit  was used for training this model over few hundred epochs on multiple GPUs.
The following datasets are used for training
This ECAPA model which is based on layers of TDNNs and SEs structure with 22.3M parameters achieves 0.92% EER on voxceleb clean test trial file and Also achieves the following results on common evaluation datasets (without finetuning on any dev set):
For a single audio file, one can also extract embeddings inline using
import nemo.collections.asr as nemo_asr speaker_model = nemo_asr.models.EncDecSpeakerLabelModel.from_pretrained(model_name='ecapa_tdnn') embs = speaker_model.get_embedding('audio_path')
This model accepts 16000 KHz Mono-channel Audio (wav files) as input.
This model provides embeddings of size 192 from a speaker for a given audio sample.
This model is trained on both telephonic and non-telephonic speech from voxceleb datasets, Fisher and switch board. If your domain of data differs from trained data or doesnot show relatively good performance consider finetuning for that speech domain.
License to use this model is covered by the license of the NeMo Toolkit . By downloading the public and release version of the model, you accept the terms and conditions of this license.