TitaNet-L

TitaNet-L

Logo for TitaNet-L
Description
TitaNet model for Speaker Verification and Diarization tasks
Publisher
NVIDIA
Latest Version
v1
Modified
April 4, 2023
Size
96.91 MB

Model Overview

Speaker Recognition is a broad research area that solves two major tasks: speaker identification (who is speaking?) and speaker verification (is the speaker who they claim to be?). In this work, we focus on far-field, text-independent speaker recognition when the identity of the speaker is based on how the speech is spoken, not necessarily on what is being said. Typically such SR systems operate on unconstrained speech utterances, which are converted into vectors of fixed length, called speaker embeddings. Speaker embeddings are also used in automatic speech recognition (ASR) and speech synthesis.

This model with contextnet based encoder[1] is trained end-to-end using angular softmax loss for speaker verification and diarization purposes and for extracting speaker embeddings

Model Architecture

TitaNet-L (large) models employ 1D depth-wise separable convolutions with Squeezeand-Excitation (SE) layers with global context followed by channel attention based statistics pooling layer to map variable-length utterances to a fixed-length embedding (tvector).[1]

Training

These models were trained on a composite dataset comprising of several thousand hours of speech, compiled from various publicly available sources. The NeMo toolkit [2] was used for training this model over few hundred epochs on multiple GPUs.

Datasets

The following datasets are used for training

Performance

This is TitaNet-L model which is based on 1D convolutions, residual connections and SEs structure with 25.3M parameters achieves 0.68% EER on voxceleb clean test trial file and Also achieves the following diarization results on common evaluation datasets (without finetuning on any dev set):

EVALUATIONTYPE NIST_SRE_2000 AMI(Lapel) AMI(MixHeadset) CH109
ORACLEKNOWN #SPEAKERS 6.73 2.03 1.73 1.19
ORACLEUNKNOWN #SPEAKERS 5.38 2.03 1.89 1.63

How to use this model

For training and extracting embeddings detailed step by step, procedure has provided in Speaker Verification notebook. and Embeddings extraction script

Embedding Extraction

For a single audio file, one can also extract embeddings inline using

import nemo.collections.asr as nemo_asr
speaker_model = nemo_asr.models.EncDecSpeakerLabelModel.from_pretrained(model_name='titanet_large')
embs = speaker_model.get_embedding('audio_path')

Speaker Verification

Speaker Verification is a task of verifying if two utterances are from the same speaker or not. We provide a helper function to verify the audio files and return True if two provided audio files are from the same speaker, False otherwise. The audio files should be 16KHz mono channel wav files.

import nemo.collections.asr as nemo_asr
speaker_model = nemo_asr.models.EncDecSpeakerLabelModel.from_pretrained(model_name='titanet_large')
decision = speaker_model.verify_speakers('path/to/one/audio_file','path/to/other/audio_file')

Input

This model accepts 16000 KHz Mono-channel Audio (wav files) as input.

Output

This model provides embeddings of size 192 from a speaker for a given audio sample.

Limitations

This model is trained on both telephonic and non-telephonic speech from voxceleb datasets, Fisher and switch board. If your domain of data differs from trained data or doesnot show relatively good performance consider finetuning for that speech domain.

References

[1] TitaNet: Neural Model for speaker representation with 1D Depth-wise separable convolutions and global context [2] NVIDIA NeMo Toolkit

Licence

License to use this model is covered by the license of the NeMo Toolkit [2]. By downloading the public and release version of the model, you accept the terms and conditions of this license.