NGC | Catalog
Welcome Guest
CatalogModelsSpeakerRecognition Speakernet

SpeakerRecognition Speakernet

For downloads and more information, please view on a desktop device.
Logo for SpeakerRecognition Speakernet


SpeakertNet-L model trained with NeMo for speaker recognition finetuning



Use Case



PyTorch with NeMo

Latest Version



June 30, 2021


30.9 MB

Model Overview

Speaker Identification is a broad research area that solves two major tasks: speaker identification (who is speaking?) and speaker verification (is the speaker who they claim to be?). In this work, we focus on far-field, text-independent speaker recognition when the identity of the speaker is based on how the speech is spoken, not necessarily on what is being said. Typically such SR systems operate on unconstrained speech utterances, which are converted into vectors of fixed length, called speaker embeddings. Speaker embeddings are also used in automatic speech recognition (ASR) and speech synthesis.

This model is trained end-to-end using cross-entropy loss for speaker recognition purposes for known speaker labels fine-tuning and testing.

Model Architecture

SpeakerNet models consists of 1D Depth-wise separable convolutional layers. These encoded information is then pooled by statistical means based on mean and variance as described in paper [1]


These models were trained on a composite dataset comprising of several thousand hours of speech, compiled from various publicly available sources. The NeMo toolkit [2] was used for training this model over few hundred epochs on multiple GPUs.


The following datasets are used for training


This speakernet-L model which is based on Quartznet Encoder structure with 8M parameters achieved 96.23% training accuracy of train set as mentioned above.

How to use this model

For training and fine-tuning detailed step by step, procedure has provided in Speaker Recognition notebook.

For inference on fine-tuned model, use this script

For speaker embedding extraction and verification refer to speaker verification model (speakernet-M)


This model accepts 16000 KHz Mono-channel Audio (wav files) as input.


This model outputs known speaker label index for a given audio sample.


This model is trained on non-telephonic speech from voxceleb datasets, hence may not work as well for telephonic speech. If it happens considering finetuning for that speech domain.


[1] SpeakerNet: 1D Depth-wise Separable Convolutional Network for Text-Independent Speaker Recognition and Verification
[2] NVIDIA NeMo Toolkit


License to use this model is covered by the license of the NeMo Toolkit [2]. By downloading the public and release version of the model, you accept the terms and conditions of this license.