NGC | Catalog
Welcome Guest
CatalogModelsSpeakerDiarization Speakernet

SpeakerDiarization Speakernet

For downloads and more information, please view on a desktop device.
Logo for SpeakerDiarization Speakernet


SpeakerNet-M model for Speaker Diarization inference



Use Case



PyTorch with NeMo

Latest Version



June 30, 2021


24.09 MB

Model Overview

Speaker Diarization (SD) is the task of segmenting audio recordings by speaker labels, that is Who Speaks When?

A diarization system consists of a Voice Activity Detection (VAD) model to get the time stamps of audio where speech is being spoken while ignoring the background noise and a Speaker Embeddings model to get speaker embeddings on speech segments obtained from VAD time stamps. These speaker embeddings would then be clustered into clusters based on number of speakers present in the audio recording.

Model Architecture

SpeakerDiarization in Nemo[3] currently supports only inference using pretrained SpekerNet[1] models and VAD[2] models. This model when combined with VAD(any) model or without VAD for ORACLE evaluation can be used for speaker diarization inference.


We separately train MarbleNet and SpeakerNet models on a composite dataset comprising of several thousand hours of speech, compiled from various publicly available sources. The NeMo toolkit [3] was used for training this model over few hundred epochs on multiple GPUs.


The following datasets are used for training speakerNet model.


  • This model (speakerdiarization_speakernet) achieves Speaker Error Rate (SER) of 5.4% on CH109 set.
  • speakerverification_speakernet model achieves Speaker Error Rate (SER) of 4.1% on AMI Lapel test set.

How to use this model

Steps on loading nemo model for speaker embedding in order to perform oracle or non-oracle speaker diarization have been explained in this Notebook


This model accepts 16000 KHz Mono-channel Audio (wav files) as input.


This model outputs RTTM file with speaker labels and their time stamps.


This model is trained on telephonic speech from voxceleb datasets,Fisher and switch board hence may not work as well for telephonic speech. For telephonic speech consider finetuning for that speech domain or try using speakerverification model.


[1] SpeakerNet: 1D Depth-wise Separable Convolutional Network for Text-Independent Speaker Recognition and Verification
[2] MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection
[3] NVIDIA NeMo Toolkit


License to use this model is covered by the license of the NeMo Toolkit [3]. By downloading the public and release version of the model, you accept the terms and conditions of this license.