Speaker Diarization (SD) is the task of segmenting audio recordings by speaker labels, that is Who Speaks When?
A diarization system consists of a Voice Activity Detection (VAD) model to get the time stamps of audio where speech is being spoken while ignoring the background noise and a Speaker Embeddings model to get speaker embeddings on speech segments obtained from VAD time stamps. These speaker embeddings would then be clustered into clusters based on number of speakers present in the audio recording.
SpeakerDiarization in Nemo currently supports only inference using pretrained SpekerNet models and VAD models. This model when combined with VAD(any) model or without VAD for ORACLE evaluation can be used for speaker diarization inference.
We separately train MarbleNet and SpeakerNet models on a composite dataset comprising of several thousand hours of speech, compiled from various publicly available sources. The NeMo toolkit  was used for training this model over few hundred epochs on multiple GPUs.
The following datasets are used for training speakerNet model.
speakerdiarization_speakernet) achieves Speaker Error Rate (SER) of 5.4% on CH109 set.
speakerverification_speakernetmodel achieves Speaker Error Rate (SER) of 4.1% on AMI Lapel test set.
Steps on loading nemo model for speaker embedding in order to perform oracle or non-oracle speaker diarization have been explained in this Notebook
This model accepts 16000 KHz Mono-channel Audio (wav files) as input.
This model outputs RTTM file with speaker labels and their time stamps.
This model is trained on telephonic speech from voxceleb datasets,Fisher and switch board hence may not work as well for telephonic speech. For telephonic speech consider finetuning for that speech domain or try using speakerverification model.
 SpeakerNet: 1D Depth-wise Separable Convolutional Network for Text-Independent Speaker Recognition and Verification
 MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection
 NVIDIA NeMo Toolkit
License to use this model is covered by the license of the NeMo Toolkit . By downloading the public and release version of the model, you accept the terms and conditions of this license.