SpeakerDiarization Speakernet

NGC Catalog

CLASSIC

Welcome Guest

For downloads and more information, please view on a desktop device.

Description

SpeakerNet-M model for Speaker Diarization inference

Publisher

NVIDIA

Latest Version

1.0.0rc1

Modified

April 4, 2023

Size

24.09 MB

Model Overview

Speaker Diarization (SD) is the task of segmenting audio recordings by speaker labels, that is Who Speaks When?

A diarization system consists of a Voice Activity Detection (VAD) model to get the time stamps of audio where speech is being spoken while ignoring the background noise and a Speaker Embeddings model to get speaker embeddings on speech segments obtained from VAD time stamps. These speaker embeddings would then be clustered into clusters based on number of speakers present in the audio recording.

Model Architecture

SpeakerDiarization in Nemo[3] currently supports only inference using pretrained SpekerNet[1] models and VAD[2] models. This model when combined with VAD(any) model or without VAD for ORACLE evaluation can be used for speaker diarization inference.

Training

We separately train MarbleNet and SpeakerNet models on a composite dataset comprising of several thousand hours of speech, compiled from various publicly available sources. The NeMo toolkit [3] was used for training this model over few hundred epochs on multiple GPUs.

Datasets

The following datasets are used for training speakerNet model.

Voxceleb 1 Dev data (1211 speakers)
Voxceleb 2 Dev data (5994 speakers)
Fisher
SwitchBoard

Performance

This model (speakerdiarization_speakernet) achieves Speaker Error Rate (SER) of 5.4% on CH109 set.
speakerverification_speakernet model achieves Speaker Error Rate (SER) of 4.1% on AMI Lapel test set.

How to use this model

Steps on loading nemo model for speaker embedding in order to perform oracle or non-oracle speaker diarization have been explained in this Notebook

Input

This model accepts 16000 KHz Mono-channel Audio (wav files) as input.

Output

This model outputs RTTM file with speaker labels and their time stamps.

Limitations

This model is trained on telephonic speech from voxceleb datasets,Fisher and switch board hence may not work as well for telephonic speech. For telephonic speech consider finetuning for that speech domain or try using speakerverification model.

References

[1] SpeakerNet: 1D Depth-wise Separable Convolutional Network for Text-Independent Speaker Recognition and Verification
[2] MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection
[3] NVIDIA NeMo Toolkit

Licence

License to use this model is covered by the license of the NeMo Toolkit [3]. By downloading the public and release version of the model, you accept the terms and conditions of this license.