Diarization MSDD Telephonic

NVIDIA

Model

NVIDIA

Diarization MSDD Telephonic

Multi-scale Diarization Decoder (MSDD) model for speaker diarization of telephone conversations

Model Overview

This collection contains Multiscale Diarization Decoder (MSDD) trained on Fisher Corpus. The checkpoint contains two different models: pretrained TitaNet-L (around 25.3M parameters) and MSDD (around 5.8M parameters). MSDD models can be jointly trained and used for inference with pretrained speaker embedding extractors, such as TitaNet and ECAPA-TDNN.

Model Architecture

MSDD [1] model is a sequence model that selectively weighs different speaker embedding scales. You can find more detail of this model here: MS Diarization with DSW.

This particular MSDD model is designed to show the most optimized diarization performance on telephonic speech and based on 5 scales: [1.5,1.25,1.0,0.75,0.5] with hop lengths of [0.75,0.625,0.5,0.375,0.25]. Therefore, the default temporal resolution is 0.25s while the hoplength could be changed to have better temporal resolution.

Training

The NeMo toolkit [2] was used for training the models for over 10 epochs. These model are trained with this example training script and this MSDD telephonic config.

Datasets

This diarization model is trained on 10,000 sessions (Approx. 1500h) in Fisher Corpus.

Performance

The performance of Speaker Diarization models is measured by Diarization Error Rate (DER). Since the MSDD model requires initializing clustering results, the final diarization accuracy is largely affected by clustering performance.

The model obtains the following DER scores on the following evaluation datasets -

Diarization evaluation condition:

Number of speakers are estimated:
-oracle_num_speakers=False
Oracle VAD is used to factor out the performance of VAD:
-oracle_VAD=True
For clustering, NMESC is used with the following parameters:
-max_num_speakers=8
-max_rp_threshold=0.15
-sparse_search_volume=30
-multiscale_weights=[1,1,1,1,1]
MSDD parameters:
-diar_window_length=50
-sigmoid_threshold=0.7
-overlap_infer_spk_limit=5

Diarization Performance:

CallHome American English (CHEAS, LDC97S42) 109 2-speaker subset: CH109

	Forgiving	Fair	Full
(collar, ignore_overlap)	(0.25, True)	(0.25, False)	(0.0, False)
False Alarm	-	0.62%	1.80%
Miss	-	2.47%	5.96%
Confusion	-	0.43%	2.10%
DER	0.58%	3.52%	9.86%

NIST-SRE-2000 (LDC2001S97) Disc8: CallHome

	Forgiving	Fair	Full
(collar, ignore_overlap)	(0.25, True)	(0.25, False)	(0.0, False)
False Alarm	-	1.05%	2.24%
Miss	-	7.62%	11.09%
Confusion	-	4.06%	6.03%
DER	4.15%	12.73%	19.37%

How to Use this Model

The model is available for use in the NeMo toolkit [2], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.

Automatically load the model from NGC

import nemo.collections.asr as nemo_asr
diar_model = nemo_asr.models.EncDecDiarLabelModel.from_pretrained(model_name="diar_msdd_telephonic")

Diarize speech recordings with this model

python [NEMO_GIT_FOLDER]/examples/speaker_tasks/diarization/multiscale_diar_decoder_infer.py \
    diarizer.vad.model_path=<NeMo VAD model path> \
    diarizer.msdd_model.model_path=<NeMo MSDD model path> \
    diarizer.oracle_vad=False \
    diarizer.manifest_filepath=<test_manifest> \
    diarizer.out_dir=<test_temp_dir> \

Input

This model accepts 16000 KHz Mono-channel Audio (wav files) as input.

Output

This model provides estimated speaker labels and corresponding timestamps in RTTM format.

Limitations

Since this model is only trained on telephonic speech, diarization performance on other acoustic conditions might show a degraded performance compared to telephonic speech.

References

[1] Multi-scale speaker diarization with dynamic scale weighting
[2] NVIDIA NeMo Toolkit

License

License to use this model is covered by the NGC TERMS OF USE unless another License/Terms Of Use/EULA is clearly specified. By downloading the public and release version of the model, you accept the terms and conditions of the NGC TERMS OF USE.

Publisher

NVIDIA

Latest Version1.0.1

UpdatedApril 4, 2023 UTC

Compressed Size102.62 MB

Labels

AI Automatic Speech Recognition Conversational AI PytorchLightning Speaker Diarization Speaker Recognition