NGC | Catalog
CatalogModelsDiarization MSDD Telephonic

Diarization MSDD Telephonic

Logo for Diarization MSDD Telephonic
Multi-scale Diarization Decoder (MSDD) model for speaker diarization of telephone conversations
Latest Version
April 4, 2023
102.62 MB

Model Overview

This collection contains Multiscale Diarization Decoder (MSDD) trained on Fisher Corpus. The checkpoint contains two different models: pretrained TitaNet-L (around 25.3M parameters) and MSDD (around 5.8M parameters). MSDD models can be jointly trained and used for inference with pretrained speaker embedding extractors, such as TitaNet and ECAPA-TDNN.

Model Architecture

MSDD [1] model is a sequence model that selectively weighs different speaker embedding scales. You can find more detail of this model here: MS Diarization with DSW.

This particular MSDD model is designed to show the most optimized diarization performance on telephonic speech and based on 5 scales: [1.5,1.25,1.0,0.75,0.5] with hop lengths of [0.75,0.625,0.5,0.375,0.25]. Therefore, the default temporal resolution is 0.25s while the hoplength could be changed to have better temporal resolution.


The NeMo toolkit [2] was used for training the models for over 10 epochs. These model are trained with this example training script and this MSDD telephonic config.


This diarization model is trained on 10,000 sessions (Approx. 1500h) in Fisher Corpus.


The performance of Speaker Diarization models is measured by Diarization Error Rate (DER). Since the MSDD model requires initializing clustering results, the final diarization accuracy is largely affected by clustering performance.

The model obtains the following DER scores on the following evaluation datasets -

Diarization evaluation condition:

  • Number of speakers are estimated:

  • Oracle VAD is used to factor out the performance of VAD:

  • For clustering, NMESC is used with the following parameters:

  • MSDD parameters:

Diarization Performance:

CallHome American English (CHEAS, LDC97S42) 109 2-speaker subset: CH109

Forgiving Fair Full
(collar, ignore_overlap) (0.25, True) (0.25, False) (0.0, False)
False Alarm - 0.62% 1.80%
Miss - 2.47% 5.96%
Confusion - 0.43% 2.10%
DER 0.58% 3.52% 9.86%

NIST-SRE-2000 (LDC2001S97) Disc8: CallHome

Forgiving Fair Full
(collar, ignore_overlap) (0.25, True) (0.25, False) (0.0, False)
False Alarm - 1.05% 2.24%
Miss - 7.62% 11.09%
Confusion - 4.06% 6.03%
DER 4.15% 12.73% 19.37%

How to Use this Model

The model is available for use in the NeMo toolkit [2], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.

Automatically load the model from NGC

import nemo.collections.asr as nemo_asr
diar_model = nemo_asr.models.EncDecDiarLabelModel.from_pretrained(model_name="diar_msdd_telephonic")

Diarize speech recordings with this model

python [NEMO_GIT_FOLDER]/examples/speaker_tasks/diarization/ \
    diarizer.vad.model_path=<NeMo VAD model path> \
    diarizer.msdd_model.model_path=<NeMo MSDD model path> \
    diarizer.oracle_vad=False \
    diarizer.manifest_filepath=<test_manifest> \
    diarizer.out_dir=<test_temp_dir> \


This model accepts 16000 KHz Mono-channel Audio (wav files) as input.


This model provides estimated speaker labels and corresponding timestamps in RTTM format.


Since this model is only trained on telephonic speech, diarization performance on other acoustic conditions might show a degraded performance compared to telephonic speech.


[1] Multi-scale speaker diarization with dynamic scale weighting
[2] NVIDIA NeMo Toolkit


License to use this model is covered by the NGC TERMS OF USE unless another License/Terms Of Use/EULA is clearly specified. By downloading the public and release version of the model, you accept the terms and conditions of the NGC TERMS OF USE.