Multi-scale Diarization Decoder (MSDD) model for speaker diarization of telephone conversations
Model Overview
This collection contains Multiscale Diarization Decoder (MSDD) trained on Fisher Corpus. The checkpoint contains two different models: pretrained TitaNet-L (around 25.3M parameters) and MSDD (around 5.8M parameters). MSDD models can be jointly trained and used for inference with pretrained speaker embedding extractors, such as TitaNet and ECAPA-TDNN.
Model Architecture
MSDD [1] model is a sequence model that selectively weighs different speaker embedding scales. You can find more detail of this model here: MS Diarization with DSW.
This particular MSDD model is designed to show the most optimized diarization performance on telephonic speech and based on 5 scales: [1.5,1.25,1.0,0.75,0.5] with hop lengths of [0.75,0.625,0.5,0.375,0.25]. Therefore, the default temporal resolution is 0.25s while the hoplength could be changed to have better temporal resolution.
Training
The NeMo toolkit [2] was used for training the models for over 10 epochs. These model are trained with this example training script and this MSDD telephonic config.
Datasets
This diarization model is trained on 10,000 sessions (Approx. 1500h) in Fisher Corpus.
Performance
The performance of Speaker Diarization models is measured by Diarization Error Rate (DER). Since the MSDD model requires initializing clustering results, the final diarization accuracy is largely affected by clustering performance.
The model obtains the following DER scores on the following evaluation datasets -
Diarization evaluation condition:
-
Number of speakers are estimated:
-oracle_num_speakers=False -
Oracle VAD is used to factor out the performance of VAD:
-oracle_VAD=True -
For clustering, NMESC is used with the following parameters:
-max_num_speakers=8
-max_rp_threshold=0.15
-sparse_search_volume=30
-multiscale_weights=[1,1,1,1,1] -
MSDD parameters:
-diar_window_length=50
-sigmoid_threshold=0.7
-overlap_infer_spk_limit=5
Diarization Performance:
CallHome American English (CHEAS, LDC97S42) 109 2-speaker subset: CH109
| Forgiving | Fair | Full | |
|---|---|---|---|
| (collar, ignore_overlap) | (0.25, True) | (0.25, False) | (0.0, False) |
| False Alarm | - | 0.62% | 1.80% |
| Miss | - | 2.47% | 5.96% |
| Confusion | - | 0.43% | 2.10% |
| DER | 0.58% | 3.52% | 9.86% |
NIST-SRE-2000 (LDC2001S97) Disc8: CallHome
| Forgiving | Fair | Full | |
|---|---|---|---|
| (collar, ignore_overlap) | (0.25, True) | (0.25, False) | (0.0, False) |
| False Alarm | - | 1.05% | 2.24% |
| Miss | - | 7.62% | 11.09% |
| Confusion | - | 4.06% | 6.03% |
| DER | 4.15% | 12.73% | 19.37% |
How to Use this Model
The model is available for use in the NeMo toolkit [2], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
Automatically load the model from NGC
Diarize speech recordings with this model
Input
This model accepts 16000 KHz Mono-channel Audio (wav files) as input.
Output
This model provides estimated speaker labels and corresponding timestamps in RTTM format.
Limitations
Since this model is only trained on telephonic speech, diarization performance on other acoustic conditions might show a degraded performance compared to telephonic speech.
References
[1] Multi-scale speaker diarization with dynamic scale weighting
[2] NVIDIA NeMo Toolkit
License
License to use this model is covered by the NGC TERMS OF USE unless another License/Terms Of Use/EULA is clearly specified. By downloading the public and release version of the model, you accept the terms and conditions of the NGC TERMS OF USE.