This collection contains Multiscale Diarization Decoder (MSDD) trained on Fisher Corpus. The checkpoint contains two different models: pretrained TitaNet-L (around 25.3M parameters) and MSDD (around 5.8M parameters). MSDD models can be jointly trained and used for inference with pretrained speaker embedding extractors, such as TitaNet and ECAPA-TDNN.
MSDD [1] model is a sequence model that selectively weighs different speaker embedding scales. You can find more detail of this model here: MS Diarization with DSW.
This particular MSDD model is designed to show the most optimized diarization performance on telephonic speech and based on 5 scales: [1.5,1.25,1.0,0.75,0.5]
with hop lengths of [0.75,0.625,0.5,0.375,0.25]
. Therefore, the default temporal resolution is 0.25s
while the hoplength could be changed to have better temporal resolution.
The NeMo toolkit [2] was used for training the models for over 10 epochs. These model are trained with this example training script and this MSDD telephonic config.
This diarization model is trained on 10,000 sessions (Approx. 1500h) in Fisher Corpus.
The performance of Speaker Diarization models is measured by Diarization Error Rate (DER). Since the MSDD model requires initializing clustering results, the final diarization accuracy is largely affected by clustering performance.
The model obtains the following DER scores on the following evaluation datasets -
Diarization evaluation condition:
Number of speakers are estimated:
-oracle_num_speakers=False
Oracle VAD is used to factor out the performance of VAD:
-oracle_VAD=True
For clustering, NMESC is used with the following parameters:
-max_num_speakers=8
-max_rp_threshold=0.15
-sparse_search_volume=30
-multiscale_weights=[1,1,1,1,1]
MSDD parameters:
-diar_window_length=50
-sigmoid_threshold=0.7
-overlap_infer_spk_limit=5
Diarization Performance:
CallHome American English (CHEAS, LDC97S42) 109 2-speaker subset: CH109
Forgiving | Fair | Full | |
---|---|---|---|
(collar, ignore_overlap) | (0.25, True) | (0.25, False) | (0.0, False) |
False Alarm | - | 0.62% | 1.80% |
Miss | - | 2.47% | 5.96% |
Confusion | - | 0.43% | 2.10% |
DER | 0.58% | 3.52% | 9.86% |
NIST-SRE-2000 (LDC2001S97) Disc8: CallHome
Forgiving | Fair | Full | |
---|---|---|---|
(collar, ignore_overlap) | (0.25, True) | (0.25, False) | (0.0, False) |
False Alarm | - | 1.05% | 2.24% |
Miss | - | 7.62% | 11.09% |
Confusion | - | 4.06% | 6.03% |
DER | 4.15% | 12.73% | 19.37% |
The model is available for use in the NeMo toolkit [2], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
import nemo.collections.asr as nemo_asr
diar_model = nemo_asr.models.EncDecDiarLabelModel.from_pretrained(model_name="diar_msdd_telephonic")
python [NEMO_GIT_FOLDER]/examples/speaker_tasks/diarization/multiscale_diar_decoder_infer.py \
diarizer.vad.model_path=<NeMo VAD model path> \
diarizer.msdd_model.model_path=<NeMo MSDD model path> \
diarizer.oracle_vad=False \
diarizer.manifest_filepath=<test_manifest> \
diarizer.out_dir=<test_temp_dir> \
This model accepts 16000 KHz Mono-channel Audio (wav files) as input.
This model provides estimated speaker labels and corresponding timestamps in RTTM format.
Since this model is only trained on telephonic speech, diarization performance on other acoustic conditions might show a degraded performance compared to telephonic speech.
[1] Multi-scale speaker diarization with dynamic scale weighting
[2] NVIDIA NeMo Toolkit
License to use this model is covered by the NGC TERMS OF USE unless another License/Terms Of Use/EULA is clearly specified. By downloading the public and release version of the model, you accept the terms and conditions of the NGC TERMS OF USE.