Sortformer Diarizer is a Transformer encoder-based end-to-end speaker diarization model that generates predicted speaker labels directly from input audio clips.
Sortformer consists of an L-size (18 layers) NeMo Encoder for Speech Tasks (NEST) which is based on Fast-Conformer encoder. Following that, an 18-layer Transformer encoder with hidden size of 192, and two feedforward layers with 4 sigmoid outputs for each frame input at the top layer [1].
Sortformer was trained on a combination of 2030 hours of real conversations and 5150 hours or simulated audio mixtures generated by NeMo speech data simulator. All the datasets listed above are based on the same labeling method via RTTM format. A subset of RTTM files used for model training are processed for the speaker diarization model training purposes. Data collection methods vary across individual datasets. For example, the above datasets include phone calls, interviews, web videos, and audiobook recordings [2]. NeMo toolkit [3] was used to train the models on 8 nodes of 8×NVIDIA Tesla V100 GPUs. We use 90 second long training samples and batch size of 4.
The Riva Quick Start Guide is recommended as the starting point for trying out Riva models. For more information on using this model with Riva Speech Services, see the Riva User Guide.
This model accepts 16000 KHz Mono-channel Audio (wav files) as input.
The output of the model is a T x S matrix, where T is the total number of frames, S is the maximum number of speakers (in this model, S = 4). Each element represents the speaker activity probability in the [0, 1] range.
[1] Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens
[2] Hugging Face: Sortformer Model
By downloading and using the models and resources packaged with Riva Conversational AI, you accept the terms of the Riva license.