NGC Catalog
CLASSIC
Welcome Guest
Models
Sortformer Speaker Diarization

Sortformer Speaker Diarization

For downloads and more information, please view on a desktop device.
Logo for Sortformer Speaker Diarization
Description
Sortformer based speaker diarization model
Publisher
NVIDIA
Latest Version
v1
Modified
March 19, 2025
Size
453.97 MB

Sortformer Speaker Diarization

Model Overview

Sortformer Diarizer is a Transformer encoder-based end-to-end speaker diarization model that generates predicted speaker labels directly from input audio clips.

Model Architecture

Sortformer consists of an L-size (18 layers) NeMo Encoder for Speech Tasks (NEST) which is based on Fast-Conformer encoder. Following that, an 18-layer Transformer encoder with hidden size of 192, and two feedforward layers with 4 sigmoid outputs for each frame input at the top layer [1].

Training

Sortformer was trained on a combination of 2030 hours of real conversations and 5150 hours or simulated audio mixtures generated by NeMo speech data simulator. All the datasets listed above are based on the same labeling method via RTTM format. A subset of RTTM files used for model training are processed for the speaker diarization model training purposes. Data collection methods vary across individual datasets. For example, the above datasets include phone calls, interviews, web videos, and audiobook recordings [2]. NeMo toolkit [3] was used to train the models on 8 nodes of 8×NVIDIA Tesla V100 GPUs. We use 90 second long training samples and batch size of 4.

How to Use this Model

The Riva Quick Start Guide is recommended as the starting point for trying out Riva models. For more information on using this model with Riva Speech Services, see the Riva User Guide.

Input

This model accepts 16000 KHz Mono-channel Audio (wav files) as input.

Output

The output of the model is a T x S matrix, where T is the total number of frames, S is the maximum number of speakers (in this model, S = 4). Each element represents the speaker activity probability in the [0, 1] range.

References

[1] Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens

[2] Hugging Face: Sortformer Model

[3] NVIDIA NeMo Toolkit

License

By downloading and using the models and resources packaged with Riva Conversational AI, you accept the terms of the Riva license.