NGC | Catalog
CatalogModelsRIVA Diarizer Embedding Extractor

RIVA Diarizer Embedding Extractor

Logo for RIVA Diarizer Embedding Extractor
Description
Embedding Extractor model used in Riva Speaker Diarization
Publisher
NVIDIA
Latest Version
deployable_v1.0
Modified
October 6, 2023
Size
35.74 MB

Speaker Diarization: TitaNet Model Card

Model Overview

TitaNet is a novel neural network architecture for extracting speaker representations.

Model Architecture

TitaNet employs 1D depth-wise separable convolutions with Squeeze-and-Excitation (SE) layers with global context followed by channel attention based statistics pooling layer to map variable-length utterances to a fixed-length embedding (t-vector) [1].

Training

These models were trained on a composite dataset comprising of several thousand hours of speech, compiled from various publicly available sources. The NeMo toolkit [2] was used for training this model over few hundred epochs on multiple GPUs.

How to Use this Model

To use this model, we can use Riva Skills Quick start guide, it is a starting point to try out Riva models. Information regarding Quick start guide can be found : here. To use Riva Speech ASR service using this model, document has the necessary information.

Input

This model accepts 16000 KHz Mono-channel Audio (wav files) as input.

Output

This model provides embeddings of size 192 from a speaker for a given audio sample.

References

[1] TitaNet: Neural Model for speaker representation with 1D Depth-wise separable convolutions and global context [2] NVIDIA NeMo Toolkit

License

By downloading and using the models and resources packaged with Riva Conversational AI, you would be accepting the terms of the Riva license