TitaNet is a novel neural network architecture for extracting speaker representations.
TitaNet employs 1D depth-wise separable convolutions with Squeeze-and-Excitation (SE) layers with global context followed by channel attention based statistics pooling layer to map variable-length utterances to a fixed-length embedding (t-vector) [1].
These models were trained on a composite dataset comprising of several thousand hours of speech, compiled from various publicly available sources. The NeMo toolkit [2] was used for training this model over few hundred epochs on multiple GPUs.
To use this model, we can use Riva Skills Quick start guide, it is a starting point to try out Riva models. Information regarding Quick start guide can be found : here. To use Riva Speech ASR service using this model, document has the necessary information.
This model accepts 16000 KHz Mono-channel Audio (wav files) as input.
This model provides embeddings of size 192 from a speaker for a given audio sample.
[1] TitaNet: Neural Model for speaker representation with 1D Depth-wise separable convolutions and global context [2] NVIDIA NeMo Toolkit
By downloading and using the models and resources packaged with Riva Conversational AI, you would be accepting the terms of the Riva license