NGC Catalog
CLASSIC
Welcome Guest
Models
Riva TTS Audio codec

Riva TTS Audio codec

For downloads and more information, please view on a desktop device.
Logo for Riva TTS Audio codec
Description
Riva AudioCodec model with 21Hz frame rate, trained on 22khz data.
Publisher
NVIDIA
Latest Version
deployable_2.0
Modified
March 17, 2025
Size
400.13 MB

Speech Synthesis: English-US T5TTS Model Overview

Description:

The NeMo Audio Codec model is a non-autoregressive convolutional encoder-quantizer-decoder model for coding or tokenization of raw audio signals or mel-spectrogram features. The NeMo Audio Codec model supports residual vector quantizer (RVQ) and finite scalar quantizer (FSQ) for quantization of the encoder output. This model is trained end-to-end using generative loss, discriminative loss, and reconstruction loss, similar to other neural audio codecs such as SoundStream and EnCodec.

Model Architecture:

Architecture Type: Convolutional encoder-quantizer-decoder model

Input:

Audio codes

Output:

Audio of shape (batch x time) in wav format

Software Integration:

Runtime Engine(s): Riva 2.18.0 or greater

Supported Hardware Platform(s):

  • NVIDIA Volta V100
  • NVIDIA Turing T4
  • NVIDIA A100 GPU
  • NVIDIA A30 GPU
  • NVIDIA A10 GPU
  • NVIDIA H100 GPU
  • NVIDIA L4 GPU
  • NVIDIA L40 GPU

Supported Operating System(s):

  • Linux