NeMo - Text to Speech

NGC Catalog

CLASSIC

Welcome Guest

For contents of this collection and more information, please view on a desktop device.

Description

This collection contains NeMo models for Text to Speech (TTS)

Curator

NVIDIA

Modified

March 14, 2025

Containers

Helm Charts

Models

Resources

Overview

NVIDIA NeMo toolkit supports numerous Speech Synthensis models which can be used to convert text to audio. NeMo comes with pretrained models that can be immediately downloaded and used to generate speech. For more information, refer to the NeMo TTS documentation

Trained or fine-tuned NeMo models (with the file extenstion .nemo) can be imported into Riva and then deployed. In general, one must convert NeMo models to Riva models (with the file extension .riva) before beginning the Riva Build phase. The tts_en_tacotron2 model is an exception: one can pass the associated .nemo file directly into the call to riva-build. For more details, see the Riva documentation on Model Development with NeMo.

Usage

You can instantiate many pretrained models automatically directly from NGC. To do so, start your script with:

import soundfile as sf
import nemo
from nemo.collections.tts.models.base import SpectrogramGenerator, Vocoder

Then chose what type of model you would like to instantiate. See table below for the list of models that are available for each task. For example:

# Download and load the pretrained tacotron2 model
spec_generator = SpectrogramGenerator.from_pretrained("tts_en_tacotron2")

# Download and load the pretrained waveglow model
vocoder = Vocoder.from_pretrained("tts_waveglow_88m")

To generate audio, we need 3 functions. 2 functions are from SpectrogramGenerator and 1 function is from Vocoder.

SpectrogramGenerator.parse(): Accepts raw python strings and returns a torch.tensor that represents tokenized text
SpectrogramGenerator.generate_spectrogram(): Accepts a batch of tokenized text and returns a torch.tensor that represents a batch of spectrograms
Vocoder.convert_spectrogram_to_audio(): Accepts a batch of spectrograms and returns a torch.tensor that represents a batch of raw audio

# All spectrogram generators start by parsing raw strings to a tokenized version of the string
parsed = spec_generator.parse("You can type your sentence here to get nemo to produce speech.")
# Then take the tokenized string and produce a spectrogram
spectrogram = spec_generator.generate_spectrogram(tokens=parsed)
# Finally, a vocoder converts the spectrogram to audio
audio = vocoder.convert_spectrogram_to_audio(spec=spectrogram)

# Save the audio to disk in a file called speech.wav
sf.write("speech.wav", audio.to('cpu').numpy(), 22050)

To use the end-to-end models for speech synthesis, please see the documentation or follow the tts inference tutorial notebook. All tutorial links can be found here

Speech Synthesis Models

Spectrogram Generators

Please refer to https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/tts/checkpoints.html#mel-spectrogram-generators for the models that generate Mel-Spectrograms from text.

Vocoders

Please refer to https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/tts/checkpoints.html#vocoders for the models that generate audios from Mel-Spectrograms.

End-to-end

Models that generate audio directly from text.

Model Name	Model Card
tts_en_e2e_fastpitchhifigan	NGC Model Card
tts_en_e2e_fastspeech2hifigan	NGC Model Card