NGC | Catalog
Welcome Guest
CatalogCollectionsNeMo - Text to Speech

NeMo - Text to Speech

For contents of this collection and more information, please view on a desktop device.
Logo for NeMo - Text to Speech


This collection contains NeMo models for Text to Speech (TTS)




March 18, 2022
Helm Charts


NVIDIA NeMo toolkit supports numerous Speech Synthensis models which can be used to convert text to audio. NeMo comes with pretrained models that can be immediately downloaded and used to generate speech. For more information, refer to the NeMo TTS documentation

Trained or fine-tuned NeMo models (with the file extenstion .nemo) can be imported into Riva and then deployed. In general, one must convert NeMo models to Riva models (with the file extension .riva) before beginning the Riva Build phase. The tts_en_tacotron2 model is an exception: one can pass the associated .nemo file directly into the call to riva-build. For more details, see the Riva documentation on Model Development with NeMo.


You can instantiate many pretrained models automatically directly from NGC. To do so, start your script with:

import soundfile as sf
import nemo
from nemo.collections.tts.models.base import SpectrogramGenerator, Vocoder

Then chose what type of model you would like to instantiate. See table below for the list of models that are available for each task. For example:

# Download and load the pretrained tacotron2 model
spec_generator = SpectrogramGenerator.from_pretrained("tts_en_tacotron2")

# Download and load the pretrained waveglow model
vocoder = Vocoder.from_pretrained("tts_waveglow_88m")

To generate audio, we need 3 functions. 2 functions are from SpectrogramGenerator and 1 function is from Vocoder.

  1. SpectrogramGenerator.parse(): Accepts raw python strings and returns a torch.tensor that represents tokenized text
  2. SpectrogramGenerator.generate_spectrogram(): Accepts a batch of tokenized text and returns a torch.tensor that represents a batch of spectrograms
  3. Vocoder.convert_spectrogram_to_audio(): Accepts a batch of spectrograms and returns a torch.tensor that represents a batch of raw audio
# All spectrogram generators start by parsing raw strings to a tokenized version of the string
parsed = spec_generator.parse("You can type your sentence here to get nemo to produce speech.")
# Then take the tokenized string and produce a spectrogram
spectrogram = spec_generator.generate_spectrogram(tokens=parsed)
# Finally, a vocoder converts the spectrogram to audio
audio = vocoder.convert_spectrogram_to_audio(spec=spectrogram)

# Save the audio to disk in a file called speech.wav
sf.write("speech.wav",'cpu').numpy(), 22050)

To use the end-to-end models for speech synthesis, please see the documentation or follow the tts inference tutorial notebook. All tutorial links can be found here

Speech Synthesis Models

Spectrogram Generators

Models that generate a mel spectrogram from text.

Model Name Model Card
tts_en_glowtts NGC Model Card
tts_en_tacotron2 NGC Model Card
tts_en_fastspeech2 NGC Model Card
tts_en_fastpitch NGC Model Card
tts_en_talknet NGC Model Card
tts_en_lj_mixertts NGC Model Card
tts_en_lj_mixerttsx NGC Model Card


Models that generate audio from a mel spectrogram.

Model Name Model Card
tts_hifigan NGC Model Card
tts_melgan NGC Model Card
tts_squeezewave NGC Model Card
tts_uniglow NGC Model Card
tts_waveglow_268m NGC Model Card
tts_waveglow_88m NGC Model Card
tts_en_lj_univnet NGC Model Card
tts_en_libritts_univnet NGC Model Card


Models that generate audio directly from text.

Model Name Model Card
tts_en_e2e_fastpitchhifigan NGC Model Card
tts_en_e2e_fastspeech2hifigan NGC Model Card