NGC | Catalog
CatalogCollectionsNeMo - Text to Speech

NeMo - Text to Speech

Logo for NeMo - Text to Speech
This collection contains NeMo models for Text to Speech (TTS)
April 4, 2023
Sorry, your browser does not support inline SVG.
Helm Charts
Sorry, your browser does not support inline SVG.
Sorry, your browser does not support inline SVG.
Sorry, your browser does not support inline SVG.


NVIDIA NeMo toolkit supports numerous Speech Synthensis models which can be used to convert text to audio. NeMo comes with pretrained models that can be immediately downloaded and used to generate speech. For more information, refer to the NeMo TTS documentation

Trained or fine-tuned NeMo models (with the file extenstion .nemo) can be imported into Riva and then deployed. In general, one must convert NeMo models to Riva models (with the file extension .riva) before beginning the Riva Build phase. The tts_en_tacotron2 model is an exception: one can pass the associated .nemo file directly into the call to riva-build. For more details, see the Riva documentation on Model Development with NeMo.


You can instantiate many pretrained models automatically directly from NGC. To do so, start your script with:

import soundfile as sf
import nemo
from nemo.collections.tts.models.base import SpectrogramGenerator, Vocoder

Then chose what type of model you would like to instantiate. See table below for the list of models that are available for each task. For example:

# Download and load the pretrained tacotron2 model
spec_generator = SpectrogramGenerator.from_pretrained("tts_en_tacotron2")

# Download and load the pretrained waveglow model
vocoder = Vocoder.from_pretrained("tts_waveglow_88m")

To generate audio, we need 3 functions. 2 functions are from SpectrogramGenerator and 1 function is from Vocoder.

  1. SpectrogramGenerator.parse(): Accepts raw python strings and returns a torch.tensor that represents tokenized text
  2. SpectrogramGenerator.generate_spectrogram(): Accepts a batch of tokenized text and returns a torch.tensor that represents a batch of spectrograms
  3. Vocoder.convert_spectrogram_to_audio(): Accepts a batch of spectrograms and returns a torch.tensor that represents a batch of raw audio
# All spectrogram generators start by parsing raw strings to a tokenized version of the string
parsed = spec_generator.parse("You can type your sentence here to get nemo to produce speech.")
# Then take the tokenized string and produce a spectrogram
spectrogram = spec_generator.generate_spectrogram(tokens=parsed)
# Finally, a vocoder converts the spectrogram to audio
audio = vocoder.convert_spectrogram_to_audio(spec=spectrogram)

# Save the audio to disk in a file called speech.wav
sf.write("speech.wav",'cpu').numpy(), 22050)

To use the end-to-end models for speech synthesis, please see the documentation or follow the tts inference tutorial notebook. All tutorial links can be found here

Speech Synthesis Models

Spectrogram Generators

Please refer to for the models that generate Mel-Spectrograms from text.


Please refer to for the models that generate audios from Mel-Spectrograms.


Models that generate audio directly from text.

Model Name Model Card
tts_en_e2e_fastpitchhifigan NGC Model Card
tts_en_e2e_fastspeech2hifigan NGC Model Card