NVIDIA NeMo toolkit supports numerous Speech Synthensis models which can be used to convert text to audio. NeMo comes with pretrained models that can be immediately downloaded and used to generate speech. For more information, refer to the NeMo TTS documentation
Trained or fine-tuned NeMo models (with the file extenstion .nemo
) can be imported into Riva and then deployed. In general, one must convert NeMo models to Riva models (with the file extension .riva
) before beginning the Riva Build phase. The tts_en_tacotron2
model is an exception: one can pass the associated .nemo
file directly into the call to riva-build
. For more details, see the Riva documentation on Model Development with NeMo.
You can instantiate many pretrained models automatically directly from NGC. To do so, start your script with:
import soundfile as sf
import nemo
from nemo.collections.tts.models.base import SpectrogramGenerator, Vocoder
Then chose what type of model you would like to instantiate. See table below for the list of models that are available for each task. For example:
# Download and load the pretrained tacotron2 model
spec_generator = SpectrogramGenerator.from_pretrained("tts_en_tacotron2")
# Download and load the pretrained waveglow model
vocoder = Vocoder.from_pretrained("tts_waveglow_88m")
To generate audio, we need 3 functions. 2 functions are from SpectrogramGenerator
and 1 function is from Vocoder
.
# All spectrogram generators start by parsing raw strings to a tokenized version of the string
parsed = spec_generator.parse("You can type your sentence here to get nemo to produce speech.")
# Then take the tokenized string and produce a spectrogram
spectrogram = spec_generator.generate_spectrogram(tokens=parsed)
# Finally, a vocoder converts the spectrogram to audio
audio = vocoder.convert_spectrogram_to_audio(spec=spectrogram)
# Save the audio to disk in a file called speech.wav
sf.write("speech.wav", audio.to('cpu').numpy(), 22050)
To use the end-to-end models for speech synthesis, please see the documentation or follow the tts inference tutorial notebook. All tutorial links can be found here
Please refer to https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/tts/checkpoints.html#mel-spectrogram-generators for the models that generate Mel-Spectrograms from text.
Please refer to https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/tts/checkpoints.html#vocoders for the models that generate audios from Mel-Spectrograms.
Models that generate audio directly from text.
Model Name | Model Card |
---|---|
tts_en_e2e_fastpitchhifigan | NGC Model Card |
tts_en_e2e_fastspeech2hifigan | NGC Model Card |