NVIDIA NeMo toolkit supports Text To Speech (TTS) which is also referred to as Speech Synthesis via a two step procedure. First, a model is used to generate a mel spectrogram from text. Second, a model is used to generate audio from a mel spectrogram. In this collection, Mel Spectrogram Generators Tacotron 2 and Glow-TTS are included.In the audio Generators (Vocoders) section, WaveGlow is included. Using the scripts in the TTS directory, train any of these models for domain specific data. Note: Transfer learning is currently a research area in TTS.
You can instantiate all these models automatically directly from NGC. To do so, start your script with:
import nemo
import nemo.collections.tts as nemo_tts
Then chose what type of model you would like to instantiate. See table below for the list of model base classes. Then use base_class.from_pretrained(...)
method.
For example:
# Spectrogram generator which takes text as an input and produces spectrogram
spectrogram_generator = nemo_tts.models.Tacotron2Model.from_pretrained(model_name="Tacotron2-22050Hz")
# Vocoder model which takes spectrogram and produces actual audio
vocoder = nemo_tts.models.WaveGlowModel.from_pretrained(model_name="WaveGlow-22050Hz")
Note that you can also list all available models using API by calling base_class.list_available_models(...)
method.
You can also download all models' ".nemo" files in the "File Browser" tab and then instantiate those models with base_class.restore_from(PATH_TO_DOTNEMO_FILE)
method. In this case, make sure you are matching NeMo and models' versions.
Here is a list of currently available models together with their base classes and short descriptions.
Model name | Model Base Class | Description |
---|---|---|
Tacotron2-22050Hz | Tacotron2Model |
This model is trained on LJSpeech sampled at 22050Hz, and can be used to generate female English voices with an American accent. |
WaveGlow-22050Hz | WaveGlowModel |
This model is trained on LJSpeech sampled at 22050Hz, and can be used as an universal vocoder. |
SqueezeWave-22050Hz | SqueezeWaveModel |
This model is trained on LJSpeech sampled at 22050Hz, and can be used as an universal vocoder. |
GlowTTS-22050Hz | GlowTTSModel |
This model is trained on LJSpeech sampled at 22050Hz, and can be used to generate female English voices with an American accent. |