NGC | Catalog
CatalogModelsNeMo Speech Synthesis models

NeMo Speech Synthesis models

For downloads and more information, please view on a desktop device.
Logo for NeMo Speech Synthesis models

Description

NeMo Speech Synthesis(Text to Speech or TTS) models contain text to speech models to generate spectrogram from text and vocoder to generate audio from spectrogram

Publisher

NVIDIA

Latest Version

1.0.0a5

Modified

April 4, 2023

Size

1.23 GB

Overview

NVIDIA NeMo toolkit supports Text To Speech (TTS) which is also referred to as Speech Synthesis via a two step procedure. First, a model is used to generate a mel spectrogram from text. Second, a model is used to generate audio from a mel spectrogram. In this collection, Mel Spectrogram Generators Tacotron 2 and Glow-TTS are included.In the audio Generators (Vocoders) section, WaveGlow is included. Using the scripts in the TTS directory, train any of these models for domain specific data. Note: Transfer learning is currently a research area in TTS.

Usage

You can instantiate all these models automatically directly from NGC. To do so, start your script with:

import nemo
import nemo.collections.tts as nemo_tts

Then chose what type of model you would like to instantiate. See table below for the list of model base classes. Then use base_class.from_pretrained(...) method. For example:

# Spectrogram generator which takes text as an input and produces spectrogram
spectrogram_generator = nemo_tts.models.Tacotron2Model.from_pretrained(model_name="Tacotron2-22050Hz")
# Vocoder model which takes spectrogram and produces actual audio
vocoder = nemo_tts.models.WaveGlowModel.from_pretrained(model_name="WaveGlow-22050Hz")

Note that you can also list all available models using API by calling base_class.list_available_models(...) method.

You can also download all models' ".nemo" files in the "File Browser" tab and then instantiate those models with base_class.restore_from(PATH_TO_DOTNEMO_FILE) method. In this case, make sure you are matching NeMo and models' versions.

Here is a list of currently available models together with their base classes and short descriptions.

Model name Model Base Class Description
Tacotron2-22050Hz Tacotron2Model This model is trained on LJSpeech sampled at 22050Hz, and can be used to generate female English voices with an American accent.
WaveGlow-22050Hz WaveGlowModel This model is trained on LJSpeech sampled at 22050Hz, and can be used as an universal vocoder.
SqueezeWave-22050Hz SqueezeWaveModel This model is trained on LJSpeech sampled at 22050Hz, and can be used as an universal vocoder.
GlowTTS-22050Hz GlowTTSModel This model is trained on LJSpeech sampled at 22050Hz, and can be used to generate female English voices with an American accent.