UnivNet is a generative adversarial network (GAN) model that generates audio from mel spectrograms. The generator uses transposed convolutions to upsample mel-spectrograms to audio.
This model is trained on all LibriTTS training data (train-clean-100
, train-clean-360
, and train-other-500
) sampled at 22050Hz, and has been tested on generating English voices.
No performance information available at this time.
This model can be automatically loaded from NGC.
NOTE: In order to generate audio, you also need a spectrogram generator from NeMo. This example uses the FastPitch model.
# Load PastPitch
from nemo.collections.tts.models import FastPitchModel
spec_generator = FastPitchModel.from_pretrained("tts_en_fastpitch")
# Load UnivNet
from nemo.collections.tts.models import UnivNetModel
model = UnivNetModel.from_pretrained(model_name="tts_en_libritts_multispeaker_univnet")
# Generate audio
import soundfile as sf
parsed = spec_generator.parse("You can type your sentence here to get nemo to produce speech.")
spectrogram = spec_generator.generate_spectrogram(tokens=parsed)
audio = model.convert_spectrogram_to_audio(spec=spectrogram)
### Save the audio to disk in a file called speech.wav
sf.write("speech.wav", audio.to('cpu').numpy(), 22050)
This model accepts batches of mel spectrograms.
This model outputs audio at 22050Hz.
There are no known limitations at this time.
1.7.0: Add model (tts_en_libritts_multispeaker_univnet.nemo) which was released with NeMo 1.7.0.
UnivNet paper: https://arxiv.org/abs/2106.07889
License to use this model is covered by the NGC TERMS OF USE unless another License/Terms Of Use/EULA is clearly specified. By downloading the public and release version of the model, you accept the terms and conditions of the NGC TERMS OF USE.