UnivNet is a generative adversarial network (GAN) model that generates audio from mel spectrograms. The generator uses transposed convolutions to upsample mel-spectrograms to audio.
This model is trained on all LibriTTS training data (
train-other-500) sampled at 22050Hz, and has been tested on generating English voices.
No performance information available at this time.
This model can be automatically loaded from NGC.
NOTE: In order to generate audio, you also need a spectrogram generator from NeMo. This example uses the FastPitch model.
# Load PastPitch from nemo.collections.tts.models import FastPitchModel spec_generator = FastPitchModel.from_pretrained("tts_en_fastpitch") # Load UnivNet from nemo.collections.tts.models import UnivNetModel model = UnivNetModel.from_pretrained(model_name="tts_en_libritts_multispeaker_univnet") # Generate audio import soundfile as sf parsed = spec_generator.parse("You can type your sentence here to get nemo to produce speech.") spectrogram = spec_generator.generate_spectrogram(tokens=parsed) audio = model.convert_spectrogram_to_audio(spec=spectrogram) ### Save the audio to disk in a file called speech.wav sf.write("speech.wav", audio.to('cpu').numpy(), 22050)
This model accepts batches of mel spectrograms.
This model outputs audio at 22050Hz.
There are no known limitations at this time.
1.7.0: Add model (tts_en_libritts_multispeaker_univnet.nemo) which was released with NeMo 1.7.0.
UnivNet paper: https://arxiv.org/abs/2106.07889