UnivNet speech synthesis model trained on English speech (LibriTTS dataset)
Model Overview
Model Architecture
UnivNet is a generative adversarial network (GAN) model that generates audio from mel spectrograms. The generator uses transposed convolutions to upsample mel-spectrograms to audio.
Training
Dataset
This model is trained on all LibriTTS training data (train-clean-100, train-clean-360, and train-other-500) sampled at 22050Hz, and has been tested on generating English voices.
Performance
No performance information available at this time.
How to Use this Model
This model can be automatically loaded from NGC.
NOTE: In order to generate audio, you also need a spectrogram generator from NeMo. This example uses the FastPitch model.
Input
This model accepts batches of mel spectrograms.
Output
This model outputs audio at 22050Hz.
Limitations
There are no known limitations at this time.
Versions
1.7.0: Add model (tts_en_libritts_multispeaker_univnet.nemo) which was released with NeMo 1.7.0.
References
UnivNet paper: https://arxiv.org/abs/2106.07889
Licence
License to use this model is covered by the NGC TERMS OF USE unless another License/Terms Of Use/EULA is clearly specified. By downloading the public and release version of the model, you accept the terms and conditions of the NGC TERMS OF USE.