VITS is an flow-based parallel end-to-end speech synthesis model. It consists of 2 encoders: TextEncoder and PosteriorEncoder (for spectrograms), normalizing flow, duration predictor and HiFiGAN vocoder.
During training, TextEncoder produces representation of text tokens and PosteriorEncoder produces representation of spectrograms are then fed to normalizing flow. This two representatons are then trained to match with KL divergence as in VAE. Also spectrogram representation are fed to MAS block to produce labels for Duration Predictor training.
During inference, Duration Predictor generates alignments from noise with conditioning on TextEncoder representations. Then they are passed through inversed flow to Vocoder block, which generates audios.
This model is trained on LJSpeech sampled at 22050Hz, and has been tested on generating female English voices with an American accent.
No performance information available at this time.
# Load VITS from nemo.collections.tts.models import VitsModel audio_generator = VitsModel.from_pretrained("tts_en_lj_vits") # Generate audio import soundfile as sf import torch with torch.no_grad(): parsed = audio_generator.parse("You can type your sentence here to get nemo to produce speech.") audio = audio_generator.convert_text_to_waveform(tokens=parsed) # Save the audio to disk in a file called speech.wav if isinstance(audio, torch.Tensor): audio = audio.to('cpu').numpy() sf.write("speech.wav", audio.T, 22050, format="WAV")
This model accepts batches of text.
This model outputs audio at 22050Hz.
There are no known limitations at this time.
1.0.0 (current): The original version released with NeMo 1.0.0.