VITS is an flow-based parallel end-to-end speech synthesis model. It consists of 2 encoders: TextEncoder and PosteriorEncoder (for spectrograms), normalizing flow, duration predictor and HiFiGAN vocoder.
During training, TextEncoder produces representation of text tokens and PosteriorEncoder produces representation of spectrograms are then fed to normalizing flow. This two representatons are then trained to match with KL divergence as in VAE. Also spectrogram representation are fed to MAS block to produce labels for Duration Predictor training.
During inference, Duration Predictor generates alignments from noise with conditioning on TextEncoder representations. Then they are passed through inversed flow to Vocoder block, which generates audios.
This model is trained on LJSpeech sampled at 22050Hz, and has been tested on generating female English voices with an American accent.
No performance information available at this time.
# Load VITS
from nemo.collections.tts.models import VitsModel
audio_generator = VitsModel.from_pretrained("tts_en_lj_vits")
# Generate audio
import soundfile as sf
import torch
with torch.no_grad():
parsed = audio_generator.parse("You can type your sentence here to get nemo to produce speech.")
audio = audio_generator.convert_text_to_waveform(tokens=parsed)
# Save the audio to disk in a file called speech.wav
if isinstance(audio, torch.Tensor):
audio = audio.to('cpu').numpy()
sf.write("speech.wav", audio.T, 22050, format="WAV")
This model accepts batches of text.
This model outputs audio at 22050Hz.
There are no known limitations at this time.
1.0.0 (current): The original version released with NeMo 1.0.0.
VITS paper: https://arxiv.org/abs/2106.06103
github: https://github.com/jaywalnut310/vits
License to use this model is covered by the NGC TERMS OF USE unless another License/Terms Of Use/EULA is clearly specified. By downloading the public and release version of the model, you accept the terms and conditions of the NGC TERMS OF USE.