NGC | Catalog
CatalogModelsTTS En HiFiTTS VITS

TTS En HiFiTTS VITS

Logo for TTS En HiFiTTS VITS
Description
End-to-end parallel speech synthesis model
Publisher
NVIDIA
Latest Version
r1.15.0
Modified
April 4, 2023
Size
346.3 MB

Model Overview

VITS is an flow-based parallel end-to-end speech synthesis model. It consists of 2 encoders: TextEncoder and PosteriorEncoder (for spectrograms), normalizing flow, duration predictor and HiFiGAN vocoder.

During training, TextEncoder produces representation of text tokens and PosteriorEncoder produces representation of spectrograms are then fed to normalizing flow. This two representatons are then trained to match with KL divergence as in VAE. Also spectrogram representation are fed to MAS block to produce labels for Duration Predictor training.

During inference, Duration Predictor generates alignments from noise with conditioning on TextEncoder representations. Then they are passed through inversed flow to Vocoder block, which generates audios.

Training

This model is trained on HiFiTTS sampled at 41000Hz, and has been tested on generating male and female English voices with an American accent.

Performance

No performance information available at this time.

How to Use this Model

# Load VITS
from nemo.collections.tts.models import VitsModel
audio_generator = VitsModel.from_pretrained("tts_en_hifitts_vits")

# Generate audio
import soundfile as sf
import torch
with torch.no_grad():
    parsed = audio_generator.parse("You can type your sentence here to get nemo to produce speech.")
    audio = audio_generator.convert_text_to_waveform(tokens=parsed)

# Save the audio to disk in a file called speech.wav
if isinstance(audio, torch.Tensor):
    audio = audio.to('cpu').numpy()
sf.write("speech.wav", audio.T, 41000, format="WAV")

Input

This model accepts batches of text.

Output

This model outputs audio at 44100Hz.

Limitations

There are no known limitations at this time.

Versions

1.15.0 (current): The original version released with NeMo 1.15.0.

References

VITS paper: https://arxiv.org/abs/2106.06103
github: https://github.com/jaywalnut310/vits

Licence

License to use this model is covered by the NGC TERMS OF USE unless another License/Terms Of Use/EULA is clearly specified. By downloading the public and release version of the model, you accept the terms and conditions of the NGC TERMS OF USE.