Flowtron is an autoregressive flow-based generative network for text-to-speech synthesis with control over speech variation and style transfer. Flowtron borrows insights from Autoregressive Flows and revamps Tacotron in order to provide high-quality and expressive mel-spectrogram synthesis. Flowtron is optimized by maximizing the likelihood of the training data, which makes training simple and stable. Flowtron learns an invertible mapping of data to a latent space that can be manipulated to control many aspects of speech synthesis (pitch, tone, speech rate, cadence, accent).
Our mean opinion scores (MOS) show that Flowtron matches state-of-the-art TTS models in terms of speech quality. In addition, we provide results on control of speech variation, interpolation between samples and style transfer between speakers seen and unseen during training.
The model published as "Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis" at this address "https://arxiv.org/abs/2005.05957".
Our text encoder modifies Tacotron’s by replacing batchnorm with instance-norm. Our decoder and NN architecture removes the essential Prenet and Postnet layers from Tacotron. We use a single speaker embedding that is channel-wise concatenated with the encoder outputs at every token. We use a fixed dummy speaker embedding for models not conditioned on speaker id. Finally, we add a dense layer with a sigmoid output the flow step closest to z. This provides the model with a gating mechanism as early as possible during inference to avoid extra computation.
Flowtron was trained using a dataset that combines the LJSpeech (LJS) dataset[1] with two proprietary single speaker datasets with 20 and 10 hours each. Flowtron also was trained on the train-clean-100 subset of LibriTTS [2] with 123 speakers and 25 minutes on average per speaker. Speakers with less than 5 minutes of data and files that are larger than 10 seconds are filtered out. For each dataset, at least 180 randomly chosen samples was used for the validation set and the remainder for the training set. The models are trained on uniformly sampled normalized text and ARPAbet encodings obtained from the CMU Pronouncing Dictionary. A sampling rate of 22050 Hz and mel-spectrograms with 80 bins using librosa mel filter defaults were used in training. We apply the STFT with a FFT size of 1024, window size of 1024 samples and hop size of 256 samples. We use the ADAM optimizer with default parameters, 1e-4 learning rate and 1e-6 weight decay. We anneal the learning rate once the generalization error starts to plateau and stop training once the the generalization error stops significantly decreasing or starts increasing. The Flowtron models with 2 steps of flow were trained on the LSH dataset for approximately 1000 epochs and then fine-tuned on LibriTTS for 500 epochs.
The model is trained on a single NVIDIA DGX-1 with 8 GPUs.
The model trained using a dataset that combines the LJSpeech (LJS) dataset with two proprietary single speaker datasets with 20 and 10 hours each (Sally and Helen). We called this combined dataset as LSH. LJS is a public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books. A transcription is provided for each clip. Clips vary in length from 1 to 10 seconds and have a total length of approximately 24 hours. The model also was trained on the train-clean-100 subset of LibriTTS with 123 speakers and 25 minutes on average per speaker. LibriTTS is a multi-speaker English corpus of approximately 585 hours of read English speech at 24 kHz sampling rate. Speakers with less than 5 minutes of data and files that are larger than 10 seconds are filtered out. For each dataset we use at least 180 randomly chosen samples for the validation set and the remainder for the training set.
We provide results that compare mean opinion scores a Flowtron with 2 steps of flow and samples from our implementation of Tacotron both trained on LSH. Although the models evaluated are multi-speaker, we only compute mean opinion scores on LJS. We crowd-sourced mean opinion score (MOS) tests on Amazon Mechanical Turk. Raters first had to pass a hearing test to be eligible. Then they listened to an utterance, after which they rated pleasantness on a five-point scale. We used 30 volume normalized utterances from all speakers disjoint from the training set for evaluation, and randomly chose the utterances for each subject.
The mean opinion scores are shown in Table 1 with 95% confidence intervals computed over approximately 250 scores per source. The results roughly match our subjective qualitative assessment. The larger advantage of Flowtron is in the control over the amount of speech variation and the manipulation of the latent space.
Model | Flows | Mean Opinion Score (MOS) |
---|---|---|
Real | - | 4.274∓0.1340 |
Flowtron | 3 | 3.665∓0.1634 |
Tacotron | - | 3.521∓0.1721 |
In oeder to run this model your system needs to have Nvidia GPU, Cuda and cuDNN.
To test this model check "fine-tuning Flowtron model" resource in NGC Catalog at this address: https://ngc.nvidia.com/catalog/resources/nvidia:flowtron. It shows how to fine-tune and test the model.
The input of model in training step is a set of audio files and their transcripts. In inferencing step the input of the model is a text that we want to have its audio version.
The output of inferencing script is an audio file that you can find it in results directory. it is the audio file of the text that you gave to the model and the sound is similar to the audio file that you used for fine tuning the model.
The audio files that are used for finetuning need to have high quality.
[1] Keith Ito et al., “The LJ speech dataset,” 2017.
[2] Hsu, W.-N., Zhang, Y., Weiss, R. J., Zen, H., Wu, Y., Wang, Y., Cao, Y., Jia, Y., Chen, Z., Shen, J., et al. Hierarchical generative modeling for controllable speech synthesis. arXiv preprint arXiv:1810.07217, 2018.
This implementation uses code from the following repos: Keith Ito, Prem Seetharaman and Liyuan Liu as described in our code.