WaveGlow is a flow-based network capable of generating high quality speech from mel-spectrograms. WaveGlow combines insights from Glow  and WaveNet  in order to provide fast, efficient and high-quality audio synthesis, without the need for auto-regression. WaveGlow is implemented using only a single network, trained using only a single cost function: maximizing the likelihood of the training data, which makes the training procedure simple and stable.
The details of this model can be seen in this paper: "WAVEGLOW: A FLOW-BASED GENERATIVE NETWORK FOR SPEECH SYNTHESIS" at this address: https://arxiv.org/abs/1811.00002 .
WaveGlow is a generative model that generates audio by sampling from a distribution. To use a neural network as a generative model, samples are taken from a simple distribution, in this case, a zero mean spherical Gaussian with the same number of dimensions as our desired output, and those samples are put through a series of layers that transforms the simple distribution to one which has the desired distribution. In this case, the distribution of audio samples conditioned on a mel-spectrogram is modeled.
The WaveGlow network we use has 12 coupling layers and 12 invertible 1x1 convolutions. The coupling layer networks (WN) each have 8 layers of dilated convolutions , with 512 channels used as residual connections and 256 channels in the skip connections. We also output 2 of the channels after every 4 coupling layers.
The WaveGlow network was trained on 8 Nvidia GV100 GPU’s using randomly chosen clips of 16,000 samples for 580,000 iterations using weight normalization and the Adam optimizer, with a batch size of 24 and a step size of 0.0001 When training appeared to plateau, the learning rate was further reduced to 5*0.00005.
We trained the model on the LJ speech data . This data set consists of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books.
The data consists of roughly 24 hours of speech data recorded on a MacBook Pro using its built-in microphone in a home environment. We use a sampling rate of 22,050kHz.
We crowd-sourced Mean Opinion Score (MOS) tests on Amazon Mechanical Turk. Raters first had to pass a hearing test to be eligible. Then they listened to an utterance, after which they rated pleasantness on a five-point scale. We used 40 volume normalized utterances disjoint from the training set for evaluation, and randomly chose the utterances for each subject. After completing the rating, each rater was excluded from further tests to avoid anchoring effects. The MOS scores are shown in Table 1 with 95% confidence intervals. Though MOS scores of synthesized samples are close on an absolute scale, none of the methods reach the MOS score of real audio. Though WaveGlow has the highest MOS, all the methods have similar scores with only 1,000 samples. This roughly matches our subjective qualitative assessment. The larger advantage of WaveGlow is in training simplicity and inference speed.
|Model||Mean Opinion Score (MOS)|
Clone our repo and initialize submodule
git clone https://github.com/NVIDIA/waveglow.git cd waveglow git submodule init git submodule update
pip3 install -r requirements.txt
Generate audio with our pre-existing model
python3 inference.py -f <(ls mel_spectrograms/*.pt) -w waveglow_256channels.pt -o . --is_fp16 -s 0.6
convert_model.py to convert your older models to the current model
with fused residual and skip connections.
Train your own model
Download LJ Speech Data. In this example it's in
Make a list of the file names to use for training/testing
ls data/*.wav | tail -n+10 > train_files.txt ls data/*.wav | head -n10 > test_files.txt
Train your WaveGlow networks
mkdir checkpoints python train.py -c config.json
For multi-GPU training replace
distributed.py. Only tested with single node and NCCL.
For mixed precision training set
"fp16_run": true on
Make test set mel-spectrograms
python mel2samp.py -f test_files.txt -o . -c config.json
Do inference with your network
ls *.pt > mel_files.txt python3 inference.py -f mel_files.txt -w checkpoints/waveglow_10000 -o . --is_fp16 -s 0.6
We use the mel-spectrogram of the original audio of the dataset as the input to the WaveNet and WaveGlow networks. For WaveGlow, we use mel-spectrograms with 80 bins using librosa mel filter defaults, i.e. each bin is normalized by the filter length and the scale is the same as HTK. The parameters of the melspectrograms are FFT size 1024, hop size 256, and window size 1024.
The model produces audio samples at a rate of 1200 kHz on an NVIDIA V100 GPU.
 Diederik P Kingma and Prafulla Dhariwal, “Glow: Generative flow with invertible 1x1 convolutions,” arXiv preprint arXiv:1807.03039, 2018.
 Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu,“Wavenet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016.
Laurent Dinh, David Krueger, and Yoshua Bengio,“Nice: Non-linear independent components estimation,” arXiv preprint arXiv:1410.8516, 2014.
 Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio,“Density estimation using real nvp,” arXiv preprint arXiv:1605.08803, 2016.
Danilo Jimenez Rezende and Shakir Mohamed, “Variational inference with normalizing flows,” arXiv preprint arXiv:1505.05770, 2015.
 Keith Ito et al., “The LJ speech dataset,” 2017.