Tacotron2 and Waveglow 2.0 for PyTorch

NGC Catalog

CLASSIC

Welcome Guest

For downloads and more information, please view on a desktop device.

Description

The Tacotron 2 and WaveGlow model form a text-to-speech system that enables user to synthesise a natural sounding speech from raw transcripts.

Publisher

NVIDIA

Latest Version

20.06.9

Modified

April 4, 2023

Compressed Size

10.29 MB

This text-to-speech (TTS) system is a combination of two neural network models:

a modified Tacotron 2 model from the Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions paper
a flow-based neural network model from the WaveGlow: A Flow-based Generative Network for Speech Synthesis paper

The Tacotron 2 and WaveGlow models form a text-to-speech system that enables users to synthesize natural sounding speech from raw transcripts without any additional information such as patterns and/or rhythms of speech.

Our implementation of Tacotron 2 models differs from the model described in the paper. Our implementation uses Dropout instead of Zoneout to regularize the LSTM layers. Also, the original text-to-speech system proposed in the paper uses the WaveNet model to synthesize waveforms. In our implementation, we use the WaveGlow model for this purpose.

Both models are based on implementations of NVIDIA GitHub repositories Tacotron 2 and WaveGlow, and are trained on a publicly available LJ Speech dataset.

The Tacotron 2 and WaveGlow model enables you to efficiently synthesize high quality speech from text.

Both models are trained with mixed precision using Tensor Cores on Volta, Turing, and the NVIDIA Ampere GPU architectures. Therefore, researchers can get results 2.0x faster for Tacotron 2 and 3.1x faster for WaveGlow than training without Tensor Cores, while experiencing the benefits of mixed precision training. The models are tested against each NGC monthly container release to ensure consistent accuracy and performance over time.

Model architecture

The Tacotron 2 model is a recurrent sequence-to-sequence model with attention that predicts mel-spectrograms from text. The encoder (blue blocks in the figure below) transforms the whole text into a fixed-size hidden feature representation. This feature representation is then consumed by the autoregressive decoder (orange blocks) that produces one spectrogram frame at a time. In our implementation, the autoregressive WaveNet (green block) is replaced by the flow-based generative WaveGlow.

Figure 1. Architecture of the Tacotron 2 model. Taken from the Tacotron 2 paper.

The WaveGlow model is a flow-based generative model that generates audio samples from Gaussian distribution using mel-spectrogram conditioning (Figure 2). During training, the model learns to transform the dataset distribution into spherical Gaussian distribution through a series of flows. One step of a flow consists of an invertible convolution, followed by a modified WaveNet architecture that serves as an affine coupling layer. During inference, the network is inverted and audio samples are generated from the Gaussian distribution. Our implementation uses 512 residual channels in the coupling layer.

Figure 2. Architecture of the WaveGlow model. Taken from the WaveGlow paper.

Default configuration

Both models support multi-GPU and mixed precision training with dynamic loss scaling (see Apex code here), as well as mixed precision inference. To speed up Tacotron 2 training, reference mel-spectrograms are generated during a preprocessing step and read directly from disk during training, instead of being generated during training.

The following features were implemented in this model:

data-parallel multi-GPU training
dynamic loss scaling with backoff for Tensor Cores (mixed precision) training.

Feature support matrix

The following features are supported by this model.

Feature	Tacotron 2	WaveGlow
AMP	Yes	Yes
Apex DistributedDataParallel	Yes	Yes

Features

AMP - a tool that enables Tensor Core-accelerated training. For more information, refer to Enabling mixed precision.

Apex DistributedDataParallel - a module wrapper that enables easy multiprocess distributed data parallel training, similar to torch.nn.parallel.DistributedDataParallel. DistributedDataParallel is optimized for use with NCCL. It achieves high performance by overlapping communication with computation during backward() and bucketing smaller gradient transfers to reduce the total number of transfers required.

Mixed precision training

Mixed precision is the combined use of different numerical precisions in a computational method. Mixed precision training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of Tensor Cores in Volta, and following with both the Turing and Ampere architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:

Porting the model to use the FP16 data type where appropriate.
Adding loss scaling to preserve small gradient values.

The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in CUDA 8 in the NVIDIA Deep Learning SDK.

For information about:

How to train using mixed precision, see the Mixed Precision Training paper and Training With Mixed Precision documentation.
Techniques used for mixed precision training, see the Mixed-Precision Training of Deep Neural Networks blog.
APEX tools for mixed precision training, see the NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch.

Enabling mixed precision

Mixed precision is enabled in PyTorch by using the Automatic Mixed Precision (AMP) library from APEX that casts variables to half-precision upon retrieval, while storing variables in single-precision format. Furthermore, to preserve small gradient magnitudes in backpropagation, a loss scaling step must be included when applying gradients. In PyTorch, loss scaling can be easily applied by using the scale_loss() method provided by AMP. The scaling value to be used can be dynamic or fixed.

By default, the train_tacotron2.sh and train_waveglow.sh scripts will launch mixed precision training with Tensor Cores. You can change this behaviour by removing the --amp flag from the train.py script.

To enable mixed precision, the following steps were performed in the Tacotron 2 and WaveGlow models:

Import AMP from APEX:

from apex import amp
amp.lists.functional_overrides.FP32_FUNCS.remove('softmax')
amp.lists.functional_overrides.FP16_FUNCS.append('softmax')

Initialize AMP:

model, optimizer = amp.initialize(model, optimizer, opt_level="O1")

If running on multi-GPU, wrap the model with DistributedDataParallel:

from apex.parallel import DistributedDataParallel as DDP
model = DDP(model)

Scale loss before backpropagation (assuming loss is stored in a variable called losses):
- Default backpropagate for FP32:
```
losses.backward()
```
- Scale loss and backpropagate with AMP:
```
with optimizer.scale_loss(losses) as scaled_losses:
    scaled_losses.backward()
```

Enabling TF32

TensorFloat-32 (TF32) is the new math mode in NVIDIA A100 GPUs for handling the matrix math also called tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs.

TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models which require high dynamic range for weights or activations.

For more information, refer to the TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x blog post.

TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default.