This text-to-speech (TTS) system is a combination of two neural network models:
The Tacotron 2 and WaveGlow models form a text-to-speech system that enables users to synthesize natural sounding speech from raw transcripts without any additional information such as patterns and/or rhythms of speech.
Our implementation of Tacotron 2 models differs from the model described in the paper. Our implementation uses Dropout instead of Zoneout to regularize the LSTM layers. Also, the original text-to-speech system proposed in the paper uses the WaveNet model to synthesize waveforms. In our implementation, we use the WaveGlow model for this purpose.
Both models are based on implementations of NVIDIA GitHub repositories Tacotron 2 and WaveGlow, and are trained on a publicly available LJ Speech dataset.
The Tacotron 2 and WaveGlow model enables you to efficiently synthesize high quality speech from text.
Both models are trained with mixed precision using Tensor Cores on Volta, Turing, and the NVIDIA Ampere GPU architectures. Therefore, researchers can get results 2.0x faster for Tacotron 2 and 3.1x faster for WaveGlow than training without Tensor Cores, while experiencing the benefits of mixed precision training. The models are tested against each NGC monthly container release to ensure consistent accuracy and performance over time.
The Tacotron 2 model is a recurrent sequence-to-sequence model with attention that predicts mel-spectrograms from text. The encoder (blue blocks in the figure below) transforms the whole text into a fixed-size hidden feature representation. This feature representation is then consumed by the autoregressive decoder (orange blocks) that produces one spectrogram frame at a time. In our implementation, the autoregressive WaveNet (green block) is replaced by the flow-based generative WaveGlow.
Figure 1. Architecture of the Tacotron 2 model. Taken from the Tacotron 2 paper.
The WaveGlow model is a flow-based generative model that generates audio samples from Gaussian distribution using mel-spectrogram conditioning (Figure 2). During training, the model learns to transform the dataset distribution into spherical Gaussian distribution through a series of flows. One step of a flow consists of an invertible convolution, followed by a modified WaveNet architecture that serves as an affine coupling layer. During inference, the network is inverted and audio samples are generated from the Gaussian distribution. Our implementation uses 512 residual channels in the coupling layer.
Figure 2. Architecture of the WaveGlow model. Taken from the WaveGlow paper.
Both models support multi-GPU and mixed precision training with dynamic loss scaling (see Apex code here), as well as mixed precision inference. To speed up Tacotron 2 training, reference mel-spectrograms are generated during a preprocessing step and read directly from disk during training, instead of being generated during training.
The following features were implemented in this model:
The following features are supported by this model.
Feature | Tacotron 2 | WaveGlow |
---|---|---|
AMP | Yes | Yes |
Apex DistributedDataParallel | Yes | Yes |
AMP - a tool that enables Tensor Core-accelerated training. For more information, refer to Enabling mixed precision.
Apex DistributedDataParallel - a module wrapper that enables easy multiprocess
distributed data parallel training, similar to torch.nn.parallel.DistributedDataParallel
.
DistributedDataParallel
is optimized for use with NCCL. It achieves high
performance by overlapping communication with computation during backward()
and bucketing smaller gradient transfers to reduce the total number of transfers
required.
Mixed precision is the combined use of different numerical precisions in a computational method. Mixed precision training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of Tensor Cores in Volta, and following with both the Turing and Ampere architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in CUDA 8 in the NVIDIA Deep Learning SDK.
For information about:
Mixed precision is enabled in PyTorch by using the Automatic Mixed Precision
(AMP) library from APEX that casts variables
to half-precision upon retrieval, while storing variables in single-precision
format. Furthermore, to preserve small gradient magnitudes in backpropagation,
a loss scaling
step must be included when applying gradients. In PyTorch, loss scaling can be
easily applied by using the scale_loss()
method provided by AMP. The scaling value
to be used can be dynamic or fixed.
By default, the train_tacotron2.sh
and train_waveglow.sh
scripts will
launch mixed precision training with Tensor Cores. You can change this
behaviour by removing the --amp
flag from the train.py
script.
To enable mixed precision, the following steps were performed in the Tacotron 2 and WaveGlow models:
Import AMP from APEX:
from apex import amp
amp.lists.functional_overrides.FP32_FUNCS.remove('softmax')
amp.lists.functional_overrides.FP16_FUNCS.append('softmax')
Initialize AMP:
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
If running on multi-GPU, wrap the model with DistributedDataParallel
:
from apex.parallel import DistributedDataParallel as DDP
model = DDP(model)
Scale loss before backpropagation (assuming loss is stored in a variable
called losses
):
Default backpropagate for FP32:
losses.backward()
Scale loss and backpropagate with AMP:
with optimizer.scale_loss(losses) as scaled_losses:
scaled_losses.backward()
TensorFloat-32 (TF32) is the new math mode in NVIDIA A100 GPUs for handling the matrix math also called tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs.
TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models which require high dynamic range for weights or activations.
For more information, refer to the TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x blog post.
TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default.