For downloads and more information, please view on a desktop device.

The Tacotron 2 and WaveGlow model form a text-to-speech system that enables user to synthesise a natural sounding speech from raw transcripts.

NVIDIA Deep Learning Examples

Text To Speech

Other

20.06.0

November 4, 2022

44.18 KB

This resource is using open-source code maintained in github (see the quick-start-guide section) and available for download from NGC

This text-to-speech (TTS) system is a combination of two neural network models:

- a modified Tacotron 2 model from the Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions paper
- a flow-based neural network model from the WaveGlow: A Flow-based Generative Network for Speech Synthesis paper

The Tacotron 2 and WaveGlow models form a text-to-speech system that enables users to synthesize natural sounding speech from raw transcripts without any additional information such as patterns and/or rhythms of speech.

Our implementation of Tacotron 2 models differs from the model described in the paper. Our implementation uses Dropout instead of Zoneout to regularize the LSTM layers. Also, the original text-to-speech system proposed in the paper uses the WaveNet model to synthesize waveforms. In our implementation, we use the WaveGlow model for this purpose.

Both models are based on implementations of NVIDIA GitHub repositories Tacotron 2 and WaveGlow, and are trained on a publicly available LJ Speech dataset.

The Tacotron 2 and WaveGlow model enables you to efficiently synthesize high quality speech from text.

Both models are trained with mixed precision using Tensor Cores on Volta, Turing, and the NVIDIA Ampere GPU architectures. Therefore, researchers can get results 2.0x faster for Tacotron 2 and 3.1x faster for WaveGlow than training without Tensor Cores, while experiencing the benefits of mixed precision training. The models are tested against each NGC monthly container release to ensure consistent accuracy and performance over time.

The Tacotron 2 model is a recurrent sequence-to-sequence model with attention that predicts mel-spectrograms from text. The encoder (blue blocks in the figure below) transforms the whole text into a fixed-size hidden feature representation. This feature representation is then consumed by the autoregressive decoder (orange blocks) that produces one spectrogram frame at a time. In our implementation, the autoregressive WaveNet (green block) is replaced by the flow-based generative WaveGlow.

Figure 1. Architecture of the Tacotron 2 model. Taken from the Tacotron 2 paper.

The WaveGlow model is a flow-based generative model that generates audio samples from Gaussian distribution using mel-spectrogram conditioning (Figure 2). During training, the model learns to transform the dataset distribution into spherical Gaussian distribution through a series of flows. One step of a flow consists of an invertible convolution, followed by a modified WaveNet architecture that serves as an affine coupling layer. During inference, the network is inverted and audio samples are generated from the Gaussian distribution. Our implementation uses 512 residual channels in the coupling layer.

Figure 2. Architecture of the WaveGlow model. Taken from the WaveGlow paper.

Both models support multi-GPU and mixed precision training with dynamic loss scaling (see Apex code here), as well as mixed precision inference. To speed up Tacotron 2 training, reference mel-spectrograms are generated during a preprocessing step and read directly from disk during training, instead of being generated during training.

The following features were implemented in this model:

- data-parallel multi-GPU training
- dynamic loss scaling with backoff for Tensor Cores (mixed precision) training.

The following features are supported by this model.

Feature | Tacotron 2 | WaveGlow |
---|---|---|

AMP | Yes | Yes |

Apex DistributedDataParallel | Yes | Yes |

AMP - a tool that enables Tensor Core-accelerated training. For more information, refer to Enabling mixed precision.

Apex DistributedDataParallel - a module wrapper that enables easy multiprocess
distributed data parallel training, similar to `torch.nn.parallel.DistributedDataParallel`

.
`DistributedDataParallel`

is optimized for use with NCCL. It achieves high
performance by overlapping communication with computation during `backward()`

and bucketing smaller gradient transfers to reduce the total number of transfers
required.

*Mixed precision* is the combined use of different numerical precisions in a
computational method. Mixed precision
training offers significant computational speedup by performing operations in
half-precision format, while storing minimal information in single-precision
to retain as much information as possible in critical parts of the network.
Since the introduction of Tensor Cores
in Volta, and following with both the Turing and Ampere architectures,
significant training speedups are
experienced by switching to mixed precision -- up to 3x overall speedup on
the most arithmetically intense model architectures. Using mixed precision
training requires two steps:

- Porting the model to use the FP16 data type where appropriate.
- Adding loss scaling to preserve small gradient values.

The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in CUDA 8 in the NVIDIA Deep Learning SDK.

For information about:

- How to train using mixed precision, see the Mixed Precision Training paper and Training With Mixed Precision documentation.
- Techniques used for mixed precision training, see the Mixed-Precision Training of Deep Neural Networks blog.
- APEX tools for mixed precision training, see the NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch.

Mixed precision is enabled in PyTorch by using the Automatic Mixed Precision
(AMP) library from APEX that casts variables
to half-precision upon retrieval, while storing variables in single-precision
format. Furthermore, to preserve small gradient magnitudes in backpropagation,
a loss scaling
step must be included when applying gradients. In PyTorch, loss scaling can be
easily applied by using the `scale_loss()`

method provided by AMP. The scaling value
to be used can be dynamic or fixed.

By default, the `train_tacotron2.sh`

and `train_waveglow.sh`

scripts will
launch mixed precision training with Tensor Cores. You can change this
behaviour by removing the `--amp`

flag from the `train.py`

script.

To enable mixed precision, the following steps were performed in the Tacotron 2 and WaveGlow models:

Import AMP from APEX:

`from apex import amp amp.lists.functional_overrides.FP32_FUNCS.remove('softmax') amp.lists.functional_overrides.FP16_FUNCS.append('softmax')`

Initialize AMP:

`model, optimizer = amp.initialize(model, optimizer, opt_level="O1")`

If running on multi-GPU, wrap the model with

`DistributedDataParallel`

:`from apex.parallel import DistributedDataParallel as DDP model = DDP(model)`

Scale loss before backpropagation (assuming loss is stored in a variable called

`losses`

):Default backpropagate for FP32:

`losses.backward()`

Scale loss and backpropagate with AMP:

`with optimizer.scale_loss(losses) as scaled_losses: scaled_losses.backward()`

TensorFloat-32 (TF32) is the new math mode in NVIDIA A100 GPUs for handling the matrix math also called tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs.

TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models which require high dynamic range for weights or activations.

For more information, refer to the TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x blog post.

TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default.