FastPitch is one of two major components in a neural, text-to-speech (TTS) system:
Such two-component TTS system is able to synthesize natural sounding speech from raw transcripts.
The FastPitch model generates mel-spectrograms and predicts a pitch contour from raw input text. In version 1.1, it does not need any pre-trained aligning model to bootstrap from. It allows to exert additional control over the synthesized utterances, such as:
Speech synthesized with FastPitch has state-of-the-art quality, and does not suffer from missing/repeating phrases like Tacotron 2 does. This is reflected in Mean Opinion Scores (details).
Model | Mean Opinion Score (MOS) |
---|---|
Tacotron 2 | 3.946 ± 0.134 |
FastPitch 1.0 | 4.080 ± 0.133 |
The current version of the model offers even higher quality, as reflected in the pairwise preference scores (details).
Model | Average preference |
---|---|
FastPitch 1.0 | 0.435 ± 0.068 |
FastPitch 1.1 | 0.565 ± 0.068 |
The FastPitch model is based on the FastSpeech model. The main differences between FastPitch and FastSpeech are that FastPitch:
The FastPitch model is similar to FastSpeech2, which has been developed concurrently. FastPitch averages pitch/energy values over input tokens, and treats energy as optional.
FastPitch is trained on a publicly available LJ Speech dataset.
This model is trained with mixed precision using Tensor Cores on Volta, Turing, and the NVIDIA Ampere GPU architectures. Therefore, researchers can get results from 2.0x to 2.7x faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.
FastPitch is a fully feedforward Transformer model that predicts mel-spectrograms from raw text (Figure 1). The entire process is parallel, which means that all input letters are processed simultaneously to produce a full mel-spectrogram in a single forward pass.
Figure 1. Architecture of FastPitch (source). The model is composed of a bidirectional Transformer backbone (also known as a Transformer encoder), a pitch predictor, and a duration predictor. After passing through the first *N* Transformer blocks, encoding, the signal is augmented with pitch information and discretely upsampled. Then it goes through another set of *N* Transformer blocks, with the goal of smoothing out the upsampled signal, and constructing a mel-spectrogram.
The FastPitch model supports multi-GPU and mixed precision training with dynamic loss scaling (see Apex code here), as well as mixed precision inference.
The following features were implemented in this model:
Pitch contours and mel-spectrograms can be generated on-line during training. To speed-up training, those could be generated during the pre-processing step and read directly from the disk during training. For more information on data pre-processing refer to Dataset guidelines and the paper.
The following features are supported by this model.
Feature | FastPitch |
---|---|
Automatic mixed precision (AMP) | Yes |
Distributed data parallel (DDP) | Yes |
Automatic Mixed Precision (AMP) - This implementation uses native PyTorch AMP implementation of mixed precision training. It allows us to use FP16 training with FP32 master weights by modifying just a few lines of code.
DistributedDataParallel (DDP) - The model uses PyTorch Lightning implementation of distributed data parallelism at the module level which can run across multiple machines.
Mixed precision is the combined use of different numerical precisions in a computational method. Mixed precision training offers significant computational speedup by performing operations in half-precision format while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of Tensor Cores in Volta, and following with both the Turing and Ampere architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in CUDA 8 in the NVIDIA Deep Learning SDK.
For information about:
For training and inference, mixed precision can be enabled by adding the --amp
flag.
Mixed precision is using native PyTorch implementation.
TensorFloat-32 (TF32) is the new math mode in NVIDIA A100 GPUs for handling the matrix math also called tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs.
TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models which require high dynamic range for weights or activations.
For more information, refer to the TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x blog post.
TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default.
Character duration The time during which a character is being articulated. Could be measured in milliseconds, mel-spectrogram frames, etc. Some characters are not pronounced, and thus have 0 duration.
Fundamental frequency The lowest vibration frequency of a periodic soundwave, for example, produced by a vibrating instrument. It is perceived as the loudest. In the context of speech, it refers to the frequency of vibration of vocal chords. Abbreviated as f0.
Pitch A perceived frequency of vibration of music or sound.
Transformer The paper Attention Is All You Need introduces a novel architecture called Transformer, which repeatedly applies the attention mechanism. It transforms one sequence into another.