This resource is using open-source code maintained in github (see the quick-start-guide section) and available for download from NGC
FastPitch is one of two major components in a neural, text-to-speech (TTS) system:
- a mel-spectrogram generator such as FastPitch or Tacotron 2, and
- a waveform synthesizer such as WaveGlow (refer to NVIDIA example code).
Such a two-component TTS system is able to synthesize natural-sounding speech from raw transcripts.
The FastPitch model generates mel-spectrograms and predicts a pitch contour from raw input text. In version 1.1, it does not need any pre-trained aligning model to bootstrap from. It allows exerting additional control over the synthesized utterances, such as:
- modify the pitch contour to control the prosody,
- increase or decrease the fundamental frequency in a natural sounding way, that preserves the perceived identity of the speaker,
- alter the rate of speech,
- adjust the energy,
- specify input as graphemes or phonemes,
- switch speakers when the model has been trained with data from multiple speakers. Some of the capabilities of FastPitch are presented on the website with samples.
Speech synthesized with FastPitch has state-of-the-art quality, and does not suffer from missing/repeating phrases as Tacotron 2 does. This is reflected in Mean Opinion Scores (details).
|Mean Opinion Score (MOS)
|3.946 ± 0.134
|4.080 ± 0.133
The current version of the model offers even higher quality, as reflected in the pairwise preference scores (details).
|0.435 ± 0.068
|0.565 ± 0.068
The FastPitch model is based on the FastSpeech model. The main differences between FastPitch and FastSpeech are that FastPitch:
- no dependence on external aligner (Transformer TTS, Tacotron 2); in version 1.1, FastPitch aligns audio to transcriptions by itself as in One TTS Alignment To Rule Them All,
- explicitly learns to predict the pitch contour,
- pitch conditioning removes harsh sounding artifacts and provides faster convergence,
- no need for distilling mel-spectrograms with a teacher model,
- capabilities to train a multi-speaker model.
The FastPitch model is similar to FastSpeech2, which has been developed concurrently. FastPitch averages pitch/energy values over input tokens, and treats energy as optional.
FastPitch is trained on a publicly available LJ Speech dataset.
This model is trained with mixed precision using Tensor Cores on NVIDIA Volta, NVIDIA Turing, and the NVIDIA Ampere GPU architectures. Therefore, researchers can get results from 2.0x to 2.7x faster than training without Tensor Cores while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.
FastPitch is a fully feedforward Transformer model that predicts mel-spectrograms from raw text (Figure 1). The entire process is parallel, which means that all input letters are processed simultaneously to produce a full mel-spectrogram in a single forward pass.
Figure 1. Architecture of FastPitch (source). The model is composed of a bidirectional Transformer backbone (also known as a Transformer encoder), a pitch predictor, and a duration predictor. After passing through the first *N* Transformer blocks, encoding, the signal is augmented with pitch information and discretely upsampled. Then it goes through another set of *N* Transformer blocks, with the goal of smoothing out the upsampled signal and constructing a mel-spectrogram.
The FastPitch model supports multi-GPU and mixed precision training with dynamic loss scaling (refer to Apex code here), as well as mixed precision inference.
The following features were implemented in this model:
- data-parallel multi-GPU training,
- dynamic loss scaling with backoff for Tensor Cores (mixed precision) training,
- gradient accumulation for reproducible results regardless of the number of GPUs.
Pitch contours and mel-spectrograms can be generated online during training. To speed-up training, those could be generated during the pre-processing step and read directly from the disk during training. For more information on data pre-processing, refer to Dataset guidelines and the paper.
Feature support matrix
The following features are supported by this model.
|Automatic mixed precision (AMP)
|Distributed data parallel (DDP)
Automatic Mixed Precision (AMP) - This implementation uses native PyTorch AMP implementation of mixed precision training. It allows us to use FP16 training with FP32 master weights by modifying just a few lines of code.
DistributedDataParallel (DDP) - The model uses PyTorch Lightning implementation of distributed data parallelism at the module level, which can run across multiple machines.
Mixed precision training
Mixed precision is the combined use of different numerical precisions in a computational method. Mixed precision training offers significant computational speedup by performing operations in half-precision format while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of Tensor Cores in NVIDIA Volta, and following with both the Turing and Ampere architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
- Porting the model to use the FP16 data type where appropriate.
- Adding loss scaling to preserve small gradient values.
The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in CUDA 8 in the NVIDIA Deep Learning SDK.
For information about:
- How to train using mixed precision, refer to the Mixed Precision Training paper and Training With Mixed Precision documentation.
- Techniques used for mixed precision training, refer to the Mixed-Precision Training of Deep Neural Networks blog.
- APEX tools for mixed precision training, refer to the NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch.
Enabling mixed precision
For training and inference, mixed precision can be enabled by adding the
Mixed precision is using native PyTorch implementation.
TensorFloat-32 (TF32) is the new math mode in NVIDIA A100 GPUs for handling the matrix math, also called tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs.
TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models which require a high dynamic range for weights or activations.
For more information, refer to the TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x blog post.
TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default.
Character duration The time during which a character is being articulated. It could be measured in milliseconds, mel-spectrogram frames, and so on. Some characters are not pronounced, and thus, have 0 duration.
Fundamental frequency The lowest vibration frequency of a periodic soundwave, for example, is produced by a vibrating instrument, and it is perceived as the loudest. In the context of speech, it refers to the frequency of vibration of vocal cords. It is abbreviated as f0.
Pitch A perceived frequency of vibration of music or sound.
Transformer The paper Attention Is All You Need introduces a novel architecture called Transformer, which repeatedly applies the attention mechanism. It transforms one sequence into another.