NGC | Catalog
CatalogModelsTacotron2 PyTorch checkpoint (FP32)

Tacotron2 PyTorch checkpoint (FP32)

For downloads and more information, please view on a desktop device.
Logo for Tacotron2 PyTorch checkpoint (FP32)

Description

Tacotron2 PyTorch checkpoint trained with FP32

Publisher

NVIDIA Deep Learning Examples

Use Case

Speech Synthesis

Framework

PyTorch

Latest Version

19.09.0

Modified

October 29, 2021

Size

107.59 MB

Model Overview

The Tacotron 2 and WaveGlow model form a text-to-speech system that enables user to synthesise a natural sounding speech from raw transcripts.

Model Architecture

The Tacotron 2 model is a recurrent sequence-to-sequence model with attention that predicts mel-spectrograms from text. The encoder (blue blocks in the figure below) transforms the whole text into a fixed-size hidden feature representation. This feature representation is then consumed by the autoregressive decoder (orange blocks) that produces one spectrogram frame at a time. In our implementation, the autoregressive WaveNet (green block) is replaced by the flow-based generative WaveGlow.

Figure 1. Architecture of the Tacotron 2 model. Taken from the Tacotron 2 paper.

The WaveGlow model is a flow-based generative model that generates audio samples from Gaussian distribution using mel-spectrogram conditioning (Figure 2). During training, the model learns to transform the dataset distribution into spherical Gaussian distribution through a series of flows. One step of a flow consists of an invertible convolution, followed by a modified WaveNet architecture that serves as an affine coupling layer. During inference, the network is inverted and audio samples are generated from the Gaussian distribution. Our implementation uses 512 residual channels in the coupling layer.

Figure 2. Architecture of the WaveGlow model. Taken from the WaveGlow paper.

Training

This model was trained using script available on NGC and in GitHub repo

Dataset

The following datasets were used to train this model:

  • LJSpeech-1.1 - Dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books. A transcription is provided for each clip. Clips vary in length from 1 to 10 seconds and have a total length of approximately 24 hours.

Performance

Performance numbers for this model are available in NGC

References

License

This model was trained using open-source software available in Deep Learning Examples repository. For terms of use, please refer to the license of the script and the datasets the model was derived from.