HiFi-GAN PyT checkpoint (FastPitch ftune, 22kHz, AMP)

NVIDIA Deep Learning Examples

Model

NVIDIA Deep Learning Examples

HiFi-GAN PyT checkpoint (FastPitch ftune, 22kHz, AMP)

HiFi-GAN v1 PyTorch checkpoint trained on 8GPU with AMP on LJSpeech-1.1 (22kHz), fine-tuned on FastPitch outputs.

Model Overview

HiFi-GAN model implements a spectrogram inversion model that allows to synthesize speech waveforms from mel-spectrograms.

Model Architecture

The entire model is composed of a generator and two discriminators. Both discriminators can be further divided into smaller sub-networks, that work at different resolutions. The loss functions take as inputs intermediate feature maps and outputs of those sub-networks. After training, the generator is used for synthesis, and the discriminators are discarded. All three components are convolutional networks with different architectures.

HiFi-GAN model architecture

Figure 1. The architecture of HiFi-GAN

Training

This model was trained using script available in GitHub repo.

Dataset

The following datasets were used to train this model:

LJSpeech-1.1 - Dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books. A transcription is provided for each clip. Clips vary in length from 1 to 10 seconds and have a total length of approximately 24 hours.

Performance

Performance numbers for this model are available in GitHub readme performance section.

References

License

This model was trained using open-source software available in Deep Learning Examples repository. For terms of use, please refer to the license of the script and the datasets the model was derived from.

Publisher

NVIDIA Deep Learning Examples

Latest Version21.08.0_amp

UpdatedApril 4, 2023 UTC

Compressed Size53.23 MB

Labels

Deep Learning Examples Text to Speech TTS