NGC | Catalog
CatalogModelsHiFi-GAN PyT checkpoint (22kHz, AMP)

HiFi-GAN PyT checkpoint (22kHz, AMP)

Logo for HiFi-GAN PyT checkpoint (22kHz, AMP)
HiFi-GAN v1 PyTorch checkpoint trained on 8GPU with AMP on LJSpeech-1.1 (22kHz).
NVIDIA Deep Learning Examples
Latest Version
April 4, 2023
53.24 MB

Model Overview

HiFi-GAN model implements a spectrogram inversion model that allows to synthesize speech waveforms from mel-spectrograms.

Model Architecture

The entire model is composed of a generator and two discriminators. Both discriminators can be further divided into smaller sub-networks, that work at different resolutions. The loss functions take as inputs intermediate feature maps and outputs of those sub-networks. After training, the generator is used for synthesis, and the discriminators are discarded. All three components are convolutional networks with different architectures.

HiFi-GAN model architecture

Figure 1. The architecture of HiFi-GAN


This model was trained using script available in GitHub repo.


The following datasets were used to train this model:

  • LJSpeech-1.1 - Dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books. A transcription is provided for each clip. Clips vary in length from 1 to 10 seconds and have a total length of approximately 24 hours.


Performance numbers for this model are available in GitHub readme performance section.



This model was trained using open-source software available in Deep Learning Examples repository. For terms of use, please refer to the license of the script and the datasets the model was derived from.