HiFi-GAN model implements a spectrogram inversion model that allows to synthesize speech waveforms from mel-spectrograms.
The entire model is composed of a generator and two discriminators. Both discriminators can be further divided into smaller sub-networks, that work at different resolutions. The loss functions take as inputs intermediate feature maps and outputs of those sub-networks. After training, the generator is used for synthesis, and the discriminators are discarded. All three components are convolutional networks with different architectures.
Figure 1. The architecture of HiFi-GAN
This model was trained using script available in GitHub repo.
The following datasets were used to train this model:
Performance numbers for this model are available in GitHub readme performance section.