NGC | Catalog
CatalogModelsTTS En FastPitch SpectrogramEnhancer For-ASR-Finetuning

TTS En FastPitch SpectrogramEnhancer For-ASR-Finetuning

Logo for TTS En FastPitch SpectrogramEnhancer For-ASR-Finetuning
Description
This collection contains FastPitch and Spectrogram Enhancer models. Main use case is English ASR domain fine-tuning. Direct TTS use is not advised.
Publisher
NVIDIA
Latest Version
1.20.0
Modified
July 10, 2023
Size
191.51 MB

Model Overview

This collection contains FastPitch together with a Spectrogram Enhancer, both trained on LibriTTS. STFT parameters match commonly used in ASR (25ms window length, 10ms hop).

Its intended use is for ASR adaptation using text-only data. FastPitch generates mel-spectrograms for texts in target domain, the enhancer makes these synthetic mel-spectrograms look more like spectrograms calculated from real speech data. All of this is generated on-the-fly and used for ASR adaptation [2].

Model Architecture

FastPitch [1] is a parallel (non-autoregressive) text-to-speech model. It accepts text tokens and outputs a spectrogram.

Spectrogram Enhancer [2] is a generative network derived from StyleGAN2 [3]. It is trained to bridge the gap between FastPitch-generated synthetic spectrograms and spectrograms. Visually the effect can be interpreted as "adding details". For details refer to [2].

Training

FastPitch uses English text (letters) and ARPAbet (phonemes) as input. Training was done using example script and config. The main difference from default config is spectrogram parameters: 8000Hz fmax, 25ms window, 10ms hop. Training time 292 epochs.

Spectrogram Enhancer was trained on pairs of real and synthetic data (script). For this, LibriTTS clean-960 subset was resynthesized with FastPitch described above (script). Resynthesis process uses ground-truth alignment and pitch. Training time 20 epochs.

Datasets

Both models in this collection are trained on LibriTTS train-960 subset (train-clean-100 + train-clean-360 + train-other-500).

How to Use this Model

The model is available for use in the NeMo toolkit [4]. Intended use case is fine-tuning English ASR, please see tutorials.

Automatically load the model from NGC

import nemo
from nemo.collections.tts.models import FastPitchModel, SpectrogramEnhancerModel

fastpitch = FastPitchModel.from_pretrained(model_name="tts_en_fastpitch_for_asr_finetuning")
enhancer = SpectrogramEnhancerModel.from_pretrained(model_name="tts_en_spectrogram_enhancer_for_asr_finetuning")

Input/Output

First model accepts mixture of English and ARPAbet and produsces spectrograms as if they have been computed from audio signal containing spoken input text. Assumed STFT parameters are 25ms window and 10ms hop.

Second model accepts 80-band mel-spectrogram and outputs 80-band mel-spectrogram of same size and content.

Limitations

Models in this collection are not intended for regular text-to-speech use. This is because ASR spectrogram parameters are suboptimal for TTS tasks.

References

[1] FastPitch: Parallel Text-to-speech with Pitch Prediction

[2] Text-only domain adaptation for end-to-end ASR using integrated text-to-mel-spectrogram generator

[3] Analyzing and Improving the Image Quality of StyleGAN

[4] NVIDIA NeMo Toolkit

License

License to use this model is covered by the NGC TERMS OF USE unless another License/Terms Of Use/EULA is clearly specified. By downloading the public and release version of the model, you accept the terms and conditions of the NGC TERMS OF USE