This collection contains FastPitch together with a Spectrogram Enhancer, both trained on LibriTTS. STFT parameters match commonly used in ASR (25ms window length, 10ms hop).
Its intended use is for ASR adaptation using text-only data. FastPitch generates mel-spectrograms for texts in target domain, the enhancer makes these synthetic mel-spectrograms look more like spectrograms calculated from real speech data. All of this is generated on-the-fly and used for ASR adaptation [2].
FastPitch [1] is a parallel (non-autoregressive) text-to-speech model. It accepts text tokens and outputs a spectrogram.
Spectrogram Enhancer [2] is a generative network derived from StyleGAN2 [3]. It is trained to bridge the gap between FastPitch-generated synthetic spectrograms and spectrograms. Visually the effect can be interpreted as "adding details". For details refer to [2].
FastPitch uses English text (letters) and ARPAbet (phonemes) as input. Training was done using example script and config. The main difference from default config is spectrogram parameters: 8000Hz fmax, 25ms window, 10ms hop. Training time 292 epochs.
Spectrogram Enhancer was trained on pairs of real and synthetic data (script). For this, LibriTTS clean-960 subset was resynthesized with FastPitch described above (script). Resynthesis process uses ground-truth alignment and pitch. Training time 20 epochs.
Both models in this collection are trained on LibriTTS train-960 subset (train-clean-100 + train-clean-360 + train-other-500).
The model is available for use in the NeMo toolkit [4]. Intended use case is fine-tuning English ASR, please see tutorials.
import nemo
from nemo.collections.tts.models import FastPitchModel, SpectrogramEnhancerModel
fastpitch = FastPitchModel.from_pretrained(model_name="tts_en_fastpitch_for_asr_finetuning")
enhancer = SpectrogramEnhancerModel.from_pretrained(model_name="tts_en_spectrogram_enhancer_for_asr_finetuning")
First model accepts mixture of English and ARPAbet and produsces spectrograms as if they have been computed from audio signal containing spoken input text. Assumed STFT parameters are 25ms window and 10ms hop.
Second model accepts 80-band mel-spectrogram and outputs 80-band mel-spectrogram of same size and content.
Models in this collection are not intended for regular text-to-speech use. This is because ASR spectrogram parameters are suboptimal for TTS tasks.
[1] FastPitch: Parallel Text-to-speech with Pitch Prediction
[2] Text-only domain adaptation for end-to-end ASR using integrated text-to-mel-spectrogram generator
[3] Analyzing and Improving the Image Quality of StyleGAN
License to use this model is covered by the NGC TERMS OF USE unless another License/Terms Of Use/EULA is clearly specified. By downloading the public and release version of the model, you accept the terms and conditions of the NGC TERMS OF USE