The RAD-TTS Aligner is a model that aligns speech and text inputs. It generates both a soft and hard alignment, the latter of which can be used to calculate token durations in mel frames.
The RAD-TTS Aligner is non-autoregressive, and uses 1D convolution layers to separately encode text and mel spectrogram inputs. The model calculates the pairwise L2 distances between tokens and mel frames, then uses the Viterbi algorithm to determine a hard alignment between the two. During inference, the model produces a soft alignment (L2 distance matrix) as well as the hard alignment of text tokens to audio frames.
For more details, please refer to the paper.
During training, forward sum loss is used with bin loss to make up an unsupervised alignment learning objective.
This model is trained on LJSpeech sampled at 22050Hz, and has been tested on generating female English voices with an American accent.
No performance information available at this time.
This model can be automatically loaded from NGC.
# Load the RAD-TTS Aligner
from nemo.collections.tts.models import AlignerModel
aligner = AlignerModel.from_pretrained("tts_en_radtts_aligner") # or "tts_en_radtts_aligner_ipa" for the IPA version
# Load audio and text
audio_data, orig_sr = sf.read(<audio_path>)
audio = torch.tensor(audio_data, dtype=torch.float, device=device).unsqueeze(0)
audio_len = torch.tensor(audio_len).unsqueeze(0)
spec, spec_len = aligner.preprocessor(input_signal=audio, length=audio_len)
# Process text
text_raw = "<your_text_here>"
text_normalized = aligner.normalizer.normalize(text_raw, punct_post_process=True)
text_tokens = aligner.tokenizer(text_normalized)
text = torch.tensor(text_tokens, device=device).unsqueeze(0).long()
text_len = torch.tensor(len(text_tokens), device=device).unsqueeze(0).long()
# Run the Aligner
attn_soft_tensor, attn_logprob_tensor = aligner(spec=spec, spec_len=spec_len, text=text, text_len=text_len)
See the Aligner Inference Tutorial for more in-depth examples and use-cases.
The model can also be used to disambiguate heteronyms for G2P, as in this example script.
This model accepts batches of text and their corresponding audio.
This model predicts soft and hard alignments between the pairs of inputs.
This checkpoint will work best at aligning samples spoken by the LJSpeech speaker, at 22050Hz. It will be less reliable for other speakers.
ARPABET_1.11.0: The original version released with NeMo 1.11.0.
IPA_1.13.0: An additional version that uses IPA rather than ARPABET.
RAD-TTS Aligner paper: https://arxiv.org/abs/2108.10447
License to use this model is covered by the NGC TERMS OF USE unless another License/Terms Of Use/EULA is clearly specified. By downloading the public and release version of the model, you accept the terms and conditions of the NGC TERMS OF USE.