The RAD-TTS Aligner is a model that aligns speech and text inputs. It generates both a soft and hard alignment, the latter of which can be used to calculate token durations in mel frames.
The RAD-TTS Aligner is non-autoregressive, and uses 1D convolution layers to separately encode text and mel spectrogram inputs. The model calculates the pairwise L2 distances between tokens and mel frames, then uses the Viterbi algorithm to determine a hard alignment between the two. During inference, the model produces a soft alignment (L2 distance matrix) as well as the hard alignment of text tokens to audio frames.
For more details, please refer to the paper.
During training, forward sum loss is used with bin loss to make up an unsupervised alignment learning objective.
This model is trained on LJSpeech sampled at 22050Hz, and has been tested on generating female English voices with an American accent.
No performance information available at this time.
This model can be automatically loaded from NGC.
# Load the RAD-TTS Aligner from nemo.collections.tts.models import AlignerModel aligner = AlignerModel.from_pretrained("tts_en_radtts_aligner") # Load audio and text audio_data, orig_sr = sf.read(<audio_path>) audio = torch.tensor(audio_data, dtype=torch.float, device=device).unsqueeze(0) audio_len = torch.tensor(audio_len).unsqueeze(0) spec, spec_len = aligner.preprocessor(input_signal=audio, length=audio_len) # Process text text_raw = "<your_text_here>" text_normalized = aligner.normalizer.normalize(text_raw, punct_post_process=True) text_tokens = aligner.tokenizer(text_normalized) text = torch.tensor(text_tokens, device=device).unsqueeze(0).long() text_len = torch.tensor(len(text_tokens), device=device).unsqueeze(0).long() # Run the Aligner attn_soft_tensor, attn_logprob_tensor = aligner(spec=spec, spec_len=spec_len, text=text, text_len=text_len)
See the Aligner Inference Tutorial for more in-depth examples and use-cases.
The model can also be used to disambiguate heteronyms for G2P, as in this example script.
This model accepts batches of text and their corresponding audio.
This model predicts soft and hard alignments between the pairs of inputs.
This checkpoint will work best at aligning samples spoken by the LJSpeech speaker, at 22050Hz. It will be less reliable for other speakers.
ARPABET_1.11.0: The original version released with NeMo 1.11.0.
RAD-TTS Aligner paper: https://arxiv.org/abs/2108.10447