NGC | Catalog
CatalogModelsTTS En RAD-TTS Aligner

TTS En RAD-TTS Aligner

Logo for TTS En RAD-TTS Aligner
Description
RAD-TTS Aligner model trained on female English speech.
Publisher
NVIDIA
Latest Version
IPA_1.13.0
Modified
April 4, 2023
Size
7.11 MB

Model Overview

The RAD-TTS Aligner is a model that aligns speech and text inputs. It generates both a soft and hard alignment, the latter of which can be used to calculate token durations in mel frames.

Model Architecture

The RAD-TTS Aligner is non-autoregressive, and uses 1D convolution layers to separately encode text and mel spectrogram inputs. The model calculates the pairwise L2 distances between tokens and mel frames, then uses the Viterbi algorithm to determine a hard alignment between the two. During inference, the model produces a soft alignment (L2 distance matrix) as well as the hard alignment of text tokens to audio frames.

For more details, please refer to the paper.

Training

During training, forward sum loss is used with bin loss to make up an unsupervised alignment learning objective.

Dataset

This model is trained on LJSpeech sampled at 22050Hz, and has been tested on generating female English voices with an American accent.

Performance

No performance information available at this time.

How to Use this Model

This model can be automatically loaded from NGC.

# Load the RAD-TTS Aligner
from nemo.collections.tts.models import AlignerModel
aligner = AlignerModel.from_pretrained("tts_en_radtts_aligner") # or "tts_en_radtts_aligner_ipa" for the IPA version

# Load audio and text
audio_data, orig_sr = sf.read(<audio_path>)
audio = torch.tensor(audio_data, dtype=torch.float, device=device).unsqueeze(0)
audio_len = torch.tensor(audio_len).unsqueeze(0)
spec, spec_len = aligner.preprocessor(input_signal=audio, length=audio_len)

# Process text
text_raw = "<your_text_here>"
text_normalized = aligner.normalizer.normalize(text_raw, punct_post_process=True)
text_tokens = aligner.tokenizer(text_normalized)
text = torch.tensor(text_tokens, device=device).unsqueeze(0).long()
text_len = torch.tensor(len(text_tokens), device=device).unsqueeze(0).long()

# Run the Aligner
attn_soft_tensor, attn_logprob_tensor = aligner(spec=spec, spec_len=spec_len, text=text, text_len=text_len)

See the Aligner Inference Tutorial for more in-depth examples and use-cases.

The model can also be used to disambiguate heteronyms for G2P, as in this example script.

Input

This model accepts batches of text and their corresponding audio.

Output

This model predicts soft and hard alignments between the pairs of inputs.

Limitations

This checkpoint will work best at aligning samples spoken by the LJSpeech speaker, at 22050Hz. It will be less reliable for other speakers.

Versions

ARPABET_1.11.0: The original version released with NeMo 1.11.0.

IPA_1.13.0: An additional version that uses IPA rather than ARPABET.

References

RAD-TTS Aligner paper: https://arxiv.org/abs/2108.10447

Licence

License to use this model is covered by the NGC TERMS OF USE unless another License/Terms Of Use/EULA is clearly specified. By downloading the public and release version of the model, you accept the terms and conditions of the NGC TERMS OF USE.