ASR-based text/audio aligner based on CTC-loss algorithm that was used to train TalkNet.
from nemo.collections.asr.models import EncDecCTCModel
model = EncDecCTCModel.from_pretrained("asr_talknet_aligner")
For an example, on how to use this model to generate speech, refer to the TTS inference notebook.
This model is trained on LibriTTS sampled at 22050Hz with input text converted to phonemes, and can be used to extract durations for audio excerpt and corresponding phonemes sequence.
[2] TalkNet 2 Paper