This is an English text normalization model based on Albert Base v2 [1] and T5-small [2]. Text normalization is the task of converting a written text into its spoken form. For example, "$123" should be verbalized as one hundred twenty three dollars, while "123 King Ave" should be verbalized as one twenty three King Avenue.
This model uses a two stage process: 1. transformer-based NER model for identifying "semiotic" spans in the input (e.g., spans that are about times, dates, or monetary amounts) 2. transformer-based seq2seq model for decoding the semiotic spans into spoken form.
The tagger model first uses a Transformer encoder (e.g., Albert-Base-v2) to build a contextualized representation for each input token. It then uses a classification head to predict the tag for each token (e.g., if a token should stay the same, its tag should be SAME). The decoder model then takes the semiotic spans identified by the tagger and transforms them into the spoken form. The decoder model is essentially a Transformer-based encoder-decoder seq2seq model (e.g., the example training script uses the T5-base model by default). Overall, our design is partly inspired by the RNN-based sliding window model proposed in the paper Neural Models of Text Normalization for Speech Applications [3].
The model was trained with pretrained Albert-Base-v2 and T5-small.
The model is trained on a processed and upsampled version of the English Google Text Normalization dataset [4].
Sentence level accuracy on the English Google Text Normalization dataset [4]:
Text Normalization 99.48%
The model is available for use in the NeMo toolkit [5], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
import nemo
import nemo.collections.nlp as nemo_nlp
tagger = nemo_nlp.models.duplex_text_normalization.DuplexTaggerModel.from_pretrained(model_name="neural_text_normalization_t5")
decoder = nemo_nlp.models.duplex_text_normalization.DuplexDecoderModel.from_pretrained(model_name="neural_text_normalization_t5")
normalizer = nemo_nlp.models.duplex_text_normalization.DuplexTextNormalizationModel(tagger, decoder, lang='en')
python [NEMO_GIT_FOLDER]/examples/nlp/duplex_text_normalization/duplex_text_normalization_infer.py lang=en mode=tn tagger_pretrained_model=neural_text_normalization_t5 decoder_pretrained_model=neural_text_normalization_t5 inference.interactive=True
To use inference from data file, set inference.interactive=False inference.from_file=[DATA_FILE]
Both the DuplexTaggerModel model and the DuplexDecoderModel model use the same simple text format as the dataset. The data needs to be stored in TAB separated files (.tsv) with three columns. The first of which is the "semiotic class" (e.g., numbers, times, dates), the second is the token in written form, and the third is the spoken form. It is expected that a complete dataset contains three files: train.tsv, dev.tsv, and test.tsv.
The model outputs either the transformed sentence in interactive mode or the evaluation metrics in test mode.
The length of the input text is currently constrained by the maximum sequence length of the tagger and decoder model, which is 512 tokens after tokenization.
[1] https://huggingface.co/albert-base-v2
[2] https://huggingface.co/t5-small
[3] https://research.fb.com/publications/neural-models-of-text-normalization-for-speech-applications
[4] https://arxiv.org/abs/1611.00068
License to use this model is covered by the NGC TERMS OF USE unless another License/Terms Of Use/EULA is clearly specified. By downloading the public and release version of the model, you accept the terms and conditions of the NGC TERMS OF USE.