This is an English inverse text normalization model based on Albert Base v2  and T5-small . Inverse text normalization is the task of converting a spoken-domain text into its written form. For example, "one hundred twenty three dollars" should be converted to "$123", while "one twenty three king avenue" should be converted to "123 King Ave".
This model uses a two stage process: 1. transformer-based NER model for identifying "semiotic" spans in the input (e.g., spans that are about times, dates, or monetary amounts) 2. transformer-based seq2seq model for decoding the semiotic spans into spoken form.
The tagger model first uses a Transformer encoder (e.g., Albert-Base-v2) to build a contextualized representation for each input token. It then uses a classification head to predict the tag for each token (e.g., if a token should stay the same, its tag should be SAME). The decoder model then takes the semiotic spans identified by the tagger and transforms them into the written form. The decoder model is essentially a Transformer-based encoder-decoder seq2seq model (e.g., the example training script uses the T5-small model by default). Overall, our design is partly inspired by the RNN-based sliding window model proposed in the paper Neural Models of Text Normalization for Speech Applications .
The model was trained with pretrained Albert-Base-v2 and T5-small.
The model is trained on a processed and upsampled version of the English Google Text Normalization dataset .
Sentence level accuracy on the English Google Text Normalization dataset :
The performance of ITN models can be measured using Word Error Rate(WER), and Sentence Accuracy. We measure Sentence Accuracy w.r.t. multi-variant reference and subdivide the errors into "digit" and "other".
The model obtains the following scores on the following evaluation datasets
Default test set WER: 2.9% Sentence accuracy: 97.31% digit errors: 0.35% other errors: 2.34% Hard test set WER: 9.34% Sentence accuracy: 85.34% digit errors: 3.12% other errors: 11.54% ## How to use this model The model is available for use in the NeMo toolkit , and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset. ### Automatically load the model from NGC ```python import nemo import nemo.collections.nlp as nemo_nlp tagger = nemo_nlp.models.duplex_text_normalization.DuplexTaggerModel.from_pretrained(model_name="itn_en_t5") decoder = nemo_nlp.models.duplex_text_normalization.DuplexDecoderModel.from_pretrained(model_name="itn_en_t5") normalizer = nemo_nlp.models.duplex_text_normalization.DuplexTextNormalizationModel(tagger, decoder, lang='en')
python [NEMO_GIT_FOLDER]/examples/nlp/duplex_text_normalization/duplex_text_normalization_infer.py lang=en mode=itn tagger_pretrained_model=itn_en_t5 decoder_pretrained_model=itn_en_t5 inference.interactive=True
To use inference from data file, set
Both the DuplexTaggerModel model and the DuplexDecoderModel model use the same simple text format as the dataset. The data needs to be stored in TAB separated files (.tsv) with three columns. The first of which is the "semiotic class" (e.g., numbers, times, dates), the second is the token in written form, and the third is the spoken form. It is expected that a complete dataset contains three files: train.tsv, dev.tsv, and test.tsv.
The model outputs either the transformed sentence in interactive mode or the evaluation metrics in test mode.
The length of the input text is currently constrained by the maximum sequence length of the tagger and decoder model, which is 512 tokens after tokenization.