NGC | Catalog
CatalogModelsEnglish T5-based Inverse Text Normalization

English T5-based Inverse Text Normalization

Logo for English T5-based Inverse Text Normalization
Description
English inverse text normalization model based on albert-base-v2 tagger and t5-small decoder.
Publisher
NVIDIA
Latest Version
1.11.0
Modified
April 4, 2023
Size
273.25 MB

Model Overview

This is an English inverse text normalization model based on Albert Base v2 [1] and T5-small [2]. Inverse text normalization is the task of converting a spoken-domain text into its written form. For example, "one hundred twenty three dollars" should be converted to "$123", while "one twenty three king avenue" should be converted to "123 King Ave".

Model Architecture

This model uses a two stage process: 1. transformer-based NER model for identifying "semiotic" spans in the input (e.g., spans that are about times, dates, or monetary amounts) 2. transformer-based seq2seq model for decoding the semiotic spans into spoken form.

The tagger model first uses a Transformer encoder (e.g., Albert-Base-v2) to build a contextualized representation for each input token. It then uses a classification head to predict the tag for each token (e.g., if a token should stay the same, its tag should be SAME). The decoder model then takes the semiotic spans identified by the tagger and transforms them into the written form. The decoder model is essentially a Transformer-based encoder-decoder seq2seq model (e.g., the example training script uses the T5-small model by default). Overall, our design is partly inspired by the RNN-based sliding window model proposed in the paper Neural Models of Text Normalization for Speech Applications [3].

Training

The model was trained with pretrained Albert-Base-v2 and T5-small.

Dataset

The model is trained on a processed and upsampled version of the English Google Text Normalization dataset [4].

Performance

Sentence level accuracy on the English Google Text Normalization dataset [4]:

The performance of ITN models can be measured using Word Error Rate(WER), and Sentence Accuracy. We measure Sentence Accuracy w.r.t. multi-variant reference and subdivide the errors into "digit" and "other".

The model obtains the following scores on the following evaluation datasets

Default test set
    WER:  2.9%
    Sentence accuracy:  97.31%
        digit errors:    0.35%
        other errors:    2.34%


Hard test set
    WER:  9.34%
    Sentence accuracy:  85.34%
        digit errors:    3.12%
        other errors:   11.54%


## How to use this model

The model is available for use in the NeMo toolkit [5], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.

### Automatically load the model from NGC
```python
import nemo
import nemo.collections.nlp as nemo_nlp
tagger = nemo_nlp.models.duplex_text_normalization.DuplexTaggerModel.from_pretrained(model_name="itn_en_t5")
decoder = nemo_nlp.models.duplex_text_normalization.DuplexDecoderModel.from_pretrained(model_name="itn_en_t5")
normalizer = nemo_nlp.models.duplex_text_normalization.DuplexTextNormalizationModel(tagger, decoder, lang='en')

Inference

python [NEMO_GIT_FOLDER]/examples/nlp/duplex_text_normalization/duplex_text_normalization_infer.py lang=en mode=itn tagger_pretrained_model=itn_en_t5 decoder_pretrained_model=itn_en_t5 inference.interactive=True

To use inference from data file, set inference.interactive=False inference.from_file=[DATA_FILE]

Input

Both the DuplexTaggerModel model and the DuplexDecoderModel model use the same simple text format as the dataset. The data needs to be stored in TAB separated files (.tsv) with three columns. The first of which is the "semiotic class" (e.g., numbers, times, dates), the second is the token in written form, and the third is the spoken form. It is expected that a complete dataset contains three files: train.tsv, dev.tsv, and test.tsv.

Output

The model outputs either the transformed sentence in interactive mode or the evaluation metrics in test mode.

Limitations

The length of the input text is currently constrained by the maximum sequence length of the tagger and decoder model, which is 512 tokens after tokenization.

References

[1] https://huggingface.co/albert-base-v2

[2] https://huggingface.co/t5-small

[3] https://research.fb.com/publications/neural-models-of-text-normalization-for-speech-applications

[4] https://arxiv.org/abs/1611.00068

[5] NVIDIA NeMo Toolkit

License

License to use this model is covered by the NGC TERMS OF USE unless another License/Terms Of Use/EULA is clearly specified. By downloading the public and release version of the model, you accept the terms and conditions of the NGC TERMS OF USE.