This model is a single-pass tagger-based model for inverse text normalization based on DeepPavlov/rubert-base-cased, trained on 2 mln sentences from Google Text Normalization Dataset.
It converts text from spoken domain into its written form: Input: "в тысяча девятьсот тринадцатом году в кузнецке было пятьдесят две версты мощеных дорог" Output: "в 1913 году в кузнецке было 52 версты мощеных дорог"
Thutmose Tagger is a single-pass tagging model. It utilizes a backbone BERT encoder (DeepPavlov/rubert-base-cased) followed by two classification heads: one is trained to predict written fragments as replacement tags, the other is trained to predict tags representing semiotic classes, like DATE
, CARDINAL
etc. The final tags have one-to-one correspondence with input words.
The NeMo toolkit [1] was used for training the models. The training corpus for the model consists of 2 mln sentences.
This model is trained with this example script and this base config.
python [NEMO_GIT_FOLDER]/examples/nlp/text_normalization_as_tagging/normalization_as_tagging_train.py \
lang=en \
data.validation_ds.data_path=${DATA_PATH}/valid.tsv \
data.train_ds.data_path=${DATA_PATH}/train.tsv \
data.train_ds.batch_size=128 \
model.language_model.pretrained_model_name=DeepPavlov/rubert-base-cased \
model.label_map=${DATA_PATH}/label_map.txt \
model.semiotic_classes=${DATA_PATH}/semiotic_classes.txt \
trainer.max_epochs=5
The initial dataset is Google Text Normalization Dataset for Russian. The dataset preparation is described example script and includes running GIZA++ automatic alignment tool to find granular alignments between spoken words and written fragments.
The performance of ITN models can be measured using Word Error Rate(WER), and Sentence Accuracy. We measure Sentence Accuracy w.r.t. multi-variant reference and subdivide the errors into "digit" and "other".
The model obtains the following scores on the following evaluation datasets
Default test set
WER: 3.55%
Sentence accuracy: 92.96%
digit errors: 0.40%
other errors: 6.63%
Hard test set
WER: 7.80%
Sentence accuracy: 83.83%
digit errors: 3.41%
other errors: 12.76%
Note that reference files were taken from the test part of Google TN Dataset, which is not considered 100% correct because of its synthetic nature. So these scores are not particularly indicative of the quality of final inverse text normalization, but they are a useful proxy.
The model is available for use in the NeMo toolkit [1], and can be used as a pre-trained checkpoint for inference.
import nemo.collections.nlp as nemo_nlp
nlp_model = nemo_nlp.models.ThutmoseTaggerModel.from_pretrained(model_name="itn_ru_thutmose_bert")
Example of inference and evaluation is provided in this [script] (https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/text_normalization_as_tagging/run_infer.sh)
python [NEMO_GIT_FOLDER]/examples/nlp/text_normalization_as_tagging/normalization_as_tagging_infer.py \
pretrained_model="itn_ru_thutmose_bert" \
inference.from_file="<INPUT_FILE>" \
inference.out_file="<OUTPUT_FILE>" \
model.max_sequence_len=1024 \
inference.batch_size=128
This model expects the input file to be plain text without punctuation, similar to the ASR text output.
This model provides an output file with exactly the same number of lines as in input. Each line consists of 5 columns:
Since this model was trained on syntetic data its performance might degrade on some constructions that were systematically biased in the initial corpus.
License to use this model is covered by the NGC TERMS OF USE unless another License/Terms Of Use/EULA is clearly specified. By downloading the public and release version of the model, you accept the terms and conditions of the NGC TERMS OF USE.