This model can be used for translating text in source language (Hi) to a text in target language (En).
The model is based on Transformer "Big" architecture originally presented in "Attention Is All You Need" paper [1]. In this particular instance, the model has 12 layers in the encoder and 2 layers in the decoder. It is using YouTokenToMe tokenizer [2].
These models were trained on a collection of many publicly available datasets comprising of millions of parallel sentences. The NeMo toolkit [5] was used for training this model over roughly 200k steps.
While training this model, we used the following datasets:
We used the YouTokenToMe tokenizer [2] with separate encoder and decoder BPE tokenizers.
The accuracy of translation models are often measured using BLEU scores [3]. On WMT14 Test set this model achieves 24.2 BLEU score measured using SacreBLEU package [4]. BLEU+case.mixed+lang.hi-en+numrefs.1+smooth.exp+test.wmt14+tok.13a+version.1.5.1 = 24.2 59.1/30.5/17.8/10.8 (BP = 1.000 ratio = 1.012 hyp_len = 56254 ref_len = 55571)
The model is available for use in the NeMo toolkit [5], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
import nemo
import nemo.collections.nlp as nemo_nlp
nmt_model = nemo_nlp.models.machine_translation.MTEncDecModel.from_pretrained(model_name="nmt_hi_en_transformer12x2")
python [NEMO_GIT_FOLDER]/examples/nlp/machine_translation/nmt_transformer_infer.py --model=nmt_hi_en_transformer12x2.nemo --srctext=[TEXT_IN_SRC_LANGUAGE] --tgtout=[WHERE_TO_SAVE_TRANSLATION] --target_lang en --source_lang hi
This translate method of the NMT model accepts a list of de-tokenized strings.
The translate method outputs a list of de-tokenized strings in the target language.
[1] Vaswani, Ashish, et al. "Attention is all you need." arXiv preprint arXiv:1706.03762 (2017).
[2] https://github.com/VKCOM/YouTokenToMe
[3] https://en.wikipedia.org/wiki/BLEU
[4] https://github.com/mjpost/sacreBLEU
License to use this model is covered by the NGC TERMS OF USE unless another License/Terms Of Use/EULA is clearly specified. By downloading the public and release version of the model, you accept the terms and conditions of the NGC TERMS OF USE.