Neural Machine Translation (NMT) model to translate from English to Russian
Model Overview
This model can be used for translating text in source language (En) to a text in target language (Ru).
Model Architecture
The model is based on Transformer "Big" architecture originally presented in "Attention Is All You Need" paper [1]. In this particular instance, the model has 24 layers in the encoder and 6 layers in the decoder. It is using YouTokenToMe tokenizer [2].
Training
These models were trained on a collection of many publicly available datasets comprising roughly a hundred million parallel sentences. The NeMo toolkit [5] was used for training this model over roughly 700k steps.
Datasets
While training this model, we used the following datasets:
- Parallel Commoncrawl - "http://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz"
- Paracrawl - "https://s3.amazonaws.com/web-language-models/paracrawl/bonus/en-ru.txt.gz"
- News Commentary - "http://data.statmt.org/news-commentary/v15/training/news-commentary-v15.en-ru.tsv.gz"
- WikiTitles - "http://data.statmt.org/wikititles/v2/wikititles-v2.ru-en.tsv.gz"
- WikiMatrix - "http://data.statmt.org/wmt20/translation-task/WikiMatrix/WikiMatrix.v1.en-ru.langid.tsv.gz"
- UN Parallel Corpus - https://conferences.unite.un.org/UNCORPUS/en/DownloadOverview
- CC-Aligned - http://www.statmt.org/cc-aligned/sentence-aligned/en_XX-ru_RU.tsv.xz
- CC-Matrix - https://github.com/facebookresearch/LASER/tree/master/tasks/CCMatrix
Tokenizer Construction
We used the YouTokenToMe tokenizer [2] with separate encoder and decoder BPE tokenizers.
Performance
The accuracy of translation models are often measured using BLEU scores [3]. The model achieves the following sacreBLEU [4] scores on the WMT'13, WMT'14, WMT'18, WMT'19 and WMT'20 test sets
WMT'13 - 30.5
WMT'14 - 44.4
WMT'18 - 35.1
WMT'19 - 35.8
WMT'20 - 25.3
How to Use this Model
The model is available for use in the NeMo toolkit [5], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
Automatically load the model from NGC
import nemo
import nemo.collections.nlp as nemo_nlp
nmt_model = nemo_nlp.models.machine_translation.MTEncDecModel.from_pretrained(model_name="nmt_en_ru_transformer24x6")
Translating text with this model
python [NEMO_GIT_FOLDER]/examples/nlp/machine_translation/nmt_transformer_infer.py --model=nmt_en_ru_transformer24x6.nemo --srctext=[TEXT_IN_SRC_LANGUAGE] --tgtout=[WHERE_TO_SAVE_TRANSLATIONS] --target_lang ru --source_lang en
Input
This translate method of the NMT model accepts a list of de-tokenized strings.
Output
The translate method outputs a list of de-tokenized strings in the target language.
Limitations
No known limitations at this time.
References
[1] Vaswani, Ashish, et al. "Attention is all you need." arXiv preprint arXiv:1706.03762 (2017).
[2] https://github.com/VKCOM/YouTokenToMe
Licence
License to use this model is covered by the NGC TERMS OF USE unless another License/Terms Of Use/EULA is clearly specified. By downloading the public and release version of the model, you accept the terms and conditions of the NGC TERMS OF USE. [3] https://en.wikipedia.org/wiki/BLEU