Multilingual Neural Machine Translation model to translate from English to German/Spanish/French
Model Overview
This model can be used for translating text in source language (En) to a text in target language (De/Es/Fr).
Model Architecture
The model is based on Transformer "Big" architecture originally presented in "Attention Is All You Need" paper [1].
In this particular instance, the model has 24 layers in the encoder and 6 layers in the decoder.
It is using SentencePiece tokenizer [2].
Training
These models were trained on a collection of many publicly available datasets comprising of millions of parallel sentences. The NeMo toolkit [5] was used for training this model over roughly 800k steps.
Datasets
While training this model, we used the following datasets:
German
- Europarl de-en set from: http://www.statmt.org/europarl/v10/training/europarl-v10.de-en.tsv.gz
- De-En version of parallel common crawl from: http://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz
- De-En version of paracrawl from: https://s3.amazonaws.com/web-language-models/paracrawl/release7.1/en-de.txt.gz
- De-En News commentary version from: http://data.statmt.org/news-commentary/v15/training/news-commentary-v15.de-en.tsv.gz
- De-En Wikipedia Parallel Titles from http://data.statmt.org/wikititles/v2/wikititles-v2.de-en.tsv.gz
- A subset of Tilde Corpus from: https://tilde-model.s3-eu-west-1.amazonaws.com/EESC2017.de-en.tmx.zip
- A subset of Tilde Corpus from: https://tilde-model.s3-eu-west-1.amazonaws.com/rapid2019.de-en.tmx.zip
- A subset of Tilde Corpus from: https://tilde-model.s3-eu-west-1.amazonaws.com/ecb2017.de-en.tmx.zip
- A subset of Tilde Corpus from: https://tilde-model.s3-eu-west-1.amazonaws.com/EMA2016.de-en.tmx.zip
- De-En portion of WikiMatrix v1 de-en data from: http://data.statmt.org/wmt20/translation-task/WikiMatrix/WikiMatrix.v1.de-en.langid.tsv.gz
Spanish
- Europarl - https://www.statmt.org/europarl/v7/es-en.tgz
- Parallel Commoncrawl - http://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz
- Paracrawl - https://s3.amazonaws.com/web-language-models/paracrawl/release7.1/en-es.txt.gz
- Tilde Corpus - https://tilde-model.s3-eu-west-1.amazonaws.com/EESC2017.en-es.tmx.zip
- Tilde Corpus - https://tilde-model.s3-eu-west-1.amazonaws.com/rapid2016.en-es.tmx.zip
- Tilde Corpus - https://tilde-model.s3-eu-west-1.amazonaws.com/ecb2017.en-es.tmx.zip
- Tilde Corpus - https://tilde-model.s3-eu-west-1.amazonaws.com/EMA2016.en-es.tmx.zip
- Tilde Corpus - https://tilde-model.s3-eu-west-1.amazonaws.com/worldbank.en-es.tmx.zip
- WikiMatrix - https://dl.fbaipublicfiles.com/laser/WikiMatrix/v1/WikiMatrix.en-es.tsv.gz
French
- Europarl - http://www.statmt.org/europarl/v10/training/europarl-v10.fr-en.tsv.gz
- Parallel Commoncrawl - http://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz
- Paracrawl - https://s3.amazonaws.com/web-language-models/paracrawl/release7.1/en-fr.txt.gz
- Giga French - http://statmt.org/wmt10/training-giga-fren.tar
- News Commentary - http://data.statmt.org/news-commentary/v15/training/news-commentary-v15.en-fr.tsv.gz
- Tilde Corpus - https://tilde-model.s3-eu-west-1.amazonaws.com/EESC2017.en-fr.tmx.zip
- Tilde Corpus - https://tilde-model.s3-eu-west-1.amazonaws.com/rapid2016.en-fr.tmx.zip
- Tilde Corpus - https://tilde-model.s3-eu-west-1.amazonaws.com/ecb2017.en-fr.tmx.zip
- Tilde Corpus - https://tilde-model.s3-eu-west-1.amazonaws.com/EMA2016.en-fr.tmx.zip
- WikiMatrix - https://dl.fbaipublicfiles.com/laser/WikiMatrix/v1/WikiMatrix.en-fr.tsv.gz
Tokenizer Construction
We used the SentencePiece tokenizer [2] with shared encoder and decoder BPE tokenizers.
Performance
The accuracy of translation models are often measured using BLEU scores [3].
The model achieves the following sacreBLEU [4] scores on WMT test sets
How to Use this Model
The model is available for use in the NeMo toolkit [5], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
Automatically load the model from NGC
Translating text with this model
where [TARGET_LANGUAGE] can be 'de' or 'es' or 'fr'
Input
This translate method of the NMT model accepts a list of de-tokenized strings.
Output
The translate method outputs a list of de-tokenized strings in the target language.
Limitations
No known limitations at this time.
References
[1] Vaswani, Ashish, et al. "Attention is all you need." arXiv preprint arXiv:1706.03762 (2017).
[2] https://github.com/google/sentencepiece
[3] https://en.wikipedia.org/wiki/BLEU
[4] https://github.com/mjpost/sacreBLEU
Licence
License to use this model is covered by the NGC TERMS OF USE unless another License/Terms Of Use/EULA is clearly specified. By downloading the public and release version of the model, you accept the terms and conditions of the NGC TERMS OF USE.