NMT Multilingual En De/Es/Fr Transformer24x6

NVIDIA

Model

NVIDIA

NMT Multilingual En De/Es/Fr Transformer24x6

Multilingual Neural Machine Translation model to translate from English to German/Spanish/French

Model Overview

This model can be used for translating text in source language (En) to a text in target language (De/Es/Fr).

Model Architecture

The model is based on Transformer "Big" architecture originally presented in "Attention Is All You Need" paper [1]. In this particular instance, the model has 24 layers in the encoder and 6 layers in the decoder. It is using SentencePiece tokenizer [2].

Training

These models were trained on a collection of many publicly available datasets comprising of millions of parallel sentences. The NeMo toolkit [5] was used for training this model over roughly 800k steps.

Datasets

While training this model, we used the following datasets:

German

Europarl de-en set from: http://www.statmt.org/europarl/v10/training/europarl-v10.de-en.tsv.gz
De-En version of parallel common crawl from: http://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz
De-En version of paracrawl from: https://s3.amazonaws.com/web-language-models/paracrawl/release7.1/en-de.txt.gz
De-En News commentary version from: http://data.statmt.org/news-commentary/v15/training/news-commentary-v15.de-en.tsv.gz
De-En Wikipedia Parallel Titles from http://data.statmt.org/wikititles/v2/wikititles-v2.de-en.tsv.gz
A subset of Tilde Corpus from: https://tilde-model.s3-eu-west-1.amazonaws.com/EESC2017.de-en.tmx.zip
A subset of Tilde Corpus from: https://tilde-model.s3-eu-west-1.amazonaws.com/rapid2019.de-en.tmx.zip
A subset of Tilde Corpus from: https://tilde-model.s3-eu-west-1.amazonaws.com/ecb2017.de-en.tmx.zip
A subset of Tilde Corpus from: https://tilde-model.s3-eu-west-1.amazonaws.com/EMA2016.de-en.tmx.zip
De-En portion of WikiMatrix v1 de-en data from: http://data.statmt.org/wmt20/translation-task/WikiMatrix/WikiMatrix.v1.de-en.langid.tsv.gz

Spanish

Europarl - https://www.statmt.org/europarl/v7/es-en.tgz
Parallel Commoncrawl - http://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz
Paracrawl - https://s3.amazonaws.com/web-language-models/paracrawl/release7.1/en-es.txt.gz
Tilde Corpus - https://tilde-model.s3-eu-west-1.amazonaws.com/EESC2017.en-es.tmx.zip
Tilde Corpus - https://tilde-model.s3-eu-west-1.amazonaws.com/rapid2016.en-es.tmx.zip
Tilde Corpus - https://tilde-model.s3-eu-west-1.amazonaws.com/ecb2017.en-es.tmx.zip
Tilde Corpus - https://tilde-model.s3-eu-west-1.amazonaws.com/EMA2016.en-es.tmx.zip
Tilde Corpus - https://tilde-model.s3-eu-west-1.amazonaws.com/worldbank.en-es.tmx.zip
WikiMatrix - https://dl.fbaipublicfiles.com/laser/WikiMatrix/v1/WikiMatrix.en-es.tsv.gz

French

Europarl - http://www.statmt.org/europarl/v10/training/europarl-v10.fr-en.tsv.gz
Parallel Commoncrawl - http://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz
Paracrawl - https://s3.amazonaws.com/web-language-models/paracrawl/release7.1/en-fr.txt.gz
Giga French - http://statmt.org/wmt10/training-giga-fren.tar
News Commentary - http://data.statmt.org/news-commentary/v15/training/news-commentary-v15.en-fr.tsv.gz
Tilde Corpus - https://tilde-model.s3-eu-west-1.amazonaws.com/EESC2017.en-fr.tmx.zip
Tilde Corpus - https://tilde-model.s3-eu-west-1.amazonaws.com/rapid2016.en-fr.tmx.zip
Tilde Corpus - https://tilde-model.s3-eu-west-1.amazonaws.com/ecb2017.en-fr.tmx.zip
Tilde Corpus - https://tilde-model.s3-eu-west-1.amazonaws.com/EMA2016.en-fr.tmx.zip
WikiMatrix - https://dl.fbaipublicfiles.com/laser/WikiMatrix/v1/WikiMatrix.en-fr.tsv.gz

Tokenizer Construction

We used the SentencePiece tokenizer [2] with shared encoder and decoder BPE tokenizers.

Performance

The accuracy of translation models are often measured using BLEU scores [3]. The model achieves the following sacreBLEU [4] scores on WMT test sets

De
WMT13 - 29.8
WMT14 - 31.9

Es
WMT12 - 41.0
WMT13 - 36.9

Fr
WMT13 - 35.7
WMT14 - 42.0

How to Use this Model

The model is available for use in the NeMo toolkit [5], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.

Automatically load the model from NGC

import nemo
import nemo.collections.nlp as nemo_nlp
nmt_model = nemo_nlp.models.machine_translation.MTEncDecModel.from_pretrained(model_name="mnmt_en_deesfr_transformer24x6")

Translating text with this model

python [NEMO_GIT_FOLDER]/examples/nlp/machine_translation/nmt_transformer_infer.py --model=mnmt_en_deesfr_transformer24x6.nemo --srctext=[TEXT_IN_SRC_LANGUAGE] --tgtout=[WHERE_TO_SAVE_TRANSLATIONS] --target_lang [TARGET_LANGUAGE] --source_lang en

where [TARGET_LANGUAGE] can be 'de' or 'es' or 'fr'

Input

This translate method of the NMT model accepts a list of de-tokenized strings.

Output

The translate method outputs a list of de-tokenized strings in the target language.

Limitations

No known limitations at this time.

References

[1] Vaswani, Ashish, et al. "Attention is all you need." arXiv preprint arXiv:1706.03762 (2017).

[2] https://github.com/google/sentencepiece

[3] https://en.wikipedia.org/wiki/BLEU

[4] https://github.com/mjpost/sacreBLEU

[5] NVIDIA NeMo Toolkit

Licence

License to use this model is covered by the NGC TERMS OF USE unless another License/Terms Of Use/EULA is clearly specified. By downloading the public and release version of the model, you accept the terms and conditions of the NGC TERMS OF USE.

Publisher

NVIDIA

Latest Version1.2.0

UpdatedApril 4, 2023 UTC

Compressed Size1.73 GB

Labels

AI De DL En Es Fr Multilingual NMT PyTorch PytorchLightning Transformer