This model can be used for translating text in source language (32 languages) to a text in target language (En).
The model is based on Transformer "Big" architecture originally presented in "Attention Is All You Need" paper [1]. In this particular instance, the model has 12 layers in the encoder and 2 layers in the decoder. It is using SentencePiece tokenizer [2].
These models were trained on a collection of many publicly available datasets comprising of millions of parallel sentences.
We used the SentencePiece tokenizer [2] with shared encoder and decoder BPE tokenizers.
The model is available for use in the NeMo toolkit [5], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
python [NEMO_GIT_FOLDER]/examples/nlp/machine_translation/nmt_transformer_infer_megatron.py model_file=megatronnmt_en_any_500m.nemo srctext=[TEXT_IN_SRC_LANGUAGE] tgtout=[WHERE_TO_SAVE_TRANSLATIONS] source_lang=en target_lang=[TARGET_LANGUAGE]
where [TARGET_LANGUAGE] can be 'cs', 'da', 'de', 'el', 'es', 'fi', 'fr', 'hu', 'it', 'lt', 'lv', 'nl', 'no', 'pl', 'pt', 'ro', 'ru', 'sk', 'sv', 'zh', 'ja', 'hi', 'ko', 'et', 'sl', 'bg', 'uk', 'hr', 'ar', 'vi', 'tr', 'id'
This translate method of the NMT model accepts a list of de-tokenized strings.
The translate method outputs a list of de-tokenized strings in the target language.
No known limitations at this time.
[1] Vaswani, Ashish, et al. "Attention is all you need." arXiv preprint arXiv:1706.03762 (2017).
[2] https://github.com/google/sentencepiece
[3] https://en.wikipedia.org/wiki/BLEU
[4] https://github.com/mjpost/sacreBLEU
This work is licensed under NSCLv1 - Link