NGC | Catalog
CatalogModelsTransformer-XL PyTorch checkpoint (Base, AMP)

Transformer-XL PyTorch checkpoint (Base, AMP)

For downloads and more information, please view on a desktop device.
Logo for Transformer-XL PyTorch checkpoint (Base, AMP)

Description

Tranformer-XL Base PyTorch checkpoint trained with AMP

Publisher

NVIDIA Deep Learning Examples

Latest Version

19.11.0_amp

Modified

April 4, 2023

Size

2.15 GB

Model Overview

Transformer-XL is a transformer-based language model with a segment-level recurrence and a novel relative positional encoding.

Model Architecture

The Transformer-XL "base" model for WikiText-103 dataset available in this repository was modified to use the following hyperparameter values:

Hyperparameter Description Original setting for the base model Our modification for the base model
d_model hidden size 410 512
n_head number of attention heads 10 8
d_head size of each attention head 41 64
d_inner hidden size in fully-connected layers 2100 2048
tgt_len number of tokens to predict during training 150 192
mem_len number of tokens cached from previous iterations during training 150 192

Changes described above were made to align certain hyperparameters with powers of two, with this modification, the model is able to achieve better hardware utilization, and therefore higher training throughput.

The Transformer-XL "large" model for WikiText-103 dataset available in this repository uses the original hyperparameters from the reference implementation.

The following table lists the hyperparameters for the large and the base Transformer-XL models for WikiText-103 dataset available in this repository.

Hyperparameter Description Base model Large model
n_layer number of layers 16 18
d_model hidden size 512 1024
n_head number of attention heads 8 16
d_head size of each attention head 64 64
d_inner inner hidden size in fully-connected layers 2048 4096
dropout dropout 0.1 0.2
dropatt dropout after softmax in the attention 0.0 0.2
lr base learning rate 0.01 0.01
eta_min minimum learning rate (for cosine decay) 0.001 0.0001
max_step number of training steps 40,000 100,000
warmup_step number of learning rate warmup steps 1,000 16,000
batch_size training batch size 256 128
tgt_len number of tokens to predict during training 192 384
mem_len number of tokens cached from previous iterations during training 192 384

The Transformer-XL model addresses the limitations of vanilla transformer-based language models, which are only able to use relatively short context, bounded by the segment length. The Transformer-XL introduces a recurrence mechanism, which is able to use a cached hidden state from previous segments. During training, the context consists of a concatenation of current segment's hidden state and cached states from previous iterations. Gradients are backpropagated only through the current segment, although the model is able to take advantage of the extra information stored in the cache and therefore is able to model long-term dependencies.

An illustration of the recurrence mechanism taken from the Transformer-XL paper is shown below. model

Training

This model was trained using script available on NGC and in GitHub repo.

Dataset

The following datasets were used to train this model:

  • WikiText-103 - A collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia.

Performance

Performance numbers for this model are available in NGC.

References

License

This model was trained using open-source software available in Deep Learning Examples repository. For terms of use, please refer to the license of the script and the datasets the model was derived from.