Transformer-XL is a transformer-based language model with a segment-level recurrence and a novel relative positional encoding.
The Transformer-XL "base" model for WikiText-103 dataset available in this repository was modified to use the following hyperparameter values:
Hyperparameter | Description | Original setting for the base model | Our modification to the base model |
---|---|---|---|
d_model |
hidden size | 410 | 512 |
n_head |
number of attention heads | 10 | 8 |
d_head |
size of each attention head | 41 | 64 |
d_inner |
hidden size in fully-connected layers | 2100 | 2048 |
tgt_len |
number of tokens to predict during training | 150 | 192 |
mem_len |
number of tokens cached from previous iterations during training | 150 | 192 |
Changes described above were made to align certain hyperparameters with powers of two, with this modification, the model is able to achieve better hardware utilization, and therefore higher training throughput.
The following table lists the hyperparameters for the base Transformer-XL model for WikiText-103 dataset available in this repository.
Hyperparameter | Description | Base model |
---|---|---|
n_layer |
number of layers | 16 |
d_model |
hidden size | 512 |
n_head |
number of attention heads | 8 |
d_head |
size of each attention head | 64 |
d_inner |
inner hidden size in fully-connected layers | 2048 |
dropout |
dropout | 0.1 |
dropatt |
dropout after softmax in the attention | 0.0 |
lr |
base learning rate | 0.01 |
min_lr_ratio |
minimum ratio learning rate (for cosine decay) | 0.1 |
max_step |
number of training steps | 40,000 |
warmup_step |
number of learning rate warmup steps | 1,000 |
batch_size |
training batch size | 256 |
tgt_len |
number of tokens to predict during training | 192 |
mem_len |
number of tokens cached from previous iterations during training | 192 |
The Transformer-XL model addresses the limitations of vanilla transformer-based language models, which are only able to use relatively short context, bounded by the segment length. The Transformer-XL introduces a recurrence mechanism, which is able to use a cached hidden state from previous segments. During training, the context consists of a concatenation of the current segment's hidden state and cached states from previous iterations. Gradients are backpropagated only through the current segment, although the model is able to take advantage of the extra information stored in the cache and therefore is able to model long-term dependencies.
An illustration of the recurrence mechanism taken from the Transformer-XL paper is shown below.
This model was trained using script available on NGC and in GitHub repo.
The following datasets were used to train this model:
Performance numbers for this model are available in NGC.
This model was trained using open-source software available in Deep Learning Examples repository. For terms of use, please refer to the license of the script and the datasets the model was derived from.