NGC Catalog
CLASSIC
Welcome Guest
Models
Transformer-XL Base TensorFlow checkpoint (AMP, Base)

Transformer-XL Base TensorFlow checkpoint (AMP, Base)

For downloads and more information, please view on a desktop device.
Logo for Transformer-XL Base TensorFlow checkpoint (AMP, Base)
Description
Transformer-XL Base Base TensorFlow checkpoint trained with AMP
Publisher
NVIDIA Deep Learning Examples
Latest Version
20.06.1
Modified
April 4, 2023
Size
2.97 GB

Model Overview

Transformer-XL is a transformer-based language model with a segment-level recurrence and a novel relative positional encoding.

Model Architecture

The Transformer-XL "base" model for WikiText-103 dataset available in this repository was modified to use the following hyperparameter values:

Hyperparameter Description Original setting for the base model Our modification to the base model
d_model hidden size 410 512
n_head number of attention heads 10 8
d_head size of each attention head 41 64
d_inner hidden size in fully-connected layers 2100 2048
tgt_len number of tokens to predict during training 150 192
mem_len number of tokens cached from previous iterations during training 150 192

Changes described above were made to align certain hyperparameters with powers of two, with this modification, the model is able to achieve better hardware utilization, and therefore higher training throughput.

The following table lists the hyperparameters for the base Transformer-XL model for WikiText-103 dataset available in this repository.

Hyperparameter Description Base model
n_layer number of layers 16
d_model hidden size 512
n_head number of attention heads 8
d_head size of each attention head 64
d_inner inner hidden size in fully-connected layers 2048
dropout dropout 0.1
dropatt dropout after softmax in the attention 0.0
lr base learning rate 0.01
min_lr_ratio minimum ratio learning rate (for cosine decay) 0.1
max_step number of training steps 40,000
warmup_step number of learning rate warmup steps 1,000
batch_size training batch size 256
tgt_len number of tokens to predict during training 192
mem_len number of tokens cached from previous iterations during training 192

The Transformer-XL model addresses the limitations of vanilla transformer-based language models, which are only able to use relatively short context, bounded by the segment length. The Transformer-XL introduces a recurrence mechanism, which is able to use a cached hidden state from previous segments. During training, the context consists of a concatenation of the current segment's hidden state and cached states from previous iterations. Gradients are backpropagated only through the current segment, although the model is able to take advantage of the extra information stored in the cache and therefore is able to model long-term dependencies.

An illustration of the recurrence mechanism taken from the Transformer-XL paper is shown below. model

Training

This model was trained using script available on NGC and in GitHub repo

Dataset

The following datasets were used to train this model:

  • WikiText-103 - A collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia.

Performance

Performance numbers for this model are available in NGC

References

  • original paper
  • NVIDIA model implementation in NGC
  • NVIDIA model implementation on GitHub

License

This model was trained using open-source software available in Deep Learning Examples repository. For terms of use, please refer to the license of the script and the datasets the model was derived from.