NGC | Catalog
CatalogModelsTransformer-XL Base TensorFlow checkpoint (AMP, Base)

Transformer-XL Base TensorFlow checkpoint (AMP, Base)

For downloads and more information, please view on a desktop device.
Logo for Transformer-XL Base TensorFlow checkpoint (AMP, Base)

Description

Transformer-XL Base Base TensorFlow checkpoint trained with AMP

Publisher

NVIDIA Deep Learning Examples

Latest Version

20.06.1_amp

Modified

April 4, 2023

Size

2.97 GB

Model Overview

Transformer-XL is a transformer-based language model with a segment-level recurrence and a novel relative positional encoding.

Model Architecture

The Transformer-XL "base" model for WikiText-103 dataset available in this repository was modified to use the following hyperparameter values:

Hyperparameter Description Original setting for the base model Our modification to the base model
d_model hidden size 410 512
n_head number of attention heads 10 8
d_head size of each attention head 41 64
d_inner hidden size in fully-connected layers 2100 2048
tgt_len number of tokens to predict during training 150 192
mem_len number of tokens cached from previous iterations during training 150 192

Changes described above were made to align certain hyperparameters with powers of two, with this modification, the model is able to achieve better hardware utilization, and therefore higher training throughput.

The following table lists the hyperparameters for the base Transformer-XL model for WikiText-103 dataset available in this repository.

Hyperparameter Description Base model
n_layer number of layers 16
d_model hidden size 512
n_head number of attention heads 8
d_head size of each attention head 64
d_inner inner hidden size in fully-connected layers 2048
dropout dropout 0.1
dropatt dropout after softmax in the attention 0.0
lr base learning rate 0.01
min_lr_ratio minimum ratio learning rate (for cosine decay) 0.1
max_step number of training steps 40,000
warmup_step number of learning rate warmup steps 1,000
batch_size training batch size 256
tgt_len number of tokens to predict during training 192
mem_len number of tokens cached from previous iterations during training 192

The Transformer-XL model addresses the limitations of vanilla transformer-based language models, which are only able to use relatively short context, bounded by the segment length. The Transformer-XL introduces a recurrence mechanism, which is able to use a cached hidden state from previous segments. During training, the context consists of a concatenation of the current segment's hidden state and cached states from previous iterations. Gradients are backpropagated only through the current segment, although the model is able to take advantage of the extra information stored in the cache and therefore is able to model long-term dependencies.

An illustration of the recurrence mechanism taken from the Transformer-XL paper is shown below. model

Training

This model was trained using script available on NGC and in GitHub repo.

Dataset

The following datasets were used to train this model:

  • WikiText-103 - A collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia.

Performance

Performance numbers for this model are available in NGC.

References

License

This model was trained using open-source software available in Deep Learning Examples repository. For terms of use, please refer to the license of the script and the datasets the model was derived from.