Transformer-XL Base TensorFlow checkpoint (AMP, Base)

NGC Catalog

CLASSIC

Welcome Guest

For downloads and more information, please view on a desktop device.

Description

Transformer-XL Base Base TensorFlow checkpoint trained with AMP

Publisher

NVIDIA Deep Learning Examples

Latest Version

20.06.1

Modified

April 4, 2023

Size

2.97 GB

Model Overview

Transformer-XL is a transformer-based language model with a segment-level recurrence and a novel relative positional encoding.

Model Architecture

The Transformer-XL "base" model for WikiText-103 dataset available in this repository was modified to use the following hyperparameter values:

Hyperparameter	Description	Original setting for the base model	Our modification to the base model
`d_model`	hidden size	410	512
`n_head`	number of attention heads	10	8
`d_head`	size of each attention head	41	64
`d_inner`	hidden size in fully-connected layers	2100	2048
`tgt_len`	number of tokens to predict during training	150	192
`mem_len`	number of tokens cached from previous iterations during training	150	192

Changes described above were made to align certain hyperparameters with powers of two, with this modification, the model is able to achieve better hardware utilization, and therefore higher training throughput.

The following table lists the hyperparameters for the base Transformer-XL model for WikiText-103 dataset available in this repository.

Hyperparameter	Description	Base model
`n_layer`	number of layers	16
`d_model`	hidden size	512
`n_head`	number of attention heads	8
`d_head`	size of each attention head	64
`d_inner`	inner hidden size in fully-connected layers	2048
`dropout`	dropout	0.1
`dropatt`	dropout after softmax in the attention	0.0
`lr`	base learning rate	0.01
`min_lr_ratio`	minimum ratio learning rate (for cosine decay)	0.1
`max_step`	number of training steps	40,000
`warmup_step`	number of learning rate warmup steps	1,000
`batch_size`	training batch size	256
`tgt_len`	number of tokens to predict during training	192
`mem_len`	number of tokens cached from previous iterations during training	192

The Transformer-XL model addresses the limitations of vanilla transformer-based language models, which are only able to use relatively short context, bounded by the segment length. The Transformer-XL introduces a recurrence mechanism, which is able to use a cached hidden state from previous segments. During training, the context consists of a concatenation of the current segment's hidden state and cached states from previous iterations. Gradients are backpropagated only through the current segment, although the model is able to take advantage of the extra information stored in the cache and therefore is able to model long-term dependencies.

An illustration of the recurrence mechanism taken from the Transformer-XL paper is shown below. model

Training

This model was trained using script available on NGC and in GitHub repo

Dataset

The following datasets were used to train this model:

WikiText-103 - A collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia.

Performance

Performance numbers for this model are available in NGC

References

License

This model was trained using open-source software available in Deep Learning Examples repository. For terms of use, please refer to the license of the script and the datasets the model was derived from.