Transformer-XL PyTorch checkpoint (Base, AMP)

NGC Catalog

CLASSIC

Welcome Guest

For downloads and more information, please view on a desktop device.

Description

Tranformer-XL Base PyTorch checkpoint trained with AMP

Publisher

NVIDIA Deep Learning Examples

Latest Version

19.11.0_amp

Modified

April 4, 2023

Size

2.15 GB

Model Overview

Transformer-XL is a transformer-based language model with a segment-level recurrence and a novel relative positional encoding.

Model Architecture

The Transformer-XL "base" model for WikiText-103 dataset available in this repository was modified to use the following hyperparameter values:

Hyperparameter	Description	Original setting for the base model	Our modification for the base model
`d_model`	hidden size	410	512
`n_head`	number of attention heads	10	8
`d_head`	size of each attention head	41	64
`d_inner`	hidden size in fully-connected layers	2100	2048
`tgt_len`	number of tokens to predict during training	150	192
`mem_len`	number of tokens cached from previous iterations during training	150	192

Changes described above were made to align certain hyperparameters with powers of two, with this modification, the model is able to achieve better hardware utilization, and therefore higher training throughput.

The Transformer-XL "large" model for WikiText-103 dataset available in this repository uses the original hyperparameters from the reference implementation.

The following table lists the hyperparameters for the large and the base Transformer-XL models for WikiText-103 dataset available in this repository.

Hyperparameter	Description	Base model	Large model
`n_layer`	number of layers	16	18
`d_model`	hidden size	512	1024
`n_head`	number of attention heads	8	16
`d_head`	size of each attention head	64	64
`d_inner`	inner hidden size in fully-connected layers	2048	4096
`dropout`	dropout	0.1	0.2
`dropatt`	dropout after softmax in the attention	0.0	0.2
`lr`	base learning rate	0.01	0.01
`eta_min`	minimum learning rate (for cosine decay)	0.001	0.0001
`max_step`	number of training steps	40,000	100,000
`warmup_step`	number of learning rate warmup steps	1,000	16,000
`batch_size`	training batch size	256	128
`tgt_len`	number of tokens to predict during training	192	384
`mem_len`	number of tokens cached from previous iterations during training	192	384

The Transformer-XL model addresses the limitations of vanilla transformer-based language models, which are only able to use relatively short context, bounded by the segment length. The Transformer-XL introduces a recurrence mechanism, which is able to use a cached hidden state from previous segments. During training, the context consists of a concatenation of current segment's hidden state and cached states from previous iterations. Gradients are backpropagated only through the current segment, although the model is able to take advantage of the extra information stored in the cache and therefore is able to model long-term dependencies.

An illustration of the recurrence mechanism taken from the Transformer-XL paper is shown below. model

Training

This model was trained using script available on NGC and in GitHub repo.

Dataset

The following datasets were used to train this model:

WikiText-103 - A collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia.

Performance

Performance numbers for this model are available in NGC.

References

License

This model was trained using open-source software available in Deep Learning Examples repository. For terms of use, please refer to the license of the script and the datasets the model was derived from.