Transformer-XL for PyTorch

NGC Catalog

CLASSIC

Welcome Guest

For downloads and more information, please view on a desktop device.

Description

Transformer-XL is a transformer-based language model with a segment-level recurrence and a novel relative positional encoding.

Publisher

NVIDIA

Latest Version

20.06.5

Modified

April 4, 2023

Compressed Size

1.39 MB

This repository provides an implementation of the Transformer-XL model in PyTorch from the paper Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. Transformer-XL is a transformer-based language model with a segment-level recurrence and a novel relative positional encoding. Enhancements introduced in Transformer-XL help capture better long-term dependencies by attending to tokens from multiple previous segments.

Our implementation is based on the codebase published by the authors of the Transformer-XL paper. Our implementation uses a modified model architecture. Our modifications were made to achieve better hardware utilization and to take advantage of Tensor Cores. Similar modifications were also proposed in an implementation available from github.com/cybertronai/transformer-xl. Refer to the Model architecture section for more details.

This model is trained with mixed precision using Tensor Cores on NVIDIA Volta and the NVIDIA Ampere GPU architectures and evaluated on Volta, Turing and the NVIDIA Ampere GPU architectures. Therefore, researchers can get results up to 2.5x faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.

Model architecture

The Transformer-XL "base" model for WikiText-103 dataset available in this repository was modified to use the following hyperparameter values:

Hyperparameter	Description	Original setting for the base model	Our modification for the base model
`d_model`	hidden size	410	512
`n_head`	number of attention heads	10	8
`d_head`	size of each attention head	41	64
`d_inner`	hidden size in fully-connected layers	2100	2048
`tgt_len`	number of tokens to predict during training	150	192
`mem_len`	number of tokens cached from previous iterations during training	150	192

Changes described above were made to align certain hyperparameters with powers of two, with this modification, the model is able to achieve better hardware utilization, and therefore higher training throughput.

The Transformer-XL "large" model for WikiText-103 dataset available in this repository uses the original hyperparameters from the reference implementation.

The following table lists the hyperparameters for the large and the base Transformer-XL models for WikiText-103 dataset available in this repository.

Hyperparameter	Description	Base model	Large model
`n_layer`	number of layers	16	18
`d_model`	hidden size	512	1024
`n_head`	number of attention heads	8	16
`d_head`	size of each attention head	64	64
`d_inner`	inner hidden size in fully-connected layers	2048	4096
`dropout`	dropout	0.1	0.2
`dropatt`	dropout after softmax in the attention	0.0	0.2
`lr`	base learning rate	0.01	0.01
`eta_min`	minimum learning rate (for cosine decay)	0.001	0.0001
`max_step`	number of training steps	40,000	100,000
`warmup_step`	number of learning rate warmup steps	1,000	16,000
`batch_size`	training batch size	256	128
`tgt_len`	number of tokens to predict during training	192	384
`mem_len`	number of tokens cached from previous iterations during training	192	384

The Transformer-XL model addresses the limitations of vanilla transformer-based language models, which are only able to use relatively short context, bounded by the segment length. The Transformer-XL introduces a recurrence mechanism, which is able to use a cached hidden state from previous segments. During training, the context consists of a concatenation of current segment's hidden state and cached states from previous iterations. Gradients are backpropagated only through the current segment, although the model is able to take advantage of the extra information stored in the cache and therefore is able to model long-term dependencies.

An illustration of the recurrence mechanism taken from the Transformer-XL paper is shown below. model

Default configuration

The following features were implemented in this model:

general
single-node or multi-node, data-parallel multi-GPU training
training and inference with mixed precision using Tensor Cores
mixed precision training implemented using Apex AMP, with O2 optimization level and with a dynamic loss scaling
model
16-layer base Transformer-XL model with hidden size 512, 8 attention heads, each head with hidden size 64
18-layer large Transformer-XL model with hidden size 1024, 16 attention heads, each head with hidden size 64
the model trained on WikiText-103 dataset, using word-level vocabulary and adaptive softmax
embedding weights are tied with weights in the classifier
training
training with LAMB optimizer, the implementation of the optimizer uses TorchScript, which enables the fusion of elementwise operations and accelerates the training
support for training with a gradient accumulation
base model:
linear learning rate warmup for 1,000 iterations, followed by the cosine learning rate schedule, the initial learning rate is set to 0.01, and the final learning rate is set to 0.001
training for 40,000 steps, using a batch size of 256
large model:
single node:
linear learning rate warmup for 16,000 iterations, followed by the cosine learning rate schedule, the initial learning rate is set to 0.01, and the final learning rate is set to 0.0001
training for 100,000 steps, using a batch size of 128
multi node:
linear learning rate warmup for 16,000 iterations, followed by the cosine learning rate schedule, the initial learning rate is set to 0.02, and the final learning rate is set to 0.0002
training for 25,000 steps, using a batch size of 512
inference
support for multi-gpu inference
support for TorchScript and pure Python inference
each token is using the same size of the context from previous time steps.
base model:
target length is set to 64, length of memory is set to 640
positional embeddings are clamped after 400 time steps
large model:
target length is set to 128, length of memory is set to 1,600
positional embeddings are clamped after 1,000 time steps

Feature support matrix

The following features are supported by this model:

Feature	Transformer-XL
Apex AMP	Yes
PyTorch DistributedDataParallel	Yes
LAMB	Yes
Inference with TorchScript	Yes
Multi-node training	Yes

Features

Apex AMP - a tool that enables Tensor Core-accelerated training. Refer to the Enabling mixed precision section for more details.

PyTorch DistributedDataParallel - a module wrapper that enables easy multiprocess distributed data-parallel training.

LAMB - stands for Layerwise Adaptive Moments Based optimizer, is a large batch optimization technique that helps accelerate training of deep neural networks using large minibatches.

TorchScript - is a way to create serializable and optimizable models from PyTorch code. Any TorchScript program can be saved from a Python process and loaded in a process where there is no Python dependency.

Mixed precision training

Mixed precision is the combined use of different numerical precisions in a computational method. Mixed precision training offers significant computational speedup by performing operations in half-precision format while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of Tensor Cores in Volta, and following with both the Turing and Ampere architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training previously required two steps:

Porting the model to use the FP16 data type where appropriate.
Manually adding loss scaling to preserve small gradient values.

The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in CUDA 8 in the NVIDIA Deep Learning SDK.

For information about:

How to train using mixed precision, see the Mixed Precision Training paper and Training With Mixed Precision documentation.
Techniques used for mixed precision training, see the Mixed-Precision Training of Deep Neural Networks blog.
APEX tools for mixed precision training, see the NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch .

Enabling mixed precision

The pytorch/train.py training script launches mixed precision training with Tensor Cores if the flag --fp16 is set.

Mixed precision is enabled in PyTorch by using the Automatic Mixed Precision (AMP), library from APEX that casts variables to half-precision upon retrieval, while storing variables in single-precision format. Furthermore, to preserve small gradient magnitudes in backpropagation, a loss scaling step must be included when applying gradients. In PyTorch, loss scaling can be easily applied by using scale_loss() method provided by AMP. The scaling value to be used can be dynamic or fixed.

For an in-depth walk through on AMP, check out sample usage here. APEX is a PyTorch extension that contains utility libraries, such as AMP, which require minimal network code changes to leverage Tensor Cores performance.

The following steps were needed to enable mixed precision training in Transformer-XL:

Import AMP from APEX:

from apex import amp

Initialize AMP and wrap the model and the optimizer before starting the training:

model, optimizer = amp.initialize(
 model,
 optimizer,
 opt_level='O2',
 )

Apply scale_loss context manager:

with amp.scale_loss(loss, optimizer) as scaled_loss:
 scaled_loss.backward()

Apply gradient clipping on single precision master weights:

torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.clip)

Enabling TF32

TensorFloat-32 (TF32) is the new math mode in NVIDIA A100 GPUs for handling the matrix math also called tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs.

TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models which require high dynamic range for weights or activations.

For more information, refer to the TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x blog post.

TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default.