The GNMT v2 model is similar to the one discussed in the Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation paper.
The most important difference between the two models is in the attention mechanism. In our model, the output from the first LSTM layer of the decoder goes into the attention module, then the re-weighted context is concatenated with inputs to all subsequent LSTM layers in the decoder at the current time step.
The same attention mechanism is also implemented in the default GNMT-like models from TensorFlow Neural Machine Translation Tutorial and NVIDIA OpenSeq2Seq Toolkit.
The following features were implemented in this model:
When comparing the BLEU score, there are various tokenization approaches and BLEU calculation methodologies; therefore, ensure you align similar metrics.
Code from this repository can be used to train a larger, 8-layer GNMT v2 model.
Our experiments show that a 4-layer model is significantly faster to train and
yields comparable accuracy on the public WMT16
English-German dataset. The
number of LSTM layers is controlled by the --num-layers
parameter in the
train.py
training script.
The following features are supported by this model.
Feature | GNMT v2 |
---|---|
Apex AMP | Yes |
Apex DistributedDataParallel | Yes |
Apex AMP - a tool that enables Tensor Core-accelerated training. Refer to the Enabling mixed precision section for more details.
Apex
DistributedDataParallel -
a module wrapper that enables easy multiprocess distributed data parallel
training, similar to
torch.nn.parallel.DistributedDataParallel.
DistributedDataParallel
is optimized for use with
NCCL. It achieves high performance by
overlapping communication with computation during backward()
and bucketing
smaller gradient transfers to reduce the total number of transfers required.
Mixed precision is the combined use of different numerical precisions in a computational method. Mixed precision training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of Tensor Cores in Volta, and following with both the Turing and Ampere architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training previously required two steps:
The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in CUDA 8 in the NVIDIA Deep Learning SDK.
For information about:
Mixed precision is enabled in PyTorch by using the Automatic Mixed Precision
(AMP), library from APEX that casts variables
to half-precision upon retrieval, while storing variables in single-precision
format. Furthermore, to preserve small gradient magnitudes in backpropagation,
a loss
scaling
step must be included when applying gradients. In PyTorch, loss scaling can be
easily applied by using scale_loss()
method provided by AMP. The scaling
value to be used can be
dynamic or fixed.
For an in-depth walk through on AMP, check out sample usage here. APEX is a PyTorch extension that contains utility libraries, such as AMP, which require minimal network code changes to leverage Tensor Cores performance.
The following steps were needed to enable mixed precision training in GNMT:
seq2seq/train/trainer.py
):from apex import amp
seq2seq/train/trainer.py
, class: Seq2SeqTrainer
):self.model, self.optimizer = amp.initialize(
self.model,
self.optimizer,
cast_model_outputs=torch.float16,
keep_batchnorm_fp32=False,
opt_level='O2')
scale_loss
context manager (file: seq2seq/train/fp_optimizers.py
,
class: AMPOptimizer
):with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
seq2seq/train/fp_optimizers.py
, class: AMPOptimizer
):if self.grad_clip != float('inf'):
clip_grad_norm_(amp.master_params(optimizer), self.grad_clip)
TensorFloat-32 (TF32) is the new math mode in NVIDIA A100 GPUs for handling the matrix math also called tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs.
TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models which require high dynamic range for weights or activations.
For more information, refer to the TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x blog post.
TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default.