BERT for PyTorch | NVIDIA NGC

NVIDIA Deep Learning Examples

BERT for PyTorch

Resource

NVIDIA Deep Learning Examples

BERT for PyTorch

BERT is a method of pre-training language representations which obtains state-of-the-art results on a wide array of NLP tasks.

The following sections provide greater details of the dataset, running training and inference, and the training results.

Scripts and sample code

Descriptions of the key scripts and folders are provided below.

data/ - Contains scripts for downloading and preparing individual datasets and will contain downloaded and processed datasets.
scripts/ - Contains shell scripts to launch data download, pre-training, and fine-tuning.
run_squad.sh - Interface for launching question answering fine-tuning with run_squad.py.
run_glue.sh - Interface for launching paraphrase detection and sentiment analysis fine-tuning with run_glue.py.
run_pretraining.sh - Interface for launching BERT pre-training with run_pretraining.py.
create_pretraining_data.py - Creates .hdf5 files from shared text files in the final step of dataset creation.
model.py - Implements the BERT pre-training and fine-tuning model architectures with PyTorch.
optimization.py - Implements the LAMB optimizer with PyTorch.
run_squad.py - Implements fine-tuning training and evaluation for question answering on the SQuAD dataset.
run_glue.py - Implements fine-tuning training and evaluation for GLUE tasks.
run_pretraining.py - Implements BERT pre-training.

Parameters

Pre-training parameters

BERT is designed to pre-train deep bidirectional networks for language representations. The following scripts replicate pre-training on Wikipedia from this paper. These scripts are general and can be used for pre-training language representations on any corpus of choice.

The complete list of the available parameters for the run_pretraining.py script is :

  --input_dir INPUT_DIR       - The input data directory.
                                Should contain .hdf5 files for the task.
 
  --config_file CONFIG_FILE      - Path to a json file describing the BERT model
                                configuration. This file configures the model
                                architecture, such as the number of transformer
                                blocks, number of attention heads, etc.
 
  --bert_model BERT_MODEL        - Specifies the type of BERT model to use;
                                should be one of the following:
        bert-base-uncased
        bert-large-uncased
        bert-base-cased
        bert-base-multilingual
        bert-base-chinese
 
  --output_dir OUTPUT_DIR        - Path to the output directory where the model
                                checkpoints will be written.
 
  --init_checkpoint           - Initial checkpoint to start pre-training from (Usually a BERT pre-trained checkpoint)
 
  --max_seq_length MAX_SEQ_LENGTH
                              - The maximum total input sequence length after
                                WordPiece tokenization. Sequences longer than
                                this will be truncated, and sequences shorter
                                than this will be padded.
 
  --max_predictions_per_seq MAX_PREDICTIONS_PER_SEQ
                              - The maximum total of masked tokens per input
                                sequence for Masked LM.
 
  --train_batch_size TRAIN_BATCH_SIZE
                              - Batch size per GPU for training.
 
  --learning_rate LEARNING_RATE
                              - The initial learning rate for the LAMB  optimizer.
 
  --max_steps MAX_STEPS       - Total number of training steps to perform.
 
  --warmup_proportion WARMUP_PROPORTION
                              - Proportion of training to perform linear learning
                                rate warmup for. For example, 0.1 = 10% of training.
 
  --seed SEED                 - Sets the seed to use for random number generation.
 
  --gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS
                              - Number of update steps to accumulate before
                                performing a backward/update pass.
 
  --allreduce_post_accumulation - If set to true, performs allreduce only after the defined number of gradient accumulation steps.
  
  --allreduce_post_accumulation_fp16 -  If set to true, performs allreduce after gradient accumulation steps in FP16.
 
  --amp or --fp16                      - If set, performs computations using
                                automatic mixed precision.
 
  --loss_scale LOSS_SCALE        - Sets the loss scaling value to use when
                                mixed precision is used. The default value (0)
                                tells the script to use dynamic loss scaling
                                instead of fixed loss scaling.
 
  --log_freq LOG_FREQ         - If set, the script outputs the training
                                loss every LOG_FREQ step.
 
  --resume_from_checkpoint       - If set, training resumes from a checkpoint
                                that currently exists in OUTPUT_DIR.
 
  --num_steps_per_checkpoint NUM_STEPS_PER_CHECKPOINT
                              - Number of update steps until a model checkpoint
                                is saved to disk.
  --phase2                 - Specified if training on phase 2 only. If not specified, default pre-training is on phase 1.
 
  --phase1_end_step        - The number of steps phase 1 was trained for. In order to  
                           resume phase 2 the correct way; phase1_end_step should correspond to the --max_steps phase 1 was trained for.

Fine tuning parameters

SQuAD

Default arguments are listed below in the order scripts/run_squad.sh expects:

Initial checkpoint - The default is /workspace/checkpoints/bert_uncased.pt.
Number of training Epochs - The default is 2.
Batch size - The default is 3.
Learning rate - The default is 3e-5.
Precision (either fp16, tf32 or fp32) - The default is fp16.
Number of GPUs - The default is 8.
Seed - The default is 1.
SQuAD directory - The default is /workspace/bert/data/v1.1.
Vocabulary file (token to ID mapping) - The default is /workspace/bert/vocab/vocab.
Output directory for results - The default is /results/SQuAD.
Mode (train, eval, train eval, predict) - The default is train.
Config file for the BERT model (It should be the same as the pre-trained model) - The default is /workspace/bert/bert_config.json.

The script saves the final checkpoint to the /results/SQuAD/pytorch_model.bin file.

GLUE

Default arguments are listed below in the order scripts/run_glue.sh expects:

Initial checkpoint - The default is /workspace/bert/checkpoints/bert_uncased.pt.
Data directory - The default is /workspace/bert/data/download/glue/MRPC/.
Vocabulary file (token to ID mapping) - The default is /workspace/bert/vocab/vocab.
Config file for the BERT model (It should be the same as the pre-trained model) - The default is /workspace/bert/bert_config.json.
Output directory for result - The default is /workspace/bert/results/MRPC.
The name of the GLUE task (mrpc or sst-2) - The default is mrpc
Number of GPUs - The default is 8.
Batch size per GPU - The default is 16.
Number of update steps to accumulate before performing a backward/update pass (this option effectively normalizes the GPU memory footprint down by the same factor) - The default is 1.
Learning rate - The default is 2.4e-5.
The proportion of training samples used to warm up the learning rate - The default is 0.1.
Number of training Epochs - The default is 3.
Total number of training steps to perform - The default is -1.0, which means it is determined by the number of epochs.
Precision (either fp16, tf32 or fp32) - The default is fp16.
Seed - The default is 2.
Mode (train, eval, prediction, train eval, train prediction, eval prediction, train eval prediction) - The default is train eval.

Multi-node

Multi-node runs can be launched on a pyxis/enroot Slurm cluster (refer to Requirements) with the run.sub script with the following command for a 4-node DGX-1 example for both phase 1 and phase 2:

BATCHSIZE=2048 LR=6e-3 GRADIENT_STEPS=128 PHASE=1 sbatch -N4 --ntasks-per-node=8 run.sub
BATCHSIZE=1024 LR=4e-3 GRADIENT_STEPS=256 PHASE=2 sbatch -N4 --ntasks-per-node=8 run.sub

Checkpoints after phase 1 will be saved in checkpointdir specified in run.sub. The checkpoint will be automatically picked up to resume training on phase 2. Note that phase 2 should be run after phase 1.

Variables to re-run the Training performance results are available in the configurations.yml file.

The batch variables BATCHSIZE, LR, GRADIENT_STEPS,PHASE refer to the Python arguments train_batch_size, learning_rate, gradient_accumulation_steps, phase2 respectively.

Note that the run.sub script is a starting point that has to be adapted depending on the environment. In particular, variables such as datadir handle the location of the files for each phase.

Refer to the file's contents to find the full list of variables to adjust for your system.

Command-line options

To view the full list of available options and their descriptions, use the -h or --help command-line option, for example:

python run_pretraining.py --help

python run_squad.py --help

python run_glue.py --help

Detailed descriptions of command-line options can be found in the Parameters section.

Getting the data

For pre-training BERT, we use the Wikipedia (2500M words) dataset. We extract only the text passages and ignore headers, lists, and tables. BERT requires that datasets are structured as a document level corpus rather than a shuffled sentence-level corpus because it is critical to extract long contiguous sentences. data/create_datasets_from_start.sh uses the LDDL downloader to download the Wikipedia dataset, and scripts/run_pretraining.sh uses the LDDL preprocessor and load balancer to preprocess the Wikipedia dataset into Parquet shards which are then streamed during the pre-training by the LDDL data loader. Refer to LDDL's README for more information on how to use LDDL. Depending on the speed of your internet connection, downloading and extracting the Wikipedia dataset takes a few hours, and running the LDDL preprocessor and load balancer takes half an hour on a single DGXA100 node.

For fine-tuning a pre-trained BERT model for specific tasks, by default, this repository prepares the following dataset:

SQuAD: for question answering
MRPC: for paraphrase detection.
SST-2: for sentiment analysis.

Dataset guidelines

The procedure to prepare a text corpus for pre-training is described in the above section. This section provides additional insight into how exactly raw text is processed so that it is ready for pre-training.

First, raw text is tokenized using WordPiece tokenization. A [CLS] token is inserted at the start of every sequence, and the two sentences in the sequence are separated by a [SEP] token.

Note: BERT pre-training looks at pairs of sentences at a time. A sentence embedding token [A] is added to the first sentence and token [B] to the next.

BERT pre-training optimizes for two unsupervised classification tasks. The first is Masked Language Modeling (Masked LM). One training instance of Masked LM is a single modified sentence. Each token in the sentence has a 15% chance of being replaced by a [MASK] token. The chosen token is replaced with [MASK] 80% of the time, 10% with a random token and the remaining 10% the token is retained. The task is then to predict the original token.

The second task is next sentence prediction. One training instance of BERT pre-training is two sentences (a sentence pair). A sentence pair may be constructed by simply taking two adjacent sentences from a single document or by pairing up two random sentences with equal probability. The goal of this task is to predict whether or not the second sentence followed the first in the original document.

Training process

The training process consists of two steps: pre-training and fine-tuning.

Pre-training

Pre-training is performed using the run_pretraining.py script along with parameters defined in the scripts/run_pretraining.sh.

The run_pretraining.sh script runs a job on a single node that trains the BERT-large model from scratch using the Wikipedia dataset as training data using the LAMB optimizer. By default, the training script runs two phases of training with a hyperparameter recipe specific to 8x V100 32G cards:

Phase 1: (Maximum sequence length of 128)

Runs on 8 GPUs with a training batch size of 64 per GPU
Uses a learning rate of 6e-3
Has FP16 precision enabled
Runs for 7038 steps, where the first 28.43% (2000) are warm-up steps
Saves a checkpoint every 200 iterations (keeps only the latest three checkpoints) and at the end of training. All checkpoints and training logs are saved to the /results directory (in the container which can be mounted to a local directory).
Creates a log file containing all the output

Phase 2: (Maximum sequence length of 512)

Runs on 8 GPUs with a training batch size of 8 per GPU
Uses a learning rate of 4e-3
Has FP16 precision enabled
Runs for 1563 steps, where the first 12.8% are warm-up steps
Saves a checkpoint every 200 iterations (keeps only the latest three checkpoints) and at the end of training. All checkpoints and training logs are saved to the /results directory (in the container which can be mounted to a local directory).
Creates a log file containing all the output

These parameters will train on the Wikipedia dataset to state-of-the-art accuracy on a DGX-1 with 32GB V100 cards.

bash run_pretraining.sh <training_batch_size> <learning-rate> <precision> <num_gpus> <warmup_proportion> <training_steps> <save_checkpoint_steps> <resume_training> <create_logfile> <accumulate_gradients> <gradient_accumulation_steps> <seed> <job_name> <allreduce_post_accumulation> <allreduce_post_accumulation_fp16> <accumulate_into_fp16> <train_bath_size_phase2> <learning_rate_phase2> <warmup_proportion_phase2> <train_steps_phase2> <gradient_accumulation_steps_phase2>

Where:

<training_batch_size> is per-GPU batch size used for training. Larger batch sizes run more efficiently but require more memory.
<learning_rate> is the base learning rate for training
<precision> is the type of math in your model, which can be either fp32 or fp16. The options mean:
- FP32: 32-bit IEEE single precision floats.
- FP16: Mixed precision 16 and 32-bit floats.
<num_gpus> is the number of GPUs to use for training. Must be equal to or smaller than the number of GPUs attached to your node.
<warmup_proportion> is the percentage of training steps used for warm-up at the start of training.
<training_steps> is the total number of training steps.
<save_checkpoint_steps> controls how often checkpoints are saved.
<resume_training> if set to true, training should resume from the latest model in /results/checkpoints. Default is false.
<create_logfile> a flag indicating if output should be written to a log file or not (acceptable values are true or 'false. true` indicates output should be saved to a log file.)
<accumulate_gradient> a flag indicating whether a larger batch should be simulated with gradient accumulation.
<gradient_accumulation_steps> an integer indicating the number of steps to accumulate gradients over. Effective batch size = training_batch_size / gradient_accumulation_steps.
<seed> random seed for the run.
<allreduce_post_accumulation> - If set to true, performs allreduce only after the defined number of gradient accumulation steps.
<allreduce_post_accumulation_fp16> - If set to true, performs allreduce after gradient accumulation steps in FP16.

Note: The above two options need to be set to false when running either TF32 or FP32.
<training_batch_size_phase2> is per-GPU batch size used for training in phase 2. Larger batch sizes run more efficiently but require more memory.
<learning_rate_phase2> is the base learning rate for training phase 2.
<warmup_proportion_phase2> is the percentage of training steps used for warm-up at the start of training.
<training_steps_phase2> is the total number of training steps for phase 2, to be continued in addition to phase 1.
<gradient_accumulation_steps_phase2> an integer indicating the number of steps to accumulate gradients over in phase 2. Effective batch size = training_batch_size_phase2 / gradient_accumulation_steps_phase2.
<init_checkpoint> A checkpoint to start the pre-training routine on (Usually a BERT pre-trained checkpoint).

For example:

bash scripts/run_pretraining.sh

Trains BERT-large from scratch on a DGX-1 32G using FP16 arithmetic. 90% of the training steps are done with sequence length 128 (phase 1 of training), and 10% of the training steps are done with sequence length 512 (phase 2 of training).

To train on a DGX-1 16G, set gradient_accumulation_steps to 512 and gradient_accumulation_steps_phase2 to 1024 in scripts/run_pretraining.sh.

To train on a DGX-2 32G, set train_batch_size to 4096, train_batch_size_phase2 to 2048, num_gpus to 16, gradient_accumulation_steps to 64 and gradient_accumulation_steps_phase2 to 256 in scripts/run_pretraining.sh

In order to run a pre-training routine on an initial checkpoint, perform the following in scripts/run_pretraining.sh:

point the init_checkpoint variable to the location of the checkpoint
set resume_training to true
Note: The parameter value assigned to BERT_CONFIG during training should remain unchanged. Also, to resume pre-training on your corpus of choice, the training dataset should be created using the same vocabulary file used in data/create_datasets_from_start.sh.

Fine-tuning

Fine-tuning is provided for a variety of tasks. The following tasks are included with this repository through the following scripts:

Question Answering (scripts/run_squad.sh)
Paraphrase Detection and Sentiment Analysis (script/run_glue.sh)

By default, each Python script implements fine-tuning a pre-trained BERT model for a specified number of training epochs as well as evaluation of the fine-tuned model. Each shell script invokes the associated Python script with the following default parameters:

Uses 8 GPUs
Has FP16 precision enabled
Saves a checkpoint at the end of training to the results/<dataset_name> folder

Fine-tuning Python scripts implement support for mixed precision and multi-GPU training through NVIDIA's APEX library. For a full list of parameters and associated explanations, refer to the Parameters section.

The fine-tuning shell scripts have positional arguments outlined below:

# For SQuAD.
bash scripts/run_squad.sh <checkpoint_to_load> <epochs> <batch_size per GPU> <learning rate> <precision (either `fp16` or `fp32`)> <number of GPUs to use> <seed> <SQuAD_DATA_DIR> <VOCAB_FILE> <OUTPUT_DIR> <mode (either `train`, `eval` or `train eval`)> <CONFIG_FILE>
# For GLUE
bash scripts/run_glue.sh <checkpoint_to_load> <data_directory> <vocab_file> <config_file> <out_dir> <task_name> <number of GPUs to use> <batch size per GPU> <gradient_accumulation steps> <learning_rate> <warmup_proportion> <epochs> <precision (either `fp16` or `fp32` or `tf32`)> <seed> <mode (either `train`, `eval`, `prediction`, `train eval`, `train prediction`, `eval prediction` or `train eval prediction`)>

By default, the mode positional argument is set to train eval. Refer to the Quick Start Guide for explanations of each positional argument.

Note: The first positional argument (the path to the checkpoint to load) is required.

Each fine-tuning script assumes that the corresponding dataset files exist in the data/ directory or separate path can be a command-line input to run_squad.sh.

Inference process

Fine-tuning inference can be run in order to obtain predictions on fine-tuning tasks, for example, Q&A on SQuAD.

Fine-tuning inference

Evaluation fine-tuning is enabled by the same scripts as training:

Question Answering (scripts/run_squad.sh)
Paraphrase Detection and Sentiment Analysis (scripts/run_glue.sh)

The mode positional argument of the shell script is used to run in evaluation mode. The fine-tuned BERT model will be run on the evaluation dataset, and the evaluation loss and accuracy will be displayed.

Each inference shell script expects dataset files to exist in the same locations as the corresponding training scripts. The inference scripts can be run with default settings. By setting the mode variable in the script to either eval or prediction flag, you can choose between running predictions and evaluating them on a given dataset or just obtain the model predictions.

bash scripts/run_squad.sh <path to fine-tuned model checkpoint> bash scripts/run_glue.sh <path to fine-tuned model checkpoint>

For SQuAD, to run inference interactively on question-context pairs, use the script inference.py as follows:

python inference.py --bert_model "bert-large-uncased" --init_checkpoint=<fine_tuned_checkpoint> --config_file="bert_config.json" --vocab_file=<path to vocab file> --question="What food does Harry like?" --context="My name is Harry and I grew up in Canada. I love apples."

Deploying BERT using NVIDIA Triton Inference Server

The NVIDIA Triton Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. More information on how to perform inference using NVIDIA Triton Inference Server can be found in triton/README.md.