BERT is a method of pre-training language representations which obtains state-of-the-art results on a wide array of NLP tasks.
To train your model using mixed or TF32 precision with Tensor Cores or using FP32, perform the following steps using the default parameters of the BERT model. Training configurations to run on 8 x A100 80G, 8 x V100 16G, 16 x V100 32G cards and examples of usage are provided at the end of this section. For the specifics concerning training and inference, refer to the Advanced section.
- Clone the repository.
git clone https://github.com/NVIDIA/DeepLearningExamples.git
cd DeepLearningExamples/PyTorch/LanguageModeling/BERT
- Download the NVIDIA pre-trained checkpoint.
If you want to use a pre-trained checkpoint, visit NGC. This pre-trained checkpoint is used to fine-tune on SQuAD. Ensure you unzip the downloaded file and place the checkpoint in the checkpoints/ folder. For a checkpoint already fine-tuned for QA on SQuAD v1.1 visit NGC.
Find all trained and available checkpoints in the table below:
| Model | Description |
|---|---|
| bert-large-uncased-qa | Large model fine-tuned on SQuAD v1.1 |
| bert-large-uncased-sst2 | Large model fine-tuned on GLUE SST-2 |
| bert-large-uncased-pretrained | Large model pretrained checkpoint on Generic corpora like Wikipedia |
| bert-base-uncased-qa | Base model fine-tuned on SQuAD v1.1 |
| bert-base-uncased-sst2 | Base model fine-tuned on GLUE SST-2 |
| bert-base-uncased-pretrained | Base model pretrained checkpoint on Generic corpora like Wikipedia. |
| bert-dist-4L-288D-uncased-qa | 4 layer distilled model fine-tuned on SQuAD v1.1 |
| bert-dist-4L-288D-uncased-sst2 | 4 layer distilled model fine-tuned on GLUE SST-2 |
| bert-dist-4L-288D-uncased-pretrained | 4 layer distilled model pretrained checkpoint on Generic corpora like Wikipedia. |
| bert-dist-6L-768D-uncased-qa | 6 layer distilled model fine-tuned on SQuAD v1.1 |
| bert-dist-6L-768D-uncased-sst2 | 6 layer distilled model fine-tuned on GLUE SST-2 |
| bert-dist-6L-768D-uncased-pretrained | 6 layer distilled model pretrained checkpoint on Generic corpora like Wikipedia. |
- Build BERT on top of the NGC container.
bash scripts/docker/build.sh
- Start an interactive session in the NGC container to run training/inference.
bash scripts/docker/launch.sh
Resultant logs and checkpoints of pre-training and fine-tuning routines are stored in the results/ folder.
data and vocab.txt are downloaded in the data/ directory by default. Refer to the Getting the data section for more details on how to process a custom corpus as required for BERT pre-training.
- Download the dataset.
This repository provides scripts to download, verify, and extract the following datasets:
- SQuAD (fine-tuning for question answering)
- MRPC (fine-tuning for paraphrase detection)
- SST-2 (fine-tuning for sentiment analysis)
- Wikipedia (pre-training)
To download, verify, extract the datasets, run:
/workspace/bert/data/create_datasets_from_start.sh
Note: For fine-tuning only, downloading the Wikipedia dataset can be skipped by commenting it out.
Note: Ensure a complete Wikipedia download. But if the download failed in LDDL,
remove the output directory data/wikipedia/ and start over again.
- Start pre-training.
To run on a single node 8 x V100 32G cards, from within the container, you can use the following script to run pre-training.
bash scripts/run_pretraining.sh
The default hyperparameters are set to run on 8x V100 32G cards.
To run on multiple nodes, refer to the Multi-node section.
- Start fine-tuning with the SQuAD dataset.
The above pre-trained BERT representations can be fine-tuned with just one additional output layer for a state-of-the-art question answering system. Running the following script launches fine-tuning for question answering with the SQuAD dataset.
bash scripts/run_squad.sh /workspace/bert/checkpoints/<pre-trained_checkpoint>
- Start fine-tuning with the GLUE tasks.
The above pre-trained BERT representations can be fine-tuned with just one additional output layer for GLUE tasks. Running the following scripts launch fine-tuning for paraphrase detection with the MRPC dataset:
bash scripts/run_glue.sh /workspace/bert/checkpoints/<pre-trained_checkpoint>
- Run Knowledge Distillation (Optional).
To get setup to run distillation on BERT, follow steps provided here.
- Start validation/evaluation.
For both SQuAD and GLUE, validation can be performed with the bash scripts/run_squad.sh /workspace/bert/checkpoints/<pre-trained_checkpoint> or bash scripts/run_glue.sh /workspace/bert/checkpoints/<pre-trained_checkpoint>, setting mode to eval in scripts/run_squad.sh or scripts/run_glue.sh as follows:
mode=${11:-"eval"}
- Start inference/predictions.
Inference can be performed with the bash scripts/run_squad.sh /workspace/bert/checkpoints/<pre-trained_checkpoint>, setting mode to prediction in scripts/run_squad.sh or scripts/run_glue.sh as follows:
mode=${11:-"prediction"}
Inference predictions are saved to <OUT_DIR>/predictions.json, set in scripts/run_squad.sh or scripts/run_glue.sh as follows:
OUT_DIR=${10:-"/workspace/bert/results/SQuAD"} # For SQuAD.
# Or…
out_dir=${5:-"/workspace/bert/results/MRPC"} # For MRPC.
# Or...
out_dir=${5:-"/workspace/bert/results/SST-2"} # For SST-2.
This repository contains a number of predefined configurations to run the SQuAD, GLUE and pre-training on NVIDIA DGX-1, NVIDIA DGX-2H or NVIDIA DGX A100 nodes in scripts/configs/squad_config.sh, scripts/configs/glue_config.sh and scripts/configs/pretrain_config.sh. For example, to use the default DGX A100 8 gpu config, run:
bash scripts/run_squad.sh $(source scripts/configs/squad_config.sh && dgxa100-80g_8gpu_fp16) # For the SQuAD v1.1 dataset.
bash scripts/run_glue.sh $(source scripts/configs/glue_config.sh && mrpc_dgxa100-80g_8gpu_fp16) # For the MRPC dataset.
bash scripts/run_glue.sh $(source scripts/configs/glue_config.sh && sst-2_dgxa100-80g_8gpu_fp16) # For the SST-2 dataset.
bash scripts/run_pretraining.sh $(source scripts/configs/pretrain_config.sh && dgxa100-80g_8gpu_fp16) # For pre-training