This is a checkpoint for the BERT Large model trained in NeMo on the uncased English Wikipedia and BookCorpus dataset on sequence length of 512. It was trained with Apex/Amp optimization level O1.
The model is trained for 2285714 iterations on a DGX1 with 8 V100 GPUs.
The model achieves EM/F1 of 85.79/92.28 on SQuADv1.1 and 80.17/83.32 on SQuADv2.0. On GLUE benchmark MRPC task the model achieves accuracy/F1 of 88.73/91.96.
Please be sure to download the latest version in order to ensure compatibility with the latest NeMo release.
BERT, or Bidirectional Encoder Representations from Transformers, is a neural approach to pre-train language representations which obtains near state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks, including the GLUE Benchmark and SQuAD Question Answering dataset. This model is based on the BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding paper and Hugging Face implementation, leveraging mixed precision arithmetic and Tensor Cores on V100 GPUs for faster training times while maintaining target accuracy.
The BERT architecture uses the same architecture as the encoder half of the Transformer. Input sequences are projected into an embedding space before being fed into the encoder structure. Additionally, positional and segment encodings are added to the embeddings to preserve positional information. The encoder structure is simply a stack of 24 Transformer blocks, which consist of a multi-head attention layer followed by successive stages of feed-forward networks and layer normalization. The multi-head attention layer accomplishes self-attention on multiple input representations. The total number of parameters is 330M.
The model is trained using masked language model loss optionally with next sentence prediction loss.
Source code and developer guide is available at https://github.com/NVIDIA/NeMo Refer to documentation at https://docs.nvidia.com/deeplearning/nemo/neural-modules-release-notes/index.html Code to pretrain and reproduce this model checkpoint are available at https://github.com/NVIDIA/NeMo.
This model checkpoint can be used for either finetuning BERT on your custom dataset, or finetuning downstream tasks, including GLUE benchmark tasks, question answering tasks e.g. SQuAD, joint intent and slot detection, punctuation and capitalization, named entity recognition, and speech recognition postprocessing model to correct mistakes. All of these tasks and scripts can be found at https://github.com/NVIDIA/NeMo.
In the following we show examples for how to train BERT and finetune two downstream tasks, GLUE MRPC and SQuAD.
Modify https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT/data/create_datasets_from_start.sh to run
python bertPrep.py with --max_seq_length 512 --max_predictions_per_seq 80 --vocab_file /path_to_vocab_dir/uncased_L-24_H-1024_A-16/vocab.txt --do_lower_case 1 --n_processes 24 --fraction_test_set 0.1 --n_test_shards 1472 --n_training_shards 1472 --input_dir shard_1472_test_split_10
Run BERT base on the sequence length 512 and DGX1 with 8 V100 GPUs
cd examples/nlp/language_modeling;
python -m torch.distributed.launch --nproc_per_node=8 bert_pretraining.py --config_file bert-config.json --train_data $train_data --eval_data $eval_data --num_gpus 8 --batch_size 8 --amp_opt_level "O1" --lr_policy SquareRootAnnealing --beta1 0.9 --beta2 0.999 --lr_warmup_proportion 0.01 --optimizer adam_w --weight_decay 0.01 --lr 0.4375e-4 [--only_mlm_loss] data_preprocessed --max_predictions_per_seq 80 --num_iters 2285714
Checkpoints will be store at args.work_dir folder.
Download BERT-STEP-2285714.pt and bert-config.json.
cd examples/nlp/glue_benchmark;
python glue_benchmark_with_bert.py --data_dir $mrpc_dataset --task_name mrpc --pretrained_bert_model bert-large-uncased --bert_checkpoint /path_to/BERT-STEP-2285714.pt --bert_config /path_to/bert-config.json --lr 1e-5
Download BERT-STEP-2285714.pt and bert-config.json.
cd examples/nlp/question_answering;
python -m torch.distributed.launch --nproc_per_node=8 question_answering_squad.py --mode train_eval --amp_opt_level O1 --num_gpus 8 --train_file=/path_to/squad/v1.1/train-v1.1.json --eval_file /path_to/squad/v1.1/dev-v1.1.json --bert_checkpoint /path_to/BERT-STEP-2285714.pt --bert_config /path_to/bert-config.json --pretrained_model_name bert-large-uncased --batch_size 3 --num_epochs 2 --lr_policy WarmupAnnealing --optimizer adam_w --lr 3e-5 --do_lower_case --no_data_cache
cd examples/nlp/question_answering;
python -m torch.distributed.launch --nproc_per_node=8 question_answering_squad.py --mode train_eval --amp_opt_level O1 --num_gpus 8 --train_file /path_to/squad/v2.0/train-v2.0.json --eval_file /path_to/squad/v2.0/dev-v2.0.json --bert_checkpoint /path_to/BERT-STEP-2285714.pt --bert_config /path_to/bert-config.json --pretrained_model_name=bert-large-uncased --batch_size 3 --num_epochs 2 --lr_policy WarmupAnnealing --optimizer adam_w --lr 3e-5 --do_lower_case --version_2_with_negative --no_data_cache