NGC Catalog

CLASSIC

Welcome Guest

For downloads and more information, please view on a desktop device.

Description

BERT Large Model trained with NeMo on uncased Wikipedia and Bookcorpus on a sequence length 512.

Publisher

NVIDIA

Latest Version

Modified

April 4, 2023

Size

1.37 GB

Overview

This is a checkpoint for the BERT Large model trained in NeMo on the uncased English Wikipedia and BookCorpus dataset on sequence length of 512. It was trained with Apex/Amp optimization level O1.

The model is trained for 2285714 iterations on a DGX1 with 8 V100 GPUs.

The model achieves EM/F1 of 85.79/92.28 on SQuADv1.1 and 80.17/83.32 on SQuADv2.0. On GLUE benchmark MRPC task the model achieves accuracy/F1 of 88.73/91.96.

Please be sure to download the latest version in order to ensure compatibility with the latest NeMo release.

BERT-STEP-2285714.pt - pretrained BERT encoder weights
BertTokenClassifier-STEP-2285714.pt - pretrained BERT masked language model head weights
SequenceClassifier-STEP-2285714.pt - pretrained BERT next sentence prediction head weights. This is optional and not needed if you only use masked language model loss.
bert-config.json - the config file used to initialize BERT network architecture in NeMo

More Details

BERT, or Bidirectional Encoder Representations from Transformers, is a neural approach to pre-train language representations which obtains near state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks, including the GLUE Benchmark and SQuAD Question Answering dataset. This model is based on the BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding paper and Hugging Face implementation, leveraging mixed precision arithmetic and Tensor Cores on V100 GPUs for faster training times while maintaining target accuracy.

The BERT architecture uses the same architecture as the encoder half of the Transformer. Input sequences are projected into an embedding space before being fed into the encoder structure. Additionally, positional and segment encodings are added to the embeddings to preserve positional information. The encoder structure is simply a stack of 24 Transformer blocks, which consist of a multi-head attention layer followed by successive stages of feed-forward networks and layer normalization. The multi-head attention layer accomplishes self-attention on multiple input representations. The total number of parameters is 330M.

The model is trained using masked language model loss optionally with next sentence prediction loss.

Documentation

Source code and developer guide is available at https://github.com/NVIDIA/NeMo Refer to documentation at https://docs.nvidia.com/deeplearning/nemo/neural-modules-release-notes/index.html Code to pretrain and reproduce this model checkpoint are available at https://github.com/NVIDIA/NeMo.

This model checkpoint can be used for either finetuning BERT on your custom dataset, or finetuning downstream tasks, including GLUE benchmark tasks, question answering tasks e.g. SQuAD, joint intent and slot detection, punctuation and capitalization, named entity recognition, and speech recognition postprocessing model to correct mistakes. All of these tasks and scripts can be found at https://github.com/NVIDIA/NeMo.

In the following we show examples for how to train BERT and finetune two downstream tasks, GLUE MRPC and SQuAD.

Usage example 1: Pretraining BERT

Download and preprocess uncased Wikipedia and BookCorpus dataset:
- Clone https://github.com/NVIDIA/DeepLearningExamples
- Modify https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT/data/create_datasets_from_start.sh to run
  
  python bertPrep.py with --max_seq_length 512 --max_predictions_per_seq 80 --vocab_file /path_to_vocab_dir/uncased_L-24_H-1024_A-16/vocab.txt --do_lower_case 1 --n_processes 24 --fraction_test_set 0.1 --n_test_shards 1472 --n_training_shards 1472 --input_dir shard_1472_test_split_10

Run the script and extract preprocessed hdf5 files into $train_data and $eval_data

Run BERT base on the sequence length 512 and DGX1 with 8 V100 GPUs

cd examples/nlp/language_modeling;

python -m torch.distributed.launch --nproc_per_node=8 bert_pretraining.py --config_file bert-config.json --train_data $train_data --eval_data $eval_data --num_gpus 8 --batch_size 8 --amp_opt_level "O1" --lr_policy SquareRootAnnealing --beta1 0.9 --beta2 0.999 --lr_warmup_proportion 0.01 --optimizer adam_w --weight_decay 0.01 --lr 0.4375e-4 [--only_mlm_loss] data_preprocessed --max_predictions_per_seq 80 --num_iters 2285714

Checkpoints will be store at args.work_dir folder.

Usage example 2: Using BERT checkpoint for downstream task, using the example of GLUE benchmark task MRPC

Download BERT-STEP-2285714.pt and bert-config.json.

cd examples/nlp/glue_benchmark;

python glue_benchmark_with_bert.py --data_dir $mrpc_dataset --task_name mrpc --pretrained_bert_model bert-large-uncased --bert_checkpoint /path_to/BERT-STEP-2285714.pt --bert_config /path_to/bert-config.json --lr 1e-5

Usage example 3: Using BERT checkpoint for downstream task SQuAD Question Answering task