Logo for BertLargeUncasedForNemo
BERT Large Model trained with NeMo on uncased Wikipedia and Bookcorpus on a sequence length 512.
Latest Version
April 4, 2023
1.37 GB


This is a checkpoint for the BERT Large model trained in NeMo on the uncased English Wikipedia and BookCorpus dataset on sequence length of 512. It was trained with Apex/Amp optimization level O1.

The model is trained for 2285714 iterations on a DGX1 with 8 V100 GPUs.

The model achieves EM/F1 of 85.79/92.28 on SQuADv1.1 and 80.17/83.32 on SQuADv2.0. On GLUE benchmark MRPC task the model achieves accuracy/F1 of 88.73/91.96.

Please be sure to download the latest version in order to ensure compatibility with the latest NeMo release.

  • - pretrained BERT encoder weights
  • - pretrained BERT masked language model head weights
  • - pretrained BERT next sentence prediction head weights. This is optional and not needed if you only use masked language model loss.
  • bert-config.json - the config file used to initialize BERT network architecture in NeMo

More Details

BERT, or Bidirectional Encoder Representations from Transformers, is a neural approach to pre-train language representations which obtains near state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks, including the GLUE Benchmark and SQuAD Question Answering dataset. This model is based on the BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding paper and Hugging Face implementation, leveraging mixed precision arithmetic and Tensor Cores on V100 GPUs for faster training times while maintaining target accuracy.

The BERT architecture uses the same architecture as the encoder half of the Transformer. Input sequences are projected into an embedding space before being fed into the encoder structure. Additionally, positional and segment encodings are added to the embeddings to preserve positional information. The encoder structure is simply a stack of 24 Transformer blocks, which consist of a multi-head attention layer followed by successive stages of feed-forward networks and layer normalization. The multi-head attention layer accomplishes self-attention on multiple input representations. The total number of parameters is 330M.

The model is trained using masked language model loss optionally with next sentence prediction loss.


Source code and developer guide is available at Refer to documentation at Code to pretrain and reproduce this model checkpoint are available at

This model checkpoint can be used for either finetuning BERT on your custom dataset, or finetuning downstream tasks, including GLUE benchmark tasks, question answering tasks e.g. SQuAD, joint intent and slot detection, punctuation and capitalization, named entity recognition, and speech recognition postprocessing model to correct mistakes. All of these tasks and scripts can be found at

In the following we show examples for how to train BERT and finetune two downstream tasks, GLUE MRPC and SQuAD.

Usage example 1: Pretraining BERT

  1. Download and preprocess uncased Wikipedia and BookCorpus dataset:
  • Run the script and extract preprocessed hdf5 files into $train_data and $eval_data
  1. Run BERT base on the sequence length 512 and DGX1 with 8 V100 GPUs

    cd examples/nlp/language_modeling;

    python -m torch.distributed.launch --nproc_per_node=8 --config_file bert-config.json --train_data $train_data --eval_data $eval_data --num_gpus 8 --batch_size 8 --amp_opt_level "O1" --lr_policy SquareRootAnnealing --beta1 0.9 --beta2 0.999 --lr_warmup_proportion 0.01 --optimizer adam_w --weight_decay 0.01 --lr 0.4375e-4 [--only_mlm_loss] data_preprocessed --max_predictions_per_seq 80 --num_iters 2285714

Checkpoints will be store at args.work_dir folder.

Usage example 2: Using BERT checkpoint for downstream task, using the example of GLUE benchmark task MRPC

Download and bert-config.json.

cd examples/nlp/glue_benchmark;

python --data_dir $mrpc_dataset --task_name mrpc --pretrained_bert_model bert-large-uncased --bert_checkpoint /path_to/ --bert_config /path_to/bert-config.json --lr 1e-5 

Usage example 3: Using BERT checkpoint for downstream task SQuAD Question Answering task

Download and bert-config.json.

SQuAD v1.1

cd examples/nlp/question_answering;

python -m torch.distributed.launch --nproc_per_node=8 --mode train_eval --amp_opt_level O1 --num_gpus 8 --train_file=/path_to/squad/v1.1/train-v1.1.json --eval_file /path_to/squad/v1.1/dev-v1.1.json --bert_checkpoint /path_to/ --bert_config /path_to/bert-config.json --pretrained_model_name bert-large-uncased --batch_size 3 --num_epochs 2 --lr_policy WarmupAnnealing --optimizer adam_w --lr 3e-5 --do_lower_case --no_data_cache

SQuAD v2.0

cd examples/nlp/question_answering;

python -m torch.distributed.launch --nproc_per_node=8 --mode train_eval --amp_opt_level O1 --num_gpus 8 --train_file /path_to/squad/v2.0/train-v2.0.json --eval_file /path_to/squad/v2.0/dev-v2.0.json --bert_checkpoint /path_to/ --bert_config /path_to/bert-config.json --pretrained_model_name=bert-large-uncased --batch_size 3 --num_epochs 2 --lr_policy WarmupAnnealing --optimizer adam_w --lr 3e-5 --do_lower_case --version_2_with_negative --no_data_cache