BERT Large TensorFlow checkpoint pretrained using AMP and LAMB optimizer
Model Overview
BERT is a method of pre-training language representations which obtains state-of-the-art results on a wide array of NLP tasks.
Model Architecture
BERT's model architecture is a multi-layer bidirectional Transformer encoder. Based on the model size, we have the following two default configurations of BERT:
| Model | Hidden layers | Hidden unit size | Attention heads | Feedforward filter size | Max sequence length | Parameters |
|---|---|---|---|---|---|---|
| BERTBASE | 12 encoder | 768 | 12 | 4 x 768 | 512 | 110M |
| BERTLARGE | 24 encoder | 1024 | 16 | 4 x 1024 | 512 | 330M |
BERT training consists of two steps, pre-training the language model in an unsupervised fashion on vast amounts of unannotated datasets, and then using this pre-trained model for fine-tuning for various NLP tasks, such as question and answer, sentence classification, or sentiment analysis. Fine-tuning typically adds an extra layer or two for the specific task and further trains the model using a task-specific annotated dataset, starting from the pre-trained backbone weights. The end-to-end process in depicted in the following image:

Figure 1: BERT Pipeline
Training
This model was trained using script available on NGC and in GitHub repo.
Dataset
The following datasets were used to train this model:
- Wikipedia - Dataset containing a 170GB+ Wikipedia dump.
- Bookcorpus - Large-scale text corpus for unsupervised learning of sentence encoders/decoders.
Performance
Performance numbers for this model are available in NGC.
References
License
This model was trained using open-source software available in Deep Learning Examples repository.
For terms of use, please refer to the license of the script and the datasets the model was derived from.