BERT is a method of pre-training language representations which obtains state-of-the-art results on a wide array of NLP tasks.
BERT's model architecture is a multi-layer bidirectional transformer encoder. Based on the model size, we have the following two default configurations of BERT:
|Hidden unit size
|Feedforward filter size
|Max sequence length
|4 x 768
|4 x 1024
BERT training consists of two steps, pre-training the language model in an unsupervised fashion on vast amounts of unannotated datasets, and then using this pre-trained model for fine-tuning for various NLP tasks, such as question and answer, sentence classification, or sentiment analysis. Fine-tuning typically adds an extra layer or two for the specific task and further trains the model using a task-specific annotated dataset, starting from the pre-trained backbone weights. The end-to-end process is depicted in the following image:
Figure 1: BERT Pipeline
The following datasets were used to train this model:
- Wikipedia - Dataset containing a 170GB+ Wikipedia dump.
- Bookcorpus - Large-scale text corpus for unsupervised learning of sentence encoders/decoders.
Performance numbers for this model are available in NGC.