BERT is a method of pre-training language representations which obtains state-of-the-art results on a wide array of NLP tasks.
BERT's model architecture is a multi-layer bidirectional transformer encoder. Based on the model size, we have the following two default configurations of BERT:
|Model||Hidden layers||Hidden unit size||Attention heads||Feedforward filter size||Max sequence length||Parameters|
|BERTBASE||12 encoder||768||12||4 x 768||512||110M|
|BERTLARGE||24 encoder||1024||16||4 x 1024||512||330M|
BERT training consists of two steps, pre-training the language model in an unsupervised fashion on vast amounts of unannotated datasets, and then using this pre-trained model for fine-tuning for various NLP tasks, such as question and answer, sentence classification, or sentiment analysis. Fine-tuning typically adds an extra layer or two for the specific task and further trains the model using a task-specific annotated dataset, starting from the pre-trained backbone weights. The end-to-end process is depicted in the following image:
Figure 1: BERT Pipeline
The following datasets were used to train this model:
Performance numbers for this model are available in NGC.