BERT is a method of pre-training language representations which obtains state-of-the-art results on a wide array of NLP tasks.
BERT's model architecture is a multi-layer bidirectional Transformer encoder. Based on the model size, we have the following two default configurations of BERT:
|Hidden unit size
|Feedforward filter size
|Max sequence length
|4 x 768
|4 x 1024
BERT training consists of two steps, pre-training the language model in an unsupervised fashion on vast amounts of unannotated datasets, and then using this pre-trained model for fine-tuning for various NLP tasks, such as question and answer, sentence classification, or sentiment analysis. Fine-tuning typically adds an extra layer or two for the specific task and further trains the model using a task-specific annotated dataset, starting from the pre-trained backbone weights. The end-to-end process in depicted in the following image:
Figure 1: BERT Pipeline
The following datasets were used to train this model:
- SQuAD 1.1 + 2.0 - Reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
Performance numbers for this model are available in NGC