BERT is a method of pre-training language representations which obtains state-of-the-art results on a wide array of NLP tasks.
The BERT model uses the same architecture as the encoder of the Transformer. Input sequences are projected into an embedding space before being fed into the encoder structure. Additionally, positional and segment encodings are added to the embeddings to preserve positional information. The encoder structure is simply a stack of Transformer blocks, which consist of a multi-head attention layer followed by successive stages of feed-forward networks and layer normalization. The multi-head attention layer accomplishes self-attention on multiple input representations.
An illustration of the architecture taken from the Transformer paper is shown below.
The following datasets were used to train this model:
- Wikipedia - Dataset containing a 170GB+ Wikipedia dump.
- Stanford Sentiment Treebank v2 - Dataset for predicting Sentiment from longer Movie Reviews.
Performance numbers for this model are available in NGC.