ELECTRA is method of pre-training language representations which outperforms existing techniques on a wide array of NLP tasks.
ELECTRA is a combination of two Transformer models: a generator and a discriminator. The generator's role is to replace tokens in a sequence, and is therefore trained as a masked language model. The discriminator, which is the model we are interested in, tries to identify which tokens were replaced by the generator in the sequence. Both generator and discriminator use the same architecture as the encoder of the Transformer. The encoder is simply a stack of Transformer blocks, which consist of a multi-head attention layer followed by successive stages of feed-forward networks and layer normalization. The multi-head attention layer performs self-attention on multiple input representations.
The following datasets were used to train this model:
Performance numbers for this model are available in NGC.