This resource is a subproject of bert_for_tensorflow. Visit the parent project to download the code and get more information about the setup.
BERT, or Bidirectional Encoder Representations from Transformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. This model is based on the BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding paper. NVIDIA's BERT is an optimized version of Google's official implementation, leveraging mixed precision arithmetic and Tensor Cores for faster inference times while maintaining target accuracy.
Other publicly available implementations of BERT include:
BERT's model architecture is a multi-layer bidirectional Transformer encoder. Based on the model size, we have the following two default configurations of BERT:
Model | Hidden layers | Hidden unit size | Attention heads | Feed-forward filter size | Max sequence length | Parameters |
---|---|---|---|---|---|---|
BERT-Base | 12 encoder | 768 | 12 | 4 x 768 | 512 | 110M |
BERT-Large | 24 encoder | 1024 | 16 | 4 x 1024 | 512 | 330M |
Typically, the language model is followed by a few task-specific layers. The model used here includes layers for question answering.
BERT inference consists of three main stages: tokenization, the BERT model, and finally a projection of the tokenized prediction onto the original text. Since the tokenizer and projection of the final predictions are not nearly as compute-heavy as the model itself, we run them on the host. The BERT model is GPU-accelerated via TensorRT.
The tokenizer splits the input text into tokens that can be consumed by the model. For details on this process, see this tutorial.
To run the BERT model in TensorRT, we construct the model using TensorRT APIs and import the weights from a pre-trained TensorFlow checkpoint from NGC. Finally, a TensorRT engine is generated and serialized to the disk. The various inference scripts then load this engine for inference.
Lastly, the tokens predicted by the model are projected back to the original text to get a final result.
The following software version configuration has been tested:
Software | Version |
---|---|
Python | 3.6.9 |
TensorFlow | 1.13.1 |
TensorRT | 7.0.0.1 |
CUDA | 10.2.89 |