NGC | Catalog
CatalogResourcesBERT Inference with TensorRT

BERT Inference with TensorRT

For downloads and more information, please view on a desktop device.
Logo for BERT Inference with TensorRT

Description

Scripts to perform high-performance BERT inference using NVIDIA TensorRT

Publisher

NVIDIA

Use Case

Nlp

Framework

TensorFlow

Latest Version

-

Modified

September 24, 2020

Compressed Size

0 B

This resource is a subproject of bert_for_tensorflow. Visit the parent project to download the code and get more information about the setup.

BERT, or Bidirectional Encoder Representations from Transformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. This model is based on the BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding paper. NVIDIA's BERT is an optimized version of Google's official implementation, leveraging mixed precision arithmetic and Tensor Cores for faster inference times while maintaining target accuracy.

Other publicly available implementations of BERT include:

  1. NVIDIA PyTorch
  2. Hugging Face
  3. codertimo
  4. gluon-nlp
  5. Google's official implementation

Model architecture

BERT's model architecture is a multi-layer bidirectional Transformer encoder. Based on the model size, we have the following two default configurations of BERT:

Model Hidden layers Hidden unit size Attention heads Feed-forward filter size Max sequence length Parameters
BERT-Base 12 encoder 768 12 4 x 768 512 110M
BERT-Large 24 encoder 1024 16 4 x 1024 512 330M

Typically, the language model is followed by a few task-specific layers. The model used here includes layers for question answering.

TensorRT Inference Pipeline

BERT inference consists of three main stages: tokenization, the BERT model, and finally a projection of the tokenized prediction onto the original text. Since the tokenizer and projection of the final predictions are not nearly as compute-heavy as the model itself, we run them on the host. The BERT model is GPU-accelerated via TensorRT.

The tokenizer splits the input text into tokens that can be consumed by the model. For details on this process, see this tutorial.

To run the BERT model in TensorRT, we construct the model using TensorRT APIs and import the weights from a pre-trained TensorFlow checkpoint from NGC. Finally, a TensorRT engine is generated and serialized to the disk. The various inference scripts then load this engine for inference.

Lastly, the tokens predicted by the model are projected back to the original text to get a final result.

Version Info

The following software version configuration has been tested:

Software Version
Python 3.6.9
TensorFlow 1.13.1
TensorRT 7.0.0.1
CUDA 10.2.89