FasterTransformer is a highly optimized transformer implementation for inference, and it is tested and maintained by NVIDIA.
In NLP, encoder and decoder are two important components, with the transformer layer becoming a popular architecture for both components. FasterTransformer implements a highly optimized transformer layer for both the encoder and decoder for inference. On Ampere, Volta, and Turing GPUs, the computing power of Tensor Cores are used automatically.
In FasterTransformer 3.0, we implemented the INT8 quantization for encoder (supporting effective transformer). With INT8 quantization, we can take advantage of the powerful INT8 tensor core in Turing and Ampere GPUs to achieve better inference performance. We also provide quantization tools of tensorflow.
Due to the precision issues of INT8 inference, we need to do quantization-aware training to get a better accuracy results. This model checkpoint is trained by bert-tf-quantization
tools for demo purpose of FasterTransformer 3.0 INT8 inference.
Please refer to our GitHub repo. Details are in "Run FasterTransformer for SQuAD 1.1 dataset" part of 3.0 version directory.
TensorFlow: F1: 88.53%, EM: 81.05%
TensorFlow with FT op: F1: 88.33%, EM: 80.65% (with tensorflow:19.07-py2 docker image)
TensorFlow with FT op: F1: 88.27%, EM: 80.52% (with tensorflow:20.03-tf1-py3 docker image)