Running the following scripts will build and launch the container containing all required dependencies for native TensorFlow as well as Triton. This is necessary for running inference and can also be used for data download, processing, and training of the model. For more information on the scripts and arguments, refer to the [Advanced](#advanced) section. 1. Clone the repository. ```bash git clone https://github.com/NVIDIA/DeepLearningExamples cd DeepLearningExamples/TensorFlow/LanguageModeling/BERT ``` 2. Build a container that extends NGC TensorFlow, Triton Inference Server, and Triton Inference Client. ```bash bash scripts/docker/build.sh ``` 3. Download fine-tuned checkpoints and SQuAD dataset. To download the data to `data/download`, run: ```bash bash scripts/docker/launch.sh triton/scripts/triton_data_download.sh ``` 4. Run inference. The Triton Inference Server can serve either of the following two BERT models: 4.1. TensorFlow SavedModel The `run_triton_tf.sh` script starts the server on a local host in a detached state, runs the client on the SQuAD v1.1 dataset and then evaluates the validity of predictions on the basis of the exact match and F1 score all in one step. The script exports the TensorFlow BERT model checkpoint as a `tensorflow_savedmodel` that Triton Inference Server accepts and builds a matching [Triton Inference Server model config](https://docs.nvidia.com/deeplearning/sdk/triton-inference-server-guide/docs/model_configuration.html) when `triton_export_model` is set to `true`. ```bash bash triton/scripts/run_triton_tf.sh ``` Refer to the advanced section for details on launching client and server separately for debugging. 4.2. TensorRT Model In order to use the BERT TensorRT engine, follow the steps underlined in [TensorRT Repository](https://github.com/NVIDIA/TensorRT/tree/master/demo/BERT) to build a TensorRT engine. Place it as `results/triton_models///model.plan` and use the `run_triton_trt.sh` script as follows. ```bash bash triton/scripts/run_triton_trt.sh ``` Notes: - [Triton Inference Server 20.09](https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/rel_20-09.html#rel_20-09) is compatible with [TensorRT 7.1.3](https://docs.nvidia.com/deeplearning/tensorrt/release-notes/index.html). - The current Triton Inference Server works with the TensorRT engine with `batch_size > 1`. - To use the performance client with dynamic batching, build an engine with -b -b ` to support dynamic batches upto size N.