NGC | Catalog
CatalogResourcesBERT Triton deployment for TensorFlow

BERT Triton deployment for TensorFlow

For downloads and more information, please view on a desktop device.
Logo for BERT Triton deployment for TensorFlow

Description

Deploying high-performance inference for BERT model using NVIDIA Triton Inference Server.

Publisher

NVIDIA

Use Case

Nlp

Framework

TensorFlow

Latest Version

-

Modified

November 12, 2021

Compressed Size

0 B

This resource is a subproject of bert_for_tensorflow. Visit the parent project to download the code and get more information about the setup.

The NVIDIA Triton Inference Server provides a datacenter and cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP/REST or gRPC endpoint, or by a C API endpoint, allowing remote clients to request inferencing for any number of GPU or CPU models being managed by the server.

A typical Triton Inference Server pipeline can be broken down into the following steps:

  1. The client serializes the inference request into a message and sends it to the server (Client Send).

  2. The message travels over the network from the client to the server (Network).

  3. The message arrives at the server, and is deserialized (Server Receive).

  4. The request is placed in the queue (Server Queue).

  5. The request is removed from the queue and computed (Server Compute).

  6. The completed request is serialized in a message and sent back to the client (Server Send).

  7. The completed message then travels over the network from the server to the client (Network).

  8. The completed message is deserialized by the client and processed as a completed inference request (Client Receive).

Generally, for local clients, steps 1-4 and 6-8 will only occupy a small fraction of time, compared to steps 5-6. As backend deep learning systems like BERT are rarely exposed directly to end users, but instead only interfacing with local front-end servers, for the sake of BERT, we can consider that all clients are local.

In this section, we will go over how to launch both the Triton Inference Server and the client and get the best performance solution that fits your specific application needs.

More information on how to perform inference using NVIDIA Triton Inference Server can be found in triton/README.md.