This resource is a subproject of fastpitch_for_pytorch. Visit the parent project to download the code and get more information about the setup.
The NVIDIA Triton Inference Server provides a datacenter and cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or gRPC endpoint, allowing remote clients to request inferencing for any number of GPU or CPU models being managed by the server.
This README provides step-by-step deployment instructions for models generated during training (as described in the model README). Additionally, this README provides the corresponding deployment scripts that ensure optimal GPU utilization during inferencing on Triton Inference Server.
The deployment process consists of two steps:
To run benchmarks measuring the model performance in inference, perform the following steps:
Start the Triton Inference Server.
The Triton Inference Server container is started in one (possibly remote) container and ports for gRPC or REST API are exposed.
Run accuracy tests.
Produce results which are tested against given accuracy thresholds. Refer to step 8 in the Quick Start Guide.
Run performance tests.
Produce latency and throughput results for offline (static batching) and online (dynamic batching) scenarios. Refer to step 10 in the Quick Start Guide.