This resource is a subproject of jasper_for_pytorch. Visit the parent project to download the code and get more information about the setup.
The NVIDIA Triton Inference Server provides a datacenter and cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or gRPC endpoint, allowing remote clients to request inferencing for any number of GPU or CPU models being managed by the server.
This folder contains detailed performance analysis as well as scripts to run Jasper inference using Triton Inference Server.
A typical Triton Inference Server pipeline can be broken down into the following steps:
Generally, for local clients, steps 1-4 and 6-8 will only occupy a small fraction of time, compared to step 5. As backend deep learning systems like Jasper are rarely exposed directly to end users, but instead only interfacing with local front-end servers, for the sake of Jasper, we can consider that all clients are local.
In this section, we will go over how to launch both the Triton Inference Server and the client and get the best performance solution that fits your specific application needs.
More information on how to perform inference using NVIDIA Triton Inference Server can be found in triton/README.md.