NGC | Catalog
CatalogResourcesFastPitch Triton deployment for PyTorch

FastPitch Triton deployment for PyTorch

For downloads and more information, please view on a desktop device.
Logo for FastPitch Triton deployment for PyTorch

Description

Deploying high-performance inference for FastPitch using NVIDIA Triton Inference Server.

Publisher

NVIDIA

Use Case

Text To Speech

Framework

PyTorch

Latest Version

-

Modified

November 12, 2021

Compressed Size

0 B

This resource is a subproject of fastpitch_for_pytorch. Visit the parent project to download the code and get more information about the setup.

Introduction

The NVIDIA Triton Inference Server provides a datacenter and cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or gRPC endpoint, allowing remote clients to request inferencing for any number of GPU or CPU models being managed by the server.

This README provides step-by-step deployment instructions for models generated during training (as described in the model README). Additionally, this README provides the corresponding deployment scripts that ensure optimal GPU utilization during inferencing on Triton Inference Server.

Deployment process

The deployment process consists of two steps:

  1. Conversion. The purpose of conversion is to find the best performing model format supported by Triton Inference Server. Triton Inference Server uses a number of runtime backends such as TensorRT, LibTorch and ONNX Runtime to support various model types. Refer to Triton documentation for the list of available backends.
  2. Configuration. Model configuration on Triton Inference Server, which generates necessary configuration files.

To run benchmarks measuring the model performance in inference, perform the following steps:

  1. Start the Triton Inference Server.

    The Triton Inference Server container is started in one (possibly remote) container and ports for gRPC or REST API are exposed.

  2. Run accuracy tests.

    Produce results which are tested against given accuracy thresholds. Refer to step 8 in the Quick Start Guide.

  3. Run performance tests.

    Produce latency and throughput results for offline (static batching) and online (dynamic batching) scenarios. Refer to step 10 in the Quick Start Guide.