NGC | Catalog
CatalogHelm ChartsTriton Inference Server

Triton Inference Server

For versions and more information, please view on a desktop device.
Logo for Triton Inference Server



Triton Inference Server Helm Chart



Latest Version


Compressed Size

17.38 KB


April 4, 2023

Triton Inference Server Helm Chart

> NOTE: Some versions of Google Kubernetes Engine (GKE) contain a > regression in the handling of LD_LIBRARY_PATH that prevents the > inference server container from running correctly. See > . Use a GKE 1.13 or > earlier version or a GKE 1.14.6 or later version to avoid this > issue.

Simple helm chart for installing a single instance of the NVIDIA Triton Inference Server. This guide assumes you already have a functional Kubernetes cluster and helm installed (see below for instructions on installing helm). Your cluster must be configured with support for the NVIDIA driver and CUDA version required by the version of the inference server you are using.

The steps below describe how to set-up a model repository, use helm to launch the inference server, and then send inference requests to the running server.

Model Repository

Triton Inference Server needs a repository of models that it will make available for inferencing. For this example you will place the model repository in a Google Cloud Storage bucket:

$ gsutil mb gs://triton-inference-server-repository

Following the instructions download the example model repository to your system and copy it into the GCS bucket:

$ gsutil cp -r docs/examples/model_repository gs://triton-inference-server-repository/model_repository
GCS Permissions

Make sure the bucket permissions are set so that the inference server can access the model repository. If the bucket is public then no additional changes are needed and you can proceed to "Running The Inference Server" section.

If bucket premissions need to be set with the GOOGLE_APPLICATION_CREDENTIALS environment variable then perform the following steps:

  • Generate Google service account JSON with proper permissions called gcp-creds.json.

  • Create a Kubernetes secret from this file:

    $ kubectl create configmap gcpcreds --from-literal "project-id=myproject" $ kubectl create secret generic gcpcreds --from-file gcp-creds.json

  • Modify templates/deployment.yaml to include the GOOGLE_APPLICATION_CREDENTIALS environment variable:

        value: /secret/gcp-creds.json
  • Modify templates/deployment.yaml to mount the secret in a volume at /secret:

      - name: vsecret
        mountPath: "/secret"
        readOnly: true
    - name: vsecret
        secretName: gcpcreds

Running The Inference Server

Once you have helm installed (see below if you need help installing helm) and your model repository ready, you can deploy the inference server using the default configuration with:

$ helm install .

You can use kubectl to wait until the inference server pod is running:

$ kubectl get pods
NAME                                                              READY   STATUS              RESTARTS   AGE
wobbley-coral-triton-inference-server-5f74b55885-n6lt7   1/1     Running   0          2m21s

There are several ways of overriding the default configuration as described in this helm documentation.

For example, you can edit the values.yaml file directly or you can use the --set option to override a single parameter with the CLI, for example:

helm install triton-inference-server --set image.imageName=""

You can also use a file by writing your own "values.yaml" file with the values you want to override and pass it to helm:

$ cat << EOF > config.yaml
namespace: MyCustomNamespace
  modelRepositoryPath: gs://my_model_repository

$ helm install -f config.yaml triton-inference-server

Using the triton Inference Server -----------------------------------

Now that the inference server is running you can send HTTP or GRPC requests to it to perform inferencing. By default, the inferencing service is exposed with a LoadBalancer service type. Use the following to find the external IP for the inference server. In this case it is

$ kubectl get services
NAME         TYPE           CLUSTER-IP    EXTERNAL-IP      PORT(S)                                        AGE
inference-se LoadBalancer   8000:31220/TCP,8001:32107/TCP,8002:31682/TCP   1m
kubernetes   ClusterIP               443/TCP                                        1h

The inference server exposes an HTTP endpoint on port 8000, and GRPC endpoint on port 8001 and a Prometheus metrics endpoint on port 8002. You can use curl to get the status of the inference server from the HTTP endpoint:

$ curl

Follow the instructions to get the example image classification client that can be used to perform inferencing using image classification models being served by the inference server. For example:

$ image_client -u -m resnet50_netdef -s INCEPTION -c3 mug.jpg
Output probabilities:
batch 0: 504 (COFFEE MUG) = 0.777365267277
batch 0: 968 (CUP) = 0.213909029961
batch 0: 967 (ESPRESSO) = 0.00294389552437

Cleanup -------

Once you've finished using the inference server you should use helm to delete the deployment:

$ helm list
NAME            REVISION        UPDATED                         STATUS          CHART                           APP VERSION     NAMESPACE
wobbly-coral    1               Wed Feb 27 22:16:55 2019        DEPLOYED        triton-inference-server-1.0.0   1.0             default

$ helm delete wobbly-coral

You may also want to delete the GCS bucket you created to hold the model repository:

$ gsutil rm -r gs://triton-inference-server-repository

Installing Helm ---------------

The following steps from the official helm install guide will give you a quick setup:

$ curl | bash
$ kubectl create serviceaccount -n kube-system tiller
serviceaccount/tiller created
$ kubectl create clusterrolebinding tiller-cluster-rule --clusterrole=cluster-admin --serviceaccount=kube-system:tiller
$ helm init --service-account tiller --wait