NGC | Catalog
CatalogResourcesKaldi ASR Integration With Triton Inference Server

Kaldi ASR Integration With Triton Inference Server

For downloads and more information, please view on a desktop device.
Logo for Kaldi ASR Integration With Triton Inference Server


Kaldi ASR custom backend for the NVIDIA Triton Inference Server.



Use Case




Latest Version



November 12, 2021

Compressed Size

1.21 MB

This repository provides a wrapper around the online GPU-accelerated ASR pipeline from the paper GPU-Accelerated Viterbi Exact Lattice Decoder for Batched Online and Offline Speech Recognition. That work includes a high-performance implementation of a GPU HMM Decoder, a low-latency Neural Net driver, fast Feature Extraction for preprocessing, and new ASR pipelines tailored for GPUs. These different modules have been integrated into the Kaldi ASR framework.

This repository contains a Triton custom backend for the Kaldi ASR framework. This custom backend calls the high-performance online GPU pipeline from the Kaldi ASR framework. This Triton integration provides ease-of-use to Kaldi ASR inference: gRPC streaming server, dynamic sequence batching, and multi-instances support. A client connects to the gRPC server, streams audio by sending chunks to the server, and gets back the inferred text as an answer (see Input/Output). More information about the Triton can be found here.

This Triton integration is meant to be used with the LibriSpeech model for demonstration purposes. We include a pre-trained version of this model to allow you to easily test this work (see Quick Start Guide). Both the Triton integration and the underlying Kaldi ASR online GPU pipeline are a work in progress and will support more functionalities in the future. Support for a custom Kaldi model is experimental (see Using a custom Kaldi model).

Reference model

A reference model is used by all test scripts and benchmarks presented in this repository to illustrate this solution. We are using the Kaldi ASR LibriSpeech recipe, available here. It was trained by NVIDIA and is delivered as a pre-trained model.

Default configuration

Details about parameters can be found in the Parameters section.

  • model path: Configured to use the pretrained LibriSpeech model.
  • use_tensor_cores: 1
  • main_q_capacity: 30000
  • aux_q_capacity: 400000
  • beam: 10
  • num_channels: 4000
  • lattice_beam: 7
  • max_active: 10,000
  • frame_subsampling_factor: 3
  • acoustic_scale: 1.0
  • num_worker_threads: 40
  • max_batch_size: 400
  • instance_group.count: 1