This repository provides a wrapper around the online GPU-accelerated ASR pipeline from the paper GPU-Accelerated Viterbi Exact Lattice Decoder for Batched Online and Offline Speech Recognition. That work includes a high-performance implementation of a GPU HMM Decoder, a low-latency Neural Net driver, fast Feature Extraction for preprocessing, and new ASR pipelines tailored for GPUs. These different modules have been integrated into the Kaldi ASR framework.
This repository contains a Triton custom backend for the Kaldi ASR framework. This custom backend calls the high-performance online GPU pipeline from the Kaldi ASR framework. This Triton integration provides ease-of-use to Kaldi ASR inference: gRPC streaming server, dynamic sequence batching, and multi-instances support. A client connects to the gRPC server, streams audio by sending chunks to the server, and gets back the inferred text as an answer (see Input/Output). More information about the Triton can be found here.
This Triton integration is meant to be used with the LibriSpeech model for demonstration purposes. We include a pre-trained version of this model to allow you to easily test this work (see Quick Start Guide). Both the Triton integration and the underlying Kaldi ASR online GPU pipeline are a work in progress and will support more functionalities in the future. Support for a custom Kaldi model is experimental (see Using a custom Kaldi model).
A reference model is used by all test scripts and benchmarks presented in this repository to illustrate this solution. We are using the Kaldi ASR
LibriSpeech recipe, available here. It was trained by NVIDIA and is delivered as a pre-trained model.
Details about parameters can be found in the Parameters section.
model path: Configured to use the pretrained LibriSpeech model.