Conformer-CTC (around 120M parameters) is trained on ASRSet with over 16500 hours of English(en-US) speech. The model transcribes speech in lower case english alphabet along with spaces and apostrophes.
Conformer-CTC model is a non-autoregressive variant of Conformer model for Automatic Speech Recognition which uses CTC loss/decoding instead of Transducer. You may find more info on the detail of this model here: Conformer-CTC Model.
Primary use case intended for these models is automatic speech recognition.
Input: Single-channel audio files (WAV) with a 16kHz sample rate
Output: Transcripts, which are sequences of valid vocabulary labels as given by the specification file
Conformer is an end-to-end architecture that is trained using CTC loss. These model checkpoints are intended to be used with the Train Adapt Optimize (TAO) Toolkit. In order to use these checkpoints, there should be a specification file (.yaml) that specifies hyperparameters, datasets for training and evaluation, and any other information needed for the experiment. For more information on the experiment spec files for each use case, please refer to the TAO Toolkit User Guide.
The models are encrypted with the key tlt_encode.
To fine-tune from a model checkpoint (.tlt), use the following command:
!tao speech_to_text_Conformer finetune -e \ -m \ -g \ -r
Where the `` parameter should be a valid path to the file that specifies the fine-tuning hyperparameters, the dataset to fine-tune on, the dataset to evaluate on, and whether or not a change of vocabulary from the default (lowercase English letters, space, and apostrophe) is needed.
To evaluate an existing dataset using a model checkpoint (.tlt), use the following command:
!tao speech_to_text_Conformer evaluate -e \ -m \ -g \ -r
The `` parameter should be a valid path to the file that specifies the dataset that is being evaluated.
The model was trained on various proprietary and open-source datasets. These datasets include domain specific data for various domains, spontaneous speech and dialogue, all of which contribute to the model’s accuracy.
This model delivers WER that is better than or comparable to popular alternate Speech to Text solutions for a range of domains and use cases.
Currently, TAO Conformer models only support training and inference on .wav audio files. All models included here were trained and evaluated on audio files with a sample rate of 16kHz, so for best performance you may need to upsample or downsample audio files to 16kHz.
In addition, the model will perform best on audio samples that are longer than 0.1 seconds long. For training and fine-tuning Conformer models, it is recommended that samples are capped at a maximum length of around 15 seconds, depending on the amount of memory available to you. You do not need to place a maximum length limitation for evaluation.
By downloading and using the models and resources packaged with TAO Conversational AI, you would be accepting the terms of the Riva license
NVIDIA’s platforms and application frameworks enable developers to build a wide array of AI applications. Consider potential algorithmic bias when choosing or creating the models being deployed. Work with the model’s developer to ensure that it meets the requirements for the relevant industry and use case; that the necessary instruction and documentation are provided to understand error rates, confidence intervals, and results; and that the model is being used under the conditions and in the manner intended.