Modular SpeechLLM [1] is a model that combines a pretrained audio encoder with a pretrained large language model (LLM) so that the LLM model can perform speech-to-text tasks and answer questions based on the input audios. The model is trained on several tasks, including ASR, AST, SpeechQA and AudioQA, with a total of about 32K hours of audios.
There are three main components of a modular SpeechLLM model:
Specifically, we use a 17-layer FastConformer [2] as the audio encoder, a 2-layer FastConformer as modality adapter, and Llama-2-7b-chat [3] as the pretrained LLM and add LoRA [4] to it. We freeze the original LLM parameters, while tuning everything else. The total number of parameters is around 7B, while trainable params are about 122M.
The model is implemented with NVIDIA NeMo toolkit [5], and can be trained with this example script and this base config.
The model is trained on the following datasets:
All results are obtained with greedy decoding.
The ASR performance is evaluated by word error rate (WER %):
Version | MCV-7.1-test | Librispeech-test-other | WSJ-eval |
---|---|---|---|
1.23.1 | 8.53 | 4.65 | 2.07 |
AST performance is evaluated by BLEU score on FLEURS dataset. It should be noted that the model was not trained on paired data of En->Es or En->Fr, but still it's able to perform zero-shot AST with decent performance.
Version | En->De | En->Es | En->Fr |
---|---|---|---|
1.23.1 | 27.41 | 16.97 | 25.79 |
SpeechQA performance is evaluated with ROUGE scores on the MS-MACRO test set.
Version | ROUGE-1 | ROUGE-2 | ROUGE-L |
---|---|---|---|
1.23.1 | 64.79 | 50.41 | 63.14 |
We evaluate on the six representative tasks in DynamicSUPERB leaderboard, using accuracy (%) as metric.
Version | Audio | Content | Degradation | Paralinguistics | Semantics | Speaker |
---|---|---|---|---|---|---|
1.23.1 | 9.0 | 92.50 | 79.50 | 28.00 | 66.00 | 65.50 |
You'll need to prepare data in the NeMo manifest format, where each line is a python dictionary with some keys, for example:
{
"audio_filepath": "path/to/audio.wav",
"offset": 0.0, # offset of the audio in seconds, this is an optional field
"duration": 10.0 , # duration of the audio in seconds, can set to `None` to load the whole audio
"context": "what is the transcription of the audio?", # text prompt for the audio, see below for more details
"answer": "the transcription of the audio", # optional for inference, default to "na" in dataloader
}
The script you need to perform inference is modular_audio_gpt_eval.py, and the corresponding config file is modular_audio_gpt_config_eval.yaml.
If you want to load a pretrained SpeechLLM from cloud, you can use the following script:
TEST_MANIFESTS="[/data/test_1.json,/data/test_2.json]"
TEST_NAMES="[test-1,test-2]"
CUDA_VISIBLE_DEVICES=0 python modular_audio_gpt_eval.py \
model.from_pretrained="speechllm_fc_llama2_7b" \
model.data.test_ds.manifest_filepath=$TEST_MANIFESTS \
model.data.test_ds.names=$TEST_NAMES \
model.data.test_ds.global_batch_size=8 \
model.data.test_ds.micro_batch_size=8 \
model.data.test_ds.tokens_to_generate=256 \
++inference.greedy=False \
++inference.top_k=50 \
++inference.top_p=0.95 \
++inference.temperature=0.4 \
++inference.repetition_penalty=1.2 \
++model.data.test_ds.output_dir="./test_outputs"
If you have a local .nemo
file, you can use model.restore_from_path=/path/to/model.nemo
to replace the line model.from_pretrained="speechllm_fc_llama2_7b"
in the above example.
The model takes single-channel audios of 16000 Hz, as well as text prompts as input.
The model produces natural language text output.
Although the model has some zero-shot extension capabilities, it works best on the languages and tasks that it's trained on, and might not work well on unseen languages or tasks.
[2] Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition
[3] Llama-2-7b-chat