SpeechLLM FastConformer Llama2-7B

SpeechLLM FastConformer Llama2-7B

Logo for SpeechLLM FastConformer Llama2-7B
Description
A 7B SpeechLLM model trained on speech-to-text recognition (ASR), speech-to-text translation (AST) and audio/speech question answering (SpeechQA, AudioQA) data.
Publisher
NVIDIA
Latest Version
1.23.1
Modified
April 26, 2024
Size
25.81 GB

Model Overview

Modular SpeechLLM [1] is a model that combines a pretrained audio encoder with a pretrained large language model (LLM) so that the LLM model can perform speech-to-text tasks and answer questions based on the input audios. The model is trained on several tasks, including ASR, AST, SpeechQA and AudioQA, with a total of about 32K hours of audios.

Model Architecture

There are three main components of a modular SpeechLLM model:

  • An audio encoder that processes the input audio and produces a sequence of audio embeddings.
  • A modality adapter that processes the audio embeddings and produces a sequence of embeddings in the same latent space as the token embeddings of a pretrained large language model (LLM).
  • A pretrained large language model (LLM) that processes embeddings from the modality adapter as well as token embeddings of input prompt, and produces the text output. The audio embeddings and text token embeddings are concatenated in time dimension before going into the LLM.

Specifically, we use a 17-layer FastConformer [2] as the audio encoder, a 2-layer FastConformer as modality adapter, and Llama-2-7b-chat [3] as the pretrained LLM and add LoRA [4] to it. We freeze the original LLM parameters, while tuning everything else. The total number of parameters is around 7B, while trainable params are about 122M.

Training

The model is implemented with NVIDIA NeMo toolkit [5], and can be trained with this example script and this base config.

Datasets

The model is trained on the following datasets:

Performance

All results are obtained with greedy decoding.

Speech-to-Text Recognition (ASR)

The ASR performance is evaluated by word error rate (WER %):

Version MCV-7.1-test Librispeech-test-other WSJ-eval
1.23.1 8.53 4.65 2.07

Speech-to-Text Translation (AST)

AST performance is evaluated by BLEU score on FLEURS dataset. It should be noted that the model was not trained on paired data of En->Es or En->Fr, but still it's able to perform zero-shot AST with decent performance.

Version En->De En->Es En->Fr
1.23.1 27.41 16.97 25.79

SpeechQA

SpeechQA performance is evaluated with ROUGE scores on the MS-MACRO test set.

Version ROUGE-1 ROUGE-2 ROUGE-L
1.23.1 64.79 50.41 63.14

Multi-task Audio Understanding

We evaluate on the six representative tasks in DynamicSUPERB leaderboard, using accuracy (%) as metric.

Version Audio Content Degradation Paralinguistics Semantics Speaker
1.23.1 9.0 92.50 79.50 28.00 66.00 65.50

How to Use this Model

Input Format

You'll need to prepare data in the NeMo manifest format, where each line is a python dictionary with some keys, for example:

{
    "audio_filepath": "path/to/audio.wav",
    "offset": 0.0, # offset of the audio in seconds, this is an optional field
    "duration": 10.0 , # duration of the audio in seconds, can set to `None` to load the whole audio
    "context": "what is the transcription of the audio?", # text prompt for the audio, see below for more details
    "answer": "the transcription of the audio", # optional for inference, default to "na" in dataloader
}

Inference with SpeechLLM

The script you need to perform inference is modular_audio_gpt_eval.py, and the corresponding config file is modular_audio_gpt_config_eval.yaml.

If you want to load a pretrained SpeechLLM from cloud, you can use the following script:

TEST_MANIFESTS="[/data/test_1.json,/data/test_2.json]"
TEST_NAMES="[test-1,test-2]"
CUDA_VISIBLE_DEVICES=0 python modular_audio_gpt_eval.py \
    model.from_pretrained="speechllm_fc_llama2_7b" \
    model.data.test_ds.manifest_filepath=$TEST_MANIFESTS \
    model.data.test_ds.names=$TEST_NAMES \
    model.data.test_ds.global_batch_size=8 \
    model.data.test_ds.micro_batch_size=8 \
    model.data.test_ds.tokens_to_generate=256 \
    ++inference.greedy=False \
    ++inference.top_k=50 \
    ++inference.top_p=0.95 \
    ++inference.temperature=0.4 \
    ++inference.repetition_penalty=1.2 \
    ++model.data.test_ds.output_dir="./test_outputs"

If you have a local .nemo file, you can use model.restore_from_path=/path/to/model.nemo to replace the line model.from_pretrained="speechllm_fc_llama2_7b" in the above example.

Input

The model takes single-channel audios of 16000 Hz, as well as text prompts as input.

Output

The model produces natural language text output.

Limitations

Although the model has some zero-shot extension capabilities, it works best on the languages and tasks that it's trained on, and might not work well on unseen languages or tasks.

References

[1] SALM: Speech-augmented Language Model with In-context Learning for Speech Recognition and Translation

[2] Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition

[3] Llama-2-7b-chat

[4] LoRA: Low-Rank Adaptation of Large Language Models

[5] NVIDIA NeMo Toolkit