This model performs joint intent classification and slot filling, directly from audio input. The model treats the problem as an audio-to-text problem, where the output text is the flattened string representation of the semantics annotation. The model is trained on the SLURP dataset .
The model is has an encoder-decoder architecture, where the encoder is a Conformer-Large model , and the decoder is a three-layer Transformer Decoder . We use the Conformer encoder pretrained on NeMo ASR-Set (details here), while the decoder is trained from scratch. A start-of-sentence (BOS) and an end-of-sentence (EOS) tokens are added to each sentence. The model is trained end-to-end by minimizing the negative log-likelihood loss with teacher forcing. During inference, the prediction is generated by beam search, where a BOS token is used to trigger the generation process.
The tokenizers for these models were built using the semantics annotations of the train set with this script. We use a vocabulary size of 58, including the BOS, EOS and padding tokens.
The model is trained on the combined real and synthetic training sets of the SLURP dataset.
|Intent (Scenario_Action)||Entity||SLURP Metrics|
|1.13.0||Conformer-Transformer-Large||127||NeMo ASR-Set 3.0||90.14||78.95||74.93||76.89||84.31||80.33||82.27|
Note: during inference, we use beam size of 32, and a temperature of 1.25.
The model is available for use in the NeMo toolkit , and can be used on another dataset with the same annotation format.
import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.SLUIntentSlotBPEModel.from_pretrained(model_name="slu_conformer_transformer_large_slurp")
python [NEMO_GIT_FOLDER]/examples/slu/speech_intent_slot/eval_utils/inference.py \ pretrained_name="slu_conformer_transformer_slurp" \ audio_dir="<DIRECTORY CONTAINING AUDIO FILES>" \ sequence_generator.type="<'beam' OR 'greedy' FOR BEAM/GREEDY SEARCH>" \ sequence_generator.beam_size="<SIZE OF BEAM>" \ sequence_generator.temperature="<TEMPERATURE FOR BEAM SEARCH>"
This model accepts 16000 Hz Mono-channel Audio (wav files) as input.
This model provides the intent and slot annotaions as a string for a given audio sample.
Since this model was trained on only the SLURP dataset , the performance of this model might degrade on other datasets.