This model performs joint intent classification and slot filling, directly from audio input. The model treats the problem as an audio-to-text problem, where the output text is the flattened string representation of the semantics annotation. The model is trained on the SLURP dataset .
The model is has an encoder-decoder architecture, where the encoder is a Conformer-Large model , and the decoder is a three-layer Transformer Decoder . We use the Conformer encoder pretrained on NeMo ASR-Set (details here), while the decoder is trained from scratch. A start-of-sentence (BOS) and an end-of-sentence (EOS) tokens are added to each sentence. The model is trained end-to-end by minimizing the negative log-likelihood loss with teacher forcing. During inference, the prediction is generated by beam search, where a BOS token is used to trigger the generation process.
The NeMo toolkit  was used for training the models for around 100 epochs. These model are trained with this example script and this base config.
The tokenizers for these models were built using the semantics annotations of the train set with this script. We use a vocabulary size of 58, including the BOS, EOS and padding tokens.
The model is trained on the combined real and synthetic training sets of the SLURP dataset.
|Intent (Scenario_Action)||Entity||SLURP Metrics|
|1.13.0||Conformer-Transformer-Large||127||NeMo ASR-Set 3.0||90.14||78.95||74.93||76.89||84.31||80.33||82.27|
Note: during inference, we use beam size of 32, and a temperature of 1.25.
The model is available for use in the NeMo toolkit , and can be used on another dataset with the same annotation format.
import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.SLUIntentSlotBPEModel.from_pretrained(model_name="slu_conformer_transformer_large_slurp")
python [NEMO_GIT_FOLDER]/examples/slu/speech_intent_slot/eval_utils/inference.py \ pretrained_name="slu_conformer_transformer_slurp" \ audio_dir="<DIRECTORY CONTAINING AUDIO FILES>" \ sequence_generator.type="<'beam' OR 'greedy' FOR BEAM/GREEDY SEARCH>" \ sequence_generator.beam_size="<SIZE OF BEAM>" \ sequence_generator.temperature="<TEMPERATURE FOR BEAM SEARCH>"
This model accepts 16000 Hz Mono-channel Audio (wav files) as input.
This model provides the intent and slot annotaions as a string for a given audio sample.
Since this model was trained on only the SLURP dataset , the performance of this model might degrade on other datasets.
 SLURP: A Spoken Language Understanding Resource Package
 Conformer: Convolution-augmented Transformer for Speech Recognition