Intent Recognition in Speech-to-Text Processing in the Context of Natural Interaction with Cognitive Assistive Systems

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

Abstract

This study investigates efficient speech-to-intent recognition for human–robot interaction in elderly-care environments in German, targeting deployment on resource-constrained platforms such as the Jetson AGX Orin. To benchmark performance, we created a domain-specific German dataset with two sub-datasets (PaSID and PaSynTex) that simulate specific nursing home communication scenarios. Two alternative speech-to-intent pipelines were developed and evaluated: a two-stage system combining automatic speech recognition (ASR) with a large language model (LLM), and an end-to-end large audio–language model (LALM) architecture. The performance of Whisper-based ASR systems was evaluated across a wide variety of LLMs and several LALMs, comparing intent-classification accuracy, latency, and resource efficiency. The results indicate that optimized ASR + LLM configurations, particularly Whisper Turbo coupled with Phi-3.5-mini or Qwen 2.5-7B, outperform unified LALM approaches while maintaining substantially lower memory and inference costs. Also, the analysis shows that, the unified LALM models outperform the two-step integration of ASR + LLM in the same configuration, but at the cost of higher resource utilization, likely due to limited optimization for edge deployment. Overall, the findings provide initial evidence that modular ASR + LLM pipelines provide a more practical solution for real-time, on-device intent recognition in assistive robotics in German, offering an effective trade-off between performance and deployability on resource-constrained platforms.