More than "Oh": Grounding Observable Events with Grunts in Multimodal Dialogue

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

Abstract

Conversational grunts (minimal vocalizations like oh, mm-hm, and uh-huh) ground information and coordinate understanding in human dialogue, yet computational systems typically treat them as noise rather than meaningful communicative acts. We present a systematic annotation and analysis of 497 grunts across 3 hours of multimodal collaborative tasks, introducing an annotation scheme that captures grunts, their antecedents, and dialogue act functions. Our analysis reveals that grunts respond to speech and observable events at nearly equal rates, demonstrating that non-verbal events function as conversational contributions requiring acknowledgment. Tokens exhibit functional specialization: mm-hm predominantly acknowledges speech, while oh preferentially acknowledges events. Prosodic analysis shows speakers systematically modulate duration and pitch based on antecedent type, with event responses typically longer and having greater range. These findings have implications for dialogue state tracking, multimodal grounding, and turn-taking in conversational AI systems.