Back to Main Conference 2024
LREC-COLING 2024main

Multimodal Language Models Show Evidence of Embodied Simulation

Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

DOI:10.63317/5c776qzdo44r

Abstract

Multimodal large language models (MLLMs) are gaining popularity as partial solutions to the “symbol grounding problem” faced by language models trained on text alone. However, little is known about whether and how these multiple modalities are integrated. We draw inspiration from analogous work in human psycholinguistics on embodied simulation, i.e., the hypothesis that language comprehension is grounded in sensorimotor representations. We show that MLLMs are sensitive to implicit visual features like object shape (e.g., “The egg was in the skillet” implies a frying egg rather than one in a shell). This suggests that MLLMs activate implicit information about object shape when it is implied by a verbal description of an event. We find mixed results for color and orientation, and rule out the possibility that this is due to models’ insensitivity to those features in our dataset overall. We suggest that both human psycholinguistics and computational models of language could benefit from cross-pollination, e.g., with the potential to establish whether grounded representations play a functional role in language processing.

Details

Paper ID
lrec2024-main-1041
Pages
pp. 11928-11933
BibKey
jones-trott-2024-multimodal
Editor
N/A
Publisher
European Language Resources Association (ELRA) and ICCL
ISSN
2522-2686
ISBN
979-10-95546-34-4
Conference
Joint International Conference on Computational Linguistics, Language Resources and Evaluation
Location
Turin, Italy
Date
20 May 2024 25 May 2024

Authors

  • CJ

    Cameron R. Jones

  • ST

    Sean Trott

Links