Back to ECNLP 2024
LREC-COLING 2024workshop

Towards Multi-Modal Co-Reference Resolution in Conversational Shopping Agents

Proceedings of the Seventh Workshop on e-Commerce and NLP @ LREC-COLING 2024

DOI:10.63317/36jn56p2d355

Abstract

The context of modern smart voice assistants is often multi-modal, where images, audio and video content are consumed by users simultaneously. In such a setup, co-reference resolution is especially challenging, and runs across modalities and dialogue turns. We explore the problem of multi-modal co-reference resolution in multi-turn dialogues and quantify the performance of multi-modal LLMs on a specially curated dataset of long, image-interleaved conversations between a voice assistant and human in a shopping use case. We propose a custom architecture for multi-modal embedding alignment using a novel parameter augmentation technique. Our proposed Parameter Augmented LLM approach shows a 4.9% absolute F1 improvement above a cross-attention baseline while reducing the number of parameters being trained by 4x.

Details

Paper ID
lrec2024-ws-ecnlp-02
Pages
pp. 8-18
BibKey
osebe-etal-2024-towards
Editor
N/A
Publisher
European Language Resources Association (ELRA) and ICCL
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the Seventh Workshop on e-Commerce and NLP @ LREC-COLING 2024
Location
undefined, undefined
Date
20 May 2024 25 May 2024

Authors

  • SO

    Samuel Osebe

  • PW

    Prashan Wanigasekara

  • TG

    Thomas Gueudre

  • TT

    Thanh Tran

  • RS

    Rahul Sharma

  • FY

    Fan Yang

  • QH

    Qian Hu

  • WR

    Weitong Ruan

  • EB

    Emre Barut

  • CS

    Chengwei Su

Links