Back to Main Conference 2024
LREC-COLING 2024main

High-Order Semantic Alignment for Unsupervised Fine-Grained Image-Text Retrieval

Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

DOI:10.63317/5pf2824yp2u4

Abstract

Cross-modal retrieval is an important yet challenging task due to the semantic discrepancy between visual content and language. To measure the correlation between images and text, most existing research mainly focuses on learning global or local correspondence, failing to explore fine-grained local-global alignment. To infer more accurate similarity scores, we introduce a novel High Order Semantic Alignment (HOSA) model that can provide complementary and comprehensive semantic clues. Specifically, to jointly learn global and local alignment and emphasize local-global interaction, we employ tensor-product (t-product) operation to reconstruct one modal’s representation based on another modal’s information in a common semantic space. Such a cross-modal reconstruction strategy would significantly enhance inter-modal correlation learning in a fine-grained manner. Extensive experiments on two benchmark datasets validate that our model significantly outperforms several state-of-the-art baselines, especially in retrieving the most relevant results.

Details

Paper ID
lrec2024-main-0714
Pages
pp. 8155-8165
BibKey
gao-etal-2024-high
Editor
N/A
Publisher
European Language Resources Association (ELRA) and ICCL
ISSN
2522-2686
ISBN
979-10-95546-34-4
Conference
Joint International Conference on Computational Linguistics, Language Resources and Evaluation
Location
Turin, Italy
Date
20 May 2024 25 May 2024

Authors

  • RG

    Rui Gao

  • MC

    Miaomiao Cheng

  • XH

    Xu Han

  • WS

    Wei Song

Links