Back to Main Conference 2026
LREC 2026main

Entity Image and Mixed-Modal Image Retrieval Datasets

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/2fnaa4f79qa5

Abstract

Despite advances in multimodal learning, challenging benchmarks for mixed-modal image retrieval that combines visual and textual information are lacking. This paper introduces a novel benchmark to rigorously evaluate image retrieval that demands deep cross-modal contextual understanding. We present two new datasets: the Entity Image Dataset (EI), providing canonical images for Wikipedia entities, and the Mixed-Modal Image Retrieval Dataset (MMIR), derived from the WIT dataset. The MMIR benchmark features two challenging query types requiring models to ground textual descriptions in the context of provided visual entities: single entity-image queries (one entity image with descriptive text) and multi-entity-image queries (multiple entity images with relational text). We empirically validate the benchmark’s utility as both a training corpus and an evaluation set for mixed-modal retrieval. The quality of both datasets is further affirmed through crowd-sourced human annotations.

Details

Paper ID
lrec2026-main-734
Pages
pp. 9349-9357
BibKey
blaga-etal-2026-entity
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • CB

    Cristian-Ioan Blaga

  • PC

    Paul Suganthan G C

  • SD

    Sahil Dua

  • KS

    Krishna Srinivasan

  • EA

    Enrique Alfonseca

  • PD

    Peter Dornbach

  • TD

    Tom Duerig

  • IZ

    Imed Zitouni

  • ZD

    Zhe Dong

Links