Back to Main Conference 2026
LREC 2026main

Early Fusion with Contrastive Learning: A Lightweight Alternative for Multi-modal Classification

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/3ioe7epfs78m

Abstract

With the emergence of numerous modalities, such as text, image, audio, etc., the use of effective multimodal systems has increased significantly. However, one of the significant challenges faced by such multimodal systems is effectively aligning and integrating diverse modalities. Several models have been proposed to address these issues; however, state-of-the-art performance is achieved by complex, heavyweight models (complexity measured in terms of trainable parameters) alone. Hence, we propose a simple yet effective lightweight framework explicitly designed for multimodal classification tasks, utilising the early fusion method combined with a contrastive learning approach. The early fusion method focuses on fusing different modalities at the input level, whereas contrastive learning allows a single modality to capture intra-modality relationships. Experiments on three different genres of multimodal classification datasets demonstrate that the proposed lightweight framework achieves performance comparable to the most competitive heavyweight state-of-the-art models and, in some cases, even outperforms them.

Details

Paper ID
lrec2026-main-717
Pages
pp. 9129-9138
BibKey
wernlein-etal-2026-early
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • FW

    Felix Wernlein

  • AJ

    Abhik Jana

  • SS

    Sandipan Sikdar

Links