MASA: A Novel Multimodal Foundation Model for L2 Speaking Assessment in Picture-description Scenarios

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

Abstract

Automatic speaking assessment (ASA) manages to quantify the language competence of second language (L2) learners by providing a proficiency score based on their spoken responses. Existing efforts typically employ a neural grader coupled with a set of handcrafted features to gauge the competence of language in L2 learners from multiple facets. Despite their decent efficacy, these methods are limited by a laborious feature engineering process and largely overlook the utilization of scoring rubrics that are presented to human raters in speaking assessment. In light of this, we put forward a novel Multimodal foundation model for ASA, termed MASA, for use in picture-description scenarios. Our approach effectively streamlines the feature engineering process by leveraging the pre-trained encoders of a multimodal foundation model, and emulates the nuanced scoring behaviors of human raters by incorporating scoring rubrics directly into the modeling process. Furthermore, a simple, training-free method is introduced to alleviate the scoring bias in MASA by contrasting the output distributions derived from the multimodal and single-modal inputs. A series of experiments conducted on a picture-description task of the General English Proficiency Test (GEPT) dataset validates the feasibility and superiority of our method in comparison to several cutting-edge baselines.