Design and Methodological Architecture of a Multilingual Corpus of Interpreter-mediated Public Service Telephone Interactions
Proceedings of Shaping Multilingual, Multimodal AI for the Social Sciences and Humanities (LLMs4SSH) @ LREC 2026
Abstract
Multimodality in Social Sciences and Humanities (SSH) research is often associated with the integration of text and visual data. However, interpreter-mediated telephone interaction presents a different configuration of complexity, where acoustic, temporal, discursive, and pragmatic dimensions converge. This paper presents the design and methodological architecture of PRAGMACOR(Corpus Pragmatics and Telephone Interpreting: Analysis of Face-Threatening Acts, Ref. PID2021-127196NA-I00), a multilingual corpus of interpreter-mediated public service telephone interactions (Chinese–Spanish, English–Spanish, French–Spanish, German–Spanish), as a case study in multimodal and plurilingual SSH infrastructure. The corpus integrates aligned audio recordings, orthographic transcriptions enriched with speech phenomena, temporal segmentation into speech acts, and multilayer pragmatic annotation of Face-Threatening Acts (FTAs), validated through a structured double-annotation and expert review process. Beyond textual data, the infrastructure captures prosodic overlap, turn-taking dynamics, and pragmatic mediation, enabling the study of cross-linguistic transfer and relational negotiation in asymmetrical institutional contexts. Datasets such as PRAGMACOR have proved essential to train LLMs for speech to speech translation (Sakai et al., 2024). Attention is given to the ethical and technical design of the corpus, including local automatic transcription, systematic removal of personal identifiable information, and irreversible voice anonymization through spectral and temporal signal transformation. These procedures ensure both research usability and compliance with responsible data governance principles. By conceptualising interpreter-mediated interaction as an acoustic-discursive multimodal object and plurilingual pragmatic process, this paper argues that PRAGMACOR provides a replicable model for the development of SSH-oriented infrastructures capable of supporting advanced research in multilingual communication, discourse analysis, and future evaluation of language technologies.