Euskorpora: A Strategic Framework for Digital Sovereignty and Linguistic Inclusion of Basque in the Era of AI
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
Euskorpora is a pioneering initiative designed to establish a comprehensive digital infrastructure for the development of speech and language technologies in Basque. Built upon European, Spanish, and Basque strategies, it addresses the scarcity of linguistic data, foundational models, and technological resources for this non-Indo-European, low-resourced language. The project integrates large-scale data collection from public institutions and private organisations, creating extensive multimodal corpora that cover the linguistic, dialectal, and domain diversity of Basque. These resources support the training of open language models for speech, translation, and language understanding, as well as the establishment of an interoperable infrastructure aligned with European initiatives such as the European Language Data Space (LDS). By combining linguistic research, artificial intelligence, and data governance, Euskorpora ensures the digital sovereignty and inclusion of the Basque language within the global AI ecosystem. Beyond its regional focus, it stands as a transferable model for advancing linguistic diversity, technological innovation, and equitable digital transformation in multilingual Europe.