From Corpus to Community: New NLP Tools for Welsh Language Research and Learning

Proceedings of the 12th Workshop on Challenges in the Management of Large Corpora

Abstract

Launched in 2020, CorCenCC (Corpws Cenedlaethol Cymraeg Cyfoes – National Corpus of Contemporary Welsh) is the first large-scale corpus of the Welsh language to integrate spoken, written, and electronically mediated data, offering a comprehensive snapshot of contemporary Welsh use. Including contributions from over 2,000 speakers, the 11.2-million-word corpus represents the diversity of Wales’s linguistic landscape. As a national resource, CorCenCC enables users to explore real world Welsh. Several tools and resources were developed through the CorCenCC project, including the CyTag POS tagger and CySemTag (adapted from Lancaster University’s USAS semantic system), to enable the grammatical and semantic categorisation of the dataset. The team also built the pedagogic toolkit Y Tiwtiadur, to allow learners and teachers to access corpus-based examples and tasks. Additionally, Yr Amliadur provides curated frequency-based wordlists across modes and parts of speech, supporting linguistic analysis and vocabulary development. Since completing the corpus, the team has focused on extending its impact and reach, to ensure that the resources are maintained and sustained for future use; a challenge often faced when large-scale projects end. This poster profiles the tools and resources created from and inspired by CorCenCC and its associated tools and resources, as a means of supporting the democratisation of linguistic resources for minoritised language contexts.