BRAGD: Constrained Multi-Label POS Tagging for Faroese
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
We present the first multi-label part-of-speech (POS) tagger for Faroese using linguistically-informed constraints, addressing the data sparsity problem inherent in compound tag approaches. We propose the BRAGD tagset, which decomposes compound morphological tags into independent features (word class, gender, number, case, etc.). The BRAGD tagset is the third iteration of a tagset previously released for Faroese, with substantial modifications that are better aligned with Faroese grammar. We annotate the previously released Sosialurin corpus with the tagset, as well as a new annotated out-of-domain test corpus of 500 sentences from more varied and contemporary texts. To train the tagger, we use a constrained loss function that dynamically masks morphologically invalid features based on the word class (noun, verb, adjective, etc.). We fine-tune a Scandinavian transformer language model using the constrained multi-label loss, achieving an overall accuracy of 97.5%. We find that models trained with multi-label loss perform better, converge faster, and show significantly lower error rates on out-of-domain data than single-label approaches or previously reported methods for Faroese POS tagging. This confirms that the multi-label approach learns robust morphological patterns rather than memorizing domain-specific tag distributions. We release models, code, and the systematically revised Sosialurin-BRAGD corpus, featuring the new BRAGD tagset and a new out-of-domain evaluation corpus from diverse and contemporary text types.