Transcription Accuracy in the Icelandic Gigaword Corpus: Evaluating Automatic and Manual Annotation
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
This paper aims to compare automatic and manually corrected annotation data in the Icelandic Gigaword Corpus. We focus on the variable use of Stylistic Fronting (SF) in Icelandic, an optional movement of words or phrases, which indicates a more formal style. Examining SF rates across time, we find that manual coding results in slightly lower SF rates than automatic coding. This difference can be explained by the different sources used in the coding process: For automatic coding, written transcripts compiled by parliament employees are used, and for manual correction, coding relies on audio files of the parliament speeches. Importantly, both types of coding are well suited to trace changing patterns of SF over a span of 16 years, suggesting that the automatic feature extraction reliably reflects the speeches that have been transcribed.