SouDeC: Source Detection and Classification in Czech
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
We present a method of attribution source detection and classification in Czech. A plain text (typically, a newspaper article) enters the SouDec system, gets parsed with the external tool UDPipe into Universal-Dependencies style of sentence representation, and then is analyzed for occurrences of attribution signals and sources. The list of attribution signals has been extracted from a corpus of Czech newspaper articles annotated with interlinked attribution signals and sources, and has been complemented with context and syntax information to help distinguish relevant occurrences of the signals. The SouDec system further classifies the attribution sources in one of five classes: anonymous, partially anonymous, unofficial, official non-political and official political, using information from another external tool, a recognizer and classifier of named entities, NameTag 3. While our source detection method gets results comparable to existing systems for other languages, further improvements can be achieved by incorporating fully-fledged automatic coreference resolution into the classification method. In a focused case study, we test a possible usage of SouDeC for distinguishing domain-specific texts of less vs. more reputable origin.