Presenting the Prague Discourse Treebank 4.0
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
The Prague Discourse Treebank 4.0 is a large genre-diversified language resource with annotation of discourse relations marked by explicit connectives in Czech texts. It consists of 175 thousand sentences with 82 thousand discourse relations. We present the treebank as well as the methods used during the annotation of its individual parts, some of which were annotated fully manually, others using cost-effective partially automatic methods, achieving a comparable quality. The discourse annotation is available in two formats and theoretical frameworks: the Prague discourse annotation on top of deep syntax dependency trees, and the Penn Discourse Treebank style on top of plain texts, using both discourse type/sense taxonomies in both formats. The corpus is publicly and freely available, offering a valuable resource for linguistic research and natural language processing tasks.