The Corpus of Contemporary Polish — a New Reference Corpus with Rich Syntactic Annotations
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
In the paper, we describe the Corpus of Contemporary Polish (KWJP) and its rich syntactic annotation. The corpus covers a wide range of text originally published between 2011 and 2020. Although it carries on the idea of providing up-to-date reference corpora of Polish initiated by the National Corpus of Polish (NKJP) project, the principles underlying its development are not the same. In this article, we outline the different choices that affect corpora content and give an explanation for them. The article focuses mainly on the description of annotation layers in KWJP which are generated with a neural network based tool specially developed for this purpose. We describe in details syntactic structure annotation, which is represented by hybrid trees combining information typical to constituency and dependency trees. Finally, we provide several examples showing how annotation with hybrid trees facilitates querying and effective searching for information in the corpus.