Back to Main Conference 2026
LREC 2026main

SEEM-CZ: Annotation and Classification of Epistemic Markers in Czech

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/2ewzmb73r3h8

Abstract

We present a project focused on linguistic description, annotation and automatic classification of the so-called epistemic markers in Czech. These expressions, such as pravděpodobně ‘probably’, zřejmě ‘apparently’ and určitě ‘certainly’, typically operate within the pragmatic domain of language. We introduce a dataset containing manual annotations of the 40 most frequent epistemic markers in Czech, totalling almost 4,000 uses. This annotation was created using parallel InterCorp data (in Czech and English) and the TEITOK tool. We describe the annotation scheme used, the annotation process and data handling. The dataset forms the core of the emerging lexical database of these expressions (SEEMLex). Thanks to the comprehensive manual annotation, the dataset can also serve as a source of further pragmatic information and can be used as a basis for further linguistic research. The proposed annotation scheme can also be used for other languages. To demonstrate the dataset’s utility for automatic classification, we trained XLM-RoBERTa classifiers using 10-fold cross-validation, achieving 72.6% accuracy for type of use classification (6 classes) and 54.2% accuracy for degree of certainty classification (4 classes).

Details

Paper ID
lrec2026-main-545
Pages
pp. 6856-6869
BibKey
tpnkov-etal-2026-seem
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • Barbora Štěpánková

  • MN

    Michal Novák

  • TM

    Tomáš Musil

  • LP

    Lucie Polakova

Links