Detecting Potentially Under-annotated Explicit Discourse Connectives in the Penn Discourse Treebank (PDTB-3) with LLMs

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

Abstract

Accurate identification of explicit discourse connectives is crucial for analysing discourse relations, which supports NLP tasks such as summarisation and question answering. However, annotation inconsistencies remain a challenge, particularly for ambiguous prepositions with both discourse and non-discourse usages. This paper presents a pipeline that leverages large language model (LLM) prompting, cross-model agreement, and syntactic pattern analysis to detect likely under-annotated connectives. Evaluated on four prepositions (by, with, without and for), the approach effectively identifies likely under-annotations for some, but not all prepositions. Results show that while the method is promising, its generalisability depends on improved prompt design, model choice, and syntactic analysis tools. The findings highlight both the potential and limitations of LLM-based approaches for corpus error detection and demonstrate how improved discourse annotation can contribute to more reliable data for downstream NLP tasks.