HomeLREC 2026WorkshopsWILDRElrec2026-ws-wildre-02
Back to WILDRE 2026
LREC 2026workshop

Semi-automatic Approach for Tamil Discourse Relation Annotation

Proceedings of the 8th Workshop on Indian Language Data: Resources and Evaluation

DOI:10.63317/2imw3didyv2n

Abstract

Discourse relations (DRs) specify the logical relations between text spans and are essential for modeling extended discourse. Resources annotated with DRs can help train large language models (LLMs) to recognize and generate these relations more naturally. However, there is currently no open-source DR-annotated resource for Tamil. Annotation is particularly challenging because many Tamil discourse connectives are realized as morphologically complex suffixes rather than standalone tokens, often involving phonological alternations. In this work, we present a DR-annotated dataset for Tamil based on the PDTB framework. We adopt a semi-automatic pipeline: 1) projection of automatic English discourse annotations onto Tamil in a parallel corpus; 2) lexical normalization using a morphological analyzer; and 3) manual verification of each instance. The resulting resource contains approximately 7;200 explicit DR annotations and a lexicon of 450 Tamil discourse connectives. The annotated data is available for download at https://anonymous.4open.science/r/Tamil-Semi-Automatic-Discourse-Relation-Dataset/.

Details

Paper ID
lrec2026-ws-wildre-02
Pages
pp. 14-24
BibKey
yung-etal-2026-semi
Editors
Girish Nath Jha, Kalika Bali, Sobha L, Devendr Kumar
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the 8th Workshop on Indian Language Data: Resources and Evaluation
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • FY

    Frances Yung

  • EP

    Enosh Peter Ponraj

  • VD

    Vera Demberg

Links