Semi-automatic Approach for Tamil Discourse Relation Annotation

Proceedings of the 8th Workshop on Indian Language Data: Resources and Evaluation

Abstract

Discourse relations (DRs) specify the logical relations between text spans and are essential for modeling extended discourse. Resources annotated with DRs can help train large language models (LLMs) to recognize and generate these relations more naturally. However, there is currently no open-source DR-annotated resource for Tamil. Annotation is particularly challenging because many Tamil discourse connectives are realized as morphologically complex suffixes rather than standalone tokens, often involving phonological alternations. In this work, we present a DR-annotated dataset for Tamil based on the PDTB framework. We adopt a semi-automatic pipeline: 1) projection of automatic English discourse annotations onto Tamil in a parallel corpus; 2) lexical normalization using a morphological analyzer; and 3) manual verification of each instance. The resulting resource contains approximately 7;200 explicit DR annotations and a lexicon of 450 Tamil discourse connectives. The annotated data is available for download at https://anonymous.4open.science/r/Tamil-Semi-Automatic-Discourse-Relation-Dataset/.