Back to WILDRE 2024
LREC-COLING 2024workshop

Aalamaram: A Large-Scale Linguistically Annotated Treebank for the Tamil Language

Proceedings of the 7th Workshop on Indian Language Data: Resources and Evaluation

DOI:10.63317/3ewubkqtrad2

Abstract

Tamil is a relatively low-resource language in the field of Natural Language Processing (NLP). Recent years have seen a growth in Tamil NLP datasets in Natural Language Understanding (NLU) or Natural Language Generation (NLG) tasks, but high-quality linguistic resources remain scarce. In order to alleviate this gap in resources, this paper introduces Aalamaram, a treebank with rich linguistic annotations for the Tamil language. It is hitherto the largest publicly available Tamil treebank with almost 10,000 sentences from diverse sources and is annotated for the tasks of Part-of-speech (POS) tagging, Named Entity Recognition (NER), Morphological Parsing and Dependency Parsing. Close attention has also been paid to multi-word segmentation, especially in the context of Tamil clitics. Although the treebank is based largely on the Universal Dependencies (UD) specifications, significant effort has been made to adjust the annotation rules according to the idiosyncrasies and complexities of the Tamil language, thereby providing a valuable resource for linguistic research and NLP developments.

Details

Paper ID
lrec2024-ws-wildre-11
Pages
pp. 73-83
BibKey
abirami-etal-2024-aalamaram
Editor
N/A
Publisher
European Language Resources Association (ELRA) and ICCL
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the 7th Workshop on Indian Language Data: Resources and Evaluation
Location
undefined, undefined
Date
20 May 2024 25 May 2024

Authors

  • AA

    A M Abirami

  • WL

    Wei Qi Leong

  • HR

    Hamsawardhini Rengarajan

  • DA

    D Anitha

  • RS

    R Suganya

  • HS

    Himanshu Singh

  • KS

    Kengatharaiyer Sarveswaran

  • WT

    William Chandra Tjhi

  • RS

    Rajiv Ratn Shah

Links