HomeLREC 2026WorkshopsCHIPSALlrec2026-ws-chipsal-19
Back to CHIPSAL 2026
LREC 2026workshop

Comparative Analysis of Tokenizers in Tamil Text Classification in Low Resource Settings

Proceedings of the Second workshop on Challenges in Processing South Asian Languages (CHiPSAL2026)

DOI:10.63317/5p78kf96x2jw

Abstract

Tokenization is crucial in NLP, influencing performance for morphologically rich, low resource languages like Tamil. This study comprehensively analyzes WordPiece, SentencePiece, and Byte-Level Byte Pair Encoding (BBPE) for Tamil text classification. We assess tokenization efficiency using metrics including token count, fragmentation, OOV rate, and compression ratio. Additionally, we analyze downstream impact through Tamil news title classification using a custom lightweight BERT based Transformer architecture. Tokenizers were pretrained on a 5.45 GB Tamil Corpus and evaluated on a Kaggle Tamil News Dataset. Results indicate WordPiece and SentencePiece outperform BBPE in efficiency and accuracy. While BBPE eliminates OOV words, excessive fragmentation hinders model learning. Increasing vocabulary size improves WordPiece and SentencePiece but not BBPE. Misclassification analysis highlights overfragmentation challenges. This study contributes to Tamil NLP by comparing tokenizers, aiding researchers in selecting appropriate strategies for agglutinative languages.

Details

Paper ID
lrec2026-ws-chipsal-19
Pages
pp. 198-208
BibKey
sivakumaran-etal-2026-comparative
Editors
Kengatharaiyer Sarveswaran, Ashwini Vaidya
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the Second workshop on Challenges in Processing South Asian Languages (CHiPSAL2026)
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • GS

    Gokulan Sivakumaran

  • RP

    Randil Pushpananda

  • B

    Bandara

Links