Back to Main Conference 2024
LREC-COLING 2024main

Enhancing Hindi Feature Representation through Fusion of Dual-Script Word Embeddings

Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

DOI:10.63317/2fsu5tdcp8ho

Abstract

Pretrained language models excel in various natural language processing tasks but often neglect the integration of different scripts within a language, constraining their ability to capture richer semantic information, such as in Hindi. In this work, we present a dual-script enhanced feature representation method for Hindi. We combine single-script features from Devanagari and Romanized Hindi Roberta using concatenation, addition, cross-attention, and convolutional networks. The experiment results show that using a dual-script approach significantly improves model performance across various tasks. The addition fusion technique excels in sequence generation tasks, while for text classification, the CNN-based dual-script enhanced representation performs best with longer sentences, and the addition fusion technique is more effective for shorter sequences. Our approach shows significant advantages in multiple natural language processing tasks, providing a new perspective on feature representation for Hindi. Our code has been released on https://github.com/JohnnyChanV/Hindi-Fusion.

Details

Paper ID
lrec2024-main-0528
Pages
pp. 5966-5976
BibKey
wang-etal-2024-enhancing-hindi
Editor
N/A
Publisher
European Language Resources Association (ELRA) and ICCL
ISSN
2522-2686
ISBN
979-10-95546-34-4
Conference
Joint International Conference on Computational Linguistics, Language Resources and Evaluation
Location
Turin, Italy
Date
20 May 2024 25 May 2024

Authors

  • LW

    Lianxi Wang

  • YT

    Yujia Tian

  • ZC

    Zhuowei Chen

Links