Back to Main Conference 2024
LREC-COLING 2024main

JLBert: Japanese Light BERT for Cross-Domain Short Text Classification

Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

DOI:10.63317/22n24f3j4465

Abstract

Models, such as BERT, have made a significant breakthrough in the Natural Language Processing (NLP) domain solving 11+ tasks. This is achieved by training on a large scale of unlabelled text resources and leveraging Transformers architecture making it the “Jack of all NLP trades”. However, one of the popular and challenging tasks in Sequence Classification is Short Text Classification (STC). Short Texts face the problem of being short, equivocal, and non-standard. In this paper, we address two major problems: 1. Improving STC tasks performance in Japanese language which consists of many varieties and dialects. 2. Building a light-weight Japanese BERT model with cross-domain functionality and comparable accuracy with State of the Art (SOTA) BERT models. To solve this, we propose a novel cross-domain scalable model called JLBert, which is pre-trained on a rich, diverse and less explored Japanese e-commerce corpus. We present results from extensive experiments to show that JLBert is outperforming SOTA Multilingual and Japanese specialized BERT models on three Short Text datasets by approx 1.5% across various domain.

Details

Paper ID
lrec2024-main-0833
Pages
pp. 9536-9542
BibKey
kayal-etal-2024-jlbert
Editor
N/A
Publisher
European Language Resources Association (ELRA) and ICCL
ISSN
2522-2686
ISBN
979-10-95546-34-4
Conference
Joint International Conference on Computational Linguistics, Language Resources and Evaluation
Location
Turin, Italy
Date
20 May 2024 25 May 2024

Authors

  • CK

    Chandrai Kayal

  • SC

    Sayantan Chattopadhyay

  • AG

    Aryan Gupta

  • SA

    Satyen Abrol

  • AG

    Archie Gugol

Links