Back to Main Conference 2024
LREC-COLING 2024main

Cross-type French Multiword Expression Identification with Pre-trained Masked Language Models

Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

DOI:10.63317/3kwcha3br48e

Abstract

Multiword expressions (MWEs) pose difficulties for natural language processing (NLP) due to their linguistic features, such as syntactic and semantic properties, which distinguish them from regular word groupings. This paper describes a combination of two systems: one that learns verbal multiword expressions (VMWEs) and another that learns non-verbal MWEs (nVMWEs). Together, these systems leverage training data from both types of MWEs to enhance performance on a cross-type dataset containing both VMWEs and nVMWEs. Such scenarios emerge when datasets are developed using differing annotation schemes. We explore the fine-tuning of several state-of-the-art neural transformers for each MWE type. Our experiments demonstrate the advantages of the combined system over multi-task approaches or single-task models, addressing the challenges posed by diverse tagsets within the training data. Specifically, we evaluated the combined system on a French treebank named Sequoia, which features an annotation layer encompassing all syntactic types of French MWEs. With this combined approach, we improved the F1-score by approximately 3% on the Sequoia dataset.

Details

Paper ID
lrec2024-main-0374
Pages
pp. 4198-4204
BibKey
bui-savary-2024-cross
Editor
N/A
Publisher
European Language Resources Association (ELRA) and ICCL
ISSN
2522-2686
ISBN
979-10-95546-34-4
Conference
Joint International Conference on Computational Linguistics, Language Resources and Evaluation
Location
Turin, Italy
Date
20 May 2024 25 May 2024

Authors

  • VB

    Van-Tuan Bui

  • AS

    Agata Savary

Links