Back to Main Conference 2024
LREC-COLING 2024main

Towards a Corpus of Spoken Maltese: Korpus tal-Malti Mitkellem, KMM

Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

DOI:10.63317/5k7ktu5qumgk

Abstract

This paper presents the rationale for a “dedicated” corpus of spoken Maltese, Korpus tal-Malti Mitkellem, KMM, ‘Corpus of Spoken Maltese’, based on the concept of a gold-standard Core collection. The Core collection is designed to cater to as wide a variety of user needs as possible whilst respecting basic principles governing corpus design, such as representativeness and balance, and delivering high quality in terms of both audio quality and annotations. An overview is provided of the composition of the current Core corpus of around 20 hours of data and of the human annotation effort involved. We also carry out a small qualitative analysis of the output of a Maltese ASR system and compare it to the human annotators’ output. Initial results are promising, showing that the ASR is robust enough to generate first-pass texts for annotators to work on, thus reducing the human effort, and consequently, the cost involved.

Details

Paper ID
lrec2024-main-1420
Pages
pp. 16343-16352
BibKey
vella-etal-2024-towards
Editor
N/A
Publisher
European Language Resources Association (ELRA) and ICCL
ISSN
2522-2686
ISBN
979-10-95546-34-4
Conference
Joint International Conference on Computational Linguistics, Language Resources and Evaluation
Location
Turin, Italy
Date
20 May 2024 25 May 2024

Authors

  • AV

    Alexandra (Sandra) Vella

  • SA

    Sarah Agius

  • AW

    Aiden Williams

  • CB

    Claudia Borg

Links