Back to Main Conference 2026
LREC 2026main

Radio Haiti-Inter: A Large-Scale Annotated Corpus of Spoken Haitian Creole

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/5kk3h4p3mp5d

Abstract

We present the first large-scale corpus of spoken Haitian Creole (Kreyòl), namely Radio Haiti-Inter. The corpus was constructed using automatic speech recognition (ASR) with a state-of-the-art model specifically dedicated to Kreyòl. In addition to transcriptions, we provide part-of-speech (POS) tags, as well as time-aligned transcripts and confidence scores, enabling users to select the most reliable segments for their research. We conduct a manual evaluation of both the transcription quality and POS tagging accuracy to assess the reliability of the resource we present. To enable high-quality research with the resource we introduce, we are releasing 50 hours, comprising both the audios and attached annotations, drawn from the highest-quality segments. This corpus represents an invaluable resource for advancing the study of Kreyòl, with potential applications in phonetics, phonology, morphology, syntax, as well as the study of code-switching and code-mixing. As the recordings cover a large span of years, the corpus we introduce is also suited to micro-diachronic studies of Kreyòl.

Details

Paper ID
lrec2026-main-241
Pages
pp. 3083-3093
BibKey
havard-etal-2026-radio
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • WH

    William N. Havard

  • RZ

    Rayan Ziane

  • MM

    Mélissa Menclé

  • MC

    Maximin Coavoux

  • BL

    Benjamin Lecouteux

  • ES

    Emmanuel Schang

Links