Radio Haiti-Inter: A Large-Scale Annotated Corpus of Spoken Haitian Creole
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
We present the first large-scale corpus of spoken Haitian Creole (Kreyòl), namely Radio Haiti-Inter. The corpus was constructed using automatic speech recognition (ASR) with a state-of-the-art model specifically dedicated to Kreyòl. In addition to transcriptions, we provide part-of-speech (POS) tags, as well as time-aligned transcripts and confidence scores, enabling users to select the most reliable segments for their research. We conduct a manual evaluation of both the transcription quality and POS tagging accuracy to assess the reliability of the resource we present. To enable high-quality research with the resource we introduce, we are releasing 50 hours, comprising both the audios and attached annotations, drawn from the highest-quality segments. This corpus represents an invaluable resource for advancing the study of Kreyòl, with potential applications in phonetics, phonology, morphology, syntax, as well as the study of code-switching and code-mixing. As the recordings cover a large span of years, the corpus we introduce is also suited to micro-diachronic studies of Kreyòl.