Medispeech: A French Reading and Spontaneous Speech Corpus for Sleepiness Estimation
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
Excessive Daytime Sleepiness (EDS) is associated with several diseases and therefore negatively affects the daily life of impacted people. Its diagnosis and follow-up are difficult because they require testing at the hospital for one full day. Monitoring patients regularly in ecological conditions may be done through speech analysis. Although several corpora containing speech from sleepy subjects exist, they do not suit ecological requirements regarding either the device used for recording or the speech elicitation tasks. In this paper, we introduce the Medispeech corpus containing reading, daily-life semi-spontaneous, and medically-oriented spontaneous tasks. Fifty-nine French subjects were recorded with both a professional-quality microphone and a smartphone using a dedicated application, resulting in 1,729 recordings for a total duration of 21 hours. Their EDS diagnosis was assessed by both a physiological objective measurement (mean sleep latency measured during a clinical test) and a subjective questionnaire (Karolinska Sleepiness Scale). Phenotyping of subjects is assured by collecting socio-demographic and medical data related to diverse dimensions of sleepiness, comorbidities, and addictions. Finally, we analyse the validity of our data collection protocol by measuring the effective duration of speech (after discarding pauses) and assessing its links with the collected subjects’ characteristics.