HomeLREC 2026WorkshopsCL4HEALTHlrec2026-ws-cl4health-16
Back to CL4HEALTH 2026
LREC 2026workshop

A Synthetic Conversational Dataset for Type 2 Diabetes Management

Proceedings of the Third Workshop on Patient-Oriented Language Processing (CL4Health) @ LREC 2026

DOI:10.63317/3cbhekpxj33y

Abstract

Access to real patient-doctor conversations in the medical domain is often restricted due to privacy concerns, making it difficult to build robust conversational AI systems. To address this, we present a novel methodology for generating a high-quality synthetic dataset designed for conversational triple extraction in Type 2 Diabetes management. Using structured prompting with GPT-4, we generated 16 demographically and medically diverse diabetic personas, and 256 multi-turn conversations between these personas and a caretaker agent, simulating realistic and context-rich interactions. The conversations incorporate critical properties such as personalization, empathy, contextual awareness, and medically grounded advice, as validated through both LLM-based and human expert evaluations. These synthetic conversations are further annotated with Subject-Predicate-Object (SPO) labels at the token level, integrating both manual and LLM-automated methods, forming the foundation for downstream tasks like triple extraction. Our work demonstrates the feasibility of using generative AI to simulate healthcare conversations at scale, offering a solution for data-scarce domains.

Details

Paper ID
lrec2026-ws-cl4health-16
Pages
pp. 171-181
BibKey
ntanavaras-etal-2026-synthetic
Editors
Deepak Gupta, Paul Thompson, Sophia Ananiadou, Dina Demner-Fushman
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the Third Workshop on Patient-Oriented Language Processing (CL4Health) @ LREC 2026
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • SN

    Stergios Ntanavaras

  • Md

    Maaike de Boer

  • PV

    Piek T.J.M. Vossen

Links