HomeLREC 2026WorkshopsCL4HEALTHlrec2026-ws-cl4health-03
Back to CL4HEALTH 2026
LREC 2026workshop

Addressing Domain Shift in Health Coaching Note Analysis through Factorized Synthetic Data Generation

Proceedings of the Third Workshop on Patient-Oriented Language Processing (CL4Health) @ LREC 2026

DOI:10.63317/3a5koirj2a6s

Abstract

Automatic extraction of behavioral goals from health coaching notes is essential for scalable monitoring of coaching programs, yet training data is scarce and exhibits substantial domain shift across programs. We collect and annotate 157 notes from a coaching program and show that models trained on the only existing public corpus, SMARTSpan (173 notes), suffer a drop of up to 30 points in exact-match F1 when transferred to our data. To address this, we propose a factorized synthetic data generation pipeline that decomposes note variation into three largely independent axes, health coach documentation structure, patient goal content, and patient persona, extracts empirical priors from a small in-domain seed set, and samples from them to produce diverse synthetic notes with embedded goal-span labels validated via cycle-consistency filtering. In low-resource experiments with only 57 in-domain training notes, our approach outperforms rephrasing and backtranslation baselines on both exact-match and partial-match F1. Ablation analysis demonstrates that augmentation must target the in-domain distribution to be effective, and a human evaluation confirms that synthetic notes are structurally faithful, with detection driven by surface artifacts rather than content or organizational flaws.All code and generated data will be published at GitHub repository: https://github.com/Michael-Tanzer/cl4health-factorized-augmentation.

Details

Paper ID
lrec2026-ws-cl4health-03
Pages
pp. 26-40
BibKey
tnzer-etal-2026-addressing
Editors
Deepak Gupta, Paul Thompson, Sophia Ananiadou, Dina Demner-Fushman
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the Third Workshop on Patient-Oriented Language Processing (CL4Health) @ LREC 2026
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • MT

    Michael Tänzer

  • IB

    Iva Bojic

  • AL

    Ashwini Yuvraj Lawate

  • AH

    Andy Hau Yan Ho

  • AK

    Andy Khong

Links