Addressing Domain Shift in Health Coaching Note Analysis through Factorized Synthetic Data Generation

Proceedings of the Third Workshop on Patient-Oriented Language Processing (CL4Health) @ LREC 2026

Abstract

Automatic extraction of behavioral goals from health coaching notes is essential for scalable monitoring of coaching programs, yet training data is scarce and exhibits substantial domain shift across programs. We collect and annotate 157 notes from a coaching program and show that models trained on the only existing public corpus, SMARTSpan (173 notes), suffer a drop of up to 30 points in exact-match F1 when transferred to our data. To address this, we propose a factorized synthetic data generation pipeline that decomposes note variation into three largely independent axes, health coach documentation structure, patient goal content, and patient persona, extracts empirical priors from a small in-domain seed set, and samples from them to produce diverse synthetic notes with embedded goal-span labels validated via cycle-consistency filtering. In low-resource experiments with only 57 in-domain training notes, our approach outperforms rephrasing and backtranslation baselines on both exact-match and partial-match F1. Ablation analysis demonstrates that augmentation must target the in-domain distribution to be effective, and a human evaluation confirms that synthetic notes are structurally faithful, with detection driven by surface artifacts rather than content or organizational flaws.All code and generated data will be published at GitHub repository: https://github.com/Michael-Tanzer/cl4health-factorized-augmentation.