MaritimEmails: A Synthetic Dataset for Maritime Chartering Correspondence
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
We introduce MaritimEmails, a large-scale synthetic corpus of 19,817 English-language email threads simulating maritime chartering negotiations between brokers and charterers. Email remains a dominant medium for business communication, yet no public corpora exist for this highly specialized domain due to confidentiality constraints. To address this gap, we generate domain-plausible negotiation exchanges using five contemporary language models under multiple prompting strategies, including Attribute Prompting and Base–Refine (BARE) approaches. Each thread includes structured annotations for vessels, ports, commodities, and Incoterms, enabling supervised training for information extraction and related tasks. Our comparative evaluation covering lexical and semantic diversity, sentiment balance, and verbosity shows that BARE generation increases linguistic variation while maintaining coherence. However, all models exhibit a systematic positivity bias, yielding less negative sentiment than is observed in the Enron reference corpus and likely also in many real negotiation settings. Baseline information extraction experiments with GLiNER and generative Qwen models yield up to 0.86 macro F1 on entity extraction, supporting the dataset’s usefulness. MaritimEmails, together with prompts, scripts, and documentation, is released for research use.