NOVELSUM: Evaluating Long-Form Summary Generation for Historical Scandinavian Novels

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

Abstract

We study long-form summarization of late-19th-century Danish and Norwegian novels and propose NOVELSUM, an evaluation resource and protocol tailored to literary narrative. We use a curated set of historical novels paired with professional reference summaries to establish baselines with long-document encoder–decoder models and prompt-based large-context LLMs. We evaluate with automatic metrics, expert human judgments, and LLM-as-judge scoring. Our human study identifies evaluation dimensions and literary facets that achieve substantial inter-annotator agreement and align with scholarly expectations. We further analyze reference-free evaluation, showing when it tracks expert trends and where it fails (notably for factual and setting-related criteria), thereby clarifying its utility when gold references or expert readers are unavailable. Our results benchmark long-context and prompted LLM approaches on historical literary prose and offer a practical path for human-grounded and reference-free assessment.