InstructSum: A Benchmark to Evaluate Instruction-Following Capability of Large Language Models in Summarization
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
Pre-trained large language models (LLMs) align their outputs with user intent through natural language instructions. In the summarization task, conciseness of the output is inherently required, which makes the instruction-following capability of LLMs particularly important. That is, providing supplementary information beyond the instruction can be undesirable. In this study, we introduce a novel benchmark, InstructSum, consisting of 3,309 types of instructions to evaluate the instruction-following capability in the summarization task. InstructSum has multiple instructions per source text, and thus it enables the evaluation of how LLMs adjust the content of the summary according to the instructions. Our experiments with six LLM families revealed the challenges that LLMs face in this task. For example, LLMs provide polite and helpful responses with irrelevant information; they go beyond instructions and fail to respond with a concise summary.