Construction of a Japanese RAG Benchmark Using Synthetic Documents on Non-existent Entities and Events

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

Abstract

Retrieval-augmented generation (RAG) is a technique in which a large language model (LLM) generates answers based on relevant documents retrieved from an external document collection. Existing RAG evaluation benchmarks often use public data, such as Wikipedia and news articles, as the external document collection. However, these data are highly likely to be already included in the LLM’s pre-training corpus, which may prevent an accurate evaluation of the model’s ability to generate answers based on the retrieved documents. In this study, we construct a Japanese RAG benchmark by having an LLM synthesize documents about non-existent entities and events and use this collection of synthetic documents as the search target. Since these synthetic documents are not included in the LLM’s training data, the ability to generate answers based on retrieved documents can be evaluated more accurately. In addition to the synthetic documents, the benchmark is composed of questions and correct answers, which are created using a combination of LLMs and human effort. We then evaluated and analyzed the RAG performance of existing LLMs using the constructed benchmark.