DEJIMA: A Novel Large-scale Japanese Dataset for Image Captioning and Visual Question Answering

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

Abstract

Vision-and-Language (V&L) models depend on large-scale, high-quality datasets, yet most resources are English-centric, and existing Japanese V&L datasets face a fundamental trade-off: manually annotated corpora offer quality but limited scale, translated datasets introduce unnatural phrasing and cultural bias, and web-crawled collections achieve scale but suffer from noise and poor grounding. To resolve this trade-off, we propose DEJIMA, a novel pipeline whose key idea is detection-guided LLM refinement: object detection first extracts visually verifiable evidence (labels and bounding boxes), then an LLM generates or refines Japanese text conditioned on this evidence, ensuring both factual grounding and linguistic naturalness without costly human annotation. Using this pipeline, we build two resources: an image–caption dataset (DEJIMA-Cap) and a VQA dataset (DEJIMA-VQA), each containing approximately 3.88M image–text pairs—over 20 times larger than existing Japanese V&L datasets. Human evaluations demonstrate that DEJIMA achieves substantially higher Japaneseness and linguistic naturalness than translation- or annotation-based baselines, while maintaining factual correctness comparable to human-annotated corpora. Models trained on DEJIMA show consistent improvements across multiple Japanese multimodal benchmarks, confirming that culturally grounded, large-scale resources play a key role in enhancing model performance. All pipeline components are commercially licensed, and we publicly release the dataset and metadata to support further research and applications. Our project page is available at https://mil-tokyo.github.io/DEJIMA-dataset/.