ArtistMus: A Globally Diverse, Artist-Centric Benchmark for Retrieval-Augmented Music Question Answering
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
Recent advances in Large Language Models (LLMs) have transformed open-domain question answering, yet their effectiveness in music-related reasoning remains limited due to sparse music knowledge in pretraining data. While music information retrieval and computational musicology have explored structured and multimodal understanding, few resources support factual and contextual music question answering (MQA) grounded in artist metadata or historical context. We introduce MusWikiDB, a vector database of 3.2M passages from 144K music-related Wikipedia pages, and ArtistMus, a benchmark of 1,000 questions on 500 diverse artists with metadata such as genre, debut year, and topic. These resources enable systematic evaluation of retrieval augmented generation (RAG) for MQA. Experiments show that RAG markedly improves factual accuracy—open-source models gain up to +56.8 percentage points (pp; Qwen3 8B: 35.0→91.8), approaching proprietary performance. RAG-style fine-tuning further boosts both factual recall and contextual reasoning, yielding strong improvements on both in-domain and out-of-domain benchmarks. MusWikiDB also yields +6 pp higher accuracy and 67% faster retrieval than the general Wikipedia corpus. We release MusWikiDB and ArtistMus to advance research in music information retrieval and domain-specific QA, establishing a foundation for retrieval augmented reasoning in culturally rich domains such as music.