JobResQA: Semi-Automatic Multilingual Benchmark Creation for LLM Machine Reading Comprehension on Résumés and Job Descriptions
The Fourth Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL 2026)
Abstract
We present a methodology for building privacy-preserving multilingual QA benchmarks in low-resource and sensitive domains, demonstrated through JobResQA, a multilingual MRC benchmark over synthetic HR documents. The dataset comprises 581 QA pairs across 105 synthetic résumé-job description pairs in five languages (English, Spanish, Italian, German, and Chinese), with questions spanning four types based on document source (intra vs. cross-document) and reasoning complexity (single-hop vs. multi-hop). We propose a privacy-preserving synthetic data pipeline applicable to other sensitive domains, with controlled demographic attributes (via placeholders) enabling future bias studies. Our cost-effective, human-in-the-loop translation pipeline based on TEaR methodology incorporates MQM error annotations and selective post-editing. Baseline evaluations across multiple open-weight LLM families using LLM-as-judge reveal higher performance on English and Spanish but substantial degradation for other languages, highlighting critical cross-lingual MRC gaps. Our pipeline, where LLMs act as synthesizers, translators, and evaluators under human oversight, constitutes a reusable methodology for resource creation and a case study in evaluation-integrity challenges of LLM-era benchmark construction.