DR-RAG: Addressing Retrieval Misalignment in Low-Resource Urdu Question Answering
Proceedings of the Second workshop on Challenges in Processing South Asian Languages (CHiPSAL2026)
Abstract
Retrieval-Augmented Generation performs well on English QA benchmarks, but degrades considerably in morphologically rich, low-resource languages. Urdu presents a particularly challenging case: heavy inflectional morphology, Nastaliq script inconsistencies, and limited training data produce a systematic mismatch between query representations and indexed document content that standard retrieval architectures cannot bridge. We propose DR-RAG (Dual-Representation Retrieval-Augmented Generation), which addresses this through dual indexing. Each document is represented as overlapping text chunks and as automatically generated question-answer pairs. Queries are first matched against the QA index, which aligns more reliably with natural query phrasing than declarative document chunks. When retrieval confidence falls below τ = 0.80, the system falls back to chunk-based retrieval, maintaining coverage without sacrificing precision. Evaluated on Urdu UQA and English SQuAD 2.0, DR-RAG improves Urdu METEOR by 38×, ROUGE-1 by 140%, and reduces generation latency by 43%. LLM-as judge scores show higher faithfulness (3.03 vs 1.93) and overall quality (2.99 vs 2.21) over MultiVector. English performance remains competitive throughout. These results indicate that representation-level alignment between queries and indexed content, rather than increased model complexity, is the critical factor for reliable retrieval in underserved South Asian languages.