Infox-QC: A Quebec-Focused French Corpus for Misinformation Detection and AI Robustness Assessment
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
The pervasive spread of online misinformation, often through social media and political campaigns, makes detecting false claims a crucial task for mitigating societal risks. While the vast majority of fake news datasets are developed in English, a critical gap remains for low-resource languages, such as French. To address this, we introduce Infox-QC, a novel French-language corpus focused on misinformation relevant to the Quebec region. Beyond containing real true and fake news, Infox-QC includes two unique subsets of AI-generated fake news: one created by prompting an AI to paraphrase existing fake news, and a second generated by prompting an AI to fabricate fake news from real true reports. This innovative approach allows us to verify the robustness of detection systems against fabricated content, which modern LLMs can generate with convincing efficacy. We establish comprehensive baselines using traditional machine learning methods, BERT-based models, and Large Language Models, both with and without Retrieval-Augmented Generation (RAG). Our results demonstrate that RAG-augmented LLMs offer the strongest contextual understanding, while traditional models provide valuable interpretable baselines. We further provide an exploratory human–LLM thematic agreement analysis to assess annotation consistency. The Infox-QC resource fills a critical void in French-language NLP research, supporting future efforts to explore the regional and cultural dimensions of misinformation through cross-linguistic comparison.