Evaluation of Two Leading Polish Language Models in a Real-world RAG Scenario
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
This paper presents a comparative evaluation of two leading Polish instruction-tuned language models, Bielik-11B-v2.3-Instruct and PLLuM-12B-nc-chat, within a real-world Retrieval-Augmented Generation (RAG) system designed for the technical documentation of a low-code platform. The study aims to identify the optimal configuration of retrieval and generation components for Polish-language applications. The evaluation was conducted in two stages. First, several embedding models and retrieval methods were tested using standard information retrieval metrics, including NDCG. The OrlikB/KartonBERT-USE-base-v1 model combined with vector-based retrieval achieved the highest performance and was adopted for the second stage. In the generation phase, both models were evaluated using quantitative scoring and pairwise A/B testing with multiple evaluators to ensure robustness. Results show that Bielik-11B-v2.3-Instruct consistently outperformed PLLuM-12B-nc-chat in producing accurate and contextually relevant answers. The study highlights the importance of constructing a reliable golden set, employing a two-phase evaluation pipeline, and selecting appropriate metrics to ensure objective and reproducible assessment of RAG systems in real-world Polish-language contexts.