HomeLREC 2026WorkshopsCMLClrec2026-ws-cmlc-03
Back to CMLC 2026
LREC 2026workshop

Merimënga: A Manifest-First Pipeline for Reproducible Albanian Web Corpus Construction

Proceedings of the 12th Workshop on Challenges in the Management of Large Corpora

DOI:10.63317/4aesmyqqveeo

Abstract

We present Merimënga, a pipeline for reproducible Albanian web-corpus construction from Common Crawl. Rather than distributing a static text dump, we publish versioned manifests and append-only JSONL ledgers that make every retrieval and filtering decision replayable at record level. Records are addressed by (WARC filename, byte offset, byte length) and retrieved via HTTP range requests with checksum validation, enabling selective download, resumability, and exact re-materialization. On top of deterministic cleaning and deduplication, Merimënga supports teacher–student filtering: a large LLM labels a stratified sample; the resulting policy is distilled into a faster student model applied at corpus scale. The paper contributes (i) a reproducibility specification for web-corpus construction based on coordinate-addressed retrieval and decision ledgers, (ii) a concrete instantiation for Albanian with language-specific filtering, and (iii) an evaluation protocol for rerun equivalence and filter-stack ablation. Large-scale download and full-corpus filtering are ongoing; this submission focuses on methodology and auditable artifacts rather than final corpus statistics. Keywords:Common Crawl, Reproducibility, Corpus Construction, Learned Filtering, Albanian

Details

Paper ID
lrec2026-ws-cmlc-03
Pages
pp. 25-31
BibKey
kabashi-etal-2026-merimënga
Editors
Piotr Bański, Dawn Knight, Marc Kupietz, Andreas Witt, Alina Wróblewska
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the 12th Workshop on Challenges in the Management of Large Corpora
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • BK

    Besim Kabashi

  • MR

    Michael Ruppert

Links