Back to Main Conference 2026
LREC 2026main

FeedFetcher: A Resilient Web Feed Downloader for Corpus Construction

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/49txz3zreas2

Abstract

Building large-scale, timestamped monitor corpora requires robust and efficient tools for continuous web data acquisition. We present FeedFetcher, an open-source, lightweight yet resilient downloader designed to collect linguistic data from RSS/Atom web feeds. The tool enables continuous corpus updates by harvesting newly published web content with minimal downtime and high data integrity. Implemented in Rust for performance, memory safety, and scalable concurrency, FeedFetcher supports thousands of simultaneous connections while maintaining server politeness. The software is available under the GPL-3.0 license on https://github.com/ondra/feed_fetcher. In our setup, the entire workflow integrates FeedFetcher with downstream text-processing pipelines for tokenization, lemmatization, corpus compilation and deployment. The system is currently used to update monitor corpora in 64 languages, producing approximately two billion tokens per month. These corpora are available in Sketch Engine. We also describe methods for discovering new web feeds, combining manual exploration with automated extraction from large-scale web crawls to expand linguistic coverage. We demonstrate the system’s applicability through a time-based analysis of word-frequency change, showing how long-term accumulation of timestamped data supports the study of lexical dynamics and language evolution.

Details

Paper ID
lrec2026-main-558
Pages
pp. 7014-7022
BibKey
herman-etal-2026-feedfetcher
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • OH

    Ondřej Herman

  • JK

    Jan Kraus

  • VS

    Vit Suchomel

Links