TextLens & LeTTuce: Automated Corpus Annotation and Multilingual Tagging as a Service
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
We present TextLens, a web-based platform for automated linguistic annotation designed to lower technical barriers for researchers in digital humanities, linguistics and translation studies. Hosted by the Dutch Language Institute (INT), TextLens allows users to upload and annotate corpora in a variety of formats (.txt, .tsv, CoNLL-U, FoLiA, TEI, and NAF) using state-of-the-art NLP tools, without the need for local installation or computational resources. The platform supports multilingual data processing and provides a persistent dashboard for managing, monitoring and sharing annotation projects. Alongside this service, we introduce the LeTTuce-PoS Dataset, a new multilingual, manually annotated dataset for part-of-speech tagging in English, French, Dutch and German, covering multiple genres and offering a valuable resource to the research community. This paper also reports benchmark results for different PoS taggers (LeTs Preprocess, LeTTuce, spaCy and Stanza) on the dataset. Together, TextLens and the LeTTuce-PoS Dataset provide an accessible, scalable platform for high-quality annotation and a robust multilingual dataset that support comparable and reproducible research in multilingual contexts.