Revisiting Masking After Fifteen Years: Early Approaches to Non-Reconstructable Linguistic Data in the current context
Proceedings of Leveraging Derived Text Formats to Unlock Copyrighted Collections for Open Science (DTF) @ LREC 2026
Abstract
This paper revisits the masking approaches introduced in 2007 for enabling the distribution of linguistically annotated corpora without exposing copyrighted or sensitive source texts and situates them within the contemporary framework of Derived Text Formats (DTF). While the original work demonstrated how syntactic and morphological information could be preserved through parameterised masking, today’s landscape, which is shaped by large language models, FAIR requirements, and emerging standardisation efforts, demands more formalised, robust and reproducible methods. We outline how DTF extend early masking concepts by introducing explicit abstraction levels, reversibility classes, and machine- actionable provenance, supported by standards such as TEI, ISO linguistic annotation models, CMDI metadata, and the draft DIN DTF specification. Building on these foundations, we present a modern workflow for DTF generation, including enrichment pipelines, structural abstractions, statistical and embedding-based representations, and non-reversible transformation layers, illustrated through the MONA-pipe framework. We conclude that DTF constitute a sustainable and infrastructure - ready solution for open, reproducible and legally secure text-based research in the decades to come.