Why Reconstructing Scrambled Texts Fails
Proceedings of Leveraging Derived Text Formats to Unlock Copyrighted Collections for Open Science (DTF) @ LREC 2026
Abstract
This paper explores the limitations of reconstructing scrambled text within the context of Derived Text Formats (DTFs). While previous research has treated reconstruction as a technical challenge, this study shifts the focus to investigating the causes of reconstruction failure. Through a detailed analysis of outputs generated by language models on non-literary (IMDb reviews) and literary (Gutenberg texts) datasets, several systematic patterns were identified. First, reconstructed texts are generally shorter than the originals, indicating that the generated results are often incomplete. Second, models simplify expressions by omitting specific modifiers, thereby producing more general outputs. Third, high similarity at the string level does not guarantee semantic equivalence, revealing fidelity-related issues in text reconstruction. In literary texts, chunk-based segmentation poses additional challenges; this approach disrupts syntactic and contextual coherence, leading to sentences that are structurally correct but semantically distorted. These findings suggest that reconstruction difficulty is not merely a matter of model performance but also reflects the importance of higher-level textual organization. This study highlights the fundamental limitations of current language models and reframes reconstruction failure as an analytical perspective for understanding how meaning is constructed in text.