A Multi-dimensional Constrained Framework for Derived Text Formats
Proceedings of Leveraging Derived Text Formats to Unlock Copyrighted Collections for Open Science (DTF) @ LREC 2026
Abstract
Derived Text Formats (DTFs) have been proposed as a solution to enable text and data mining while avoiding copyright infringement. Building on a review of recent empirical studies of DTFs on topic modeling, authorship classification, and sentiment analysis, this paper argues that DTFs should not be treated as static formats, but as variable and task-dependent representations shaped by multiple interacting factors. In response, we propose a multi-dimensional framework that conceptualizes DTFs as configurations within a structured space defined by both internal representation parameters and external constraints. The framework includes four internal representation dimensions—feature level, degree of reduction, transformation strategy, and aggregation level—as well as two external constraining forces: legal requirements and task-specific information needs. By emphasizing the interdependence of these dimensions, the proposed framework provides a systematic way to describe, compare, and design DTFs across different analytical contexts. Therefore, this paper contributes to a more theoretically grounded understanding of DTFs and offers guidance for their responsible and effective use in text and data mining in Digital Humanities.