From One-Hot to Semantic Encoding: Entity Embedding for Small and Heterogeneous Digital Humanities Datasets
Proceedings of Shaping Multilingual, Multimodal AI for the Social Sciences and Humanities (LLMs4SSH) @ LREC 2026
Abstract
This paper investigates the use of semantic encoding for the analysis of heterogeneous digital literature metadata. Drawing on two databases of Latin American digital literature, Archivo de Literatura Digital en América Latina and the Atlas da Literatura Digital Brasileira, we compare traditional one-hot encoding with a semantically enriched representation derived from feature-value descriptions embedded in a continuous vector space. In contrast to one-hot encoding, which treats categorical values as orthogonal, semantic encoding models accounts for similarity between features, thereby mitigating vocabulary mismatch across databases. We evaluate both approaches using between-group centroid distances, and normalized centrality measures. Our results show that semantic encoding clarifies structural differentiation across genres and might smooth arbitrary differences introduced by differing vocabularies across databases. The findings suggest that semantic representations provide a more interpretable embedding space for small and taxonomically heterogeneous datasets. Beyond technical performance, the study suggests that embedding-based methods can support critical inquiry in digital humanities, enabling the examination of database bias, categorical patterns, and diachronic evolution within a unified semantic framework. Code is available at https://github.com/isag91/semantic-encoding-DH.