HomeLREC 2026WorkshopsLLMS4SSHlrec2026-ws-llms4ssh-15
Back to LLMS4SSH 2026
LREC 2026workshop

From One-Hot to Semantic Encoding: Entity Embedding for Small and Heterogeneous Digital Humanities Datasets

Proceedings of Shaping Multilingual, Multimodal AI for the Social Sciences and Humanities (LLMs4SSH) @ LREC 2026

DOI:10.63317/4qrbgxhxkpzs

Abstract

This paper investigates the use of semantic encoding for the analysis of heterogeneous digital literature metadata. Drawing on two databases of Latin American digital literature, Archivo de Literatura Digital en América Latina and the Atlas da Literatura Digital Brasileira, we compare traditional one-hot encoding with a semantically enriched representation derived from feature-value descriptions embedded in a continuous vector space. In contrast to one-hot encoding, which treats categorical values as orthogonal, semantic encoding models accounts for similarity between features, thereby mitigating vocabulary mismatch across databases. We evaluate both approaches using between-group centroid distances, and normalized centrality measures. Our results show that semantic encoding clarifies structural differentiation across genres and might smooth arbitrary differences introduced by differing vocabularies across databases. The findings suggest that semantic representations provide a more interpretable embedding space for small and taxonomically heterogeneous datasets. Beyond technical performance, the study suggests that embedding-based methods can support critical inquiry in digital humanities, enabling the examination of database bias, categorical patterns, and diachronic evolution within a unified semantic framework. Code is available at https://github.com/isag91/semantic-encoding-DH.

Details

Paper ID
lrec2026-ws-llms4ssh-15
Pages
pp. 147-152
BibKey
gribomont-2026-one
Editors
Arturo Montejo-Raez, Cristina Grisot, Joanna Blochowiak, Nikola Ljubešić, Elena Battaner, German Rigau
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of Shaping Multilingual, Multimodal AI for the Social Sciences and Humanities (LLMs4SSH) @ LREC 2026
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • IG

    Isabelle Gribomont

Links