Back to Main Conference 2026
LREC 2026main

Integrating TEI, NER/NEL, Textometry, and Linked Data for a Semantically Enriched Interview Corpus

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/4zjef9ycsupk

Abstract

This paper presents a pipeline that converts unstructured interview transcripts into a semantically enriched, queryable knowledge resource. The texts from the Digitalne Ikone 20+ interview collection were first encoded in TEI XML (Text Encoding Initiative), marking interview boundaries, paragraph breaks, speaker turns with identifiers, dates, and topics. This structural encoding underpins downstream NLP and enables structured querying (e.g., by speaker). We then applied Named Entity Recognition to identify persons, places, organizations, and events, and embedded the results directly in TEI. In the third stage, Named Entity Linking mapped entity mentions to canonical Wikidata identifiers via context-aware disambiguation; missing entries were added to Wikidata when necessary. The resulting TEI+NER/NEL corpus, serialized as linked data, follows the NIF (NLP Interchange Framework). The pipeline also supports retrieval-augmented summarization that retrieves evidence passages and prompts LLMs (implemented with DSPy) to produce faithful interview summaries. We discuss design choices (TXM for textometry with JeRTeh resources; TESLA models for NER/NEL), report qualitative gains in interpretability through semantic links, and outline future work on domain-adapted NER/NEL, graph-based completion, and more expressive RAG architectures. The approach is replicable for other oral-history or media corpora and advances practical, evidence-grounded access to cultural archives and beyond.

Details

Paper ID
lrec2026-main-114
Pages
pp. 1474-1484
BibKey
stankovic-etal-2026-integrating
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • RS

    Ranka Stankovic

  • TV

    Tamara Vučenović

  • BR

    Biljana Rujević

  • MN

    Milica Ikonić Nešić

  • Mihailo Škorić

Links