HomeLREC 2026WorkshopsSIGULlrec2026-ws-sigul-17
Back to SIGUL 2026
LREC 2026workshop

Structured Entity Extraction from Hawaiian Television Chyrons Using Vision-Language Models

Proceedings of the SIGUL 2026 Joint Workshop with ELE, EURALI, and DCLRL "Towards Inclusivity and Equality: Language Resources and Technologies for Under-Resourced and Endangered Languages

DOI:10.63317/4vjma77xc4kh

Abstract

Hawaiian (ʻŌlelo Hawaiʻi) is an endangered Polynesian language whose broadcast archives represent a critical yet underutilized resource for language documentation. We present the first evaluation of vision-language models (VLMs) for structured entity extraction from television chyrons, investigating the performance gap between Hawaiian-language content and mainland U.S. comparisons. Using our new HiChy dataset of 3,925 manually annotated images, we demonstrate that Hawaiian content remains significantly more challenging for current VLMs: for the best-performing model (Qwen2.5-VL-7B), character error rates roughly double from 0.064 on mainland data to 0.130 on Hawaiian content. We extend the task to key information extraction (KIE), finding that while models can perform structured parsing, they struggle specifically with names of Hawaiian linguistic origin, a difficulty that persists even when controlling for geographic source. Across five evaluated models spanning local quantized inference and commercial APIs, we find that OCR accuracy and structured extraction capability do not necessarily correlate: the best OCR model (Gemini 3 Flash) underperforms locally-deployed alternatives on KIE, while even a 2.2B-parameter model (SmolVLM2) achieves functional extraction. Our results provide a baseline for AI-assisted archival processing of underrepresented language media and highlight the need for models that better account for the orthographic and cultural specificities of Hawaiian.

Details

Paper ID
lrec2026-ws-sigul-17
Pages
pp. 160-168
BibKey
lynch-etal-2026-structured
Editors
Atul Kr. Ojha, Sakriani Sakti, Claudia Soria, Maite Melero, John P. McCrae, Constantine Lignos, Chao-Hong Liu, German Rigau Claramunt, Georg Rehm
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the SIGUL 2026 Joint Workshop with ELE, EURALI, and DCLRL "Towards Inclusivity and Equality: Language Resources and Technologies for Under-Resourced and Endangered Languages
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • KL

    Kelley Lynch

  • OK

    Owen King

  • KR

    Kyeongmin Rim

  • GK

    Gabrielle Keen

  • YC

    Yangyang Chen

  • JP

    James Pustejovsky

Links