Structured Entity Extraction from Hawaiian Television Chyrons Using Vision-Language Models

Proceedings of the SIGUL 2026 Joint Workshop with ELE, EURALI, and DCLRL "Towards Inclusivity and Equality: Language Resources and Technologies for Under-Resourced and Endangered Languages

DOI:10.63317/4vjma77xc4kh

Abstract

Hawaiian (ʻŌlelo Hawaiʻi) is an endangered Polynesian language whose broadcast archives represent a critical yet underutilized resource for language documentation. We present the first evaluation of vision-language models (VLMs) for structured entity extraction from television chyrons, investigating the performance gap between Hawaiian-language content and mainland U.S. comparisons. Using our new HiChy dataset of 3,925 manually annotated images, we demonstrate that Hawaiian content remains significantly more challenging for current VLMs: for the best-performing model (Qwen2.5-VL-7B), character error rates roughly double from 0.064 on mainland data to 0.130 on Hawaiian content. We extend the task to key information extraction (KIE), finding that while models can perform structured parsing, they struggle specifically with names of Hawaiian linguistic origin, a difficulty that persists even when controlling for geographic source. Across five evaluated models spanning local quantized inference and commercial APIs, we find that OCR accuracy and structured extraction capability do not necessarily correlate: the best OCR model (Gemini 3 Flash) underperforms locally-deployed alternatives on KIE, while even a 2.2B-parameter model (SmolVLM2) achieves functional extraction. Our results provide a baseline for AI-assisted archival processing of underrepresented language media and highlight the need for models that better account for the orthographic and cultural specificities of Hawaiian.

Resources

Details

Paper ID

lrec2026-ws-sigul-17

Pages

pp. 160-168

DOI

10.63317/4vjma77xc4kh

BibKey

lynch-etal-2026-structured

Editors

Atul Kr. Ojha, Sakriani Sakti, Claudia Soria, Maite Melero, John P. McCrae, Constantine Lignos, Chao-Hong Liu, German Rigau Claramunt, Georg Rehm

Publisher

European Language Resources Association (ELRA)

ISSN

N/A

ISBN

N/A

Workshop

Proceedings of the SIGUL 2026 Joint Workshop with ELE, EURALI, and DCLRL "Towards Inclusivity and Equality: Language Resources and Technologies for Under-Resourced and Endangered Languages

Location

Palma, Mallorca, Spain

Date

11 - 16 May 2026

Authors

KL
Kelley Lynch
OK
Owen King
KR
Kyeongmin Rim
GK
Gabrielle Keen
YC
Yangyang Chen
JP
James Pustejovsky

Links

URL

DOI