HomeLREC 2026WorkshopsCAWLlrec2026-ws-cawl-10
Back to CAWL 2026
LREC 2026workshop

A Lightweight N-gram Approach to Abbreviation Expansion in Large Corpora

Proceedings of the Third Workshop on Computation and Written Language (CAWL 2026) @ LREC 2026

DOI:10.63317/3t4xjiz2vvw2

Abstract

We present a lightweight, corpus-based approach to abbreviation expansion that relies solely on contextual N-gram statistics. The method models local context using two-sided and one-sided bigram and trigram counts extracted from a large domain-specific corpus. Candidate expansions are selected through linear interpolation of context-specific evidence, enhanced with reliability-based scaling to mitigate sparse data effects. The approach does not require external linguistic resources, pretrained language models, or explicit morphosyntactic analysis, making it suitable for domain-specific and resource-constrained settings. Experiments conducted on a large Slovene medical corpus demonstrate that interpolation generally outperforms strict backoff strategies, with notable improvements for medium- and low-frequency abbreviations. Despite its simplicity, the proposed framework achieves robust performance while remaining computationally efficient and scalable.

Details

Paper ID
lrec2026-ws-cawl-10
Pages
pp. 95-100
BibKey
oltes-etal-2026-lightweight
Editors
Kyle Gorman
Publisher
European Language Resources Association (ELRA)
ISSN
N/A
ISBN
N/A
Workshop
Proceedings of the Third Workshop on Computation and Written Language (CAWL 2026) @ LREC 2026
Location
Palma, Mallorca, Spain
Date
11 - 16 May 2026

Authors

  • Tjaša Šoltes

  • MB

    Marko Bajec

Links