Back to Main Conference 2026
LREC 2026main

From CHAT to Coded CoNLL-U: A Reproducible Pipeline for the Syntactic Annotation and Querying of Child Language Data

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/498zo5heasd5

Abstract

The CHILDES database is a core resource for language acquisition research, yet its CHAT format poses significant challenges for modern computational analysis. To address this, we present a reproducible, open-source pipeline that transforms CHAT transcripts into annotated tabular (CSV) and CoNLL-U formats. Its core script, childes.py, automates the conversion and integrates part-of-speech tagging and dependency parsing. A key innovation is dql.py, a tool that uses a Grew dependency query language to systematically add user-defined linguistic codings to the parsed data. While the script is parametrised for various languages, the pipeline’s utility is demonstrated by applying it to the French CHILDES corpus to conduct a large-scale analysis of object clitic production. The resulting structured data reveals clear developmental trajectories, such as the gradual convergence of children’s dative clitic usage towards the adult input. The workflow and the resources it generates facilitate reproducible, data-driven research in language acquisition.

Details

Paper ID
lrec2026-main-901
Pages
pp. 11516-11523
BibKey
stein-2026-chat
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • AS

    Achim Stein

Links