Back to Main Conference 2024
LREC-COLING 2024main

The Syntactic Acceptability Dataset (Preview): A Resource for Machine Learning and Linguistic Analysis of English

Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

DOI:10.63317/3f2yamzokm9i

Abstract

We present a preview of the Syntactic Acceptability Dataset, a resource being designed for both syntax and computational linguistics research. In its current form, the dataset comprises 1,000 English sequences from the syntactic discourse: Half from textbooks and half from the journal Linguistic Inquiry, the latter to ensure a representation of the contemporary discourse. Each entry is labeled with its grammatical status (“well-formedness” according to syntactic formalisms) extracted from the literature, as well as its acceptability status (“intuitive goodness” as determined by native speakers) obtained through crowdsourcing, with highest experimental standards. Even in its preliminary form, this dataset stands as the largest of its kind that is publicly accessible. We also offer preliminary analyses addressing three debates in linguistics and computational linguistics: We observe that grammaticality and acceptability judgments converge in about 83% of the cases and that “in-betweenness” occurs frequently. This corroborates existing research. We also find that while machine learning models struggle with predicting grammaticality, they perform considerably better in predicting acceptability. This is a novel finding. Future work will focus on expanding the dataset.

Details

Paper ID
lrec2024-main-1401
Pages
pp. 16113-16120
BibKey
juzek-2024-syntactic
Editor
N/A
Publisher
European Language Resources Association (ELRA) and ICCL
ISSN
2522-2686
ISBN
979-10-95546-34-4
Conference
Joint International Conference on Computational Linguistics, Language Resources and Evaluation
Location
Turin, Italy
Date
20 May 2024 25 May 2024

Authors

  • TJ

    Tom S Juzek

Links