SpreadNaLa: A Naturalistic Code Generation Evaluation Dataset of Spreadsheet Formulas

Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

DOI:10.63317/3g58v686p99j

Abstract

Automatic generation of code from natural language descriptions has emerged as one of the main use cases of large language models (LLMs). This has also led to a proliferation of datasets to track progress in the reliability of code generation models, including domains such as programming challenges and common data science tasks. However, existing datasets primarily target the use of code generation models to aid expert programmers in writing code. In this work, we consider a domain of code generation which is more frequently used by users without sophisticated programming skills: translating English descriptions to spreadsheet formulas that can be used to do everyday data processing tasks. We extract naturalistic instructions from StackOverflow posts and manually verify and standardize the corresponding spreadsheet formulas. We use this dataset to evaluate an off-the-shelf code generation model (GPT 3.5 text-davinci-003) as well as recently proposed pragmatic code generation procedures and find that Code Reviewer reranking (Zhang et al., 2022) performs best among the evaluated methods but still frequently generates formulas that differ from human-generated ones.