Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024

LREC-COLING 2024 Workshop

Turin, Italy 20 - 25 May 2024 26 papers

DOI:10.63317/3jfrug2yvkgc

Proceedings PDF

Show20per page

Quality and Quantity of Machine Translation References for Automatic Metrics

Vilém Zouhar, Ondřej Bojar

pp. 1-11 DOI: 10.63317/46kywqcoxyj2

Exploratory Study on the Impact of English Bias of Generative Large Language Models in Dutch and French

Ayla Rigouts Terryn, Miryam de Lhoneux

pp. 12-27 DOI: 10.63317/5ogbe6wmm74i

Adding Argumentation into Human Evaluation of Long Document Abstractive Summarization: A Case Study on Legal Opinions

Mohamed Elaraby, Huihui Xu, Morgan Gray, Kevin Ashley, Diane Litman

pp. 28-35 DOI: 10.63317/2ahtbz8ud7ik

A Gold Standard with Silver Linings: Scaling Up Annotation for Distinguishing Bosnian, Croatian, Montenegrin and Serbian

Aleksandra Miletić, Filip Miletić

pp. 36-46 DOI: 10.63317/528r56eaiceb

Insights of a Usability Study for KBQA Interactive Semantic Parsing: Generation Yields Benefits over Templates but External Validity Remains Challenging

Ashley Lewis, Lingbo Mo, Marie-Catherine de Marneffe, Huan Sun, Michael White

pp. 47-62 DOI: 10.63317/3z2i6a5fb79t

Extrinsic evaluation of question generation methods with user journey logs

Elie Antoine, Eléonore Besnehard, Frederic Bechet, Geraldine Damnati, Eric Kergosien, Arnaud Laborderie

pp. 63-70 DOI: 10.63317/4cp9psaanokk

Towards Holistic Human Evaluation of Automatic Text Simplification

Luisa Carrer, Andreas Säuberli, Martin Kappus, Sarah Ebling

pp. 71-80 DOI: 10.63317/2jrain583efv

Decoding the Metrics Maze: Navigating the Landscape of Conversational Question Answering System Evaluation in Procedural Tasks

Alexander Frummet, David Elsweiler

pp. 81-90 DOI: 10.63317/3xb58kdzgjja

The 2024 ReproNLP Shared Task on Reproducibility of Evaluations in NLP: Overview and Results

Anya Belz, Craig Thomson

pp. 91-105 DOI: 10.63317/558m64egqo2t

Once Upon a Replication: It is Humans’ Turn to Evaluate AI’s Understanding of Children’s Stories for QA Generation

Andra-Maria Florescu, Marius Micluta-Campeanu, Liviu P. Dinu

pp. 106-113 DOI: 10.63317/2kyy8r4xitpb

Exploring Reproducibility of Human-Labelled Data for Code-Mixed Sentiment Analysis

Sachin Sasidharan Nair, Tanvi Dinkar, Gavin Abercrombie

pp. 114-124 DOI: 10.63317/4rk7cfdtiy7k

Reproducing the Metric-Based Evaluation of a Set of Controllable Text Generation Techniques

Michela Lorandi, Anya Belz

pp. 125-131 DOI: 10.63317/3to62ozos9sk

ReproHum: #0033-03: How Reproducible Are Fluency Ratings of Generated Text? A Reproduction of August et al. 2022

Emiel van Miltenburg, Anouck Braggaar, Nadine Braun, Martijn Goudbeek, Emiel Krahmer, Chris van der Lee, Steffen Pauws, Frédéric Tomas

pp. 132-144 DOI: 10.63317/4nttkbozazdh

ReproHum #0927-03: DExpert Evaluation? Reproducing Human Judgements of the Fluency of Generated Text

Tanvi Dinkar, Gavin Abercrombie, Verena Rieser

pp. 145-152 DOI: 10.63317/2ncopi2t3wjs

ReproHum #0927-3: Reproducing The Human Evaluation Of The DExperts Controlled Text Generation Method

Javier González Corbelle, Ainhoa Vivel Couso, Jose Maria Alonso-Moral, Alberto Bugarín-Diz

pp. 153-162 DOI: 10.63317/4vrgpwhg7p5h

ReproHum #1018-09: Reproducing Human Evaluations of Redundancy Errors in Data-To-Text Systems

Filip Klubička, John D. Kelleher

pp. 163-198 DOI: 10.63317/37ks853ew783

ReproHum#0043: Human Evaluation Reproducing Language Model as an Annotator: Exploring Dialogue Summarization on AMI Dataset

Vivian Fresen, Mei-Shin Wu-Urbanek, Steffen Eger

pp. 199-209 DOI: 10.63317/3qvsqwmrfqua

ReproHum #0712-01: Human Evaluation Reproduction Report for “Hierarchical Sketch Induction for Paraphrase Generation”

Mohammad Arvan, Natalie Parde

pp. 210-220 DOI: 10.63317/43mv8z294fae

ReproHum #0712-01: Reproducing Human Evaluation of Meaning Preservation in Paraphrase Generation

Lewis N. Watson, Dimitra Gkatzia

pp. 221-228 DOI: 10.63317/4s8pa752e9q5

ReproHum #0043-4: Evaluating Summarization Models: investigating the impact of education and language proficiency on reproducibility

Mateusz Lango, Patricia Schmidtova, Simone Balloccu, Ondrej Dusek

pp. 229-237 DOI: 10.63317/33a3pbwypjgb

Showing 20 of 26 papers | Page 1 of 2