Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024
LREC-COLING 2024 Workshop
Quality and Quantity of Machine Translation References for Automatic Metrics
Vilém Zouhar, Ondřej Bojar
Exploratory Study on the Impact of English Bias of Generative Large Language Models in Dutch and French
Ayla Rigouts Terryn, Miryam de Lhoneux
Adding Argumentation into Human Evaluation of Long Document Abstractive Summarization: A Case Study on Legal Opinions
Mohamed Elaraby, Huihui Xu, Morgan Gray, Kevin Ashley, Diane Litman
A Gold Standard with Silver Linings: Scaling Up Annotation for Distinguishing Bosnian, Croatian, Montenegrin and Serbian
Aleksandra Miletić, Filip Miletić
Insights of a Usability Study for KBQA Interactive Semantic Parsing: Generation Yields Benefits over Templates but External Validity Remains Challenging
Ashley Lewis, Lingbo Mo, Marie-Catherine de Marneffe, Huan Sun, Michael White
Extrinsic evaluation of question generation methods with user journey logs
Elie Antoine, Eléonore Besnehard, Frederic Bechet, Geraldine Damnati, Eric Kergosien, Arnaud Laborderie
Towards Holistic Human Evaluation of Automatic Text Simplification
Luisa Carrer, Andreas Säuberli, Martin Kappus, Sarah Ebling
Decoding the Metrics Maze: Navigating the Landscape of Conversational Question Answering System Evaluation in Procedural Tasks
Alexander Frummet, David Elsweiler
The 2024 ReproNLP Shared Task on Reproducibility of Evaluations in NLP: Overview and Results
Anya Belz, Craig Thomson
Once Upon a Replication: It is Humans’ Turn to Evaluate AI’s Understanding of Children’s Stories for QA Generation
Andra-Maria Florescu, Marius Micluta-Campeanu, Liviu P. Dinu
Exploring Reproducibility of Human-Labelled Data for Code-Mixed Sentiment Analysis
Sachin Sasidharan Nair, Tanvi Dinkar, Gavin Abercrombie
Reproducing the Metric-Based Evaluation of a Set of Controllable Text Generation Techniques
Michela Lorandi, Anya Belz
ReproHum: #0033-03: How Reproducible Are Fluency Ratings of Generated Text? A Reproduction of August et al. 2022
Emiel van Miltenburg, Anouck Braggaar, Nadine Braun, Martijn Goudbeek, Emiel Krahmer, Chris van der Lee, Steffen Pauws, Frédéric Tomas
ReproHum #0927-03: DExpert Evaluation? Reproducing Human Judgements of the Fluency of Generated Text
Tanvi Dinkar, Gavin Abercrombie, Verena Rieser
ReproHum #0927-3: Reproducing The Human Evaluation Of The DExperts Controlled Text Generation Method
Javier González Corbelle, Ainhoa Vivel Couso, Jose Maria Alonso-Moral, Alberto Bugarín-Diz
ReproHum #1018-09: Reproducing Human Evaluations of Redundancy Errors in Data-To-Text Systems
Filip Klubička, John D. Kelleher
ReproHum#0043: Human Evaluation Reproducing Language Model as an Annotator: Exploring Dialogue Summarization on AMI Dataset
Vivian Fresen, Mei-Shin Wu-Urbanek, Steffen Eger
ReproHum #0712-01: Human Evaluation Reproduction Report for “Hierarchical Sketch Induction for Paraphrase Generation”
Mohammad Arvan, Natalie Parde
ReproHum #0712-01: Reproducing Human Evaluation of Meaning Preservation in Paraphrase Generation
Lewis N. Watson, Dimitra Gkatzia
ReproHum #0043-4: Evaluating Summarization Models: investigating the impact of education and language proficiency on reproducibility
Mateusz Lango, Patricia Schmidtova, Simone Balloccu, Ondrej Dusek
Showing 20 of 26 papers | Page 1 of 2