Publication | Closed Access
Rethinking the Agreement in Human Evaluation Tasks
34
Citations
14
References
2018
Year
EngineeringCognitionCommunicationSemanticsHuman Evaluation TasksCorpus LinguisticsSocial SciencesText MiningProgram EvaluationNatural Language ProcessingComputational LinguisticsLanguage EngineeringConversation AnalysisEvaluation MethodologyMachine TranslationCognitive ScienceQuestion AnsweringLanguage Annotation TasksNlp TaskHuman EvaluationsAgreement MetricsExperimental PsychologyRetrieval Augmented GenerationEvaluation MeasureEvaluation TechniqueLinguisticsLanguage Generation
Human evaluations are broadly thought to be more valuable the higher the inter-annotator agreement. In this paper we examine this idea. We will describe our experiments and analysis within the area of Automatic Question Generation. Our experiments show how annotators diverge in language annotation tasks due to a range of ineliminable factors. For this reason, we believe that annotation schemes for natural language generation tasks that are aimed at evaluating language quality need to be treated with great care. In particular, an unchecked focus on reduction of disagreement among annotators runs the danger of creating generation goals that reward output that is more distant from, rather than closer to, natural human-like language. We conclude the paper by suggesting a new approach to the use of the agreement metrics in natural language generation evaluation tasks.
| Year | Citations | |
|---|---|---|
Page 1
Page 1