Rome was built in 1776: A Case Study on Factual Correctness in\n Knowledge-Grounded Response Generation

Abstract

Recently neural response generation models have leveraged large pre-trained\ntransformer models and knowledge snippets to generate relevant and informative\nresponses. However, this does not guarantee that generated responses are\nfactually correct. In this paper, we examine factual correctness in\nknowledge-grounded neural response generation models. We present a human\nannotation setup to identify three different response types: responses that are\nfactually consistent with respect to the input knowledge, responses that\ncontain hallucinated knowledge, and non-verifiable chitchat style responses. We\nuse this setup to annotate responses generated using different stateof-the-art\nmodels, knowledge snippets, and decoding strategies. In addition, to facilitate\nthe development of a factual consistency detector, we automatically create a\nnew corpus called Conv-FEVER that is adapted from the Wizard of Wikipedia\ndataset and includes factually consistent and inconsistent responses. We\ndemonstrate the benefit of our Conv-FEVER dataset by showing that the models\ntrained on this data perform reasonably well to detect factually inconsistent\nresponses with respect to the provided knowledge through evaluation on our\nhuman annotated data. We will release the Conv-FEVER dataset and the human\nannotated responses.\n