CODAH: An Adversarially Authored Question-Answer Dataset for Common\n Sense

Abstract

Commonsense reasoning is a critical AI capability, but it is difficult to\nconstruct challenging datasets that test common sense. Recent neural question\nanswering systems, based on large pre-trained models of language, have already\nachieved near-human-level performance on commonsense knowledge benchmarks.\nThese systems do not possess human-level common sense, but are able to exploit\nlimitations of the datasets to achieve human-level scores.\n We introduce the CODAH dataset, an adversarially-constructed evaluation\ndataset for testing common sense. CODAH forms a challenging extension to the\nrecently-proposed SWAG dataset, which tests commonsense knowledge using\nsentence-completion questions that describe situations observed in video. To\nproduce a more difficult dataset, we introduce a novel procedure for question\nacquisition in which workers author questions designed to target weaknesses of\nstate-of-the-art neural question answering systems. Workers are rewarded for\nsubmissions that models fail to answer correctly both before and after\nfine-tuning (in cross-validation). We create 2.8k questions via this procedure\nand evaluate the performance of multiple state-of-the-art question answering\nsystems on our dataset. We observe a significant gap between human performance,\nwhich is 95.3%, and the performance of the best baseline accuracy of 67.5% by\nthe BERT-Large model.\n