Publication | Open Access
CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge
310
Citations
32
References
2018
Year
EngineeringSemantic WebCorpus LinguisticsRich World KnowledgeText MiningNatural Language ProcessingSource ConceptInformation RetrievalData ScienceComputational LinguisticsCommonsense KnowledgeLanguage StudiesCommonsense QuestionMachine TranslationQuestion AnsweringNlp TaskKnowledge DiscoveryCommonsense ReasoningSemantic ParsingRetrieval Augmented GenerationDomain Knowledge ModelingLinguistics
Question answering typically relies on contextual clues, but humans also draw on extensive world knowledge. The authors introduce CommonsenseQA, a dataset designed to test question answering that requires prior commonsense knowledge. They generate questions by selecting a source concept and multiple target concepts sharing a semantic relation in ConceptNet, then having crowd‑workers craft multiple‑choice questions that distinguish among those targets, yielding complex semantics that demand prior knowledge. The dataset contains 12,247 questions, and baseline models achieve at most 56 % accuracy, far below the 89 % human performance.
When answering a question, people often draw upon their rich world knowledge in addition to the particular context. Recent work has focused primarily on answering questions given some relevant document or context, and required very little general background. To investigate question answering with prior knowledge, we present CommonsenseQA: a challenging new dataset for commonsense question answering. To capture common sense beyond associations, we extract from ConceptNet (Speer et al., 2017) multiple target concepts that have the same semantic relation to a single source concept. Crowd-workers are asked to author multiple-choice questions that mention the source concept and discriminate in turn between each of the target concepts. This encourages workers to create questions with complex semantics that often require prior knowledge. We create 12,247 questions through this procedure and demonstrate the difficulty of our task with a large number of strong baselines. Our best baseline is based on BERT-large (Devlin et al., 2018) and obtains 56% accuracy, well below human performance, which is 89%.
| Year | Citations | |
|---|---|---|
Page 1
Page 1