Is Automated Topic Model Evaluation Broken?: The Incoherence of\n Coherence

Abstract

Topic model evaluation, like evaluation of other unsupervised methods, can be\ncontentious. However, the field has coalesced around automated estimates of\ntopic coherence, which rely on the frequency of word co-occurrences in a\nreference corpus. Contemporary neural topic models surpass classical ones\naccording to these metrics. At the same time, topic model evaluation suffers\nfrom a validation gap: automated coherence, developed for classical models, has\nnot been validated using human experimentation for neural models. In addition,\na meta-analysis of topic modeling literature reveals a substantial\nstandardization gap in automated topic modeling benchmarks. To address the\nvalidation gap, we compare automated coherence with the two most widely\naccepted human judgment tasks: topic rating and word intrusion. To address the\nstandardization gap, we systematically evaluate a dominant classical model and\ntwo state-of-the-art neural models on two commonly used datasets. Automated\nevaluations declare a winning model when corresponding human evaluations do\nnot, calling into question the validity of fully automatic evaluations\nindependent of human judgments.\n