‘What is this corpus about?’: using topic modelling to explore a specialised corpus

TLDR

The paper demonstrates how topic modelling can be applied to an academic English corpus to identify prominent topics, track their evolution, classify paper types, and compare the approach to traditional corpus linguistics methods. It explains the underlying probabilistic model, outlines the steps for building a topic model—including key parameter choices—and shows how the resulting topics are explored and interpreted. The analysis shows that topics, defined by co‑occurring words, provide rich insights into corpus structure and confirm that topic modelling is especially useful for initial corpus exploration.

Abstract

This paper introduces topic modelling, a machine learning technique that automatically identifies ‘topics’ in a given corpus. The paper illustrates its use in the exploration of a corpus of academic English. It first offers the intuitive explanation of the underlying mechanism of topic modelling and describes the procedure for building a model, including the decisions involved in the model-building process. The paper then explores the model. A topic in topic models is characterised by a set of co-occurring words, and we will demonstrate that such topics bring us rich insights into the nature of a corpus. As exemplary tasks, this paper identifies the prominent topics in different parts of papers, investigates the chronological change of a journal, and reveals different types of papers in the journal. The paper further compares topic modelling to two more traditional techniques in corpus linguistics, semantic annotation and keywords analysis, and highlights the strengths of topic modelling. We believe that topic modelling is particularly useful in the initial exploration of a corpus.

References

Page 1

	Year	Citations
R: A Language and Environment for Statistical Computing R Core Team	2000	352.8K
Latent dirichlet allocation David M. Blei, Andrew Y. Ng, Michael I. Jordan Journal of Machine Learning Research Latent Dirichlet AllocationEngineeringCorpus LinguisticsAutomatic SummarizationText Mining	2003	26.9K
An algorithm for suffix stripping Martin Porter Program electronic library and information systems EngineeringComplex SuffixesCorpus LinguisticsText MiningNatural Language Processing	1980	8.1K
Finding scientific topics Thomas L. Griffiths, Mark Steyvers Proceedings of the National Academy of Sciences	2004	5.9K
Probabilistic topic models David M. Blei Communications of the ACM EngineeringDigital ArchiveText MiningInformation RetrievalData Science	2012	5.4K
Corpus, Concordance, Collocation Patricio Novoa Green Translation StudiesSemanticsCorpus LinguisticsGerman LiteratureApplied Linguistics	1992	3.5K
Proceedings of the 23rd international conference on Machine learning William W. Cohen, Andrew Moore Artificial IntelligenceArtificial Intelligence ApproachEngineeringMachine LearningData Science	2006	2.6K
Dynamic topic models David M. Blei, John Lafferty Natural Language ProcessingDocument ClusteringLatent ModelingEngineeringInformation Retrieval	2006	2.3K
Corpus, Concordance, Collocation Terence Odlin, John McH. Sinclair Modern Language Journal Second Language LearningMultilingualismEnglish Language TeachersEducationComputer Technology	1994	2.3K
Stance and engagement: a model of interaction in academic discourse Ken Hyland Discourse Studies Academic DiscoursePragmatic AnalysisPublic EngagementEducationRhetoric	2005	1.9K

Page 1