Use of the Chi-Squared Test to Examine Vocabulary Differences in English Language Corpora Representing Seven Different Countries

TLDR

The chi‑squared test is applied to seven ICAME corpora to identify vocabulary characteristic of each country’s English, building on prior work that revealed differences between American and British corpora and highlighting the need for corpora that represent the full range of cultural topics. The study aims to determine whether vocabulary differences reflect cultural or linguistic factors so that they can be used to automatically classify texts by English variety. The authors determine the topical range of each culture and then sample adequately from each domain to construct representative corpora for analysis. The chi‑squared analysis uncovered significant vocabulary differences between the Brown Corpus of American English and the Lancaster–Oslo–Bergen Corpus of British English.

Abstract

The chi-squared test is used to find the vocabulary most typical of seven different ICAME corpora, each representing the English used in a particular country. In a closely related study, Leech and Fallon (1992, Computer corpora – what do they tell us about culture? ICAME Journal, 16: 29–50) found differences in the vocabulary used in the Brown Corpus of American English and that the Lancaster–Oslo–Bergen Corpus of British English. They were mainly interested in those vocabulary differences which they assumed to be due to cultural differences between the United States and Britain, but we are equally interested in vocabulary differences which reveal linguistic preferences in the various countries in which English is spoken. Whether vocabulary differences are cultural or linguistic in nature, they can be used for the automatic classification according to variety of English of texts of unknown provenance. The extent to which the vocabulary differences between the corpora represent vocabulary differences between the varieties of English as a whole depends on the extent to which the corpora represent the full range of topics typical of their associated cultures, and thus there is a need for corpora designed to represent the topics and vocabulary of cultures or dialects, rather than stratified across a set range of topics and genres. This will require methods to determine the range of topics addressed in each culture, then methods to sample adequately from each topical domain.

References

Page 1

	Year	Citations

Page 1