VARD versus WORD: A comparison of the UCREL variant detector and modern spellcheckers on English historical corpora

Abstract

Analysis of English historical texts poses a number of obstacles for standard corpus analysis and annotation techniques. In addition to nonstandard spellings and contractions, there are difficulties at the morphological, phonetic and syntactic levels. Our response has been to develop a VARiant Detector (VARD). We trained VARD on 16th-19th century data, specifically, the Nameless Shakespeare and a selection of texts taken from Chadwyck-Healey’s Eighteenth and Nineteenth Century Fiction collection. We have chosen to explore data from these centuries as, even though variant usage remains an issue up to the present day (because of the use of dialectal forms/ongoing standardisation), it falls substantially in the 18th-19th centuries. This paper reports on experiments to test the utility of VARD. The experiments compared VARD’s performance on unseen data with that of spell checkers for modern English (MS-Word and Aspell). Our hypothesis is that, as these spell checkers are not intended to work on historical data, VARD will be superior at both recognising variants and suggesting modern forms. VARD includes modern equivalents via an XML tag rather than removing the original variants.

References

Page 1

	Year	Citations

Page 1