Publication | Open Access
VARD versus WORD: A comparison of the UCREL variant detector and modern spellcheckers on English historical corpora
50
Citations
9
References
2005
Year
EngineeringCorpus LinguisticsNatural Language ProcessingApplied LinguisticsLanguage DocumentationVariant DetectorVard Versus WordModern EnglishComputational LinguisticsLexicographyHistorical LinguisticsGrammarCorpus AnalysisLanguage StudiesLexiconMachine TranslationComputational LexicologySemantic ChangeDistributional SemanticsEnglish Historical CorporaLexical ResourceLanguage CorpusEnglish Historical TextsUcrel Variant DetectorLinguistics
Analysis of English historical texts poses a number of obstacles for standard corpus analysis and annotation techniques. In addition to nonstandard spellings and contractions, there are difficulties at the morphological, phonetic and syntactic levels. Our response has been to develop a VARiant Detector (VARD). We trained VARD on 16th-19th century data, specifically, the Nameless Shakespeare and a selection of texts taken from Chadwyck-Healey’s Eighteenth and Nineteenth Century Fiction collection. We have chosen to explore data from these centuries as, even though variant usage remains an issue up to the present day (because of the use of dialectal forms/ongoing standardisation), it falls substantially in the 18th-19th centuries. This paper reports on experiments to test the utility of VARD. The experiments compared VARD’s performance on unseen data with that of spell checkers for modern English (MS-Word and Aspell). Our hypothesis is that, as these spell checkers are not intended to work on historical data, VARD will be superior at both recognising variants and suggesting modern forms. VARD includes modern equivalents via an XML tag rather than removing the original variants.
| Year | Citations | |
|---|---|---|
Page 1
Page 1