Publication | Closed Access
Robust, Lexicalized Native Language Identification
40
Citations
29
References
2012
Year
Unknown Venue
EngineeringSpeech CorpusMultilingual PretrainingCorpus LinguisticsText MiningSpeech RecognitionNatural Language ProcessingLanguage DocumentationData ScienceNative Language IdentificationComputational LinguisticsLanguage TestingStandard Dataset7-Language TaskLanguage StudiesMachine TranslationNlp TaskCross-language RetrievalLanguage RecognitionLanguage CorpusLinguistics
Previous approaches to the task of native language identification (Koppel et al., 2005) have been limited to small, within-corpus evaluations. Because these are restrictive and unreliable, we apply cross-corpus evaluation to the task. We demonstrate the efficacy of lexical features, which had previously been avoided due to the within-corpus topic confounds, and provide a detailed evaluation of various options, including a simple bias adaptation technique and a number of classifier algorithms. Using a new web corpus as a training set, we reach high classification accuracy for a 7-language task, performance which is robust across two independent test sets. Although we show that even higher accuracy is possible using crossvalidation, we present strong evidence calling into question the validity of cross-validation evaluation using the standard dataset.
| Year | Citations | |
|---|---|---|
Page 1
Page 1