Concepedia

Publication | Closed Access

Robust, Lexicalized Native Language Identification

40

Citations

29

References

2012

Year

Abstract

Previous approaches to the task of native language identification (Koppel et al., 2005) have been limited to small, within-corpus evaluations. Because these are restrictive and unreliable, we apply cross-corpus evaluation to the task. We demonstrate the efficacy of lexical features, which had previously been avoided due to the within-corpus topic confounds, and provide a detailed evaluation of various options, including a simple bias adaptation technique and a number of classifier algorithms. Using a new web corpus as a training set, we reach high classification accuracy for a 7-language task, performance which is robust across two independent test sets. Although we show that even higher accuracy is possible using crossvalidation, we present strong evidence calling into question the validity of cross-validation evaluation using the standard dataset.

References

YearCitations

Page 1