Concepedia

Publication | Closed Access

Native language detection with 'cheap' learner corpora

48

Citations

9

References

2013

Year

Abstract

We begin by showing that the best publicly available, multiple-L1 learner corpus, the International Corpus of Learner English (Granger et al. 2009), has serious issues when used for the task of native language detection (NLD). The topic biases in the corpus are a confounding factor that result in crossvalidated performance that is misleading, for all the feature types which are traditionally used. Our approach here is to look for other, cheap ways to get training data for NLD. To that end, we present the web-scraped Lang-8 learner corpus, and show that it is useful for the task, particularly if large quantities of data are used. This also seems to facilitate the use of lexical features, which had been previously avoided. We also investigate ways to do NLD that don’t involve having learner corpora at all, including double-translation and extracting information from L1 corpora directly. All of these avenues are shown to be promising.

References

YearCitations

Page 1