Native language detection with 'cheap' learner corpora

Abstract

We begin by showing that the best publicly available, multiple-L1 learner corpus, the International Corpus of Learner English (Granger et al. 2009), has serious issues when used for the task of native language detection (NLD). The topic biases in the corpus are a confounding factor that result in crossvalidated performance that is misleading, for all the feature types which are traditionally used. Our approach here is to look for other, cheap ways to get training data for NLD. To that end, we present the web-scraped Lang-8 learner corpus, and show that it is useful for the task, particularly if large quantities of data are used. This also seems to facilitate the use of lexical features, which had been previously avoided. We also investigate ways to do NLD that don’t involve having learner corpora at all, including double-translation and extracting information from L1 corpora directly. All of these avenues are shown to be promising.

References

Page 1

	Year	Citations

Page 1