Publication | Closed Access
Native language detection with 'cheap' learner corpora
48
Citations
9
References
2013
Year
Unknown Venue
We begin by showing that the best publicly available, multiple-L1 learner corpus, the International Corpus of Learner English (Granger et al. 2009), has serious issues when used for the task of native language detection (NLD). The topic biases in the corpus are a confounding factor that result in crossvalidated performance that is misleading, for all the feature types which are traditionally used. Our approach here is to look for other, cheap ways to get training data for NLD. To that end, we present the web-scraped Lang-8 learner corpus, and show that it is useful for the task, particularly if large quantities of data are used. This also seems to facilitate the use of lexical features, which had been previously avoided. We also investigate ways to do NLD that don’t involve having learner corpora at all, including double-translation and extracting information from L1 corpora directly. All of these avenues are shown to be promising.
| Year | Citations | |
|---|---|---|
Page 1
Page 1