Publication | Closed Access
Native Tongues, Lost and Found: Resources and Empirical Evaluations in Native Language Identification
66
Citations
22
References
2012
Year
EngineeringCross-lingual RepresentationMultilingualismCross-language PerspectivePhonologyCorpus LinguisticsText MiningSpeech RecognitionNatural Language ProcessingIndigenous LanguageData ScienceNative Language IdentificationComputational LinguisticsLanguage EngineeringEnsemble ClassifierLanguage StudiesIndigenous LanguagesEndangered LanguageMachine TranslationHeritage LanguageEmpirical EvaluationsNlp TaskNative TonguesHeritage Language AcquisitionExtinct LanguageLanguage RecognitionLanguage CorpusIcle CorpusLinguistics
In this paper we present work on the task of Native Language Identification (NLI). We present an alternative corpus to the ICLE which has been used in most work up until now. We believe that our corpus, TOEFL11, is more suitable for the task of NLI and will allow researchers to better compare systems and results. We show that many of the features that have been commonly used in this task generalize to new and larger corpora. In addition, we examine possible ways of increasing current system performance (e.g., additional features and feature combination methods), and achieve overall state-of-the-art results (accuracy of 90.1%) on the ICLE corpus using an ensemble classifier that includes previously examined features and a novel feature (n-gram language models). We also show that training on a large corpus and testing on a smaller one works well, but not vice versa. Finally, we show that system performance varies across proficiency scores.
| Year | Citations | |
|---|---|---|
Page 1
Page 1