Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners

TLDR

The study aims to extract a large Japanese learner corpus from a language‑learning SNS revision log and to demonstrate its suitability as training data for SMT‑based error correction. The authors extract the corpus from revision logs and evaluate tokenization granularity, using character‑wise and word‑wise models to mitigate segmentation errors. The resulting corpus is large, diverse, and valuable for learners and instructors, and the character‑wise SMT model outperforms the word‑wise model.

Abstract

We present an attempt to extract a largescale Japanese learners’ corpus from the revision log of a language learning SNS. This corpus is easy to obtain in largescale, covers a wide variety of topics and styles, and can be a great source of knowledge for both language learners and instructors. We also demonstrate that the extracted learners’ corpus of Japanese as a second language can be used as training data for learners’ error correction using an SMT approach. We evaluate different granularities of tokenization to alleviate the problem of word segmentation errors caused by erroneous input from language learners. Experimental results show that the character-wise model outperforms the word-wise model.

References

Page 1

	Year	Citations

Page 1