Publication | Closed Access
A Comparison of Techniques to Automatically Identify Complex Words.
83
Citations
13
References
2013
Year
Unknown Venue
Identifying complex words is a crucial but often overlooked component of lexical simplification, yet over‑identification can cause erroneous substitutions that lose meaning, while under‑identification can leave confusing words in the text. This paper evaluates different methods for complex‑word identification. The authors mined a corpus of annotated sentences from Simple Wikipedia edit histories, validated it with human judges, and then tested three CW‑identification techniques—simplifying everything, frequency thresholding, and a support vector machine. Thresholding performs no better than the naive simplifying‑everything baseline, while the support vector machine yields a modest precision gain at the expense of a dramatic recall loss.
Identifying complex words (CWs) is an important, yet often overlooked, task within lexical simplification (The process of automatically replacing CWs with simpler alternatives). If too many words are identified then substitutions may be made erroneously, leading to a loss of meaning. If too few words are identified then those which impede a user’s understanding may be missed, resulting in a complex final text. This paper addresses the task of evaluating different methods for CW identification. A corpus of sentences with annotated CWs is mined from Simple Wikipedia edit histories, which is then used as the basis for several experiments. Firstly, the corpus design is explained and the results of the validation experiments using human judges are reported. Experiments are carried out into the CW identification techniques of: simplifying everything, frequency thresholding and training a support vector machine. These are based upon previous approaches to the task and show that thresholding does not perform significantly differently to the more naive technique of simplifying everything. The support vector machine achieves a slight increase in precision over the other two methods, but at the cost of a dramatic trade off in recall.
| Year | Citations | |
|---|---|---|
Page 1
Page 1