Using FACETS to model rater training effects

TLDR

The study investigates differences in severity and consistency between inexperienced and experienced raters before and after training. Sixteen raters (eight experienced, eight inexperienced) rated overlapping subsets of 60 essays before and after training using a three‑part scale (content, rhetorical control, language) and FACETS analysis to estimate rater severity and consistency. Inexperienced raters were initially more severe and less consistent than experienced raters; after training, severity differences narrowed but persisted, while consistency improved for most raters, indicating training better enhances intra‑rater reliability than inter‑rater reliability.

Abstract

This article describes a study conducted to explore differences in rater severity and consistency among inexperienced and experienced raters both before and after rater training. Sixteen raters (eight experienced and eight inexperienced) rated overlapping subsets of essays from a total sample of 60 essays before and after rater training in the context of an operational administration of UCLA’s English as a Second Language Placement Examination (ESLPE). A three-part scale was used, comprising content, rhetorical control, and language. Ratings were analysed using FACETS, a multi-faceted Rasch analysis program that provides estimates of rater severity on a linear scale as well as fit statistics, which are indicators of rater consistency. The analysis showed that the inexperienced raters tended to be both more severe and less consistent in their ratings than the experienced raters before training. After training, the differences between the two groups of raters were less pronounced; however, significant differences in severity were still found among raters, although consistency had improved for most raters. These results provide support for the notion that rater training is more successful in helping raters give more predictable scores (i.e., intra-rater reliability) than in getting them to give identical scores (i.e., inter-rater reliability).

References

Page 1

	Year	Citations

Page 1