Concepedia

Abstract

ABSTRACT Automated scoring models for the e‐rater ® scoring engine were built and evaluated for the GRE ® argument and issue‐writing tasks. Prompt‐specific, generic, and generic with prompt‐specific intercept scoring models were built and evaluation statistics such as weighted kappas, Pearson correlations, standardized difference in mean scores, and correlations with external measures were examined to evaluate the e‐rater model performance against human scores. Performance was also evaluated across different demographic subgroups. Additional analyses were performed to establish appropriate agreement thresholds between human and e‐rater scores for unusual essays and the impact of using e‐rater on operational scores. The generic e‐rater scoring model with operational prompt‐specific intercept for the issue‐writing task and prompt‐specific e‐rater scoring model for the argument writing task were recommended for operational use. The two automated scoring models were implemented to produce check scores at a discrepancy threshold of 0.5 with human scores.

References

YearCitations

Page 1