Evaluation of recommendations

TLDR

Recommender systems evaluate accuracy via rating prediction (RMSE) or ranking (precision/recall). The study compares rating prediction and ranking, showing that their main difference is the data used: rating prediction relies only on observed ratings, while ranking incorporates all items. The authors find that rating prediction addresses only a subset of the problem due to selection bias and sparsity, and that framing the task as a ranking or classification problem—predicting who rated what—offers a more comprehensive solution.

Abstract

The literature on recommender systems distinguishes typically between two broad categories of measuring recommendation accuracy: rating prediction, often quantified in terms of the root mean square error (RMSE), and ranking, measured in terms of metrics like precision and recall, among others. In this paper, we examine both approaches in detail, and find that the dominating difference lies instead in the training and test data considered: rating prediction is concerned with only the observed ratings, while ranking typically accounts for all items in the collection, whether the user has rated them or not. Furthermore, we show that predicting observed ratings, while popular in the literature, only solves a (small) part of the rating prediction task for any item in the collection, which is a common real-world problem. The reasons are selection bias in the data, combined with data sparsity. We show that the latter rating-prediction task involves the prediction task 'Who rated What' as a sub-problem, which can be cast as a classification or ranking problem. This suggests that solving the ranking problem is not only valuable by itself, but also for predicting the rating value of any item.

References

Page 1

	Year	Citations

Page 1