Publication | Closed Access
Pessimistic Reward Models for Off-Policy Learning in Recommendation
35
Citations
64
References
2021
Year
Unknown Venue
Artificial IntelligencePessimistic Reward ModelsDeployed Recommendation PolicyMachine LearningBehavioral Decision MakingEffective Reward ModellingEngineeringEducationReinforcement Learning (Educational Psychology)Lifelong Reinforcement LearningReinforcement Learning (Computer Engineering)Data SciencePreference LearningBandit LearningPreference ModelingImitation LearningPredictive AnalyticsKnowledge DiscoveryAction Model LearningSequential Decision MakingComputer ScienceExploration V ExploitationContextual BanditDeep Reinforcement LearningPreference ElicitationDecision Science
Methods for bandit learning from user interactions often require a model of the reward a certain context-action pair will yield – for example, the probability of a click on a recommendation. This common machine learning task is highly non-trivial, as the data-generating process for contexts and actions is often skewed by the recommender system itself. Indeed, when the deployed recommendation policy at data collection time does not pick its actions uniformly-at-random, this leads to a selection bias that can impede effective reward modelling. This in turn makes off-policy learning – the typical setup in industry – particularly challenging.
| Year | Citations | |
|---|---|---|
Page 1
Page 1