Publication | Closed Access
Exploration scavenging
107
Citations
12
References
2008
Year
Unknown Venue
Contextual BanditSide InformationEngineeringData ScienceOnline AlgorithmGame TheoryTargeted AdvertisingExperimental EconomicsBusinessOnline AdvertisingSequential Decision MakingDecision TheoryComputer ScienceExploration PolicyBusiness AnalyticsMarket DesignMechanism DesignExploration V Exploitation
The study investigates policy evaluation in contextual bandits using data from a different policy and proposes a principled, provably correct method that works even with deterministic exploration as long as each action is sufficiently explored. The authors develop this method and apply it to offline evaluation of internet advertising policies. They demonstrate that evaluation is impossible when the exploration policy uses side information, yet the theory still yields realistic applications and is empirically validated on real ad placement data.
We examine the problem of evaluating a policy in the contextual bandit setting using only observations collected during the execution of another policy. We show that policy evaluation can be impossible if the exploration policy chooses actions based on the side information provided at each time step. We then propose and prove the correctness of a principled method for policy evaluation which works when this is not the case, even when the exploration policy is deterministic, as long as each action is explored sufficiently often. We apply this general technique to the problem of offline evaluation of internet advertising policies. Although our theoretical results hold only when the exploration policy chooses ads independent of side information, an assumption that is typically violated by commercial systems, we show how clever uses of the theory provide non-trivial and realistic applications. We also provide an empirical demonstration of the effectiveness of our techniques on real ad placement data.
| Year | Citations | |
|---|---|---|
Page 1
Page 1