Stochastic Linear Optimization Under Bandit Feedback

Abstract

In the classical stochastic k-armed bandit problem, in each of a sequence of rounds, a decision maker chooses one of k arms and incurs a cost chosen from an unknown distribution associated with that arm. In the linear optimization analog of this problem, rather than finitely many arms, the decision set is a compact subset of R n and the cost of each decision is just the evaluation of a randomly chosen linear cost function at that point. As before, it is assumed that the cost functions are sampled independently from an unknown but time-invariant distribution. The goal is to minimize the total cost incurred over some number of rounds T, and success is measured by low regret. Auer [2003] was the first to study this problem and provided an algorithm with a regret bound that is O(poly(n, log |D|) √ T), where |D | is the cardinality of the decision set (which he assumed to be finite). We present a near complete characterization of this problem in terms of both upper and lower bounds. We consider a deterministic algorithm based on upper confidence bounds, which was described by Auer and conjectured to have small regret. (Auer analyzed a more complicated master algorithm, which called this simpler algorithm as a subroutine.) In certain natural cases,

References

Page 1

	Year	Citations

Page 1