Publication | Open Access
Bandit Online Learning with Unknown Delays
11
Citations
23
References
2018
Year
Mathematical ProgrammingEngineeringMachine LearningOnline ProblemStochastic OptimizationMulti-armed BanditGame TheoryBandit Online LearningOnline AlgorithmUnknown DelayBandit OnlineComputer ScienceExploration V ExploitationOperations Research
This paper deals with bandit online learning problems involving feedback of unknown delay that can emerge in multi-armed bandit (MAB) and bandit convex optimization (BCO) settings. MAB and BCO require only values of the objective function involved that become available through feedback, and are used to estimate the gradient appearing in the corresponding iterative algorithms. Since the challenging case of feedback with \emph{unknown} delays prevents one from constructing the sought gradient estimates, existing MAB and BCO algorithms become intractable. For such challenging setups, delayed exploration, exploitation, and exponential (DEXP3) iterations, along with delayed bandit gradient descent (DBGD) iterations are developed for MAB and BCO, respectively. Leveraging a unified analysis framework, it is established that the regret of DEXP3 and DBGD are ${\cal O}\big( \sqrt{K\bar{d}(T+D)} \big)$ and ${\cal O}\big( \sqrt{K(T+D)} \big)$, respectively, where $\bar{d}$ is the maximum delay and $D$ denotes the delay accumulated over $T$ slots. Numerical tests using both synthetic and real data validate the performance of DEXP3 and DBGD.
| Year | Citations | |
|---|---|---|
Page 1
Page 1