A Smoothed Analysis of the Greedy Algorithm for the Linear Contextual\n Bandit Problem

Abstract

Bandit learning is characterized by the tension between long-term exploration\nand short-term exploitation. However, as has recently been noted, in settings\nin which the choices of the learning algorithm correspond to important\ndecisions about individual people (such as criminal recidivism prediction,\nlending, and sequential drug trials), exploration corresponds to explicitly\nsacrificing the well-being of one individual for the potential future benefit\nof others. This raises a fairness concern. In such settings, one might like to\nrun a "greedy" algorithm, which always makes the (myopically) optimal decision\nfor the individuals at hand - but doing this can result in a catastrophic\nfailure to learn. In this paper, we consider the linear contextual bandit\nproblem and revisit the performance of the greedy algorithm. We give a smoothed\nanalysis, showing that even when contexts may be chosen by an adversary, small\nperturbations of the adversary's choices suffice for the algorithm to achieve\n"no regret", perhaps (depending on the specifics of the setting) with a\nconstant amount of initial training data. This suggests that "generically"\n(i.e. in slightly perturbed environments), exploration and exploitation need\nnot be in conflict in the linear setting.\n