Exploration-Exploitation in Multi-Agent Competition: Convergence with\n Bounded Rationality

Abstract

The interplay between exploration and exploitation in competitive multi-agent\nlearning is still far from being well understood. Motivated by this, we study\nsmooth Q-learning, a prototypical learning model that explicitly captures the\nbalance between game rewards and exploration costs. We show that Q-learning\nalways converges to the unique quantal-response equilibrium (QRE), the standard\nsolution concept for games under bounded rationality, in weighted zero-sum\npolymatrix games with heterogeneous learning agents using positive exploration\nrates. Complementing recent results about convergence in weighted potential\ngames, we show that fast convergence of Q-learning in competitive settings is\nobtained regardless of the number of agents and without any need for parameter\nfine-tuning. As showcased by our experiments in network zero-sum games, these\ntheoretical results provide the necessary guarantees for an algorithmic\napproach to the currently open problem of equilibrium selection in competitive\nmulti-agent settings.\n