Multi-Armed Bandits
14 Apr 2019Balancing exploration and exploitation in sequential decision making
The Problem
Imagine you're in a casino with multiple slot machines (bandits), each with an unknown probability of payout. How do you maximize your total reward over time? You need to balance:
- Exploration: Trying different machines to learn their payouts
- Exploitation: Playing the machine you believe has the highest payout
Epsilon-Greedy
The simplest approach: with probability \(\epsilon\), explore randomly; otherwise, exploit the best known action.
- Simple to implement
- Explores uniformly, even among clearly suboptimal actions
Upper Confidence Bounds (UCB)
UCB measures potential by an upper confidence bound of the reward value \(\hat{U}_{t}(a)\), so that the true value is below with high probability:
\[ Q(a) \le \hat{Q}_{t}(a) + \hat{U}_{t}(a) \]The UCB algorithm always selects the action that maximizes this upper confidence bound:
\[ a_{t}^{UCB} = \arg\max_{a \in A} \left( \hat{Q}_{t}(a) + \hat{U}_{t}(a) \right) \]The confidence bound \(\hat{U}_{t}(a)\) typically decreases as we observe more samples of action \(a\), following the principle of "optimism in the face of uncertainty."
Thompson Sampling
A Bayesian approach that maintains a probability distribution over the expected reward of each action:
- Sample a reward estimate from each action's posterior distribution
- Select the action with the highest sampled value
- Update the posterior based on the observed reward
Thompson Sampling often achieves near-optimal regret bounds while being simple to implement.
Regret
We measure performance by regret: the difference between the reward we would have obtained by always playing the optimal action and what we actually received:
\[ R_T = T \cdot \mu^* - \sum_{t=1}^{T} r_t \]where \(\mu^*\) is the expected reward of the best action.