Multi-Armed Bandits

Balancing exploration and exploitation in sequential decision making

The Problem

Imagine you're in a casino with multiple slot machines (bandits), each with an unknown probability of payout. How do you maximize your total reward over time? You need to balance:

  • Exploration: Trying different machines to learn their payouts
  • Exploitation: Playing the machine you believe has the highest payout

Epsilon-Greedy

The simplest approach: with probability \(\epsilon\), explore randomly; otherwise, exploit the best known action.

  • Simple to implement
  • Explores uniformly, even among clearly suboptimal actions

Upper Confidence Bounds (UCB)

UCB measures potential by an upper confidence bound of the reward value \(\hat{U}_{t}(a)\), so that the true value is below with high probability:

\[ Q(a) \le \hat{Q}_{t}(a) + \hat{U}_{t}(a) \]

The UCB algorithm always selects the action that maximizes this upper confidence bound:

\[ a_{t}^{UCB} = \arg\max_{a \in A} \left( \hat{Q}_{t}(a) + \hat{U}_{t}(a) \right) \]

The confidence bound \(\hat{U}_{t}(a)\) typically decreases as we observe more samples of action \(a\), following the principle of "optimism in the face of uncertainty."

Thompson Sampling

A Bayesian approach that maintains a probability distribution over the expected reward of each action:

  1. Sample a reward estimate from each action's posterior distribution
  2. Select the action with the highest sampled value
  3. Update the posterior based on the observed reward

Thompson Sampling often achieves near-optimal regret bounds while being simple to implement.

Regret

We measure performance by regret: the difference between the reward we would have obtained by always playing the optimal action and what we actually received:

\[ R_T = T \cdot \mu^* - \sum_{t=1}^{T} r_t \]

where \(\mu^*\) is the expected reward of the best action.