Upper Confidence Bounds

The Upper Confidence Bounds (UCB) algorith measures this potential by an upper confidence bound of the reward value, \(\hat{U}_{t}(a)\), so that the true value is below with bound \(Q(a) \le \hat{Q}_{t}(a) + \hat{U}_{t}(a)\) with high probability. In UCB algorithm, we always select the greediest action to maximize the upper confidence bound: \( a_{t}^{UCB} = argmax_{a \in A}\hat{Q_{t}(a) + \hat{U}_{t}(a)}\)