Bridging the gap between complex scientific research and the curious minds eager to explore it.

Computer Science, Machine Learning

Trade-Offs in Bandit Problems: A Review of Lower Bounds and Achieved Regret

Trade-Offs in Bandit Problems: A Review of Lower Bounds and Achieved Regret

In this article, we investigate the behavior of upper confidence bound (UCB) and Thompson Sampling (TS) algorithms in a two-armed bandit problem with non-exponential rewards. We show that UCB has a regret bound of $O(\sqrt{T})$ while TS has a regret bound of $O(T^\frac{1}{2})$. We also generalize these results to most index policies of the literature, providing a set of conditions under which an index policy has linear sliding regret.
We start by explaining that the goal of this section is to extend Theorem 8 and Theorem 13 to general index policies. We then provide a set of nine conditions (A1-A9) that are met by classical existing indexes, including classical UCB and TS algorithms. We show that under these conditions, an index policy has linear sliding regret, as described in Theorem 18.
We then analyze the behavior of UCB and TS algorithms in more detail. For UCB, we prove a regret bound of $O(\sqrt{T})$, demonstrating that it is optimal in the sense of minimizing expected regret among all policies that take into account only past observations. For TS, we show that its regret bound is $O(T^\frac{1}{2})$, which is faster than UCB but not as optimal.
To illustrate the difference between these algorithms, we provide a figure showing their typical one-shot pseudo-regret. We also compare their behavior to that of MOSS, KL-UCB, and IMED, which are other popular index policies.
Overall, our results demonstrate that UCB and TS have different strengths and weaknesses, with UCB performing better in terms of expected regret but being slower in terms of one-shot regret. Meanwhile, TS has a faster one-shot regret but is not as optimal in terms of expected regret.
In conclusion, this article provides a comprehensive analysis of UCB and TS algorithms in the context of a two-armed bandit problem with non-exponential rewards, and generalizes these results to most index policies of the literature. Our findings shed light on the relative strengths and weaknesses of these algorithms and provide insights into their performance in different scenarios.