In this paper, the authors explore the problem of optimizing a sequence of actions in the presence of uncertainty, using a technique called double Thompson sampling. They apply this approach to the dueling bandit problem, which involves choosing the best arm (action) from a set of possible arms based on their past rewards and the uncertainty of future rewards.
The authors show that by using double Thompson sampling, they can improve the convergence rate of the algorithm compared to traditional Thompson sampling methods. They prove that the proposed algorithm achieves a logarithmic regret bound, which means it performs close to the optimal arm with high probability.
To understand this concept better, let’s consider an example of a bandit who wants to choose the best arm to play each round in a casino game. The bandit has a limited budget and wants to maximize their winnings over time. However, the outcome of each arm is uncertain, and the bandit can only observe the outcomes of previous rounds to make decisions.
Double Thompson sampling works by considering two versions of the algorithm: one that chooses arms based on their past rewards, and another that chooses arms randomly. By combining these two approaches, the algorithm can take advantage of the information from both perspectives to make better choices over time.
The authors demonstrate the effectiveness of double Thompson sampling through simulations and theoretical analysis. They show that their proposed method achieves a better performance compared to traditional methods in terms of regret bounds.
In summary, double Thompson sampling is an approach to optimizing actions in the presence of uncertainty, which combines two different perspectives to make better decisions over time. The authors prove that this approach achieves a logarithmic regret bound and demonstrates its effectiveness through simulations and theoretical analysis.
Computer Science, Machine Learning