In this article, we explore the relationship between Fisher-non-degeneracy assumption and common tabular policy parameterizations in reinforcement learning (RL). We delve into the limitations of commonly used algorithms that rely on this assumption, such as policy gradient methods, to highlight their inability to find an optimal solution when the function approximation error is not negligible.
Section 1 introduces the Fisher-non-degeneracy assumption and its significance in RL. We then discuss how common policy parameterizations like softmax, log-linear, and neural softmax can be mathematically defined. However, these policies struggle to find an optimal solution when the function approximation error is not small.
Section 2 examines the connection between Fisher-non-degeneracy assumption and optimality algorithms, such as NPG-PD and CRPO. These algorithms are designed to find an (ε + √εbias)-optimal policy, where εbias represents the function approximation error. Since these algorithms use softmax parameterization, they have εbias = 0, leading to an optimal solution. However, other algorithms may struggle with this restriction and require additional techniques to achieve optimality.
Section 3 focuses on CRPO algorithm, which utilizes a two-layer neural network to control the function approximation error. While this approach allows for better control over √εbias, it also requires careful tuning of the network width to ensure optimality.
In summary, this article demystifies complex concepts in RL by using everyday language and engaging analogies. By examining the limitations of commonly used algorithms that rely on Fisher-non-degeneracy assumption, we highlight the importance of considering function approximation error in RL policy parameterizations. This work provides valuable insights for researchers and practitioners seeking to develop more efficient and effective RL algorithms.
Computer Science, Machine Learning