Bridging the gap between complex scientific research and the curious minds eager to explore it.

Electrical Engineering and Systems Science, Systems and Control

Safety Function Learning for Interacting Systems: A Thorough Approach

Safety Function Learning for Interacting Systems: A Thorough Approach

In this article, we explore the concept of terminating episodes in a Markov decision process (MDP). An MDP is a mathematical framework used to model decision-making problems where the outcome of an action depends on the current state of the system. In our context, we focus on a specific type of MDP called p-safe, which requires that the probability of reaching a terminal state within a certain number of steps (p) is less than or equal to 1.
Our goal is to design an algorithm that can terminate episodes in a p-safe MDP. An episode begins with a random initial state and ends when the system reaches a terminal state or a predetermined maximum number of steps has been reached. Our algorithm uses a proxy set of states to identify safe and unsafe states and then applies a policy to generate data for learning. We use an off-policy temporal difference learning method with importance sampling to learn the safety function corresponding to the given policy.
To illustrate our approach, we consider an example MDP with 12 states, two target states, and two forbidden states. The proxy set of states is defined as the set of states that are reachable from the initial state. We show how our algorithm can be applied to learn the safety function corresponding to a uniformly random policy for each state.
Our results demonstrate that our algorithm can effectively terminate episodes in a p-safe MDP while ensuring the safety of the system. We also provide a numerical example to demonstrate the effectiveness of our approach.
In summary, this article presents an algorithm for terminating episodes in a p-safe Markov decision process. Our approach uses a proxy set of states to identify safe and unsafe states and applies an off-policy temporal difference learning method with importance sampling to learn the safety function corresponding to a given policy. We demonstrate the effectiveness of our algorithm through numerical examples, providing a concise summary of the article while avoiding complex mathematical formulations.