Safe Reinforcement Learning for Single Train Trajectory Optimization via Shield SARSA

Abstract

The single train trajectory optimization, also known as speed profile optimization (SPO), is a traditional problem to minimize the traction energy consumption of trains. As a kind of optimal method, reinforcement learning (RL) has been used to solve the SPO problem. In the learning process of a common RL algorithm, a soft constraint (punishment) is always used to keep the agent away from unsafe states. However, a soft constraint can not guarantee and explain the safety of the result. For the SPO problem, it means that the optimized speed profile obtained by a simple RL may break the speed limit which is unacceptable in reality. This paper proposes a protection mechanism called Shield and constructs a Shield SARSA ( <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">${S}$ </tex-math></inline-formula> -SARSA) algorithm to protect the learning process of the high-speed train. Four different reward functions are used to compare the protective efficacy between the proposed algorithm and the soft constraint. The numerical experiments based on the line data from Wuxi East to Suzhou North verify the protective efficacy and effectiveness.

References

Page 1

	Year	Citations

Page 1