화학공학소재연구정보센터
IEEE Transactions on Automatic Control, Vol.63, No.9, 2787-2802, 2018
Infinite Time Horizon Maximum Causal Entropy Inverse Reinforcement Learning
Inverse reinforcement learning (IRL) attempts to use demonstrations of "expert" decision making in a Markov decision process to infer a corresponding policy that shares the "structured, purposeful" qualities of the expert's actions. In this paper, we extend the maximum causal entropy framework, a notable paradigm in IRL, to the infinite time horizon setting. We consider two formulations (maximum discounted causal entropy and maximum average causal entropy) appropriate for the infinite horizon case and show that both result in optimization programs that can be reformulated as convex optimization problems; thus, admitting efficient computation. We then develop a gradient-based algorithm for the maximum discounted causal entropy formulation that enjoys the desired feature of being model agnostic, a property that is absent in many previous IRL algorithms. We propose the stationary soft Bellman policy, a key building block in the gradient-based algorithm, and study its properties in depth, which not only lead to theoretical insight into its analytical properties, but also help motivate a large toolkit of methods for implementing the gradient-based algorithm. Finally, we select three algorithms of this type and apply them to two problem instances involving demonstration data from a simple controlled queuing network model inspired by problems in air traffic management.