Automatica, Vol.40, No.10, 1749-1759, 2004
A unified approach to Markov decision problems and performance sensitivity analysis with discounted and average criteria: multichain cases
We propose a unified framework to Markov decision problems and performance sensitivity analysis for multichain Markov processes with both discounted and average-cost performance criteria. With the fundamental concept of performance potentials, we derive both performance-gradient and performance-difference formulas, which play the central role in performance optimization. The standard policy iteration algorithms for both discounted- and average-reward MDPs can be established using the performance-difference formulas in a simple and intuitive way; and the performance-gradient formulas together with stochastic approximation may lead to new optimization schemes. This sensitivity-based point of view of performance optimization provides some insights that link perturbation analysis, Markov decision processes, and reinforcement learning together. The research is an extension of the previous work on ergodic Markov chains (Cao Automatica 36 (2000) 771). (C) 2004 Elsevier Ltd. All rights reserved.
Keywords:policy iteration;potentials;perturbation analysis;performance sensitivity;reinforcement learning