IEEE Transactions on Automatic Control, Vol.60, No.7, 1841-1856, 2015
Optimization of Average Rewards of Time Nonhomogeneous Markov Chains
We study the optimization of average rewards of discrete time nonhomogeneous Markov chains, in which the state spaces, transition probabilities, and reward functions depend on time. The analysis encounters a few major difficulties: 1) Notions crucial to homogeneous Markov chains, such as ergodicity, stationarity, periodicity, and connectivity, no longer apply; 2) The average reward criterion is under-selective; i.e., it does not depend on the decisions in any finite period, and thus dynamic programming is not amenable; and 3) Because of the under-selectivity, an optimal average-reward policy may not be the best in any finite period. These issues are resolved by 1) We discover that a new notion, called "confluencity", is the base for optimization of average rewards of Markov chains. Confluencity refers to the property that two independent sample paths of a Markov chain starting from any two different initial states will eventually meet together; 2) We apply the direct-comparison based approach [ 3] to the average reward optimization and obtain the necessary and sufficient conditions for optimal policies; and 3) We study the bias optimality with biasmeasuring the transient reward; we show that for the transient reward to be optimal, one additional condition based on bias potentials is required.
Keywords:Bias optimality;bias potential;confluencity;direct-comparison based optimization;HJB equation;performance potential;weak ergodicity;weak recurrence