A remark on control of partially observed Markov chains

Annals of Operations Research 29 (1991) 429-438 429

A REMARK ON C O N T R O L OF PARTIALLY OBSERVED MARKOV CHAINS

Vivek S. BORKAR

Department of Electrical Engineerin~ Indian Institute of Science, Bangalore 560012, India

A new state variable is introduced for the problem of controlling a Markov chain under partial observations, which, under a suitably altered probability measure, has a simple evolution.

Keywords: Controlled Markov chains, unnormalized law, dynamic programming, change of measure, partial observations.

1. Introduction

The aim of this note is to introduce a new stable variable for the control of partially observed Markov chains, taking values in the space of finite positive measures on the state space. On normalization, it yields the usual state variable, viz. the conditional law of the state given past observations. Hence we call it the unnormalized conditional law. The idea of using an unnormalized conditional law for a state variable is not new. See for example lemma 6.5, p. 83, chapter 6 of [2] where such a variable is used to achieve linear dynamics (as opposed to the nonlinear evolution of the conditional law). Our definition of the unnormalized conditional law differs from the one of [2]. A more significant difference in our approach is that we consider its evolution under a new probabili ty measure under which the observation process has a much simpler statistics. This, along with a simple evolution for the state process, makes the new state variable an attractive alternative. For one of the control problems we shall consider, viz. the problem of control up to an exit time from a bounded set, there is the additional advantage of a considerably reduced dimensionality of the state vector. (The definition of the unnormalized law has to be altered a little to suit the needs of this problem.) All these remarks will become clearer as we proceed.

The paper is organized as follows: The next section introduces the formalism of the problem. Section 3 introduces the unnormalized conditional law and the new probabili ty measure. Section 4 restates the dynamic programming equations for two specific control problems in terms of the new state variable.

© J.C. Baltzer A.G. Scientific Publishing Company

430 V.S. Borkar / Partially observed Markov chains

2. Control under partial observations

Let S = {1, 2 . . . . }, the state space, D = a compact metric control space and H = { 1, 2 . . . . } a finite or countably infinite observation space. We shall consider an S-valued controlled Markov chain Xn, n >/0, controlled by a D-valued control process Zn, n >/0, and an associated H-valued observation process Y., n >i 0. The exact mechanism of their evolution is described next. Let p: S × D × S × H ~ [0, 1] be a family of continuous maps satisfying

)-~.p(i, u, j , k) - - - l , i E S , u ~ D . j , k

Let . ~ t io(Xm, Zm, Ym, m<~n), ~ = a ( Z m, Ym, m<~n), for n>:0 , with )Co, Y0 being prescribed random variables. The evolution of X., Y., n >i 0, is described by

P(X.+ 1 tij, Y,,+, = k / . ~ ) t i p ( X . , Z,,j, k), n>~O, j E S , k ~ H . (1.1)

The problem of control under partial observations is to choose { Z. } to minimize some prescribed cost functional under the constraint: Z. , { X,,,, m ~< n, Z m, m < n } are conditionally independent given Y.,, m ~< n, for n >/O.

Remark

Our admissible class of { Zn} is a little more general than the traditional requirement that { Zn } be adapted to the progressive o-field generated by { Y. }. This has the advantage that it allows for the possibility of randomization by the controller.

Let /3 ~ (0, 1), A c S finite, T = min{ n >/01 X. ~ A } ( = oo if X. ~ A for all n), a A = { j ~ S \ A I E i ~ A , k p ( i , u, j, k ) > 0 for some u ~ D } and k: S × D ~ R+, I - - A × D ~ R ÷ and h: a A ~ R + bounded continuous maps. The two control problems we consider here are: (C1) Infinite horizon discounted cost: Minimize over all admissible { Z . }

(C2) Control up to a first exit time: Minimize over all admissible { Z, }

for Xo ~A. We introduce next some notation for later use. Let P(k, u), k E H, u ~ D,

denote the matrix [[p(i, u, j, k)l]i,j~s and P'(k, u) its submatrix obtained by taking the rows and columns corresponding to the elements of A, correctly

V.S. Borkar / Partially observed Markov chains 431

ordered. Let 1¢ be the column vector [1, 1 . . . ] . For a Polish space X, ~ ( X ) , M ( X ) will denote respectively the space of probability measures and the space of finite nonnegative measures on X, each with the coarsest topology that renders continuous the maps/~ ---, f f d/t for f E Cb(X). We shall often write At ~ Y~(X) or M(X), X= (1, 2 . . . . ), as a row vector [#(1),/~(2) . . . . ], /t(i) =/~({i)) .

The ~ (S ) -va lued process { ~r, } of conditional laws of X, given ~ , n >i 0, is defined a.s. uniquely by

n> O,

for f belonging to a countable subset of Cb(S ) which separates points of ~ ( S ) . Also define an M( A )-valued process (rr.' } by

f f d ~ ' = e [ f ( X . ) l ( r > . } / ~ . l , n>~O.

A standard Bayes' rule argument shows that (~r.) evolves according to

~r.+, = R(~r.P(Y.+,, Z.)), n >I O, (2.1)

where R: l l ~ l 1 is the map that maps x=[x~, x z . . . . ] into [xa/I lxl l a, x~ / II x I11, x3/It x II 1 . . . . ] when II x II 1 = El x; t > 0 and [0, 0 . . . . ] when not. Also,

P(Y.+, = k / E ) = ~r.P( k, Z.)I¢ . (2.2)

See [2], chapter 6, for details. (2.1), (2.2) together imply that { it. } is a ~ (S ) -va l - ued Markov chain controlled by { Z. }. The cost (C1) equals (in view of our condition on ( Z. })

] E c ( . , Zm) d%, . (2.3)

The original control problem with cost (C1) is then equivalent to the problem of controlling { ~r, } governed by (2.1), (2.2), with cost (2.3).

Simple Bayes' rule arguments analogously lead to the following evolution for

z . ) 1 o ) - ' ' ' ' = (vr.P (Y.+x, Z . ) ) , n>tO. (2.4)

Here P'(Y.+a, Z.) is the submatrix of P(Y.+~, Z.) obtained by taking rows and columns corresponding to A, appropriately ordered. The cost (C2) is equivalent to

where

-k(i, u )= l ( i , u)+ ~.,p(i, u, j , k ) h ( j ) - h ( i ) , i ~ S , u ~ D , j,k


for h, l arbitrarily extended to S, S × D resp. This follows on observing that for p' ( i , j , u ) = Ekp( i , u, j , k), nAT/~ )

re=El I h(Xm) - EP'(X. , -I , j, Z m - l ) h ( j ) , fl ~ 1, J

is an { ~ }-martingale. Thus by the optional sampling theorem:

[ . ^ r ] [ . ^ T - , ]

Letting n --) oo, the left hand side equals (2.5) while the fight hand side equals (C2) minus the constant factor E[h(Xo) ] which may be set to zero by choosing h to be zero on A.

The evolution (2.4), however, indicates that (or" } is not a controlled Markov chain. But, in view of (2.1), (2.2), ft. = (or.', or.), n >1 0, is an M ( A ) x ~(S) -va lued controlled Markov chain. Thus our second control problem is equivalent to controlling (#.) with cost (2.5).

The evolutions of (or.}, { #.} are not simple, being highly nonhnear and requiring the additional computation (2.2) for the conditional law of the driving process { Y~ }. This motivates the definition in the next section of a new state variable.

3. The unnormalized conditional law

For any nonempty B c H, define ors ~#~(B) as follows: If B is finite with n distinct dements, orb is the uniform distribution on B with orB(i) -- n -a, i ~ B. If B is infinite, write it as B = (il , i 2 . . . . } with i n < in+ 1 for all n. Set orB(ik) = 2 -k for k >/1. (As will become clear later, these choices of or B are not unique. One only needs to ensure that support (orB) = B.)

For n >/1, let B(n) = { i ~ H I P( Y. = i / ~ _ a) > O) c H where, by convention, ~ -1 = the trivial o-field. Define an M( S )-valued process { u. }, called the process of unnormalised conditional laws, as follows: v 0 = % and v., n >t 1, is given recursively by

v.+ 1 = v .P(Y.+ 1, Z . ) , n >1 O, (3.1)

where

zo)=(orB(.+,)(rn+,)) P(r.+,, n>_-0. Then or. = R(v . ) , n >1 0, justifying the term unnormatized conditional law. Let A . = ( v . l c ) - t for n>~0. Then A o = l and [ A . } is an { ~ ) - a d a p t e d positive process.

KS. Borkar / Partially observed Markov chains

LEMMA 3.1

(A. , J~) is a martingale.

Proof

433

= Y~. ~r.(i)p(i, Z., j, k ) ( . .P (k , Z. ) l c ) -1 i j , k

= Y'. v.(i)(vnle)-'p(i, Zn, j, k)(p .P(k , Z. ) l e ) -1 i , j , k

= E(,.l~)-'~ro(.+a)(k)(u.fi(k. Z . ) I¢)( . .P(k , Z . ) l c ) - ' k

= ( v . l c ) - ' Y' .~r. , .+a,(k ) k

= ( , , . 1 o ) - ' . [ ]

Remark

The above would remain true even if v 0 were not in ~ ( S ) . The only difference would be that A o and therefore E[A.], n >t 0, need no longer be 1.

Let P denote the underlying probability measure. The above lemma allows us to define a new probability measure P on V . ~ by

dP. = A . , n>~O,

where P., P. are the_restrictions of P, P respectively_to ~ . , n >/0. By the above lemma. th i s defines P in a consistent manner. Let E[-] denote the expectation under P.

LEMMA 3.2 For any f ~ Cb(S x D),

n>~0.

Proof

Both sides are seen to equal

[3


LEMMA 3.3 Under P, the conditional law of Y.+1 given ~ is ~rB~.+x) for each n >i O.

Proof

For k ~ B(n + 1),

P(Y.+I = k / ~ ) = e [ I ( Y.+I = k }A.+a/~.] /E [A,,+~/~.]

=(v.I¢)E[I{Y.+,=k)(v.P(Y.+1, Z. )1¢) - I /~ . . ]

= z . )xo) - lP(Y .+ , = kl .)

= <v,,lc)(V,,P<k, Z#) lc) - l (~r .<i )p( i , Z., j , k)) I , J

( v . P ( k , --1 = Z,,)lc) Ev.(i)p(i, Z., j, k) i , j

= (v .F(k , z . )ac ) = [ ]

In view of lemma 3.2, ( e l ) equals

Thus we may consider { p,} as an M(S_)-valued controlled Markov chain with cost (3.2) under the probability measure P. The evolution of { v, } is still nonlinear because of the presence of the factor ~rBt,+l)(Y,+l) -1 on the fight hand side of (3.1). Nevertheless, it is often simpler than the evolution of {~r,} because B(n + 1) is usually easy to find, often by inspection. In fact, if support (p(i, u, j , • )) is independent of i, j and u, B(n) is independent of n and (3.1) linear. Another interesting case occurs when Y, = f (3( , ) , n >/0, for some f: S ~ H. In this case B(n) is the image of support (Tr,) = support (p,) under f .

It is also instructive to compare this with the continuous observation space case where the observation process is often given by

where f : S -* H is measurable and ( v~ } a continuous-valued i.i.d, noise sequence independent of { X~}. Under suitable conditions, one can introduce a new probability measure, absolutely continuous w.r.t, the old one over finite time intervals, such that { Y, } becomes independent of { X~ } with the same law as { v~ } under the old measure [1]. This scheme does not work in general in case of discrete observation space as the following trivial example indicates: Let S - - (1} and thus X, = 1 for all n. Also, let the i.i.d, sequence ( I,, } be deterministic with


v, = 1 for all n. Let f = identity. Then Y, = 1 + 1 = 2 for all n and for each n, the laws of Y, and v., which are the Dirac measures at 2 and 1 respectively, are mutually singular. Clearly, the above strategy will not work here.

Now consider the case of (C2). By analogy with (1,, }, we define an M(A)-valued process { v,' } by v~ = ~r o and

• t - - I ~',,+1 = v , ,e (Y,,+a, Z , ) , n >1 O, (3.3) where

- - - 1 p , / y . P'(Y.÷I, z . ) = ~B(. . , (Y.+~) t . + . z . ) , n >_- o.

Then ~r" = A.J,~' for n >1 0 and thus (2.5) equals

Thus we may consider {v,'} as an M(A)-valued controlled Markov chain with cost (3.4), with P as the underlying probability measure. Note that the dimension of i,,' is less than half that of ff~.

4. The dynamic programming equations

Consider (C1). Define V: 9B(S) -~ R + by

v(~ ) = inf E B'~ c ( . , zm) d,~m/% = . , 0

where the infimum is over all admissible control sequences. V is called the value function. The following result is classical (see, e.g., [3], chapter 39, pp. 232-233).

T H E O R E M 4.1

V satisfies the dynamic programming equations

.-.mf (fc(-, u)dr+ u, d,,)W(,,)), (4.1) where ~(/~, u, d r ) is the transition probability kernel of the controlled Markov chain { ~r,}. Furthermore, the control policy Z, = 4(~r,), n >1 O, for a measurable 4: ~ ( S ) --* D is optimal for any initial condition if and only if 4(/~) obtains the infimum in (4.1) for each/.t.

Remark

Using a standard selection argument, one can establish the existence of at least one such 4.

We shall now convert the above known results into equivalent results for our new state process { ~,,}. For this purpose, extend the definition of V to M(S) by


setting V(/x) = #(S)V(Ix(S)-I#) for # ~ M(S) with #(S) > 0 and V(#) = 0 when /~(S) = 0. Write

= p , ( S ) - ' # for # ( S ) > 0

-- an arbitrary element of #~(S) for # ( S ) = 0.

Then by (4.1),

v(~) = ~(s)v(~,)

=#(S) rain elf c(., Zo)d%

fie [ v( ~,)/~o]/% = ~[ I

+

= # ( S ) rain Elf c(., Zo) d% + flV(eq)/%=~]

= bt(S) rain Elf c(., Zo)dr0 + fl(1'll¢)-lV(vl)/Vo=~]

=min E[fc('. Zo) d,,o+BV(~,,)/,,o--t.~]. (4.2)

where we have used the fact that the evolution of { I,, } is linear in its initial condition as long as its support is kept fixed.

Thus we have:

THEOREM 4.2

V satisfies the dynamic programming equation

V(#) = ,Eo ,.,min( f c(" u)dl x + B f q(~, u, d~)V(p)), (4.3)

where q(#, u, dJ,) is the transition probability kernel of the controlled Markov chain {p,}. Furthermore, the control policy Z, = ~(v,), n >/0, for a measurable ~: M(S) --* D is optimal for any initial condition if and only if ~(#) attains the minimum in (4.3) for every #.

The first claim is a restatement of (4.2). The second follows on observing that in the course of the derivation of (4.2), the u that attains the minimum in (4.3) and hence in (4.2), will also attain the minimum in (4.1) for # replaced by ~ and vice versa. A remark analogous t o the one following theorem 4.1 also applies.

A simple computation using the scaling property of V (i.e., V(a#) = aV(l~) for a > 0) reduces (4.3) to

uED \ ~ k 1

Compare this with (1), p. 234 of [3].


Finally, we also have

V(~) = ~ ( S ) V ( p )

,] = rain e c ( . , z m ) = o

m - - 1 =/~(S) min E (Pmle c( . , Zm) dv,,,/v o =

(4.4)

as expected. In particular, we could use the right hand side to give a direct definition of V.

The situation for (C2) is analogous. One defines the value function V: { v M(A) Ip(A) ~< 1} x 9~(S) --, R + by

( ' _ V ~) = inf E -, d~r, J c r 0 = # , 0

where (~r~', ~rm) , m >/0, evolves as described in the preceding section and the infimum is over all control sequences. (We are allowing for slightly more general initial data, but that causes no problems.) The next theorem follows by standard dynamic programming arguments analogous to those used for proving theorem 4.1. We omit the details.

THEOREM 4.3 V satisfies the dynamic programming equation

V(Iz)= ,aoinf u, dv)V(p) /~0 =/~ ], (4.5)

where p ' ( / t , u, d r ) is the transition probability of the controlled Markov chain {~.}. A control policy Z. = ~ (~) , n >/0, is optimal for any initial condition if and only if ~(/~) attains the minimum in (4.5) for each /~. At least one such exists.

For/~ ~ M(A), write #1 =/L(A)-I/~ ~ 9~(A) ( = an arbitrary dement of 9~(A) when # ( A ) = 0 ) and let /~2~9~(S) be the element that restricts to /h to A, assigning zero mass to S \ A . Define V': M(A) ..o R + by V'(/x) = I~(A)V(~ 1, 1~2). As in the case of (4.4), one can show that

V'(/~) = inf ", Zm) dp,~/Po =/x , 0

the infimum being over all controls. One also has:


THEOREM 4.4 V ' satisfies the dynamic programming equations

= minE(. a u) (dy) + f u, (4.6)

where q'(#, u, d~,) is the transition probabil i ty function for the controlled Markov chain { ~,,'}. A control policy Z, = ~(u,'), n >t 0, for a measurable map !~: M ( A ) ~ D is optimal for any initial condition if and only if ~(/~) attains the min imum in (4.6) for each #. At least one such ~ exists.

This is deduced from theorem 4.3 using arguments similar to those we used to deduce theorem 4.2 from theorem 4.1. As in the case of the discounted cost problem, one can reduce (4.6) to

V ' ( # ) = m i n ( E # ( i ) - k ( i , u ) + E V ' ( # P ' ( k , u))). u ~ D k i k

Finally, note that the finite horizon control problem whose cost functional is

e ¢(xm, zm)+h(X ) 0

for c ~ Ch(S × D), h ~ Ch(S ), N >/1 can be handled along the lines of (C2) with (J,, } as the state process. We do not consider it separately since the details are routine.

References

[1] A. Bensoussan and W. Runggaldier, An approximation method for stochastic control problems with partial observation of the state - a method for constructing c-optimal controls, Acta Appl. Math. 10 (1987) 145-170.

[2] P.R. Kumar and P.P. Varaiya, Stochastic Systems: Estimation, Identification and Adaptive Control (Prentice-Hall, Englewood Cliffs, N J, 1986).

[3] P. Whittle, Optimization over Time: Dynamic Programming and Stochastic Control, voi. 2 (Wiley, Chichester, 1983).

Documents

A remark on control of partially observed Markov chains