; AT · Recent developments in the area of reinforcement learning have yielded a number of new algorithms for the prediction and control of Markovian environments. These algorithms,

REPORT DOCUMENTATION PAGE rm ,,r,,d,, _ _ _ _o__ _ _ _No. 07o 4-0186

P 1 In g d u , f o t hl e o eo do l o•f t W o tm U is e dli lt e l to a r a .1 W p we . In dud v t ie t ar r a g mel I1e ,niama. nng mv•lk o €I g0 0fD l ewrl~ ll vll /n th 11 OIC II•• i1o frctionn.. S end ~ f • lgll~• •I •Jl11 • W o m mf at WI aI ai aid l l•

.now,,9.sggeeOne tr r _est ang the taen. to Waswngwn H-auanf Seavicra . r•ewv .O toe In• for (•o" ermn 111 127Je On QM ' g. SW@ 1204. AabOt. .VA .r2-43M2. and to the Ottce d Mantaemwt and Budeo. PaD rwoA Reducan Protect (0704-01611. WnraNtM%. OC 20503.1. AGENCY USE ONLY (Leave Blank) 2. REPORT DATE 3. REPORT TYPE AND DATES COVERED

I August 1993 memorandum4. TITLE AND SUBTITLE 5. FUNDING NUMBERS

On the Convergence of Stochastic Iterative Dynamic Programming NSF-ASC-9217041Algsorithms N00014-90-J-1942

NSFPECS:92- 6531 IC)6. AUTHOR(S) ,.f IRI-9013991-

Tommi Jaakkola, Michael I. Jordan, and Satinder P. Singh

7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION

Massachusetts Institute of Technology REPORT NUMBER

Artificial Intelligence Laboratory AIM 1441545 Technology Square CBCL84Cambridge, Massachusetts 02139

9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSORING/MONITORINGAGENCY REPORT NUMBER

Office of Naval Research -Information SystemsArlington, Virginia 22217 T ; AT

11. SUPPLEMENTARY NOTESon NWMARWIVN W[•4 I

12a. DISTRIBUTION/AVAILABIUTY STATEMENT 12b. DISTRIBUTION CODE

DISTRIBUTION UNLIMITED Appcoved fopr r e -

13. ABSTRACT (Maximum 200 words)

Recent developments in the area of reinforcement learning hay yielded a number of new algorithms forthe prediction and control of Markovian environments. These algorithms, including the TD lambda)algorithm of Sutton (1988) and the Q-leaming algorithm of Watkins (1989), can be motivatedheuristically as approximations to dynamic programming (DP). In this paper we provide a rigorousproof of convergence of these DP-based learning algorithms by relating them to the powerful techniquesof stochastic approximation theory via a new convergence theorem. The theorem establishes a generalclass of convergent algorithms to which both TD(lambda) and Q-learning belong.

14. SUBJECT TERMS 15. NUM9ER OF PAGESreinforcement learning stochastic approximation 15convergence dynamic programming

16. PRICE CODE

17. SECURITY CLASSIFICATION 18. SECURITY CLASSIFICATION 19. SECURITY CLASSIFICATION 20. UMITATION OFOF REPORT OF THIS PAGE OF ABSTRACT ABSTRACT

UNCLASSIFIED UNCLASSIFIED UNCLASSIFIED UNCLASSIFIEDNSN 7540.-01-280-5500 SUM Pom • Fl 2-M9)

PT r'U byALi. W S13

DTIC QUALITY 3NF/

BestAvailable

COpy

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

ARTIFICIAL INTELLIGENCE LABORATORY

and

CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING

DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES

A.I. Memo No. 1441 August 6, 1993C.B.C.L. Memo No. 84

On the Convergence of Stochastic IterativeDynamic Programming Algorithms

Tommi Jaakkola, Michael I. Jordan and Satinder P. Singh

Abstract

Recent developments in the area of reinforcement learning have yielded a number of new algorithms for

the prediction and control of Markovian environments. These algorithms, including the TD(A) algorithm

of Sutton (1988) and the Q-learning algorithm of Watkins (1989), can be motivated heuristically as ap-Lf proximations to dynamic programming (DP). In this paper we provide a rigorous proof of convergence

0) of these DP-based learning algorithms by relating them to the powerful techniques of stochastic approx-n • imation theory via a new convergence theorem. The theorem establishes a general class of convergent

0 algorithms to which both TD(A) and Q-learning belong.

Copyright 0 Massachusetts Institute of Technology, 1993

This report describes research done at the Dept. of Brain and Cognitive Sciences, the Center for Biological and

Computational Learning, and the Artificial Intelligence Laboratory of the Massachusetts Institute of Technology.

Support for CBCL is provided in part by a grant from the NSF (ASC-9217041). Support for the laboratory's

artificial intelligence research is provided in part by the Advanced Research Projects Agency of the Dept. of

Defense. The authors were supported by a grant from the McDonnell-Pew Foundation, by a grant from ATR

Human Information Processing Research Laboratories, by a grant from Siemens Corporation, by by grant IRI-

9013991 from the National Science Foundation, by grant N00014-90-J-1942 from the Office of Naval Research,

and by NSF grant ECS-9216531 to support an Initiative in Intelligent Control at MIT. Michael I. Jordan is a

NSF Presidential Young Investigator.

An important component of many real world learning problems is the temporal credit as-signment problem-the problem of assigning credit or blame to individual components of atemporally-extended plan of action, based on the success or failure of the plan as a whole. Tosolve such a problem, the learner must be equipped with the ability to assess the long-termconsequences of particular choices of action and must be willing to forego an immediate payofffor the prospect of a longer term gain. Moreover, because most real world problems involvingprediction of the future consequences of actions involve substantial uncertainty, the learnermust be prepared to make use of a probability calculus for assessing and comparing actions.

There has been increasing interest in the temporal credit assignment problem, due princi-pally to the develop-lent of learning algorithms based on the theory of dynamic programming(DP) (Barto, Sutton, & Watkins, 1990; Werbos, 1992). Sutton's (1988) TD(A) algorithmaddressed the problem of learning to predict in a Markov environment, utilizing a temporaldifference operator to update the predictions. Watkins' (1989) Q-learning algorithm extendedSutton's work to control problems, and also clarified the ties to dynamic programming.

In the current paper, our concern is with the stochastic convergence of DP-based learningalgorithms. Although Watkins (1989) and Watkins and Dayan (1992) proved that Q-learningconverges with probability one, and Dayan (1992) observed that TD(0) is a special case of Q-learning and therefore also converges with probability one, these proofs rely on a constructionthat is particular to Q-learning and fail to reveal the ties of Q-learning to the broad theory ofstochastic approximation (e.g., Wasan, 1969). Our goal here is to provide a simpler proof ofconvergence for Q-learning by making direct use of stochastic approximation theory. We alsoshow that our proof extends to TD(A) for arbitrary A. Several other authors have recentlypresented results that are similar to those presented here: Dayan and Sejnowski (1993) forTD(A), Peng and Williams (1993) for TD(A), and Tsitsiklis (1993) for Q-learning. Our resultsappear to be closest to those of Tsitsiklis (1993).

We begin with a general overview of Markovian decision problems and DP. We introducethe Q-learning algorithm as a stochastic form of DP. We then present a proof of convergencefor a general class of stochastic processes of which Q-learning is a special case. We then discussTD(A) and show that it is also a special case of our theorem.

Markovian decision problems

A useful mathematical model of temporal credit assignment problems, studied in stochasticcontrol theory (Aoki, 1967) and operations research (Ross, 1970), is the Markovian decisionproblem. Markovian decision problems are built on the formalism of controlled Markov chains.Let S = 1, 2,.. ., N be a discrete state space and let U(i) be the discrete set of actions availableto the learner when the chain is in state i. The probability of making a transition from state ito state j is given by pij(u), where u E U(i). The learner defines a policy A, which is a functionfrom states to actions. Associated with every policy 1 is a Markov chain defined by the statetransition probabilities pij(p( i)). r

There is an instantaneous cost ci(u) associated with each state i and action u, where c,(u)is a random variable with expected value ej(u). We also define a value function V1(i), which isthe expected sum of discounted future costs given that the system begins in state i and followspolicy J:

N-iV(i)= im E cp(st))Iso = i}, (1)

t=O --

where st E S is the state of the Markov chain at time t. Future costs are discounted by a factor. sd/or

Dlst spoctia

ytt, where 3y E (0, 1). We wish to find a policy that minimizes the value function:

V*(i) = min V,(i). (2)LA

Such a policy is referred to as an optimal policy and the corresponding value function is referredto as the optimal value function. Note that the optimal value function is unique, but an optimalpolicy need not be unique.

Markovian decision problems can be solved by dynamic programming (Bertsekas, 1987).The basis of the DP approach is an equation that characterizes the optimal value function.This equation, known as Bellman's equation, characterizes the optimal value of the state interms of the optimal values of possible successor states:

V*(i) = mrin i(u) + - pij(u)V(j)}. (3)tsEU(i) jES

To motivate Bellman's equation, suppose that the system is in state i at time t and considerhow V*(i) should be characterized in terms of possible transitions out of state i. Suppose thataction u is selected and the system transitions to state j. The expression ci(u) + 3'V=(j) isthe cost of making a transition out of state i plus the discounted cost of following an optimalpolicy thereafter. The minimum of the expected value of this expression, over possible choicesof actions, seems a plausible measure of the optimal cost at i and by Bellman's equation isindeed equal to V*(i).

There are a variety of computational techniques available for solving Bellman's equation.The technique that we focus on in the current paper is a iterative algorithm known as valueiteration. Value iteration solves for V*(i) by setting up a recurrence relation for which Bellman'sequation is a fixed point. Denoting the estimate of V*(i) at the kth iteration as V(k)(i), wehave:

v(k+l)(i) = min {f ,(u) + t E pj(U)V(k)(j)} (4)UEU(i) jES

This iteration can be shown to converge to V*(i) for arbitrary initial V(0 )(i) (Bertsekas, 1987).The proof is based on showing that the iteration from V(k)(i) to V(k+l)(i) is a contractionmapping. That is, it can be shown that:

max IV(k+l)(i) - V*(i)I < 7 max IV(k)(i) _ V*(i)I, (5)S S

which implies that V(k)(i) converges to V*(i) and also places an upper bound on the convergencerate.

Watkins (1989) utilized an alternative notation for expressing Bellman's equation that isparticularly convenient for deriving learning algorithms. Define the function Q*(i, u) to be theexpression appearing inside the "rmin" operator of Bellman's equation:

Q*(i,u) = (U) + - Zpij(u)V(j) (6)jES

Using this notation Bellman's equation can be written as follows:

V*(i) = min Q*(i,u). (7)uEU(i)

Moreover, value iteration can be expressed in terms of Q functions:

Q(k+l)(i, u) = Zi(u) + 3 _ pij(u)V(k)(j), (8)

2

where V(k)(i) is defined in terms of Q(k)(i, u) as follows:

V(k)(i)= min Q(k)(i2u). (9)uEU(i)

The mathematical convenience obtained from using Q's rather than V's derives from the factthat the minimization operator appears inside the expectation in Equation 8, whereas it appearsoutside the the expectation in Equation 4. This fact plays an important role in the convergenceproof presented in this paper.

The value iteration algorithm in Equation 4 or Equation 8 can also be executed asyn-chronously (Bertsekas & Tsitsiklis, 1989). In an asynchronous implementation, the update ofthe value of a particular state proceeds in parallel with the updates of the values of other states.Bertsekas & Tsitsiklis (1989) show that as long as each state is updated infinitely often andeach action is tried an infinite number of times in each state, then the asynchronous algorithmeventually converges to the optimal value function. Moreover, asynchronous execution has theadvantage that it is directly applicable to real-time Markovian decision problems (RTDP; Barto,Bradtke, & Singh, 1993). In a real-time setting, the system uses its evolving value functionto choose control actions for an actual process and updates the values of the states along thetrajectory followed by the process.

Dynamic programming serves as a starting point for deriving a variety of learning algorithmsfor systems that interact with Markovian environments (Barto, Bradtke, & Singh, 1993; Sutton,1988; Watkins, 1989). Indeed, real-time dynamic programming is arguably a form of learningalgorithm as it stands. Although RTDP requires that the system possess a complete model ofthe environment (i.e., the probabilities pii(u) and the expected costs ei(u) are assumed known),the performance of a system using RTDP improves over time, and its improvement is focusedon the states that are actually visited. The system "learns" by transforming knowledge in oneformat (the model) into another format (the value function).

A more difficult learning problem arises when the probabilistic structure of the environmentis unknown. There are two approaches to dealing with this situation (cf. Barto, Bradtke,& Singh, 1993). An indirect approach acquires a model of the environment incrementally, byestimating the costs and the transition probabilities, and then uses this model in an ongoing DPcomputation. A direct method dispenses with constructing a model and attempts to estimatethe optimal value function (or the optimal Q-values) directly. In the remainder of this paper,we focus on direct methods, in particular the Q-learning algorithm of Watkins (1989) and theTD(A) algorithm of Sutton (1988).

The Q-learning algorithm is a stochastic form of value iteration. Consider Equation 8, whichexpresses the update of the Q values in terms of the Q values of successor states. To performa step of value iteration requires knowing the expected costs and the transition probabilities.Although such a step cannot be performed without a model, it is nonetheless'possible to estimatethe appropriate update. For an arbitrary V function, the quantity EjESpii(u)V(j) can beestimated by the quantity V(j), if successor state j is chosen with probability pij(u). Butthis is assured by simply following the transitions of the actual Markovian environment, whichmakes a transition from state i to state j with probability pii(u). Thus the sample value of V atthe successor state is an unbiased estimate of the sum. Moreover ci(u) is an unbiased estimateof ai(u). This reasoning leads to the following relaxation algorithm, where we use Qt(i, u) andVt(i) to denote the learner's estimates of the Q function and V function at time t, respectively:

Qt+l(st, tUt) = (1 - at(st, ut))Qt(,st, ut) + at(st, ut)[c.,(ut) + 'Yvt(st+i)] (10)

3

whereVt(st+)= Mir, Qt(st, ut). (11)

uEU(sg+0)

The variables at(st, ut) are zero except for the state that is being updated at time t.

The fact that Q-learning is a stochastic form of value iteration immediately suggests the use

of stochastic approximation theory, in particular the classical framework of Robbins and Monro

(1951). Robbins-Monro theory treats the stochastic convergence of a sequence of unbiasedestimates of a regression function, providing conditions under which the sequence converges to

a root of the function. Although the stochastic convergence of Q-learning is not an immediateconsequence of Robbins-Monro theory, the theory does provide results that can be adaptedto studying the convergence of DP-based learning algorithms. In this paper we utilize a resultfrom Dvoretzky's (1956) formulation of Robbins-Monro theory to prove the convergence of bothQ-learning and TD(A).

Convergence proof for Q-learning

Our proof is based on the observation that the Q-learning algorithm can be viewed as a stochas-tic process to which techniques of stochastic approximation are generally applicable. Due to thelack of a formulation of stochastic approximation for the maximum norm, however, we need toslightly extend the standard results. This is accomplished by the following theorem the proofof which is given in Appendix A.

Theorem 1 A random iterative process A,.+ 1(x) = (1 - ct,(x))An(x) + i3n(x)Fn(x) converges

to zero w.p. 1 under the following assumptions:

1) The state space is finite.

2) Ena,(x) = c0, En ao(x) < cc, En On(x) = cc, nO,' (X) < cc, and E{In(x)IP,} <E{ox,(x)IPn} uniformly w.p.1.

3) 11 E{F,,(x)lP,} IIw< 7 II An 11W, where - E (0, 1).

4) Var{Fn(x)1P,n} 5 C(l+ 11 An 11w) 2, where C is some constant.

Here Pn = {A,,An_ ,...,Fn-1,...,an-1-,,n-1,...} stands for"the past at step n. Fn(x),a,,(x) andf,6n(x) are allowed to depend on the past insofar as the above conditions remain valid.The notation i1" 11w refers to some weighted maximum norm.

In applying the theorem, the A, process will generally represent the difference between a

stochastic process of interest and some optimal value (e.g., the optimal value function). Theformulation of the theorem therefore requires knowledge to be available about the optimal

solution to the learning problem before it can be applied to any algorithm whose convergence isto be verified. In the case of Q-learning the required knowledge is available through the theoryof DP and Bellman's equation in particular.

The convergence of the Q-learning algorithm now follows easily by relating the algorithm to

the converging stochastic process defined by Theorem 1.1 In the form of the theorem we have:1 We note that the theorem is more powerful than is needed to prove the convergence of Q-learning. Its

generality, however, allows it to be applied to other algorithms as well (see the following section on TD(A).

4

Theorem 2 The Q-learning algorithm given by

Qt+i(st, ut) = (1 - at(st, ut))Qt(st, ut) + at(st, ut)[c.J(t) + -Yvt(st+i)]

converges to the optimal Q*(s, u) values if

1) The state and action spaces are finite.

2) _t at(s, u) = oc and Et at (s, u) < oo uniformly w.p.1.

3) Var{c,(u)) is bounded.

3) If -y = 1 all policies lead to a cost free terminal state w.p.1.

Proof. By subtracting Q*(s, u) from both sides of the learning rule and by definingAt(s, u) = Qt(s, u) - Q*(s, u) together with

Ft(s, u) = c.(u) + yVt(s.et) - Q*(s, u) (12)

the Q-learning algorithm can be seen to have the form of the process in theorem 1 with/ 3t(s, u) =

at($, U).To verify that Ft(s, u) has the required properties we begin by showing that it is a contraction

mapping with respect to some maximum norm. This is done by relating Ft to the DP valueiteration operator for the same Markov chain. More specifically,

maxlE{Ft(i,u)}I = -ymax[ I j.p (u)[Vt(j) - V*(j)]I

<- max EPii(u)maxlQt(j,v)- Q'(j,v)l

=-y max ~P,,(u)VA(j)-= T(V")(i)

where T is the DP value iteration operator for the case where the costs associated with eachstate axe zero. If -t < 1 the contraction property of T and thus of Ft can be seen directlyfrom the above formulas. When the future costs are not discounted (-- = 1) but the chain isabsorbing and all policies lead to the terminal state w.p.1 there still exists a weighted maximumnorm with respect to which T is a contraction mapping (see e.g. Bertsekas & Tsitsiklis, 1989).

The variance of Ft(s,u) given the past is within the bounds of theorem 1 as it depends onQt(s, u) at most linearly and the variance of c8(u) is bounded.

Note that the proof covers both the on-line and batch versions. 0

The TD(A) algorithm

The TD(A) (Sutton, 1988) is also a DP-based learning algorithm that is naturally defined ina Maxkovian environment. Unlike Q-learning, however, TD does not involve decision-makingtasks but rather predictions about the future costs of an evolving system. TD(A) converges tothe same predictions as a version of Q-learning in which there is only one action available ateach state, but the algorithms are derived from slightly different grounds and their behavioral

5

differences are not well understood. In this section we introduce the algorithm and its derivation.The proof of convergence is given in the following section.

Let us define Vt(i) to be the current estimate of the expected cost incurred during theevolution of the system starting from state i and let ci denote the instantaneous random costat state i. As in the case of Q-learning we assume that the future costs are discounted at eachstate by a factor -. If no discounting takes place (- = 1) we need to assume that the Markovchain is absorbing, that is, there exists a cost free terminal state to which the system convergeswith probability one.

We are concerned with estimating the future costs that the learner has to incur. One wayto achieve these predictions is to simply observe n consecutive random costs weighted by thediscount factor and to add the best estimate of the costs thereafter. This gives us the estimate

Vt()(i,) = c, + c, + - , 2 . + + + i,+,,_ + ,nV-(it+ ) (13)

The expected value of this can be shown to be a strictly better estimate than the currentestimate is (Watkins, 1989). In the undiscounted case this holds only when n is larger thansome chain-dependent constant. To demonstrate this let us replace Vt with V* in the aboveformula giving E{Vt*(n)(it)} = V*(it) which implies

m axIEVt(")(i)}- V'(i)I • maxPr{rmi _> n} maxlV,(i) - V*(i)I (14)

where mi is the number of steps in a sequence that begins in state i (infinite in the non-absorbing case). This implies that if either 7 < 1 or n is large enough so that the chain

can terminate before n steps starting from an arbitrary initial state then the estimate V(') isstrictly better than Vt. In general, the larger n the more unbiased the estimate is as the effectof incorrect Vt vanishes. However, larger n increases the variance of the estimate as there aremore (independent) terms in the sum.

Despite the error reduction property of the truncated estimate it is difficult to calculate inpractice as one would have to wait n steps before the predictions could be updated. In additionit clearly has a huge variance. A remedy to these problems is obtained by constructing a newestimate by averaging over the truncated predictions. TD(A) is based on taking the geometricaverage:

v"(i) = (1- \) F \ (15)n=1

As a weighted average it is still a strictly better estimate than Vt(i) with the additional benefitof being better in the undiscounted case as well (as the summation extends to infinity). Fur-thermore, we have introduced a new parameter A which affects the trade-off between the biasand variance of the estimate (Watkins, 1989). An increase in A puts more weight on less biasedestimates with higher variances and thus the bias in Vt\ decreases at the expense of a highervariance.

The mathematical convenience of using the geometric average can be seen as follows. Giventhe estimates Vt(i) the obvious way to use them in a learning rule is

'/+,(it) = V(it) + a[Vt"(i0) - Vt(it)] (16)

In terms of prediction differences, that is

At(it) = ci, + YVt(it+,) - VM(it) (17)

6

the geometric weighting allows us to write the correction term in the learning rule as

Vx(it) - V,(it) = A,(i,) + (Ay)YA,(it+,) + (A'y) 2A. (it+2 ) + (18)

Note that up to now the prediction differences that need to be calculated in the future depend onthe current Vt(i). If the chain is nonabsorbing this computational implausibility can, however,be overcome by updating the predictions at each step with the prediction differences calculatedby using the current predictions. This procedure gives the on-line version of TD(A):

t

Vt+i(i) = Vt(i) + ta/t(it) Z(eA)t-X(k) (19)k=0O

where xj(k) is the indicator variable of whether state i was visited at kh step (of a sequence).Note that the sum contains the effect of the modifications or activity traces initiated at past timesteps. Moreover, it is important to note that in this case the theoretically desirable propertiesof the estimates derived earlier may hold only asymptotically (see the convergence proof in thenext section).

In the absorbing case the estimates Vt(i) can also be updated off-line, that is, after acomplete sequence has been observed. The learning rule for this case is derived simply fromcollecting the correction traces initiated at each step of the sequence. More concisely, the totalcorrection is the sum of individual correction traces illustrated in eq. (18). This results in thebatch learning rule

m t

V.+I(i) = V0(i) + C. E A.(i,) Z(-yA)t-X,(k) (20)t=i k=O

where the (m + 1)th step is the termination state.We note that the above derivation of the TD(A) algorithm corresponds to the specific choice

of a linear representation for the predictors Vt(i) (see, e.g., Dayan, 1992). Learning rules forother representations can be obtained using gradient descent but these are not considered here.In practice TD(A) is usually applied to an absorbing chain thus allowing the use of either thebatch or the on-line version but the latter is usually preferred.

Convergence of TD(A)

As we are interested in strong forms of convergence we need to modify the algorithm slightly.The learning rate parameters a, are replaced by c!,(i) which satisfy ooa,•(i) = 00 and

a' c(i) < oo uniformly w.p.1. These parameters allow asynchronous updating and theycan, in general, be random variables. The convergence of the algorithm is guaranteed by thefollowing theorem which is an application of Theorem 1.

Theorem 3 For any finite absorbing Markov chain, for any distribution of starting states withno inaccessible states, and for any distributions of the costs with finite variances the TD(A)algorithm given by

1)

M t

V.+i(i) = Vn(i) + an(i)Z[c1 , + "V.(i,+i) - Vn(i,)] Z(_yA)-kX,(k)t=1 k=--1

7

2)

t

V,+i(i) = Vt(i) + Qt(i)[tC, + ",Vt(it+i) - (k)(it)1 -evA)t-%atkk=1

converges to the optimal predictions w.p.1 provided n, a,(i) = • and n an(i) < 00 unsorrnyw.p.1 and /,A E [0,1] with 7 A < 1.

Proof for (1): Using the ideas described in the previous section the learning rule can bewritten as

V.+,(Si) = V.(i) +a•.(i)[Gni. (i) re (i) ]E m(i)})

Gn(i) = EVm(t)} i k)

where Vn(i; k) is an estimate calculated at the Vt occurence of state i in a sequence and formathematical convenience we have made the transformation a.(i) -- E{ m(i)}a(i), where m(i)is the number of times state i was visited during the sequence.

To apply Theorem 1 we subtract V*(i), the optimal predictions, from both sides of thelearning equation. By identifying O,(i) := an(i)m(i)/E{m(i)}, 6,,(i) := an(i), and Fe(i) :=G,(i) - V*(i)m(i)/E{m(i)} we need to show that these satisfy the conditions of Theorem 1.For an(i) and i,3(i) this is obvious. We begin here by showing that F,(i) indeed is a contractionmapping. To this end,

max IE{Fn(i) I V,,}j =

max E{ (.)}E{(Vnh(i; 1) - V*(i)) + (Vn(i; 2) - V"(i)) + ... V}

which can be bounded above by using the relation

IE{V,'(i; k) - V*(i) I vn}I

"_ E { IE{V,(i; k) - V'(i) m(i) 2 k, Vn}l0(m(i) - k) I V}

"< P{m(i) > k}IE{IV(i) - V*(i) IV,}I_ yP{m(i) 2_ k} max IVn(i) - V*(i)I

where O(x) = 0 if x < 0 and 1 otherwise. Here we have also used the fact that Vý\(i) is acontraction mapping independent of possible discounting. As -k P{m(i) > k} = E{m(i)} wefinally get

maxIE{F,,(i) I V.}[ < 7 maxIV,(i)- V'(i)I

The variance of F,(i) can be seen to be bounded by

Efm4 } maxIV (i)I2

For any absorbing Markov chain the convergence to the terminal state is geometric and thusfor every finite k, E{mk} _• C(k), implying that the variance of F,(i) is within the bounds of

8

theorem 1. As Theorem I is now applicable we can conclude that the batch version of TD(A)converges to the optimal predictions w.p.l. 0

Proof for (2) TP. proof for the on-line version is achieved by showing that the effectof the on-line up:'-ing vanishes in the limit thereby forcing the two versions to be equalasymptotically. We view the on-line version as a batch algorithm in which the updates aremade after each complete sequence but are made in such a manner so as to be equal to thosemade on-line.

Define G',(i) = G.(i) + R.(i) to be the new batch estimate where R,(i) is the differencebetween the on-line and batch estimates. We define the new batch learning parameters tobe the maxima over a sequence, that is &,(i) = maxtEs at(i). Now R,(i) consists of termsproportional to

[c, + 'yV.(it+i) - V0(i0)1

the expected value of which can be bounded by A = 2 11 V, - V" 11. Assuming that -rA < 1(which implies that the multipliers of the above terms are bounded) we can get an upper boundfor the expected value of the correction R,,(i). Let us define RP1 to be the expected differencebetween the on-line estimate after t steps and the first t terms of the batch estimate. We canbound R,,,t(i) readily by the update rule resulting in the iteration

II *,,.,+i 11_1161 &, C1(A+ II &, 1I)

where R,,,,,(i) = E{R,,(i) I V,}, R,, 0(i) 0, and C is some constant. Since 11 t- I1 goes to zerow.p.1 the above iteration implies that &R,,,, 11-- 0 w.p.1 giving

max IE{R,(i) I V,}l < C,, max IV,(i) - V*(i)Ii I

where C,, - 0 w.p.1. Therefore using the results for the batch algorithm, F.(i) = G'(i) -V'(i)m(i)/E{m(i)} satisfies

max IE{F.(i)}l < (-v + C.) max IV.(i) - V*(i)Ii i

where for large n (-y + Cn) < 1' < 1 w.p.1. The variance of Rn(i) and thereby that of Ft'l(i) arewithin the bounds of theorem 1 by linearity. This completes the proof. 0

Conclusions

In this paper we have extended results from stochastic approximation theory to cover asyn-chronous relaxation processes which have a contraction property with respect to some maximumnorm (Theorem 1). This new class of converging iterative processes is shown to include boththe Q-learning and TD(A) algorithms in either their on-line or batch versions. We note thatthe convergence of the on-line version of TD(A) has not been shown previously. We also wishto emphasize the simplicity of our restlts. The convergence proofs for Q-learning and TD(A)utilize only high-level statistical properties of the estimates used in these algorithms and do notrely on constructions specific to the algorithms. Our approach also sheds additional light onthe similarities between Q-learning and TD(A).

Although Theorem 1 is readily applicable to DP-based learning schemes, the theory ofDynamic Programming is important only for its characterization of the optimal solution andfor a contraction property needed in applying the theorem. The theorem can be applied toiterative algorithms of different types as well.

9

Finally we note that Theorem I can be extended to cover processes tl:at do not show theusual contraction property thereby increasing its applicability to algorithms of possibly morepractical importance.

Proof of Theorem 1

In this section we provide a detailed proof of the theorem on which the convergence proofs forQ-learning and TD(A) were based. We introduce and prove three essential lemmas, which willalso help to clarify ties to the literature and the ideas behind the theorem, followed by the proofof Theorem 1. The notation 11 • 11w= maxr I -/W(x)J will be used in what follows.

Lemma 1 A random process

W +-)Wn1 x = ( - + (x)r(x)

converges to zero with probability one if the following conditions are satisfied:

1) E'Qn n(x)- o= , E.a2(z) < o0, En03,(z) = 00, and EnI32(x) < on uniformly w.p.1.

2) E{rn(x)IPn} = 0 and E{rn2(X)jP} <_ C W.p.l, where

Pn = {W,, Wn-1,. -.- , rn-1, rn-2. , an-1, %-2, • •.I, 3 n-1,1n-2,. .-

All the random variables are allowed to depend on the past P,.

Proof. Except for the appearance of o3n(x) this is a standard result. With the abovedefinitions convergence follows directly from Dvoretzky's extended theorem (Dvoretzky, 1956).

Lemma 2 Consider a random process Xn+i(z) = Gn(Xn, X), where

Gn(OXnX) = OGn(XnX)

Let us suppose that if we kept 11 Xn 11 bounded by scaling, then Xn would converge to zero w.p. 1.This assumption is sufficient to guarantee that the original process converges to zero w.p. 1.

Proof. Note that the scaling of X, at any point of the iteration corresponds to having

started the process with scaled X 0 . Fix some constant C. If during the iteration, 11 X, 11increases above C, then Xn is scaled so that 11 X, 11= C. By the assumption then this processmust converge w.p.1. To show that the net effect of the corrections must stay finite w.p.1 wenote that if 11 X, 11 converges then for any e > 0 there exists M, such that 11 Xn I[< c < C forall n > M, with probability at least 1 - f. But this implies that the iteration stays below Cafter M, and converges to zero without any further corrections. 0

Lemma 3 A stochastic process Xn+I(z) = (1 - a(z))X,(z) + 'y,3(z) 11 XnI1 converges to zerow.p.1 provided

1) x E S, where S is a finite set.

2) FCno,(x) = o0, En' a2.(z) < o0, 'n3 n(x) = 00, FnI32(x) < O0, and E{l3,(z)} <Efa,(z)} uniformly w.p.I.

10

Proof. Essentially the proof is an application of Lemma 2. To this end, assume that wekeep 11 Xn j11: C1 by scaling which allows the iterative process to be bounded by

IX•+ 1(x)l _5 (1 - a.C(x))IX.(x)l + 7h.(x)Ci

This is linear in IX,,(x) and can be easily shown to converge w.p.1 to some X*(z), where11 X* 11:5 7C 1. Hence, for small enough e, there exists MI(c) such that 11 X, 11•5 C1 /(1 + E)for all n > MI(c) with probability at least p/(c). With probability pl(e) the procedure can berepeated for C2 = C1/(1 + c). Continuing in this manner and choosing pk(E) so that f'kPk(0)

goes to one as e -* 0 we obtain the w.p.1 convergence of the bounded iteration and Lemma 2can be applied. 0

Theorem 1 A .random iterative process An+I(z) = (1 - an(z))An(z) + I,,(z)F,(z) converges

to zero w.p.1 under the following assumptions:

1) The state space is finite.

2) En a,,(T) = 00, En, ac(X) < 00, En #3(T) = 0c, Engn(X) < 00, and E{f,(z)IP,} <E{ an (x)IPn} uniformly w.p. 1

3) 11 E{Fn(z)IPn} IIw< - II An IIw, where -y E (0, 1).

4) Var{FF(z)lP1} < C(1+ 11 An IIw) 2, where C is some constant.

Here Pn = {X,,X,-l,...,F,- ,...,- t ... , ,- ... } stands for the past at step n. F,(x),cn(z) and/O,,(x) are allowed to depend on the past insofar as the above conditions remain valid.The notation I1" (1w refers to some weighted mazimum norm.

Proof. By defining r,(z) = Ft(z) - E{F, (z)IP,} we can decompose the iterative processinto two parallel processes given by

bn,(--) -= (1 - an(z))6,(xz) + O,(z)E{F,(z)IP,}W. I(z) = (1- a,(z))W,.(z)+/Pn(z)rn(z) (21)

where An(z) = 6,(z) + wn(z). Dividing the equations by W(z) for each z and denoting6n'(z) = 6,(z)/W(z), w.(z) = wto(z)/W(z), and r.(z) = rn(z)/W(z) we can bound the 6.process by assumption 3) and rewrite the equation pair as

16'.+x(z)I < (1 - a(z))l6'.(z)l + 7•.(z) Ii j'I + w'. I

W'+1(z) = (1 - oa(z))W'(z) + 7-/(x)rn'(x)

Assume for a moment that the An process stays bounded. Then the variance of r' (z) isbounded by some constant C and thereby w' converges to zero w.p.1 according to Lemma 1.Hence, there exists M such that for all n > M 11 w'n J< c with probability at least 1 - e. Thisimplies that the 6n process can be further bounded by

16n'+i (X)l _ (1 - an(z))16'.(x) + 7.(z) 1'. + E 11

with probability > 1 - c. If we choose C such that 7 (C + 1)/C < 1 then for II 6b' I1> Ce

7 II b'. + C 11-5 y(C + 1)/C 11 6n I1

11

and the process defined by this upper bound converges to zero w.p.1 by Lemma 3. Thus 11 6b jconverges w.p.1 to some value bounded by CE which guarantees the w.p.1 convergence of theoriginal process under the boundedness assumption.

By assumption (4) rT,(x) can be written as (1+ 11 6b, + w,, II)s,(z), where EZs8(z)3P2, 1 < C.Let us now decompose w,n as un + v, with

U+(= (1 - X))UL(z) + 70n(-) II 6, + Un + v 1 I

and v, converges to zero w.p.1 by Lemma 1. Again by choosing C such that -y(C + 1)/C < 1we can bound the b' and u,, processes for 11 6n + un 11> CE. The pair (6n, u,) is then ascale invaxiant process whose bounded version was proven earlier to converge to zero w.p.1 andtherefore by Lemma 2 it too converges to zero w.p.1. This proves the w.p.1 convergence of thetriple 6:,, un, and v,1 bounding the original process. 0

References

Aoki, M. (1967). Optimization of Stochastic Systems. New York: Academic Press.

Barto, A. G., Bradtke, S. J., & Singh, S. P. (1993). Learning to act using real-time dynamicprogramming. Submitted to: AI Journal.

Baxto, A. G., Sutton, R. S., & Watkins, C.J.C.H. (1990). Sequential decision problems and neu-ral networks. In D. Touretzky (Ed.), Advances in Neural Information Processing Systems,2, pp. 686-693. San Mateo, CA: Morgan Kaufmann.

Bertsekas, D. P. (1987). Dynamic Programming: Deterministic and Stochastic Models. Engle-wood Cliffs, NJ: Prentice-Hall.

Bertsekas, D. P., & Tsitsiklis, J. N. (1989). Parallel and Distributed Computation: Numerical

Methods. Englewood Cliffs, NJ: Prentice-Hall.

Dayan, P. (1992). The convergence of TD(A) for general A. Machine Learning, 8, 341-362.

Dayan, P., & Sejnowski, T. J. (1993). TD(A) converges with probability 1. CNL, The SalkInstitute, San Diego, CA.

Dvoretzky, A. (1956). On stochastic approximation. Proceedings of the Third Berkeley Sympo-sium on Mathematical Statistics and Probability. University of California Press.

Robbins, H., & Monro, S. (1951). A stochastic approximation model. Annals of MathematicalStatistics, 22, 400-407.

Ross, S. M. (1970). Applied Probability Models with Optimization Applications. San Francisco:Holden-Day.

Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. MachineLearning, 3, 9-44.

Watkins, C.J.C.H. (1989). Learning from delayed rewards. PhD Thesis, University of Cam-bridge, England.

12

Watkins, C.J.C.H, & Dayan, P. (1992) Q-learning. Machine Learning, 8, 279-292.

Werbos, P. (1992). Approximate dynamic programming for real-time control and neural mod-eling. In D. A. White and D. A. Sofge, (Eds.), Handbook of Intelligent Control: Neural,Fuzzy, and Adaptive Approaches, pp. 493-525. New York: Van Nostrand Reinhold.

13

Documents

; AT · Recent developments in the area of reinforcement learning have yielded a number of new algorithms for the prediction and control of Markovian environments. These algorithms,