173

Stochastic Approximation: A Dynamical Systems Viewpoint

Embed Size (px)

Citation preview

Page 1: Stochastic Approximation: A Dynamical Systems Viewpoint
Page 2: Stochastic Approximation: A Dynamical Systems Viewpoint

STOCHASTIC APPROXIMATION : A DYNAMICAL

SYSTEMS VIEWPOINT

Vivek S. Borkar

Tata Institute of Fundamental Research, Mumbai.

Page 3: Stochastic Approximation: A Dynamical Systems Viewpoint
Page 4: Stochastic Approximation: A Dynamical Systems Viewpoint

Contents

Preface page vii

1 Introduction 1

2 Basic Convergence Analysis 102.1 The o.d.e. limit 102.2 Extensions and variations 16

3 Stability Criteria 213.1 Introduction 213.2 Stability criterion 213.3 Another stability criterion 27

4 Lock-in Probability 314.1 Estimating the lock-in probability 314.2 Sample complexity 424.3 Avoidance of traps 44

5 Stochastic Recursive Inclusions 525.1 Preliminaries 525.2 The differential inclusion limit 535.3 Applications 565.4 Projected stochastic approximation 59

6 Multiple Timescales 646.1 Two timescales 646.2 Averaging the natural timescale: preliminaries 676.3 Averaging the natural timescale: main results 736.4 Concluding remarks 76

7 Asynchronous Schemes 787.1 Introduction 787.2 Asymptotic behavior 807.3 Effect of delays 827.4 Convergence 85

v

Page 5: Stochastic Approximation: A Dynamical Systems Viewpoint

vi Contents

8 A Limit Theorem for Fluctuations 888.1 Introduction 888.2 A tightness result 898.3 The functional central limit theorem 968.4 The convergent case 99

9 Constant Stepsize Algorithms 1019.1 Introduction 1019.2 Asymptotic behaviour 1029.3 Refinements 107

10 Applications 11710.1 Introduction 11710.2 Stochastic gradient schemes 11810.3 Stochastic fixed point iterations 12510.4 Collective phenomena 13110.5 Miscellaneous applications 137

11 Appendices 14011.1 Appendix A: Topics in analysis 140

11.1.1 Continuous functions 14011.1.2 Square-integrable functions 14111.1.3 Lebesgue’s theorem 143

11.2 Appendix B: Ordinary differential equations 14311.2.1 Basic theory 14311.2.2 Linear systems 14611.2.3 Asymptotic behaviour 147

11.3 Appendix C: Topics in probability 14911.3.1 Martingales 14911.3.2 Spaces of probability measures 15211.3.3 Stochastic differential equations 153

References 156Index 163

Page 6: Stochastic Approximation: A Dynamical Systems Viewpoint

Preface

Stochastic approximation was introduced in a 1951 article in the Annals ofMathematical Statistics by Robbins and Monro. Originally conceived as atool for statistical computation, an area in which it retains a place of pride,it has come to thrive in a totally different discipline, viz., that of electricalengineering. The entire area of ‘adaptive signal processing’ in communicationengineering has been dominated by stochastic approximation algorithms andvariants, as is evident from even a cursory look at any standard text on thesubject. Then there are the more recent applications to adaptive resourceallocation problems in communication networks. In control engineering too,stochastic approximation is the main paradigm for on-line algorithms for systemidentification and adaptive control.

This is not accidental. The key word in most of these applications is adaptive.Stochastic approximation has several intrinsic traits that make it an attractiveframework for adaptive schemes. It is designed for uncertain (read ‘stochastic’)environments, where it allows one to track the ‘average’ or ‘typical’ behaviour ofsuch an environment. It is incremental, i.e., it makes small changes in each step,which ensures a graceful behaviour of the algorithm. This is a highly desirablefeature of any adaptive scheme. Furthermore, it usually has low computationaland memory requirements per iterate, another desirable feature of adaptivesystems. Finally, it conforms to our anthropomorphic notion of adaptation:It makes small adjustments so as to improve a certain performance criterionbased on feedbacks received from the environment.

For these very reasons, there has been a resurgence of interest in this classof algorithms in several new areas of engineering. One of these, viz., commu-nication networks, is already mentioned above. Yet another major applicationdomain has been artificial intelligence, where stochastic approximation has pro-vided the basis for many learning or ‘parameter tuning’ algorithms in soft com-puting. Notable among these are the algorithms for training neural networks

vii

Page 7: Stochastic Approximation: A Dynamical Systems Viewpoint

viii Preface

and the algorithms for reinforcement learning, a popular learning paradigm forautonomous software agents with applications in e-commerce, robotics, etc.

Yet another fertile terrain for stochastic approximation has been in the areaof economic theory, for reasons not entirely dissimilar to those mentioned above.On one hand, they provide a good model for collective phenomena, where mi-cromotives (to borrow a phrase from Thomas Schelling) of individual agentsaggregate to produce interesting macrobehaviour. The ‘nonlinear urn’ schemeanalyzed by Arthur and others to model increasing returns in economics is acase in point. On the other hand, their incrementality and low per iterate com-putational and memory requirements make them an ideal model of a boundedlyrational economic agent, a theme which has dominated their application tolearning models in economics, notably to learning in evolutionary games.

This flurry of activity, while expanding the application domain of stochasticapproximation, has also thrown up interesting new issues, some of them dic-tated by technological imperatives. Consequently, it has spawned interestingnew theoretical developments as well. The time thus seemed right for a bookpulling together old and new developments in the subject with an eye on theaforementioned applications. There are, indeed, several excellent texts alreadyin existence, many of which will be referenced later in this book. But theytend to be comprehensive texts: excellent for the already initiated but ratherintimidating for someone who wants to make quick inroads. Hence a need fora ‘bite-sized’ text. The present book is an attempt at one.

Having decided to write a book, there was still a methodological choice.Stochastic approximation theory has two somewhat distinct strands of research.One, popular with statisticians, uses the techniques of martingale theory andassociated convergence theorems for analysis. The second, popular more withengineers, treats the algorithm as a noisy discretization of an ordinary differ-ential equation (o.d.e.) and analyzes it as such. We have opted for the latterapproach, because the kind of intuition that it offers is an added advantage inmany of the engineering applications.

Of course, this is not the first book expounding this approach. There are sev-eral predecessors such as the excellent texts by Benveniste–Metivier–Priouret,Duflo, and Kushner–Yin referenced later in the book. These are, however,what we have called comprehensive texts above, with a wealth of information.This book is not comprehensive, but is more of a compact account of the high-lights to enable an interested, mathematically literate reader to run throughthe basic ideas and issues in a relatively short time span. The other ‘novelties’of the book would be a certain streamlining and fine-tuning of proofs usingthat eternal source of wisdom – hindsight. There are occasional new variationson proofs sometimes leading to improved results (e.g., in Chapter 6) or justshorter proofs, inclusion of some newer themes in theory and applications, and

Page 8: Stochastic Approximation: A Dynamical Systems Viewpoint

Preface ix

so on. Given the nature of the subject, a certain mathematical sophisticationwas unavoidable. For the benefit of those not quite geared for it, we havecollected the more advanced mathematical requirements in a few appendices.These should serve as a source for quick reference and pointers to the liter-ature, but not as a replacement for a firm grounding in the respective areas.Such grounding is a must for anyone wishing to contribute to the theory ofstochastic approximation. Those interested more in applying the results totheir respective specialties may not feel the need to go much further than thislittle book.

Let us conclude this long preface with the pleasant task of acknowledging allthe help received in this venture. The author forayed into stochastic approx-imation around 1993–1994, departing significantly from his dominant activitytill then, which was controlled Markov processes. This move was helped by aproject on adaptive systems supported by a Homi Bhabha Fellowship. Morethan the material help, the morale boost was a great help and he is immenselygrateful for it. His own subsequent research in this area has been supported bygrants from the Department of Science and Technology, Government of India,and was conducted in the two ‘Tata’ Institutes: Indian Institute of Science atBangalore and the Tata Institute of Fundamental Research in Mumbai. Dr. V.V. Phansalkar went though the early drafts of a large part of the book and withhis fine eye for detail, caught many errors. Prof. Shalabh Bhatnagar, Dr. ArzadAlam Kherani and Dr. Huizen (Janey) Yu also read the drafts and pointed outcorrections and improvements (Janey shares with Dr. Phansalkar the rare traitfor having a great eye for detail and contributed a lot to the final clean-up). Dr.Sameer Jalnapurkar did a major overhaul of chapters 1–3 and a part of chapter4, which in addition to fixing errors, greatly contributed to their readability.Ms. Diana Gillooly of Cambridge University Press did an extremely meticulousjob of editorial corrections on the final manuscript. The author takes full blamefor whatever errors that remain. His wife Shubhangi and son Aseem have beenextremely supportive as always. This book is dedicated to them.

Vivek S. BorkarMumbai, February 2008

Page 9: Stochastic Approximation: A Dynamical Systems Viewpoint
Page 10: Stochastic Approximation: A Dynamical Systems Viewpoint

1

Introduction

Consider an initially empty urn to which balls, either red or black, are addedone at a time. Let yn denote the number of red balls at time n and xn

def= yn/n

the fraction of red balls at time n. We shall suppose that the conditionalprobability that the next, i.e., the (n + 1)st ball is red given the past up totime n is a function of xn alone. Specifically, suppose that it is given by p(xn)for a prescribed p : [0, 1] → [0, 1]. It is easy to describe xn, n ≥ 1 recursivelyas follows. For yn, we have the simple recursion

yn+1 = yn + ξn+1,

where

ξn+1 = 1 if the (n + 1)st ball is red,

= 0 if the (n + 1)st ball is black.

Some simple algebra then leads to the following recursion for xn:

xn+1 = xn +1

n + 1(ξn+1 − xn),

with x0 = 0. This can be rewritten as

xn+1 = xn +1

n + 1(p(xn)− xn) +

1n + 1

(ξn+1 − p(xn)).

Note that Mndef= ξn − p(xn−1), n ≥ 1 (with p(x0)

def= the probability of thefirst ball being red) is a sequence of zero mean random variables satisfyingE[Mn+1|ξm,m ≤ n] = 0 for n ≥ 0. This means that Mn is a martingaledifference sequence (see Appendix C), i.e., uncorrelated with the ‘past’, andthus can be thought of as ‘noise’. The above equation then can be thought of

1

Page 11: Stochastic Approximation: A Dynamical Systems Viewpoint

2 Introduction

as a noisy discretization (or Euler scheme in numerical analysis parlance) forthe ordinary differential equation (o.d.e. for short)

x(t) = p(x(t))− x(t),

for t ≥ 0, with nonuniform stepsizes a(n) def= 1/(n + 1) and ‘noise’ Mn.(Compare with the standard Euler scheme xn+1 = xn + a(p(xn) − xn) fora small a > 0.) If we assume p(·) to be Lipschitz continuous, o.d.e. theoryguarantees that this o.d.e. is well-posed, i.e., it has a unique solution for anyinitial condition x(0) that in turn depends continuously on x(0) (see AppendixB). Note also that the right-hand side of the o.d.e. is nonnegative at x(t) = 0and nonpositive at x(t) = 1, implying that any trajectory starting in [0, 1] willremain in [0, 1] forever. As this is a scalar o.d.e., any bounded trajectory mustconverge. To see this, note that it cannot move in any particular direction(‘right’ or ‘left’) forever without converging, because it is bounded. At thesame time, it cannot change direction from ‘right’ to ‘left’ or vice versa withoutpassing through an equilibrium point: This would require that the right-handside of the o.d.e. changes sign and hence by continuity must pass through a pointwhere it vanishes, i.e., an equilibrium point. The trajectory must then convergeto this equilibrium, a contradiction. (For that matter, the o.d.e. couldn’t havebeen going both right and left at any given x because this direction is uniquelyprescribed by the sign of p(x) − x.) Thus we have proved that x(·) mustconverge to an equilibrium. The set of equilibria of the o.d.e. is given by thepoints where the right-hand side vanishes, i.e., the set H = x : p(x) = x.This is precisely the set of fixed points of p(·). Once again, as the right-handside is continuous, is ≤ 0 at 1, and is ≥ 0 at 0, it must pass through 0 by themean value theorem and hence H is nonempty. (One could also invoke theBrouwer fixed point theorem (Appendix A) to say this, as p : [0, 1] → [0, 1] isa continuous map from a convex compact set to itself.)

Our interest, however, is in xn. The theory we develop later in this bookwill tell us that the xn ‘track’ the o.d.e. with probability one in a certainsense to be made precise later, implying in particular that they converge a.s.to H. The key factors that ensure this are the fact that the stepsize a(n) tendsto zero as n → ∞, and the fact that the series

∑n a(n)Mn+1 converges a.s.,

a consequence of the martingale convergence theorem. The first observationmeans in particular that the ‘pure’ discretization error becomes asymptoticallynegligible. The second observation implies that the ‘tail’ of the above con-vergent series given by

∑∞m=n a(n)Mn+1, which is the ‘total noise added to

the system from time n on’, goes to zero a.s. This in turn ensures that theerror due to noise is also asymptotically negligible. We note here that thefact

∑n a(n)2 =

∑n(1/(n + 1)2) < ∞ plays a crucial role in facilitating the

application of the martingale convergence theorem in the analysis of the urn

Page 12: Stochastic Approximation: A Dynamical Systems Viewpoint

Introduction 3

scheme above. This is because it ensures the following sufficient condition formartingale convergence (see Appendix C):

∑n

E[(a(n)Mn+1)2|ξm,m ≤ n] ≤∑

n

a(n)2 < ∞, a.s.

One also needs the fact that∑

n a(n) = ∞, because in view of our interpretationof a(n) as a time step, this ensures that the discretization does cover the entiretime axis. As we are interested in tracking the asymptotic behaviour of theo.d.e., this is clearly necessary.

Let’s consider now the simple case when H is a finite set. Then one cansay more, viz., that the xn converge a.s. to some point in H. The exactpoint to which they converge will be random, though we shall later narrowdown the choice somewhat (e.g., the ‘unstable’ equilibria will be avoided withprobability one under suitable conditions). For the time being, we shall stopwith this conclusion and discuss the raison d’etre for looking at such ‘nonlinearurns’ .

This simple set-up was proposed by W. Brian Arthur (1994) to model thephenomenon of increasing returns in economics. The reader will have heardof the ‘law of diminishing returns’ from classical economics, which can be de-scribed as follows. Any production enterprise such as a farm or a factoryrequires both fixed and variable resources. When one increases the amountof variable resources, each additional unit thereof will get a correspondinglysmaller fraction of fixed resources to draw upon, and therefore the additionalreturns due to it will correspondingly diminish.

While quite accurate in describing the traditional agricultural or manufactur-ing sectors, this law seems to be contradicted in some other sectors, particularlyin case of the modern ‘information goods’. One finds that larger investmentsin a brand actually fetch larger returns because of standardization and com-patibility of goods, brand loyalty of customers, and so on. This is the so-called‘increasing returns’ phenomenon modelled by the urn above, where each newred ball is an additional unit of investment in a particular product. If thepredominance of one colour tends to fetch more balls of the same, then aftersome initial randomness the process will get ‘locked into’ one colour which willdominate overwhelmingly. (This corresponds to p(x) > x for x ∈ (x0, 1) forsome x0 ∈ (0, 1), and < x for x ∈ (0, x0). Then the stable equilibria are 0 and1, with x0 being an unstable equilibrium. Recall that in this set-up the equi-librium x is stable if p′(x) < 1, unstable if p′(x) > 1.) When we are modellinga pair of competing technologies or conventions, this means that one of them,not necessarily the better one, will come to dominate overwhelmingly. Arthur(1994) gives several interesting examples of this phenomenon. To mention afew, he describes how the VHS technology came to dominate over Sony Beta-max for video recording, why the present arrangement of letters and symbols

Page 13: Stochastic Approximation: A Dynamical Systems Viewpoint

4 Introduction

on typewriters and keyboards (QWERTY) could not be displaced by a supe-rior arrangement called DVORAK, why ‘clockwise’ clocks eventually displaced‘counterclockwise’ clocks, and so on.

Keeping economics aside, our interest here will be in the recursion for xnand its analysis sketched above using an o.d.e. The former constitutes a special(and a rather simple one at that) case of a much broader class of stochasticrecursions called ‘stochastic approximation’ which form the main theme of thisbook. What’s more, the analysis based on a limiting o.d.e. is an instance ofthe ‘o.d.e. approach’ to stochastic approximation which is our main focus here.Before spelling out further details of these, here’s another example, this timefrom statistics.

Consider a repeated experiment which gives a string of input-output pairs(Xn, Yn), n ≥ 1, with Xn ∈ Rm, Yn ∈ Rk resp. We assume that (Xn, Yn)are i.i.d. Our objective will be to find the ‘best fit’ Yn = fw(Xn) + εn, n ≥ 1,

from a given parametrized family of functions fw : Rm → Rk : w ∈ Rd,εn being the ‘error’. What constitutes the ‘best fit’, however, depends on thechoice of our error criterion and we shall choose this to be the popular ‘meansquare error’ given by g(w) def= 1

2E[||εn||2] = 12E[||Yn − fw(Xn)||2]. That is, we

aim to find a w∗ that minimizes this over all w ∈ Rd. This is the standardproblem of nonlinear regression. Typical parametrized families of functions arepolynomials, splines, linear combinations of sines and cosines, or more recently,wavelets and neural networks. The catch here is that the above expectationcannot be evaluated because the underlying probability law is not known. Also,we do not suppose that the entire string (Xn, Yn) is available as in classicalregression, but that it is being delivered one at a time in ‘real time’. The aimthen is to come up with a recursive scheme which tries to ‘learn’ w∗ in realtime by adaptively updating a running guess as new observations come in.

To arrive at such a scheme, let’s pretend to begin with that we do know theunderlying law. Assume also that fw is continuously differentiable in w andlet ∇wfw(·) denote its gradient w.r.t. w. The obvious thing to try then is todifferentiate the mean square error w.r.t. w and set the derivative equal to zero.Assuming that the interchange of expectation and differentiation is justified,we then have

∇wg(w) = −E[〈Yn − fw(Xn),∇wfw(Xn)〉] = 0

at the minimum point. We may then seek to minimize the mean square errorby gradient descent, given by:

wn+1 = wn −∇wg(wn)

= wn + E[〈Yn − fwn(Xn),∇wfwn(Xn)〉|wn].

This, of course, is not feasible for reasons already mentioned, viz., that the

Page 14: Stochastic Approximation: A Dynamical Systems Viewpoint

Introduction 5

expectation above cannot be evaluated. As a first approximation, we maythen consider replacing the expectation by the ‘empirical gradient’, i.e., theargument of the expectation evaluated at the current guess wn for w∗,

wn+1 = wn + 〈Yn − fwn(Xn),∇wfwn

(Xn)〉.This, however, will lead to a different kind of problem. The term added to wn

on the right is the nth in a sequence of ‘i.i.d. functions’ of w, evaluated at wn.Thus we expect the above scheme to be (and it is) a correlated random walk,zigzagging its way to glory. We may therefore want to smooth it by makingonly a small, incremental move in the direction suggested by the right-handside instead of making the full move. This can be achieved by replacing theright-hand side by a convex combination of it and the previous guess wn, withonly a small weight 1 > a(n) > 0 for the former. That is, we replace the aboveby

wn+1 = (1− a(n))wn + a(n)(wn + 〈Yn − fwn(Xn),∇wfwn(Xn)〉).Equivalently,

wn+1 = wn + a(n)〈Yn − fwn(Xn),∇wfwn(Xn)〉.Once again, if we do not want the scheme to zigzag drastically, we should

make a(n) small, the smaller the better. At the same time, a small a(n)leads to a very small correction to wn at each iterate, so the scheme will workvery slowly, if at all. This suggests starting the iteration with relatively higha(n) and letting a(n) → 0. (In fact, a(n) < 1 as above is not needed, asthat can be taken care of by scaling the empirical gradient.) Now let’s add andsubtract the exact error gradient at the ‘known guess’ wn from the empiricalgradient on the right-hand side and rewrite the above scheme as

wn+1 = wn + a(n)(E[〈Yn − fwn(Xn),∇wfwn(Xn)〉|wn])

+ a(n)(〈Yn − fwn(Xn),∇wfwn(Xn)〉

− E[〈Yn − fwn(Xn),∇wfwn(Xn)〉|wn]).

This is of the form

wn+1 = wn + a(n)(−∇wg(wn) + Mn+1),

with Mn a martingale difference sequence as in the previous example. Onemay then view this scheme as a noisy discretization of the o.d.e.

w(t) = −∇wg(w(t)).

This is a particularly well studied o.d.e. We know that it will converge to

Page 15: Stochastic Approximation: A Dynamical Systems Viewpoint

6 Introduction

Hdef= w : ∇wg(w) = 0 in general, and if this set is discrete, to in fact one

of the local minima of g for typical (i.e., generic: belonging to an open denseset) initial conditions. As before, we are interested in tracking the asymptoticbehaviour of this o.d.e. Hence we must ensure that the discrete time stepsa(n) used in the ‘noisy discretization’ above do cover the entire time axis,i.e., ∑

n

a(n) = ∞, (1.0.1)

while retaining a(n) → 0. (Recall from the previous example that a(n) → 0 isneeded for asymptotic negligibility of discretization errors.) At the same time,we also want the error due to noise to be asymptotically negligible a.s. Theurn example above then suggests that we also impose

∑n

a(n)2 < ∞, (1.0.2)

which asymptotically suppresses the noise variance.One can show that with (1.0.1) and (1.0.2) in place, for reasonable g (e.g.,

with lim||w||→∞ g(w) = ∞ and finite H, among other possibilities) the ‘stochas-tic gradient scheme’ above will converge a.s. to a local minimum of g.

Once again, what we have here is a special case – perhaps the most importantone – of stochastic approximation, analyzed by invoking the ‘o.d.e. method’.

What, after all, is stochastic approximation? Historically, stochastic approx-imation started as a scheme for solving a nonlinear equation h(x) = 0 given‘noisy measurements’ of the function h. That is, we are given a black boxwhich on input x, gives as its output h(x) + ξ, where ξ is a zero mean randomvariable representing noise. The stochastic approximation scheme proposed byRobbins and Monro (1951)† was to run the iteration

xn+1 = xn + a(n)[h(xn) + Mn+1], (1.0.3)

where Mn is the noise sequence and a(n) are positive scalars satisfying(1.0.1) and (1.0.2) above. The expression in the square brackets on the right isthe noisy measurement. That is, h(xn) and Mn+1 are not separately available,only their sum is. We shall assume Mn to be a martingale difference sequence,i.e., a sequence of integrable random variables satisfying

E[Mn+1|xm, Mm,m ≤ n] = 0.

This is more general than it appears. For example, an important special caseis the d-dimensional iteration

xn+1 = xn + a(n)f(xn, ξn+1), n ≥ 0, (1.0.4)

† See Lai (2003) for an interesting historical perspective.

Page 16: Stochastic Approximation: A Dynamical Systems Viewpoint

Introduction 7

for an f : Rd ×Rk →Rd with i.i.d. noise ξn. This can be put in the formatof (1.0.3) by defining h(x) = E[f(x, ξ1)] and Mn+1 = f(xn, ξn+1) − h(xn) forn ≥ 0.

Since its inception, the scheme (1.0.3) has been a cornerstone in scientificcomputation. This has been so largely because of the following advantages,already apparent in the above examples:

• It is designed to handle noisy situations, e.g., the stochastic gradient schemeabove. One may say that it captures the average behaviour in the longrun. The noise in practice may not only be from measurement errors orapproximations, but may also be added deliberately as a probing device ora randomized action, as, e.g., in certain dynamic game situations.

• It is incremental, i.e., it makes small moves at each step. This typically leadsto more graceful behaviour of the algorithm at the expense of its speed. Weshall say more on this later in the book.

• In typical applications, the computation per iterate is low, making its imple-mentation easy.

These features make the scheme ideal for applications where the key word is‘adaptive’. Thus the stochastic approximation paradigm dominates the fieldsof adaptive signal processing, adaptive control, and certain subdisciplines ofsoft computing / artificial intelligence such as neural networks and reinforce-ment learning – see, e.g., Bertsekas and Tsitsiklis (1997), Haykin (1991) andHaykin (1998). Not surprisingly, it is also emerging as a popular frameworkfor modelling boundedly rational macroeconomic agents – see, e.g., Sargent(1993). The two examples above are representative of these two strands. Weshall be seeing many more instances later in this book.

As noted in the preface, there are broadly two approaches to the theoreticalanalysis of such algorithms. The first, popular with statisticians, is the prob-abilistic approach based on the theory of martingales and associated objectssuch as ‘almost supermartingales’. The second approach, while still using aconsiderable amount of martingale theory, views the iteration as a noisy dis-cretization of a limiting o.d.e. Recall that the standard ‘Euler scheme’ fornumerically approximating a trajectory of the o.d.e.

x(t) = h(x(t))

would be

xn+1 = xn + ah(xn),

with x0 = x(0) and a > 0 a small time step. The stochastic approximationiteration differs from this in two aspects: replacement of the constant time step‘a’ by a time-varying ‘a(n)’, and the presence of ‘noise’ Mn+1. This qualifies itas a noisy discretization of the o.d.e. Our aim is to seek x for which h(x) = 0,

Page 17: Stochastic Approximation: A Dynamical Systems Viewpoint

8 Introduction

i.e., the equilibrium point(s) of this o.d.e. The o.d.e. would converge (if itdoes) to these only asymptotically unless it happens to start exactly there.Hence to capture this asymptotic behaviour, we need to track the o.d.e. overthe infinite time interval. This calls for the condition

∑n a(n) = ∞. The

condition∑

n a(n)2 < ∞ will on the other hand ensure that the errors dueto discretization of the o.d.e. and those due to the noise Mn both becomenegligible asymptotically with probability one. (To motivate this, let Mn bei.i.d. zero mean with a finite variance σ2. Then by a theorem of Kolmogorov,∑

n a(n)Mn converges a.s if and only if∑

n a(n)2 converges.) Together theseconditions try to ensure that the iterates do indeed capture the asymptoticbehaviour of the o.d.e. We have already seen instances of this above.

Pioneered by Derevitskii and Fradkov (1974), this ‘o.d.e. approach’ was fur-ther extended and introduced to the engineering community by Ljung (1977).It is already the basis of several excellent texts such as Benveniste, Metivier andPriouret (1990), Duflo (1996), and Kushner and Yin (2003), among others†.The rendition here is a slight variation of the traditional one, with an eye onpedagogy so that the highlights of the approach can be introduced quickly andrelatively simply. The lecture notes of Benaim (1999) are perhaps the closest inspirit to the treatment here, though at a much more advanced level. (Benaim’snotes in particular give an overview of the contributions of Benaim and Hirsch,which introduced important notions from dynamical systems theory, such as in-ternal chain recurrence, to stochastic approximation. These represent a majordevelopment in this field in recent years.)

While it is ultimately a matter of personal taste, the o.d.e. approach doesindeed appeal to engineers because of the ‘dynamical systems’ view it takes,which is close to their hearts. Also, as we shall see at the end of this book, it canserve as a useful recipe for concocting new algorithms: any convergent o.d.e. isa potential source of a stochastic approximation algorithm that converges withprobability one.

The organization of the book is as follows. Chapter 2 gives the basic con-vergence analysis for the stochastic approximation algorithm with decreasingstepsizes. This is the core material for the rest of the book. Chapter 3 givessome ‘stability tests’ that ensure the boundedness of iterates with probabilityone. Chapter 4 gives some refinements of the results of Chapter 2, viz., anestimate for probability of convergence to a specific attractor if the iteratesfall in its domain of attraction. It also gives a result about avoidance withprobability one of unstable equilibria. Chapter 5 gives the counterparts of thebasic results of Chapter 2 for a more general iteration, which has a differentialinclusion as a limit rather than an o.d.e. This is useful in many practical in-

† Wasan (1969) and Nevelson and Khasminskii (1976) are two early texts on stochasticapproximation, though with a different flavour. See also Ljung et al. (1992).

Page 18: Stochastic Approximation: A Dynamical Systems Viewpoint

Introduction 9

stances, which are also described in this chapter. Chapter 6 analyzes the caseswhen more than one timescale is used. This chapter, notably the sections on‘averaging the natural timescale’, is technically a little more difficult than therest and the reader may skip the details of the proofs on a first reading. Chap-ter 7 describes the distributed asynchronous implementations of the algorithm.Chapter 8 describes the functional central limit theorem for fluctuations associ-ated with the basic scheme of Chapter 2. All the above chapters use decreasingstepsizes. Chapter 9 briefly describes the corresponding theory for constantstepsizes which are popular in some applications.

Chapter 10 of the book has a different flavour: it collects together several ex-amples from engineering, economics, etc., where the stochastic approximationformalism has paid rich dividends. Thus the general techniques of the first partof the book are specialized to each case of interest and the additional structureavailable in the specific problem under consideration is exploited to say more,depending on the context. It is a mixed bag, the idea being to give the readera flavour of the various ‘tricks of the trade’ that may come in handy in futureapplications. Broadly speaking, one may classify these applications into threestrands. The first is the stochastic gradient scheme and its variants wherein h

above is either the negative gradient of some function or something close to thenegative gradient. This scheme is the underlying paradigm for many adaptivefiltering, parameter estimation and stochastic optimization schemes in general.The second is the o.d.e. version of fixed point iterations, i.e., successive appli-cation of a map from a space to itself so that it may converge to a point thatremains invariant under it (i.e., a fixed point). These are important in a class ofapplications arising from dynamic programming. The third is the general col-lection of o.d.e.s modelling collective phenomena in economics etc., such as theurn example above. This classification is, of course, not exhaustive and someinstances of stochastic approximation in practice do fall outside of this. Also,we do not consider the continuous time analog of stochastic approximation (see,e.g., Mel’nikov, 1996).

The background required for this book is a good first course on measuretheoretic probability, particularly the theory of discrete parameter martingales,at the level of Breiman (1968) or Williams (1991) (though we shall generallyrefer to Borkar (1995), more out of familiarity than anything), and a first courseon ordinary differential equations at the level of Hirsch, Smale and Devaney(2003). There are a few spots where something more than this is required, viz.,the theory of weak (Prohorov) convergence of probability measures. The threeappendices in Chapter 11 collect together the key aspects of these topics thatare needed here.

Page 19: Stochastic Approximation: A Dynamical Systems Viewpoint

2

Basic Convergence Analysis

2.1 The o.d.e. limit

In this chapter we begin our formal analysis of the stochastic approximationscheme in Rd given by

xn+1 = xn + a(n)[h(xn) + Mn+1], n ≥ 0, (2.1.1)

with prescribed x0 and with the following assumptions which we recall fromthe last chapter:

(A1) The map h : Rd →Rd is Lipschitz: ||h(x)− h(y)|| ≤ L||x− y|| for some0 < L < ∞.

(A2) Stepsizes a(n) are positive scalars satisfying∑

n

a(n) = ∞,∑

n

a(n)2 < ∞. (2.1.2)

(A3) Mn is a martingale difference sequence with respect to the increasingfamily of σ-fields

Fndef= σ(xm,Mm,m ≤ n) = σ(x0,M1, . . . , Mn), n ≥ 0.

That is,

E[Mn+1|Fn] = 0 a.s., n ≥ 0.

Furthermore, Mn are square-integrable with

E[||Mn+1||2|Fn] ≤ K(1 + ||xn||2) a.s., n ≥ 0, (2.1.3)

for some constant K > 0.

10

Page 20: Stochastic Approximation: A Dynamical Systems Viewpoint

2.1 The o.d.e. limit 11

Assumption (A1) implies in particular the linear growth condition for h(·) :For a fixed x0, ||h(x)|| ≤ ||h(x0)|| + L||x − x0|| ≤ K ′(1 + ||x||) for a suitableconstant K ′ > 0 and all x ∈ Rd. Thus

E[||xn+1||2] 12 ≤ E[||xn||2] 1

2 + a(n)K ′(1 + E[||xn||2] 12 )

+ a(n)√

K(1 + E[||xn||2] 12 ).

We have used here the following fact:√

1 + z2 ≤ 1 + z for z ≥ 0. Alongwith (2.1.3) and the condition E[||x0||2] < ∞, this implies inductively thatE[||xn||2], E[||Mn||2] remain bounded for each n.

We shall carry out our analysis under the further assumption:

(A4) The iterates of (2.1.1) remain bounded a.s., i.e.,

supn||xn|| < ∞, a.s. (2.1.4)

This result is far from automatic and usually not very easy to establish.Some techniques for establishing this will be discussed in the next chapter.

The limiting o.d.e. which (2.1.1) might be expected to track asymptoticallycan be written by inspection as

x(t) = h(x(t)), t ≥ 0. (2.1.5)

Assumption (A1) ensures that (2.1.5) is well-posed, i.e., has a unique solu-tion for any x(0) that depends continuously on x(0). The basic idea of theo.d.e. approach to the analysis of (2.1.1) is to construct a suitable continu-ous interpolated trajectory x(t), t ≥ 0, and show that it asymptotically almostsurely approaches the solution set of (2.1.5). This is done as follows: Definetime instants t(0) = 0, t(n) =

∑n−1m=0 a(m), n ≥ 1. By (2.1.2), t(n) ↑ ∞. Let

Indef= [t(n), t(n + 1)], n ≥ 0. Define a continuous, piecewise linear x(t), t ≥ 0,

by x(t(n)) = xn, n ≥ 0, with linear interpolation on each interval In. That is,

x(t) = xn + (xn+1 − xn)t− t(n)

t(n + 1)− t(n), t ∈ In.

Note that supt≥0 ‖x(t)‖ = supn ‖xn‖ < ∞ a.s. Let xs(t), t ≥ s, denote theunique solution to (2.1.5) ‘starting at s’:

xs(t) = h(xs(t)), t ≥ s,

with xs(s) = x(s), s ∈ R. Likewise, let xs(t), t ≤ s, denote the unique solutionto (2.1.5) ‘ending at s’:

xs(t) = h(xs(t)), t ≤ s,

Page 21: Stochastic Approximation: A Dynamical Systems Viewpoint

12 Basic Convergence Analysis

with xs(s) = x(s), s ∈ R. Define also

ζn =n−1∑m=0

a(m)Mm+1, n ≥ 1.

By (A3) and the remarks that follow, (ζn,Fn), n ≥ 1, is a zero mean, square-integrable martingale. Furthermore, by (A2), (A3) and (A4),

n≥0

E[‖ζn+1 − ζn‖2|Fn] =∑

n≥0

a(n)2E[‖Mn+1‖2|Fn] < ∞, a.s.

It follows from the martingale convergence theorem (Appendix C) that ζn con-verges a.s. as n →∞.

Lemma 1. For any T > 0,

lims→∞

supt∈[s,s+T ]

||x(t)− xs(t)|| = 0, a.s.

lims→∞

supt∈[s−T,s]

||x(t)− xs(t)|| = 0, a.s.

Proof. We shall only prove the first claim, as the arguments for proving thesecond claim are completely analogous. Let t(n+m) be in [t(n), t(n)+T ]. Let[t] def= maxt(k) : t(k) ≤ t. Then by construction,

x(t(n + m)) = x(t(n)) +m−1∑

k=0

a(n + k)h(x(t(n + k))) + δn,n+m, (2.1.6)

where δn,n+mdef= ζn+m − ζn. Compare this with

xt(n)(t(m + n)) = x(t(n)) +∫ t(n+m)

t(n)

h(xt(n)(t))dt

= x(t(n)) +m−1∑

k=0

a(n + k)h(xt(n)(t(n + k)))

+∫ t(n+m)

t(n)

(h(xt(n)(y))− h(xt(n)([y])))dy.

(2.1.7)

We shall now bound the integral on the right-hand side. Let C0def= supn ‖xn‖ <

∞ a.s., let L > 0 denote the Lipschitz constant of h as before, and let s ≤ t ≤s+T . Note that ‖h(x)−h(0)‖ ≤ L‖x‖, and so ‖h(x)‖ ≤ ‖h(0)‖+L‖x‖. Since

Page 22: Stochastic Approximation: A Dynamical Systems Viewpoint

2.1 The o.d.e. limit 13

xs(t) = x(s) +∫ t

sh(xs(τ))dτ ,

‖xs(t)‖ ≤ ‖x(s)‖+∫ t

s

[‖h(0)‖+ L‖xs(τ)‖]dτ

≤ (C0 + ‖h(0)‖T ) + L

∫ t

s

‖xs(τ)‖dτ.

By Gronwall’s inequality (see Appendix B), it follows that

‖xs(t)‖ ≤ (C0 + ‖h(0)‖T )eLT , s ≤ t ≤ s + T.

Thus, for all s ≤ t ≤ s + T ,

‖h(xs(t))‖ ≤ CTdef= ‖h(0)‖+ L(C0 + ‖h(0)‖T )eLT < ∞, a.s.

Now, if 0 ≤ k ≤ (m− 1) and t ∈ (t(n + k), t(n + k + 1)],

||xt(n)(t)− xt(n)(t(n + k))|| ≤ ||∫ t

t(n+k)

h(xt(n)(s))ds||

≤ CT (t− t(n + k))

≤ CT a(n + k).

Thus,

||∫ t(n+m)

t(n)

(h(xt(n)(t))− h(xt(n)([t])))dt||

≤∫ t(n+m)

t(n)

L||xt(n)(t)− xt(n)([t])||dt

= L

m−1∑

k=0

∫ t(n+k+1)

t(n+k)

||xt(n)(t)− xt(n)(t(n + k))||dt

≤ CT L

m−1∑

k=0

a(n + k)2

≤ CT L

∞∑

k=0

a(n + k)2n↑∞→ 0, a.s. (2.1.8)

Also, since the martingale (ζn,Fn) converges a.s., we have

supk≥0

||δn,n+k|| n↑∞→ 0, a.s. (2.1.9)

Page 23: Stochastic Approximation: A Dynamical Systems Viewpoint

14 Basic Convergence Analysis

Subtracting (2.1.7) from (2.1.6) and taking norms, we have

||x(t(n + m))− xt(n)(t(n + m))||

≤ L

m−1∑

i=0

a(n + i)||x(t(n + i))− xt(n)(t(n + i))||

+ CT L∑

k≥0

a(n + k)2 + supk≥0

||δn,n+k||, a.s.

Define KT,n = CT L∑

k≥0 a(n + k)2 + supk≥0 ||δn,n+k||. Note that KT,n → 0

a.s. as n →∞. Also, let zi = ||x(t(n + i))− xt(n)(t(n + i))|| and bidef= a(n + i).

Thus, the above inequality becomes

zm ≤ KT,n + L

m−1∑

i=0

bizi.

Note that z0 = 0 and∑m−1

i=0 bi ≤ T . The discrete Gronwall lemma (see Ap-pendix B) tells us that

sup0≤i≤m

zi ≤ KT,neLT .

One then has that for t(n + m) ≤ t(n) + T ,

||x(t(n + m))− xt(n)(t(n + m))|| ≤ KT,neLT , a.s.

If t(n + k) ≤ t ≤ t(n + k + 1), we have that

x(t) = λx(t(n + k)) + (1− λ)x(t(n + k + 1))

for some λ ∈ [0, 1]. Thus,

‖xt(n)(t)− x(t)‖= ‖λ(xt(n)(t)− x(t(n + k))) + (1− λ)(xt(n)(t)− x(t(n + k + 1)))‖

≤ λ‖xt(n)(t(n + k))− x(t(n + k)) +∫ t

t(n+k)

h(xt(n)(s))ds‖

+ (1− λ)‖xt(n)(t(n + k + 1))− x(t(n + k + 1))

−∫ t(n+k+1)

t

h(xt(n)(s))ds‖

≤ (1− λ)‖xt(n)(t(n + k + 1))− x(t(n + k + 1))‖+ λ‖xt(n)(t(n + k))− x(t(n + k))‖

+ max(λ, 1− λ)∫ t(n+k+1)

t(n+k)

‖h(xt(n)(s))‖ds.

Since ‖h(xs(t))‖ ≤ CT for all s ≤ t ≤ s + T , it follows that

supt∈[t(n),t(n)+T ]

||x(t)− xt(n)(t)|| ≤ KT,neLT + CT a(n + k), a.s.

Page 24: Stochastic Approximation: A Dynamical Systems Viewpoint

2.1 The o.d.e. limit 15

The claim now follows for the special case of s →∞ along t(n). The generalclaim follows easily from this special case. ¥

Recall that a closed set A ⊂ Rd is said to be an invariant set (resp. apositively / negatively invariant set) for the o.d.e. (2.1.5) if any trajectoryx(t),−∞ < t < ∞ (resp. 0 ≤ t < ∞ / −∞ < t ≤ 0) of (2.1.5) with x(0) ∈ A

satisfies x(t) ∈ A ∀t ∈ R (resp. ∀t ≥ 0 / ∀t ≤ 0). It is said to be internally chaintransitive in addition if for any x, y ∈ A and any ε > 0, T > 0, there exist n ≥ 1and points x0 = x, x1, . . . , xn−1, xn = y in A such that the trajectory of (2.1.5)initiated at xi meets with the ε-neighbourhood of xi+1 for 0 ≤ i < n after atime ≥ T . (If we restrict to y = x in the above, the set is said to be internallychain recurrent .) Let Φt : Rd → Rd denote the map that takes x(0) to x(t)via (2.1.5). Under our conditions on h, this map will be continuous (in factLipschitz) for each t > 0. (See Appendix B.) From the uniqueness of solutionsto (2.1.5) in both forward and backward time, it follows that Φt is invertible.In fact it turns out to be a homeomorphism, i.e., a continuous bijection witha continuous inverse (see Appendix B). Thus we can define Φ−t(x) = Φ−1

t (x),the point at which the trajectory starting at time 0 at x and running backwardin time for a duration t would end up. Along with Φ0 ≡ the identity map onRd, Φt, t ∈ R defines a group of homeomorphisms on Rd, which is referredto as the flow associated with (2.1.5). Thus the definition of an invariant setcan be recast as follows: A is invariant if

Φt(A) = A ∀t ∈ R.

A corresponding statement applies to positively or negatively invariant setswith t ≥ 0, resp. t ≤ 0. Our general convergence theorem for stochastic ap-proximation, due to Benaim (1996), is the following.

Theorem 2. Almost surely, the sequence xn generated by (2.1.1) convergesto a (possibly sample path dependent) compact connected internally chain tran-sitive invariant set of (2.1.5).

Proof. Consider a sample point where (2.1.4) and the conclusions of Lemma1 hold. Let A denote the set

⋂t≥0 x(s) : s ≥ t. Since x(·) is continuous

and bounded, x(s) : s ≥ t, t ≥ 0, is a nested family of nonempty compactand connected sets. A, being the intersection thereof, will also be nonemptycompact and connected. Then x(t) → A and therefore xn → A. In fact, for anyε > 0, let Aε def= x : miny∈A ‖x− y‖ < ε. Then (Aε)c ∩ (∩t≥0x(s) : s ≥ t) =φ. Hence by the finite intersection property of families of compact sets, (Aε)c∩x(s) : s ≥ t′ = ∅ for some t′ > 0. That is, x(t′ + ·) ∈ Aε. Conversely, ifx ∈ A, there exist sn ↑ ∞ in [0,∞) such that x(sn) → x. This is immediate

Page 25: Stochastic Approximation: A Dynamical Systems Viewpoint

16 Basic Convergence Analysis

from the definition of A. In fact, we have

maxs∈[t(n),t(n+1)]

‖x(s)− x(t(n))‖ = O(a(n)) → 0

as n → ∞. Thus we may take sn = t(m(n)) for suitable m(n) without anyloss of generality. Let x(·) denote the trajectory of (2.1.5) with x(0) = x. Thenby the first part of Lemma 1 and the continuity of the map Φt defined above, itfollows that xsn(sn + t) = Φt(x(sn)) → Φt(x) = x(t) for all t > 0. By Lemma1, x(sn + t) → x(t), implying that x(t) ∈ A as well. A similar argument worksfor t < 0, using the second part of Lemma 1. Thus A is invariant under (2.1.5).

Let x1, x2 ∈ A and fix ε > 0, T > 0. Pick ε/4 > δ > 0 such that: if ‖z−y‖ < δ

and xz(·), xy(·) are solutions to (2.1.5) with initial conditions z, y resp., thenmaxt∈[0,2T ] ‖xz(t)− xy(t)‖ < ε/4. Also pick n0 > 1 such that s ≥ t(n0) impliesthat x(s + ·) ∈ Aδ and supt∈[s,s+2T ] ||x(t) − xs(t)|| < δ. Pick n2 > n1 ≥ n0

such that ||x(t(ni))− xi|| < δ, i = 1, 2. Let kT ≤ t(n2)− t(n1) < (k + 1)T forsome integer k ≥ 0 and let s(0) = t(n1), s(i) = s(0) + iT for 1 ≤ i < k, ands(k) = t(n2). Then for 0 ≤ i < k, supt∈[s(i),s(i+1)] ||x(t) − xs(i)(t)|| < δ. Pickxi, 0 ≤ i ≤ k, in A such that x1 = x1, xk = x2, and for 0 < i < k, xi arein the δ-neighbourhood of x(s(i)). The sequence (s(i), xi), 0 ≤ i ≤ k, satisfiesthe definition of internal chain transitivity: If x∗i (·) denotes the trajectories of(2.1.5) initiated at xi for each i, we have

‖x∗i (s(i + 1)− s(i))− xi+1‖≤ ‖x∗i (s(i + 1)− s(i))− xs(i)(s(i + 1))‖

+ ‖xs(i)(s(i + 1))− x(s(i + 1))‖+ ‖x(s(i + 1))− xi+1‖≤ ε

4+

ε

4+

ε

4< ε.

This completes the proof. ¥

2.2 Extensions and variations

Some important extensions of the foregoing are immediate:

• When the set supn ||xn|| < ∞ has a positive probability not necessarilyequal to one, we still have

∑n

a(n)E[||Mn+1||2|Fn] < ∞

a.s. on this set. The martingale convergence theorem from Appendix C citedin the proof of Lemma 1 above then tells us that ζn converges a.s. on thisset. Thus by the same arguments as before (which are pathwise), Theorem2 continues to hold ‘a.s. on the set supn ||xn|| < ∞’.

Page 26: Stochastic Approximation: A Dynamical Systems Viewpoint

2.2 Extensions and variations 17

• While we took a(n) to be deterministic in section 2.1, the arguments wouldalso go through if a(n) are random and bounded, satisfy (A2) with prob-ability one, and (A3) holds, with Fn redefined as

Fn = σ(xm,Mm, a(m),m ≤ n)

for n ≥ 0. In fact, the boundedness condition for random a(n) could berelaxed by imposing appropriate moment conditions. We shall not get intothe details of this at any point, but it is worth keeping in mind throughoutas there are applications (e.g., in system identification) when a(n) arerandom.

• The arguments above go through even if we replace (2.1.1) by

xn+1 = xn + a(n)[h(xn) + Mn+1 + ε(n)], n ≥ 0,

where ε(n) is a deterministic or random bounded sequence which is o(1).This is because ε(n) then contributes an additional error term in the proofof Lemma 1 which is also asymptotically negligible and therefore does notaffect the conclusions. This important observation will be recalled often inwhat follows.

These observations apply throughout the book wherever the arguments arepathwise, i.e., except in Chapter 9. The next corollary is often useful in nar-rowing down the potential candidates for A.

Suppose there exists a continuously differentiable V : Rd → [0,∞) such thatlim‖x‖→∞ V (x) = ∞, H

def= x ∈ Rd : V (x) = 0 6= φ, and 〈h(x),∇V (x)〉 ≤ 0with equality if and only if x ∈ H. (Thus V is a ‘Liapunov function’.)

Corollary 3. Almost surely, xn converge to an internally chain transitiveinvariant set contained in H.

Proof. The argument is sample pathwise for a sample path in the probabilityone set where assumption (A4) and Lemma 1 hold. Fix one such sample pathand let C ′ = supn ||xn|| and C = sup||x||≤C′ V (x). For any 0 < a ≤ C, let

Ha def= x ∈ Rd : V (x) < a, and let Ha denote the closure of Ha. Fix an η

such that 0 < η < C/2. Let

∆ def= minx∈HC\Hη

|〈h(x),∇V (x)〉| > 0.

Let T be an upper bound for the time required for a solution of x = h(x) toreach Hη, starting from a point in HC . We may choose T > C/∆. Let δ > 0be such that for x ∈ HC and ‖x − y‖ < δ, we have |V (x) − V (y)| < η. Sucha choice of δ is possible by the uniform continuity of V on compact sets. ByLemma 1, there is a t0 such that for all t ≥ t0, sups∈[t,t+T ] ‖x(s)− xt(s)‖ < δ.Note that x(·) ∈ HC , and so for all t ≥ t0, |V (x(t + T )) − V (xt(t + T ))| < η.

Page 27: Stochastic Approximation: A Dynamical Systems Viewpoint

18 Basic Convergence Analysis

But xt(t + T ) ∈ Hη and therefore x(t + T ) ∈ H2η. Thus for all t ≥ t0 + T ,x(t) ∈ H2η. Since η can be taken to be arbitrarily small, it follows thatx(t) → H as t →∞. ¥

Alternatively, we can invoke the ‘LaSalle invariance principle’ (see AppendixB) in conjunction with Theorem 2. The following corollary is immediate:

Corollary 4. If the only internally chain transitive invariant sets for (2.1.5)are isolated equilibrium points, then xn a.s. converges to a possibly samplepath dependent equilibrium point.

More generally, a similar statement could be made for isolated internallychain transitive invariant sets, i.e., internally chain transitive invariant setseach of which is at a strictly positive distance from the rest. We shall refineCorollary 4 in Chapter 4. The next corollary is a variation of the so-called‘Kushner–Clark lemma’ (Kushner and Clark, 1978) and also follows from theabove discussion. Recall that ‘i.o.’ in probability theory stands for ‘infinitelyoften’.

Corollary 5. Let G be an open set containing a bounded internally chain tran-sitive invariant set D for (2.1.5), and suppose that G does not intersect anyother bounded internally chain transitive invariant set (except possibly sub-

sets of D). Then under (A4), xn → D a.s. on the set xn ∈ G i.o. def=⋂

n

⋃m≥nxm ∈ G.

Proof. We know that a.s., xn converges to a compact, connected internallychain transitive invariant set. Let this set be D′. If D′ does not intersect G,then by compactness of D′, there is an ε-neighbourhood Nε(D′) of D′ whichdoes not intersect G. But since xn → D′, xn ∈ Nε(D′) for n large. This,however, leads to a contradiction if xn ∈ G i.o. Thus, if xn ∈ G i.o., D′ has tointersect G. It follows that D′ equals D or a subset thereof, and so xn → D

a.s. on the set xn ∈ G i.o.. ¥

In the more general set-up of Theorem 2, the next theorem is sometimes use-ful. (The statement and proof require some familiarity with weak (Prohorov)convergence of probability measures. See Appendix C for a brief account.)

Let P(Rd) denote the space of probability measures on Rd with Prohorovtopology (also known as the topology of weak convergence, see, e.g., Borkar,1995, Chapter 2). Let C0(Rd) denote the space of continuous functions onRd that vanish at infinity. Then the space M of complex Borel measures onRd is isomorphic to the dual space C∗0 (Rd). The isomorphism is given byµ 7→ ∫

(·)dµ. (See, e.g., Rudin, 1986, Chapter 6.) It is easy to show thatP(Rd) consists of real measures µ which correspond to those elements µ ofC∗0 (Rd) that are nonnegative on nonnegative functions in C0(Rd) (i.e., f ≥ 0

Page 28: Stochastic Approximation: A Dynamical Systems Viewpoint

2.2 Extensions and variations 19

for f ∈ C0(Rd) implies that µ(f) ≥ 0), and for a constant function f(·) ≡ C,µ(f) = C.

Define (random) measures ν(t), t > 0, on Rd by∫

fdν(t) =1t

∫ t

0

f(x(s))ds

for f ∈ C0(Rd). These are called empirical measures. Since this integral isnonnegative for nonnegative f and furthermore, ν(t)(Rd) = 1

t

∫ t

01 ds = 1, ν(t)

is a probability measure on Rd. By (A4), the ν(t) are supported in a commoncompact subset of Rd independent of t. By Prohorov’s theorem (see AppendixC), they form a relatively compact subset of P(Rd).

Theorem 6. Almost surely, every limit point ν∗ of ν(t) in P(Rd) as t → ∞is invariant under (2.1.5).

Proof. Fix some s > 0. Consider a sample path for which Lemma 1 applies,i.e., for which ‖x(y + s) − xy(y + s)‖ → 0 as y → ∞. Let f ∈ C0(Rd). Notethat

|1t

∫ t

0

f(x(y))dy − 1t

∫ t+s

s

f(x(y))dy| → 0 as t →∞.

Note also that the quantity on the left above is the same as

|1t

∫ t

0

f(x(y))dy − 1t

∫ t

0

f(x(y + s))dy|.

Let ε > 0. By uniform continuity of f , there is a T such that for y ≥ T ,‖f(x(y + s))− f(xy(y + s))‖ < ε. Now, if t ≥ T ,

|1t

∫ t

0

f(x(y + s))dy − 1t

∫ t

0

f(xy(y + s))dy|

≤ 1t

∫ T

0

|f(x(y + s))− f(xy(y + s))|dy

+1t

∫ t

T

|f(x(y + s))− f(xy(y + s))|dy

≤ T

t2B +

(t− T )t

ε ≤ 2ε

for t large enough. Here B is a bound on the magnitude of f ∈ C0(Rd). Thus

|1t

∫ t

0

f(x(y + s))dy − 1t

∫ t

0

f(xy(y + s))dy| → 0 as t →∞.

But since

|1t

∫ t

0

f(x(y))dy − 1t

∫ t

0

f(x(y + s))dy| → 0

Page 29: Stochastic Approximation: A Dynamical Systems Viewpoint

20 Basic Convergence Analysis

as t →∞, it follows that

|1t

∫ t

0

f(x(y))dy − 1t

∫ t

0

f(xy(y + s))dy| → 0

as t →∞. But this implies that

|∫

fdν(t)−∫

f Φsdν(t)| = |1t

∫ t

0

f(x(y))dy − 1t

∫ t

0

f Φs(x(y))dy|

= |1t

∫ t

0

f(x(y))dy − 1t

∫ t

0

f(xy(y + s))dy|

→ 0 as t →∞.

If ν∗ is a limit point of ν(t), there is a sequence tn ∞ such that ν(tn) → ν∗

weakly. Thus,∫

fdν(tn) → ∫fdν∗ and

∫f Φsdν(tn) → ∫

f Φsdν∗. But| ∫ fdν(tn) − ∫

f Φsdν(tn)| → 0 as n → ∞. This tells us that∫

fdν∗ =∫f Φsdν∗. This holds for all f ∈ C0(Rd). Hence ν∗ is invariant under Φs.

As s > 0 was arbitrary, the claim follows. ¥

See Benaim and Schreiber (2001) for further results in this vein.

We conclude this section with some comments regarding stepsize selection.Our view of a(n) as discrete time steps in the o.d.e. approximation alreadygives some intuition about their role. Thus large stepsizes will mean fastersimulation of the o.d.e., but also larger errors due to discretization and noise(the latter is so because the stepsize a(n) also multiplies the ‘noise’ Mn+1 in thealgorithm). Reducing the stepsizes would mean lower discretization errors andnoise-induced errors and therefore a more graceful behaviour of the algorithm,but at the expense of a slower speed of convergence. This is because one istaking a larger number of iterations to simulate any given time interval in‘o.d.e. time’. In the parlance of artificial intelligence, larger stepsizes aid betterexploration of the solution space, while smaller stepsizes aid better exploitationof the local information available. The trade-off between them is a well-knownrule of thumb in AI. Starting with a relatively large a(n) and decreasing itslowly tries to strike a balance between the two. See Goldstein (1988) for someresults on stepsize selection.

Page 30: Stochastic Approximation: A Dynamical Systems Viewpoint

3

Stability Criteria

3.1 Introduction

In this chapter we discuss a couple of schemes for establishing the a.s. bound-edness of iterates assumed above. The convergence analysis of the precedingchapter has some universal applicability, but the situation is different for sta-bility criteria. There are several variations of stability criteria applicable underspecific restrictions and sometimes motivated by specific applications for whichthey are tailor-made. (The second test we see below is one such.) We describeonly a couple of these variations. The first one is quite broadly applicable.The second is a bit more specialized, but has been included because it hasa distinct flavour and shows how one may tweak known techniques such asstochastic Liapunov functions to obtain new and useful criteria.

3.2 Stability criterion

The first scheme is adapted from Borkar and Meyn (2000). The idea of thistest is as follows: We consider the piecewise linear interpolated trajectory x(·)at times Tn ↑ ∞, which are spaced approximately T > 0 apart and divide thetime axis into concatenated time segments of length approximately T . If atany Tn, the iterate has gone out of the unit ball in Rd, we rescale it over thesegment [Tn, Tn+1) by dividing it by the norm of its value at Tn. If the originaltrajectory drifts towards infinity, then there is a corresponding sequence ofrescaled segments as above that asymptotically track a limiting o.d.e. obtainedas a scaling limit of our ‘basic o.d.e.’

x(t) = h(x(t)). (3.2.1)

21

Page 31: Stochastic Approximation: A Dynamical Systems Viewpoint

22 Stability Criteria

If this scaling limit is globally asymptotically stable to the origin, these seg-ments, and therefore the original iterations which differ from them only by ascaling factor, should start drifting towards the origin, implying stability.

Formally, assume the following:

(A5) The functions hc(x) def= h(cx)/c, c ≥ 1, x ∈ Rd, satisfy hc(x) → h∞(x) asc →∞, uniformly on compacts for some h∞ ∈ C(Rd). Furthermore, the o.d.e.

x(t) = h∞(x(t)) (3.2.2)

has the origin as its unique globally asymptotically stable equilibrium.

The o.d.e. (3.2.2) is the aforementioned ‘scaling limit’. It is worth notinghere that

(i) hc, h∞ will be Lipschitz with the same Lipschitz constant as h, implying inparticular the well-posedness of (3.2.2) above and also of the o.d.e.

x(t) = hc(x(t)). (3.2.3)

In particular, they are equicontinuous. Thus pointwise convergence of hc toh∞ as c →∞ will automatically imply uniform convergence on compacts.

(ii) h∞ satisfies h∞(ax) = ah∞(x) for a > 0, and hence if (3.2.2) has anisolated equilibrium, it must be at the origin.

(iii) ‖hc(x) − hc(0)‖ ≤ L‖x‖, and so ‖hc(x)‖ ≤ ‖hc(0)‖ + L‖x‖ ≤ ‖h(0)‖ +L‖x‖ ≤ K0(1 + ‖x‖) (for a suitable constant K0).

Let φ∞(t, x) denote the solution of the o.d.e. x = h∞(x) with initial conditionx.

Lemma 1. There exists a T > 0 such that for all initial conditions x on theunit sphere, ‖φ∞(t, x)‖ < 1

8 for all t > T .

Proof. Since asymptotic stability implies Liapunov stability (see Appendix B),there is a δ > 0 such that any trajectory starting within distance δ of the originstays within distance 1

8 thereof. For an initial condition x on the unit sphere,let Tx be a time at which the solution is within distance δ/2 of the origin. Lety be any other initial condition on the unit sphere. Note that

φ∞(t, x) = x +∫ t

0

h∞(φ∞(s, x))ds, and

φ∞(t, y) = y +∫ t

0

h∞(φ∞(s, y))ds.

Page 32: Stochastic Approximation: A Dynamical Systems Viewpoint

3.2 Stability criterion 23

Subtracting the above equations and using the Lipschitz property, we get

‖φ∞(t, x)− φ∞(t, y)‖ ≤ ‖x− y‖+ L

∫ t

0

‖φ∞(s, x)− φ∞(s, y)‖ds.

Then by Gronwall’s inequality we find that for t ≤ Tx,

‖φ∞(t, x)− φ∞(t, y)‖ ≤ ‖x− y‖eLTx .

So there is a neighbourhood Ux of x such that for all y ∈ Ux, φ∞(Tx, y)is within distance δ of the origin. By Liapunov stability, this implies thatφ∞(t, y) remains within distance 1

8 of the origin for all t ≥ Tx. Since the unitsphere is compact, it can be covered by a finite number of such neighbourhoodsUx1 , . . . , Uxn

with corresponding times Tx1 , . . . , Txn. Then the statement of the

lemma holds if T is the maximum of Tx1 , . . . , Txn. ¥

The following lemma shows that the solutions of the o.d.e.s x = hc(x) andx = h∞(x) are close for c large enough.

Lemma 2. Let K ⊂ Rd be compact, and let [0, T ] be a given time interval.Then for t ∈ [0, T ] and x0 ∈ K,

‖φc(t, x)− φ∞(t, x0)‖ ≤ [‖x− x0‖+ ε(c)T ]eLT ,

where ε(c) is independent of x0 ∈ K and ε(c) → 0 as c →∞. In particular, ifx = x0, then

‖φc(t, x0)− φ∞(t, x0)‖ ≤ ε(c)TeLT . (3.2.4)

Proof. Note that

φc(t, x) = x +∫ t

0

hc(φc(s, x))ds, and

φ∞(t, x0) = x0 +∫ t

0

h∞(φ∞(s, x0))ds.

This gives

‖φc(t, x)− φ∞(t, x0)‖ ≤ ‖x− x0‖+∫ t

0

‖hc(φc(s, x))− h∞(φ∞(s, x0))‖ds.

Now, using the facts that φ∞([0, T ],K) is compact, hc → h∞ uniformly oncompact sets, and hc has the Lipschitz property, we get

‖hc(φc(s, x))− h∞(φ∞(s, x0))‖≤ ‖hc(φc(s, x))− hc(φ∞(s, x0))‖

+ ‖hc(φ∞(s, x0))− h∞(φ∞(s, x0))‖≤ L‖φc(s, x)− φ∞(s, x0)‖+ ε(c),

Page 33: Stochastic Approximation: A Dynamical Systems Viewpoint

24 Stability Criteria

where ε(c) is independent of x0 ∈ K and ε(c) → 0 as c →∞. Thus for t ≤ T ,we get

‖φc(t, x)− φ∞(t, x0)‖ ≤ ‖x− x0‖+ ε(c)T + L

∫ t

0

‖φc(s, x)− φ∞(s, x0)‖ds.

The conclusion follows from Gronwall’s inequality. ¥

The previous two lemmas give us:

Corollary 3. There exist c0 > 0 and T > 0 such that for all initial conditionsx on the unit sphere, ‖φc(t, x)‖ < 1

4 for t ∈ [T, T + 1] and c > c0.

Proof. Choose T as in Lemma 1. Now, using equation (3.2.4) with K taken tobe the closed unit ball, conclude that ‖φc(t, x)‖ < 1

4 for t ∈ [T, T + 1] and c

such that ε(c)(T + 1)eL(T+1) < 18 . ¥

Let T0 = 0 and Tn+1 = mint(m) : t(m) ≥ Tn + T for n ≥ 0. Then Tn+1 ∈[Tn +T, Tn +T + a] ∀n, where a = supn a(n), Tn = t(m(n)) for suitable m(n) ↑∞, and Tn ↑ ∞. For notational simplicity, let a = 1 without loss of generality.Define the piecewise continuous trajectory x(t), t ≥ 0, by x(t) = x(t)/r(n) fort ∈ [Tn, Tn+1], where r(n) def= ||x(Tn)|| ∨ 1, n ≥ 0. That is, we obtain x(·) fromx(·) by observing the latter at times Tn that are spaced approximately T

apart. In case the observed value falls outside the unit ball of Rd, it is resetto a value on the unit sphere of Rd by normalization. Not surprisingly, thisprevents any possible blow-up of the trajectory, as reflected in the followinglemma. For later use, we also define x(T−n+1)

def= x(Tn+1)/r(n). This is thesame as x(Tn+1) if there is no jump at Tn+1, and equal to limt↑Tn+1 x(t) ifthere is a jump.

Lemma 4. supt E[||x(t)||2] < ∞.

Proof. It suffices to show that

supm(n)≤k<m(n+1)

E[||x(t(k))||2] < M

for some M > 0 independent of n.Fix n. Then for m(n) ≤ k < m(n + 1),

x(t(k + 1)) = x(t(k)) + a(k)(hr(n)(x(t(k))) + Mk+1),

where Mk+1def= Mk+1/r(n). Since r(n) ≥ 1, it follows from (A3) that Mk+1

satisfies

E[||Mk+1||2|Fk] ≤ K(1 + ||x(t(k))||2). (3.2.5)

Page 34: Stochastic Approximation: A Dynamical Systems Viewpoint

3.2 Stability criterion 25

Thus, E[||Mk+1||2] ≤ K(1 + E[||x(t(k))||2]), which gives us the bound

E[||Mk+1||2]1/2 ≤√

K(1 + E[||x(t(k))||2]1/2).

(Note that for a ≥ 0,√

1 + a2 ≤ 1 + a.) Using this and the bound ||hc(x)|| ≤K0(1 + ||x||) mentioned above, we have

E[||x(t(k + 1))||2] 12 ≤ E[||x(t(k))||2] 1

2 (1 + a(k)K1) + a(k)K2,

for suitable constants K1,K2 > 0. Keeping in mind that

m(n+1)−1∑

k=m(n)

a(k) ≤ T + 1, ||x(t(m(n))|| ≤ 1,

a straightforward recursion leads to

E[||x(t(k + 1))||2] 12 ≤ eK1(T+1)(1 + K2(T + 1)),

where we also use the inequality 1 + x ≤ ex. This is the desired bound. ¥

Lemma 5. The sequence ζndef=

∑n−1k=0 akMk+1, n ≥ 1, is a.s. convergent.

Proof. By the convergence theorem for square-integrable martingales (see Ap-pendix C), it is enough to show that

∑k E[‖a(k)Mk+1‖2|Fk] < ∞ a.s. Thus it

is enough to show that E[∑

k E[‖a(k)Mk+1‖2|Fk]] < ∞. Since, as in the proofof Lemma 4, E[||Mk+1||2|Fk] ≤ K(1 + ||x(t(k))||2), we get

E[∑

k

E[‖a(k)Mk+1‖2|Fk]] =∑

k

E[E[‖a(k)Mk+1‖2|Fk]]

≤∑

k

a(k)2K(1 + E[||x(t(k))||2]).

This is finite, by property (A2) and by Lemma 4. ¥

For n ≥ 0, let xn(t), t ∈ [Tn, Tn+1], denote the trajectory of (3.2.3) withc = r(n) and xn(Tn) = x(Tn).

Lemma 6. limn→∞ supt∈[Tn,Tn+1] ||x(t)− xn(t)|| = 0, a.s.

Proof. For simplicity, we assume L > 1, a(n) < 1 ∀n. Note that for m(n) ≤k < m(n + 1),

x(t(k + 1)) = x(t(k)) + a(k)(hr(n)(x(t(k))) + Mk+1).

Page 35: Stochastic Approximation: A Dynamical Systems Viewpoint

26 Stability Criteria

This yields, for 0 < k ≤ m(n + 1)−m(n),

x(t(m(n) + k)) = x(t(m(n))) +k−1∑

i=0

a(m(n) + i)hr(n)(x(t(m(n) + i)))

+ (ζm(n)+k − ζm(n)).

By Lemma 5, there is a (random) bound B on supi ‖ζi‖. Also, as mentionedat the beginning of this section, we have

‖hr(n)(x(t(m(n) + i)))‖ ≤ ‖h(0)‖+ L‖x(t(m(n) + i))‖.Furthermore,

∑0≤i<m(n+1)−m(n) a(m(n) + i) ≤ (T + 1). Therefore,

‖x(t(m(n) + k))‖

≤ ‖x(t(m(n)))‖+k−1∑

i=0

a(m(n) + i)(‖h(0)‖

+ L‖x(t(m(n) + i))‖) + 2B

≤ L

k−1∑

i=0

a(m(n) + i)‖x(t(m(n) + i))‖+ ‖h(0)‖(T + 1)

+ 2B + 1,

where we use ‖x(t(m(n)))‖ ≤ 1. We can now apply the discrete Gronwallinequality (see Appendix B) to obtain

‖x(t(m(n) + k))‖ ≤ (‖h(0)‖(T + 1) + 2B + 1)eL(T+1) def= K∗ (3.2.6)

for 0 < k ≤ m(n + 1)−m(n). It follows that x remains bounded on [Tn, Tn+1]by some K∗ > 0 and this bound is independent of n. We can now mimic theargument of Lemma 1, Chapter 2, to show that

limn→∞

supt∈[Tn,Tn+1]

‖x(t)− xn(t)‖ = 0, a.s.

¥

This leads to our main result:

Theorem 7. Under (A1)–(A3) and (A5), supn ‖xn‖ < ∞ a.s.

Proof. Fix a sample point where the claims of Lemmas 5 and 6 hold. Wewill first show that supn ‖x(Tn)‖ < ∞. If this does not hold, there will exista sequence Tn1 , Tn2 , . . . such that ‖x(Tnk

)‖ ∞, i.e., rnk ∞. We saw

(Corollary 3) that there exists a scaling factor c0 > 0 and a T > 0 such thatfor all initial conditions x on the unit sphere, ‖φc(t, x)‖ < 1

4 for t ∈ [T, T + 1]

Page 36: Stochastic Approximation: A Dynamical Systems Viewpoint

3.3 Another stability criterion 27

and c > c0 (≥ 1 by assumption). If rn > c0, ‖x(Tn)‖ = ‖xn(Tn)‖ = 1, and‖xn(Tn+1)‖ < 1

4 . But then by Lemma 6, ‖x(T−n+1)‖ < 12 if n is large. Thus,

for rn > c0 and n sufficiently large,

‖x(Tn+1)‖‖x(Tn)‖ =

‖x(T−n+1)‖‖x(Tn)‖ <

12.

We conclude that if ‖x(Tn)‖ > c0, x(Tk), k ≥ n falls back to the ball of radius c0

at an exponential rate. Thus if ‖x(Tn)‖ > c0, ‖x(Tn−1)‖ is either even greaterthan ‖x(Tn)‖ or is inside the ball of radius c0. Then there must be an instanceprior to n when x(·) jumps from inside this ball to outside the ball of radius0.9rn. Thus, corresponding to the sequence rnk

∞, we will have a sequenceof jumps of x(Tn) from inside the ball of radius c0 to points increasingly faraway from the origin. But, by a discrete Gronwall argument analogous to theone used in Lemma 6, it follows that there is a bound on the amount by which‖x(·)‖ can increase over an interval of length T + 1 when it is inside the ball ofradius c0 at the beginning of the interval. This leads to a contradiction. ThusC

def= supn ‖x(Tn)‖ < ∞. This implies that supn ‖xn‖ ≤ CK∗ < ∞ for K∗ asin (3.2.6). ¥

Consider as an illustrative example the scalar case with h(x) = −x + g(x)for some bounded Lipschitz g. Then h∞(x) = −x, indicating that the scalinglimit c →∞ above basically picks the dominant term −x of h that essentiallycontrols the behaviour far away from the origin.

3.3 Another stability criterion

The second stability test we discuss is adapted from Abounady, Bertsekas andBorkar (2002). This applies to the case when stability for one initial conditionimplies stability for all initial conditions. Also, the associated o.d.e. (3.2.1) isassumed to converge to a bounded invariant set for all initial conditions. Theidea then is to consider a related recursion ‘with resets’, i.e., a recursion whichis reset to a bounded set whenever it exits from a larger prescribed bounded setcontaining the previous one. By a suitable choice of these sets (which explicitlydepends on the dynamics), one then argues that there are at most finitely manyresets. Hence the results of the preceding section apply thereafter. But thisimplies stability for some initial condition, hence for all initial conditions.

In this situation, it is more convenient to work with (1.0.4) of Chapter 1rather than (1.0.3) there, i.e., with

xn+1 = xn + a(n)f(xn, ξn+1), n ≥ 0, (3.3.1)

where ξn are i.i.d. random variables taking values in Rm, say (though more

Page 37: Stochastic Approximation: A Dynamical Systems Viewpoint

28 Stability Criteria

general spaces can be admitted), and f : Rd ×Rm → Rd satisfies

||f(x, z)− f(y, z)|| ≤ L||x− y||. (3.3.2)

Thus h(x) def= E[f(x, ξ1)] is Lipschitz with Lipschitz constant L and Mn+1def=

f(xn, ξn+1)− h(xn), n ≥ 0, is a martingale difference sequence satisfying (A3).This reduces (3.3.1) to the familiar form

xn+1 = xn + a(n)[h(xn) + Mn+1]. (3.3.3)

We also assume that

‖f(x, y)‖ ≤ K(1 + ‖x‖) (3.3.4)

for a suitable K > 0. (The assumption simplifies matters, but could be replacedby other conditions that lead to similar conclusions.) Then

‖Mn+1|| ≤ 2K(1 + ‖xn‖), (3.3.5)

which implies (A3) of Chapter 2. The main assumption we shall be making isthe following:

(*) Let x′n, x′′n be two sequences of random variables generated by theiteration (3.3.1) on a common probability space with the same ‘driving noise’ξn, but different initial conditions. Then supn ||x′n − x′′n|| < ∞, a.s.

Typically in applications, supn ||x′n − x′′n|| will be bounded by a function of||x′0−x′′0 ||. We further assume that (3.2.1) has an associated Liapunov functionV : Rd →R that is continuously differentiable and satisfies:

(i) lim||x||→∞ V (x) = ∞, and

(ii) for Vdef= 〈∇V (x), h(x)〉, V ≤ 0, and in addition, V < 0 outside a

bounded set B ⊂ Rd.

It is worth noting here that we shall need only the existence and not the explicitform of V . The existence in turn is often ensured by smooth versions of theconverse Liapunov theorem as in Wilson (1969).

Consider an initial condition x0 (assumed to be deterministic for simplicity),and let G be a closed ball that contains in turn both B and x0 in its interior.For a ∈ R, let Ca

def= x ∈ Rd : V (x) ≤ a. V is bounded on G, so there existb and c, with b < c such that G ⊂ Cb ⊂ Cc. Choose δ < (c− b)/2.

We define a modified iteration x∗n as follows: Let x∗0 = x0 and x∗n+1 begenerated by (3.3.1) except when x∗n /∈ Cc. In the latter case we first replacex∗n by its projection to the boundary ∂G of G and then continue. This isthe ‘reset’ operation. Let τn, n ≥ 1, denote the successive reset times, with+∞ being a possible value thereof. Let t(n) be defined as before. Define

Page 38: Stochastic Approximation: A Dynamical Systems Viewpoint

3.3 Another stability criterion 29

x(s) for s ∈ [t(n), t(n + 1)] as the linear interpolation of x(t(n)) = x∗n andx(t(n + 1)) = x∗n+1, using the post-reset value of x∗n at s = t(n) and the pre-reset value of x∗n+1 at s = t(n+1) in case there is a reset at either time instant.Note that by (3.3.4), ‖x(·)‖ remains bounded.

Now, we can choose a ∆ > 0 such that V ≤ −∆ on the compact set x :b ≤ V (x) ≤ c. Choose T such that ∆T > 2δ. Let T0 = 0, and let Tn+1 =mint(m) : t(m) ≥ Tn + T. For simplicity, suppose that a(m) ≤ 1 for all m.Then Tn + T ≤ Tn+1 ≤ Tn + T + 1.

For t ∈ [Tn, Tn+1], define xn(t) to be the piecewise solution of x = h(x) suchthat xn(t) is set to x(t) at t = Tn and also at t = τk for all τk ∈ [Tn, Tn+1].That is:

• it satisfies the o.d.e. on every subinterval [t1, t2] where t1 is either some Tn

or a reset time in [Tn, Tn+1) and t2 is either its immediate successor fromthe set of reset times in (Tn, Tn+1) or Tn+1, whichever comes first, and

• at t1 it equals x(t1).

The following lemma is proved in a manner analogous to Lemma 1 of Chapter2.

Lemma 8. For T > 0,

limn→∞

supt∈[Tn,Tn+1]

||x(t)− xn(t)|| = 0, a.s.

Theorem 9. supn ‖xn‖ < ∞, a.s.

Proof. Let x∗k ∈ Cc. Then by (3.3.4), there is a bounded set B1 such that

x∗,−k+1def= x∗k + a(k)[h(x∗k) + Mk+1] ∈ B1.

Let η1 > 0. By Lemma 8, there is an N1 such that for n ≥ N1, one hassupt∈[Tn,Tn+1] ||x(t) − xn(t)|| < η1. Let D be the η1-neighbourhood of B1.Thus, for n ≥ N1, t ∈ [Tn, Tn+1], both x(t) and xn(t) remain inside D. D hascompact closure, and therefore V is uniformly continuous on D. Thus there isan η2 > 0 such that for x, y ∈ D, ‖x− y‖ < η2 implies |V (x)− V (y)| < δ. Letη = minη1, η2. Let N ≥ N1 be such that for n ≥ N , supt∈[Tn,Tn+1] ||x(t) −xn(t)|| < η. Then for n ≥ N, t ∈ [Tn, Tn+1], xn(t) remains inside D andfurthermore, |V (x(t))− V (xn(t))| < δ.

Now fix n ≥ N . Let τk ∈ [Tn, Tn+1] be a reset time. Let t ∈ [τk, Tn+1].Since xn(τk) is on G, then V (xn(τk)) ≤ b, and V (xn(t)), t ≥ τk, decreases witht until Tn+1 or until the next reset, if that occurs before Tn+1. At such areset, however, V (xn(t)) again has value at most b and the argument repeats.Thus in any case, V (xn(t)) ≤ b on [τk, Tn+1]. Thus V (x(t)) ≤ b + δ < c on[τk, Tn+1]. Hence x(·) does not exit Cc, and there can in fact be no furtherresets in [τk, Tn+1]. Note also that V (x(Tn+1)) ≤ b + δ.

Page 39: Stochastic Approximation: A Dynamical Systems Viewpoint

30 Stability Criteria

Now consider the interval [Tn+1, Tn+2]. Since xn+1(Tn+1) = x(Tn+1), wehave V (xn+1(Tn+1)) ≤ b+δ. We argue as above to conclude that V (xn+1(t)) ≤b+ δ and V (x(t)) < b+2δ < c on [Tn+1, Tn+2]. This implies that x(·) does notleave Cc on [Tn+1, Tn+2] and hence there are no resets on this interval. Further,if xn+1(·) remained in the set x : b ≤ V (x) ≤ c, the rate of decrease ofV (xn+1(t)) would be at least ∆. Recall that ∆T > 2δ. Thus it must be thatV (xn+1(Tn+2)) ≤ b, which means that as before, V (x(Tn+2)) ≤ b + δ.

Repeating the argument for successive ns, it follows that beyond a certainpoint in time, there can be no further resets. Thus for n large enough, x∗nremains bounded and furthermore, x∗n+1 = x∗n + a(n)f(x∗n, ξn+1). But xn+1 =xn + a(n)f(xn, ξn+1), and the noise sequence is the same for both recursions.By the assumption (*), it follows that the sequence xn remains bounded, a.s.

If there is no reset after N , x∗n is anyway bounded and the same logicapplies. ¥

This test of stability is useful in some reinforcement learning applications.Further stability criteria can be found, e.g., in Kushner and Yin (2003) andTsitsiklis (1994).

Page 40: Stochastic Approximation: A Dynamical Systems Viewpoint

4

Lock-in Probability

4.1 Estimating the lock-in probability

Recall the urn model of Chapter 1. When there are multiple isolated stableequilibria, it turns out that there can be a positive probability of convergenceto one of these equilibria which is not, however, necessarily among the desiredones. This, we recall, was the explanation for several instances of adoption ofone particular convention or technology as opposed to another. The idea is thatafter some initial randomness, the process becomes essentially ‘locked into’ thedomain of attraction of a particular equilibrium, i.e., locked into a particularchoice of technology or convention. With this picture in mind, we define thelock-in probability as the probability of convergence to an asymptotically sta-ble attractor, given that the iterate is in a neighbourhood thereof. Our aimhere will be to get a lower bound on this probability and explore some of itsconsequences. Our treatment follows the approach of Borkar (2002, 2003).

Our setting is as follows. We consider the stochastic approximation on Rd:

xn+1 = xn + a(n)[h(xn) + Mn+1], (4.1.1)

under assumptions (A1), (A2) and (A3) of Chapter 2. We add to (A2) therequirement that a(n) ≤ ca(m) ∀ n ≥ m for some c > 0. We shall not apriori make assumption (A4) of Chapter 2, which says that the sequence xngenerated by (4.1.1) is a.s. bounded. The a.s. boundedness of this sequencewith high probability will be proved as a consequence of our main result. Wehave seen that recursion (4.1.1) can be considered to be a noisy discretizationof the ordinary differential equation

x(t) = h(x(t)). (4.1.2)

Let G ⊂ Rd be open, and let V : G → [0,∞) be such that Vdef= ∇V ·h : G → R

31

Page 41: Stochastic Approximation: A Dynamical Systems Viewpoint

32 Lock-in Probability

is non-positive. We shall assume that Hdef= x : V (x) = 0 is equal to the set

x : V (x) = 0 and is a compact subset of G. Thus the function V is a Liapunovfunction. Then H is an asymptotically stable invariant set of the differentialequation (4.1.2). Conversely, (local) asymptotic stability implies the existenceof such a V by the converse Liapunov theorem – see, e.g., Krasovskii (1963).Let there be an open set B with compact closure such that H ⊂ B ⊂ B ⊂ G.It follows from the LaSalle invariance principle (see Appendix B) that anyinternally chain transitive invariant subsets of B will be subsets of H.

In this setting we shall derive an estimate for the probability that the se-quence xn is convergent to H, conditioned on the event that xn0 ∈ B forsome n0 sufficiently large. In the next section, we shall also derive a samplecomplexity estimate for the probability of the sequence being within a certainneighbourhood of H after a certain length of time, again conditioned on theevent that xn0 ∈ B.

Let Ha def= x : V (x) < a have compact closure Ha = x : V (x) ≤ a. ForA ⊂ Rd, δ > 0, let Nδ(A) denote the δ-neighbourhood of A, i.e., Nδ(A) def= x :infy∈A ||x − y|| < δ. Fix some 0 < ε1 < ε and δ > 0 such that Nδ(Hε1) ⊂Hε ⊂ Nδ(Hε) ⊂ B.

As was argued in the first extension of Theorem 2, Chapter 2, in section 2.2,if the sequence xn generated by recursion (4.1.1) remains a.s. bounded on aprescribed set of sample points, then it converges almost surely on this set to a(possibly sample path dependent) compact internally chain transitive invariantset of the o.d.e. (4.1.2). Therefore, if we can show that with high probabilityxn remains inside the compact set B, then it follows that xn converges toH with high probability. We shall in fact show that x(·), the piecewise linearand continuous curve obtained by linearly interpolating the points xn as inChapter 2, lies inside Nδ(Hε) ⊂ B with high probability from some time on.Let us define

T =[maxx∈B V (x)]− ε1

minx∈B\Hε1 |∇V (x) · h(x)| .

Then T is an upper bound for the time required for a solution of the o.d.e.(4.1.2) to reach the set Hε1 , starting from an initial condition in B. Fix ann0 ≥ 1 ‘sufficiently large’. (We shall be more specific later about how n0 isto be chosen.) For m ≥ 1, let nm = minn : t(n) ≥ t(nm−1) + T. Define asequence of times T0, T1, . . . by Tm = t(nm). For m ≥ 0, let Im be the interval[Tm, Tm+1], and let

ρmdef= sup

t∈Im

||x(t)− xTm(t)||,

where xTm(·) is, as in Chapter 2, the solution of the o.d.e. (4.1.2) on In with

Page 42: Stochastic Approximation: A Dynamical Systems Viewpoint

4.1 Estimating the lock-in probability 33

initial condition xTm(Tm) = x(Tm). We shall assume that a(n) ≤ 1 ∀n, whichimplies that the length of Im is between T and T + 1.

Let us assume for the moment that xn0 ∈ B, and that ρm < δ for all m ≥ 0.Because of the way we defined T , it follows that xT0(T1) ∈ Hε1 . Since ρ0 < δ

and Nδ(Hε1) ⊂ Hε, x(T1) ∈ Hε. Since Hε is a positively invariant subsetof B, it follows that xT1(·) lies in Hε on I1, and that xT1(T2) ∈ Hε1 . Hencex(T2) ∈ Hε. Continuing in this way it follows that for all m ≥ 1, xTm liesinside Hε on Im. Now using the fact that ρm < δ for all m ≥ 0, it follows thatx(t) is in Nδ(Hε) ⊂ B for all t ≥ T1. As mentioned above, it now follows thatx(t) → H as t →∞. Therefore we have:

Lemma 1.

P (x(t) → H|xn0 ∈ B) ≥ P (ρm < δ ∀m ≥ 0|xn0 ∈ B). (4.1.3)

We let Bm denote the event that xn0 ∈ B and ρk < δ for k = 0, 1, . . . ,m.Recall that Fn

def= σ(x0,M1, . . . , Mn), n ≥ 1. We then have that Bm ∈ Fnm+1 .Note that

P (ρm < δ ∀m ≥ 0|xn0 ∈ B) = 1− P (ρm ≥ δ for some m ≥ 0|xn0 ∈ B)

We have the following disjoint union:

ρm ≥ δ for some m ≥ 0 =

ρ0 ≥ δ ∪ ρ1 ≥ δ; ρ0 < δ ∪ ρ2 ≥ δ; ρ0, ρ1 < δ ∪ · · ·Therefore,

P (ρm ≥ δ for some m ≥ 0|xn0 ∈ B)

= P (ρ0 ≥ δ|xn0 ∈ B)

+ P (ρ1 ≥ δ; ρ0 < δ|xn0 ∈ B)

+ P (ρ2 ≥ δ; ρ0, ρ1 < δ|xn0 ∈ B) + · · ·= P (ρ0 ≥ δ|xn0 ∈ B)

+ P (ρ1 ≥ δ|ρ0 < δ, xn0 ∈ B)P (ρ0 < δ|xn0 ∈ B)

+ P (ρ2 ≥ δ|ρ0, ρ1 < δ, xn0 ∈ B)P (ρ0, ρ1 < δ|xn0 ∈ B) + · · ·≤ P (ρ0 ≥ δ|xn0 ∈ B) + P (ρ1 ≥ δ|B0) + P (ρ2 ≥ δ|B1) + · · ·

Thus, with B−1def= x0 ∈ B, we have:

Lemma 2.

P (ρm < δ ∀m ≥ 0|xn0 ∈ B) ≥ 1−∞∑

m=0

P (ρm ≥ δ|Bm−1).

Page 43: Stochastic Approximation: A Dynamical Systems Viewpoint

34 Lock-in Probability

The bound derived in Lemma 2 involves the term P (ρm ≥ δ|Bm−1). We shallnow derive an upper bound on this term. Recall that Bm−1 denotes the eventthat xn0 ∈ B and ρk < δ for k = 0, 1, . . . ,m− 1. This implies that x(Tm) ∈ B.Let C be a bound on ||h(Φt(x))||, where Φt is the time-t flow map for the o.d.e.(4.1.2), 0 ≤ t ≤ T + 1, and x ∈ B. By the arguments of Lemma 1 of Chapter2, it follows that

ρm ≤ Ca(nm) + KT (CLb(nm) + maxnm≤j≤nm+1

||ζj − ζnm||),

where KT is a constant that depends only on T , L is the Lipschitz constant forh, b(n) def=

∑k≥n a(k)2, and ζk =

∑k−1i=0 a(i)Mi+1. Since a(nm) ≤ ca(n0) and

b(nm) ≤ c2b(n0), it follows that

ρm ≤ (Ca(n0) + cKT CLb(n0))c + KT maxnm≤j≤nm+1

||ζj − ζnm ||.

This implies that if n0 is chosen so that (Ca(n0) + cKT CLb(n0))c < δ/2, thenρm > δ implies that

maxnm≤j≤nm+1

||ζj − ζnm || >δ

2KT.

We state this as a lemma:

Lemma 3. If n0 is chosen so that

(Ca(n0) + cKT CLb(n0))c < δ/2, (4.1.4)

then

P (ρm ≥ δ|Bm−1) ≤ P ( maxnm≤j≤nm+1

||ζj − ζnm || >δ

2KT|Bm−1)

We shall now find a bound for the expression displayed on the right-hand sideof the above inequality. We shall give two methods for bounding this quantity.The first one uses Burkholder’s inequality, and the second uses a concentrationinequality for martingales. As we shall see, the second method gives a betterbound, but under a stronger assumption.

Burkholder’s inequality (see Appendix C) implies that if (Xn,Fn), n ≥ 1is a (real-valued) zero mean martingale, and if Mn

def= Xn − Xn−1 is the cor-responding martingale difference sequence, then there is a constant C1 suchthat

E[( max0≤j≤n

||Xj ||)2] ≤ C21E[

n∑

i=1

M2i ].

We shall use the conditional version of this inequality: If G ⊂ F0 is a sub-σ-field,then

E[( max0≤j≤n

||Xj ||)2|G] ≤ C21E[

n∑

i=1

M2i |G]

Page 44: Stochastic Approximation: A Dynamical Systems Viewpoint

4.1 Estimating the lock-in probability 35

almost surely. In our context, ζj − ζnm, j ≥ nm is an Rd-valued martingale

with respect to the filtration Fj. We shall apply Burkholder’s inequality toeach component.

We shall first prove some useful lemmas. Let K be the constant in assumption(A3) of Chapter 2. For an Rd-valued random variable x, let ‖x‖∗ denoteE[||x||2|Bm−1]1/2(ω), where ω ∈ Bm−1. Note that ‖ · ‖∗ satisfies the propertiesof a norm ‘almost surely’.

Lemma 4. For nm < j ≤ nm+1 and a.s. ω ∈ Bm−1,

(‖Mj‖∗)2 ≤ K(1 + (‖xj−1‖∗)2).Proof. Note that Bm−1 ∈ Fnm

⊂ Fj−1 for j > nm. Then a.s.,

(‖Mj‖∗)2 = E[||Mj ||2|Bm−1](ω)

= E[E[||Mj ||2|Fj−1]|Bm−1](ω)

≤ E[K(1 + ||xj−1||2)|Bm−1](ω)

(by Assumption (A3) of Chapter 2)

= K(1 + E[||xj−1||2|Bm−1](ω))

= K(1 + (‖xj−1‖∗)2).¥

Lemma 5. There is a constant KT such that for nm ≤ j ≤ nm+1,

‖xj‖∗ ≤ KT a.s.

Proof. Consider the recursion

xj+1 = xj + a(j)[h(xj) + Mj+1], j ≥ nm.

As we saw in Chapter 2, the Lipschitz property of h implies a linear growthcondition on h, i.e., ||h(x)|| ≤ K ′(1 + ||x||). Taking the Euclidean norm onboth sides of the above equation and using the triangle inequality and thislinear growth condition leads to

||xj+1|| ≤ ||xj ||(1 + a(j)K ′) + a(j)K ′ + a(j)||Mj+1||.Therefore a.s. for j ≥ nm,

‖xj+1‖∗ ≤ ‖xj‖∗(1 + a(j)K ′) + a(j)K ′ + a(j)‖Mj+1‖∗

≤ ‖xj‖∗(1 + a(j)K ′) + a(j)K ′ + a(j)√

K(1 + (‖xj‖∗)2)1/2

(by the previous lemma)

≤ ‖xj‖∗(1 + a(j)K ′) + a(j)K ′ + a(j)√

K(1 + ‖xj‖∗)= ‖xj‖∗(1 + a(j)K1) + a(j)K1,

where K1 = K ′ +√

K.

Page 45: Stochastic Approximation: A Dynamical Systems Viewpoint

36 Lock-in Probability

Applying this inequality repeatedly and using the fact that 1+aK1 ≤ eaK1 andthe fact that if j < nm+1, a(nm)+ a(nm +1)+ · · ·+ a(j) ≤ t(nm+1)− t(nm) ≤T + 1, we get

‖xj+1‖∗ ≤ ‖xnm‖∗eK1(T+1) + K1(T + 1)eK1(T+1) for nm ≤ j < nm+1.

For ω ∈ Bm−1, xnm(ω) ∈ B. Since B is compact, there is a constant K2

such that ||xnm || < K2. Thus ‖xnm‖∗ ≤ K2. This implies that for nm − 1 ≤j < nm+1, ‖xj+1‖∗ ≤ KT

def= eK1(T+1)[K2 + K1(T + 1)]. In other words, fornm ≤ j ≤ nm+1, ‖xj‖∗ ≤ KT . ¥

Lemma 6. For nm < j ≤ nm+1,

(‖Mj‖∗)2 ≤ K(1 + K2T ).

Proof. This follows by combining the results of Lemmas 4 and 5. ¥

For a.s. ω ∈ Bm−1, we have

P ( maxnm≤j≤nm+1

||ζj − ζnm || >δ

2KT|Bm−1)

= P (( maxnm≤j≤nm+1

||ζj − ζnm ||)2 >δ2

4K2T

|Bm−1)

≤ E[( maxnm≤j≤nm+1

||ζj − ζnm ||)2|Bm−1] · 4K2T

δ2,

where the inequality follows from the conditional Chebyshev inequality. Weshall now use Burkholder’s inequality to get the desired bound. Let ζi

n denote

Page 46: Stochastic Approximation: A Dynamical Systems Viewpoint

4.1 Estimating the lock-in probability 37

the ith component of ζn. For a.s. ω ∈ Bm−1,

E[( maxnm≤j≤nm+1

||ζj − ζnm||)2|Bm−1]

= E[ maxnm≤j≤nm+1

d∑

i=1

(ζij − ζi

nm)2|Bm−1]

≤d∑

i=1

E[( maxnm≤j≤nm+1

|ζij − ζi

nm|)2|Bm−1]

≤d∑

i=1

C21E[

nm+1∑

j=nm+1

a(j − 1)2(M ij)

2|Bm−1]

= C21E[

nm+1∑

j=nm+1

a(j − 1)2||Mj ||2|Bm−1]

= C21

nm+1∑

j=nm+1

a(j − 1)2(‖Mj‖∗)2

≤ C21 [a2

nm+ · · ·+ a2

nm+1−1]K(1 + K2T )

= C21 (b(nm)− b(nm+1))K(1 + K2

T ).

Combining the foregoing, we obtain the following lemma:

Lemma 7. For Kdef= 4C2

1K(1 + K2T )K2

T > 0,

P ( maxnm≤j≤nm+1

||ζj − ζnm || >δ

2KT|Bm−1) ≤ K

δ2(b(nm)− b(nm+1)).

Thus we have:

Theorem 8. For some constant K > 0,

P (x(t) → H|xn0 ∈ B) ≥ 1− K

δ2b(n0).

Page 47: Stochastic Approximation: A Dynamical Systems Viewpoint

38 Lock-in Probability

Proof. Note that

P (ρm < δ ∀m ≥ 0|xn0 ∈ B)

≥ 1−∞∑

m=0

P (ρm ≥ δ|Bm−1) by Lemma 2

≥ 1−∞∑

m=0

P ( maxnm≤j≤nm+1

||ζj − ζnm|| > δ

2KT|Bm−1) by Lemma 3

≥ 1−∞∑

m=0

K

δ2(b(nm)− b(nm+1)) by Lemma 7

= 1− K

δ2b(n0).

The claim now follows from Lemma 1. ¥

A key assumption for the bound derived in Theorem 8 was assumption (A3)of Chapter 2, by which E[||Mj ||2|Fj−1] ≤ K(1 + ||xj−1||2), for j ≥ 1. Nowwe shall make the more restrictive assumption that ||Mj || ≤ K0(1 + ||xj−1||).This condition holds, e.g., in the reinforcement learning applications discussedin Chapter 10. Under this assumption, it will be possible for us to derive asharper bound.

We first prove a lemma about the boundedness of the stochastic approxima-tion iterates:

Lemma 9. There is a constant K3 such that for t ∈ Im, ||x(t)|| ≤ K3(1 +||x(Tm)||).

Proof. The stochastic approximation recursion is

xn+1 = xn + a(n)[h(xn) + Mn+1].

Therefore, assuming that ||Mj || ≤ K0(1+ ||xj−1||) and using the linear growthproperty of h,

||xn+1|| ≤ ||xn||+ a(n)K(1 + ||xn||) + a(n)K0(1 + ||xn||)= ||xn||(1 + a(n)K4) + a(n)K4, where K4

def= K + K0.

Arguing as in Lemma 5, we conclude that for nm ≤ j < nm+1,

||xj+1|| ≤ [||xnm ||+ K4(T + 1)]eK4(T+1).

Thus for nm ≤ j ≤ nm+1, ||xj || ≤ eK4(T+1)[||xnm || + K4(T + 1)] ≤ K3(1 +||xnm ||) for some constant K3. The lemma follows. ¥

Page 48: Stochastic Approximation: A Dynamical Systems Viewpoint

4.1 Estimating the lock-in probability 39

Note that for nm ≤ k < nm+1,

||ζk+1 − ζk|| = ||a(k)Mk+1||≤ a(k)K0(1 + ||xk||)≤ a(k)K0(1 + K3(1 + ||xnm

||))≤ a(k)K0[1 + K3(K2 + 1)],

where K2def= max

x∈B||x||

= a(k)K, (4.1.5)

where Kdef= K0[1 + K3(K2 + 1)].

Thus one may use the following ‘concentration inequality for martingales’ (cf.Appendix C): Consider the filtration F1 ⊂ F2 ⊂ · · · ⊂ F . Let S1, . . . , Sn

be a (scalar) martingale with respect to this filtration, with Y1 = S1, Yk =Sk − Sk−1 (k ≥ 2) the corresponding martingale difference sequence. Letck ≤ Yk ≤ bk. Then

P ( max1≤k≤n

|Sk| ≥ t) ≤ 2e−2t2∑

k≤n(bk−ck)2 .

If B ∈ F1, we can state a conditional version of this inequality as follows:

P ( max1≤k≤n

|Sk| ≥ t|B) ≤ 2e−2t2∑

k≤n(bk−ck)2 .

Let ||·||∞ denote the max-norm on Rd, i.e., ||x||∞ def= maxi |xi|. Note that forv ∈ Rd, ||v||∞ ≤ ||v|| ≤

√d||v||∞. Thus ||v|| ≥ c implies that ||v||∞ ≥ c/

√d.

Since ζj − ζnmnm≤j≤nm+1 is a martingale with respect to Fjnm≤j≤nm+1

and Bm−1 ∈ Fnm , we have, by the inequality (4.1.5),

Page 49: Stochastic Approximation: A Dynamical Systems Viewpoint

40 Lock-in Probability

P ( maxnm≤j≤nm+1

||ζj − ζnm|| > δ

2KT|Bm−1)

≤ P ( maxnm≤j≤nm+1

||ζj − ζnm ||∞ >δ

2KT

√d|Bm−1)

= P ( maxnm≤j≤nm+1

max1≤i≤d

|ζij − ζi

nm| > δ

2KT

√d|Bm−1)

= P ( max1≤i≤d

maxnm≤j≤nm+1

|ζij − ζi

nm| > δ

2KT

√d|Bm−1)

≤d∑

i=1

P ( maxnm≤j≤nm+1

|ζij − ζi

nm| > δ

2KT

√d|Bm−1)

≤d∑

i=1

2 exp −2δ2/(4K2

T d)4[a(nm)2 + · · ·+ a(nm+1 − 1)2]K2

≤ 2d exp − δ2

8K2T K2d[b(nm)− b(nm+1)]

= 2de−Cδ2/(d[b(nm)−b(nm+1)]), where C = 1/(8K2T K2).

This gives us:

Lemma 10. There is a constant C > 0 such that

P ( maxnm≤j≤nm+1

||ζj − ζnm || >δ

2KT|Bm−1) ≤ 2de−Cδ2/(d[b(nm)−b(nm+1)]).

For n0 sufficiently large that (4.1.4) holds and

b(n0) < Cδ2/d, (4.1.6)

the following bound holds:

Theorem 11.

P (ρm < δ ∀m ≥ 0|xn0 ∈ B) ≥ 1− 2de− Cδ2

db(n0) = 1− o(b(n0)).

Proof. Note that Lemmas 2 and 3 continue to apply. Then

P (ρm < δ ∀m ≥ 0|xn0 ∈ B)

≥ 1−∞∑

m=0

P (ρm ≥ δ|Bm−1) by Lemma 2

≥ 1−∞∑

m=0

P ( maxnm≤j≤nm+1

||ζj − ζnm || >δ

2KT|Bm−1) by Lemma 3

≥ 1−∞∑

m=0

2de−Cδ2/(d[b(nm)−b(nm+1)]) by Lemma 10

Page 50: Stochastic Approximation: A Dynamical Systems Viewpoint

4.1 Estimating the lock-in probability 41

Note that for C ′ > 0, e−C′/x/x → 0 as x → 0 and increases with x for0 < x < C ′. Therefore for sufficiently large n0,

e−Cδ2/(d[b(nm)−b(nm+1)])

[b(nm)− b(nm+1)]≤ e−Cδ2/(db(nm))

b(nm)≤ e−Cδ2/(db(n0))

b(n0).

Hence

e−Cδ2/(d[b(nm)−b(nm+1)]) = [b(nm)− b(nm+1)] · e−Cδ2/(d[b(nm)−b(nm+1)])

[b(nm)− b(nm+1)]

≤ [b(nm)− b(nm+1)] · e−Cδ2/(db(n0))

b(n0).

So,

∞∑m=0

e−Cδ2/(d[b(nm)−b(nm+1)]) ≤∞∑

m=0

[b(nm)− b(nm+1)] · e−Cδ2/(db(n0))

b(n0)

=e−Cδ2/(db(n0))

b(n0)

∞∑m=0

[b(nm)− b(nm+1)]

= e−Cδ2/(db(n0))

The claim follows. ¥

Lemma 1 coupled with Theorems 8 and 11 now enables us to derive boundson the lock-in probability. We state these bounds in the following corollary:

Corollary 12. In the setting described at the beginning of this section, if n0

is chosen so that (4.1.4) holds, then there is a constant K such that

P (x(t) → H|xn0 ∈ B) ≥ 1− K

δ2b(n0).

If we make the additional assumption that ||Mj || ≤ K0(1 + ||xj−1||) for j ≥ 1and (4.1.6) holds, then the following tighter bound for the lock-in probabilityholds:

P (x(t) → H|xn0 ∈ B) ≥ 1− 2de− Cδ2

db(n0)

= 1− o(b(n0)).

In conclusion, we observe that the stronger assumption on the martingaledifference sequence Mn was made necessary by the fact that we use McDi-armid’s concentration inequality for martingales, which requires the associatedmartingale difference sequence to be bounded by deterministic bounds. Morerecent work on concentration inequalities for martingales may be used to relaxthis condition. See, e.g., Li (2003).

Page 51: Stochastic Approximation: A Dynamical Systems Viewpoint

42 Lock-in Probability

4.2 Sample complexity

We continue in the setting described at the beginning of the previous section.Our goal is to derive a sample complexity estimate, by which we mean anestimate of the probability that x(t) is within a certain neighbourhood of H

after the lapse of a certain amount of time, conditioned on the event thatxn0 ∈ B for some fixed n0 sufficiently large.

We begin by fixing some ε > 0 such that Hε (= x : V (x) ≤ ε) ⊂ B. Fixsome T > 0, and let

∆ def= minx∈B\Hε

[V (x)− V (ΦT (x))],

where ΦT is the time-T flow map of the o.d.e. (4.1.2) (see Appendix B). Notethat ∆ > 0. We remark that the arguments that follow do not require V tobe differentiable, as was assumed earlier. It is enough to assume that V iscontinuous and that V (x(t)) monotonically decreases with t along any trajec-tory of (4.1.2) in B \H. One such situation with non-differentiable V will beencountered in Chapter 10, in the context of reinforcement learning.

We fix an n0 ≥ 1 sufficiently large. We shall specify later how large n0 needsto be. Let nm, Tm, Im, ρm and Bm be defined as in the previous section. Fixa δ > 0 such that Nδ(Hε) ⊂ B and such that for all x, y ∈ B with ||x− y|| < δ,||V (x)− V (y)|| < ∆/2.

Let us assume that xn0 ∈ B, and that ρm < δ for all m ≥ 0. If xn0 ∈ B \Hε,we have that V (xT0(T1)) ≤ V (x(T0)) − ∆. Since ||xT0(T1) − x(T1)|| < δ, itfollows that V (x(T1)) ≤ V (x(T0))−∆/2. If x(T1) ∈ B\Hε, the same argumentcan be repeated to give V (x(T2)) ≤ V (x(T1)) −∆/2. Since V (x(Tm)) cannotdecrease at this rate indefinitely, it follows that x(Tm0) ∈ Hε for some m0. Infact, if

τdef=

[maxx∈B V (x)]− ε

∆/2· (T + 1),

then Tm0 ≤ T0 + τ .Thus xTm0 (t) ∈ Hε on Im0 = [Tm0 , Tm0+1]. Therefore x(Tm0+1) ∈ Hε+∆/2.

This gives rise to two possibilities: either x(Tm0+1) ∈ Hε or x(Tm0+1) ∈Hε+∆/2 \Hε. In the former case, xTm0+1(t) ∈ Hε on Im0+1, and x(Tm0+2) ∈Hε+∆/2. In the latter case, xTm0+1(t) ∈ Hε+∆/2 on Im0+1, xTm0+1(Tm0+2) ∈Hε−∆/2 ⊂ Hε, and again x(Tm0+2) ∈ Hε+∆/2. In any case, x(Tm0+1) ∈Hε+∆/2 implies that xTm0+1(t) ∈ Hε+∆/2 on Im0+1 and x(Tm0+2) ∈ Hε+∆/2.This argument can be repeated. We have thus shown that if x(Tm0) ∈ Hε,then xTm0+k(t) ∈ Hε+∆/2 on Im0+k for all k ≥ 0, which in turn implies thatx(t) ∈ Nδ(Hε+∆/2) for all t ≥ Tm0 . We thus conclude that if xn0 ∈ B andρm < δ for all m ≥ 0, then x(t) ∈ Nδ(Hε+∆/2) for all t ≥ Tm0 , and thus for allt ≥ t(n0) + τ . This gives us:

Page 52: Stochastic Approximation: A Dynamical Systems Viewpoint

4.2 Sample complexity 43

Lemma 13.

P (x(t) ∈ Nδ(Hε+∆/2) ∀t ≥ t(n0) + τ |xn0 ∈ B)

≥ P (ρm < δ ∀m ≥ 0|xn0 ∈ B)

Lemma 13 coupled with Theorems 8 and 11 now allows us to derive thefollowing sample complexity estimate.

Corollary 14. In the setting described at the beginning of section 4.1, if n0 ischosen so that (4.1.4) holds, then there is a constant K such that

P (x(t) ∈ Nδ(Hε+∆/2) ∀t ≥ t(n0) + τ |xn0 ∈ B) ≥ 1− K

δ2b(n0).

If we make the additional assumption that ||Mj || ≤ K0(1 + ||xj−1||) for j ≥ 1and (4.1.6) holds, then there is a constant C such that the following tighterbound holds:

P (x(t) ∈ Nδ(Hε+∆/2) ∀t ≥ t(n0) + τ |xn0 ∈ B) ≥ 1− 2de− Cδ2

db(n0)

= 1− o(b(n0)).

The corollary clearly gives a sample complexity type result in the sense de-scribed at the beginning of this section. There is, however, a subtle differencefrom the traditional sample complexity results: What we present here is notthe number of samples needed to get within a prescribed accuracy with a pre-scribed probability starting from time zero, but starting from time n0, and thebound depends crucially on the position at time n0 via its dependence on theset B that we are able to choose.

As an example, consider the situation when h(x) = g(x)− x, where g(·) is acontraction, so that ||g(x) − g(y)|| < α||x − y|| for some α ∈ (0, 1). Let x∗ bethe unique fixed point of g(·), guaranteed by the contraction mapping theorem(see Appendix A). Straightforward calculation shows that V (x) = ||x − x∗||satisfies our requirements: Let

X(x, t) = h(X(x, t)), X(x, 0) = x.

We have

(X(x, t)− x∗) = (x− x∗) +∫ t

0

(g(X(x, s))− x∗)ds−∫ t

0

(X(x, s)− x∗)ds,

leading to

(X(x, t)− x∗) = e−t(x− x∗) +∫ t

0

e−(t−s)(g(X(x, s))− x∗)ds.

Page 53: Stochastic Approximation: A Dynamical Systems Viewpoint

44 Lock-in Probability

Taking norms and using the contraction property,

||X(x, t)− x∗|| ≤ e−t||x− x∗||+∫ t

0

e−(t−s)||g(X(x, s))− x∗||ds

≤ e−t||x− x∗||+ α

∫ t

0

e−(t−s)||X(x, s)− x∗||ds.

That is,

et||X(x, t)− x∗|| ≤ ||x− x∗||+ α

∫ t

0

es||X(x, s)− x∗||ds.

By the Gronwall inequality,

||X(x, t)− x∗|| ≤ e−(1−α)t||x− x∗||.Let the iteration be at a point x at time n0 large enough that (4.1.4) and(4.1.6) hold. Let ||x − x∗|| = b (say). For ε, T > 0 as above, one may choose∆ = ε(1− e−(1−α)T ). We may take δ = ∆

2 ≤ ε2 . One then needs

N0def= min

n :

n∑

i=n0+1

a(i) ≥ 2(T + 1)bε(1− e−(1−α)T )

more iterates to get within 2ε of x∗, with a probability exceeding

1− 2de− cε2

b(m(n0)) = 1− o(b(n0))

for a suitable constant c that depends on T , among other things. Note thatT > 0 is a free parameter affecting both the expression for N0 and that for theprobability.

4.3 Avoidance of traps

As a second application of the estimates for the lock-in probability, we shallprove the avoidance of traps under suitable conditions. This term refers to thefact that under suitable additional hypotheses, the stochastic approximationiterations asymptotically avoid with probability one the attractors which areunstable in some direction. As one might guess, the additional hypothesesrequired concern the behaviour of h in the immediate neighbourhood of theseattractors and a ‘richness’ condition on the noise. Intuitively, there should bean unstable direction at all length scales and the noise should be rich enoughthat it pushes the iterates in such a direction sufficiently often. This in turnensures that they are eventually pushed away for good.

The importance of these results stems from the following considerations:We know from Chapter 2 that invariant sets of the o.d.e. are candidate limit

Page 54: Stochastic Approximation: A Dynamical Systems Viewpoint

4.3 Avoidance of traps 45

sets for the algorithm. In many applications, the unstable invariant sets areprecisely the spurious or undesirable limit sets one wants to avoid. The resultsof this section then give conditions when that avoidance will be achieved. Moregenerally, these results allow us to narrow down the search for possible limitsets. As before, we work with the hypothesis (A4) of Chapter 2:

supn||xn|| < ∞ a.s.

Consider a scenario where there exists an invariant set of (4.1.2) which isa disjoint union of N compact attractors Ai, 1 ≤ i ≤ N, with domains ofattraction Gi, 1 ≤ i ≤ N, resp., such that G =

⋃i Gi is open dense in Rd. Let

W = Gc. We shall impose further conditions on W as follows. Let Dα denotethe truncated (open) cone

x = [x1, . . . , xd] ∈ Rd : 1 < x1 < 2, |d∑

i=2

x2i |

12 < αx1

for some α > 0. For a d× d orthogonal matrix O, x ∈ Rd and a > 0, let ODα,x+Dα and aDα denote resp. the rotation of Dα by O, translation of Dα by x,and scaling of Dα by a. Finally, for ε > 0, let Wε denote the ε-neighbourhoodof W in Rd. We shall be making some additional assumptions regarding (4.1.1)over and above those already in place. Our main additional assumption willbe:

(A5) There exists an α > 0 such that for any a > 0 and x ∈ Rd, there existsan orthogonal matrix Ox,a such that x + aOx,aDα ⊂ W c

a .

What this means is that for any a > 0, we can plant a version of the truncatedcone scaled down by a near any point in Rd by means of suitable translationand rotation, in such a manner that it lies entirely in W c

a . Intuitively, thisensures that any point in Rd cannot have points in W arbitrarily close to it inall directions. We shall later show that this implies that a sequence of iteratesapproaching W will get pushed out to a shrinking family of such truncatedcones sufficiently often. In turn, we also show that this is enough to ensurethat the iterates move away from W to one of the Ai, whence they cannotconverge to W .

We shall keep α fixed henceforth. Thus we may denote the set x + aOx,aDα

as Dx,a. Let Id denote the d-dimensional identity matrix and let M1 ≥ M2 fora pair of d× d positive definite matrices stand for xTM1x ≥ xTM2x ∀x ∈ Rd.The main consequence of (A5) that we shall need is the following:

Lemma 15. For any c > b > 0, there exists a β = β(b, c) > 0 such that forany a > 0 sufficiently small, x ∈ Rd and any d-dimensional Gaussian measure

Page 55: Stochastic Approximation: A Dynamical Systems Viewpoint

46 Lock-in Probability

µ with mean x and covariance matrix Σ satisfying a2cId ≥ Σ ≥ a2bId, one hasµ(Dx,a) ≥ β(b, c).

Proof. By the scaling properties of the Gaussian, µ(Dx,a) = µ(D0,1) where µ

denotes the Gaussian measure with zero mean and covariance matrix Σ satis-fying

cId ≥ Σ ≥ bId.

The claim follows. ¥

We also assume:

(A6) There exists a positive definite matrix-valued continuous map Q : Rd →Rd×d such that for all n ≥ 0, E[Mn+1M

Tn+1|Fn] = Q(xn) and for some 0 <

Λ− < Λ+ < ∞, Λ+Id ≥ Q(x) ≥ Λ−Id.

(A7) supnb(n)a(n) < ∞.

(A8) h(·) is continuously differentiable and the Jacobian matrix∇h(·) is locallyLipschitz.

Assumption (A6) intuitively means that the noise is ‘rich’ enough in alldirections. Assumption (A7) is satisfied, e.g., by a(n) = 1/n, a(n) = 1/(1 +n`n(n)), etc., but not by, say, a(n) = 1/n

23 . Thus it requires a(n) to decrease

‘sufficiently fast’. (We shall mention a possible relaxation of this conditionlater.) Let s, T > 0. Consider a trajectory x(·) of (4.1.2) with x(s) ∈ U , whereU is the closure of a bounded open set containing A

def=⋃

i Ai. For t > s in[s, s + T + 1], let Φ(t, s) denote the Rd×d-valued solution of the linear system

Φ(t, s) = ∇h(x(t))Φ(t, s), Φ(s, s) = Id. (4.3.1)

For a positive definite matrix Q, let λmin(Q), λmax(Q) denote the least andthe highest eigenvalue of Q. Let

c∗ def= sup λmax(Φ(t, s)ΦT(t, s)),

b∗ def= inf λmin(Φ(t, s)ΦT(t, s)),

where the superscript ‘T’ denotes matrix transpose and the supremum andinfimum are over all x(·) as above and all s + T + 1 ≥ t ≥ s ≥ 0. Then∞ > c∗ ≥ b∗ > 0. The leftmost inequality follows from the fact that ∇h(x) isuniformly bounded because of the Lipschitz condition on h, whence a standardargument using the Gronwall inequality implies a uniform upper bound on||Φ(t, s)|| for t, s in the above range. The rightmost inequality, on the other

Page 56: Stochastic Approximation: A Dynamical Systems Viewpoint

4.3 Avoidance of traps 47

hand, is a consequence of the fact that Φ(t, s) is nonsingular for all t > s in theabove range. Also, the time dependence of its dynamics is via the continuousdependence of its coefficients on x(·), which lies in a compact set. Hence thesmallest eigenvalue of Φ(t, s)ΦT(t, s), being a continuous function of its entries,is bounded away from zero.

For j ≥ m(n), n ≥ 0, let yjdef= xj − xn(t(j)), where xn(·) is the solution of

the o.d.e. (4.1.2) on [t(n),∞) with xn(t(n)) = x(t(n)). Recall that

xn(t(j + 1)) = xn(t(j)) + a(j)h(xn(t(j))) + O(a(j)2).

Subtracting this from (4.1.1) and using Taylor expansion, one has

yj+1 = yj + a(j)(∇h(xn(t(j)))yj + κj) + a(j)Mj+1 + O(a(j)2),

where κj = o(||yj ||). In particular, iterating the expression above leads to

ym(n)+i = Πm(n)+i−1j=m(n) (1 + a(j)∇h(xn(t(j))))ym(n)

+m(n)+i−1∑

j=m(n)

a(j)Πm(n)+i−1k=j+1 (1 + a(k)∇h(xn(t(k))))κj

+m(n)+i−1∑

j=m(n)

a(j)Πm(n)+i−1k=j+1 (1 + a(k)∇h(xn(t(k))))Mj+1

+ O(a(m(n))).

Since ym(n) = 0, the first term drops out. The second term tends to zero asn ↑ ∞ because ||yn|| does. The last term clearly tends to zero as n ↑ ∞. LetΨn denote the third term on the right when i = m(n + 1) − m(n), and letΨn

def= Ψn/ϕ(n), where

ϕ(n) def= (b(m(n))− b(m(n + 1)))1/2.

Let P(Rd) denote the space of probability measures on Rd with Prohorovtopology (see Appendix C). The next lemma is a technical result which weneed later. Let φn denote the regular conditional law of Ψn given Fm(n), n ≥ 0,viewed as a P(Rd)-valued random variable.

Lemma 16. Almost surely on x(t(m(n))) ∈ U ∀n, every limit point of φnas n ↑ ∞ is zero mean Gaussian with the spectrum of its covariance matrixcontained in [b∗Λ−, c∗Λ+].

Proof. For n ≥ 0 define the martingale array ξni , 0 ≤ i ≤ kn

def= m(n + 1) −m(n) by ξn

0 = 0 and

ξni =

1ϕ(n)

m(n)+i−1∑

j=m(n)

a(j)Πm(n)+i−1k=j+1 (1 + a(k)∇h(xn(t(k))))Mj+1.

Page 57: Stochastic Approximation: A Dynamical Systems Viewpoint

48 Lock-in Probability

Then Ψn = ξnkn

and if 〈ξn〉i, 0 ≤ i ≤ kn, denotes the corresponding matrix-valued quadratic covariation process, i.e.,

〈ξn〉m def=m∑

i=0

E[(ξni+1 − ξn

i )(ξni+1 − ξn

i )T|Fi],

then

〈ξn〉m =1

ϕ(n)2

m(n)+i−1∑

j=m(n)

a(j)2(Πm(n)+i−1

k=j+1 (1 + a(k)∇h(xn(t(k)))))

× Q(xm(n)+j)(Πm(n)+i−1

k=j+1 (1 + a(k)∇h(xn(t(k)))))T

.

As n ↑ ∞,

1ϕ(n)2

m(n)+i−1∑

j=m(n)

a(j)2(Πm(n)+i−1

k=j+1 (1 + a(k)∇h(xn(t(k)))))

× Q(xm(n)+j)(Πm(n)+i−1

k=j+1 (1 + a(k)∇h(xn(t(k)))))T

− 1ϕ(n)2

m(n)+i−1∑

j=m(n)

a(j)2(Πm(n)+i−1

k=j+1 (1 + a(k)∇h(xn(t(k)))))

× Q(xn(t(m(n) + j)))(Πm(n)+i−1

k=j+1 (1 + a(k)∇h(xn(t(k)))))T

→ 0, a.s.

by Lemma 1 and Theorem 2 of Chapter 2. Note that xn(·) is Fm(n)-measurable.Fix a sample point in the probability one set where the conclusions of Lemma1 and Theorem 2 of Chapter 2 hold. Pick a subsequence n(`) ⊂ n suchthat xm(n(`)) → x∗ (say). Then xn(`)(·) → x(·) uniformly on compact intervals,where x(·) is the unique solution to the o.d.e. ˙x(t) = h(x(t)) with x(0) = x∗.Along n(`), any limit point of the matrices

(Πm(n)+i−1

k=j+1 (1 + a(k)∇h(xn(t(k)))))

Q(xn(t(m(n) + j)))

×(Πm(n)+i−1

k=j+1 (1 + a(k)∇h(xn(t(k)))))T

, m(n) ≤ j < m(n) + i, i ≥ 0,

is of the form Φ(t, s)Q(x)Φ(t, s)T for some t, s, x and therefore has its spectrumin [b∗Λ−, c∗Λ+]. Hence the same is true for any convex combinations or limitsof convex combinations thereof. In view of this, the claim follows on applyingthe central limit theorem for martingale arrays (Chow and Teicher, 2003, p.351; see also Hall and Heyde, 1980) to φn(j). ¥

Remark: The central limit theorem for martingale arrays referred to above is

Page 58: Stochastic Approximation: A Dynamical Systems Viewpoint

4.3 Avoidance of traps 49

stated in Chow and Teicher (2003) for the scalar case, but the vector case iseasily deducible from it by applying the scalar case to arbitrary one-dimensionalprojections thereof.

Clearly xn → W ∪ (∪iAi), because any internally chain recurrent invariantset must be contained in W ∪ (∪iAi). But Theorem 2 of Chapter 2 implies theconnectedness of the a.s. limit set of xn, and W and ∪iAi have disjoint openneighbourhoods. Thus it follows that the sets xn → W and xn → W alonga subsequence are a.s. identical. We also need:

Lemma 17. If Fn,Hn are events in Fn, n ≥ 0, such that P (Fn+1|Fn) ≥ κ > 0on Hn for all n ≥ 0, then P (Fn i.o.c

⋂Hn i.o.) = 0.

Proof. Since

Zndef=

n−1∑m=0

IFm+1 −n−1∑m=0

P (Fm+1|Fm), n ≥ 0, (4.3.2)

is a zero mean martingale with bounded increments, almost surely it eitherconverges or satisfies

lim supn→∞

Zn = − lim infn→∞

Zn = ∞.

(See Theorem 12 of Appendix C.) Thus the two sums on the right-hand sideof (4.3.2) converge or diverge together, a.s. Since the second sum is largerthan κ

∑n−1m=0 IHm , it follows that

∑nm=0 IFn , n ≥ 0, diverges a.s. whenever∑n

m=0 IHm , n ≥ 0, does. The claim follows. ¥

Recall that U is the closure of a bounded open neighbourhood of ∪iAi.

Corollary 18. For any r > 0, xm(n+1) ∈ W crϕ(n) i.o. a.s. on the set xn →

W.

Proof. By Lemmas 15 and 16, it follows that almost surely for n sufficientlylarge, the conditional probability

P (xm(n+1) ∈ W crϕ(n)|Fn)

satisfies

P (xm(n+1) ∈ W crϕ(n)|Fn) > η > 0

on the set xm(n) ∈ U, for some η independent of n. It follows from Lemma 17that xm(n+1) ∈ W c

rϕ(n) i.o. a.s. on the set xn ∈ U from some n on and xn →W. The claim follows by applying this to a countable increasing family of setsU that covers Rd. ¥

Page 59: Stochastic Approximation: A Dynamical Systems Viewpoint

50 Lock-in Probability

We now return to the framework of the preceding section with B = U ∩W c.Recall our choice of n0 such that (4.1.4) holds. Let K be a prescribed positiveconstant. By (A7),

supn

b(m(n))b(m(n))− b(m(n + 1))

= supn

b(m(n))ϕ(n)2

≤ 1 + supn

b(m(n + 1))ca(m(n + 1))T

< ∞,

where we have used the fact that (b(m(n))− b(m(n + 1)) ≥ Ta(m(n+1))c . Thus

we haveb(m(n))

ϕ(n)→ 0 as n ↑ ∞.

Thus for δ = Kϕ(n), we do have

Kb(i) <δ

2, ∀i ≥ m(n),

for K as in Lemma 7 and Theorem 8 and for n sufficiently large, say n ≥ n0.(This can be ensured by increasing n0 if necessary.) With this choice of δ andfor xm(n) ∈ B, the probability that xk → A exceeds

1− Kb(m(n))K2ϕ(n)2

. (4.3.3)

As noted above, supnb(m(n))ϕ(n)2 < ∞. Thus we may choose K large enough that

the right-hand side of (4.3.3) exceeds 12 , with n sufficiently large. Then we have

our main result:

Theorem 19. xn → A a.s.

Proof. Take r = K in Corollary 18. By the foregoing,

P (xm(n)+kk↑∞→ A|Fm(n)+1) ≥

12,

on the set xm(n+1) ∈ W crϕ(n) ∩ U for n ≥ 0 sufficiently large. It follows from

Lemma 17 that xn → A a.s. on xm(n+1) ∈ W crϕ(n) ∩ U i.o., therefore a.s.

on xm(n+1) ∈ W crϕ(n) i.o. (by considering countably many sets U that cover

Rd), and finally, a.s. on xn → W by the above corollary. That is, xn → A

a.s. on xn → W, a contradiction unless P (xn → W ) = P (xn → W along asubsequence) = 0. ¥

Two important generalizations are worth noting:

(i) We have used (A6) only to prove Lemma 16. So the conclusions of thelemma, which stipulate a condition on cumulative noise rather than periterate noise as in (A6), will suffice.

Page 60: Stochastic Approximation: A Dynamical Systems Viewpoint

4.3 Avoidance of traps 51

(ii) The only important consequence of (A7) used was the fact that the ratiob(n)/ϕ(n)2, n ≥ 1, remains bounded. This is actually a much weakerrequirement than (A7) itself.

To conclude, Theorem 19 is just one example of an ‘avoidance of traps’ result.There are several other formulations, notably Ljung (1978), Pemantle (1990),Brandiere and Duflo (1996), Brandiere (1998) and Benaim (1999). See alsoFang and Chen (2000) for some related results.

Page 61: Stochastic Approximation: A Dynamical Systems Viewpoint

5

Stochastic Recursive Inclusions

5.1 Preliminaries

This chapter considers an important generalization of the basic stochastic ap-proximation scheme of Chapter 2, which we call ‘stochastic recursive inclusions’.The idea is to replace the map h : Rd → Rd in the recursion (2.1.1) of Chap-ter 2 by a set-valued map h : Rd → subsets of Rd, satisfying the followingconditions:

(i) For each x ∈ Rd, h(x) is convex and compact.(ii) For all x ∈ Rd,

supy∈h(x)

||y|| < K(1 + ‖x‖) (5.1.1)

for some K > 0.(iii) h is upper semicontinuous in the sense that if xn → x and yn → y with

yn ∈ h(xn) for n ≥ 1, then y ∈ h(x). (In other words, the graph of h,defined as (x, y) : y ∈ h(x), is closed.)

See Aubin and Frankowska (1990) for general background on set-valued mapsand their calculus. Stochastic recursive inclusion refers to the scheme

xn+1 = xn + a(n)[yn + Mn+1], (5.1.2)

where a(n) are as before, Mn is a martingale difference sequence w.r.t.the increasing σ-fields Fn = σ(xm, ym,Mm,m ≤ n), n ≥ 0, satisfying (A3)of Chapter 2, and finally, yn ∈ h(xn) ∀n. The requirement that yn be inh(xn) is the reason for the terminology ‘stochastic recursive inclusions’. Weshall give several interesting applications of stochastic recursive inclusions insection 5.3, following the convergence analysis of (5.1.2) in the next section.

52

Page 62: Stochastic Approximation: A Dynamical Systems Viewpoint

5.2 The differential inclusion limit 53

5.2 The differential inclusion limit

As might be expected, in this chapter the o.d.e. limit (2.1.5) of Chapter 2 getsreplaced by a differential inclusion limit

x(t) ∈ h(x(t)). (5.2.1)

To prove that (5.2.1) is indeed the desired limiting differential inclusion, weproceed as in Chapter 2 to define t(0) = 0, t(n) =

∑n−1m=0 a(m), n ≥ 1. Define

x(·) as before, i.e., set x(t(n)) = xn, n ≥ 0, with linear interpolation on eachinterval [t(n), t(n + 1)]. Define the piecewise constant function y(t), t ≥ 0, byy(t) = yn, t ∈ [t(n), t(n + 1)), n ≥ 0. Define ζn as before. As in Chapter 2,we shall analyze (5.1.2) under the stability assumption

supn||xn|| < ∞, a.s. (5.2.2)

For s ≥ 0, let xs(t), t ≥ s, denote the solution to

xs(t) = y(t), xs(s) = x(s).

Tests for whether (5.2.2) holds can be stated, e.g., along the lines of section3.2. We omit the details. The following can now be proved exactly along thelines of Lemma 1 of Chapter 2.

Lemma 1. For any T > 0, lims→∞ supt∈[s,s+T ] ||x(t)− xs(t)|| = 0 a.s.

By (5.2.2) and condition (ii) on the set-valued map h,

sup||y|| : y ∈⋃n

h(xn) < ∞, a.s. (5.2.3)

Thus almost surely, xs(·), s ≥ 0 is an equicontinuous, pointwise boundedfamily. By the Arzela–Ascoli theorem, it is therefore relatively compact inC([0,∞);Rd). By Lemma 1, the same then holds true also for x(s+·) : s ≥ 0,because if not, there exist sn ↑ ∞ such that x(sn + ·) does not have any limitpoint in C([0,∞);Rd). Then nor does xsn(·) by the lemma, a contradiction tothe relative compactness of the latter.

Theorem 2. Almost surely, every limit point x(·) of x(s + ·), s ≥ 0 inC([0,∞);Rd) as s → ∞ satisfies (5.2.1). That is, it satisfies x(t) = x(0) +∫ t

0y(s)ds, t ≥ 0, for some measurable y(·) satisfying y(t) ∈ h(x(t)) ∀t.

Proof. Fix T > 0. Viewing y(s + t), t ∈ [0, T ], s ≥ 0, as a subset ofL2([0, T ];Rd), it follows from (5.2.3) that it is bounded and hence weakly rel-atively sequentially compact. (See Appendix A.) Let s(n) →∞ be a sequencesuch that x(s(n) + ·) → x(·) in C([0,∞);Rd) and y(s(n) + ·) → y(·) weakly

Page 63: Stochastic Approximation: A Dynamical Systems Viewpoint

54 Stochastic Recursive Inclusions

in L2([0, T ];Rd). Then by Lemma 1, xs(n)(s(n) + ·) → x(·) in C([0,∞);Rd).Letting n →∞ in the equation

xs(n)(t) = xs(n)(0) +∫ t

0

y(s(n) + z)dz, t ≥ 0,

we have x(t) = x(0) +∫ t

0y(z)dz, t ≥ 0. Since y(s(n) + ·) → y(·) weakly in

L2([0, T ];Rd), there exist n(k) ⊂ n such that n(k) ↑ ∞ and

1N

N∑

k=1

y(s(n(k)) + ·) → y(·)

strongly in L2([0, T ];Rd) (see Appendix A). In turn, there exist N(m) ⊂ Nsuch that N(m) ↑ ∞ and

1N(m)

N(m)∑

k=1

y(s(n(k)) + ·) → y(·) (5.2.4)

a.e. in [0, T ]. Fix t ∈ [0, T ] where this holds. Define [s] def= maxt(n) : t(n) ≤ s.Then y(s(n(k)) + t) ∈ h(x([s(n(k)) + t])) ∀k. Since we have x(s(n) + ·) → x(·)in C([0,∞);Rd) and t(n + 1)− t(n) → 0, it follows that

x([s(n(k)) + t]) = (x([s(n(k)) + t])− x(s(n(k)) + t))

+ (x(s(n(k)) + t)− x(t)) + x(t)

→ x(t).

The upper semicontinuity of the set-valued map h then implies that

y(s(n(k)) + t) → h(x(t)).

Since h(x(t)) is convex compact, it follows from (5.2.4) that y(t) ∈ h(x(t)).Thus y(t) ∈ h(x(t)) a.e., where the qualification ‘a.e.’ may be dropped bymodifying y(·) suitably on a Lebesgue-null set. Since T > 0 was arbitrary, theclaim follows. ¥

Before we proceed, here’s a technical lemma about (5.2.1):

Lemma 3. The set-valued map x ∈ Rd → Qx ⊂ C([0,∞);Rd), where Qxdef=

the set of solutions to (5.2.1) with initial condition x, is nonempty compactvalued and upper semicontinuous.

Proof. From (5.1.1), we have that for a solution x(·) of (5.2.1) with a prescribedinitial condition x0,

||x(t)|| ≤ ||x0||+ K ′∫ t

0

(1 + ||x(s)||)ds, t ≥ 0,

for a suitable constant K ′ > 0. By the Gronwall inequality, it follows that any

Page 64: Stochastic Approximation: A Dynamical Systems Viewpoint

5.2 The differential inclusion limit 55

solution x(·) of (5.2.1) with a prescribed initial condition x0 (more generally,initial conditions belonging to a bounded set) remains bounded on [0, T ] foreach T > 0 by a bound that depends only on T . From (5.1.1) and (5.2.1) itthen follows that the corresponding ‖x(t)‖ remains bounded on [0, T ] with abound that depends only on T . By the Arzela–Ascoli theorem, it follows thatQx0 is relatively compact in C([0,∞);Rd). Set y(·) = x(·) and write

x(t) = x0 +∫ t

0

y(s)ds, t ≥ 0.

Now argue as in the proof of Theorem 2 to show that any limit point (x(·), y(·))in C([0,∞);Rd) of Qx0 will also satisfy this equation with y(t) ∈ h(x(t)) a.e.,proving that Qx0 is closed. Next consider xn → x∞ and xn(·) ∈ Qxn

forn ≥ 1. Since xn, n ≥ 1 is in particular a bounded set, argue as aboveto conclude that xn(·), n ≥ 1 is relatively compact in C([0,∞);Rd). Anargument similar to that used to prove Theorem 2 can be used once more, toshow that any limit point is in Qx∞ . This proves the upper semicontinuity ofthe map x ∈ Rd → Qx ⊂ C([0,∞);Rd). ¥

The next result uses the obvious generalizations of the notions of invariantset and chain transitivity to differential inclusions. We shall say that a set B

is invariant (resp. positively / negatively invariant) under (5.2.1) if for x ∈ B,there is some trajectory x(t), t ∈ (−∞,∞) (resp. [0,∞) / (−∞, 0]), that liesentirely in B. Note that we do not require this of all trajectories of (5.2.1)passing through x at time 0. That requirement would define a stronger notionof invariance that we shall not be using here. See Benaim, Hofbauer and Sorin(2003) for various notions of invariance for differential inclusions.

Corollary 4. Under (5.2.2), xn generated by (5.1.2) converge a.s. to a closedconnected internally chain transitive invariant set of (5.2.1).

Proof. From the foregoing, we know that xn will a.s. converge to Adef=

∩t≥0x(t + s) : s ≥ 0. The proof that A is invariant and that for any δ > 0,x(t + ·) is the the open δ-neighbourhood Aδ of A for t sufficiently large issimilar to that of Theorem 2 of Chapter 2 where similar claims are established.It is therefore omitted. To prove internal chain transitivity of A, let x1, x2 ∈ A

and ε, T > 0. Pick ε/4 > δ > 0. Pick n0 > 1 such that n ≥ n0 implies that fors ≥ t(n), x(s + ·) ∈ Aδ and furthermore,

supt∈[s,s+2T ]

||x(t)− xs(t)|| < δ

for some solution xs(·) of (5.2.1) in A. Pick n2 > n1 ≥ n0 such that ||x(t(ni))−xi|| < δ, i = 1, 2. Let kT ≤ t(n2) − t(n1) < (k + 1)T for some integer k ≥ 0.Let s(0) = t(n1), s(i) = s(0) + iT for 1 ≤ i < k, and s(k) = t(n2). Then for0 ≤ i < k, supt∈[s(i),s(i+1)] ||x(t) − xs(i)(t)|| < δ. Pick xi, 0 ≤ i ≤ k, in G such

Page 65: Stochastic Approximation: A Dynamical Systems Viewpoint

56 Stochastic Recursive Inclusions

that x1 = x1, xk = x2, and for 0 < i < k, xi are in the δ-neighbourhood ofx(s(i)). The sequence (s(i), xi), 0 ≤ i ≤ k, satisfies the definition of internalchain transitivity. ¥

The invariance of A is in fact implied by its internal chain transitivity asshown in Benaim, Hofbauer and Sorin (2003), so it need not be separatelyestablished.

5.3 Applications

In this section we consider four applications of the foregoing. A fifth importantone is separately dealt with in the next section. We start with a useful technicallemma. Let co(· · · ) stand for ‘the closed convex hull of · · · ’.Lemma 5. Let f : x ∈ Rd → f(x) ⊂ Rd be an upper semicontinuous set-valued map such that f(x) is compact for all x and sup||f(x)|| : ||x|| < M isbounded for all M > 0. Then the set-valued map x → co(f(x)) is also uppersemicontinuous.

Proof. Let xn → x, yn → y, in Rd such that yn ∈ co(f(xn)) for n ≥ 1. Thenby Caratheodory’s theorem (Theorem 17.1, p. 155, of Rockafellar, 1970), thereexist an

0 , . . . and ∈ [0, 1] with

∑i an

i = 1, and yn0 , . . . , yn

d ∈ f(xn), not necessarilydistinct, so that ||yn −

∑di=0 an

i yni || < 1

n for n ≥ 1. By dropping to a suitablesubsequence, we may suppose that an

i → ai ∈ [0, 1] ∀i and yni → yi ∈ f(x) ∀i.

(Here we use the hypotheses that f is upper semicontinuous and bounded oncompacts.) Then

∑i ai = 1 and y =

∑i aiyi ∈ co(f(x)), which proves the

claim. ¥

We now list four instances of stochastic recursive inclusions.

(i) Controlled stochastic approximation: Consider the iteration

xn+1 = xn + a(n)[g(xn, un) + Mn+1],

where un is a random sequence taking values in a compact metricspace U , a(n) are as before, Mn satisfies the usual conditions w.r.t.the σ-fields Fn

def= σ(xm, um,Mm,m ≤ n), n ≥ 0, and g : Rd × U → Rd

is continuous and Lipschitz in the first argument uniformly w.r.t. thesecond. We view un as a control process. That is, un is chosen bythe agent running the algorithm at time n ≥ 0 based on the observedhistory and possibly extraneous independent randomization, as in theusual stochastic control problems. It could, however, also be an unknownrandom process in addition to Mn that affects the measurements. Theidea then is to analyze the asymptotic behaviour of the iterations forarbitrary un that fit the above description. It then makes sense to

Page 66: Stochastic Approximation: A Dynamical Systems Viewpoint

5.3 Applications 57

define h(x) = co(g(x, u) : u ∈ U). The above iteration then becomesa special case of (1). The three conditions stipulated for the set-valuedmap are easily verified in this particular case by using Lemma 5.

(ii) Stochastic subgradient descent: Consider a continuously differentiableconvex function f : Rd →R which one aims to minimize based on noisymeasurements of its gradients. That is, at any point x ∈ Rd, one canmeasure ∇f(x) + an independent copy of a zero mean random variable.Then the natural scheme to explore would be

xn+1 = xn + a(n)[−∇f(xn) + Mn+1],

where Mn is the i.i.d. (more generally, a martingale difference) mea-surement noise and the expression in square brackets on the right repre-sents the noisy measurement of the gradient. This is a special case of the‘stochastic gradient scheme’ we shall discuss in much more detail laterin the book. Here we are interested in an extension of the scheme thatone encounters when f is not continuously differentiable everywhere, sothat ∇f is not defined at all points. It turns out that a natural gener-alization of ∇f to this ‘non-smooth’ case is the subdifferential ∂f . Thesubdifferential ∂f(x) at x is the set of all y satisfying

f(z) ≥ f(x) + 〈y, z − x〉

for all z ∈ Rd. It is clear that it will be a closed convex set. It canalso be shown to be compact nonempty (Theorem 23.4, p. 217, of Rock-afellar, 1970) and upper semicontinuous as a function of x. Assume thelinear growth property stipulated in (5.1.1) for h = −∂f . Thus we mayreplace the above stochastic gradient scheme by equation (5.1.2) withthis specific choice of h, which yields the stochastic subgradient descent.

Note that the closed convex set of minimizers of f , if nonempty, iscontained in M = x ∈ Rd : θ ∈ ∂f, where θ denotes the zero vectorin Rd. It is then easy to see that at any point on the trajectory of (5.2.1)lying outsideM, f(x(t)) must be strictly decreasing. Thus any invariantset for (5.2.1) must be contained in M. Corollary 4 then implies thatxn →M a.s.

(iii) Approximate drift: We may refer to the function h on the right-handside of (2.1.1), Chapter 2, as the ‘drift’. In many cases, there is adesired drift h that we wish to implement, but we have at hand onlyan approximation of it. One common situation is when the negativegradient in the stochastic gradient scheme, i.e., h = −∇f for somecontinuously differentiable f : Rd →R, is not explicitly available but isknown only approximately (e.g., as a finite difference approximation).

Page 67: Stochastic Approximation: A Dynamical Systems Viewpoint

58 Stochastic Recursive Inclusions

In such cases, the actual iteration being implemented is

xn+1 = xn + a(n)[h(xn) + ηn + Mn+1],

where ηn is an error term. Suppose the only available information aboutηn is that ||ηn|| ≤ ε ∀n for a known ε > 0. In this case, one mayanalyze the iteration as a stochastic recursive inclusion (5.1.2) with

yn ∈ h(xn) def= h(xn) + B(ε)

= y : ‖h(xn)− y‖ ≤ ε.Here B(ε) is the closed ε-ball centered at the origin. An importantspecial case when one can analyze its asymptotic behaviour to someextent is the case when there exists a globally asymptotically stableattractor H for the associated o.d.e. (2.1.5) of Chapter 2.

Theorem 6. Under (5.2.2), given any δ > 0, there exists an ε0 > 0 suchthat for all ε ∈ (0, ε0), xn above converge a.s. to the δ-neighbourhoodof H.

Proof. For γ > 0, define

Hγ def= x ∈ Rd : infy∈H

||x− y|| < γ.

Fix a sample path where (5.2.2) and Lemma 1 hold. Pick T > 0 largeenough that for any solution x(·) of the o.d.e.

x(t) = h(x(t))

for which ||x(0)|| ≤ Cdef= supn ‖xn‖, we have x(t) ∈ Hδ/3 for t ≥ T . Let

xs(·) be as before and let xs(·) denote the solution of the above o.d.e.for t ≥ s with xs(s) = xs(s). Then a simple application of the Gronwallinequality shows that for ε0 > 0 sufficiently small, ε ≤ ε0 implies

supt∈[s,s+T ]

||xs(t)− xs(t)|| < δ/3.

In particular, xs(T ) ∈ H2δ/3. Hence, since s > 0 was arbitrary, itfollows by Lemma 1 that for sufficiently large s, x(s + T ) ∈ Hδ, i.e., forsufficiently large s, x(s) ∈ Hδ. ¥

(iv) Discontinuous dynamics: Consider

xn+1 = xn + a(n)[g(xn) + Mn+1],

where g : Rd → Rd is merely measurable but satisfies a linear growthcondition: ||g(x)|| ≤ K(1 + ||x||) for some K > 0. Define h(x) def=⋂

ε>0 co(g(y) : ||y− x|| < ε). Then the above iteration may be viewed

Page 68: Stochastic Approximation: A Dynamical Systems Viewpoint

5.4 Projected stochastic approximation 59

as a special case of (5.1.2). The three properties stipulated for h abovecan be verified in a straightforward manner. Note that the differentialinclusion limit in this case is one of the standard solution concepts fordifferential equations with discontinuous right hand sides – see, e.g., p.50, Filippov (1988).

In fact, continuous g which is not Lipschitz can be viewed as a specialcase of stochastic recursive inclusions. Here h(x) is always the singletong(x) and (5.2.1) reduces to an o.d.e. x(t) = g(x(t)). The catch is thatin absence of the Lipschitz condition, the existence of a solution to thiso.d.e. is guaranteed, but not its uniqueness, which may fail at all points(see, e.g., Chapter II of Hartman, 1982). Thus the weaker claims abovewith the suitably weakened notion of invariant sets apply in place of theresults of Chapter 2.

As for the invariant sets of (5.2.1), it is often possible to characterize them us-ing a counterpart for differential inclusions of the Liapunov function approach,as in Corollary 3 of Chapter 2. See Chapter 6 of Aubin and Cellina (1980) fordetails.

5.4 Projected stochastic approximation

These are stochastic approximation iterations that are forced to remain in somebounded set G by being projected back to G whenever they go out of G. Thisavoids the stability issue altogether, now that the iterates are forced to remainin a bounded set. But it can lead to other complications as we see below, sosome care is needed in using this scheme.

Thus iteration (2.1.1) of Chapter 2 is replaced by

xn+1 = Γ(xn + a(n)[h(xn) + Mn+1]), (5.4.1)

where Γ(·) is a projection to a prescribed compact set G. That is, Γ = theidentity map for points in the interior of G, and maps a point outside G tothe point in G closest to it w.r.t. the Euclidean distance. (Sometimes someother equivalent metric may be more convenient, e.g., the max-norm ||x||∞ def=maxi |xi|.) The map Γ need not be single-valued in general, but it is whenG is convex. This is usually the case in practice. Even otherwise, as long asthe boundary ∂G of G is reasonably well-behaved, Γ will be single-valued forpoints outside G that are sufficiently close to ∂G. Again, this is indeed usuallythe case for our algorithm because our stepsizes a(n) are small, at least forlarge n, and thus the iteration cannot move from a point inside G to a pointfar from G in a single step. Hence assuming that Γ is single-valued is not aserious restriction.

Page 69: Stochastic Approximation: A Dynamical Systems Viewpoint

60 Stochastic Recursive Inclusions

First consider the simple case when ∂G is smooth and Γ is Frechet differen-tiable, i.e., there exists a linear map Γx(·) such that the limit

γ(x; y) def= limδ↓0

Γ(x + δy)− x

δ

exists and equals Γx(y). (This will be the identity map for x in the interior ofG.) In this case (5.4.1) may be rewritten as

xn+1 = xn + a(n)Γ(xn + a(n)[h(xn) + Mn+1])− xn

a(n)= xn + a(n)[Γxn(h(xn)) + Γxn(Mn+1) + o(a(n))].

This iteration is similar to the original stochastic approximation scheme (2.1.1)with h(xn) and Mn+1 replaced resp. by Γxn

(h(xn)) and Γxn(Mn+1), with an

additional error term o(a(n)). Suppose the map x → Γx(h(x)) is Lipschitz.If we mimic the proofs of Lemma 1 and Theorem 2 of Chapter 2, the o(a(n))term will be seen to contribute an additional error term of order o(a(n)T ) tothe bound on supt∈[s,s+T ] ||x(t)− xs(t)||, where n is such that t(n) = [s]. Thiserror term tends to zero as s →∞. Thus the same proof as before establishesthat the conclusions of Theorem 2 of Chapter 2 continue to hold, but with theo.d.e. (2.1.5) of Chapter 2 replaced by the o.d.e.

x(t) = Γx(t)(h(x(t))). (5.4.2)

If x → Γx(h(x)) is merely continuous, then the o.d.e. (5.4.2) will have possiblynon-unique solutions for any initial condition and the set of solutions as afunction of initial condition will be a compact-valued upper semicontinous set-valued map. An analog of Theorem 2 of Chapter 2 can still be established withthe weaker notion of invariance introduced just before Corollary 4 above. Ifthe map is merely measurable, we are reduced to the ‘discontinuous dynamics’scenario discussed above.

Unlike the convexity of G, the requirement that ∂G be smooth is, however,a serious restriction. This is because it fails in many simple cases such as whenG is a polytope, which is a very common situation. Thus there is a need toextend the foregoing to cover such cases. This is where the developments ofsection 5.3 come into the picture.

This analysis may fail in two ways. The first is that the limit γ(x; y) abovemay be undefined. In most cases arising in applications, however, this is notthe problem. The difficulty usually is that the limit does exist but does notcorrespond to the evaluation of a linear map Γx as stipulated above. That is,γ(x; y) exists as a directional derivative of Γ at x in the direction y, but Γ isnot Frechet differentiable. Thus we have the iteration

xn+1 = xn + a(n)[γ(xn; h(xn) + Mn+1) + o(a(n))]. (5.4.3)

Page 70: Stochastic Approximation: A Dynamical Systems Viewpoint

5.4 Projected stochastic approximation 61

There is still some hope of an o.d.e. limit if Mn+1 is conditionally independentof Fn given xn. In this case, let its regular conditional law given xn be denotedby µ(x, dy). Then the above may be rewritten as

xn+1 = xn + a(n)[h(xn) + Mn+1 + o(a(n))],

where

h(xn) def=∫

µ(xn, dy)γ(xn;h(xn) + y),

and

Mn+1def= γ(xn; h(xn) + Mn+1)− h(xn).

To obtain this, we have simply added and subtracted on the right-hand sideof (5.4.3) the one-step conditional expectation of γ(xn; h(xn) + Mn+1). Theadvantage is that the present expression is in the same format as (2.2.1) modulothe o(a(n)) term whose contribution is asymptotically negligible. Thus in thespecial case when h turns out to be Lipschitz, one has the o.d.e. limit

x(t) = h(x(t)),

with the associated counterpart of Theorem 2 of Chapter 2.This situation, however, is very special. Usually h is only measurable. Then

this reduces to the case of ‘discontinuous dynamics’ studied in the precedingsection and can be analyzed in that framework by means of an appropriatelimiting differential inclusion.

In the case when one is saddled with only (5.4.3) and nothing more, let

yndef= E[γ(xn; h(xn) + Mn+1)|Fn]

and

Mn+1def= (γ(xn; h(xn) + Mn+1)− yn)

for n ≥ 0. We then recover (5.1.2) above with Mn+1 replacing Mn+1. Supposethere exists a set-valued map, denoted by x → Γx(h(x)) to suggest a kinshipwith Γ above, such that

yn ∈ Γxn(h(xn)) a.s.,

and suppose that it satisfies the three conditions stipulated in section 5.1, viz.,it is compact convex valued and upper semicontinuous with bounded range oncompacts such that (5.1.1) holds. Then the analysis of section 5.1 applies withthe limiting differential inclusion

x(t) ∈ Γx(t)(h(x(t))). (5.4.4)

Page 71: Stochastic Approximation: A Dynamical Systems Viewpoint

62 Stochastic Recursive Inclusions

For example, if the support of the conditional distribution of Mn+1 given Fn

is a closed bounded set A(xn) depending on xn, one may consider

Γx(h(x)) def=⋂ε>0

co(⋃

||z−x||<ε

γ(z;h(z) + y) : y ∈ A(z)).

In many situations, ∂G is smooth except along a ‘thin’ set consisting of aunion of surfaces (submanifolds) one or more dimensions lower than ∂G itself.(This is the case, e.g., when G is a polytope, when ∂G is a union of its faces andis nonsmooth at the boundaries of these faces.) If h and the noise Mn are suchthat these parts of the boundary ‘repel’ the iterates the way unstable equilibriawere seen to do in Chapter 4, then one can still work with the limiting o.d.e.(5.4.2), ignoring the region where it is not justified. Another scenario whenthings simplify is when the offending part of ∂G is ‘thin’ in the above senseand everywhere along this set the possible directions stipulated by the right-hand side of (5.4.4) are such that any solution of (5.4.4) spends zero net timein this set. In this case, (5.4.4) becomes the same as (5.4.2) interpreted in theCaratheodory sense (see pp. 3–4 of Filippov, 1988).

The second major concern in projected algorithms is the possibility of spuri-ous equilibria or other invariant sets on ∂G, i.e., equilibria or invariant sets for(5.4.2) or (5.4.4) that are not equilibria or invariant sets for the o.d.e. (2.1.5).For example, if h(x) is directed along the outward normal at some x ∈ ∂G

and ∂G is smooth in a neighbourhood of x, then x can be a spurious stableequilibrium for the limiting projected o.d.e. These spurious equilibria will bepotential asymptotic limit sets for the projected scheme in view of Corollary4. Thus their presence can lead to convergence of (5.1.2) to undesired pointsor sets. This has to be avoided where possible by using any prior knowledgeavailable to choose G properly. Another possibility is the following: Supposewe consider a parametrized family of candidate G, say closed balls of radius r

centered at the origin. Suppose such problems arise only for r belonging to aLebesgue-null set. Then we may choose G randomly at each iterate accordingto some Lebesgue-continuous density for r in a neighbourhood of a nominalvalue r = r0 fixed beforehand. A further possibility, due to Chen (1994, 1998),is to start with a specific G and slowly increase it to the whole of Rd. We shallrevisit this scheme in the next chapter.

It is also possible that the new equilibria or invariant sets on ∂G thus intro-duced correspond in fact to desired equilibria or invariant sets lying outside G

or at ∞. In this case, the former may be viewed as approximations of the latterand thus may in fact be the desired limit sets for the projected algorithm.

In case the limit in the definition of γ(·; ·) above is not even well-defined, onecan consider the set of all limit points therein and build a stochastic recursiveinclusion around that. We ignore this possibility as it does not seem very usefulin applications.

Page 72: Stochastic Approximation: A Dynamical Systems Viewpoint

5.4 Projected stochastic approximation 63

On the flip side, there are situations where the projected dynamics is in factwell-posed, see, e.g., Dupuis and Nagurney (1993).

Page 73: Stochastic Approximation: A Dynamical Systems Viewpoint

6

Multiple Timescales

6.1 Two timescales

In the preceding chapters we have used a fixed stepsize schedule a(n) forall components of the iterations in stochastic approximation. In the ‘o.d.e.approach’ to the analysis of stochastic approximation, these are viewed as dis-crete nonuniform time steps. Thus one can conceive of the possibility of usingdifferent stepsize schedules for different components of the iteration, which willthen induce different timescales into the algorithm. We shall consider the caseof two timescales first, following Borkar (1996). Thus we are interested in theiterations

xn+1 = xn + a(n)[h(xn, yn) + M(1)n+1], (6.1.1)

yn+1 = yn + b(n)[g(xn, yn) + M(2)n+1], (6.1.2)

where h : Rd+k → Rd, g : Rd+k → Rk are Lipschitz and M (1)n , M (2)

n aremartingale difference sequences w.r.t. the increasing σ-fields

Fndef= σ(xm, ym,M1

m,M2m,m ≤ n), n ≥ 0,

satisfying

E[||M in+1||2|Fn] ≤ K(1 + ||xn||2 + ||yn||2), i = 1, 2,

for n ≥ 0. Stepsizes a(n), b(n) are positive scalars satisfying∑

n

a(n) =∑

n

b(n) = ∞,∑

n

(a(n)2 + b(n)2) < ∞,b(n)a(n)

→ 0.

The last condition implies that b(n) → 0 at a faster rate than a(n), imply-ing that (6.1.2) moves on a slower timescale than (6.1.1). Examples of suchstepsizes are a(n) = 1

n , b(n) = 11+n log n , or a(n) = 1

n2/3 , b(n) = 1n , and so on.

64

Page 74: Stochastic Approximation: A Dynamical Systems Viewpoint

6.1 Two timescales 65

It is instructive to compare this coupled iteration to the singularly perturbedo.d.e.

x(t) =1εh(x(t), y(t)), (6.1.3)

y(t) = g(x(t), y(t)), (6.1.4)

in the limit ε ↓ 0. Thus x(·) is a fast transient and y(·) the slow component.It then makes sense to think of y(·) as quasi-static (i.e., ‘almost a constant’)while analyzing the behaviour of x(·). This suggests looking at the o.d.e.

x(t) = h(x(t), y), (6.1.5)

where y is held fixed as a constant parameter. Suppose that:

(A1) (6.1.5) has a globally asymptotically stable equilibrium λ(y) (uniformlyin y), where λ : Rk →Rd is a Lipschitz map.

Then for sufficiently small values of ε we expect x(t) to closely track λ(y(t))for t > 0. In turn this suggests looking at the o.d.e.

y(t) = g(λ(y(t)), y(t)), (6.1.6)

which should capture the behaviour of y(·) in (6.1.4) to a good approximation.Suppose that:

(A2) The o.d.e. (6.1.6) has a globally asymptotically stable equilibrium y∗.

Then we expect (x(t), y(t)) in (6.1.3)–(6.1.4) to approximately converge to(i.e., converge to a small neighbourhood of) the point (λ(y∗), y∗).

This intuition indeed carries over to the iterations (6.1.1)–(6.1.2). Thus(6.1.1) views (6.1.2) as quasi-static while (6.1.2) views (6.1.1) as almost equi-librated. The motivation for studying this set-up comes from the followingconsiderations. Suppose that an iterative algorithm calls for a particular sub-routine in each iteration. Suppose also that this subroutine itself is anotheriterative algorithm. The traditional method would be to use the output of thesubroutine after running it ‘long enough’ (i.e., until near-convergence) duringeach iterate of the outer loop. But the foregoing suggests that we could getthe same effect by running both the inner and the outer loops (i.e., the cor-responding iterations) concurrently, albeit on different timescales. Then theinner ‘fast’ loop sees the outer ‘slow’ loop as quasi-static while the latter seesthe former as nearly equilibrated. We shall see applications of this later in thebook.

We now take up the formal convergence analysis of the two-timescale scheme(6.1.1)–(6.1.2) under the stability assumption:

Page 75: Stochastic Approximation: A Dynamical Systems Viewpoint

66 Multiple Timescales

(A3) supn(||xn||+ ||yn||) < ∞, a.s.

Assume (A1)–(A3) above.

Lemma 1. (xn, yn) → (λ(y), y) : y ∈ Rk a.s.

Proof. Rewrite (6.1.2) as

yn+1 = yn + a(n)[εn + M(3)n+1], (6.1.7)

where εndef= b(n)

a(n)g(xn, yn) and M(3)n+1

def= b(n)a(n)M

(2)n+1 for n ≥ 0. Consider the

pair (6.1.1), (6.1.7) in the framework of the third ‘extension’ listed at thestart of section 2.2. By the observations made there, it then follows that(xn, yn) converges to the internally chain transitive invariant sets of the o.d.e.x(t) = h(x(t), y(t)), y(t) = 0. The claim follows. ¥

In other words, ‖xn − λ(yn)‖ → 0 a.s., that is, xn asymptotically ‘track’λ(yn), a.s.

Theorem 2. (xn, yn) → (λ(y∗), y∗) a.s.

Proof. Let s(0) = 0 and s(n) =∑n−1

i=0 b(i) for n ≥ 1. Define the piecewise linearcontinuous function y(t), t ≥ 0, by y(s(n)) = yn, with linear interpolation oneach interval [s(n), s(n + 1)], n ≥ 0. Let ψn

def=∑n−1

m=0 b(m)M (2)m+1, n ≥ 1. Then

arguing as for ζn in Chapter 2, ψn is an a.s. convergent square-integrablemartingale. Let [t]′ def= maxs(n) : s(n) ≤ t, t ≥ 0. Then for n,m ≥ 0,

y(s(n + m)) = y(s(n)) +∫ s(n+m)

s(n)

g(λ(y(t)), y(t))dt

+∫ s(n+m)

s(n)

(g(λ(y([t]′)), y([t]′))− g(λ(y(t)), y(t)))dt

+m−1∑

k=1

b(n + k)(g(xn+k, yn+k)− g(λ(yn+k), yn+k))

+ (ψn+m+1 − ψn).

For s ≥ 0, let ys(t), t ≥ s, denote the trajectory of (6.1.6) with ys(s) = y(s).Using the Gronwall inequality as in the proof of Lemma 1 of Chapter 2, weobtain, for T > 0,

supt∈[s,s+T ]

||y(t)− ys(t)|| ≤ KT (I + II + III),

where KT > 0 is a constant depending on T and

(i) ‘I’ is the ‘discretization error’ contributed by the third term on theright-hand side above, which is O(

∑k≥n b(k)2) a.s.,

Page 76: Stochastic Approximation: A Dynamical Systems Viewpoint

6.2 Averaging the natural timescale: preliminaries 67

(ii) ‘II’ is the ‘error due to noise’ contributed by the fifth term on the right-hand side above, which is O(supk≥n ||ψk − ψn||) a.s., and

(iii) ‘III’ is the ‘tracking error’ contributed by the fourth term on the right-hand side above, which is O(supk≥n ||xk − λ(yk)||) a.s.

Since all three errors tend to zero a.s. as s →∞,

supt∈[s,s+T ]

||y(t)− ys(t)|| → 0, a.s.

Arguing as in the proof of Theorem 2 of Chapter 2, we get yn → y∗ a.s. ByLemma 5.1, xn → λ(y∗) a.s. This completes the proof. ¥

The same general scheme can be extended to three or more timescales. Thisextension, however, is not as useful as it may seem, because the convergenceanalysis above captures only the asymptotic ‘mean drift’ for (6.1.1)–(6.1.2),not the fluctuations about the mean drift. Unless the timescales are reason-ably separated, the behaviour of the coupled scheme (6.1.1)–(6.1.2) will notbe very graceful. At the same time, if the timescales are greatly separated,that separation may render either the fast timescale too fast (increasing bothdiscretization error and noise-induced error because of larger stepsizes), or theslow timescale too slow (slowing down the convergence because of smaller step-sizes), or both. This difficulty becomes more pronounced the larger the numberof timescales involved.

Another, less elegant way of achieving the two-timescale effect would be torun (6.1.2) also with stepsizes a(n), but along a subsample n(k) of timeinstants that become increasingly rare (i.e., n(k +1)−n(k) →∞) and keepingits values constant between these instants. That is,

yn(k)+1 = yn(k) + a(n(k))[g(xn(k), yn(k)) + M(2)n(k)+1],

with yn+1 = yn ∀n /∈ n(k). In practice it has been found that a good policy isto run (6.1.2) with a slower stepsize schedule b(n) as above and also updateit along a subsequence nN, n ≥ 0 for a suitable integer N > 1, keeping itsvalues constant in between (S. Bhatnagar, personal communication).

6.2 Averaging the natural timescale: preliminaries

Next we consider a situation wherein the stochastic approximation iterationsare also affected by another process Yn running in the background on thetrue or ‘natural’ timescale which corresponds to the time index ‘n’ itself thattags the iterations. Given our viewpoint of a(n) as time steps, since a(n) → 0the algorithm runs on a slower timescale than Yn and thus should see the‘averaged’ effects of the latter. We make this intuition precise in what follows.This section, which, along with the next section, is based on Borkar (2006),

Page 77: Stochastic Approximation: A Dynamical Systems Viewpoint

68 Multiple Timescales

builds up the technical infrastructure for the main results to be presented in thenext section. This development requires the background material summarizedin Appendix C on spaces of probability measures on metric spaces.

Specifically, we consider the iteration

xn+1 = xn + a(n)[h(xn, Yn) + Mn+1], (6.2.1)

where Yn is a random process taking values in a complete separable met-ric space S with dynamics we shall soon specify, and h : Rd × S → Rd isjointly continuous in its arguments and Lipschitz in its first argument uni-formly w.r.t. the second. Mn is a martingale difference sequence w.r.t. theσ-fields Fn

def= σ(xm, Ym,Mm,m ≤ n), n ≥ 0. Stepsizes a(n) are as before,with the additional condition that they be eventually nonincreasing.

We shall assume that Yn is an S-valued controlled Markov process withtwo control processes: xn above and another random process Zn takingvalues in a compact metric space U . Thus

P (Yn+1 ∈ A|Ym, Zm, xm,m ≤ n) =∫

A

p(dy|Yn, Zn, xn), n ≥ 0, (6.2.2)

for A Borel in S, where (y, z, x) ∈ S×U×Rd → p(dw|y, z, x) ∈ P(S) is a contin-uous map specifying the controlled transition probability kernel. (Here and inwhat follows, P(· · · ) will denote the space of probability measures on the com-plete separable metric space ‘· · · ’ with Prohorov topology – see, e.g., AppendixC.) We assume that the continuity in the x variable is uniform on compactsw.r.t. the other variables. We shall say that Zn is a stationary control ifZn = v(Yn) ∀n for some measurable v : S → U , and a stationary randomizedcontrol if for each n the conditional law of Zn given (Ym, xm, Zm−1,m ≤ n) isϕ(Yn) for a fixed measurable map ϕ : y ∈ S → ϕ(y) = ϕ(y, dz) ∈ P(U) inde-pendent of n. Thus, in particular, Zn will then be conditionally independentof (Ym−1, xm, Zm−1,m ≤ n) given Yn for n ≥ 0. By abuse of terminology, weidentify the stationary (resp. stationary randomized) control above with themap v(·) (resp. ϕ(·)). Note that the former is a special case of the latter forϕ(·) = δv(·), where δx denotes the Dirac measure at x.

If xn = x ∀n for a fixed deterministic x ∈ Rd, then Yn will be a time-homogeneous Markov process under any stationary randomized control ϕ. Itstransition kernel will be

px,ϕ(dw|y) =∫

p(dw|y, z, x)ϕ(y, dz).

Suppose that this Markov process has a (possibly nonunique) invariant prob-ability measure ηx,ϕ(dy) ∈ P(S). Correspondingly we define the ergodic occu-pation measure

Ψx,ϕ(dy, dz) def= ηx,ϕ(dy)ϕ(y, dz) ∈ P(S × U).

Page 78: Stochastic Approximation: A Dynamical Systems Viewpoint

6.2 Averaging the natural timescale: preliminaries 69

This is the stationary law of the state-control pair when the stationary ran-domized control ϕ is used and the initial distribution is ηx,ϕ. It clearly satisfiesthe equation

S

f(y)dΨx,ϕ(dy, U) =∫

S

U

f(w)p(dw|y, z, x)dΨx,ϕ(dy, dz) (6.2.3)

for bounded continuous f : S →R. Conversely, if some Ψ ∈ P(S×U) satisfies(6.2.3) for f belonging to any set of bounded continuous functions S →R thatseparates points of P(S), then it must be of the form Ψx,ϕ for some stationaryrandomized control ϕ. (In particular, countable subsets of Cb(S) that separatepoints of P(S) are known to exist – see Appendix C.) This is because we canalways decompose Ψ as

Ψ(dy, dz) = η(dy)ϕ(y, dz)

with η and ϕ denoting resp. the marginal on S and the regular conditional lawon U . Since ϕ(·) is a measurable map S → P(U), it can be identified witha stationary randomized control. (6.2.3) then implies that η is an invariantprobability measure under the ‘stationary randomized control’ ϕ.

We denote by D(x) the set of all such ergodic occupation measures for theprescribed x. Since (6.2.3) is preserved under convex combinations and con-vergence in P(S × U), D(x) is closed and convex. We also assume that it iscompact. Once again, using the fact that (6.2.3) is preserved under conver-gence in P(S×U), it follows that if x(n) → x in Rd and Ψn → Ψ in P(S×U)with Ψn ∈ D(x(n)) ∀n, then Ψ ∈ D(x), implying upper semicontinuity of theset-valued map x → D(x).

Define t(n), x(·) as before. We define a P(S × U)-valued random processµ(t) = µ(t, dydz), t ≥ 0, by

µ(t) def= δ(Yn,Zn), t ∈ [t(n), t(n + 1)),

for n ≥ 0. This process will play an important role in our analysis of (6.2.1).Also define for t > s ≥ 0, µt

s ∈ P(S × U × [s, t]) by

µts(A×B) def=

1t− s

B

µ(y, A)dy

for A,B Borel in S × U, [s, t] resp. Similar notation will be followed for otherP(S × U)-valued processes. Recall that S being a complete separable metricspace, it can be homeomorphically embedded as a dense subset of a compactmetric space S. (See Theorem 1.1.1, p. 2 in Borkar, 1995.) As any probabilitymeasure on S × U can be identified with a probability measure on S × U

that assigns zero probability to (S − S) × U , we may view µ(·) as a randomvariable taking values in U def= the space of measurable functions ν(·) = ν(·, dy)from [0,∞) to P(S ×U). This space is topologized with the coarsest topology

Page 79: Stochastic Approximation: A Dynamical Systems Viewpoint

70 Multiple Timescales

that renders continuous the maps ν(·) ∈ U → ∫ T

0g(t)

∫fdν(t)dt ∈ R for all

f ∈ C(S), T > 0 and g ∈ L2[0, T ]. We shall assume that:

(*) For f ∈ C(S), the function

(y, z, x) ∈ S × U ×Rd →∫

f(w)p(dw|y, z, x)

extends continuously to S × U ×Rd.

Later on we see a specific instance of how this might come to be, viz., inthe Euclidean case. With a minor abuse of notation, we retain the originalnotation f to denote this extension. Finally, we denote by U0 ⊂ U the subsetµ(·) ∈ U :

∫S×U

µ(t, dydz) = 1 ∀t with the relative topology.

Lemma 3. U is compact metrizable.

Proof. For N ≥ 1, let eNi (·), i ≥ 1 denote a complete orthonormal basis for

L2[0, N ]. Let fj be countable dense in the unit ball of C(S). Then it is aconvergence determining class for P(S) (cf. Appendix C). It can then be easilyverified that

d(ν1(·), ν2(·)) def=∑

N≥1

i≥1

j≥1

2−(N+i+j) ||∫ N

0

eNi (t)

∫fjdν1(t)dt

−∫ N

0

eNi (t)

∫fjdν2(t)dt|| ∧ 1

defines a metric on U consistent with its topology. To show sequential compact-ness, take νn(·) ⊂ U . Recall that

∫fjdνn(·)|[0,N ], j, n,N ≥ 1, are bounded

and therefore relatively sequentially compact in L2[0, N ] endowed with theweak topology. Thus we may use a diagonal argument to pick a subsequenceof n, denoted by n again by abuse of terminology, such that for each j

and N ,∫

fjdνn(·)|[0,N ] → αj(·)|[0,N ] weakly in L2[0, N ] for some real-valuedmeasurable functions αj(·), j ≥ 1 on [0,∞) satisfying αj(·)|[0,N ] ∈ L2[0, N ]∀ N ≥ 1. Fix j, N . Mimicking the proof of the Banach-Saks theorem (Theo-rem 1.8.4 of Balakrishnan (1976)), let n(1) = 1 and pick n(k) inductively tosatisfy

∞∑

j=1

2−j max1≤m<k

|∫ N

0

(∫

fjdνn(k)(t)− αj(t))(∫

fjdνn(m)(t)− αj(t))dt)| < 1k

.

This choice is possible because∫

fjdνn(·)|[0,N ] → αj(·)|[0,N ] weakly in L2[0, N ].Denote by || · ||2 and 〈·, ·〉2 the norm and inner product in L2[0, N ]. Then for

Page 80: Stochastic Approximation: A Dynamical Systems Viewpoint

6.2 Averaging the natural timescale: preliminaries 71

j ≥ 1,

|| 1m

m∑

k=1

∫fjdνn(k)(·)− αj(·)||22

≤ 1m2

(2mN2 + 2m∑

i=2

i−1∑

`=1

|〈∫

fjdνn(i)(·)− αj(·),∫

fjdνn(`)(·)− αj(·)〉|)

≤ 2m2

[mN2 + 2j(m− 1)] → 0,

as m →∞. Thus1m

m∑

k=1

∫fjdνn(k)(·) → αj(·)

strongly in L2[0, N ] and hence a.e. along a subsequence m(`) of m. Fix at ≥ 0 for which this is true. P(S×U) is a compact space by Prohorov’s theorem– see Appendix C. Let ν′(t) be a limit point in P(S × U) of the sequence

1m(`)

m(`)∑

k=1

νn(k)(t),m ≥ 1.

Then αj(t) =∫

fjdν′(t) ∀j, implying that

[α1(t), α2(t), . . .] ∈ [∫

f1dν,

∫f2dν, . . .] : ν ∈ P(S × U)

a.e., where the ‘a.e.’ may be dropped by the choice of a suitable modification ofthe αj . By a standard measurable selection theorem (see, e.g., Wagner, 1977),it then follows that there exists a ν∗(·) ∈ U such that αj(t) =

∫fjdν∗(t) ∀t, j.

That is, d(νn(·), ν∗(·)) → 0. This completes the proof. ¥

We assume as usual the stability condition for xn: supn ||xn|| < ∞ a.s. Inaddition, we shall need the following ‘stability’ condition for Yn:

(†) Almost surely, for any t > 0, the set µs+ts , s ≥ 0 remains tight.

Note that while this statement involves both Yn and Zn via the definitionof µ(·), it is essentially a restriction only on Yn. This is because ∀n, Zn ∈ U ,which is compact. A sufficient condition for (†) when S = Rk will be discussedlater.

Define h(x, ν) def=∫

h(x, y)ν(dy, U) for ν ∈ P(S × U). For µ(·) as above,consider the non-autonomous o.d.e.

x(t) = h(x(t), µ(t)). (6.2.4)

Let xs(t), t ≥ s, denote the solution to (6.2.4) with xs(s) = x(s), for s ≥ 0.The following can then be proved along the lines of Lemma 1 of Chapter 2.

Page 81: Stochastic Approximation: A Dynamical Systems Viewpoint

72 Multiple Timescales

Lemma 4. For any T > 0, supt∈[s,s+T ] ||x(t)− xs(t)|| → 0, a.s.

We shall also need the following lemma. Let µn(·) → µ∞(·) in U0.

Lemma 5. Let xn(·), n = 1, 2, . . . ,∞, denote solutions to (6.2.4) correspondingto µ(·) replaced by µn(·), for n = 1, 2, . . . ,∞. Suppose xn(0) → x∞(0). Thenlimn→∞ supt∈[t0,t0+T ] ||xn(t)− x∞(t)|| → 0 for every t0, T > 0.

Proof. Take t0 = 0 for simplicity. By our choice of the topology for U0,∫ t

0

g(t)∫

fdµn(s)ds−∫ t

0

g(t)∫

fdµ∞(s)ds → 0

for bounded continuous g : [0, t] →R, f : S →R. Hence∫ t

0

∫f(s, ·)dµn(s)ds−

∫ t

0

∫f(s, ·)dµ∞(s)ds → 0

for all bounded continuous f : [0, t]× S → R of the form

f(s, w) =N∑

m=1

amgm(s)fm(w)

for some N ≥ 1, scalars ai and bounded continuous real-valued functions gi, fi

on [0, t], S resp., for 1 ≤ i ≤ N . By the Stone-Weierstrass theorem, such func-tions can uniformly approximate any f ∈ C([0, T ] × S). Thus the above con-vergence holds true for all such f , implying that t−1dµn(s)ds → t−1dµ∞(s)ds

in P(S × [0, t]) and hence in P(S × [0, t]). Thus in particular

||∫ t

0

(h(x∞(s), µn(s))− h(x∞(s), µ∞(s)))ds|| → 0.

As a function of t, the integral on the left is equicontinuous and pointwisebounded. By the Arzela-Ascoli theorem, this convergence must in fact beuniform for t in a compact set. Now for t > 0,

||xn(t)− x∞(t)|| ≤ ||xn(0)− x∞(0)||

+∫ t

0

||h(xn(s), µn(s))− h(x∞(s), µn(s))||ds

+ ||∫ t

0

(h(x∞(s), µn(s))− h(x∞(s), µ∞(s)))ds||

≤ ||xn(0)− x∞(0)||+ L

∫ t

0

||xn(s)− x∞(s)||ds

+ ||∫ t

0

(h(x∞(s), µn(s))− h(x∞(s), µ∞(s)))ds||.

Page 82: Stochastic Approximation: A Dynamical Systems Viewpoint

6.3 Averaging the natural timescale: main results 73

By the Gronwall inequality, there exists KT > 0 such that

supt∈[0,T ]

||xn(t)− x∞(t)||

≤ KT

(||xn(0)− x∞(0)||

+ supt∈[0,T ]

||∫ t

0

(h(x∞(s), µn(s))− h(x∞(s), µ∞(s)))ds||).

In view of the foregoing, this leads to the desired conclusion. ¥

6.3 Averaging the natural timescale: main results

The key consequence of (†) that we require is the following:

Lemma 6. Almost surely, every limit point of (µs+ts , xs(·)) for t > 0 as s →∞

is of the form (µt0, x(·)), where

• µ(·) satisfies µ(t) ∈ D(x(t)), and• x(·) satisfies (6.2.4) with µ(·) replaced by µ(·).Proof. Let fi be a countable set of bounded continuous functions S →R thatis a convergence determining class for P(S). By replacing each fi by aifi + bi

for suitable ai, bi > 0, we may suppose that 0 ≤ fi(·) ≤ 1 for all i. For each i,

ξin

def=n−1∑m=1

a(m)(fi(Ym+1)−∫

fi(w)p(dw|Ym, Zm, xm)),

is a zero mean martingale with supn E[||ξin||2] ≤

∑n a(n)2 < ∞. By the mar-

tingale convergence theorem (cf. Appendix C), it converges a.s. Let τ(n, s) def=minm ≥ n : t(m) ≥ t(n) + s for s ≥ 0, n ≥ 0. Then as n →∞,

τ(n,t)∑m=n

a(m)(fi(Ym+1)−∫

fi(w)p(dw|Ym, Zm, xm)) → 0, a.s.

for t > 0. By our choice of fi and the fact that a(n) are eventuallynonincreasing (this is the only time the latter property is used),

τ(n,t)∑m=n

(a(m)− a(m + 1))fi(Ym+1) → 0, a.s.

Thusτ(n,t)∑m=n

a(m)(fi(Ym)−∫

fi(w)p(dw|Ym, Zm, xm)) → 0, a.s.

Page 83: Stochastic Approximation: A Dynamical Systems Viewpoint

74 Multiple Timescales

Dividing by∑τ(n,t)

m=n a(m) ≥ t and using (∗) and the uniform continuity ofp(dw|y, z, x) in x on compacts, we obtain

∫ t(n)+t

t(n)

∫(fi(y)−

∫fi(w)p(dw|y, z, x(s))µ(s, dydz))ds → 0, a.s.

Fix a sample point in the probability one set on which the convergence aboveholds for all i. Let (µ(·), x(·)) be a limit point of (µ(s + ·), xs(·)) in U ×C([0,∞);Rd) as s →∞. Then the convergence above leads to

∫ t

0

∫(fi(y)−

∫fi(w)p(dw|y, z, x(s)))µ(s, dydz)ds = 0 ∀i. (6.3.1)

By (†), µt0(S×U×[0, t]) = 1 ∀t and thus it follows that µt

s(S×U×[s, t]) = 1 ∀t >

s ≥ 0. By Lebesgue’s theorem (see Appendix A), one then has µ(t)(S×U) = 1for a.e. t. A similar application of Lebesgue’s theorem in conjunction with(6.3.1) shows that

∫(fi(y)−

∫fi(w)p(dw|y, z, x(t)))µ(t, dydz) = 0 ∀i,

for a.e. t. The qualification ‘a.e. t’ here may be dropped throughout by choosinga suitable modification of µ(·). By our choice of fj, this leads to

µ(t, dw × U) =∫

p(dw|y, z, x(t))µ(t, dydz).

The claim follows from this and Lemma 5. ¥

Combining Lemmas 3 – 6 immediately leads to our main result:

Theorem 7. Almost surely, x(s + ·), s ≥ 0 converge to an internally chaintransitive invariant set of the differential inclusion

x(t) ∈ h(x(t)), (6.3.2)

as s → ∞, where h(x)def= h(x, ν) : ν ∈ D(x). In particular xn converge

a.s. to such a set.

For special cases, more can be said, e.g., in the following:

Corollary 8. Suppose there is no additional control process Zn in (6.2.2)and for each x ∈ Rd and xn ≡ x ∀n, Yn is an ergodic Markov process with aunique invariant probability measure ν(x) = ν(x, dy). Then (6.3.2) above maybe replaced by the o.d.e.

x(t) = h(x(t), ν(x(t))). (6.3.3)

Page 84: Stochastic Approximation: A Dynamical Systems Viewpoint

6.3 Averaging the natural timescale: main results 75

If p(dw|y, x) denotes the transition kernel of this ergodic Markov process,then ν(x) is characterized by

∫(f(y)−

∫f(w)p(dw|y, x))ν(x, dy) = 0

for bounded f ∈ C(S). Since this equation is preserved under convergencein P(S), it follows that x → ν(x) is a continuous map. This guarantees theexistence of solutions to (6.3.3) by standard o.d.e. theory, though not theiruniqueness. In general, the solution set for a fixed initial condition will be anonempty compact subset of C([0,∞);Rd). For uniqueness, we need h(·, ν(·))to be Lipschitz, which requires additional information about ν and the transi-tion kernel p.

Many of the developments of the previous chapters have their natural coun-terparts for (6.2.1). For example, the first stability criterion of Chapter 3(Theorem 7) has the following natural extension, stated here for the simplercase when assumptions of Corollary 8 hold. The notation is as above.

Theorem 9. Suppose the limit

h(x)def= lim

a↑∞h(x(t)

a , ν(x(t)a ))

a

exists uniformly on compacts, and furthermore, the o.d.e.

x(t) = h(x(t))

is well posed and has the origin as the unique globally asymptotically stableequilibrium. Then supn ‖xn‖ < ∞ a.s.

As an interesting ‘extension’, suppose Yn,−∞ < n < ∞ is a not necessarilyMarkov process, with the conditional law of Yn given by

Y −n def= [Yn, Yn−1, Yn−2, . . .]

being a continuous map S∞ → P(S) independent of n. Then Y −n is a time-homogeneous Markov process. Let γ : S∞ → S denote the map that takes[s1, s2, . . .] ∈ S∞ to s1 ∈ S. Replacing S by S∞, Yn by Y −n, and h(x, ·)by h(x, γ(·)), we can reduce this case to the one studied above. The resultsabove then apply as long as the technical assumptions made from time to timecan be verified. The case of stationary Yn (or something that is nearly thesame, viz., the case when the appropriate time averages exist) is in fact themost extensively studied case in the literature (see, e.g., Kushner and Yin,2003).

Page 85: Stochastic Approximation: A Dynamical Systems Viewpoint

76 Multiple Timescales

6.4 Concluding remarks

We conclude this chapter with a sufficient condition for (†) when S = Rm

for some m ≥ 1. The condition is that there exists a V ∈ C(Rm) such thatlim||x||→∞ V (x) = ∞ and furthermore,

supn

E[V (Yn)2] < ∞, (6.4.1)

and for some compact B ⊂ Rm and scalar ε0 > 0,

E[V (Yn+1)|Fn] ≤ V (Yn)− ε0, (6.4.2)

a.s. on Yn /∈ B.In the framework of sections 6.2 and 6.3, we now replace S by Rm and S

by Rm def= the one-point compactification of Rm with the additional ‘point atinfinity’ denoted simply by ‘∞’. We assume that p(dω|y, z, x) → δ∞ in P(Rm)as ||y|| → ∞ uniformly in x, z.

Lemma 10. Any limit point (µ∗(·), x∗(·)) of (µ(s + ·), x(s + ·)) as s → ∞ inU × C([0,∞);Rd) is of the form

µ∗(t) = a(t)µ(t) + (1− a(t))δ∞, t ≥ 0,

where a(·) is a measurable function [0,∞) → [0, 1] and µ(t) ∈ D(x∗(t)) ∀t.Proof. Let fi denote a countable convergence determining class of functionsfor Rm satisfying lim||x||→∞ |fi(x)| = 0 for all i. Thus they extend continu-ously to Rm with value zero at ∞. Also, note that by our assumption above,lim||y||→∞

∫fi(w)p(dw|y, z, x) → 0 uniformly in z, x, which verifies (*). Argue

as in the proof of Lemma 6 to conclude that∫

(fi(y)−∫

fi(w)p(dw|y, z, x∗(t)))µ∗(t, dydz) = 0 ∀i,

for all t, a.s. Write µ∗(t) = a(t)µ(t) + (1 − a(t))δ∞ with a(·) : [0,∞) → [0, 1]a measurable map. This is always possible (the decomposition being in factunique for those t for which a(t) > 0). Then when a(t) > 0, the above reducesto ∫

(fi(y)−∫

fi(w)p(dw|y, z, x∗(t)))µ(t, dydz) = 0 ∀i,

for all t. Thus µ(t) ∈ D(x∗(t)) when a(t) > 0. When a(t) = 0, the choiceof µ(t) is arbitrary and it may be chosen so that it is in D(x∗(t)). The claimfollows. ¥

Corollary 11. Condition ( † ) holds. That is, almost surely, for any t > 0,the set µs+t

s , s ≥ 0 remains tight.

Page 86: Stochastic Approximation: A Dynamical Systems Viewpoint

6.4 Concluding remarks 77

Proof. Replacing fi by V in the proof of Lemma 6 and using (6.4.1) to justifythe use of the martingale convergence theorem therein, we have

lims→∞

∫ t

0

∫ ∫(V (w)p(dw|y, z, x(s + r))− V (y))µ(s + r, dydz)dr = 0,

a.s. Fix a sample point where this and Lemma 10 hold. Extend the map

ψ : (x, y, z) ∈ Rd × Rm × U →∫

V (ω)p(dω|y, z, x)− V (y)

toRd×Rm×U by setting ψ(x,∞, z) = −ε0, whence it is upper semicontinuous.Thus taking the above limit along an appropriate subsequence along which(µ(s + ·), x(s + ·)) → (µ∗(·), x∗(·)) (say), we get

0 ≤ −ε0

∫ t

0

(1− a(s))ds

+∫ t

0

a(s)(∫ ∫

(V (w)p(dw|y, z, x∗(s))− V (y))µ∗(s, dydz))

ds

= −ε0

∫ t

0

(1− a(s))ds,

by Lemma 10. Thus a(s) = 1 a.e., where the ‘a.e.’ may be dropped by taking asuitable modification of µ∗(·). This implies that the convergence of µ(s(n) + ·)to µ∗(·) is in fact in U0. This establishes (†). ¥

Page 87: Stochastic Approximation: A Dynamical Systems Viewpoint

7

Asynchronous Schemes

7.1 Introduction

Until now we have been considering the case where all components of xn areupdated simultaneously at time n and the outcome is immediately available forthe next iteration. There may, however, be situations when different compo-nents are updated by possibly different processors (these could be in differentlocations, e.g., in remote sensing applications). Furthermore, each of these com-ponents may be running on its own ‘clock’ and exchanging information withthe others with some communication delays. This is the distributed, asyn-chronous implementation of the algorithm. The theory we have developed sofar does not apply automatically any more and some work is needed to figureout when it does and when it doesn’t. Another important class of problemswhich lands us into a similar predicament consists of the multiagent learning oroptimization schemes when each component actually corresponds to a differentautonomous agent and the aforementioned complications arise naturally. Yetanother situation involves the ‘on-line’ algorithms for control or estimation ofa Markov chain in which we have a one-to-one correspondence between thecomponents of xn and the state space of the chain (i.e., the ith component ofxn is a quantity associated with state i of the chain), and the ith componentgets updated only when state i is visited. We shall see examples of this lateron.

A mathematical model that captures the aspects above is as follows: Lettingxn = [xn(1), . . . , xn(d)], the ith (for 1 ≤ i ≤ d) component is updated in ouroriginal scheme according to

xn+1(i) = xn(i) + a(n)[hi(xn) + Mn+1(i)], n ≥ 1, (7.1.1)

where hi,Mn(i) are the ith components of h,Mn resp., for n ≥ 1. We replace

78

Page 88: Stochastic Approximation: A Dynamical Systems Viewpoint

7.1 Introduction 79

this by

xn+1(i) = xn(i)+a(ν(i, n))Ii ∈ Yn× [hi(xn−τ1i(n)(1), . . . , xn−τdi(n)(d)) + Mn+1(i)], (7.1.2)

for n ≥ 0. Here:

(i) Yn is a random subset of the index set 1, . . . , d, indicating the subsetof components which are updated at time n,

(ii) 0 ≤ τij(n) ≤ n is the delay faced by ‘processor’ j in receiving the outputof processor i at time n. In other words, at time n, processor j knowsxn−τij(n)(i), but not xm(i) for m > n − τij(n) (or does know some ofthem but does not realize they are more recent!).

(iii) ν(i, n) def=∑n

m=0 Ii ∈ Ym, i.e., the number of times the ith componentwas updated up until time n.

Note that the ith processor needs to know only its local clock ν(i, n) and notthe global clock n. In fact the global clock can be a complete artifice as longas causal relationships are respected. One usually has

lim infn→∞

ν(i, n)n

> 0. (7.1.3)

This means that all components are being updated comparably often. A simplesufficient condition for (7.1.3) would be that Yn is an irreducible and hencepositive recurrent Markov chain on the power set of 1, . . . , d. (More generally,it could be a controlled Markov chain on this state space with the property thatany stationary policy leads to an irreducible chain with a stationary distributionthat assigns a probability ≥ δ to each state, for some δ > 0. More on this later.)Note in particular that this condition ensures that ν(i, n) ↑ ∞ for all i, i.e., eachcomponent is updated infinitely often. For the purposes of the next section,this is all we need.

Define

Fn = σ(xm, Mm, Ym, τij(m), 1 ≤ i, j ≤ d,m ≤ n), n ≥ 0.

We assume that

E[Mn+1(i)|Fn] = 0,

E[|Mn+1(i)|2|Fn] ≤ K(1 + supm≤n

||xm||2), (7.1.4)

where 1 ≤ i ≤ d, n ≥ 0, and K > 0 is a suitable constant.Usually it makes sense to assume τii(m) = 0 for all i and m ≤ n, and we

shall do so (implying that a processor has its own past outputs immediatelyavailable). This, however, is not essential for the analysis that follows.

Page 89: Stochastic Approximation: A Dynamical Systems Viewpoint

80 Asynchronous Schemes

The main result here is that under suitable conditions, the interpolated iter-ates track a time-dependent o.d.e. of the form

x(t) = Λ(t)h(x(t)), (7.1.5)

where Λ(·) is a matrix-valued measurable process such that Λ(t) for each t

is a diagonal matrix with nonnegative diagonal entries. These in some sensereflect the relative ‘instantaneous’ rates with which the different componentsget updated. Our treatment follows Borkar (1998). See also Kushner and Yin(1987a, 1987b).

7.2 Asymptotic behavior

As usual, we shall start by assuming

supn||xn|| < ∞, a.s. (7.2.1)

We shall also simplify the situation by assuming that there are no delays, i.e.,τij(n) ≡ 0 ∀i, j, n. The effect of delays will be considered separately later on.Thus (7.1.2) becomes

xn+1(i) = xn(i)+a(ν(i, n))Ii ∈ Yn× [hi(xn(1), . . . , xn(d)) + Mn+1(i)], (7.2.2)

for n ≥ 0. Let a(n) def= maxi∈Yn a(ν(i, n)) > 0, n ≥ 0. Then it is easy to verifythat

∑n a(n) = ∞,

∑n a(n)2 < ∞ a.s.: we have, for any fixed i, 1 ≤ i ≤ d,

∑n

a(n) ≥∑

n

a(ν(i, n))Ii ∈ Yn

=∑

n

a(n) = ∞, and

∑n

a(n)2 ≤∑

n

i

a(ν(i, n))2Ii ∈ Yn

≤ d∑

n

a(n)2 < ∞. (7.2.3)

This implies in particular that a(n) is a legitimate stepsize schedule, albeitrandom. (See the comments following Theorem 2 of Chapter 2.) Rewrite(7.2.2) as

xn+1(i) = xn(i)+a(n)q(i, n)

× [hi(xn(1), . . . , xn(d)) + Mn+1(i)], (7.2.4)

where q(i, n) def= (a(ν(i, n))/a(n))Ii ∈ Yn ∈ (0, 1] ∀n. As before, definet(0) = 0, t(n) =

∑nm=0 a(m), n ≥ 1. Define x(t), t ≥ 0, by x(t(n)) = xn, n ≥ 0,

Page 90: Stochastic Approximation: A Dynamical Systems Viewpoint

7.2 Asymptotic behavior 81

with linear interpolation on each interval Indef= [t(n), t(n + 1)]. For 1 ≤ i ≤ d,

define ui(t), t ≥ 0, by ui(t) = q(i, n) for t ∈ [t(n), t(n + 1)), n ≥ 0. Letλ(t) = diag(u1(t), . . . , ud(t)), t ≥ 0, and xs(t), t ≥ s, the unique solution to thenon-autonomous o.d.e.

xs(t) = λ(t)h(xs(t)), t ≥ s.

The following lemma then holds by familiar arguments.

Lemma 1. For any T > 0,

lims→∞

supt∈[s,s+T ]

||x(t)− xs(t)|| = 0, a.s.

This immediately leads to:

Theorem 2. Almost surely, any limit point of x(s + ·) in C([0,∞);Rd) ass ↑ ∞ is a solution of a non-autonomous o.d.e.

x(t) = Λ(t)h(x(t)), (7.2.5)

where Λ(·) is a d × d-dimensional diagonal matrix-valued measurable functionwith entries in [0, 1] on the diagonal.

Proof. View u(·) def= [u1(·), . . . , ud(·)] as an element of V def= the space of mea-surable maps y(·) : [0,∞) → [0, 1]d with the coarsest topology that renderscontinuous the maps

y(·) →∫ t

0

〈g(s), y(s)〉ds,

for all t > 0, g(·) ∈ L2([0, t];Rd). A standard application of the Banach–Alaoglu theorem (see Appendix A) shows that this is a compact space, metriz-able by the metric

ρ(y1(·), y2(·)) def=∞∑

n=1

∞∑m=1

2−(n+m)

×min(

1, |∫ n

0

〈y1(t), enm(t)〉dt−

∫ n

0

〈y2(t), enm(t)〉dt|

),

where enm(·),m ≥ 1 is a complete orthonormal basis for L2([0, n];Rd). Rel-

ative compactness of x(t + ·), t ≥ 0, in C([0,∞);Rd) is established as before.Consider tn →∞ such that x(tn + ·) → x∗(·) (say) in C([0,∞);Rd). By drop-ping to a subsequence if necessary, assume that u(tn + ·) → u∗(·) in V. Let Λ(·)denote the diagonal matrix with ith diagonal entry = u∗(i). By Lemma 1,

x(tn + s)− x(tn + r) =∫ s

r

λ(tn + z)h(x(tn + z))dz + o(1), s > r ≥ 0.

Page 91: Stochastic Approximation: A Dynamical Systems Viewpoint

82 Asynchronous Schemes

Letting n →∞ in this equation, familiar arguments from Chapter 6 yield

x∗(s)− x∗(r) =∫ s

r

Λ(z)h(x∗(z))dz, s > r ≥ 0.

This completes the proof. ¥

7.3 Effect of delays

Next we shall consider the effect of delays. Specifically, we look for conditionsunder which Theorem 2 will continue to hold for xn given by (7.1.2) insteadof (7.2.2). We shall assume that each output xn(j) of the jth processor is trans-mitted to the ith processor for any pair (i, j) almost surely, though we allowfor some outputs to be ‘lost’ in transit. The situation where not all outputs aretransmitted can also be accommodated by equating unsent outputs with lostones. In this case our requirement boils down to infinitely many outputs beingtransmitted. At the receiver end, we assume that the ith processor receivesinfinitely many outputs sent by j almost surely, though not necessarily in theorder sent. This leaves two possibilities at the receiver: Either the messagesare ‘time-stamped’ and the receiver can re-order them and use at each iterationthe one sent most recently, or they are not and the receiver uses the one re-ceived most recently. Our analysis allows for both possibilities, subject to theadditional condition that a(m + n) ≤ κa(n) for all m,n ≥ 0 and some κ > 0.This is a very mild restriction.

Comparing (7.1.2) with (7.2.2), one notes that the delays introduce in the(n + 1)st iteration of the ith component an additional error of

a(ν(i, n))Ii ∈ Yn× (hi(xn−τ1i(n)(1), · · · , xn−τdi(n)(d))− hi(xn(1), · · · , xn(d))).

Our aim will be to find conditions under which this error is o(a(n)). If so,one can argue as in the extension at the start of section 2.2 and conclude thatTheorem 2 continues to hold with (7.1.2) in place of (7.2.2). Since h(·) isLipschitz, the above error is bounded by a constant times

a(ν(i, n))∑

j

|xn(j)− xn−τji(n)(j)|.

We shall consider each summand (say, the jth) separately. This is bounded by

|n−1∑

m=n−τji(n)

a(ν(j,m))Ij ∈ Ymhj(xm−τ1j(m)(1), · · · , xm−τdj(m)(d))|

+ |n−1∑

m=n−τji(n)

a(ν(j, m))Ij ∈ YmMm+1(j)|. (7.3.1)

Page 92: Stochastic Approximation: A Dynamical Systems Viewpoint

7.3 Effect of delays 83

We shall impose the mild assumption:

n− τk`(n) ↑ ∞ a.s. (7.3.2)

for all k, `. As before, (7.1.4), (7.2.1) and (7.2.3) together imply that the sum∑nm=0 a(ν(j,m))Ij ∈ YmMm+1(j) converges a.s. for all j. In view of (7.3.2),

we then have

|n−1∑

m=n−τji(n)

a(ν(j,m))Ij ∈ YmMm+1(j)| = o(1),

implying that

a(ν(i, n))|n−1∑

m=n−τji(n)

a(ν(j,m))Ij ∈ YmMm+1(j)| = o(a(ν(i, n))).

Under (7.2.1), the first term of (7.3.1) can be almost surely bounded from aboveby a (sample path dependent) constant times

n−1∑

m=n−τji(n)

a(ν(j,m)).

(See, e.g., Chapter 2.) Since a(n) ≤ κa(m) for m ≤ n, this in turn is boundedby κa(ν(j, n − τji(n)))τji(n) for large n. Thus we are done if this quantity iso(1). Note that by (7.3.2), this is certainly so if the delays are bounded. Moregenerally, suppose that

τji(n)n

→ 0 a.s.

This is a perfectly reasonable condition and can be recast as

n− τji(n)n

→ 1 a.s. (7.3.3)

Note that this implies (7.3.2). We further assume that

lim supn→∞

supy∈[x,1]

a(bync)a(n)

< ∞ ∀i, (7.3.4)

for 0 < x ≤ 1. This is also quite reasonable as it is seen to hold for moststandard examples of a(n). Furthermore, this implies that whenever (7.1.3)and (7.3.3) hold,

lim supn→∞

a(ν(j, n− τk`(n)))a(n)

< ∞

for all k, `. Thus our task reduces to showing a(n)τk`(n) = o(1) for all k, `.Assume the following:

Page 93: Stochastic Approximation: A Dynamical Systems Viewpoint

84 Asynchronous Schemes

(†) There exists η > 0 and a nonnegative integer valued random variable τ

such that:

• a(n) = o(n−η) and• τ stochastically dominates all τk`(n) and satisfies

E[τ1η ] < ∞.

All standard examples of a(n) satisfy the first condition with a naturalchoice of η, e.g., for a(n) = n−1, take η = 1 − ε for any ε ∈ (0, 1). Thesecond condition is easily verified, e.g., if the tails of the delay distributionsshow uniform exponential decay. Under (†),

P (τk`(n) ≥ nη) ≤ P (τ ≥ nη)

= P (τ1η ≥ n),

leading to∑

n

P (τk`(n) ≥ nη) ≤∑

n

P (τ1η ≥ n)

= E[τ1η ]

< ∞.

By the Borel–Cantelli lemma, one then has

P (τk`(n) ≥ nη i.o.) = 0.

Coupled with the first part of (†), this implies a(n)τk`(n) = o(1). We haveproved:

Theorem 3. Under assumptions (7.3.3), (7.3.4) and (†), the conclusions ofTheorem 2 also hold when xn are generated by (7.1.2).

The following discussion provides some intuition as to why the delays are‘asymptotically negligible’ as long as they are not ‘arbitrarily large’, in thesense of (7.3.3). Recall our definition of t(n). Note that the passage from theoriginal discrete time count n to t(n) implies a time scaling. In fact this isa ‘compression’ of the time axis because the successive differences t(n+1)−t(n)tend to zero. An interval [n, n + 1, . . . , n + N ] on the original time axis getsmapped to [t(n), t(n+1), . . . , t(n+N)] under this scaling. As n →∞, the widthof the former remains constant at N , whereas that of the latter, t(n+N)−t(n),tends to zero. That is, intervals of a fixed length get ‘squeezed out’ in the limitas n →∞. Since the approximating o.d.e. we are looking at is operating on thetransformed timescale, the net variation of its trajectories over these intervalsis less and less as n →∞, hence so is the case of interpolated iterates x(·). In

Page 94: Stochastic Approximation: A Dynamical Systems Viewpoint

7.4 Convergence 85

other words, the error between the most recent iterate from a processor and onereceived with a bounded delay is asymptotically negligible. The same intuitioncarries over for possibly unbounded delays that satisfy (7.3.3).

7.4 Convergence

We now consider several instances where Theorems 2 and 3 can be strength-ened.

(i) The first and perhaps the most important case is when convergence isobtained just by a judicious choice of stepsize schedule. Let Λ(t) abovebe written as diag(η1(t), . . . , ηd(t)), i.e., the diagonal matrix with theith diagonal entry equal to ηi(t). For n ≥ 0, s > 0, let

N(n, s) def= minm > n : t(m) ≥ t(n) + s > n.

From the manner in which Λ(·) was obtained, it is clear that there existn(k) ⊂ n such that

∫ t+s

t

ηi(y)dy = limk→∞

N(n(k),s)∑

m=n(k)

a(ν(i,m))Ii ∈ Yma(m)

a(m)

= limk→∞

ν(i,N(n(k),s))∑

m=ν(i,n(k))

a(m) ∀i.

Thus∫ t+s

tηi(y)dy∫ t+s

tηj(y)dy

= limk→∞

∑ν(i,N(n(k),s))m=ν(i,n(k)) a(m)

∑ν(j,N(n(k),s))m=ν(j,n(k)) a(m)

∀i, j. (7.4.1)

Suppose we establish that under (7.1.3), the right-hand side of (7.4.1)is always 1. Then (7.2.5) is of the form

x(t) = αh(x(t)),

for a scalar α > 0. This is simply a time-scaled version of the o.d.e.

x(t) = h(x(t)) (7.4.2)

and hence has exactly the same trajectories. Thus the results of Chapter2 apply. See Borkar (1998) for one such situation.

(ii) The second important situation is when the o.d.e. (7.1.5) has the sameasymptotic behaviour as (7.4.2) purely because of the specific structureof h(·). Consider the following special case of the scenario of Corollary 3

Page 95: Stochastic Approximation: A Dynamical Systems Viewpoint

86 Asynchronous Schemes

of Chapter 2, with a continuously differentiable Liapunov function V (·)satisfying

lim||x||→∞

V (x) = ∞, 〈h(x),∇V (x)〉 < 0 whenever h(x) 6= 0.

Suppose

lim inft→∞

ηi(t) ≥ ε > 0 ∀i, (7.4.3)

and

〈h(x), Γ∇V (x)〉 < 0 whenever h(x) 6= 0,

for all d × d diagonal matrices Γ satisfying Γ ≥ εId, Id being the d × d

identity matrix. By (7.4.3), Λ(t) ≥ εId ∀t. Then exactly the sameargument as for Corollary 3 of Chapter 2 applies to (7.1.5), leading tothe conclusion that xn → x : h(x) = 0 a.s. An important instanceof this lucky situation is the case when h(x) = −∇F (x) for some F (·),whence for V (·) ≡ F (·) and Γ as above,

〈h(x), Γ∇F (x)〉 ≤ −ε||∇F (x)||2 < 0

outside x : ∇F (x) = 0.Another example is the case when h(x) = F (x) − x for some F (·)

satisfying

||F (x)− F (y)||∞ ≤ β||x− y||∞ ∀x, y, (7.4.4)

with ||x||∞ def= maxi |xi| for x = [x1, . . . , xd] and β ∈ (0, 1). That is,F is a contraction w.r.t. the max-norm. In this case, it is known fromthe contraction mapping theorem that there is a unique x∗ such thatF (x∗) = x∗, i.e., a unique equilibrium point for (7.4.2). Furthermore, adirect calculation shows that V (x) def= ||x − x∗||∞ serves as a Liapunovfunction, albeit a non-smooth one. In fact,

||x(t)− x∗||∞ ↓ 0. (7.4.5)

See Theorem 2 of Chapter 10 for details. Now,

Γ(F (x)− x) = FΓ(x)− x

for FΓ(·) def= (I − Γ)x + ΓF (x). Note that the diagonal terms of Γ arebounded by 1. Then

||FΓ(x)− FΓ(y)||∞ ≤ β||x− y||∞ ∀x, y,

where βdef= 1− ε(1− β) ∈ (0, 1). In particular, this is true for Γ = Λ(t)

for any t ≥ 0. Thus once again a direct calculation shows that (7.4.5)holds and therefore x(t) → x∗. (See the remark following Theorem 2 of

Page 96: Stochastic Approximation: A Dynamical Systems Viewpoint

7.4 Convergence 87

Chapter 10.) In fact, these observations extend to the situation whenβ = 1 as well. For the case when β = 1, existence of equilibrium isan assumption on F (·) and uniqueness need not hold. One can alsoconsider ‘weighted norms’, such as ||x||∞,w

def= supi wi|xi| for prescribedwi > 0, 1 ≤ i ≤ d. We omit the details here as these will be self-evidentafter the developments in Chapter 10 where such ‘fixed point solvers’are analyzed in greater detail.

Finally, note that replacing a(ν(i, n)) in (7.1.2) by a(n) would amount todistributed but synchronous iterations, as they presuppose a common clock.These can be analyzed along exactly the same lines, with the o.d.e. limit (7.1.5).In the case when Yn can be viewed as a controlled Markov chain on the powerset Q of 1, 2, . . . , d, the analysis of sections 6.2 and 6.3 of Chapter 6 shows thatthe ith diagonal element of Λ(t) will in fact be of the form

∑i∈A∈Q πt(A), where

πt is the vector of stationary probabilities for this chain under some stationarypolicy. Note that in principle, Yn can always be cast as a controlled Markovchain on Q: Let the control space U be the set of probability measures on Q

and let the controlled transition probability function be p(j|i, u) = u(j) fori, j ∈ Q,u ∈ U . The control sequence is then the process of regular conditionallaws of Yn+1 given Yk, k ≤ n, for n ≥ 0. This gives a recipe for verifying (7.1.3)in many cases.

One may be able to ‘rig’ the stepsizes here so as to get the desired limitingo.d.e. For example, suppose the Yn above takes values in i : 1 ≤ i ≤ d,i.e., singletons alone. Suppose further that it is an ergodic Markov chain on thisset. Suppose the chain has a stationary probability vector [π1, . . . , πd]. Then byCorollary 8 of Chapter 6, Λ(t) ≡ diag(π1, . . . , πd). Thus if we use the stepsizesa(n)/πi for the ith component, we get the limiting o.d.e. x(t) = h(x(t))as desired. In practice, one may use a(n)/ξn(i) instead, where ξn(i) is anempirical estimate of πi obtained by suitable averaging on a faster timescale sothat it tracks πi. (One could, for example, have ξn(i) = ν(i, n)/n if na(n) → 0as n → ∞, i.e., the stepsizes a(n) decrease slower than 1

n , and thereforethe analysis of section 6.1 applies.) This latter arrangement also extends in anatural manner to the more general case when Yn is ‘controlled Markov’.

Page 97: Stochastic Approximation: A Dynamical Systems Viewpoint

8

A Limit Theorem for Fluctuations

8.1 Introduction

To motivate the results of this chapter, consider the classical strong law of largenumbers: Let Xn be i.i.d. random variables with E[Xn] = µ, E[X2

n] < ∞.Let

S0 = 0, Sndef=

∑ni=1 Xi

n, n ≥ 1.

The strong law of large numbers (see, e.g., Section 4.2 of Borkar, 1995) statesthat

Sn

n→ µ, a.s.

To cast this as a ‘stochastic approximation’ result, note that some simple alge-braic manipulation leads to

Sn+1 = Sn +1

n + 1(Xn+1 − Sn)

= Sn +1

n + 1([µ− Sn] + [Xn+1 − µ])

= Sn + a(n)(h(Sn) + Mn+1)

for

a(n) def=1

n + 1, h(x) def= µ− x ∀x, Mn+1

def= Xn+1 − µ.

In particular, a(n) and Mn+1 are easily seen to satisfy the conditionsstipulated for the stepsizes and martingale difference noise resp. in Chapter 2.Thus this is a valid stochastic approximation iteration. Its o.d.e. limit then is

x(t) = µ− x(t), t ≥ 0,

88

Page 98: Stochastic Approximation: A Dynamical Systems Viewpoint

8.2 A tightness result 89

which has µ as the unique globally asymptotically stable equilibrium. Its ‘scaledlimit’ as in assumption (A5) of Chapter 3 is

x(t) = −x(t), t ≥ 0,

which has the origin as the unique globally asymptotically stable equilibrium.Thus by the theory developed in Chapter 2 and Chapter 3,

(i) the ‘iterates’ Sn remain a.s. bounded, and(ii) they a.s. converge to µ.

We have recovered the strong law of large numbers from stochastic approxima-tion theory. Put differently, the a.s. convergence results for stochastic approx-imation iterations are nothing but a generalization of the strong law of largenumbers for a class of dependent and not necessarily identically distributedrandom variables.

The classical strong law of large numbers, which states a.s. convergence ofempirical averages to the mean, is accompanied by other limit theorems thatquantify fluctuations around the mean, such as the central limit theorem, thelaw of iterated logarithms, the functional central limit theorem (Donsker’s theo-rem), etc. It is then reasonable to expect similar developments for the stochasticapproximation iterates. The aim of this chapter is to state a functional cen-tral limit theorem in this vein. This is proved in section 8.3, following somepreliminaries in the next section. Section 8.4 specializes these results to thecase when the iterates a.s. converge to a single deterministic limit and recoversthe central limit theorem for stochastic approximation (see, e.g., Chung (1954),Fabian (1968)).

8.2 A tightness result

We shall follow the notation of section 4.3, which we briefly recall below. Thusour basic iteration in Rd is

xn+1 = xn + a(n)(h(xn) + Mn+1) (8.2.1)

for n ≥ 0, with the usual assumptions on a(n), Mn+1, and the additionalassumptions:

(A1) h(·) is continuously differentiable and both h(·) and the Jacobian matrix∇h(·) are uniformly Lipschitz.

(A2) a(n)a(n+1)

n↑∞→ 1.

(A3) supn ||xn|| < ∞ a.s., supn E[||xn||4] < ∞.

Page 99: Stochastic Approximation: A Dynamical Systems Viewpoint

90 A Limit Theorem for Fluctuations

(A4) Mn satisfy

E[Mn+1MTn+1|Mi, xi, i ≤ n] = Q(xn),

E[||Mn+1||4|Mi, xi, i ≤ n] ≤ K ′(1 + ||xn||4),where K ′ > 0 is a suitable constant and Q : Rd → Rd×d is a positive defi-nite matrix-valued Lipschitz function such that the least eigenvalue of Q(x) isbounded away from zero uniformly in x.

For the sake of simplicity, we also assume a(n) ≤ 1 ∀n. As before, fix T > 0and define t(0) = 0,

t(n) def=n−1∑m=0

a(m),

m(n) def= minm ≥ n : t(m) ≥ t(n) + T, n ≥ 1.

Thus t(m(n)) ∈ [t(n) + T, t(n) + T + 1]. For n ≥ 0, let xn(t), t ≥ t(n), denotethe solution to

xn(t) = h(xn(t)), t ≥ t(n), xn(t(n)) = xn. (8.2.2)

Then

xn(t(j + 1)) = xn(t(j)) + a(j)(h(xn(t(j)))− δj

), (8.2.3)

where δj is the ‘discretization error’ as in Chapter 2, which is O(a(j)). Let

yjdef= xj − xn(t(j)),

zjdef=

yj√a(j)

,

for j ≥ n, n ≥ 0. Subtracting (8.2.3) from (8.2.1) and using Taylor expansion,we have

yj+1 = yj + a(j)(∇h(xn(t(j)))yj + κj + δj) + a(j)Mj+1.

Here κj = o(‖yj‖) is the error in the Taylor expansion, which is also o(1) inview of Theorem 2 of Chapter 2. Iterating, we have, for 0 ≤ i ≤ m(n)− n,

yn+i

= Πn+i−1j=n (1 + a(j)∇h(xn(t(j))))yn

+n+i−1∑

j=n

a(j)Πn+i−1k=j+1(1 + a(k)∇h(xn(t(k))))(κj + δj + Mj+1)

=n+i−1∑

j=n

a(j)Πn+i−1k=j+1(1 + a(k)∇h(xn(t(k))))(κj + δj + Mj+1),

Page 100: Stochastic Approximation: A Dynamical Systems Viewpoint

8.2 A tightness result 91

because yn = 0. Thus for i as above,

zn+i =n+i−1∑

j=n

√a(j)Πn+i−1

k=j+1(1 + a(k)∇h(xn(t(k))))

×√

a(k)a(k + 1)

Mj+1

√a(j)

a(j + 1)

+n+i−1∑

j=n

√a(j)Πn+i−1

k=j+1(1 + a(k)∇h(xn(t(k))))

×√

a(k)a(k + 1)

(κj + δj)

√a(j)

a(j + 1). (8.2.4)

Define zn(t), t ∈ Indef= [t(n), t(n) + T ], by zn(t(j)) = zj , with linear interpola-

tion on each [t(j), t(j+1)] for n ≤ j ≤ m(n). Let zn(t) = zn(t(n)+t), t ∈ [0, T ].We view zn(·) as C([0, T ];Rd)-valued random variables. Our first step willbe to prove the tightness of their laws. For this purpose, we need the followingtechnical lemma.

Let Xn be a zero mean martingale w.r.t. the increasing σ-fields Fn withX0 = 0 (say) and supn≤N E[|Xn|4] < ∞. Let Yn

def= Xn −Xn−1, n ≥ 1.

Lemma 1. For a suitable constant K > 0,

E[ supn≤N

|Xn|4] ≤ K(N∑

m=1

E[Y 4m] + E[(

N∑m=1

E[Y 2m|Fm−1])2]).

Page 101: Stochastic Approximation: A Dynamical Systems Viewpoint

92 A Limit Theorem for Fluctuations

Proof. For suitable constants K1,K2,K3,K4 > 0,

E[ supn≤N

|Xn|4] ≤ K1E[(N∑

m=1

Y 2m)2]

≤ K2(E[(N∑

m=1

(Y 2m − E[Y 2

m|Fm−1]))2]

+ E[(N∑

m=1

E[Y 2m|Fm−1])2])

≤ K3(E[N∑

m=1

(Y 2m − E[Y 2

m|Fm−1])2]

+ E[(N∑

m=1

E[Y 2m|Fm−1])2])

≤ K4(E[N∑

m=1

Y 4m]

+ E[(N∑

m=1

E[Y 2m|Fm−1])2]),

where the first and the third inequalities follow from Burkholder’s inequality(see Appendix C). ¥

Applying Lemma 1 to (8.2.4), we have:

Lemma 2. For m(n) ≥ ` > k ≥ n, n ≥ 0,

E[||zn(t(`))− zn(t(k))||4] = O((∑

j=k

a(j))2)

= O(|t(`)− t(k)|2).

Proof. Let `, k be as above. By (A2),√

a(j)/a(j + 1) is uniformly boundedin j. Since h is uniformly Lipschitz, ∇h is uniformly bounded and thus forn ≤ k < m(n),

‖Π`r=k+1(1 + a(r)∇h(xn(t(r))))‖ ≤ eK1

∑`r=k+1 a(r)

≤ e(T+1)K1 ,

for a suitable bound K1 > 0 on ||∇h(·)||. Also, for any η > 0, ‖κj‖ ≤η‖yj‖ =

√a(j)η‖zj‖ for sufficiently large j. Hence we have for large j and

Page 102: Stochastic Approximation: A Dynamical Systems Viewpoint

8.2 A tightness result 93

η′ = η(supn

√a(n)

a(n+1) )2,

‖∑`j=k+1

√a(j)Π`

r=j(1 + a(r)∇h(xn(t(r))))√

a(r)a(r+1)κj

√a(j)

a(j+1)‖≤ η′eK1(T+1)

∑`j=k+1 a(j)‖zj‖.

Similarly, since ||δj || ≤ K ′a(j) for some K ′ > 0,

‖∑`j=k+1

√a(j)Π`

r=j(1 + a(r)∇h(xn(t(r))))√

a(r)a(r+1)δj

√a(j)

a(j+1)‖≤ KeK1(T+1)

∑`j=k+1 a(j)

32

for some constant K > 0. Applying the above lemma to Ψkdef= the first term

on the right-hand side of (8.2.4) with n + i = k, E[supn≤m≤k ‖Ψm‖4] is seento be bounded by

K1((k∑

j=n

a(j))2 +k∑

j=n

a(j)2)

for some K1 > 0, where we have used the latter parts of (A3) and (A4).Combining this with the foregoing, we have

E[‖zn(t(k))‖4] ≤ K2

((

k∑

j=n

a(j))2 +k∑

j=n

a(j)2

+ (k∑

j=n

a(j)32 )4 + E[(

k∑

j=n

a(j)‖zj‖)4])

for a suitable K2 > 0. Since zj = zn(t(j)), a(j) ≤ 1 and∑m(n)

j=n a(j) ≤ T + 1,we have

E[(k∑

j=n

a(j)‖zj‖)4] ≤ (k∑

j=n

a(j))4E[( 1∑k

j=n a(j)

k∑

j=n

a(j)‖zj‖)4

]

≤ (T + 1)3k∑

j=n

a(j)E[‖zj‖4] (8.2.5)

by the Jensen inequality. Also,∑k

j=n a(j)2 ≤ (∑k

j=n a(j))2 ≤ (T + 1)2,∑kj=n a(j)

32 ≤ (T + 1) supj

√a(j). Thus, for a suitable K3 > 0,

E[‖zn(t(k))‖4] ≤ K3(1 +k∑

j=n

a(j)E[‖zn(t(j))‖4]).

By the discrete Gronwall inequality, it follows that

supn≤k≤m(n)

E[‖zn(t(k))‖4] ≤ K4 < ∞ (8.2.6)

Page 103: Stochastic Approximation: A Dynamical Systems Viewpoint

94 A Limit Theorem for Fluctuations

for a suitable K4 > 0. Arguments analogous to (8.2.5) then also lead to

E[(∑

j=k

a(j)‖zj‖)4] ≤ (∑

j=k

a(j))3K4.

Thus

E[‖zn(t(`))− zn(t(k))‖4] ≤ K2

((∑

j=k

a(j))2 +∑

j=k

a(j)2

+ (∑

j=k

a(j)32 )4 + (

j=k

a(j))3K4

)

≤ K5(∑

j=k

a(j))2

= O(|t(`)− t(k)|2).¥

A small variation of the argument used to prove (8.2.6) shows that

E[ supn≤k≤`

||zn(t(k))||4] ≤ K ′(1 +∑m=n

E[ supn≤k≤m

||zn(t(k))||4]),

which, by the discrete Gronwall inequality, improves (8.2.6) to

E[ supn≤k≤m(n)

‖zn(t(k))‖4] ≤ K5 < ∞. (8.2.7)

We shall use this bound later. A claim analogous to Lemma 2 holds for xn(·):Lemma 3. For t(n) ≤ s < t ≤ t(n) + T ,

E[||xn(t)− xn(s)||4] ≤ K(T )|t− s|2

for a suitable constant K(T ) > 0 depending on T .

Proof. Since h(·) is Lipschitz,

||h(x)|| ≤ K(1 + ||x||) (8.2.8)

for some constant K > 0. Since

supn

E[||xn(t(n))||4] = supn

E[||xn||4] < ∞

by (A3), a straightforward application of the Gronwall inequality in view of(8.2.8) leads to

supt∈[0,T ]

E[||xn(t(n) + t)||4] < ∞.

Page 104: Stochastic Approximation: A Dynamical Systems Viewpoint

8.2 A tightness result 95

Thus for some K ′, K(T ) > 0,

E[||xn(t)− xn(s)||4] ≤ E[||∫ t

s

h(xn(y))dy||4]

≤ (t− s)4E[|| 1t− s

∫ t

s

h(xn(y))dy||4]

≤ (t− s)3E[∫ t

s

||h(xn(y))||4dy]

≤ (t− s)3E[∫ t

s

K ′(1 + ||xn(y)||)4dy]

≤ K(T )|t− s|2,

which is the desired bound. ¥

We shall need the following well-known criterion for tightness of probabilitymeasures on C([0, T ];Rd):

Lemma 4. Let ξα(·), for α belonging to some prescribed index set J , be afamily of C([0, T ];Rd)-valued random variables such that the laws of ξα(0)are tight in P(Rd), and for some constants a, b, c > 0

E[||ξα(t)− ξα(s)||a] ≤ b|t− s|1+c ∀ α ∈ J, t, s ∈ [0, T ]. (8.2.9)

Then the laws of ξα(·) are tight in P(C([0, T ];Rd)).

See Billingsley (1968, p. 95) for a proof.

Let xn(t) = xn(t(n) + t), t ∈ [0, T ]. Then we have:

Lemma 5. The laws of the processes (zn(·), xn(·)), n ≥ 0 are relatively com-pact in P(C([0, T ];Rd))2.

Proof. Note that zn(0) = 0 ∀n and hence have trivially tight laws. Tightnessof the laws of zn(·), n ≥ 0 then follows by combining Lemmas 2 and 4 above.Tightness of the laws of x(0) follows from the second half of (A3), as

P (||xn(t)|| > a) ≤ E||xn(t)||4]a4

≤ K

a4,

for a suitable constant K > 0. Tightness of the laws of xn(·) then followsby Lemmas 3 and 4. Since tightness of marginals implies tightness of jointlaws, tightness of the joint laws of (zn(·), xn(·)) follows. The claim is nowimmediate from Prohorov’s theorem (see Appendix C). ¥

Page 105: Stochastic Approximation: A Dynamical Systems Viewpoint

96 A Limit Theorem for Fluctuations

In view of this lemma, we may take a subsequence of (zn(·), xn(·)) thatconverges in law to a limit (say) (z∗(·), x∗(·)). Denote this subsequence againby (zn(·), xn(·)) by abuse of notation. In the next section we characterizethis limit.

8.3 The functional central limit theorem

To begin with, we shall invoke Skorohod’s theorem (see Appendix C) to supposethat

(zn(·), xn(·)) → (z∗(·), x∗(·)), a.s. (8.3.1)

in C([0, T ];Rd)2. Since the trajectories of the o.d.e. x(t) = h(x(t)) form aclosed set in C([0, T ];Rd) (see Appendix A), it follows that x∗(·) satisfies thiso.d.e. In fact, we know separately from the developments of Chapter 2 that itwould be in an internally chain transitive invariant set thereof. To characterizez∗(·), it is convenient to work with

zj+1 =

√a(j)

a(j + 1)zj + a(j)∇h(xn(j))

√a(j)

a(j + 1)zj

+√

a(j)

√a(j)

a(j + 1)Mj+1 + o(a(j)),

for n ≤ j ≤ m(n), which lead to

zj+1 = zn +j∑

k=n

(√a(k)

a(k + 1)− 1

)zk

+j∑

k=n

a(k)

√a(k)

a(k + 1)∇h(xn(t(k))zk

+j∑

k=n

√a(k)

√a(k)

a(k + 1)Mk+1 + o(1),

for j in the above range. Thus

zn(t(j + 1)) = zn(t(n)) + (ζj − ζn)

+∫ t(j+1)

t(n)

∇h(xn(y))zn(y)b(y)dy

+j∑

k=n

√a(k)

√a(k)

a(k + 1)Mk+1 + o(1),

where:

• ζm =∑m

i=n

(√a(i)

a(i+1) − 1)

zi, and

Page 106: Stochastic Approximation: A Dynamical Systems Viewpoint

8.3 The functional central limit theorem 97

• b(t(j)) def=√

a(j)/a(j + 1), j ≥ 0, with linear interpolation on each interval[t(j), t(j + 1)].

Note that

b(t)t↑∞→ 1, (8.3.2)

and, for t(`) = mint(k) : t(k) ≥ t(n) + T,max

n≤j≤`‖ζj − ζn‖ ≤ sup

i≥n|√

a(i)/a(i + 1)− 1| supt∈[t(n),t(`)]

‖zn(t)‖(T + 1). (8.3.3)

By (8.2.7) and (A2), the right-hand side tends to zero in fourth moment andtherefore in law as n → ∞ for any T > 0. Fix t > s in [0, T ] and let g ∈Cb(C([0, s];Rd)2). Then, since Mk is a martingale difference sequence, wehave

||E[(zn(t)− zn(s)−∫ t

s

∇h(xn(y))b(t(n) + y)zn(y)dy)

× g(zn([0, s]), xn([0, s]))]||= o(1).

Here we use the notation f([0, s]) to denote the trajectory segment f(y), 0 ≤y ≤ s. Letting n →∞, we then have, in view of (8.3.2),

E[(z∗(t)− z∗(s)−∫ t

s

∇h(x∗(y))z∗(y)dy)g(z∗([0, s]), x∗([0, s]))] = 0.

Letting Gtdef= the completion of ∩s≥tσ(z∗(y), x∗(y), y ≤ s) for t ≥ 0, it then

follows by a standard monotone class argument that

z∗(t)−∫ t

0

∇h(x∗(s))z∗(s)ds, t ∈ [0, T ],

is a martingale.For t ∈ [0, T ], define Σn(t) by

Σn(t(j)− t(n)) =j∑

k=n

a(k)b(t(k))2Q(xn(t(k))),

for n ≤ j ≤ m(n), with linear interpolation on each [t(j)− t(n), t(j +1)− t(n)].Then

j∑

k=n

a(k)b(k)2Mj+1MTj+1 − Σn(t(j)− t(n)), n ≤ j ≤ m(n),

is a martingale by (A4). Therefore for t, s as above and

qn(t) def= zn(t)−∫ t

0

∇h(xn(y))b(t(n) + y)zn(y)dy, t ∈ [0, T ],

Page 107: Stochastic Approximation: A Dynamical Systems Viewpoint

98 A Limit Theorem for Fluctuations

we have

||E[(qn(t)qn(t)T − qn(s)qn(s)T − (Σn(t)− Σn(s)))

×g(zn([0, s]), xn([0, s])]||= o(1).

(The multiplication by g(· · · ) and the expectation are componentwise.) Passingto the limit as n →∞, one concludes as before that

(z∗(t)−

∫ t

0

∇h(x∗(s))z∗(s)ds)(

z∗(t)−∫ t

0

∇h(x∗(s))z∗(s)ds)T

−∫ t

0

Q(x∗(s))ds

is a Gt-martingale for t ∈ [0, T ]. From the results of Wong (1971), itthen follows that on a possibly augmented probability space, there exists ad-dimensional Brownian motion B(t), t ≥ 0, such that

z∗(t) =∫ t

0

∇h(x∗(s))z∗(s)ds +∫ t

0

D(x∗(s))dB(s), (8.3.4)

where D(x) ∈ Rd×d for x ∈ Rd is a positive semidefinite, Lipschitz (in x),square-root of the matrix Q(x).

Remarks: (1) A square-root as above always exists under our hypotheses, asshown in Theorem 5.2.2 of Stroock and Varadhan (1979).(2) Equation (8.3.4) specifies z∗(·) as a solution of a linear stochastic differentialequation. A ‘variation of constants’ argument leads to the explicit expression

z∗(t) =∫ t

0

Φ(t, s)Q(x∗(s))dB(s), (8.3.5)

where Φ(t, s), t ≥ s ≥ 0, satisfies the linear matrix differential equation

d

dtΦ(t, s) = ∇h(x∗(t))Φ(t, s), t ≥ s; Φ(s, s) = Id. (8.3.6)

Here Id denotes the d× d identity matrix. In particular, if x∗(·) were a deter-ministic trajectory, then by (8.3.6), Φ(·, ·) would be deterministic too and by(8.3.5), z∗(·) would be the solution of a linear stochastic differential equationwith deterministic coefficients and zero initial condition. In particular, (8.3.5)would then imply that it is a zero mean Gaussian process.

Summarizing, we have:

Theorem 6. The limits in law (z∗(·), x∗(·)) of (zn(·), xn(·)) are such thatx∗(·) is a solution of the o.d.e. (9.1.2) belonging to an internally chain transitiveinvariant set thereof, and z∗(·) satisfies (8.3.4).

Page 108: Stochastic Approximation: A Dynamical Systems Viewpoint

8.4 The convergent case 99

8.4 The convergent case

We now consider the special case when the o.d.e. x(t) = h(x(t)) has a uniqueglobally asymptotically stable equilibrium x. Then under our conditions, xn →x a.s. as n ↑ ∞ by Theorem 2 of Chapter 2. We may also suppose that alleigenvalues of ∇h(x) have strictly negative real parts. Then x∗(·) ≡ x and thus(8.3.4) reduces to the constant coefficient linear stochastic differential equation

z∗(t) =∫ t

0

∇h(x)z∗(s)ds +∫ t

0

D(x)dB(s),

leading to

z∗(t) =∫ t

0

e∇h(x)(t−s)D(x)dB(s).

In particular, z∗(t) is a zero mean Gaussian random variable with covariancematrix given by

Γ(t) def=∫ t

0

e∇h(x)(t−s)Q(x)e∇h(x)T(t−s)ds

=∫ t

0

e∇h(x)uQ(x)e∇h(x)Tudu,

after a change of variable u = t−s. Thus, as t →∞, the law of z∗(t) convergesto the stationary distribution of this Gauss–Markov process, which is zero meanGaussian with covariance matrix

Γ∗ def= limt→∞

Γ(t)

=∫ ∞

0

e∇h(x)sQ(x)e∇h(x)Tsds.

Note that

∇h(x)Γ∗ + Γ∗∇h(x)T = limt→∞

(∇h(x)Γ(t) + Γ(t)∇h(x)T)

= limt→∞

∫ t

0

d

du

(e∇h(x)uQ(x)e∇h(x)Tu

)du

= limt→∞

(e∇h(x)tQ(x)e∇h(x)Tt −Q(x))

= −Q(x),

in view of our assumption on ∇h(x) above. Thus Γ∗ satisfies the matrix equa-tion

∇h(x)Γ∗ + Γ∗∇h(x)T + Q(x) = 0. (8.4.1)

From the theory of linear systems of differential equations, it is well-known thatΓ∗ is the unique positive definite solution to the ‘Liapunov equation’ (8.4.1)(see, e.g., Kailath, 1980, p. 179).

Page 109: Stochastic Approximation: A Dynamical Systems Viewpoint

100 A Limit Theorem for Fluctuations

Since the Gaussian density with zero mean and covariance matrix Γ(t) con-verges pointwise to the Gaussian density with zero mean and covariance matrixΓ∗, it follows from Scheffe’s theorem (see Appendix C) that the law of z∗(t)tends to the stationary distribution in total variation and hence in P(Rd). Letε > 0 and pick T above large enough that for t = T , the two are at most ε apartwith respect to a suitable metric ρ compatible with the topology of P(Rd). Itthen follows that the law of zn(T ) converges to the ε-neighbourhood (w.r.t. ρ)of the stationary distribution as n ↑ ∞. Since ε > 0 was arbitrary, it followsthat it converges in fact to this distribution. We have thus proved the ‘CentralLimit Theorem’ for stochastic approximation:

Theorem 7. The law of zn converges to the Gaussian distribution with zeromean and covariance matrix Γ∗ given by the unique positive definite solutionto (8.4.1).

One important implication of this result is the following: it suggests that ina certain sense, the convergence rate of xn to x is O(

√a(n)). Also, one can

read off ‘confidence intervals’ for finite runs based on Gaussian approximation,see Hsieh and Glynn (2000).

We have presented a simple case of the functional central limit theoremfor stochastic approximation. A much more general statement that allowsfor a ‘Markov noise’ on the natural timescale as in Chapter 6 is available inBenveniste, Metivier and Priouret (1990). In addition, there are also other limittheorems available for stochastic approximation iterations, such as convergencerates for moments (Gerencser, 1992), a ‘pathwise central limit theorem’ forcertain scaled empirical measures (Pelletier, 1999), a law of iterated logarithms(Pelletier, 1998), Strassen-type strong invariance principles (Lai and Robbins,1978; Pezeshki-Esfahani and Heunis, 1997), and Freidlin – Wentzell type ‘largedeviations’ bounds (Dupuis, 1988; Dupuis and Kushner, 1989).

Page 110: Stochastic Approximation: A Dynamical Systems Viewpoint

9

Constant Stepsize Algorithms

9.1 Introduction

In many practical circumstances, it is more convenient to use a small constantstepsize a(n) ≡ a ∈ (0, 1) rather than the decreasing stepsize considered thusfar. One such situation is when the algorithm is ‘hard-wired’ and decreasingstepsize may mean additional overheads. Another important scenario is whenthe algorithm is expected to operate in a slowly varying environment (e.g.,in tracking applications) where it is important that the timescale of the algo-rithm remain reasonably faster than the timescale on which the environmentis changing, for otherwise it would never adapt.

Naturally, for constant stepsize one has to forgo the strong convergence state-ments we have been able to make for decreasing stepsizes until now. A rule ofthumb, to be used with great caution, is that in the passage from decreasingto small positive stepsize, one replaces a ‘converges a.s. to’ statement with a‘concentrates with a high probability in a neighbourhood of ’ statement. We shallmake this more precise in what follows, but the reason for thus relaxing theclaims is not hard to guess. Consider for example the iteration

xn+1 = xn + a[h(xn) + Mn+1], (9.1.1)

where Mn are i.i.d. with Gaussian densities. Suppose the o.d.e.

x(t) = h(x(t)) (9.1.2)

has a unique globally asymptotically stable equilibrium x∗. Observe that xnis then a Markov process and if it is stable (i.e., the laws of xn remaintight – see Appendix C) then the best one can hope for is that it will havea stationary distribution which assigns a high probability to a neighbourhoodof x∗. On the other hand, because of additive Gaussian noise, the stationarydistribution will have full support. Using the well-known recurrence properties

101

Page 111: Stochastic Approximation: A Dynamical Systems Viewpoint

102 Constant Stepsize Algorithms

of such Markov processes, it is not hard to see that both ‘supn ||xn|| < ∞ a.s.’and ‘xn → x∗ a.s.’ are untenable, because xn will visit any given open setinfinitely often with probability one.

In this chapter, we present many counterparts of the results thus far forconstant stepsize. The treatment will be rather sketchy, emphasizing mainlythe points of departures from the diminishing stepsizes. What follows alsoextends more generally to bounded stepsizes.

9.2 Asymptotic behaviour

We shall assume (A1), (A3) of Chapter 2 and replace (A4) there by

Cdef= sup

nE[||xn||2] 1

2 < ∞ (9.2.1)

and

supn

E[G(||xn||2)] < ∞ (9.2.2)

for some G : [0,∞) → [0,∞) satisfying G(t)/tt↑∞→ ∞. Condition (9.2.2)

is equivalent to the statement that ||xn||2 are uniformly integrable – see,e.g., Theorem 1.3.4, p. 10, of Borkar (1995). Let L > 0 denote the Lipschitzconstant of h as before. As observed earlier, its Lipschitz continuity implies atmost linear growth. Thus we have the equivalent statements

||h(x)|| ≤ K1(1 + ||x||) or K2

√1 + ||x||2,

for suitable K1,K2 > 0. We may use either according to convenience. Imitatingthe developments of Chapter 2 for decreasing stepsizes, let t(n) = na, n ≥ 0.Define x(·) by x(t(n)) = xn ∀n with x(t) defined on [t(n), t(n + 1)] by linearinterpolation for all n, so that it is a piecewise linear, continuous function. Asbefore, let xs(t), t ≥ s, denote the trajectory of (9.1.2) with xs(s) = x(s). Inparticular,

xs(t) = x(s) +∫ t

s

h(xs(y))dy

for t ≥ s implies that

||xs(t)|| ≤ ||x(s)||+∫ t

s

K1(1 + ||xs(y)||)dy.

By the Gronwall inequality, we then have

||xs(t)|| ≤ Kτ (1 + ||x(s)||), t ∈ [s, s + τ ], τ > 0.

for a Kτ > 0. In turn this implies

||h(xs(t))|| ≤ K1(1 + Kτ (1 + ||x(s)||)) ≤ ∆(τ)(1 + ‖x(s)‖), t ∈ [s, s + τ ],

Page 112: Stochastic Approximation: A Dynamical Systems Viewpoint

9.2 Asymptotic behaviour 103

for τ > 0, ∆(τ) def= K1(1 + Kτ ). Thus for t > t′ in [s, s + τ ],

||xs(t)− xs(t′)|| ≤∫ t

t′||h(xs(y))− h(xs(s))||dy

≤ ∆(τ)(1 + ‖x(s)‖)(t− t′). (9.2.3)

The key estimate for our analysis is:

Lemma 1. For any T > 0,

E[ supt∈[0,T ]

||x(s + t)− xs(s + t)||2] = O(a). (9.2.4)

Proof. Consider T = Na for some N > 0. For t ≥ 0, let [t] def= maxna : n ≥0, na ≤ t. Let ζn

def= a∑n

m=1 Mm, n ≥ 1. Then for n ≥ 0 and 1 ≤ m ≤ N , wehave

x(t(n + m)) = x(t(n)) +∫ t(n+m)

t(n)

h(x([t]))dt + (ζm+n − ζn).

Also,

xt(n)(t(n + m)) = x(t(n)) +∫ t(n+m)

t(n)

h(xt(n)([t]))dt

+∫ t(n+m)

t(n)

(h(xt(n)(t))− h(xt(n)([t])))dt. (9.2.5)

Recall that L is a Lipschitz constant for h(·). Clearly,

||∫ t(n+m)

t(n)

(h(x([t]))− h(xt(n)([t])))dt||

= a||m−1∑

k=0

(h(x(t(n + k)))− h(xt(n)(t(n + k)))

)||

≤ aL

m−1∑

k=0

||x(t(n + k))− xt(n)(t(n + k))||

≤ aL

m−1∑

k=0

supj≤k

||x(t(n + j))− xt(n)(t(n + j))||. (9.2.6)

Page 113: Stochastic Approximation: A Dynamical Systems Viewpoint

104 Constant Stepsize Algorithms

By (9.2.3), we have

||∫ t(n+m+1)

t(n+m)

(h(x(t))− h(x([t])))dt||

≤∫ t(n+m+1)

t(n+m)

||h(x(t))− h(x([t]))||dt

≤ L

∫ t(n+m+1)

t(n+m)

||x(t)− x([t])||dt

≤ 12a2L∆(Na)(1 + ‖x(t(n))‖)

def=12a2K(1 + ‖x(t(n))‖). (9.2.7)

Subtracting (9.2.5) from (9.2), we have, by (9.2.6) and (9.2.7),

sup0≤k≤m

||x(t(n + k))− xt(n)(t(n + k)))||

≤ aL

m−1∑

k=0

sup0≤j≤k

||x(t(n + j))− xt(n)(t(n + j))||

+ aTK(1 + ‖x(t(n))‖) + sup1≤j≤m

||ζn+j − ζn||.

By Burkholder’s inequality (see Appendix C) and assumption (A3) of Chapter2,

E[ sup1≤j≤m

||ζn+j − ζn||2] ≤ a2KE[∑

0≤j<m

‖Mn+j‖2]

≤ a2K∑

0≤j<m

(1 + E[||x(t(n + j))||2])

≤ a2KN(1 + C2)

= aKT (1 + C2)

for m ≤ N , C as in (9.2.1) and suitable K, K > 0. Hence

E[ supk≤m

||x(t(n + k))− xt(n)(t(n + k)))||2] 12

≤ aL

m−1∑

k=0

E[supj≤k

||x(t(n + j))− xt(n)(t(n + j))||2] 12

+ aTKE[(1 + ‖x(t(n))‖)2] 12 + E[ sup

1≤j≤m||ζn+m − ζn||2] 1

2

≤ aL

m−1∑

k=0

E[supj≤k

||x(t(n + j))− xt(n)(t(n + j))||2] 12

+ aTK√

1 + C2 +√

aKT (1 + C2)

Page 114: Stochastic Approximation: A Dynamical Systems Viewpoint

9.2 Asymptotic behaviour 105

for a suitable K > 0. By the discrete Gronwall inequality (Appendix B), itfollows that

E[ supn≤j≤n+N

||x(t(j))− xt(n)(t(j))||2] 12 ≤ √

aK

for a suitable K > 0 that depends on T . Since

E[ supt∈[t(k),t(k+1)]

‖x(t)− x(t(k))‖2] 12

and

E[ supt∈[t(k),t(k+1)]

‖xt(n)(t)− xt(n)(t(k))‖2] 12

are also O(√

a), it is easy to deduce from the above that

E[ supt∈[0,T ]

||x(t(n) + t)− xt(n)(t(n) + t)||2] 12 ≤ √

aK

for a suitable K > 0. This completes the proof for T of the form T = Na. Thegeneral case can be proved easily from this. ¥

Let (9.1.2) have a globally asymptotically stable compact attractor A and letρ(x,A) def= miny∈A ||x−y|| denote the distance of x ∈ Rd from A. For purposesof the proof of the next result, we introduce R > 0 such that

(i) Aa def= x ∈ Rd : ρ(x,A) ≤ a ⊂ B(R) def= x ∈ Rd : ||x|| < R, and(ii) for a > 0 as above,

supn

P (||xn|| ≥ R) < a and supn

E[||xn||2I||xn|| ≥ R] < a. (9.2.8)

(The second part of (9.2.8) is possible by the uniform integrability con-dition (9.2.2).)

By global asymptotic stability of (9.1.2), we may pick T = Na > 0 largeenough such that for any solution x(·) thereof with x(0) ∈ B(R)− A, one hasx(T ) ∈ Aa and

ρ(x(T ), A) ≤ 12ρ(x(0), A). (9.2.9)

We also need the following lemma:

Lemma 2. There exists a constant K∗ > 0 depending on T above, such thatfor t ≥ 0,

E[ρ(x(t + T ), A)2Ix(t) ∈ Aa] 12 ≤ K∗√a,

E[ρ(x(t + T ), A)2Ix(t) ∈ B(R)c] 12 ≤ K∗√a.

Page 115: Stochastic Approximation: A Dynamical Systems Viewpoint

106 Constant Stepsize Algorithms

Proof. In what follows, K∗ > 0 denotes a suitable constant possibly dependingon T , not necessarily the same each time. For xt(·) as above, we have

E[ρ(x(t + T ), A)2Ix(t) ∈ Aa] 12

≤ E[ρ(xt(t + T ), A)2Ix(t) ∈ Aa] 12 + K∗√a,

= K∗√a.

Here the inequality follows by Lemma 1 and the equality by our choice of T ,which implies in particular that the expectation on the right-hand side of theinequality is O(a). Also, by comparing xt(·) with a trajectory x′(·) in A andusing the Gronwall inequality, we get

||xt(t + T )− x′(T )|| ≤ K∗||xt(t)− x′(0)|| = K∗||x(t)− x′(0)||.

Hence ρ(xt(t + T ), A) ≤ K∗ρ(x(t), A). Since A is bounded, we also haveρ(y, A)2 ≤ K∗(1 + ‖y‖2) ∀y. Then

E[ρ(x(t + T ), A)2Ix(t) ∈ B(R)c] 12

≤ K∗√a + E[ρ(xt(t + T ), A)2Ix(t) ∈ B(R)c] 12

≤ K∗√a + K∗E[ρ(x(t), A)2Ix(t) ∈ B(R)c] 12

≤ K∗√a + K∗E[(1 + ‖x(t)‖2)Ix(t) ∈ B(R)c] 12

≤ K∗√a.

Here the first inequality follows from Lemma 1, the second and the third fromthe preceding observations, and the last from (9.2.8). This completes the proof.

¥

Theorem 3. For a suitable constant K > 0,

lim supn→∞

E[ρ(xn, A)2]12 ≤ K

√a. (9.2.10)

Proof. Take t = ma > 0. By Lemma 1,

E[||x(t + T )− xt(t + T )||2] 12 ≤ K ′√a

Page 116: Stochastic Approximation: A Dynamical Systems Viewpoint

9.3 Refinements 107

for a suitable K ′ > 0. Then using (9.2.8), (9.2.9) and Lemma 2, one has:

E[ρ(xm+N , A)2]12

= E[ρ(x(t + T ), A)2]12

≤ E[ρ(x(t + T ), A)2Ix(t) /∈ B(R)] 12

+ E[ρ(x(t + T ), A)2Ix(t) ∈ B(R)−Aa] 12

+ E[ρ(x(t + T ), A)2Ix(t)) ∈ Aa] 12

≤ 2K∗√a + E[ρ(x(t + T ), A)2Ix(t) ∈ B(R)−Aa] 12

≤ 2K∗√a + E[ρ(xt(t + T ), A)2Ix(t) ∈ B(R)−Aa] 12

+ E[||x(t + T )− xt(t + T )||2] 12

≤ (2K∗ + K ′)√

a + E[ρ(xt(t + T ), A)2Ix(t) ∈ B(R)−Aa] 12

≤ (2K∗ + K ′)√

a +12E[ρ(xt(t), A)2Ix(t) ∈ B(R)−Aa] 1

2

= (2K∗ + K ′)√

a +12E[ρ(xm, A)2]

12 .

Here the second inequality follows from Lemma 2 and the last one by (9.2.9).Iterating, one has

lim supk→∞

E[ρ(xm+kN , A)2]12 ≤ 2(2K∗ + K ′)

√a.

Repeating this for m + 1, . . . , m + N − 1, in place of m, the claim follows. ¥

Let ε > 0. By the foregoing and the Chebyshev inequality, we have

lim supn→∞

P (ρ(xn, A) > ε) = O(a),

which captures the intuitive statement that ‘xn concentrate around A witha high probability as n →∞’.

9.3 Refinements

This section collects together the extensions to the constant stepsize scenarioof various other results developed earlier for the decreasing stepsize case. Inmost cases, we only sketch the idea, as the basic philosophy is roughly similarto that for the decreasing stepsize case.

(i) Stochastic recursive inclusions: Consider

xn+1 = xn + a[yn + Mn+1], n ≥ 0,

with yn ∈ h(xn) ∀n for a set-valued map h satisfying the conditions stip-ulated at the beginning of Chapter 5. Let x(·) denote the interpolated

Page 117: Stochastic Approximation: A Dynamical Systems Viewpoint

108 Constant Stepsize Algorithms

trajectory as in Chapter 5. For T > 0, let ST denote the solution set ofthe differential inclusion

x(t) ∈ h(x(t)), t ∈ [0, T ]. (9.3.1)

Under the ‘linear growth’ condition on h(·) stipulated in Chapter 5,viz., supy∈h(x) ‖h(y)‖ ≤ K(1+‖x‖), a straightforward application of theGronwall inequality shows that the solutions to (9.3.1) remain uniformlybounded on finite time intervals for uniformly bounded initial conditions.For z(·) ∈ C([0, T ];Rd), let

d(z(·),ST ) def= infy(·)∈ST

supt∈[0,T ]

||z(t)− y(t)||.

For t ≥ 0, suppose that:

(†)E[||x(t)||2] remains bounded as a ↓ 0. (This is a ‘stability ’ condition.)

Then the main result here is:

Theorem 4. d(x(·)|[t′,t′+T ],ST )a↓0→ 0 in law, uniformly in t′ ≥ 0.

Proof. Fix t′ ∈ [n0a, (n0 + 1)a), n0 ≥ 0. Define x(·) by

˙x(t) = yn0+m, t ∈ [(n0 + m)a,(n0 + m + 1)a) ∩ [t′,∞),

0 ≤ m < N, (9.3.2)

with x(t′) = xn0 . Let T = Na for simplicity. Then by familiar argu-ments,

E[ sups∈[t′,t′+T ]

||x(s)− x(s)||2] 12 = O(

√a). (9.3.3)

By the ‘stability’ condition (†) mentioned above, the law of x(t′) remainstight as a ↓ 0. That is, x(t′), which coincides with x(t′), remains tightin law. Also, for t1 < t2 in [t′, t′ + T ],

E[‖x(t2)− x(t1)‖2] ≤ |t2 − t1|2 supn0≤m≤n0+N

E[‖ym‖2]

≤ |t2 − t1|2Kfor a suitable K > 0 independent of a. For the second inequality above,we have used the linear growth condition on h(·) along with the ‘stability’condition (†) above. By the tightness criterion of Billingsley (1968), p.95, it then follows that the laws of x(t′ + s), s ∈ [0, T ], remain tightin P(C([0, T ];Rd)) as a → 0 and t′ varies over [0,∞). Thus alongany sequences a ≈ a(k) ↓ 0, t′(k) ⊂ [0,∞), we can take a further

Page 118: Stochastic Approximation: A Dynamical Systems Viewpoint

9.3 Refinements 109

subsequence, denoted a(k), t′(k) again by abuse of notation, so thatthe x(t′(k) + ·) converge in law. Denote x(t′(k) + ·) by xk(·) in order tomake their dependence on k, t′(k) explicit. By Skorohod’s theorem (seeAppendix C), there exist C([0, T ];Rd)-valued random variables xk(·)such that xk(·), xk(·) agree in law separately for each k and xk(·)converge a.s. Now argue as in the proof of Theorem 1 of Chapter 5 toconclude that a.s., the limit thereof in C([0, T ];Rd) is in ST , i.e.,

d(xk(·),ST ) → 0 a.s.

Thus

d(xk(·),ST ) → 0 in law.

In view of (9.3.3), we then have

d(xk(t′(k) + ·),ST ) → 0 in law,

where the superscript k renders explicit the dependence on the stepsizea(k). The claim follows. ¥

If the differential inclusion (9.3.1) has a globally asymptotically stablecompact attractor A, we may use this in place of Lemma 1 to derive avariation of Theorem 3 for the present set-up.

(ii) Avoidance of traps: This is rather easy in the constant stepsize set-up. Suppose (9.1.2) has compact attractors A1, . . . , AM with respectivedomains of attraction D1, . . . , DM . Suppose that the set U

def= the com-plement of ∪M

i=1Di, has the following property: For any ε > 0, thereexists an open neighbourhood U ε of U and a constant C > 0 such thatfor A = ∪iAi,

lim supn→∞

E[ρ(xn, Aa)2Ixn ∈ U ε] ≤ Ca. (9.3.4)

Intuitively, this says that the ‘bad’ part of the state space has low proba-bility in the long run. One way (9.3.4) can arise is if U has zero Lebesguemeasure and the laws of the xn have uniformly bounded densities w.r.t.the Lebesgue measure on Rd. One then chooses ε small enough that(9.3.4) is met. Under (9.3.4), we can modify the calculation in the proofof Theorem 3 as

E[ρ(xm+N , A)2]12

≤ E[ρ(x(t + T ), A)2Ix(t) /∈ B(R)] 12

+ E[ρ(x(t + T ), A)2Ix(t) ∈ U ε] 12

+ E[ρ(x(t + T ), A)2Ix(t) ∈ B(R) ∩ (U ε ∪Aa)c] 12

+ E[ρ(x(t + T ), A)Ix(t)) ∈ Aa] 12

Page 119: Stochastic Approximation: A Dynamical Systems Viewpoint

110 Constant Stepsize Algorithms

Choosing T appropriately as before, argue as before to obtain (9.2.10),using (9.3.4) in addition to take care of the second term on the right.

(iii) Stability: We now sketch how the first stability criterion described insection 3.2 can be extended to the constant stepsize framework. Theclaim will be that supn E[||xn||2] remains bounded for a sufficiently smallstepsize a.

As in section 3.2, let hc(x) def= h(cx)/c ∀x, 1 ≤ c < ∞, and h∞(·) def=limit in C(Rd) of hc(·) as c ↑ ∞, assumed to exist. Let assumption (A5)there hold, i.e., the o.d.e.

x∞(t) = h∞(x∞(t))

have the origin as the unique globally asymptotically stable equilibriumpoint. Let xc(·) denote a solution to the o.d.e.

xc(t) = hc(xc(t)), (9.3.5)

for c ≥ 1. Then by Corollary 3 of Chapter 3, there exists c0 > 0, T > 0,

such that whenever ||xc(0)|| = 1, one has

||xc(t)|| < 14

for t ∈ [T, T + 1] ∀ c ≥ c0. (9.3.6)

We take T = Na for some N ≥ 1 without loss of generality, whichspecifies N as a function of T and a. Let Tn = nNa, n ≥ 0. By analogywith what we did for the decreasing stepsize case in Chapter 3, definex(·) as follows. On [Tn, Tn+1], define

xn((nN + k)a) def=xnN+k

||xnN || ∨ 1, 0 ≤ k ≤ N,

with linear interpolation. Define

x(t) def= lims↓t

xn(s), t ∈ [Tn, Tn+1).

Let x(Tn+1−) def= limt↑Tn+1 x(t) for n ≥ 0. Then x(·) is piecewise linearand continuous except possibly at the Tn, where it will be right continu-ous with its left limit well defined. Let xn(·) denote a solution to (9.3.5)on [Tn, Tn+1) with c = ||xnN || ∨ 1 and xn(Tn) = x(Tn) for n ≥ 0. Bythe arguments of section 9.2,

E[ supt∈[Tn,Tn+1)

||x(t)− xn(t)||2] 12 ≤ C1

√a, ∀n, (9.3.7)

for a suitable constant C1 > 0 independent of a. As before, let Fn =σ(xi,Mi, i ≤ n), n ≥ 0.

Page 120: Stochastic Approximation: A Dynamical Systems Viewpoint

9.3 Refinements 111

Lemma 5. For n ≥ 0 and a suitable constant C2 > 0 depending on T ,

sup0≤k≤N

E[||xnN+k||2|FnN ]12 ≤ C2(1 + ||xnN ||) a.s. ∀ n ≥ 0.

Proof. Recall that

E[||Mn+1||2|Fn] ≤ K3(1 + ||xn||2) ∀n, (9.3.8)

for some K3 > 0 (cf. assumption (A3) of Chapter 2). By (9.1.1), forn ≥ 0, 0 ≤ k ≤ N,

E[||xnN+k+1||2|FnN ]12

≤ E[||xnN+k||2|FnN ]12 + aK ′(1 + E[||xnN+k||2|FnN ]

12 ),

for a suitable K ′ > 0, where we use (9.3.8) and the linear growth con-dition on h. The claim follows by iterating this inequality. ¥

Theorem 6. For sufficiently small a > 0, supn E[||xn||2] < ∞.

Proof. Let c0, T be such that (9.3.6) holds and pick√

a ≤ (2C1)−1 forC1 as in (9.3.7). For n ≥ 0, 0 ≤ k ≤ N, we have

E[||x(n+1)N ||2|FnN ]12

= E[||x(n+1)N ||2|FnN ]12 (||xnN || ∨ 1)

≤ E[||x(n+1)N − xn(Tn+1)||2|FnN ]12 (||xnN || ∨ 1)

+ E[||xn(Tn+1)||2|FnN ]12 (||xnN || ∨ 1)

≤ 12||xnN || ∨ 1

+ E[||xn(Tn+1)||2I||xnN || ≥ c0|FnN ]12 (||xnN || ∨ 1)

+ E[||xn(Tn+1)||2I||xnN || < c0|FnN ]12 (||xnN || ∨ 1)

≤ 12||xnN ||+ 1

4||xnN ||+ C

=34||xnN ||+ C,

where the second inequality follows by (9.3.7) and our choice of a, andthe third inequality follows from (9.3.6), (9.3.7) and Lemma 5 above,with C

def= C2(1 + c0) + 1. Thus

E[‖x(n+1)N‖2]12 ≤ 3

4E[‖xnN‖2] 1

2 + C.

By iterating this inequality, we have supn E[||xnN ||2] < ∞, whence theclaim follows by Lemma 5. ¥

Page 121: Stochastic Approximation: A Dynamical Systems Viewpoint

112 Constant Stepsize Algorithms

(iv) Two timescales: By analogy with section 6.1, consider the coupled iter-ations

xn+1 = xn + a[h(xn, yn) + M1n+1], (9.3.9)

yn+1 = yn + b[g(xn, yn) + M2n+1], (9.3.10)

for 0 < b << a, where h, g are Lipschitz and M in+1, i = 1, 2, are

martingale difference sequences satisfying

E[||M in+1||2|M j

m, xm, ym,m ≤ n, j = 1, 2]

≤ C(1 + ||xn||2 + ||yn||2),for i = 1, 2. Assume (9.2.1) and (9.2.2) for xn along with their coun-terparts for yn. We also assume that the o.d.e.

x(t) = h(x(t), y) (9.3.11)

has a unique globally asymptotically stable equilibrium λ(y) for each y

and a Lipschitz function λ(·), and that the o.d.e.

y(t) = h(λ(y(t)), y(t)) (9.3.12)

has a unique globally asymptotically stable equilibrium y∗. Then thearguments of section 6.1 may be combined with the arguments of section9.2 to conclude that

lim supn→∞

E[||xn − λ(y∗)||2 + ||yn − y∗||2] = O(a) + O(b

a). (9.3.13)

Specifically, consider first the timescale corresponding to the stepsize a

and consider the interpolated trajectory x(·) on [na, na + T ] for Tdef=

Na > 0, for some N ≥ 1 and n ≥ 0. Let xn(·) denote the solution ofthe o.d.e. (9.3.11) on [na, na + T ] for y = yn and xn(na) = xn. Thenarguing as for Lemma 1 above,

E[ supt∈[na,na+T ]

||x(t)− xn(t)||2] = O(a) + O(b

a).

Here we use the easily established fact that

sup0≤k≤N

E[||yn+k − yn||2] = O(b

a),

and thus the approximation yn+k ≈ yn, 0 ≤ k ≤ N, contributes onlyanother O( b

a ) error. Given our hypotheses on the asymptotic behaviourof (9.3.11), it follows that

lim supn→∞

E[||xn − λ(yn)||2] = O(a) + O(b

a).

Page 122: Stochastic Approximation: A Dynamical Systems Viewpoint

9.3 Refinements 113

Next, consider the timescale corresponding to b. Let T ′ def= Mb > 0 forsome M ≥ 1. Consider the interpolated trajectory y(·) on [nb, nb + T ′]defined by y(mb) def= ym ∀m, with linear interpolation. Let yn(·) denotethe solution to (9.3.12) on [nb, nb + T ′] with yn(nb) = yn for n ≥ 0.Then argue as in the proof of Lemma 1 to conclude that

lim supn→∞

E[ supt∈[na,na+T ′]

||y(t)− yn(t)||2] = O(a) + O(b

a).

The only difference from the argument leading to Lemma 1 is an ad-ditional error term due to the approximation xn ≈ λ(yn), which isO(a)+O( b

a ) as observed above. (This, in fact, gives the O(a)+O( ba ) on

the right-hand side instead of O(b).) Given our hypotheses on (9.3.12),this implies that

lim supn→∞

E[||yn − y∗||2] = O(a) + O(b

a),

which in turn will also yield (in view of the Lipschitz continuity of λ(·))

lim supn→∞

E[||xn − λ(y∗)||2] = O(a) + O(b

a).

(v) Averaging the ‘natural’ timescale: Now consider

xn+1 = xn + a[h(xn, Yn) + Mn+1], n ≥ 0, (9.3.14)

where Yn is as in sections 6.2 and 6.3. That is, it is a process takingvalues in a complete separable metric space S such that for any Borelset A ⊂ S,

P (Yn+1 ∈ A|Ym, Zm, xm,m ≤ n) =∫

A

p(dy|Yn, Zn, xn), n ≥ 0.

Here Zn takes values in a compact metric space U and p(dy|·, ·, ·) isa continuous ‘controlled’ transition probability kernel. Mimicking thedevelopments of Chapter 6, define

µ(t) def= δ(Yn,Zn), t ∈ [na, (n + 1)a), n ≥ 0,

where δ(y,z) is the Dirac measure at (y, z). For s ≥ 0, let xs(t), t ≥ s, bethe solution to the o.d.e.

xs(t) = h(xs(t), µ(t)) def=∫

h(xs(t), ·)dµ(t). (9.3.15)

Define x(·) as at the start of section 9.2. Then by familiar arguments,

E[ supt∈[s,s+T ]

||x(t)− xs(t)||2] = O(a). (9.3.16)

Assume the ‘stability condition’ (†) above. It then follows by familiar

Page 123: Stochastic Approximation: A Dynamical Systems Viewpoint

114 Constant Stepsize Algorithms

arguments that, as a ↓ 0, the laws of xs(·)|[s,s+T ], s ≥ 0, remain tightas probability measures on C([0, T ];Rd). Suppose the laws of µ(s + ·)remain tight as well. Then every sequence a(n) ↓ 0, s = s(a(n)) ∈ [0,∞),has a further subsequence, denoted by a(n), s(a(n)) again by abuseof terminology, such that the corresponding processes (xs(·), µ(s + ·))converge in law. Invoking Skorohod’s theorem, we may suppose thatthis convergence is a.s. We shall now need a counterpart of Lemma 6of Chapter 6. Let fi be as in the proof of Lemma 6, Chapter 6, anddefine ξi

n, τ(n, t) as therein, with a(m) ≡ a. It is then easily verifiedthat almost surely as a ↓ 0 along a(`), we have

E[(τ(n,t)∑m=n

a(f(Ym+1)−∫

f(y)p(dy|Ym, Zm, xm))2]

= E[τ(n,t)∑m=n

a2(f(Ym+1)−∫

f(y)p(dy|Ym, Zm, xm))2]

= O(a)a↓0→ 0.

As for Lemma 6, Chapter 6, this leads to the following: almost surely,for any limit point (x(·), µ(·)) of (xs(·), µ(s + ·)) as a ↓ 0 along a(`),

∫ t

0

(fi(y)−∫

fi(w)p(dw|y, z, x(s))µ(s)(dydz))ds = 0

∀ i ≥ 1, t ≥ 0. Argue as in the proof of the lemma to conclude thatµ(t) ∈ D(x(t)) ∀t. It then follows that a.s., xs(·)|[s,s+T ] converges to

GTdef= the set of trajectories of the differential inclusion

x(t) ∈ h(x(t)), s ≤ t ≤ s + T, (9.3.17)

for a set-valued map h defined as in section 6.3. Set

d(z(·),GT ) def= infy(·)∈GT

supt∈[0,T ]

||z(t)− y(t)||, z(·) ∈ C([0, T ];Rd).

Then we can argue as in extension (i) above to conclude:

Theorem 7. For any T > 0, d( x(·)|[s,s+T ] , GT )a↓0→ 0 in law uniformly

in s ≥ 0.

The asymptotic behaviour of the algorithm as n → ∞ may thenbe inferred from the asymptotic behaviour of trajectories in GT as inChapter 5. As in Chapter 6, consider the special case when there isno ‘external control process’ Zn in the picture and in addition, for

Page 124: Stochastic Approximation: A Dynamical Systems Viewpoint

9.3 Refinements 115

xn ≡ x ∈ Rd, the process Yn is an ergodic Markov process with theunique stationary distribution ν(x). Then (9.3.17) reduces to

x(t) = h(x(t), ν(x(t))).

(vi) Asynchronous implementations: This extension proceeds along linessimilar to that for decreasing stepsize, with the corresponding claimsadapted as in section 9.2. Thus, for example, for the case with no inter-processor delays, we conclude that for t ≥ 0,

E[ sups∈[t,t+T ]

||x(s)− xt(s)||2] = O(a),

where x(s), s ∈ [t, t + T ], is a trajectory of

˙x(s) = Λ(s)h(x(s)), x(t) = x(t),

for a Λ(·) as in Theorem 2 of Chapter 7. The delays simply contributeanother O(a) error term and thus do not affect the conclusions.

One can use this information to ‘rig’ the stepsizes so as to get thedesired limiting o.d.e. when a common clock is available. This is alongthe lines of the concluding remarks of section 7.4. Thus, for example,suppose the components are updated one at a time according to anergodic Markov chain Yn on their index set. That is, at time n, theYnth component is being updated. Suppose the chain has a stationaryprobability vector [π1, . . . , πd]. Then by Corollary 8 of Chapter 6, Λ(t) ≡diag(π1, . . . , πd). Thus if we use the stepsize a/πi for the ith component,we get the limiting o.d.e. x(t) = h(x(t)) as desired. In practice, we mayuse a/ηn(i) instead, where ηn(i) is an empirical estimate of πi obtainedby suitable averaging on a faster timescale, so that it tracks πi. As inChapter 7, this latter arrangement also extends in a natural manner tothe more general case when Yn is ‘controlled Markov’ as in Chapter6.

(vii) Limit theorems: For T = Na > 0 as above, let xs(·), s = na (say),denote the solution of (9.1.2) on [s, s+T ] with xs(s) = xn for n ≥ 0. Wefix T and vary a, with N = dT

a e, s = na. Define zn(t), t ∈ [na, na + T ],by

zn((n + k)a) def=1√a

(xn+k − xs((n + k)a)) , 0 ≤ k ≤ N,

with linear interpolation. Then arguing as in sections 8.2 and 8.3 (withthe additional hypotheses therein), we conclude that the limits in lawas n →∞ of the laws of zn(·), viewed as C([0, T ];Rd)-valued randomvariables, are the laws of a random process on [0, T ] of the form

z∗(t) =∫ t

0

∇h(x∗s(s))z∗(s)ds +∫ t

0

D(x∗s(s))dB(s), t ∈ [0, T ],

Page 125: Stochastic Approximation: A Dynamical Systems Viewpoint

116 Constant Stepsize Algorithms

where D(·) is as in Chapter 8, x∗s(·) is a solution of (9.1.2), and B(·)is a standard Brownian motion in Rd. If we let s ↑ ∞ as well and(9.1.2) has a globally asymptotically stable compact attractor A, x∗s(·)will concentrate with high probability in a neighbourhood of the setx(·) ∈ C([0, T ];Rd) : x(·) satisfies (9.1.2) with x(0) ∈ A. Furthermore,in the special case when (9.1.2) has a unique globally asymptoticallystable equilibrium xeq, x∗(·) ≡ xeq, z∗(·) is a Gauss–Markov process,and we recover the central limit theorem for xn.

Page 126: Stochastic Approximation: A Dynamical Systems Viewpoint

10

Applications

10.1 Introduction

This chapter is an overview of several applications of stochastic approximationin broad strokes. These examples are far from exhaustive and are meant to giveonly a flavour of the immense potentialities of the basic scheme. In each case,only the general ideas are sketched. The details are relatively routine in manycases. In some cases where they are not, pointers to the relevant literature areprovided.

The applications have been broadly classified into the following three cate-gories:

(i) Stochastic gradient schemes: These are stochastic approximation ver-sions of classical gradient ascent or descent for optimizing some perfor-mance measure.

(ii) Stochastic fixed point iterations: These are stochastic approximationversions of classical fixed point iterations xn+1 = f(xn) for solving thefixed point equation x = f(x).

(iii) Collective phenomena: This is the broad category consisting of diversemodels of interacting autonomous agents arising in engineering or eco-nomics.

In addition we have some miscellaneous instances that don’t quite fit any ofthe categories above. In several cases, we shall only look at the limiting o.d.e.This is because the asymptotic behaviour of the actual stochastic iteration canbe easily read off from this o.d.e. in view of the theory developed so far in thisbook.

117

Page 127: Stochastic Approximation: A Dynamical Systems Viewpoint

118 Applications

10.2 Stochastic gradient schemes

Stochastic gradient schemes are iterations of the type

xn+1 = xn + a(n)[−∇f(xn) + Mn+1],

where f(·) is the continuously differentiable function we are seeking to minimizeand the expression in square brackets represents a noisy measurement of thegradient. (We drop the minus sign on the right-hand side when the goal ismaximization.) Typically lim||x||→∞ f(x) = ∞, ensuring the existence of aglobal minimum for f . The limiting o.d.e. then is

x(t) = −∇f(x(t)), (10.2.1)

for which f itself serves as a ‘Liapunov function’:

d

dtf(x(t)) = −||∇f(x(t))||2 ≤ 0,

with a strict inequality when ∇f(x(t)) 6= 0. Let Hdef= x : ∇f(x) = 0 denote

the set of equilibrium points for this o.d.e. Recall the definition of an ω-limitset from Appendix B.

Lemma 1. The only possible invariant sets that can occur as ω-limit sets for(10.2.1) are the subsets of H.

Proof. If the statement is not true, there exists a trajectory x(·) of (10.2.1) suchthat its ω-limit set contains a non-constant trajectory x(·). By the foregoingobservations, f(x(t)) must be monotonically decreasing. Let t > s, implyingf(x(t)) < f(x(s)). But by the definition of an ω-limit set, we can find t1 <

s1 < t2 < s2 < · · · such that x(tn) → x(t) and x(sn) → x(s). It follows that forsufficiently large n, f(x(tn)) < f(x(sn)). This contradicts the fact that f(x(·))is monotonically decreasing, proving the claim. ¥

Suppose H is a discrete set. Assume f to be twice continuously differentiable.Then∇f is continuously differentiable and its Jacobian matrix, i.e., the Hessianmatrix of f , is positive definite at x ∈ H if and only if x is a local minimum.Thus linearizing the o.d.e. around any x ∈ H, we see that the local minima arethe stable equilibria of the o.d.e. and the ‘avoidance of traps’ results in Chapter3 tell us that xn will converge a.s. to a local minimum under reasonableconditions. (Strictly speaking, one should allow the situation when both thefirst and the second derivatives vanish at a point in H. We ignore this scenarioas it is non-generic.)

The assumption that H is discrete also seems reasonable in view of the resultfrom Morse theory that f with isolated critical points are dense in C(Rd) (see,e.g., Chapter 2 of Matsumoto, 2002). But this has to be taken with a pinch

Page 128: Stochastic Approximation: A Dynamical Systems Viewpoint

10.2 Stochastic gradient schemes 119

of salt. In many stochastic-approximation-based parameter tuning schemes inengineering, non-isolated equilibria can arise due to overparametrization.

There are several variations on the basic scheme, mostly due to the unavail-ability of even a noisy measurement of the gradient assumed in the foregoing.In many cases, one needs to approximately evaluate the gradient. Thus we mayreplace the scheme above by

xn+1 = xn + a(n)[−∇f(xn) + Mn+1 + η(n)],

where η(n) is the additional ‘error’ in gradient estimation. Suppose one has

supn||η(n)|| < ε0

for some small ε0 > 0. Then by Theorem 6 of Chapter 5, the iterates convergea.s. to a small neighbourhood of some point in H. (The smaller the ε0 weare able to pick, the better the prospects.) This result may be further refinedby adapting the ‘avoidance of traps’ argument of Chapter 4 to argue that theconvergence is in fact to a neighbourhood of some local minimum.

The simplest such scheme, going back to Kiefer and Wolfowitz (1952), usesa finite difference approximation. Let xn

def= [xn(1), . . . , xn(d)]T and similarly,Mn

def= [Mn(1), . . . ,Mn(d)]T. Let eidef= denote the unit vector in the ith coor-

dinate direction for 1 ≤ i ≤ d and δ > 0 a small positive scalar. The algorithmis

xn+1(i) = xn(i) + a(n)[−(

f(xn + δei)− f(xn − δei)2δ

)+ Mn+1(i)]

for 1 ≤ i ≤ d, n ≥ 0. (Mn+1 here collects together the net ‘noise’ in all thefunction evaluations involved.) By Taylor’s theorem, the error in replacing thegradient with its finite difference approximation as above is O(δ||∇2f(xn)||),where ∇2f denotes the Hessian matrix of f . If this error is small, the foregoinganalysis applies. (A further possibility is to slowly reduce δ to zero, whencethe accuracy of the approximation improves. But usually the division by δ

would also feature in the martingale difference term Mn+1 above and there is aclear trade-off between improvement of the mean error due to finite differenceapproximation alone and increased fluctuation and numerical problems causedby the small denominator.)

Note that this scheme requires 2d function evaluations. If one uses ‘one-sideddifferences’ to replace the algorithm above by

xn+1(i) = xn(i) + a(n)[−(

f(xn + δei)− f(xn)δ

)+ Mn+1(i)],

the number of function evaluations is reduced to d+1, which may still be high

Page 129: Stochastic Approximation: A Dynamical Systems Viewpoint

120 Applications

for many applications. A remarkable development in this context is the simul-taneous perturbation stochastic approximation (SPSA) due to Spall (1992). Let∆n(i), 1 ≤ i ≤ d, n ≥ 0 be i.i.d. random variables such that

(i) ∆ndef= [∆n(1), . . . , ∆n(d)]T is independent of Mi+1, xi, i ≤ n; ∆j , j < n,

for each n ≥ 0, and(ii) P (∆m(i) = 1) = P (∆m(i) = −1) = 1

2 .

Considering the one-sided scheme for simplicity, we replace the algorithm aboveby

xn+1(i) = xn(i) + a(n)[−(

f(xn + δ∆n)− f(xn)δ∆n(i)

)+ Mn+1(i)],

for n ≥ 0. Note that by Taylor’s theorem, for each i,(

f(xn + δ∆n)− f(xn)δ∆n(i)

)≈ ∂f

∂xi(xn) +

j 6=i

∂f

∂xj(xn)

∆n(j)∆n(i)

.

The expected value of the second term on the right is zero. Hence it acts asjust another noise term like Mn+1, averaging out to zero in the limit. Thusthis is a valid approximate gradient scheme which requires only two functionevaluations. A two-sided counterpart can be formulated similarly. Yet anothervariation which requires a single function evaluation is

xn+1(i) = xn(i) + a(n)[−(

f(xn + δ∆n)δ∆n(i)

)+ Mn+1(i)],

which uses the fact that(

f(xn + δ∆n)δ∆n(i)

)≈ f(xn)

δ∆n(i)+

∂f

∂xi(xn) +

j 6=i

∂f

∂xj(xn)

∆n(j)∆n(i)

.

Both the first and the third term on the right average out to zero in the limit,though the small δ in the denominator of the former degrades the performance.See Chapter 7 of Spall (2003) for a comparison of the two alternatives from apractical perspective. In general, one can use more general ∆n(i) as long as theyare i.i.d. zero mean and ∆n(i), ∆n(i)−1 satisfy suitable moment conditions – seeSpall (2003). Bhatnagar et al. (2003) instead use cleverly chosen deterministicsequences to achieve the same effect, with some computational advantages.

Another scheme which works with a single function evaluation at a time isthat of Katkovnik and Kulchitsky (1972). The idea is as follows: Suppose wereplace ∇f by its approximation

Dfσ(x) def=∫

Gσ(x− y)∇f(y)dy,

where Gσ(·) is the Gaussian density with mean zero and variance σ2 and the

Page 130: Stochastic Approximation: A Dynamical Systems Viewpoint

10.2 Stochastic gradient schemes 121

integral is componentwise. This is a good approximation to ∇f for small valuesof σ2. Integrating by parts, we have

Dfσ(x) =∫∇Gσ(x− y)f(y)dy,

where the right-hand side can be cast as another (scaled) Gaussian expecta-tion. Thus it can be approximated by a Monte Carlo technique which may bedone either separately in a batch mode, or on a faster timescale as suggestedin section 6.1. This scheme has the problem of numerical instability due to thepresence of a small term σ2 that appears in the denominator of the actual com-putation and may need smoothing and/or truncation to improve its behaviour.See section 7.6 of Rubinstein (1981) for the general theoretical framework andvariations on this idea.

Note that the scheme

xn+1 = xn + a(n)[h(xn) + Mn+1], n ≥ 0,

will achieve the original objective of minimizing f for any h(·) that satisfies

〈∇f(x), h(x)〉 < 0 ∀ x /∈ H.

We shall call such schemes gradient-like. One important instance of these isa scheme due to Fabian (1960). In this, h(x) = −sgn(∇f(x)), where the ithcomponent of the right-hand side is simply +1 or −1 depending on whether theith component of ∇f(x) is < 0 or > 0, and is zero if the latter is zero. Thus

〈h(x),∇f(x)〉 = −∑

i

| ∂f

∂xi(x)| < 0 ∀x /∈ H.

This scheme typically has more graceful, but slower behaviour away from H.Since the sgn(·) function defined above is discontinuous, one has to invoke thetheory of stochastic recursive inclusions to analyze it as described in section5.3, under the heading Discontinuous dynamics. That is, one considers thelimiting differential inclusion

x(t) ∈ h(x(t)),

where the ith component of the set-valued map h(x) is +1 or −1 dependingon whether the ith component of h(x) is > 0 or < 0, and is [−1, 1] if it is zero.In practical terms, the discontinuity leads to some oscillatory behaviour whena particular component is near zero, which can be ‘smoothed’ out by taking asmooth approximation to sgn(·) near its discontinuity.

In the optimization literature, there are improvements on basic gradient de-scent such as the conjugate gradient and Newton/quasi-Newton methods. Thestochastic approximation variants of these have also been investigated, see, e.g.,Anbar (1978), Ruppert (1985) and Ruszczynski and Syski (1983).

Page 131: Stochastic Approximation: A Dynamical Systems Viewpoint

122 Applications

A related situation arises when one is seeking a saddle point of a functionf(·, ·) : A×B ⊂ Rn ×Rn → R, i.e., a point (x∗, y∗) such that

minx

maxy

f(x, y) = maxy

minx

f(x, y) = f(x∗, y∗).

This is known to exist, e.g., when A,B are compact convex and f(·, y) (resp.f(x, ·)) is convex (resp. concave) for each fixed y (resp. x). Given noisymeasurements of the corresponding partial derivatives, one may then per-form (say) stochastic gradient descent xn w.r.t. the x-variable on the fasttimescale and stochastic gradient ascent yn w.r.t. the y-variable on the slowtimescale. By our arguments of Chapter 6, xn will asymptotically trackg(yn) def= argmin(f(·, yn)), implying that yn tracks the o.d.e.

y(t) = ∇yf(z, y)|z=g(y).

Here ∇y is the gradient in the y-variable. Assuming the uniqueness of thesaddle point, y(·) and therefore yn will then converge (a.s. in the latter case)to y∗ under reasonable conditions if we are able to rewrite the o.d.e. above as

y(t) = ∇y minx

f(x, y),

i.e., to claim that ∇y minx f(x, y) = ∇yf(z, y)|z=g(y). This is true under verystringent conditions, and is called the ‘envelope theorem’ in mathematical eco-nomics. Some recent extensions thereof (Milgrom and Segal, 2002; see alsoBardi and Capuzzo-Dolcetta, 1997, pp. 42–46) sometimes allow one to extendthis reasoning to more general circumstances. See Borkar (2005) for one suchsituation.

An important domain of related activity is that of simulation-based opti-mization, wherein one seeks to maximize a performance measure and eitherthis measure or its gradient is to be estimated from a simulation. There areseveral important strands of research in this area and we shall very briefly de-scribe a few. To start with, consider the problem of maximizing over a realparameter θ a performance measure J(θ) def= Eθ[f(X)], where f is a nice (say,continuous) function R → R and Eθ[ · ] denotes the expectation of the realrandom variable X whose distribution function is Fθ. The idea is to update theguesses θ(n) for the optimal θ based on simulated values of pseudo-randomvariables Xn generated such that the distribution function of Xn is Fθ(n) foreach n. Typically one generates a random variable X with a prescribed con-tinuous and strictly increasing distribution function F by taking X = F−1(U),where U is uniformly distributed on [0, 1]. A slightly messier expression workswhen F is either discontinuous or not strictly increasing. More generally, onehas X = Ψ(U) for U as above and a suitable Ψ. Thus we may suppose thatf(Xn) = Φ(Un, θ(n)) for Un i.i.d. uniform on [0, 1] and some suitable Φ. Sup-pose Φ(u, ·) is continuously differentiable and the interchange of expectation

Page 132: Stochastic Approximation: A Dynamical Systems Viewpoint

10.2 Stochastic gradient schemes 123

and differentiation ind

dθE[Φ(U, θ)] = E[

∂θΦ(U, θ)]

is justified. Then a natural scheme would be the stochastic approximation

θ(n + 1) = θ(n) + a(n)[∂

∂θΦ(Un+1, θ(n))],

which will track the o.d.e.

θ(t) =d

dθJ(θ).

This is the desired gradient ascent. This computation is the basic idea behindinfinitesimal perturbation analysis (IPA) and its variants.

Another variation is the likelihood ratio method which assumes that thelaw µθ corresponding to Fθ is absolutely continuous with respect to a ‘baseprobability measure’ µ and the likelihood ratio (or Radon–Nikodym derivative)Λθ(·) def= dµθ

dµ (·) is continuously differentiable in θ. Then

J(θ) =∫

fdµθ =∫

fΛθdµ.

Suppose the interchange of expectation and differentiation

d

dθJ(θ) =

∫f

d

dθΛθdµ

is justified. Then the stochastic approximation

θ(n + 1) = θ(n) + a(n)[f(Xn+1)d

dθΛθ(Xn+1)|θ=θ(n)],

where Xn are i.i.d. with law µ, will track the same o.d.e. as above.It is also possible to conceive of a combination of the two schemes, see section

15.4 of Spall (2003). The methods get complicated if the Xn are not inde-pendent, e.g., in case of a Markov chain. See Ho and Cao (1991), Glasserman(1991) and Fu and Hu (1997) for extensive accounts of IPA and its variants.

An alternative approach in case of a scalar parameter is to have two simu-lations Xn and X ′

n corresponding to θ(n) and its ‘small perturbation’θ(n)+δ for some small δ > 0 respectively. That is, the conditional law of Xn

(resp. X ′n), given Xm, X ′

m,m < n, and θ(m),m ≤ n, is µθ(n) (resp. µθ(n)+δ).The iteration scheme is

θ(n + 1) = θ(n) + a(n)(

f(X ′n+1)− f(Xn+1)

δ

).

By the results of Chapter 5, this tracks the o.d.e.

θ(t) =J(θ + δ)− J(θ)

δ,

Page 133: Stochastic Approximation: A Dynamical Systems Viewpoint

124 Applications

the approximate gradient ascent. Bhatnagar and Borkar (1998) take thisviewpoint with an additional stochastic approximation iteration on a fastertimescale for explicit averaging of f(Xn) and f(X ′

n). This leads to moregraceful behaviour. Bhatnagar and Borkar (1997) do the same with the two-timescale effect achieved, not through the choice of different stepsize schedules,but by performing the slow iteration along an appropriately chosen subsampleof the time instants at which the fast iteration is updated. As already men-tioned in Chapter 6, in actual experiments a judicious combination of the twowas found to work better than either by itself. The advantage of these schemesis that they are no more difficult to implement and analyze for controlledMarkov Xn (resp. X ′

n), wherein the conditional law of Xn+1 (resp. X ′n+1)

given Xm, X ′m, θ(m),m ≤ n, depends on Xn, θ(n) (resp. X ′

n, θ(n)), than theyare for Xn (resp. X ′

n) as above wherein the latter depends on θ(n) alone.The disadvantage on the other hand is that for d-dimensional parameters, oneneeds (d + 1) simulations, one corresponding to θ(n) and d correspondingto a δ-perturbation of each of its d components. Bhatnagar et al. (2003) workaround this by combining the ideas above with SPSA.

Most applications of stochastic gradient methods tend to be for minimizationof an appropriately defined measure of mean ‘error’ or ‘discrepancy’. Themean square error and the relative (Kullback–Leibler) entropy are the twomost popular instances. We have seen one example of the former in Chapter 1,where we discussed the problem of finding the optimal parameter β to minimizeE[||Yn − fβ(Xn)||2], where (Yi, Xi), i ≥ 1, are i.i.d. pairs of observations andfβ(·) is a parametrized family of functions. Most parameter tuning algorithmsin the neural network literature are of this form, although some use the other,i.e., ‘entropic’ discrepancy measure. See Haykin (1999) for an overview. Inparticular, the celebrated backpropagation algorithm is a stochastic gradientscheme involving the gradient of a ‘layered’ composition of sums of nonlinearmaps. The computation of the gradient is split into simple local computationsusing the chain rule of calculus.

An even older application is to adaptive signal processing (Haykin, 1991).More sophisticated variants appear in system identification where one tries tolearn the dynamics of a stochastic dynamic system based on observed outputs(Ljung, 1999).

See also Fort and Pages (1995) and Kosmatopoulos and Christodoulou (1996)for an analysis of the Kohonen algorithm for learning vector quantization (LVQ)that seeks to find points x1, . . . , xk (say) so as to minimize

E[ min1≤i≤k

||xk − Y ||2],

given i.i.d. samples of Y . These help us to identify k ‘clusters’ in the obser-

Page 134: Stochastic Approximation: A Dynamical Systems Viewpoint

10.3 Stochastic fixed point iterations 125

vations, each identified with one of the xi, with the understanding that everyobservation is associated with the nearest xi, 1 ≤ i ≤ k.

Gradient and gradient-like schemes are guaranteed to converge to a localminimum at best. A scheme that ensures convergence to a global minimum inprobability is simulated annealing , which we do not deal with here. It involvesadding a slowly decreasing extraneous noise to ‘push’ the iterates out of localminima that are not global minima often enough that their eventual conver-gence to the set of global minima (in probability) is assured. See, e.g., Gelfandand Mitter (1991).

An alternative ‘correlation-based’ scheme that asymptotically behaves likea gradient scheme, but does not involve explicit gradient estimation, is the‘Alopex’ algorithm analyzed in Sastry, Magesh and Unnikrishnan (2002).

10.3 Stochastic fixed point iterations

In this section we consider iterations of the type

xn+1 = xn + a(n)[F (xn)− xn + Mn+1]. (10.3.1)

That is, h(x) = F (x) − x in our earlier notation. The idea is that xnshould converge to a solution x∗ of the equation F (x∗) = x∗, i.e., to a fixedpoint of F (·). We shall be interested in the following specific situation: Letwi > 0, 1 ≤ i ≤ d, be prescribed ‘weights’ and define norms on Rd equivalentto the usual Euclidean norm: for x = [x1, . . . , xd] ∈ Rd and w = [w1, . . . , wd]as above,

||x||w,pdef= (

d∑

i=1

wi|xi|p)1p , 1 ≤ p < ∞,

and

||x||w,∞def= max

iwi|xi|.

We assume that

||F (x)− F (y)||w,p ≤ α||x− y||w,p ∀ x, y ∈ Rd, (10.3.2)

for some w as above, 1 ≤ p ≤ ∞, and α ∈ [0, 1]. We shall say that F is acontraction w.r.t. the norm || · ||w,p if (10.3.2) holds with α ∈ [0, 1) and a non-expansive map w.r.t. this norm if it holds with α = 1. As the names suggest,in the former case an application of the map contracts distances by a factor ofat least α < 1, in the latter case it does not increase or ‘expand’ them. By thecontraction mapping theorem (see Appendix A), a contraction has a uniquefixed point whereas a non-expansive map may have none (e.g., F (x) = x + 1),one (e.g., F (x) = −x), or many, possibly infinitely many (e.g., F (x) = x).

Page 135: Stochastic Approximation: A Dynamical Systems Viewpoint

126 Applications

The limiting o.d.e. is

x(t) = F (x(t))− x(t). (10.3.3)

We analyze this equation below for the case α ∈ [0, 1), whence there is a uniquex∗ such that x∗ = F (x∗).

Let Fi(·) denote the ith component of F (·) for 1 ≤ i ≤ d and define V (x) def=||x− x∗||w,p, x ∈ Rd.

Theorem 2. Under (10.3.2), V (x(t)) is a strictly decreasing function of t forany non-constant trajectory of (10.3.3).

Proof. Note that the only constant trajectory of (10.3.3) is x(·) ≡ x∗. Let1 < p < ∞. Let sgn(x) = +1,−1, or 0, depending on whether x > 0, < 0, or= 0. For x(t) 6= x∗, we obtain using the Holder inequality,

d

dtV (x(t))

=p

p(∑

i

wi|xi(t)− x∗i |p)1−p

p (∑

i

wisgn(xi(t)− x∗i )

×|xi(t)− x∗i |p−1xi(t))

= ||x(t)− x∗||1−pw,p (

i

wisgn(xi(t)− x∗i )|xi(t)− x∗i |p−1

×(Fi(x(t))− xi(t)))

= ||x(t)− x∗||1−pw,p (

i

wisgn(xi(t)− x∗i )|xi(t)− x∗i |p−1

×(Fi(x(t))− Fi(x∗)))

− ||x(t)− x∗||1−pw,p (

i

wisgn(xi(t)− x∗i )|xi(t)− x∗i |p−1

×(xi(t)− x∗i ))

≤ ||x(t)− x∗||1−pw,p ||x(t)− x∗||p−1

w,p ||F (x(t))− F (x∗)||w,p

− ||x(t)− x∗||1−pw,p ||x(t)− x∗||p−1

w,p ||x(t)− x∗||w,p

= ||F (x(t))− F (x∗)||w,p − ||x(t)− x∗||w,p

≤ −(1− α)||x(t)− x∗||w,p = −(1− α)V (x(t)),

which is < 0 for x(t) 6= x∗. Here the first term on the right-hand side of the firstinequality comes from the first term on its left-hand side by Holder’s inequality,and the second term on the right exactly equals the second term on the left.This proves the claim for 1 < p < ∞. The inequality above can be written, fort > s ≥ 0, as

||x(t)− x∗||w,p ≤ ||x(s)− x∗||w,p − (1− α)∫ t

s

||x(y)− x∗||w,pdy.

Page 136: Stochastic Approximation: A Dynamical Systems Viewpoint

10.3 Stochastic fixed point iterations 127

Letting p ↓ 1 (resp. p ↑ ∞), the claims for p = 1 (resp. p = ∞) follow bycontinuity of p → ||x||w,p on [1,∞]. ¥

Corollary 3. x∗ is the unique globally asymptotically stable equilibrium of(10.3.3).

Remark: This corollary extends to the non-autonomous case

x(t) = Ft(x(t))− x(t)

in a straightforward manner when the maps Ft, t ≥ 0, satisfy

‖Ft(y)− Ft(z)‖p ≤ α‖y − z‖p ∀ t ≥ 0,

for α ∈ (0, 1), and have a common (unique) fixed point x∗.The non-expansive case (α = 1) for p = ∞ is sometimes useful in dynamic-

programming-related applications. We state the result without proof here,referring the reader to Borkar and Soumyanath (1997) for a proof.

Theorem 4. Let F (·) be ‖ · ‖w,∞-non-expansive. If Hdef= x : F (x) = x 6= φ,

then ||x(t)− x||w,∞ is nonincreasing for any x ∈ H and x(t) → a single pointin H (depending on x(0)).

In particular, the corresponding stochastic approximation iterates convergeto H a.s. by familiar arguments.

The most important application of this set-up is to the reinforcement learningalgorithms for Markov decision processes. We shall illustrate a simple case here,that of the infinite-horizon discounted-cost Markov decision process. Thus wehave a controlled Markov chain Xn on a finite state space S, controlled by acontrol process Zn taking values in a finite ‘action space’ A. Its evolution isgoverned by

P (Xn+1 = i|Xm, Zm,m ≤ n) = p(Xn, Zn, i), n ≥ 0, i ∈ S. (10.3.4)

Here p : S × A × S → [0, 1] is the controlled transition probability functionsatisfying ∑

j

p(i, a, j) = 1 ∀i ∈ S, a ∈ A.

Thus p(i, a, j) is the probability of moving from i to j when the action chosenin state i is a, regardless of the past. Let k : S × A → R be a prescribed‘running cost’ function and β ∈ (0, 1) a prescribed ‘discount factor’. Theclassical discounted cost problem is to minimize

J(i, Zn) def= E[∞∑

m=0

βmk(Xm, Zm)|X0 = i] (10.3.5)

Page 137: Stochastic Approximation: A Dynamical Systems Viewpoint

128 Applications

over admissible Zn (i.e., those that are consistent with (10.3.4)) for each i.The classical approach to this problem is through dynamic programming (see,e.g., Puterman, 1994). Define the ‘value function’

V (i) def= infZn

J(i, Zn), i ∈ S.

It then satisfies the dynamic programming equation

V (i) = mina

[k(i, a) + β∑

j

p(i, a, j)V (j)], i ∈ S. (10.3.6)

In words, this says that ‘the minimum expected cost to go from state i is theminimum of the expected sum of the immediate cost at i and the minimumcost to go from the next state on’. Furthermore, for a v : S → A, the controlchoice Zn = v(Xn) ∀n is optimal for all choices of the initial law for X0 ifv(i) attains the minimum on the right-hand side of (10.3.6) for all i. Thus theproblem is ‘solved’ if we know V (·). Note that (10.3.6) is of the form V = F (V )for F (·) = [F1(·), . . . , F|S|(·)] defined as follows: for x = [x1, . . . , x|S|] ∈ R|S|,

Fi(x) def= mina

[k(i, a) + β∑

j

p(i, a, j)xj ], 1 ≤ i ≤ |S|.

It is easy to verify that

||F (x)− F (y)||∞ ≤ β||x− y||∞, x, y ∈ R|S|.Thus by the contraction mapping theorem (see Appendix A), equation (10.3.6)has a unique solution V and the ‘fixed point iterations’ Vn+1 = F (Vn), n ≥ 0,

converge exponentially to V for any choice of V0. These iterations constitutethe well-known ‘value iteration’ algorithm, one of the classical computationalschemes of Markov decision theory (see Puterman, 1994, for an extensive ac-count).

The problem arises if the function p(·), i.e., the system model, is unknown.Thus the conditional averaging with respect to p(·) implicit in the evaluationof F cannot be performed. Suppose, however, that a simulation device isavailable which can generate a transition according to the desired conditionallaw p(i, a, ·) (say). This situation occurs typically with large complex systemsconstructed by interconnecting several relatively simpler systems, so that whilecomplete analytical modelling and analysis is unreasonably hard, a simulationbased on ‘local’ rules at individual components is not (e.g., a communicationnetwork). One can then use the simulated transitions coupled with stochasticapproximation to average their effects in order to mimic the value iteration.The ‘simulation’ can also be an actual run of the real system in the ‘on-line’version of the algorithm.

This is the basis of the Q-learning algorithm of Watkins (1989), a cornerstone

Page 138: Stochastic Approximation: A Dynamical Systems Viewpoint

10.3 Stochastic fixed point iterations 129

of reinforcement learning. Before we delve into its details, a word on the notionof reinforcement learning: In the classical learning paradigms for autonomousagents in artificial intelligence, one has at one end of the spectrum supervisedlearning in which instructive feedback such as the value of an error measure orits gradient is provided continuously to the agent as the basis on which to learn.This is the case, e.g., in the ‘perceptron training algorithm’ in neural networks– see Haykin (2000). (Think of a teacher.) At the other extreme are unsu-pervised learning schemes such as the learning vector quantization scheme ofKohonen (see, e.g., Kohonen, 2002) which ‘self-organize’ data without any ex-ternal feedback. In between the two extremes lies the domain of reinforcementlearning where the agent gets evaluative feedback, i.e., an observation related tothe performance and therefore carrying useful information about it which, how-ever, falls short of what would constitute exact instructive feedback. (Thinkof a critic.) In the context of Markov decision processes described above, theobserved payoffs (≈ negative of costs) constitute the reinforcement signal.

Q-learning derives its name from the fact that it works with the so-calledQ-factors rather than with the value function. The Q-factors are nothing butthe entities being minimized on the right-hand side of (10.3.6). Specifically, let

Q(i, a) def= k(i, a) + β∑

j

p(i, a, j)V (j), i ∈ S, a ∈ A.

Thus in particular V (i) = mina Q(i, a) ∀i and Q(·) satisfies

Q(i, a) = k(i, a) + β∑

j

p(i, a, j)minb

Q(j, b), i ∈ S, a ∈ A. (10.3.7)

Like (10.3.6), this is also of the form Q = G(Q) for a G : R|S|×|A| → Rsatisfying

||G(Q)−G(Q′)||∞ ≤ β||Q−Q′||∞.

In particular, (10.3.7) has a unique solution Q∗ and the iteration Qn+1 =G(Qn) for n ≥ 0, i.e.,

Qn+1(i, a) = k(i, a) + β∑

j

p(i, a, j)minb

Qn(j, b), n ≥ 0,

converges to Q∗ at an exponential rate. What the passage from (10.3.6) to(10.3.7) has earned us is the fact that now the minimization is inside the con-ditional expectation and not outside, which makes a stochastic approximationversion possible. The stochastic approximation version based on a simulation

Page 139: Stochastic Approximation: A Dynamical Systems Viewpoint

130 Applications

run (Xn, Zn) governed by (10.3.4) is

Qn+1(i, a)

= Qn(i, a) + a(n)IXn = i, Zn = a×[k(i, a) + β min

bQn(Xn+1, b)−Qn(i, a)] (10.3.8)

(= (1− a(n)IXn = i, Zn = a)Qn(i, a)

+ a(n)IXn = i, Zn = a×[k(i, a) + β min

bQn(Xn+1, b)]

)

for n ≥ 0. Thus we replace the conditional expectation in the previous iterationby an actual evaluation at a random variable realized by the simulation deviceaccording to the desired transition probability in question, and then make anincremental move in that direction with a small weight a(n), giving a largeweight (1− a(n)) to the previous guess. This makes it a stochastic approxima-tion, albeit asynchronous, because only one component is being updated at atime. Note that the computation is still done by a single processor. Only onecomponent is updated at a time purely because new information is availableonly for one component at a time, corresponding to the transition that justtook place. Thus there is no reason to use a(ν(i, n)) in place of a(n) as inChapter 7. The limiting o.d.e. then is

q(t) = Λ(t)(G(q(t))− q(t)),

where Λ(t) for each t is a diagonal matrix with a probability vector along itsdiagonal. Assume that its diagonal elements remain uniformly bounded awayfrom zero. Sufficient conditions that ensure this can be stated along the linesof section 7.4, viz., irreducibility of Xn under all control policies of the typeZn = v(Xn), n ≥ 0, for some v : S → A, and a requirement that at each visitto any state i, there is a minimum positive probability of choosing any controla in A. Then this is a special case of the situation discussed in example (ii) insection 7.4. As discussed there, the foregoing ensures convergence of the o.d.e.to Q∗ and therefore the a.s. convergence of Qn to Q∗.

There are other reinforcement learning algorithms differing in philosophy(e.g., the actor-critic algorithm (Barto et al., 1983) which mimics the ‘policyiteration’ scheme of Markov decision theory), or differing in the cost crite-rion, e.g., the ‘average cost’ (Abounadi et al., 2001), or different because of anexplicit approximation architecture incorporated to beat down the ‘curse of di-mensionality’ (see, e.g., Tsitsiklis and Van Roy, 1997). They are all stochasticapproximations.

Contractions and non-expansive maps are, however, not the only ones forwhich convergence to a unique fixed point may be proved. One other case, for

Page 140: Stochastic Approximation: A Dynamical Systems Viewpoint

10.4 Collective phenomena 131

example, is when −F is monotone, i.e.,

〈x− y, F (x)− F (y)〉 < 0 ∀ x 6= y.

In the affine case, i.e., F (x) = Ax + b for a d × d matrix A and b ∈ Rd, thiswould mean that the symmetric part of A, i.e., 1

2 (A + AT), would have to benegative definite. Suppose F has a fixed point x∗. Then

d

dt‖x(t)− x∗‖2 = 2〈x(t)− x∗, F (x(t))− x(t)〉

= 2〈x(t)− x∗, F (x(t))− F (x∗)〉 − 2‖x(t)− x∗‖2< 0

for x(t) 6= x∗. Thus ‖x(t) − x∗‖2 serves as a Liapunov function, leading tox(t) → x∗. In particular, if x′ were another fixed point of F , x(t) ≡ x′ wouldsatisfy (10.3.3), forcing x′ = x∗. Hence x∗ is the unique fixed point of F and aglobally asymptotically stable equilibrium for the o.d.e.

10.4 Collective phenomena

The models we have considered so far are concerned with adaptation by asingle agent. An exciting area of current research is the scenario when severalinteracting agents are each trying to adapt to an environment which in turn isaffected by the other agents in the pool. A simple case is that of two agentsin the ‘nonlinear urn’ scenario discussed in Chapter 1. Suppose we have twoagents and an initially empty urn, with the first agent adding either zero or onered ball to the urn at each (discrete) time instant and the other doing likewisewith black balls. Let xn (resp. yn) denote the fraction of times up to time nthat a red (resp. black) ball is added. That is, if ξn

def= I a red ball is added attime n and ζn

def= I a black ball is added at time n for n ≥ 1, then

xndef=

∑nm=1 ξm

n, yn

def=∑n

m=1 ζm

n, n ≥ 1.

Suppose the conditional probability that a red ball is added at time n giventhe past up to time n is p(yn) and the corresponding conditional probabilitythat a black ball is added at time n is q(xn), for prescribed Lipschitz functionsp(·), q(·) : [0, 1] → [0, 1]. That is, the probability with which an agent adds aball at any given time depends on the empirical frequency with which the otheragent has been doing so until then. Arguing as in Chapter 1, we then have

xn+1 = xn +1

n + 1[p(yn+1)− xn + Mn+1],

yn+1 = yn +1

n + 1[q(xn+1)− yn + M ′

n+1],

Page 141: Stochastic Approximation: A Dynamical Systems Viewpoint

132 Applications

for suitably defined martingale differences Mn+1, M ′n+1. This leads to the

o.d.e.

x(t) = p(y(t))− x(t), y(t) = q(x(t))− y(t), t ≥ 0.

Note that the ‘driving vector field’ on the right-hand side,

h(x, y) = [h1(x, y), h2(x, y)]T def= [p(y)− x, q(x)− y]T,

satisfies

Div(h) def=∂h1(x, y)

∂x+

∂h2(x, y)∂y

= −2.

From o.d.e. theory, one knows then that the maps (x, y) → (x(t), y(t)) fort > 0 ‘shrink’ the volume in R2. In fact one can show that the flow becomesasymptotically one dimensional and therefore, as argued in Chapter 1, mustconverge (Benaim and Hirsch, 1997). That is, the fractions of red and blackballs stabilize, leading to a fixed asymptotic probability of picking for either.

While such nonlinear urns have been studied as economic models, this analy-sis extends to more general problems, viz., two-person repeated bimatrix games(Kaniovski and Young, 1995). Here two agents, say agent 1 and agent 2, re-peatedly play a game in which there are the same two strategy choices availableto each of them at each time, say a1, b1 for agent 1 and a2, b2 for agent 2.Based on the strategy pair (ξ1

n, ξ2n) ∈ a1, b1×a2, b2 chosen at time n, agent

i gets a payoff of hi(ξ1n, ξ2

n) for i = 1, 2. Let

νi(n) def=∑n

m=1 Iξim = ai

n, i = 1, 2; n ≥ 0,

specify their respective ‘empirical strategies’. That is, at time n agent i appearsto agent j 6= i as though she is choosing ai with probability νi(n) and bi withprobability 1 − νi(n). We assume that agent j 6= i plays at time n + 1 her‘best response’ to agent i’s empirical strategy given by the probability vector[νi(n), 1−νi(n)]. Suppose that this best response is given in turn by a Lipschitzfunction fj : [0, 1] → [0, 1]. That is, she plays ξj

n+1 = aj with probabilityfj(νi(n)). Modulo the assumed regularity of the best response maps f1, f2,this is precisely the ‘fictitious play’ model of Brown (1951), perhaps the firstlearning model in game theory which has been extensively analyzed. A similarrule applies when i and j are interchanged. Then the above analysis leads tothe limiting o.d.e.

ν1(t) = F1(ν1(t), ν2(t))def= f1(ν2(t))− ν1(t),

ν2(t) = F2(ν1(t), ν2(t))def= f2(ν1(t))− ν2(t).

Again, div([F1, F2])def= ∂F1

∂ν1+ ∂F2

∂ν2= −2, leading to the same conclusion as

before. That is, their strategies converge a.s. This limit, say [ν∗1 , ν∗2 ], forms a

Page 142: Stochastic Approximation: A Dynamical Systems Viewpoint

10.4 Collective phenomena 133

Nash equilibrium: neither agent can improve her lot by moving away from thechosen strategy if the other one doesn’t. This is immediate from the fact thatν∗i is the best response to ν∗j for i 6= j.

There are, however, some drastic oversimplifications in the foregoing. Oneimportant issue is that the ‘best response’ is often non-unique and thus onehas to replace this o.d.e. by a suitable differential inclusion. Also, the situationin dimensions higher than two is no longer as easy. See Chapters 2 and 3 ofFudenberg and Levine (1998) for more on fictitious play.

Another model of interacting agents is the o.d.e.

x(t) = h(x(t)) (10.4.1)

with∂hi

∂xj> 0, j 6= i. (10.4.2)

These are called cooperative o.d.e.s, the idea being that increase in the jthcomponent, corresponding to some desirable quantity for the jth agent, willlead to an increase in the ith component as well for i 6= j. (The strict inequalityin (10.4.2) can be weakened to ‘≥’ as long as the Jacobian matrix of h isirreducible, i.e., for any partition I ∪ J of the row/column indices, there issome i ∈ I, j ∈ J such that the (i, j)th element is nonzero. See section 4.1of Smith, 1995, for details.) Suppose the trajectories remain bounded. Thena theorem of Hirsch states that for all initial conditions belonging to an opendense set, x(·) converges to the set of equilibria (see Hirsch, 1985; also Smith,1995).

As an application, we consider the problem of dynamic pricing in a systemof parallel queues from Borkar and Manjunath (2004). There are K parallelqueues and an entry charge pi(n) is charged for the ith queue, reflecting (say)its quality of service. Let yi(n) denote the queue length in the ith queue attime n. There is an ‘ideal profile’ y∗ = [y∗1 , . . . , y∗K ] of queue lengths whichwe want to stay close to, and the objective is to manage this by modulatingthe respective prices dynamically. Let Γi denote the projection to [εi, Bi] for0 ≤ i ≤ K, where εi > 0 is a small number and Bi is a convenient a priori upperbound. Let a > 0 be a small constant stepsize. The scheme is, for 1 ≤ i ≤ K,

pi(n + 1) = Γi (pi(n) + api(n)[yi(n)− y∗i ]) , n ≥ 0.

The idea is to increase the price if the current queue length is above the ideal(so as to discourage new entrants) and decrease it if the opposite is true (toencourage more entrants). The scalar εi is the minimum price which alsoensures that the iteration does not get stuck at zero. We assume that if theprice vector is frozen at some p = [p1, . . . , pK ], the process of queue lengths is

Page 143: Stochastic Approximation: A Dynamical Systems Viewpoint

134 Applications

ergodic. Ignoring the boundary of the box B def= Πi[εi, Bi], the limiting o.d.e. is

pi(t) = pi(t)[fi(p(t))− y∗i ], 1 ≤ i ≤ K,

where p(t) = [p1(t), . . . , pK(t)] and fi(p) is the stationary average of the queuelength in the ith queue when the price vector is frozen at p. It is reasonableto assume that if the εi are sufficiently low and the Bi are sufficiently high,then p(t) is eventually pushed inwards from the boundary of B, so that we mayignore the boundary effects and the above is indeed the valid o.d.e. limit forthe price adjustment mechanism. It is also reasonable to assume that

∂fi

∂pj> 0,

as an increase in the price of one queue keeping all else constant will force itspotential customers to other queues. (As mentioned above, this condition canbe relaxed.) Thus this is a cooperative o.d.e. and the foregoing applies. Onecan say more: Letting f(·) = [f1(·), . . . , fK(·)]T, it follows by Sard’s theorem(see, e.g., Ortega and Rheinboldt, 1970, p. 130) that for almost all choices of y∗,the Jacobian matrix of f is nonsingular on the inverse image of y∗. Hence bythe inverse function theorem (see, e.g., ibid., p. 125), this set, which is also theset of equilibria for the above o.d.e. in B, is discrete. Thus the o.d.e. convergesto a point for almost all initial conditions. One can argue as in Chapter 9to conclude that the stationary distribution of p(n) concentrates near theequilibria of the above o.d.e. Note that all equilibria in B are equivalent as faras our aim of keeping the vector of queue lengths near y∗ is concerned. Thuswe conclude that the dynamically adjusted prices asymptotically achieve thedesired queue length profile, giving it the right ‘pushes’ if it deviates.

Yet another popular dynamic model of interaction is the celebrated replicatordynamics, studied extensively by mathematical biologists – see, e.g., Hofbauerand Siegmund (1998). This dynamics resides in the d-dimensional probabilitysimplex S

def= x = [x1, . . . , xd] : xi ≥ 0 ∀i, ∑i xi = 1, and is given by

xi(t) = xi(t)[Di(x(t))−∑

j

xj(t)Dj(x(t))], 1 ≤ i ≤ d, (10.4.3)

where x(t) = [x1(t), . . . , xd(t)] and D(·) = [D1(·), . . . , Dd(·)] is a Lipschitz map.The interpretation is that for each i, xi(t) is the fraction of the population attime t that belongs to species i. Di(x) is the payoff received by the ith specieswhen the population profile is x (i.e., when the fraction of the populationoccupied by the jth species is xj for all j). Equation (10.4.3) then says inparticular that the fraction of a given species in the population increases ifthe payoff it is receiving is higher than the population average, and decreasesif it is lower. If

∑i xi(t) = 1, then summed over i, the right-hand side of

(10.4.3) vanishes, implying that∑

i xi(t) = 1 ∀t if it is so for t = 0. Thus

Page 144: Stochastic Approximation: A Dynamical Systems Viewpoint

10.4 Collective phenomena 135

the simplex S is invariant under this o.d.e. The faces of S correspond to oneor more components being zero, i.e., the population of that particular speciesis ‘extinct’. These are also individually invariant, because the correspondingcomponents of the right-hand side of (10.4.3) vanish.

This equation has been studied extensively (Hofbauer and Siegmund, 1998).We shall consider it under the additional condition that −D(·) be monotone,i.e.,

〈x− y, D(x)−D(y)〉 < 0 ∀ x 6= y ∈ S.

In the affine case, i.e., D(x) = Ax + b for a d × d matrix A and b ∈ Rd, thiswould mean that the symmetric part of A, i.e., 1

2 (A + AT), would have to benegative definite.

Lemma 5. There exists a unique x∗ ∈ S such that x∗ maximizes x → xTD(x∗).

Proof. The set-valued map that maps x ∈ S to the set of maximizers of thefunction y ∈ S → yTD(x) is nonempty compact convex and upper semicontin-uous, as can be easily verified. Thus by the Kakutani fixed point theorem (see,e.g., Border, 1985), it follows that there exists an x∗ such that x∗ maximizesthe function y → yTD(x∗) over S. Suppose x′ 6= x∗ is another point in S suchthat x′ maximizes the function y → yTD(x′) on S. Then

〈x∗ − x′,D(x∗)−D(x′)〉= (x∗)TD(x∗)− (x∗)TD(x′)− (x′)TD(x∗) + (x′)TD(x′)

≥ 0.

This contradicts the ‘monotonicity’ condition above unless x∗ = x′. ¥

Theorem 6. For x(0) in the interior of S, x(t) → x∗.

Proof. Define

V (x) def=∑

i

x∗i `n(

x∗ixi

), x = [x1, · · · , xd] ∈ S.

Then an application of Jensen’s inequality shows that V (x) ≥ 0 and is 0 if andonly if x = x∗. Also,

d

dtV (x(t)) = −

i

x∗i

(xi(t)xi(t)

)

= (x(t)− x∗)TD(x(t))

≤ (x(t)− x∗)TD(x(t))− (x(t)− x∗)TD(x∗)

= (x(t)− x∗)T(D(x(t))−D(x∗))

< 0,

Page 145: Stochastic Approximation: A Dynamical Systems Viewpoint

136 Applications

for x(t) 6= x∗. Here the second equality follows from the o.d.e. (10.4.3), the firstinequality follows from our choice of x∗, and the last inequality follows fromthe monotonicity assumption on −D. The claim follows easily from this. ¥

Thus the stochastic approximation counterpart will converge a.s. to x∗. Note,however, that one requires a projection to keep the iterates in S, and also‘sufficiently rich’ noise, possibly achieved by adding some extraneous noise, inorder to escape getting trapped in an undesirable face of S.

Another ‘convergent’ scenario is when Dj(x) = ∂F∂xj

∀j for some continuouslydifferentiable function F (·). In that case,

d

dtF (x(t)) =

i

xi(t)(

∂F

∂xi

)2

−(∑

i

xi∂F

∂xi

)2

≥ 0,

with strict inequality when ∇F 6= 0, so that −F (·) serves as a Liapunov func-tion. See Sandholm (1998) for an interesting application of this to transporta-tion science. A special case arises when Dj(x) =

∑k R(i, j)xj ∀j, where R is a

positive definite matrix. A time-scaled version of this occurs in the analysis ofthe asymptotic behaviour of vertex-reinforced random walks (Benaim, 1997).Pemantle (2006) surveys this and related dynamics.

The monotonicity or gradient conditions above, however, are not natural inmany applications. Without monotonicity, (10.4.3) can show highly complexbehaviour (Hofbauer and Siegmund, 1998). In a specific application, Borkarand Kumar (2003) get a partial characterization of the asymptotic behaviour.

The reinforcement learning paradigm introduced in section 10.3 also has amultiagent counterpart in which several agents control the transition probabil-ities of the Markov chain, each with its own objective in mind. The specialcase of this when a fixed static game is played repeatedly is called a ‘repeatedgame’. We have already seen one instance of this situation in our discussion offictitious play. This particular scenario has been extensively studied. Neverthe-less, the results are limited except for special cases. The case of a two-personzero-sum game, wherein one agent tries to maximize a performance measurewhile the other agent tries to minimize it, is one situation where the dynamicprogramming ideas can be extended in a straightforward way. For example,suppose this performance measure is of the form (10.3.5) with Zn standing fora pair of controls (Z1

n, Z2n) chosen respectively by the two agents independently

of each other. Then (10.3.6) simply extends to the Shapley equation

V (i) = mina

maxb

[k(i, (a, b)) + β∑

j

p(i, (a, b), j)V (j)]

= maxb

mina

[k(i, (a, b)) + β∑

j

p(i, (a, b), j)V (j)], i ∈ S.

The corresponding Q-learning scheme may then be written accordingly. See

Page 146: Stochastic Approximation: A Dynamical Systems Viewpoint

10.5 Miscellaneous applications 137

Hu and Wellman (1998), Jafari et al. (2001), Littman (2001), Sastry et al.(1994), Singh et al. (2000), for some important contributions to this area, whichalso highlight the difficulties. Leslie and Collins (2002) provide an interestingexample of multiple timescales in this context. Despite all this and much else,the area of multiagent learning remains wide open with several unsolved issues.See Fudenberg and Levine (1998), Vega Redondo (1996), Young (1998), Young(2004) for a flavour of some of the economics-motivated work in this directionand Sargent (1993) for some more applications of stochastic approximation ineconomics, albeit with a different flavour.

10.5 Miscellaneous applications

This section discusses a couple of instances which don’t fall into the abovecategorization, just to emphasize that the rich possibilities extend well beyondthe paradigms described above.

(i) A network example: Consider the o.d.e.

xi(t) = ai(x(t))[bi(xi(t))−

j

cijfj(∑

k

cjkgk(xk(t)))].

Assume:

• ai(x), bi(y), fi(y), gi(y), g′i(y) are strictly positive and are Lipschitz,where g′i = dgi

dy .• cij = cji.

Then V (x) =∑

i

( ∫ ∑j cijgj(xj)

0 fi(y)dy − ∫ xi

0bi(y)g′i(y)dy

)serves as a

Liapunov function, because

dV (x(t))/dt

=∑

i

[∑

j

fj(∑

k

cjkgk(xk(t)))cjig′i(xi(t))

− bi(xi(t))g′i(xi(t))]xi(t)

= −∑

i

ai(x(t))g′i(xi(t))[bi(xi(t))

−∑

j

cijfj(∑

k

cjkgk(xk(t)))]2

≤ 0.

One can have ‘neighbourhood structure’ by having N(i) def= the set ofneighbours of i, with the requirement that

cij = cji > 0 ⇐⇒ i ∈ N(j) ⇐⇒ j ∈ N(i).

One can allow i ∈ N(i). One ‘network’ interpretation is:

Page 147: Stochastic Approximation: A Dynamical Systems Viewpoint

138 Applications

• j ∈ N(i) if j’s transmission can be ‘heard’ by i.• xj(t) = the traffic originating from j at time t.• cij capture the distance effects.• ∑

k cjkgk(xk(t)) = the net traffic ‘heard’ by j at time t.• Each node reports the volume of the net traffic it has heard to its

neighbours and updates its own traffic so that the higher the nettraffic it hears, the more it decreases its own flow correspondingly.

An equilibrium of this o.d.e. will be characterized by

bi(xi) =∑

j

cijfj(∑

k

cjkgk(xk)) ∀i.

While we have used a network interpretation here, one can also re-cover the Kelly–Maulloo–Tan (1998) model of the tatonnement processin network pricing context and the Cohen–Grossberg model (see Chap-ter 14 of Haykin, 2000) which covers several neural network modelsincluding the celebrated Hopfield model. An important precursor to theCohen–Grossberg model is Grossberg (1978).

(ii) Principal component analysis: This is a problem from statistics, whereinone has n-dimensional data for large n > 1 and the idea is to findm << n directions along which the data can be said to be concentrated.Standard theoretical considerations then suggest that this should be theeigenspace of the empirical covariance matrix corresponding to its m

largest eigenvalues. The neural network methodology for finding thissubspace is based on adaptively learning the weight matrix of an appro-priately designed network using stochastic approximation. There areseveral variations on this theme, see, e.g., section 8.3 of Hertz, Kroghand Palmer (1991). One celebrated instance due to Oja (1982) leads tothe limiting matrix o.d.e. in Rn×m given by

W (t) = QW (t)−W (t)(W (t)TQW (t)),

where Q ∈ Rn×n is a positive definite matrix. (Assuming zero meandata, this will in fact be its covariance matrix obtained from empiricaldata by the averaging property of stochastic approximation.) Leizarowitz(1997) analyzes this equation by control theoretic methods and estab-lishes that W (t) does indeed converge to a matrix whose column vectorsspan the eigenspace corresponding to the m largest eigenvalues of Q

(see also Yan, Helmke and Moore, 1994). See Oja and Karhunen (1985)for a precursor and Yoshizawa, Helmke and Moore (2001), Helmke andMoore (1993) for yet another related dynamics with several importantapplications.

Page 148: Stochastic Approximation: A Dynamical Systems Viewpoint

10.5 Miscellaneous applications 139

The account above is far from exhaustive. (Just to give one example, wehave not covered the very interesting dynamics for consensus / coordinationbeing studied in the robotics and sensor networks literature – see, e.g., Cuckerand Smale (2007) and the references therein.) The message, if any, of theforegoing is that any convergent o.d.e. is a potential paradigm for a stochastic-approximation-based algorithm.

Page 149: Stochastic Approximation: A Dynamical Systems Viewpoint

11

Appendices

11.1 Appendix A: Topics in analysis

11.1.1 Continuous functions

We briefly recall here the two key theorems about continuous functions usedin the book. Recall that a subset of a topological space is relatively compact(resp. relatively sequentially compact) if its closure is compact (resp. sequen-tially compact). Also, compactness and sequential compactness are equiva-lent notions for metric spaces. The first theorem concerns the relative com-pactness in the space C([0, T ];Rd) of continuous functions [0, T ] → Rd fora prescribed T > 0. C([0, T ];Rd) is a Banach space under the ‘sup-norm’||f || def= supt∈[0,T ] ||f(t)||. That is,

(i) it is a vector space over the reals,(ii) || · || : C([0, T ];Rd) → [0,∞) satisfies

(a) ||f || ≥ 0, with equality if and only if f ≡ 0,(b) ||αf || = |α|||f || for α ∈ R,(c) ||f + g|| ≤ ||f ||+ ||g||,

and(iii) || · || is complete, i.e., fk ⊂ C([0, T ];Rd), ||fm−fn|| → 0 as m,n →∞,

imply that there exists an f ∈ C([0, T ];Rd) such that ||fn − f || → 0(the uniqueness of this f is obvious).

A set B ⊂ C([0, T ];Rd) is said to be equicontinuous at t ∈ [0, T ] if forany ε > 0, one can find a δ > 0 such that |t − s| < δ, s ∈ [0, T ] impliessupf∈B ||f(s) − f(t)|| < ε. It is simply equicontinuous if it equicontinuousat all t ∈ [0, T ]. It is said to be pointwise bounded if for any t ∈ [0, T ],supf∈B ||f(t)|| < ∞. It can be verified that if the set is equicontinuous, it willbe pointwise bounded at all points in [0, T ] if it is so at one point in [0, T ]. Theresult we are interested in is the Arzela–Ascoli theorem which characterizesrelative compactness in C([0, T ];Rd):

140

Page 150: Stochastic Approximation: A Dynamical Systems Viewpoint

11.1 Appendix A: Topics in analysis 141

Theorem 1. B ⊂ C([0, T ];Rd) is relatively compact if and only if it is equicon-tinuous and pointwise bounded.

See Appendix A of Rudin (1991) for a proof and further developments.The space C([0,∞);Rd) of continuous functions [0,∞) → Rd is given the

coarsest topology such that the map that takes f ∈ C([0,∞);Rd) to its restric-tion to [0, T ], viewed as an element of the space C([0, T ];Rd), is continuous forall T > 0. In other words, fn → f in this space if and only if fn|[0,T ] → f |[0,T ]

in C([0, T ];Rd) for all T > 0. This is not a Banach space, but a Frechet space,i.e., it has a complete translation-invariant metric and the corresponding openballs are convex. This metric can be, e.g.,

ρ(f, g) def=∞∑

T=1

2−T ||f − g||T ∧ 1,

where we denote by || · ||T the sup-norm on C([0, T ];Rd) to make explicit itsdependence on T . By our choice of the topology on C([0,∞);Rd), Theorem 1holds for this space as well.

The next result concerns contractions, i.e., maps f : S → S on a metric spaceS endowed with a metric ρ, that satisfy

ρ(f(x), f(y)) ≤ αρ(x, y)

for some α ∈ [0, 1). We say that x∗ is a fixed point of f if x∗ = f(x∗). Assumethat ρ is complete, i.e., limm,n→∞ ρ(xn, xm) = 0 implies xn → x∗ for somex∗ ∈ S. The next theorem is called the contraction mapping theorem.

Theorem 2. There exists a unique fixed point x∗ of f and for any x0 ∈ S, theiteration

xn+1 = f(xn), n ≥ 0,

satisfies

ρ(xn, x∗) ≤ αnρ(x0, x∗), n ≥ 0,

i.e., xn converges to x∗ at an exponential rate.

This is an example of a fixed point theorem. Another example is Brouwer’sfixed point theorem, which says that every continuous map f : C → C for acompact convex C ⊂ Rd has a fixed point.

11.1.2 Square-integrable functions

Consider now the space L2([0, T ];Rd) of measurable functions f : [0, T ] → Rd

satisfying∫ T

0

||f(t)||2dt < ∞.

Page 151: Stochastic Approximation: A Dynamical Systems Viewpoint

142 Appendices

Letting 〈·, ·〉 denote the inner product in Rd, we can define an inner product〈·, ·〉T on L2([0, T ];Rd) by

〈f, g〉T def=∫ T

0

〈f(t), g(t)〉dt, f, g ∈ L2([0, T ];Rd).

This is easily seen to be a valid inner product, i.e., a symmetric continuous mapL2([0, T ];Rd)2 → R that is separately linear in each argument and satisfies:〈f, f〉T ≥ 0, with equality if and only if f ≡ 0 a.e. It thus defines a norm

||f || def=√〈f, f〉T =

(∫ T

0

||f(t)||2dt

) 12

,

which turns out to be complete. L2([0, T ];Rd) is then a Hilbert space with theabove inner product and norm.

The open balls w.r.t. this norm define what is called the strong topology onL2([0, T ];Rd). One can also define the weak topology as the coarsest topologyw.r.t. which the functions f → 〈f, g〉T are continuous for all g ∈ L2([0, T ];Rd).The corresponding convergence concept is: fn → f weakly in L2([0, T ];Rd) ifand only if 〈fn, g〉T → 〈f, g〉T for all g ∈ L2([0, T ];Rd). The results we needare the following:

Theorem 3. A || · ||-bounded set B ⊂ L2([0, T ];Rd) is relatively compact andrelatively sequentially compact in the weak topology.

Theorem 4. If fn → f weakly in L2([0, T ];Rd), then there exists a subsequencefn(k) such that

|| 1m

m∑

k=1

fn(k) − f || → 0.

Theorem 3 is a special instance of the Banach–Alaoglu theorem. See Rudin(1991) for this and related developments. Likewise, Theorem 4 is an instanceof the Banach–Saks theorem, see Balakrishnan (1976, p. 29) for a proof.

Let X denote the space of measurable maps f : [0,∞) → Rd with theproperty

∫ T

0

||f(t)||2dt < ∞ ∀T > 0.

Topologize X with the coarsest topology that renders continuous the maps

f →∫ T

0

〈f(t)g(t)〉dt

for all g ∈ L2([0, T ];Rd), for all T > 0. Then by our very choice of topology,Theorems 3 and 4 apply to X after the following modification: A set B ⊂ Xwith the property that f |[0,T ] : f ∈ B is || · ||T -bounded in L2([0, T ];Rd) for

Page 152: Stochastic Approximation: A Dynamical Systems Viewpoint

11.2 Appendix B: Ordinary differential equations 143

all T > 0 will be relatively compact and relatively sequentially compact in X .Furthermore, if fn → f in X , then for any T > 0, there exists a subsequencefn(k) such that

|| 1m

m∑

k=1

fn(k)|[0,T ] − f |[0,T ]||T → 0.

11.1.3 Lebesgue’s theorem

Let f : R → R be a measurable and locally integrable function and for t > s

in R, let g(t) =∫ t

sf(y)dy. Then Lebesgue’s theorem states that for a.e. t ≥ s,

ddtg(t) exists and equals f(t).

11.2 Appendix B: Ordinary differential equations

11.2.1 Basic theory

This chapter briefly summarizes some key facts about ordinary differentialequations of relevance to us. The reader may refer to standard texts such asHirsch, Smale and Devaney (2003) for further details. Consider the differentialequation in Rd given by

x(t) = h(x(t)), x(0) = x. (11.2.1)

This is an autonomous o.d.e. because the the driving vector field h does nothave an explicit time-dependence. It would be non-autonomous if we replaceh(x(t)) on the right by h(x(t), t). We shall say that (11.2.1) is well-posed iffor any choice of the initial condition x ∈ Rd, it has a unique solution x(·)defined for all t ≥ 0 and the map x → the corresponding x(·) ∈ C([0,∞);Rd)is continuous. One sufficient condition for this is the Lipschitz condition onh(·): there exists L > 0 such that

||h(x)− h(y)|| ≤ L||x− y||, ∀x, y ∈ Rd.

Theorem 5. For h satisfying the Lipschitz condition, (11.2.1) is well-posed.

We shall sketch a proof of this to illustrate the application of the Gronwallinequality stated below:

Lemma 6. (Gronwall inequality) For continuous u(·), v(·) ≥ 0 and scalarsC, K, T ≥ 0,

u(t) ≤ C + K

∫ t

0

u(s)v(s)ds ∀t ∈ [0, T ], (11.2.2)

implies

u(t) ≤ CeK∫ T0 v(s)ds, t ∈ [0, T ].

Page 153: Stochastic Approximation: A Dynamical Systems Viewpoint

144 Appendices

Proof. Let s(t) def=∫ t

0u(s)v(s)ds, t ∈ [0, T ]. Multiplying (11.2.2) on both sides

by v(t), it translates into

s(t) ≤ Cv(t) + Ks(t)v(t).

This leads to

e−K∫ t0 v(s)ds(s(t)−Kv(t)s(t)) =

d

dt

(e−K

∫ t0 v(s)dss(t)

)

≤ Ce−K∫ t0 v(s)dsv(t).

Integrating from 0 to t and using the fact that s(0) = 0, we have

e−K∫ t0 v(s)dss(t) ≤ C

K(1− e−K

∫ t0 v(s)ds).

Thus

s(t) ≤ C

K(eK

∫ t0 v(s)ds − 1).

Hence

u(t) ≤ C + Ks(t)

≤ C + K

(C

K(eK

∫ t0 v(s)ds − 1)

)

= CeK∫ t0 v(s)ds.

The claim follows for t ∈ [0, T ]. ¥

The most commonly used situation is v(·) ≡ 1, when this inequality reducesto

u(t) ≤ CeKt.

We return to the proof of Theorem 5.

Proof. Define the map F : y(·) ∈ C([0, T ];Rd) → z(·) ∈ C([0, T ];Rd) by

z(t) = x +∫ t

0

h(y(s))ds, t ∈ [0, T ].

Clearly, x(·) is a solution of (11.2.1) on [0, T ] if and only if it is a fixed pointof F . Let zi(·) = F (yi(·)) for i = 1, 2. Denoting by || · ||T the sup-norm onC([0, T ];Rd), we have

||z1(·)− z2(·)||T ≤∫ T

0

||h(y1(s))− h(y2(s))||ds

≤ L

∫ T

0

||y1(s)− y2(s)||ds

≤ LT ||y1(·)− y2(·)||T .

Page 154: Stochastic Approximation: A Dynamical Systems Viewpoint

11.2 Appendix B: Ordinary differential equations 145

Taking T < 1/L, it follows that F is a contraction and thus has a uniquefixed point by the contraction mapping theorem of Appendix A. Existence anduniqueness of a solution to (11.2.1) on [0, T ] follows. The argument may berepeated for [T, 2T ], [2T, 3T ] and so forth in order to extend the claim to ageneral T > 0. Next, let xi(·), i = 1, 2, be solutions to (11.2.1) correspondingto x = x1, x2, respectively. Then

||x1(t)− x2(t)|| ≤ ||x1 − x2||+ L

∫ t

0

||x1(s)− x2(s)||ds

for t ∈ [0, T ]. By Lemma 6,

||x1(·)− x2(·)||T ≤ eLT ||x1 − x2||T ,

implying that the map x ∈ Rd → x(·)|[0,T ] ∈ C([0, T ];Rd) is Lipschitz, inparticular, continuous. Since T > 0 was arbitrary, it follows that the mapx ∈ Rd → C([0,∞);Rd) is continuous. ¥

Since the continuous image of a compact set is compact, we have:

Corollary 7. The solution set of (11.2.1) as x varies over a compact subsetof Rd is compact in C([0,∞);Rd).

A similar argument works if we consider (11.2.1) with t ≤ 0: one simply hasto work with intervals [−T, 0] in place of [0, T ]. Thus for each t ∈ R, there is acontinuous map Ψt : Rd →Rd that takes x to x(t) via (11.2.1). It follows fromthe uniqueness claim above that Ψt, Ψ−t are inverses of each other and thuseach Ψt is a homeomorphism, i.e., a continuous bijection with a continuousinverse. The family Ψt, t ∈ R, defines a flow of homeomorphisms Rd → Rd,i.e., it satisfies:

(i) Ψ0 = the identity map,(ii) Ψs Ψt = Ψt Ψs = Ψs+t,

where ‘’ stands for composition of functions.More generally, we may assume h to be only locally Lipschitz, i.e.,

||h(x)− h(y)|| ≤ LR||x− y||, ∀x, y ∈ BRdef= z ∈ Rd : ||z|| ≤ R,

for some LR > 0 that may tend to ∞ as R →∞. Then the claims of Theorem5 can be shown to hold locally in space and time, i.e., in a neighbourhood ofx and for t ∈ [0, T ] for T > 0 sufficiently small. Suppose in addition one canshow separately that the trajectory is well-defined for all t ≥ 0, i.e., there isno ‘finite time blow-up’ (meaning that limt↑t ||x(t)|| = ∞ for some t < ∞) forany initial condition. Then the full statement of Theorem 5 may be recovered.One way of ensuring no finite time blow-up is by demonstrating a convenient‘Liapunov function’ – see section 11.3.

Page 155: Stochastic Approximation: A Dynamical Systems Viewpoint

146 Appendices

We close this section with a discrete counterpart of the Gronwall inequality.While it is not used much in the o.d.e. context, it is extremely useful otherwiseand has been used extensively in this book itself.

Lemma 8. (Discrete Gronwall inequality) Let xn, n ≥ 0 (resp. an, n ≥0) be nonnegative (resp. positive) sequences and C,L ≥ 0 scalars such that forall n,

xn+1 ≤ C + L(n∑

m=0

amxm). (11.2.3)

Then for Tn =∑n

m=0 am,

xn+1 ≤ CeLTn .

Proof. Let sn =∑n

m=0 aixi. Multiplying (11.2.3) on both sides by an+1 leadsto

sn+1 − sn ≤ Can+1 + Lsnan+1.

That is,

sn+1 ≤ Can+1 + sn(1 + Lan+1).

Iterating (with x0 ≤ C by convention: replace C by x0∨C otherwise) we obtain

sn ≤ C

n∑

k=0

Πnm=k+1(1 + Lam)ak

≤ C

∫ Tn

0

eL(Tn−s)ds

=C

L

(eLTn − 1

).

(Here by convention, Πnm=n+1(1 + Lam) = 1.) Thus

xn+1 ≤ C + Lsn ≤ C + L× C

L

(eLTn − 1

)= CeLTn .

¥

11.2.2 Linear systems

A special and important class of differential equations is that of linear systems,i.e., the equations

x(t) = A(t)x(t), t ≥ t0, x(t0) = x, (11.2.4)

where A(·) is an Rd×d-valued continuous function of time. Although the right-hand side is now time-dependent, similar arguments to those of the precedingsection establish its existence, uniqueness and continuous dependence on initial

Page 156: Stochastic Approximation: A Dynamical Systems Viewpoint

11.2 Appendix B: Ordinary differential equations 147

condition x and initial time t0. It is easily seen that a linear combination ofsolutions of (11.2.4) will also be a solution for the corresponding linear combi-nation of the initial conditions. Thus for given t0, all solutions can be specifiedas linear combinations of solutions corresponding to x = ej , 1 ≤ j ≤ d, where

ejdef= the unit vector in the jth coordinate direction. Let Φ(t, t0) denote the

d×d matrix whose ith column is x(t) corresponding to x(t0) = ei for 1 ≤ i ≤ d.Then we have:

(i) Φ(t, t) = Id, the d-dimensional identity matrix,(ii) Φ(t, s)Φ(s, u) = Φ(t, u) for s, t, u ∈ R,(iii) for x(t0) = x in (11.2.4), x(t) = Φ(t, t0)x.

For constant A(·) ≡ A, Φ(t, t0) = exp(A(t− t0)), where the matrix exponentialis defined by

eAt def=∞∑

m=0

(A)mtm

m!.

An important instance of a linear system that we shall encounter is thefollowing: Suppose h(·) in (11.2.1) is continuously differentiable with Dh(x)denoting its Jacobian matrix evaluated at x. Then it can be shown that themap x → x(t) for each t is continuously differentiable. If we denote by Dx(t)its Jacobian matrix, then Dx(·) can be shown to satisfy the (matrix) linearsystem

d

dtDx(t) = Dh(x(t))Dx(t), Dx(0) = Id.

Formally, this may be derived simply by differentiating (11.2.1) on both sidesw.r.t. the components of x. In this case, Ψt defined above is a flow of C1-diffeomorphisms, i.e., continuously differentiable bijections with continuouslydifferentiable inverses. This can be repeated for higher derivatives if sufficientregularity of h is available.

11.2.3 Asymptotic behaviour

Given a trajectory x(·) of (11.2.1), the set Ω def= ∩t>0x(s) : s > t, i.e., the setof its limit points as t → ∞, is called its ω-limit set. (A similar definition fort → −∞ defines the ‘α-limit set’.) In general this set depends upon the initialcondition x. Recall that a set A is positively (resp. negatively) invariant for(11.2.1) if x ∈ A implies that the corresponding x(t) given by (11.2.1) is also inA for t > 0 (resp. t < 0), and is invariant if it is both positively and negativelyinvariant. It is easy to verify that Ω will be invariant. If Ω = x∗, x(t) ≡ x∗

must be a trajectory of the o.d.e., whence h(x∗) = 0. Conversely, h(x∗) = 0

Page 157: Stochastic Approximation: A Dynamical Systems Viewpoint

148 Appendices

implies that x(t) ≡ x∗ defines a trajectory of the o.d.e., corresponding to x = x∗

in (11.2.1). Such x∗ are called equilibrium points of the o.d.e.A compact (more generally, closed) invariant set M will be called an at-

tractor if it has an open neighbourhood O such that every trajectory in O

remains in O and converges to M . The largest such O is called the domainof attraction of M . A compact invariant set M will be said to be Liapunovstable if for any ε > 0, there exists a δ > 0 such that every trajectory initiatedin the δ-neighbourhood of M remains in its ε-neighbourhood. A compact in-variant set M is said to be asymptotically stable if it is both Liapunov stableand an attractor. If this M = x∗, the equilibrium point x∗ is said to beasymptotically stable. One criterion for verifying asymptotic stability of x∗ is‘Liapunov’s second method’: Suppose one can find a continuously differentiablefunction V (·) defined in a neighbourhood O of x∗ such that 〈∇V (x), h(x)〉 < 0for x∗ 6= x ∈ O and = 0 for x = x∗, with V (x) →∞ as x → ∂O (def= the bound-ary of O). Then asymptotic stability of x∗ follows from the observation thatfor any trajectory x(·) in O, d

dtV (x(t)) ≤ 0 with equality only for x(t) = x∗.Conversely, asymptotic stability of x∗ implies the existence of such a function(see, e.g., Krasovskii, 1963). This also generalizes to compact invariant sets M

that are asymptotically stable.If x∗ is asymptotically stable and all trajectories of the o.d.e. converge to it,

it is said to be globally asymptotically stable. In this case, O above may be takento be the whole space. More generally, if one has a continuously differentiableV : Rd → R with V (x) → ∞ as ||x|| → ∞ and 〈∇V (x), h(x)〉 ≤ 0 ∀x,then any trajectory x(·) must converge to the largest invariant set contained inx : 〈∇V (x), h(x)〉 = 0. This is known as the LaSalle invariance principle.

Not every equilibrium point need be asymptotically stable. This fact is bestillustrated in the case of the constant coefficient linear system

x(t) = Ax(t), (11.2.5)

where A is a d× d matrix. We shall consider the case where all eigenvalues ofA have nonzero real parts. This situation is ‘structurally stable’, i.e., invari-ant under small perturbations of A. In particular, A is nonsingular and thusthe origin is the only equilibrium point. One can explicitly solve (11.2.5) asx(t) = exp(At)x(0). If all eigenvalues of A have strictly negative real parts,then x(t) → the origin exponentially. If not, it will do so only for those x(0)that lie on the ‘stable subspace’, i.e., the eigenspace of those eigenvalues (ifany) which have strictly negative real parts. It moves away from the origineventually for any other initial condition, i.e., ‘generically’, meaning ‘for initialconditions belonging to an open dense set’. This is because the stable subspacehas codimension at least one by hypothesis and hence its complement is dense.

More generally, if h in (11.2.1) is continuously differentiable with Jacobian

Page 158: Stochastic Approximation: A Dynamical Systems Viewpoint

11.3 Appendix C: Topics in probability 149

matrix Dh(x∗) at an equilibrium point x∗, we may compare it with the linearsystem

d

dt(y(t)− x∗) = Dh(x∗)(y(t)− x∗), (11.2.6)

in a neighbourhood of x∗. If the eigenvalues of Dh(x∗) have nonzero realparts, x∗ is said to be a hyperbolic equilibrium point. Note that x∗ is theunique equilibrium point for (11.2.6) in this case. It is known that in a smallneighbourhood of a hyperbolic x∗, there exists a homeomorphism that mapstrajectories of (11.2.5) with A = Dh(x∗) and those of (11.2.6) into each otherpreserving orientation. Thus their qualitative behaviour is the same. (This isthe Hartman–Grossman theorem.) Thus x∗ is asymptotically stable for (11.2.1)if it is so for (11.2.6), i.e., if all eigenvalues of Dh(x∗) have negative real parts.If x∗ is not asymptotically stable for (11.2.1), then in a small neighbourhoodof x∗, there exists a ‘stable manifold’ of dimension equal to the number ofeigenvalues (if any) with negative real parts, such that if x(0) lies on thismanifold, x(t) → x∗ and not otherwise.

Finally, we shall say that a probability measure µ on Rd is invariant underthe flow Ψt defined above if

∫fdµ =

∫f Ψtdµ ∀f ∈ Cb(Rd) and t ≥ 0.

Define empirical measures ν(t), t ≥ 0, by∫

fdν(t) def=1t

∫ t

0

f(x(s))ds, f ∈ Cb(Rd), t ≥ 0,

for x(·) as in (11.2.1). If x(t) remains bounded as t ↑ ∞, then ν(t) aresupported on a compact set and hence are relatively compact in the spaceP(Rd) of probability measures on Rd introduced in Appendix C. (This is aconsequence of Prohorov’s theorem mentioned in Appendix C.) Then everylimit point of ν(t) as t →∞ is invariant under Ψt, as can be easily verified.

11.3 Appendix C: Topics in probability

11.3.1 Martingales

Let (Ω,F , P ) be a probability space and Fn a family of increasing sub-σ-fields of F . A real-valued random process Xn defined on this probabilityspace is said to be a martingale w.r.t. the family Fn (or an Fn-martingale)if it is integrable and

(i) Xn is Fn-measurable for all n, and(ii) E[Xn+1|Fn] = Xn a.s. for all n.

Page 159: Stochastic Approximation: A Dynamical Systems Viewpoint

150 Appendices

Alternatively, one says that (Xn,Fn)n≥0 is a martingale. The sequence Mn =Xn − Xn−1 is then called a martingale difference sequence. The referenceto Fn is often dropped when it is clear from the context. A sequence ofRd-valued random variables is said to be a (vector) martingale if each of itscomponent processes is. There is a very rich theory of martingales and re-lated processes such as submartingales (in which ‘=’ is replaced by ‘≥’ in (ii)above), supermartingales (in which ‘=’ is replaced by ‘≤’ in (ii) above), ‘al-most supermartingales’, and so on. (In fact, the purely probabilistic approachto stochastic approximation relies heavily on these.) We shall confine ourselvesto listing a few key facts that have been used in this book. For more, the readermay refer to Borkar (1995), Breiman (1968), Neveu (1970) or Williams (1991).The results presented here for which no specific reference is given will be foundin particular in Borkar (1995). Throughout what follows, Xn, Fn are asabove.

(i) A decomposition theorem

Theorem 9. Let Mn be a d-dimensional Fn-martingale such thatE[||Mn||2] < ∞ ∀n. Then there exists an Rd×d-valued process Γnsuch that Γn is Fn−1-measurable for all n and MnMT

n − Γn is anRd×d-valued Fn-martingale.

This theorem is just a special case of the Doob decomposition. In fact,it is easy to see that for

Mn = [Mn(1), . . . , Mn(d)]T,

one has Γn = [[Γn(i, j)]]1≤i,j≤d, with

Γn(i, j) =n∑

m=1

E[Mm(i)Mm(j)−Mm−1(i)Mm−1(j)|Fm−1]

for 1 ≤ i, j ≤ d.(ii) Convergence theorems

Theorem 10. If supn E[X+n ] < ∞, then Xn converges a.s.

Theorem 11. If E[X2n] < ∞ ∀n, then Xn converges a.s. on the set

∑n E[(Xn+1−Xn)2|Fn] < ∞ and is o(∑n−1

m=1 E[(Xm+1−Xm)2|Fm])a.s. on the set ∑n E[(Xn+1 −Xn)2|Fn] = ∞.

The following result is sometimes useful:

Theorem 12. If E[supn |Xn+1 −Xn|] < ∞, then

P (Xn converges ∪ lim supn→∞

Xn = − lim infn→∞

Xn = ∞) = 1.

Page 160: Stochastic Approximation: A Dynamical Systems Viewpoint

11.3 Appendix C: Topics in probability 151

(iii) Inequalities: Let E[|Xn|p] < ∞ ∀n for some p ∈ (1,∞). Suppose X0 =0, implying in particular that E[Xn] = 0 ∀n.

Theorem 13. (Burkholder inequality) There exist constants c, C >

0 depending on p alone such that for all n = 1, 2, . . . ,∞,

cE[(n∑

m=1

(Xm −Xm−1)2)p2 ] ≤ E[ sup

m≤n|Xm|p]

≤ CE[(n∑

m=1

(Xm −Xm−1)2)p2 ].

Theorem 14. (Concentration inequality) Suppose that

|Xn −Xn−1| ≤ kn < ∞for some deterministic constants kn. Then for λ > 0,

P ( supm≤n

|Xm| > λ) ≤ 2e− λ2

∑m≤n k2

m .

See McDiarmid (1998, p. 227) for a proof of this result.(iv) Central limit theorem: We shall state a more general ‘central limit theo-

rem for vector martingale arrays’. For n ≥ 1, let (Mnm,Fn

m),m ≥ 0,be Rd-valued vector martingales with E[||Mn

m||2] < ∞ ∀m,n. De-fine Γn

m as above (i.e., Γnm is Fn

m−1-measurable for all n,m, andMn

m(Mnm)T − Γn

m, m ≥ 1, is an Fnm-martingale for each n). Recall

that a stopping time with respect to the increasing family of σ-fieldsFn is a random variable taking values in 0, 1, . . . ,∞ such that forall n in this set, τ ≤ n is Fn-measurable (with F∞ def= ∨nFn).

Theorem 15. (Central limit theorem) Suppose that there exist Fnm-

stopping times τn for n ≥ 1 such that τn ↑ ∞ a.s. and:

(a) for some symmetric positive definite Γ ∈ Rd×d, Γnτn

→ Γ inprobability, and

(b) for any ε > 0,τn∑

m=1

E[||Mnm −Mn

m−1||2I||Mnm −Mn

m−1|| > ε|Fnm−1] → 0

in probability.

Then Mnτn converges in law to the d-dimensional Gaussian measure

with zero mean and covariance matrix Γ.

See Hall and Heyde (1980) for this and related results. (The vectorcase stated here follows from the scalar case on considering arbitraryone dimensional projections.)

Page 161: Stochastic Approximation: A Dynamical Systems Viewpoint

152 Appendices

11.3.2 Spaces of probability measures

Let S be a metric space with a complete metric d(·, ·). Endow S with its Borelσ-field, i.e., the σ-field generated by the open d-balls. Assume also that S isseparable, i.e., has a countable dense subset sn. Let P(S) denote the spaceof probability measures on S. P(S) may be metrized with the metric

ρ(µ, ν) def= inf E[d(X, Y ) ∧ 1],

where the infimum is over all pairs of S-valued random variables X,Y suchthat the law of X is µ and the law of Y is ν. This metric can be shown tobe complete. Let δx denote the Dirac measure at x ∈ S, i.e., δx(A) = 0 or 1depending on whether x /∈ A or ∈ A for A Borel in S. Also, let si, i ≥ 1denote a prescribed countable dense subset of S. Then µ ∈ P(S) of the form

µ =m∑

k=1

akδxk,

for some m ≥ 1, a1, . . . , am rational in [0, 1] with∑

i ai = 1, and xi, 1 ≤ i ≤m ⊂ si, i ≥ 1, are countable dense in P(S). Hence P(S) is separable. Thefollowing theorem gives several equivalent formulations of convergence in P(S):

Theorem 16. The following are equivalent:

(i) ρ(µn, µ) → 0.(ii) For all f ∈ Cb(S),

∫fdµn →

∫fdµ.

(iii) For all f ∈ Cb(S) that are uniformly continuous w.r.t. some compatiblemetric on S,

∫fdµn →

∫fdµ.

(iv) For all open G ⊂ S, lim infn→∞ µn(G) ≥ µ(G).(v) For all closed F ⊂ S, lim supn→∞ µn(F ) ≤ µ(F ).(vi) For all A ⊂ S satisfying µ(∂A) = 0, µn(A) → µ(A).

In fact, there exists a countable set fi, i ≥ 1 ⊂ Cb(S) such that ρ(µn, µ) →0 if and only if

∫fidµn →

∫fidµ ∀i. This set is known as a convergence de-

termining class. If S is compact, C(S) is separable and any countable denseset in its unit ball will do. For non-compact S, embed it densely and homeo-morphically into a compact subset S of [0, 1]∞, consider a countable subset ofC(S), and restrict it to S (see Borkar, 1995, Chapter 2).

Relative compactness in P(S) is characterized by the following theorem: Saythat A ⊂ P(S) is a tight set if for any ε > 0, there exists a compact Kε ⊂ S

such that

µ(Kε) > 1− ε ∀µ ∈ A.

By a result of Oxtoby and Ulam, every singleton in P(S) is tight (see, e.g.,Borkar (1995), p. 4).

Page 162: Stochastic Approximation: A Dynamical Systems Viewpoint

11.3 Appendix C: Topics in probability 153

Theorem 17. (Prohorov) A ⊂ P(S) is relatively compact if and only if it istight.

The following theorem is extremely important:

Theorem 18. (Skorohod) If µn → µ∞ in P(S), then on some probabilityspace there exist random variables Xn, n = 1, 2, . . . ,∞, such that the law of Xn

is µn for each n, 1 ≤ n ≤ ∞, and Xn → X∞ a.s.

A stronger convergence notion than convergence in P(S) is that of conver-gence in total variation. We say that µn → µ in total variation if

sup |∫

fdµn −∫

fdµ| → 0,

where the supremum is over all f ∈ Cb(S) with supx |f(x)| ≤ 1. This in turnallows us to write

∫fdµn →

∫fdµ for bounded measurable f : S → R. The

next theorem gives a useful test for convergence in total variation. We shallsay that µ ∈ P(S) is absolutely continuous with respect to a positive, notnecessarily finite measure λ on S if λ(B) = 0 implies µ(B) = 0 for any BorelB ⊂ S. It is known that in this case there exists a measurable Λ : S → Rsuch that

∫fdµ =

∫fΛdλ for all µ-integrable f : S → R. This is the Radon–

Nikodym theorem of measure theory and Λ(·) is called the Radon–Nikodymderivative of µ w.r.t. λ. For example, the familiar probability density is theRadon–Nikodym derivative of the corresponding probability measure w.r.t. theLebesgue measure. The likelihood ratio in statistics is another example of aRadon–Nikodym derivative.

Theorem 19. (Scheffe) Suppose µn, n = 1, 2, . . . ,∞, are absolutely continu-ous w.r.t. a positive measure λ on S with Λn the corresponding Radon–Nikodymderivatives. If Λn → Λ∞ λ-a.s., then µn → µ∞ in total variation.

11.3.3 Stochastic differential equations

As this topic has been used only nominally and that too in only one place, viz.,Chapter 8, we shall give only the barest facts. An interested reader can findmuch more in standard texts such as Oksendal (2005). Consider a probabilityspace (Ω,F , P ) with a family Ft, t ≥ 0 of sub-σ-fields of F satisfying:

(i) it is increasing, i.e., Fs ⊂ Ft ∀t > s,(ii) it is right continuous, i.e., Ft = ∩s>tFs ∀t,(iii) it is complete, i.e., each Ft contains all zero probability sets in F and

their subsets.

A measurable stochastic process Zt on this probability space is said to beadapted to Ft if Zt is Ft-measurable ∀t ≥ 0. A d-dimensional Brownian

Page 163: Stochastic Approximation: A Dynamical Systems Viewpoint

154 Appendices

motion W (·) defined on (Ω,F , P ) is said to be an Ft-Wiener process in Rd

if it is adapted to Ft and for each t ≥ 0, W (t + ·)−W (t) is independent ofFt. Let ξt be an Rd-valued process satisfying:

(i) it is adapted to Ft,(ii) E[

∫ t

0||ξs||2ds] < ∞ ∀t > 0,

(iii) there exist 0 = t0 < t1 < t2 < · · · < tii↑∞→ ∞ such that ξt = some Fti

-measurable random variable ζi for t ∈ [ti, ti+1) (i.e., ξt is a piecewiseconstant adapted process).

Define the stochastic integral of ξt with respect to the Ft-Wiener processW (·) by

∫ t

0

〈ξs, dW (s)〉 def=∑

0<i≤i∗(t)

〈ζi−1, (W (ti)−W (ti−1))〉

+ 〈ζi∗(t), (W (t)−W (ti∗(t)))〉,where i∗(t) is the unique integer i ≥ 0 such that t ∈ [ti∗(t), ti∗(t)+1). From the‘independent increments’ property of Brownian motion, one can verify that

E[|∫ t

0

〈ξs, dW (s)〉|2] = E[∫ t

0

||ξs||2ds] ∀t > 0.

More generally, let ξt be an Rd-valued process satisfying (i)–(ii) above andlet ξn

t , n ≥ 1, be a family of real-valued processes satisfying (i)–(iii) above,such that

E[∫ t

0

||ξs − ξns ||2ds] → 0 ∀t > 0. (11.3.1)

Then once again using the independent increments property of the Wienerprocess W (·), we have, in view of (11.3.1),

E[|∫ t

0

〈ξns , dW (s)〉 −

∫ t

0

〈ξms , dW (s)〉|2] = E[

∫ t

0

||ξns − ξm

s ||2ds] → 0 ∀t > 0.

That is, the sequence of random variables∫ t

0〈ξn

s , dW (s)〉, n ≥ 1, is Cauchyin L2(Ω,F , P ) and hence has a unique limit therein, which we denote by∫ t

0〈ξs, dW (s)〉.Next we argue that it is always possible to find such ξn

t , n ≥ 1, for anyξt satisfying (i)–(ii). Here’s a sketch of what is involved: For a small a > 0,define ξ(a)

t by

ξ(a)t =

1a ∧ t

∫ t

(t−a)∨0

ξsds, t ≥ 0,

which is adapted (i.e., F-measurable for each t), has continuous paths, andapproximates ξs on [0, t] in mean square to any desired accuracy for a small

Page 164: Stochastic Approximation: A Dynamical Systems Viewpoint

11.3 Appendix C: Topics in probability 155

enough. Now pick a ‘grid’ ti as above and define

ξs = ξ(a)ti

, s ∈ [ti, ti+1), i ≥ 0.

This can approximate ξ(a)s arbitrarily closely in mean square if the grid is

taken to be sufficiently fine.Although this construction was for a fixed t > 0, one can show that it is

possible to construct this process so that t → ∫ t

0ξsdW (s) is continuous in t a.s.

We call this the stochastic integral of ξt w.r.t. W (·).Let m : Rd × [0,∞) → Rd and σ : Rd × [0,∞) → Rd×d be Lipschitz maps.

Consider the stochastic integral equation

X(t) = X0 +∫ t

0

m(X(s), s)ds +∫ t

0

〈σ(X(s), s), dW (s)〉, t ≥ 0, (11.3.2)

where X0 is an F0-measurable random variable. It is standard practice to call(11.3.2) a stochastic differential equation and write it as

dX(t) = m(X(t), t)dt + σ(X(t), t)dW (t), X(0) = X0.

It is possible to show that this will have an a.s. unique solution X(·) on (Ω,F , P )with continuous paths. (This is the so-called strong solution. There is also anotion of a weak solution, which we shall not concern ourselves with.) Clearly,the case of linear or constant m(·) and σ(·) is covered by this.

The equation

dX(t) = A(t)X(t)dt + D(t)dW (t)

with Gaussian X(0) = X0 is a special case of the above. X(·) is then a Gaussianand Markov process. This equation can be explicitly ‘integrated’ as follows: LetΦ(t, t0), t ≥ t0, be the unique solution to the linear matrix differential equation

d

dtΦ(t, t0) = A(t)Φ(t, t0), t ≥ t0; Φ(t0, t0) = Id,

where Id ∈ Rd×d is the identity matrix. (See Appendix B. Recall in particularthat for A(·) ≡ a constant matrix A, Φ(t, t0) = exp(A(t− t0)).) Then

X(t) = Φ(t, t0)X0 +∫ t

0

〈Φ(t, s), D(s)dW (s)〉, t ≥ 0.

Both the Gaussian property (when X0 is Gaussian) and the Markov propertyof X(·) can be easily deduced from this using the Gaussian and independentincrements properties of W (·).

Page 165: Stochastic Approximation: A Dynamical Systems Viewpoint

References

[1] ABOUNADI, J.; BERTSEKAS, D. P.; BORKAR, V. S. (2001) ‘Learning algo-rithms for Markov decision processes with average cost’, SIAM Journal on Con-trol and Optimization 40, 681–698.

[2] ABOUNADI, J.; BERTSEKAS, D. P.; BORKAR, V. S. (2002) ‘Stochastic approx-imation for nonexpansive maps: applications to Q-learning algorithms’, SIAMJournal on Control and Optimization 41, 1–22.

[3] ANBAR, D. (1978) ‘A stochastic Newton-Raphson method’, Journal of StatisticalPlanning and Inference 2, 153–163.

[4] ARTHUR, W. B. (1994) Increasing Returns and Path Dependence in the Economy,Univ. of Michigan Press, Ann Arbor, Mich.

[5] ARTHUR, W. B.; ERMOLIEV, Y.; KANIOVSKI, Y. (1983) ‘A generalized urnproblem and its applications’, Cybernetics 19, 61–71.

[6] AUBIN, J. P.; CELLINA, A. (1984) Differential Inclusions, Springer Verlag, Berlin.[7] AUBIN, J. P.; FRANKOWSKA, H. (1990) Set-Valued Analysis, Birkhauser,

Boston.[8] BALAKRISHNAN, A. V. (1976) Applied Functional Analysis, Springer Verlag,

New York.[9] BARDI, M.; CAPUZZO-DOLCETTA, I. (1997) Optimal Control and Viscosity

Solutions of Hamilton-Jacobi-Bellman Equations, Birkhauser, Boston.[10] BARTO, A., SUTTON, R.; ANDERSON, C. (1983) ‘Neuron-like elements that

can solve difficult learning control problems’, IEEE Transactions on Systems,Man and Cybernetics, 13, 835–846.

[11] BENAIM, M. (1996) ‘A dynamical system approach to stochastic approximation’,SIAM Journal on Control and Optimization 34, 437–472.

[12] BENAIM, M. (1997) ‘Vertex-reinforced random walks and a conjecture of Peman-tle’, Annals of Probability 25, 361–392.

[13] BENAIM, M. (1999) ‘Dynamics of stochastic approximation algorithms’, in LeSeminaire de Probabilites, J. Azema, M. Emery, M. Ledoux and M. Yor (eds.),Springer Lecture Notes in Mathematics No. 1709, Springer Verlag, Berlin-Heidelberg, 1–68.

[14] BENAIM, M.; HIRSCH, M. (1997) ‘Stochastic adaptive behaviour for prisoner’sdilemma’, preprint.

[15] BENAIM, M.; HOFBAUER, J.; SORIN, S. (2005) ‘Stochastic approximation anddifferential inclusions’, SIAM Journal on Control and Optimization 44, 328-348.

[16] BENAIM, M.; SCHREIBER, S. (2000) ‘Ergodic properties of weak asymptoticpseudotrajectories for semiflows’, Journal of Dynamics and Differential Equa-tions 12, 579–598.

[17] BENVENISTE, A.; METIVIER, M.; PRIOURET, P. (1990) Adaptive Algorithms

156

Page 166: Stochastic Approximation: A Dynamical Systems Viewpoint

References 157

and Stochastic Approximation, Springer Verlag, Berlin - New York.[18] BERTSEKAS, D. P.; TSITSIKLIS, J. N. (1996) Neuro-Dynamic Programming,

Athena Scientific, Belmont, Mass.[19] BHATNAGAR, S.; BORKAR, V. S. (1997) ‘Multiscale stochastic approximation

for parametric optimization of hidden Markov models’, Probability in the Engi-neering and Informational Sciences, 11, 509–522.

[20] BHATNAGAR, S.; BORKAR, V. S. (1998) ‘A two time-scale stochastic approxi-mation scheme for simulation-based parametric optimization’, Probability in theEngineering and Informational Sciences, 12, 519–531.

[21] BHATNAGAR, S.; FU, M. C.; MARCUS, S. I., WANG, I.-J. (2003) ‘Two-timescale simultaneous perturbation stochastic approximation using determin-istic perturbation sequences’, ACM Transactions on Modelling and ComputerSimulation, 13, 180–209.

[22] BORDER, K. C. (1989) Fixed Point Theorems with Applications to Economicsand Game Theory, Cambridge Univ. Press, Cambridge, UK.

[23] BORKAR, V. S. (1995) Probability Theory: An Advanced Course, Springer Verlag,New York.

[24] BORKAR, V. S. (1997) ‘Stochastic approximation with two time scales’, Systemsand Control Letters 29, 291–294.

[25] BORKAR V. S. (1998) ‘Asynchronous stochastic approximation’, SIAM Journalon Control and Optimization 36, 840–851 (Correction note in ibid., 38, 662–663).

[26] BORKAR, V. S. (2002) ‘On the lock-in probability of stochastic approximation’,Combinatorics, Probability and Computing 11, 11–20.

[27] BORKAR, V. S. (2003) ‘Avoidance of traps in stochastic approximation’, Systemsand Control Letters 50, 1–9 (Correction note in ibid. (2006) 55, 174–175).

[28] BORKAR, V. S. (2005) ‘An actor-critic algorithm for constrained Markov decisionprocesses’, Systems and Control Letters 54, 207–213.

[29] BORKAR, V. S. (2006) ‘Stochastic approximation with ‘controlled Markov’ noise’,Systems and Control Letters 55, 139–145.

[30] BORKAR, V. S.; KUMAR, P. R. (2003) ‘Dynamic Cesaro-Wardrop equilibrationin networks’, IEEE Transactions on Automatic Control 48, 382–396.

[31] BORKAR, V. S.; MANJUNATH, D. (2004) ‘Charge based control of diffserve-likequeues’, Automatica 40, 2040–2057.

[32] BORKAR, V. S.; MEYN, S. P. (2000) ‘The O.D.E. method for convergence ofstochastic approximation and reinforcement learning’, SIAM Journal on Controland Optimization 38, 447–469.

[33] BORKAR, V. S.; SOUMYANATH, K. (1997) ‘A new analog parallel scheme forfixed point computation, Part 1: Theory’, IEEE Transactions on Circuits andSystems I: Fundamental Theory and Applications, 44, 351–355.

[34] BRANDIERE, O. (1998) ‘Some pathological traps for stochastic approximation’,SIAM Journal on Control and Optimization 36, 1293–1314.

[35] BRANDIERE, O.; DUFLO, M. (1996) ‘Les algorithms stochastiques contournent– ils les pieges?’, Annales de l’Institut Henri Poincare 32, 395–427.

[36] BREIMAN, L. (1968) Probability, Addison-Wesley, Reading, Mass.[37] BROWN, G.; (1951) ‘Iterative solutions of games with fictitious play’, in Activity

Analysis of Production and Allocation, T. Koopmans (ed.), John Wiley, NewYork.

[38] CHEN, H.-F. (1994) ‘Stochastic approximation and its new applications’, in Proc.1994 Hong Kong International Workshop on New Directions in Control and Man-ufacturing, 2–12.

[39] CHEN, H.-F. (2002) Stochastic Approximation and Its Applications, Kluwer Aca-demic, Dordrecht, The Netherlands.

[40] CHOW, Y. S.; TEICHER, H. (2003) Probability Theory: Independence, Inter-

Page 167: Stochastic Approximation: A Dynamical Systems Viewpoint

158 References

changeability, Martingales (3rd ed.), Springer Verlag, New York.[41] CHUNG, K. L. (1954) ‘On a stochastic approximation method’, Annals of Math-

ematical Statistics 25, 463–483.[42] CUCKER, F.; SMALE, S. (2007) ‘Emergent behavior in flocks’, IEEE Transac-

tions on Automatic Control 52, 852–862.[43] DEREVITSKII, D. P.; FRADKOV, A. L. (1974) ‘Two models for analyzing the

dynamics of adaptation algorithms’, Automation and Remote Control 35, 59–67.[44] DUFLO, M. (1996) Algorithmes Stochastiques, Springer Verlag, Berlin-Heidelberg.[45] DUFLO, M. (1997) Random Iterative Models, Springer Verlag, Berlin-Heidelberg.[46] DUPUIS, P. (1988) ‘Large deviations analysis of some recursive algorithms with

state-dependent noise’, Annals of Probability 16, 1509–1536.[47] DUPUIS, P.; KUSHNER, H. J. (1989) ‘Stochastic approximation and large devi-

ations: upper bounds and w. p. 1 convergence’, SIAM Journal on Control andOptimization 27, 1108–1135.

[48] DUPUIS, P.; NAGURNEY, A. (1993) ‘Dynamical systems and variational in-equalities, Annals of Operations Research 44, 7–42.

[49] FABIAN, V. (1960) ‘Stochastic approximation methods’, Czechoslovak Mathemat-ical Journal 10, 125–159.

[50] FABIAN, V. (1968) ‘On asymptotic normality in stochastic approximation’, An-nals of Mathematical Statistics 39, 1327–1332.

[51] FANG, H.-T.; CHEN, H.-F. (2000) ‘Stability and instability of limit points forstochastic approximation algorithms’, IEEE Transactions on Automatic Control45, 413–420.

[52] FILIPPOV, A. F. (1988) Differential Equations with Discontinuous RighthandSides, Kluwer Academic, Dordrecht.

[53] FORT, J.-C.; PAGES, G. (1995) ‘On the a.s. convergence of the Kohonen algo-rithm with a general neighborhood function’, Annals of Applied Probability 5,1177–1216.

[54] FU, M. C.; HU, J.-Q. (1997) Conditional Monte Carlo: Gradient Estimation andOptimization Applications, Kluwer Academic, Boston.

[55] FUDENBERG, D.; LEVINE, D. (1998) Theory of Learning in Games, MIT Press,Cambridge, Mass.

[56] GELFAND, S. B.; MITTER, S. K. (1991) ‘Recursive stochastic algorithms forglobal optimization in Rd’, SIAM Journal on Control and Optimization 29, 999–1018.

[57] GERENCSER, L. (1992) ‘Rate of convergence of recursive estimators’, SIAMJournal on Control and Optimization 30, 1200–1227.

[58] GOLDSTEIN, L. (1988) ‘On the choice of step-size in the Robbins-Monro proce-dure’, Statistics and Probability Letters 6, 299–303.

[59] GLASSERMAN, P. (1991) Gradient Estimation via Perturbation Analysis, KluwerAcademic, Boston.

[60] GROSSBERG, S. (1978) ‘Competition, decision and consensus’, Journal of Math-ematical Analysis and Applications 66, 470–493.

[61] HALL, P.; HEYDE, C. C. (1980) Martingale Limit Theory and Its Applications,Academic Press, New York.

[62] HARTMAN, P. (1982) Ordinary Differential Equations (2nd ed.), Birkhauser,Boston.

[63] HAYKIN, S. (2001) Adaptive Filter Theory (4th ed.), Prentice Hall, EnglewoodCliffs, N.J.

[64] HAYKIN, S. (1998) Neural Networks: A Comprehensive Foundation (2nd ed.),McMillan Publ. Co., New York.

[65] HELMKE, U.; MOORE, J. B (1994) Optimization and Dynamical Systems,Springer Verlag, London.

Page 168: Stochastic Approximation: A Dynamical Systems Viewpoint

References 159

[66] HERTZ, J.; KROGH, A.; PALMER, R. (1991) An Introduction to the Theory ofNeural Computation, Addison Wesley, Redwood City, Calif.

[67] HIRSCH, M. W. (1985) ‘Systems of differential equations that are competitive orcooperative II: Convergence almost everywhere’, SIAM Journal on MathematicalAnalysis 16, 423–439.

[68] HIRSCH, M. W.; SMALE, S.; DEVANEY, R. (2003) Differential Equations, Dy-namical Systems and an Introduction to Chaos, Academic Press, New York.

[69] HSIEH, M.-H.; GLYNN, P. W. (2002) ‘Confidence regions for stochastic approx-imation algorithms’, Proc. of the Winter Simulation Conference 1, 370–376.

[70] HO, Y. C.; CAO, X. (1991) Perturbation Analysis of Discrete Event DynamicalSystems, Birkhauser, Boston.

[71] HOFBAUER, C.; SIEGMUND, K. (1998) Evolutionary Games and PopulationDynamics, Cambridge Univ. Press, Cambridge, UK.

[72] HU, J.; WELLMAN, M. P. (1998) ‘Multiagent reinforcement learning: theoreticalframework and an algorithm’, Proc. of the 15th International Conference onMachine Learning, Madison, Wisc., 242–250.

[73] JAFARI, A., GREENWALD, A.; GONDEK, D.; ERCAL, G. (2001) ‘On no-regretlearning, fictitious play and Nash equilibrium’, Proc. of the 18th InternationalConference on Machine Learning, Williams College, Williamstown, Mass., 226–233.

[74] KAILATH, T. (1980) Linear Systems, Prentice Hall, Englewood Cliffs, N.J.[75] KANIOVSKI, Y. M.; YOUNG, H. P. (1995) ‘Learning dynamics in games with

stochastic perturbations’, Games and Economic Behavior 11, 330–363.[76] KATKOVNIK, V.; KULCHITSKY, Y. (1972) ‘Convergence of a class of random

search algorithms’, Automation and Remote Control 8, 1321–1326.[77] KEIFER, J.; WOLFOWITZ, J. (1952) ‘Stochastic estimation of the maximum of

a regression function’, Annals of Mathematical Statistics 23, 462–466.[78] KELLY, F. P.; MAULLOO, A.; TAN, D. (1998) ‘Rate control in communication

networks: shadow prices, proportional fairness and stability’, Journal of Opera-tional Research Society 49, 237–252.

[79] KOHONEN, T. (2002) ‘Learning vector quantization’, in The Handbook of BrainTheory and Neural Networks (2nd ed.), M. A. Arbib (ed.), MIT Press, Cam-bridge, Mass., 537–540.

[80] KOSMOTOPOULOS, E. B.; CHRISTODOULOU, M. A. (1996) ‘Convergenceproperties of a class of learning vector quantization algorithms’, IEEE Transac-tions on Image Processing 5, 361–368.

[81] KRASOVSKII, N. N. (1963) Stability of Motion, Stanford Univ. Press, Stanford,Calif.

[82] KUSHNER, H. J.; CLARK, D. (1978) Stochastic Approximation Algorithms forConstrained and Unconstrained Systems, Springer Verlag, New York.

[83] KUSHNER, H. J.; YIN, G. (1987a) ‘Asymptotic properties for distributed andcommunicating stochastic approximation algorithms’, SIAM Journal on Controland Optimization 25, 1266–1290.

[84] KUSHNER, H. J.; YIN, G. (1987b) ‘Stochastic approximation algorithms forparallel and distributed processing’, Stochastics and Stochastics Reports 22, 219–250.

[85] KUSHNER, H. J.; YIN, G. (2003) Stochastic Approximation and Recursive Algo-rithms and Applications (2nd ed.), Springer Verlag, New York.

[86] LAI, T. L. (2003) ‘Stochastic approximation’, Annals of Statistics 31, 391–406.[87] LAI, T. L.; ROBBINS, H. (1978) ‘Limit theorems for weighted sums and stochastic

approximation processes’, Proc. National Academy of Sciences USA 75, 1068–1070.

[88] LEIZAROWITZ, A. (1997) ‘Convergence of solutions to equations arising in neu-ral networks’, Journal of Optimization Theory and Applications 94, 533–560.

Page 169: Stochastic Approximation: A Dynamical Systems Viewpoint

160 References

[89] LI, Y. (2003) ‘A martingale inequality and large deviations’, Statistics and Prob-ability Letters 62, 317–321.

[90] LITTMAN, M. L. (2001) ‘Value-function reinforcement learning in Markovgames’, Cognitive Systems Research 2, 55–66.

[91] LJUNG, L. (1977) ‘Analysis of recursive stochastic algorithms’, IEEE Transac-tions on Automatic Control 22, 551–575.

[92] LJUNG, L. (1978) ‘Strong convergence of a stochastic approximation algorithm’,Annals of Statistics 6, 680–696.

[93] LJUNG, L. (1999) System Identification: Theory for the User (2nd ed.), PrenticeHall, Englewood Cliffs, N.J.

[94] LJUNG, L.; PFLUG, G. C.; WALK, H. (1992) Stochastic Approximation andOptimization of Random Systems, Birkhauser, Basel.

[95] MATSUMOTO, Y. (2002) An Introduction to Morse Theory, Trans. of Mathe-matical Monographs No. 208, American Math. Society, Providence, R.I.

[96] McDIARMID, C. (1998) ‘Concentration’, in Probabilistic Methods for AlgorithmicDiscrete Mathematics, M. Habib, C. McDiarmid, J. Ramirez-Alfonsin and B.Reed (eds.), Springer Verlag, Berlin-Heidelberg.

[97] MEL’NIKOV, A. (1996) ‘Stochastic differential equations: singularity of coeffi-cients, regression models and stochastic approximation’, Russian MathematicalSurveys 52, 819–909.

[98] MILGROM, P.; SEGAL, I (2002) ‘Envelope theorems for arbitrary choice sets’,Econometrica 70, 583–601.

[99] NEVELSON, M.; KHASMINSKII, R. (1976) Stochastic Approximation and Re-cursive Estimation, Trans. of Mathematical Monographs No. 47, American Math.Society, Providence, R.I.

[100] NEVEU, J. (1975) Discrete Parameter Martingales, North Holland, Amsterdam.[101] OJA, E. (1982) ‘Simplified neuron model as a principal component analyzer’,

Journal of Mathematical Biology 15, 267–273.[102] OJA, E.; KARHUNEN, J. (1985) ‘On stochastic approximation of the eigen-

vectors and eigenvalues of the expectation of a random matrix’, Journal ofMathematical Analysis and Applications 106, 69–84.

[103] OKSENDAL, B. (2005) Stochastic Differential Equations (6th ed.), SpringerVerlag, Berlin-Heidelberg.

[104] ORTEGA, J. M.; RHEINBOLDT, W. C. (2000) Iterative Solutions of Nonlin-ear Equations in Several Variables, Society for Industrial and Applied Math.,Philadelphia.

[105] PELLETIER, M. (1998) ‘On the almost sure asymptotic behaviour of stochasticalgorithms’, Stochastic Processes and Their Applications 78, 217–244.

[106] PELLETIER, M. (1999) ‘An almost sure central limit theorem for stochasticapproximation algorithms’, Journal of Multivariate Analysis 71, 76–93.

[107] PEMANTLE, R. (1990) ‘Nonconvergence to unstable points in urn models andstochastic approximations’, Annals of Probability 18, 698–712.

[108] PEMANTLE, R. (2007) ‘A survey of random processes with reinforcement’,Probability Surveys 7, 1–79.

[109] PEZESHKI-ESFAHANI, H.; HEUNIS, A. J. (1997) ‘Strong diffusion approxi-mations for recursive stochastic algorithms’, IEEE Transactions on InformationTheory 43, 512–523.

[110] PUTERMAN, M. (1994) Markov Decision Processes, John Wiley, New York.[111] ROBBINS, H.; MONRO, S. (1951) ‘A stochastic approximation method’, Annals

of Mathematical Statistics 22, 400–407.[112] ROCKAFELLAR, R. T. (1970) Convex Analysis, Princeton Univ. Press, Prince-

ton, N.J.[113] RUBINSTEIN, R. (1981) Simulation and the Monte Carlo Method, John Wiley,

New York.

Page 170: Stochastic Approximation: A Dynamical Systems Viewpoint

References 161

[114] RUDIN, W. (1986) Real and Complex Analysis (3rd ed.), McGraw-Hill, NewYork.

[115] RUDIN, W. (1991) Functional Analysis (2nd ed.), McGraw-Hill, New York.[116] RUPPERT, D. (1988) ‘A Newton-Raphson version of the multivariate Robbins-

Monro procedure’, Annals of Statistics 13, 236–245.[117] RUSZCZYNSKI, A.; SYSKI, W. (1983) ‘Stochastic approximation method with

gradient averaging for unconstrained problems’, IEEE Transactions on Auto-matic Control 28, 1097–1105.

[118] SANDHOLM, W. (1998) ‘An evolutionary approach to congestion’, Discussionpaper available at:http://www.kellogg.northwestern.edu/research/math/papers/1198.pdf.

[119] SARGENT, T. (1993) Bounded Rationality in Macroeconomics, Clarendon Press,Oxford, UK.

[120] SASTRY, P. S.; MAGESH, M.; UNNIKRISHNAN, K. P. (2002) ‘Two timescaleanalysis of the Alopex algorithm for optimization’, Neural Computation 14, 2729–2750.

[121] SASTRY, P. S.; PHANSALKAR, V. V.; THATHACHAR, M. A. L. (1994) ‘De-centralized learning of Nash equilibria in multi-person stochastic games withincomplete information’, IEEE Transactions on Systems, Man and Cybernetics24, 769–777.

[122] SHAMMA, J. S.; ARSLAN, G. (2005) ‘Dynamic fictitious play, dynamic gradientplay, and distributed convergence to Nash equilibria’, IEEE Transactions onAutomatic Control 50, 312–327.

[123] SINGH, S. P.; KEARNS, M.; MANSOUR, Y. (2000) ‘Nash convergence of gradi-ent dynamics in general-sum games’, Proc. of the 16th Conference on Uncertaintyin Artificial Intelligence, Stanford, Calif., 541–548.

[124] SMITH, H. (1995) Monotone Dynamical Systems, American Math. Society, Prov-idence, R.I.

[125] SPALL, J. C. (1992) ‘Multivariate stochastic approximation using a simultaneousperturbation gradient approximation’, IEEE Transactions on Automatic Control37, 332–341.

[126] SPALL, J. C. (2003) Introduction to Stochastic Search and Optimization, JohnWiley, Hoboken, N.J.

[127] STROOCK, D. W.; VARADHAN, S. R. S. (1979) Multidimensional DiffusionProcesses, Springer Verlag, New York.

[128] TSITSIKLIS, J. N. (1994) ‘Asynchronous stochastic approximation and Q-learning’, Machine Learning 16, 185–202.

[129] TSITSIKLIS, J. N.; VAN ROY, B. (1997) ‘An analysis of temporal-differencelearning with function approximation’, IEEE Transactions on Automatic Control42, 674–690.

[130] VEGA-REDONDO, F. (1995) Evolution, Games and Economic Behaviour, Ox-ford Univ. Press, Oxford, UK.

[131] WAGNER, D. H. (1977) ‘Survey of measurable selection theorems’, SIAM Jour-nal on Control and Optimization 15, 859–903.

[132] WASAN, M. (1969) Stochastic Approximation, Cambridge Univ. Press, Cam-bridge, UK.

[133] WATKINS, C. I. C. H. (1988) ‘Learning from Delayed Rewards ’, Ph.D. thesis,Cambridge Univ., Cambridge, UK.

[134] WILSON, F. W. (1969) ‘Smoothing derivatives of functions and applications’,Transactions of the American Math. Society 139, 413–428.

[135] WILLIAMS, D. (1991) Probability with Martingales, Cambridge Univ. Press,Cambridge, UK.

[136] WONG, E. (1971) ‘Representation of martingales, quadratic variation and ap-plications’, SIAM Journal on Control and Optimization 9, 621–633.

Page 171: Stochastic Approximation: A Dynamical Systems Viewpoint

162 References

[137] YAN, W.-Y.; HELMKE, U.; MOORE, J. B. (1994) ‘Global analysis of Oja’s flowfor neural networks’, IEEE Transactions on Neural Networks 5, 674–683.

[138] YOSHIZAWA, S.; HELMKE, U.; STARKOV, K. (2001) ‘Convergence analysisfor principal component flows’, International Journal of Applied Mathematicsand Computer Science 11, 223–236.

[139] YOUNG, H. P (1998) Individual Strategy and Social Structure, Princeton Univ.Press, Princeton, N.J.

[140] YOUNG, H. P. (2004) Strategic Learning and Its Limits, Oxford Univ. Press,Oxford, UK.

Page 172: Stochastic Approximation: A Dynamical Systems Viewpoint

Index

α-limit set, 147ω-limit set, 147

absolutely continuous, 153actor-critic algorithm, 130almost equilibrated, 65Alopex, 125approximate drift, 57Arzela–Ascoli theorem, 140attractor, 148autonomous o.d.e., 143avoidance of traps, 44

backpropagation, 124Banach–Alaoglu theorem, 142Banach–Saks theorem, 142best response, 132Brouwer’s fixed point theorem, 141Burkholder inequality, 151

central limit theoremfor martingales, 151for stochastic approximation, 100functional, 96

clockglobal, 79local, 79

Cohen–Grossberg model, 138collective phenomena, 131complete norm, 140concentration inequality, 151contraction, 125, 141contraction mapping theorem, 141convergence in total variation, 153cooperative o.d.e., 133

delay, 79effect of, 82

differential inclusion, 53discontinuous dynamics, 58discounted cost, 127discrete Gronwall inequality, 146distributed implementation

asynchronous, 78synchronous, 87

domain of attraction, 148Doob decomposition, 150dynamic pricing, 133dynamic programming, 128

empirical gradient, 5empirical measures, 19empirical strategies, 132envelope theorem, 122equicontinuous, 140equilibrium point, 147ergodic occupation measure, 68Euler scheme, 2

fictitious play, 132fixed point, 2flow, 15, 145

gamebimatrix, 132repeated, 136two-person zero-sum, 136

Gauss–Markov process, 99generic, 6gradient-like, 121Gronwall inequality, 143

Hartman–Grossman theorem, 149hyperbolic equilibrium, 149

i.o., 18increasing returns, 3infinitesimal perturbation analysis, 123invariant set, 15, 55, 147

internally chain recurrent, 15internally chain transitive, 15

Kushner–Clark lemma, 18

LaSalle invariance principle, 148learning

163

Page 173: Stochastic Approximation: A Dynamical Systems Viewpoint

164 Index

reinforcement, 129supervised, 129unsupervised, 129

learning vector quantization, 124Lebesgue’s theorem, 143Liapunov equation, 99Liapunov’s second method, 148likelihood ratio method, 123linear system, 146Lipschitz condition, 143lock-in probability, 31

Markov decision process, 127martingale, 149

convergence, 150difference, 1, 6inequalities, 150

mean square error, 4

Nash equilibrium, 133non-autonomous o.d.e., 143non-expansive, 125nonlinear regression, 4nonlinear urn, 3

principal component analysis, 138Prohorov’s theorem, 153

Q-learning, 128quasi-static, 65

Radon–Nikodym derivative, 153Radon–Nikodym theorem, 153replicator dynamics, 134

saddle point, 122sample complexity, 42scaling limit, 22Scheffe’s theorem, 153set-valued map, 52

upper semicontinuous, 52Shapley equation, 136simulated annealing, 125simulation-based optimization, 122singularly perturbed o.d.e., 65Skorohod’s theorem, 153stability

asymptotic, 148global asymptotic, 148Liapunov, 148structural, 148

stable manifold, 149stable subspace, 148stationary control, 68stationary randomized control, 68stepsize

constant, 101decreasing, 10

stochastic approximation, 6controlled, 56

projected, 59simultaneous perturbation, 120

stochastic differential equation, 155stochastic fixed point iterations, 125stochastic gradient schemes, 118stochastic integral, 155stochastic recursive inclusion, 52stochastic subgradient descent, 57stopping time, 151strong solution, 155subdifferential, 57sup-norm, 140

tatonnement, 138tight set, 152timescales

natural, 67two, 64

value function, 128value iteration, 128

well-posed, 143