Local Convergence Theorems for Adaptive Stochastic Approximation Schemes Author(s): T. L. Lai and Herbert Robbins Source: Proceedings of the National Academy of Sciences of the United States of America, Vol. 76, No. 7 (Jul., 1979), pp. 3065-3067 Published by: National Academy of Sciences Stable URL: http://www.jstor.org/stable/69929 . Accessed: 07/05/2014 20:15 Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp . JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected]. . National Academy of Sciences is collaborating with JSTOR to digitize, preserve and extend access to Proceedings of the National Academy of Sciences of the United States of America. http://www.jstor.org This content downloaded from on Wed, 7 May 2014 20:15:16 PM All use subject to JSTOR Terms and Conditions

Local Convergence Theorems for Adaptive Stochastic Approximation Schemes

Embed Size (px)

Citation preview

Page 1: Local Convergence Theorems for Adaptive Stochastic Approximation Schemes

Local Convergence Theorems for Adaptive Stochastic Approximation SchemesAuthor(s): T. L. Lai and Herbert RobbinsSource: Proceedings of the National Academy of Sciences of the United States of America,Vol. 76, No. 7 (Jul., 1979), pp. 3065-3067Published by: National Academy of SciencesStable URL: http://www.jstor.org/stable/69929 .

Accessed: 07/05/2014 20:15

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp

.JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact [email protected].


National Academy of Sciences is collaborating with JSTOR to digitize, preserve and extend access toProceedings of the National Academy of Sciences of the United States of America.


This content downloaded from on Wed, 7 May 2014 20:15:16 PMAll use subject to JSTOR Terms and Conditions

Page 2: Local Convergence Theorems for Adaptive Stochastic Approximation Schemes

Proc. Natl. Acad. Sci. USA Vol. 76, No. 7, pp. 3065-3067,, July 1979 Statistics

Local convergence theorems for adaptive stochastic approximation schemes

(regression/weighted sums/martingale transform)


Department of Mathematical Statistics, Columbia University, New York, New York 10027; and State University of New York, Stony Brook, New York 11790

Contributed by Herbert Robbins, April 27, 1979

ABSTRACT For the regression model y M(x) + c, adaptive stochastic approximation schemes of the form x.+ = n-

yn/(nbI) for choosing the levels xl,x2,.. at which Y1,Y,y ... are observed converge with probability 1 to the unknown root 0 of the regression function M(x). Certain local convergence theo- rems that relate the convergence rate of x,, -0 to the limiting behavior of the random variables b, are established.

Let E, El, E2,. . be independent, identically distributed random variables with EE = 0 and EE2 = a2 < m. Let ao c a I c ... be an increasing sequence of a-fields such that (t is ai-measurable and is independent of ai- 1 for all i > 1. Consider the general regression model

yi = M(x1) + i, i = 1,2,. [1]

in which the regression function M(x) is a Borel function such that

M(f) = 0, M'() = > 0 [2]

inf {M(x)(x - 0) > 0 for all 0 < 6 < 1, [3] 6clX_@ll1/6

IM(x)I S cIxI + d for some cd > 0 and all x. [4]

The Robbins-Monro stochastic approximation scheme for choosing the levels xi so that they will eventually approach the unknown root 0 of the regression function is defined by

*1 = ao-measurable random variable,

+I= - y/(ib), i = 1,27. [5]

where b is some positive constant. It is well known that for b < 2/3,

V-n (x, - 0) __N(0,U2/$b(2f - b)j) in distribution, and [6]

lim suip V g n _g |xn -6 I

= - -V-- - -- as. (almost surely). [7] Vb (2/ O- ~b)Y

Moreover, if for some X > 0

M(x) = :(x-0) + 0(jx -01 +X) as x-*, [8]

then for b > 23, there exists a random variable z such that

n/b(xn-0) z a.s., [9]

and for b = 2/,

/n (x. -0) - N(O ,U2/b2) in distribution [10] log n

The publication costs of this article were defrayed in part by page charge payment. This article must therefore be hereby marked "ad- vertisement" in accordance with 18 U. S. C. ?1734 solely to indicate this fact.

(cf. ref. 1). Thus, the asymptotically optimal choice of b is 1. In practice, the value M'(0) = will rarely be known. It is

therefore natural to try estimating 1 from the data being ob- served and to use instead of [5] an adaptive stochastic approx- imation scheme of the form

Xl = ao-measurable random variable

xi+ 1 Xi Yi/(ib ), i =^1,2,.. . [I|I 1]

where bi > 0 is some estimate of 1 based on xl,yl,... ,Xjy,. The following theorem, which follows from lemma 5 of ref. 1, deals with the a.s. convergence of the stochastic approximation scheme [11].

THEOREM 1. Suppose that in the regression model [1], M(x) satisfies conditions [2 -[4]. Define inductively yi by [1] and xi+s by [1 ], where {bil is a sequence of positive random vari- ables such that

bi is ?i-i-measurable for all i > 1, [121

infbi>0 and sup bi<ooa.s. [13]

Then lim xn = 6 a.s. In view of [7], it is natural to ask whether xn converges to 6

at the rate

Xn6-0 O[n-1/2(log log n)'!2] [14]

on the event Ilim sup bn < 213. An affirmative answer is given by the following.

THEOREM 2. With the same notations and assumptions as in Theorem 1, [14] holds a.s. on the event Ilim sup bn < 211.

Theorem 2 is an immediate corollary of the following deeper result which says that if the assumptions of Theorem 1 are satisfied, then, with probability 1, a sufficiently long string of bn not exceeding (2- 7i)1 leads to a corresponding string of xn differing from 0 by no more than a constant times n-'/2(log log n)1/2.

THEOREM 3. With the same notations and assumptions as in Theorem 1, there exists an event Qo such that P(Q0) = 1 and all sample points c e QO have the following property: For every given 0< <I < 2, there exist C > 0 and positive integers M,k (depending on w and q) such that at w, for all m > M and n > mk

max bj < (2-) I Ixj-I0 < Cj-1/2 mrSj n

X (log log j)1/2 for all mk < j S n. [15]

We preface the proof of Theorem 3 with the following. LEMMA 1. Suppose that lbnj is a sequence of random vari-

ables such that [12] holds and


This content downloaded from on Wed, 7 May 2014 20:15:16 PMAll use subject to JSTOR Terms and Conditions

Page 3: Local Convergence Theorems for Adaptive Stochastic Approximation Schemes

3066 Statistics: Lai and Robbins Proc. Natl. Acad. Sci. USA 76 (1979)

inf Ib,j > Oa.s., sup IbI < ooa.s.

Let Vn = o2c?bT2. Let S(.) be the random function defined by S(O) = Oand S(t) = 2?(ej/b) for Vn S t < Vn+1, n = 1,2, . Assume that a 3 0. Then, redefining the random variables on a new probability space if necessary, there exists a standard Wiener process w(t), t 2 0, such that

lrn I S(t) -w(t)I 0Oa.s. tim (t log log t)'/2

Consequently, n

E(e;/bi) lim sup l - = 1 a.s.

(2Vn log log n)1!2 Lemma 1 is a generalization of theorem 2 of ref. 2 and can

be proved by using essentially the same argument, applying theorem 3.1 of ref. 3, and noting that for p > 0 and 6 > 0,

Etl I j I PI[b > j]j>,il = SjXi2?2= J IxIPdG(x),

where G is the distribution function of E. Proof of Theorem 3: Without loss of generality, assume that

6 = 0. By Theorem 1, with probability 1,

limx= 0. [16]

Let S. = z ' (e/bil). By Lemma 1, with probability 1,

Sn = 0[(n log log n)1/2]. [17]

Let Qo be the event on which [161 and [171 both hold and

infb > 0, supbi < o. [18]

Then P(Qo) = 1. From [2] and [16], it follows that on QO

M(x,) = (, + tn)xn, where lim (= 0. [19]

Let dj = (3 + (j)/bj. Then, as shown in the proof of theorem 3 of ref. 1,

n xn I = Om-1,nXm - E I3 /(ibi), [20]


where for i = 0,1, . n - 1, n

13in = 1 O (- j-dj), 13nn =l[211 j=i+ 1

By partial summation, n n3 7i {-1 13 Oin i 1 I,n

+ J1n |+ f m-ln Sm_lj [22]

Let w E Qo and let 0 < i < 1. Choose 1 > p > 0 such that

(I - p)l(2 - t1) = P > 2 . [23] 2

Because di = [3 + o(1)]/bj at w by [19], there exists jo (de- pending on w) such that at w, for all j ? jo.

1 dj<2 (- )fi3/b1 ? d, S 2F/37W [24]

where b* - inf b1. By [21] and [24], for n > i ? jo~

< i-2f3+ 1,n(1 + 2/3/b*) at w. [25]

From [21], [23], and [241, it follows that at w

n>m > joand max bj <(2-

0~~~~~~~~ < ji < n

j=i + 1 (2 X7

= ynk^Yi for n i > m-1, [26] where

j=jo ( i[7 for some D > 0. Because p > 1/2 we can choose a positive integer k such that

p(l- k-1) >- [28] 2 From [16] - [18], [20], [22], and [25] - [28], it then follows that, by choosing M(> jo) and C sufficiently large, the desired im- plication [15] holds at X for all m > M and n > m .

COROLLARY 1. With the same notations and'assumptions as in Theorem 1, let -n = n1 Ilxi. Then almost all sample points w have the following property: For every given 0 < X < 2, there exist C > 0 and positive integers M,k (depending on w and r7) such that at c, for all m > M and n> mk,

max bj < (2 -i7)3 m<j<n

|X;-01 < C j-1/2(log log j)1/2for all mk j < n + V [29]

Proof: With probability 1, because x1, -1 0 by Theorem 1, m

(x (X-0) = o(m), E (xi -0) = o(n), 1 nti5n +V/

and therefore the desired conclusion follows from Theorem 3 by choosing k > 2.

An obvious choice for bn in the stochastic approximation scheme [11] is the usual least squares estimate


L Xi x - yi

O3n = [30] n (X - i)2

at least in the linear case M(x) (x - 0). Note that On is

Hn-measurable but not an-l-measurable. Hence, to satisfy [12] we should take bn = On-1 instead. It seems artificial to have to drop the last observation (xn,yn) in estimating ,3 at stage n, and it is therefore desirable to remove the condition [12]. In the sequel we shall replace [12] by the following

Condition C: There exist random variables b' and un such that (i) 2' u2 <co a.s., (ii) b'r and un are both Wn_,-measurable for all n > 1, and

(iii) Pf 0 bn - bn I = O( l Un 1r {) as n a) c]=1

THEOREM 4. Suppose that in Theorems 1, 2, and 3 we re- place the condition [12] by Condition C. Then the conclusions of Theorems 1, 2, and 3 still hold.

To prove Theorem 4, we shall need LEMMA 2. Let fun} be a sequence of random variables such

This content downloaded from on Wed, 7 May 2014 20:15:16 PMAll use subject to JSTOR Terms and Conditions

Page 4: Local Convergence Theorems for Adaptive Stochastic Approximation Schemes

Statistics: Lai and Robbins Proc. Natl. Acad. Sci. USA 76 (1979) 3067

that 2 u2 < co a.s. and un is a, --measurable for all n > 1. Then

(a) . Unfn converges a.s.,

(b) U2 2UE < co a.s., and therefore 1

(c) u1iI4 = O(V?)a.s.

Proof of Theorem 4: By Lemma 2(a), un En O 0 a.s., and therefore Condition C(iii) implies that bn - b- 0 a.s. Hence, by [13],

inf b' > 0 and sup bt < ca.s. [31]

We note that

bs =iXf + i {bs 2 [32]

Because b' is 3i_1-measurable for all i > 1 and because [31] holds,


fil/(ib') converges a.s. [33] 1

Because 1 E /i2 < oo and 2: u 4 < oo a.s. by Lemma 2(b), it then follows from [13], [31], Condition C(iii), and the Schwarz inequality that

ib,b -

b 0(uii I l) < co a s. [34]

By [32], [33], and [34], 2;' Ej/(ibj) converges a.s., and therefore by lemma 5 of ref. 1,

xn -0 a.s. [35]

Let S& = c El/bi, S' = 2 lEi/b'. By [13], [31], Condition C(iii), and Lemma 2(c),

ISn -S 0 = E0(i I E) = O(Vn) a.s. [361

Because S' , O[(n log log n)'/2] by Lemma 1, it then follows from [36] that

Sn = O[(n log log n)1/2] a.s. [37]

In view of [35] and [37], the same argument as that used to prove Theorem 3 shows that the conclusion of Theorem 3 still holds in the present more general case, and therefore so does the conclusion of Theorem 2.

As a corollary of Theorem 4, we obtain COROLLARY 2. Suppose that in Corollary 1 we replace the

condition [12] by Condition C; then the conclusion of Corol- lary 1 still holds.

In practice, although 3 is unknown, it is usually possible to give prior bounds B > B' > 0 such that B' </3 < B. In this case, a natural choice for bn is An, truncated below by B' and above by B, where An, is the least squares estimate of : defined by [30]. Obviously the bn so chosen satisfies [13], and it can be shown that Condition C also holds. Therefore Theorem 4 and Cor- ollary 2 are applicable, and they provide very useful tools for showing that such bn indeed converge to /3 a.s. and for ana- lyzing the asymptotic properties of the adaptive stochastic approximation scheme [11]. The details will be given else- where.

This research was supported by the National Science Foundation. 1. Lai, T. L. & Robbins, H. (1979) Ann. Stat. 7, in press. 2. Lai, T. L. & Robbins, H. (1978) Proc. Natl. Acad. Sci. USA 75,

1068-1070. 3. Jain, N. C., Jogdeo, K. & Stout, W. F. (1975) Ann. Probab. 3,


This content downloaded from on Wed, 7 May 2014 20:15:16 PMAll use subject to JSTOR Terms and Conditions