4
Local Convergence Theorems for Adaptive Stochastic Approximation Schemes Author(s): T. L. Lai and Herbert Robbins Source: Proceedings of the National Academy of Sciences of the United States of America, Vol. 76, No. 7 (Jul., 1979), pp. 3065-3067 Published by: National Academy of Sciences Stable URL: http://www.jstor.org/stable/69929 . Accessed: 07/05/2014 20:15 Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp . JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected]. . National Academy of Sciences is collaborating with JSTOR to digitize, preserve and extend access to Proceedings of the National Academy of Sciences of the United States of America. http://www.jstor.org This content downloaded from 169.229.32.136 on Wed, 7 May 2014 20:15:16 PM All use subject to JSTOR Terms and Conditions

Local Convergence Theorems for Adaptive Stochastic Approximation Schemes

Embed Size (px)

Citation preview

Page 1: Local Convergence Theorems for Adaptive Stochastic Approximation Schemes

Local Convergence Theorems for Adaptive Stochastic Approximation SchemesAuthor(s): T. L. Lai and Herbert RobbinsSource: Proceedings of the National Academy of Sciences of the United States of America,Vol. 76, No. 7 (Jul., 1979), pp. 3065-3067Published by: National Academy of SciencesStable URL: http://www.jstor.org/stable/69929 .

Accessed: 07/05/2014 20:15

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp

.JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact [email protected].

.

National Academy of Sciences is collaborating with JSTOR to digitize, preserve and extend access toProceedings of the National Academy of Sciences of the United States of America.

http://www.jstor.org

This content downloaded from 169.229.32.136 on Wed, 7 May 2014 20:15:16 PMAll use subject to JSTOR Terms and Conditions

Page 2: Local Convergence Theorems for Adaptive Stochastic Approximation Schemes

Proc. Natl. Acad. Sci. USA Vol. 76, No. 7, pp. 3065-3067,, July 1979 Statistics

Local convergence theorems for adaptive stochastic approximation schemes

(regression/weighted sums/martingale transform)

T. L. LAI AND HERBERT ROBBINS

Department of Mathematical Statistics, Columbia University, New York, New York 10027; and State University of New York, Stony Brook, New York 11790

Contributed by Herbert Robbins, April 27, 1979

ABSTRACT For the regression model y M(x) + c, adaptive stochastic approximation schemes of the form x.+ = n-

yn/(nbI) for choosing the levels xl,x2,.. at which Y1,Y,y ... are observed converge with probability 1 to the unknown root 0 of the regression function M(x). Certain local convergence theo- rems that relate the convergence rate of x,, -0 to the limiting behavior of the random variables b, are established.

Let E, El, E2,. . be independent, identically distributed random variables with EE = 0 and EE2 = a2 < m. Let ao c a I c ... be an increasing sequence of a-fields such that (t is ai-measurable and is independent of ai- 1 for all i > 1. Consider the general regression model

yi = M(x1) + i, i = 1,2,. [1]

in which the regression function M(x) is a Borel function such that

M(f) = 0, M'() = > 0 [2]

inf {M(x)(x - 0) > 0 for all 0 < 6 < 1, [3] 6clX_@ll1/6

IM(x)I S cIxI + d for some cd > 0 and all x. [4]

The Robbins-Monro stochastic approximation scheme for choosing the levels xi so that they will eventually approach the unknown root 0 of the regression function is defined by

*1 = ao-measurable random variable,

+I= - y/(ib), i = 1,27. [5]

where b is some positive constant. It is well known that for b < 2/3,

V-n (x, - 0) __N(0,U2/$b(2f - b)j) in distribution, and [6]

lim suip V g n _g |xn -6 I

= - -V-- - -- as. (almost surely). [7] Vb (2/ O- ~b)Y

Moreover, if for some X > 0

M(x) = :(x-0) + 0(jx -01 +X) as x-*, [8]

then for b > 23, there exists a random variable z such that

n/b(xn-0) z a.s., [9]

and for b = 2/,

/n (x. -0) - N(O ,U2/b2) in distribution [10] log n

The publication costs of this article were defrayed in part by page charge payment. This article must therefore be hereby marked "ad- vertisement" in accordance with 18 U. S. C. ?1734 solely to indicate this fact.

(cf. ref. 1). Thus, the asymptotically optimal choice of b is 1. In practice, the value M'(0) = will rarely be known. It is

therefore natural to try estimating 1 from the data being ob- served and to use instead of [5] an adaptive stochastic approx- imation scheme of the form

Xl = ao-measurable random variable

xi+ 1 Xi Yi/(ib ), i =^1,2,.. . [I|I 1]

where bi > 0 is some estimate of 1 based on xl,yl,... ,Xjy,. The following theorem, which follows from lemma 5 of ref. 1, deals with the a.s. convergence of the stochastic approximation scheme [11].

THEOREM 1. Suppose that in the regression model [1], M(x) satisfies conditions [2 -[4]. Define inductively yi by [1] and xi+s by [1 ], where {bil is a sequence of positive random vari- ables such that

bi is ?i-i-measurable for all i > 1, [121

infbi>0 and sup bi<ooa.s. [13]

Then lim xn = 6 a.s. In view of [7], it is natural to ask whether xn converges to 6

at the rate

Xn6-0 O[n-1/2(log log n)'!2] [14]

on the event Ilim sup bn < 213. An affirmative answer is given by the following.

THEOREM 2. With the same notations and assumptions as in Theorem 1, [14] holds a.s. on the event Ilim sup bn < 211.

Theorem 2 is an immediate corollary of the following deeper result which says that if the assumptions of Theorem 1 are satisfied, then, with probability 1, a sufficiently long string of bn not exceeding (2- 7i)1 leads to a corresponding string of xn differing from 0 by no more than a constant times n-'/2(log log n)1/2.

THEOREM 3. With the same notations and assumptions as in Theorem 1, there exists an event Qo such that P(Q0) = 1 and all sample points c e QO have the following property: For every given 0< <I < 2, there exist C > 0 and positive integers M,k (depending on w and q) such that at w, for all m > M and n > mk

max bj < (2-) I Ixj-I0 < Cj-1/2 mrSj n

X (log log j)1/2 for all mk < j S n. [15]

We preface the proof of Theorem 3 with the following. LEMMA 1. Suppose that lbnj is a sequence of random vari-

ables such that [12] holds and

3065

This content downloaded from 169.229.32.136 on Wed, 7 May 2014 20:15:16 PMAll use subject to JSTOR Terms and Conditions

Page 3: Local Convergence Theorems for Adaptive Stochastic Approximation Schemes

3066 Statistics: Lai and Robbins Proc. Natl. Acad. Sci. USA 76 (1979)

inf Ib,j > Oa.s., sup IbI < ooa.s.

Let Vn = o2c?bT2. Let S(.) be the random function defined by S(O) = Oand S(t) = 2?(ej/b) for Vn S t < Vn+1, n = 1,2, . Assume that a 3 0. Then, redefining the random variables on a new probability space if necessary, there exists a standard Wiener process w(t), t 2 0, such that

lrn I S(t) -w(t)I 0Oa.s. tim (t log log t)'/2

Consequently, n

E(e;/bi) lim sup l - = 1 a.s.

(2Vn log log n)1!2 Lemma 1 is a generalization of theorem 2 of ref. 2 and can

be proved by using essentially the same argument, applying theorem 3.1 of ref. 3, and noting that for p > 0 and 6 > 0,

Etl I j I PI[b > j]j>,il = SjXi2?2= J IxIPdG(x),

where G is the distribution function of E. Proof of Theorem 3: Without loss of generality, assume that

6 = 0. By Theorem 1, with probability 1,

limx= 0. [16]

Let S. = z ' (e/bil). By Lemma 1, with probability 1,

Sn = 0[(n log log n)1/2]. [17]

Let Qo be the event on which [161 and [171 both hold and

infb > 0, supbi < o. [18]

Then P(Qo) = 1. From [2] and [16], it follows that on QO

M(x,) = (, + tn)xn, where lim (= 0. [19]

Let dj = (3 + (j)/bj. Then, as shown in the proof of theorem 3 of ref. 1,

n xn I = Om-1,nXm - E I3 /(ibi), [20]

i5m

where for i = 0,1, . n - 1, n

13in = 1 O (- j-dj), 13nn =l[211 j=i+ 1

By partial summation, n n3 7i {-1 13 Oin i 1 I,n

+ J1n |+ f m-ln Sm_lj [22]

Let w E Qo and let 0 < i < 1. Choose 1 > p > 0 such that

(I - p)l(2 - t1) = P > 2 . [23] 2

Because di = [3 + o(1)]/bj at w by [19], there exists jo (de- pending on w) such that at w, for all j ? jo.

1 dj<2 (- )fi3/b1 ? d, S 2F/37W [24]

where b* - inf b1. By [21] and [24], for n > i ? jo~

< i-2f3+ 1,n(1 + 2/3/b*) at w. [25]

From [21], [23], and [241, it follows that at w

n>m > joand max bj <(2-

0~~~~~~~~ < ji < n

j=i + 1 (2 X7

= ynk^Yi for n i > m-1, [26] where

j=jo ( i[7 for some D > 0. Because p > 1/2 we can choose a positive integer k such that

p(l- k-1) >- [28] 2 From [16] - [18], [20], [22], and [25] - [28], it then follows that, by choosing M(> jo) and C sufficiently large, the desired im- plication [15] holds at X for all m > M and n > m .

COROLLARY 1. With the same notations and'assumptions as in Theorem 1, let -n = n1 Ilxi. Then almost all sample points w have the following property: For every given 0 < X < 2, there exist C > 0 and positive integers M,k (depending on w and r7) such that at c, for all m > M and n> mk,

max bj < (2 -i7)3 m<j<n

|X;-01 < C j-1/2(log log j)1/2for all mk j < n + V [29]

Proof: With probability 1, because x1, -1 0 by Theorem 1, m

(x (X-0) = o(m), E (xi -0) = o(n), 1 nti5n +V/

and therefore the desired conclusion follows from Theorem 3 by choosing k > 2.

An obvious choice for bn in the stochastic approximation scheme [11] is the usual least squares estimate

n

L Xi x - yi

O3n = [30] n (X - i)2

at least in the linear case M(x) (x - 0). Note that On is

Hn-measurable but not an-l-measurable. Hence, to satisfy [12] we should take bn = On-1 instead. It seems artificial to have to drop the last observation (xn,yn) in estimating ,3 at stage n, and it is therefore desirable to remove the condition [12]. In the sequel we shall replace [12] by the following

Condition C: There exist random variables b' and un such that (i) 2' u2 <co a.s., (ii) b'r and un are both Wn_,-measurable for all n > 1, and

(iii) Pf 0 bn - bn I = O( l Un 1r {) as n a) c]=1

THEOREM 4. Suppose that in Theorems 1, 2, and 3 we re- place the condition [12] by Condition C. Then the conclusions of Theorems 1, 2, and 3 still hold.

To prove Theorem 4, we shall need LEMMA 2. Let fun} be a sequence of random variables such

This content downloaded from 169.229.32.136 on Wed, 7 May 2014 20:15:16 PMAll use subject to JSTOR Terms and Conditions

Page 4: Local Convergence Theorems for Adaptive Stochastic Approximation Schemes

Statistics: Lai and Robbins Proc. Natl. Acad. Sci. USA 76 (1979) 3067

that 2 u2 < co a.s. and un is a, --measurable for all n > 1. Then

(a) . Unfn converges a.s.,

(b) U2 2UE < co a.s., and therefore 1

(c) u1iI4 = O(V?)a.s.

Proof of Theorem 4: By Lemma 2(a), un En O 0 a.s., and therefore Condition C(iii) implies that bn - b- 0 a.s. Hence, by [13],

inf b' > 0 and sup bt < ca.s. [31]

We note that

bs =iXf + i {bs 2 [32]

Because b' is 3i_1-measurable for all i > 1 and because [31] holds,

co

fil/(ib') converges a.s. [33] 1

Because 1 E /i2 < oo and 2: u 4 < oo a.s. by Lemma 2(b), it then follows from [13], [31], Condition C(iii), and the Schwarz inequality that

ib,b -

b 0(uii I l) < co a s. [34]

By [32], [33], and [34], 2;' Ej/(ibj) converges a.s., and therefore by lemma 5 of ref. 1,

xn -0 a.s. [35]

Let S& = c El/bi, S' = 2 lEi/b'. By [13], [31], Condition C(iii), and Lemma 2(c),

ISn -S 0 = E0(i I E) = O(Vn) a.s. [361

Because S' , O[(n log log n)'/2] by Lemma 1, it then follows from [36] that

Sn = O[(n log log n)1/2] a.s. [37]

In view of [35] and [37], the same argument as that used to prove Theorem 3 shows that the conclusion of Theorem 3 still holds in the present more general case, and therefore so does the conclusion of Theorem 2.

As a corollary of Theorem 4, we obtain COROLLARY 2. Suppose that in Corollary 1 we replace the

condition [12] by Condition C; then the conclusion of Corol- lary 1 still holds.

In practice, although 3 is unknown, it is usually possible to give prior bounds B > B' > 0 such that B' </3 < B. In this case, a natural choice for bn is An, truncated below by B' and above by B, where An, is the least squares estimate of : defined by [30]. Obviously the bn so chosen satisfies [13], and it can be shown that Condition C also holds. Therefore Theorem 4 and Cor- ollary 2 are applicable, and they provide very useful tools for showing that such bn indeed converge to /3 a.s. and for ana- lyzing the asymptotic properties of the adaptive stochastic approximation scheme [11]. The details will be given else- where.

This research was supported by the National Science Foundation. 1. Lai, T. L. & Robbins, H. (1979) Ann. Stat. 7, in press. 2. Lai, T. L. & Robbins, H. (1978) Proc. Natl. Acad. Sci. USA 75,

1068-1070. 3. Jain, N. C., Jogdeo, K. & Stout, W. F. (1975) Ann. Probab. 3,

119-145.

This content downloaded from 169.229.32.136 on Wed, 7 May 2014 20:15:16 PMAll use subject to JSTOR Terms and Conditions