View
213
Download
0
Category
Preview:
Citation preview
Local Convergence Theorems for Adaptive Stochastic Approximation SchemesAuthor(s): T. L. Lai and Herbert RobbinsSource: Proceedings of the National Academy of Sciences of the United States of America,Vol. 76, No. 7 (Jul., 1979), pp. 3065-3067Published by: National Academy of SciencesStable URL: http://www.jstor.org/stable/69929 .
Accessed: 07/05/2014 20:15
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp
.JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact support@jstor.org.
.
National Academy of Sciences is collaborating with JSTOR to digitize, preserve and extend access toProceedings of the National Academy of Sciences of the United States of America.
http://www.jstor.org
This content downloaded from 169.229.32.136 on Wed, 7 May 2014 20:15:16 PMAll use subject to JSTOR Terms and Conditions
Proc. Natl. Acad. Sci. USA Vol. 76, No. 7, pp. 3065-3067,, July 1979 Statistics
Local convergence theorems for adaptive stochastic approximation schemes
(regression/weighted sums/martingale transform)
T. L. LAI AND HERBERT ROBBINS
Department of Mathematical Statistics, Columbia University, New York, New York 10027; and State University of New York, Stony Brook, New York 11790
Contributed by Herbert Robbins, April 27, 1979
ABSTRACT For the regression model y M(x) + c, adaptive stochastic approximation schemes of the form x.+ = n-
yn/(nbI) for choosing the levels xl,x2,.. at which Y1,Y,y ... are observed converge with probability 1 to the unknown root 0 of the regression function M(x). Certain local convergence theo- rems that relate the convergence rate of x,, -0 to the limiting behavior of the random variables b, are established.
Let E, El, E2,. . be independent, identically distributed random variables with EE = 0 and EE2 = a2 < m. Let ao c a I c ... be an increasing sequence of a-fields such that (t is ai-measurable and is independent of ai- 1 for all i > 1. Consider the general regression model
yi = M(x1) + i, i = 1,2,. [1]
in which the regression function M(x) is a Borel function such that
M(f) = 0, M'() = > 0 [2]
inf {M(x)(x - 0) > 0 for all 0 < 6 < 1, [3] 6clX_@ll1/6
IM(x)I S cIxI + d for some cd > 0 and all x. [4]
The Robbins-Monro stochastic approximation scheme for choosing the levels xi so that they will eventually approach the unknown root 0 of the regression function is defined by
*1 = ao-measurable random variable,
+I= - y/(ib), i = 1,27. [5]
where b is some positive constant. It is well known that for b < 2/3,
V-n (x, - 0) __N(0,U2/$b(2f - b)j) in distribution, and [6]
lim suip V g n _g |xn -6 I
= - -V-- - -- as. (almost surely). [7] Vb (2/ O- ~b)Y
Moreover, if for some X > 0
M(x) = :(x-0) + 0(jx -01 +X) as x-*, [8]
then for b > 23, there exists a random variable z such that
n/b(xn-0) z a.s., [9]
and for b = 2/,
/n (x. -0) - N(O ,U2/b2) in distribution [10] log n
The publication costs of this article were defrayed in part by page charge payment. This article must therefore be hereby marked "ad- vertisement" in accordance with 18 U. S. C. ?1734 solely to indicate this fact.
(cf. ref. 1). Thus, the asymptotically optimal choice of b is 1. In practice, the value M'(0) = will rarely be known. It is
therefore natural to try estimating 1 from the data being ob- served and to use instead of [5] an adaptive stochastic approx- imation scheme of the form
Xl = ao-measurable random variable
xi+ 1 Xi Yi/(ib ), i =^1,2,.. . [I|I 1]
where bi > 0 is some estimate of 1 based on xl,yl,... ,Xjy,. The following theorem, which follows from lemma 5 of ref. 1, deals with the a.s. convergence of the stochastic approximation scheme [11].
THEOREM 1. Suppose that in the regression model [1], M(x) satisfies conditions [2 -[4]. Define inductively yi by [1] and xi+s by [1 ], where {bil is a sequence of positive random vari- ables such that
bi is ?i-i-measurable for all i > 1, [121
infbi>0 and sup bi<ooa.s. [13]
Then lim xn = 6 a.s. In view of [7], it is natural to ask whether xn converges to 6
at the rate
Xn6-0 O[n-1/2(log log n)'!2] [14]
on the event Ilim sup bn < 213. An affirmative answer is given by the following.
THEOREM 2. With the same notations and assumptions as in Theorem 1, [14] holds a.s. on the event Ilim sup bn < 211.
Theorem 2 is an immediate corollary of the following deeper result which says that if the assumptions of Theorem 1 are satisfied, then, with probability 1, a sufficiently long string of bn not exceeding (2- 7i)1 leads to a corresponding string of xn differing from 0 by no more than a constant times n-'/2(log log n)1/2.
THEOREM 3. With the same notations and assumptions as in Theorem 1, there exists an event Qo such that P(Q0) = 1 and all sample points c e QO have the following property: For every given 0< <I < 2, there exist C > 0 and positive integers M,k (depending on w and q) such that at w, for all m > M and n > mk
max bj < (2-) I Ixj-I0 < Cj-1/2 mrSj n
X (log log j)1/2 for all mk < j S n. [15]
We preface the proof of Theorem 3 with the following. LEMMA 1. Suppose that lbnj is a sequence of random vari-
ables such that [12] holds and
3065
This content downloaded from 169.229.32.136 on Wed, 7 May 2014 20:15:16 PMAll use subject to JSTOR Terms and Conditions
3066 Statistics: Lai and Robbins Proc. Natl. Acad. Sci. USA 76 (1979)
inf Ib,j > Oa.s., sup IbI < ooa.s.
Let Vn = o2c?bT2. Let S(.) be the random function defined by S(O) = Oand S(t) = 2?(ej/b) for Vn S t < Vn+1, n = 1,2, . Assume that a 3 0. Then, redefining the random variables on a new probability space if necessary, there exists a standard Wiener process w(t), t 2 0, such that
lrn I S(t) -w(t)I 0Oa.s. tim (t log log t)'/2
Consequently, n
E(e;/bi) lim sup l - = 1 a.s.
(2Vn log log n)1!2 Lemma 1 is a generalization of theorem 2 of ref. 2 and can
be proved by using essentially the same argument, applying theorem 3.1 of ref. 3, and noting that for p > 0 and 6 > 0,
Etl I j I PI[b > j]j>,il = SjXi2?2= J IxIPdG(x),
where G is the distribution function of E. Proof of Theorem 3: Without loss of generality, assume that
6 = 0. By Theorem 1, with probability 1,
limx= 0. [16]
Let S. = z ' (e/bil). By Lemma 1, with probability 1,
Sn = 0[(n log log n)1/2]. [17]
Let Qo be the event on which [161 and [171 both hold and
infb > 0, supbi < o. [18]
Then P(Qo) = 1. From [2] and [16], it follows that on QO
M(x,) = (, + tn)xn, where lim (= 0. [19]
Let dj = (3 + (j)/bj. Then, as shown in the proof of theorem 3 of ref. 1,
n xn I = Om-1,nXm - E I3 /(ibi), [20]
i5m
where for i = 0,1, . n - 1, n
13in = 1 O (- j-dj), 13nn =l[211 j=i+ 1
By partial summation, n n3 7i {-1 13 Oin i 1 I,n
+ J1n |+ f m-ln Sm_lj [22]
Let w E Qo and let 0 < i < 1. Choose 1 > p > 0 such that
(I - p)l(2 - t1) = P > 2 . [23] 2
Because di = [3 + o(1)]/bj at w by [19], there exists jo (de- pending on w) such that at w, for all j ? jo.
1 dj<2 (- )fi3/b1 ? d, S 2F/37W [24]
where b* - inf b1. By [21] and [24], for n > i ? jo~
< i-2f3+ 1,n(1 + 2/3/b*) at w. [25]
From [21], [23], and [241, it follows that at w
n>m > joand max bj <(2-
0~~~~~~~~ < ji < n
j=i + 1 (2 X7
= ynk^Yi for n i > m-1, [26] where
j=jo ( i[7 for some D > 0. Because p > 1/2 we can choose a positive integer k such that
p(l- k-1) >- [28] 2 From [16] - [18], [20], [22], and [25] - [28], it then follows that, by choosing M(> jo) and C sufficiently large, the desired im- plication [15] holds at X for all m > M and n > m .
COROLLARY 1. With the same notations and'assumptions as in Theorem 1, let -n = n1 Ilxi. Then almost all sample points w have the following property: For every given 0 < X < 2, there exist C > 0 and positive integers M,k (depending on w and r7) such that at c, for all m > M and n> mk,
max bj < (2 -i7)3 m<j<n
|X;-01 < C j-1/2(log log j)1/2for all mk j < n + V [29]
Proof: With probability 1, because x1, -1 0 by Theorem 1, m
(x (X-0) = o(m), E (xi -0) = o(n), 1 nti5n +V/
and therefore the desired conclusion follows from Theorem 3 by choosing k > 2.
An obvious choice for bn in the stochastic approximation scheme [11] is the usual least squares estimate
n
L Xi x - yi
O3n = [30] n (X - i)2
at least in the linear case M(x) (x - 0). Note that On is
Hn-measurable but not an-l-measurable. Hence, to satisfy [12] we should take bn = On-1 instead. It seems artificial to have to drop the last observation (xn,yn) in estimating ,3 at stage n, and it is therefore desirable to remove the condition [12]. In the sequel we shall replace [12] by the following
Condition C: There exist random variables b' and un such that (i) 2' u2 <co a.s., (ii) b'r and un are both Wn_,-measurable for all n > 1, and
(iii) Pf 0 bn - bn I = O( l Un 1r {) as n a) c]=1
THEOREM 4. Suppose that in Theorems 1, 2, and 3 we re- place the condition [12] by Condition C. Then the conclusions of Theorems 1, 2, and 3 still hold.
To prove Theorem 4, we shall need LEMMA 2. Let fun} be a sequence of random variables such
This content downloaded from 169.229.32.136 on Wed, 7 May 2014 20:15:16 PMAll use subject to JSTOR Terms and Conditions
Statistics: Lai and Robbins Proc. Natl. Acad. Sci. USA 76 (1979) 3067
that 2 u2 < co a.s. and un is a, --measurable for all n > 1. Then
(a) . Unfn converges a.s.,
(b) U2 2UE < co a.s., and therefore 1
(c) u1iI4 = O(V?)a.s.
Proof of Theorem 4: By Lemma 2(a), un En O 0 a.s., and therefore Condition C(iii) implies that bn - b- 0 a.s. Hence, by [13],
inf b' > 0 and sup bt < ca.s. [31]
We note that
bs =iXf + i {bs 2 [32]
Because b' is 3i_1-measurable for all i > 1 and because [31] holds,
co
fil/(ib') converges a.s. [33] 1
Because 1 E /i2 < oo and 2: u 4 < oo a.s. by Lemma 2(b), it then follows from [13], [31], Condition C(iii), and the Schwarz inequality that
ib,b -
b 0(uii I l) < co a s. [34]
By [32], [33], and [34], 2;' Ej/(ibj) converges a.s., and therefore by lemma 5 of ref. 1,
xn -0 a.s. [35]
Let S& = c El/bi, S' = 2 lEi/b'. By [13], [31], Condition C(iii), and Lemma 2(c),
ISn -S 0 = E0(i I E) = O(Vn) a.s. [361
Because S' , O[(n log log n)'/2] by Lemma 1, it then follows from [36] that
Sn = O[(n log log n)1/2] a.s. [37]
In view of [35] and [37], the same argument as that used to prove Theorem 3 shows that the conclusion of Theorem 3 still holds in the present more general case, and therefore so does the conclusion of Theorem 2.
As a corollary of Theorem 4, we obtain COROLLARY 2. Suppose that in Corollary 1 we replace the
condition [12] by Condition C; then the conclusion of Corol- lary 1 still holds.
In practice, although 3 is unknown, it is usually possible to give prior bounds B > B' > 0 such that B' </3 < B. In this case, a natural choice for bn is An, truncated below by B' and above by B, where An, is the least squares estimate of : defined by [30]. Obviously the bn so chosen satisfies [13], and it can be shown that Condition C also holds. Therefore Theorem 4 and Cor- ollary 2 are applicable, and they provide very useful tools for showing that such bn indeed converge to /3 a.s. and for ana- lyzing the asymptotic properties of the adaptive stochastic approximation scheme [11]. The details will be given else- where.
This research was supported by the National Science Foundation. 1. Lai, T. L. & Robbins, H. (1979) Ann. Stat. 7, in press. 2. Lai, T. L. & Robbins, H. (1978) Proc. Natl. Acad. Sci. USA 75,
1068-1070. 3. Jain, N. C., Jogdeo, K. & Stout, W. F. (1975) Ann. Probab. 3,
119-145.
This content downloaded from 169.229.32.136 on Wed, 7 May 2014 20:15:16 PMAll use subject to JSTOR Terms and Conditions
Recommended