Local Convergence Theorems for Adaptive Stochastic Approximation Schemes

Local Convergence Theorems for Adaptive Stochastic Approximation SchemesAuthor(s): T. L. Lai and Herbert RobbinsSource: Proceedings of the National Academy of Sciences of the United States of America,Vol. 76, No. 7 (Jul., 1979), pp. 3065-3067Published by: National Academy of SciencesStable URL: http://www.jstor.org/stable/69929 .

Accessed: 07/05/2014 20:15

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp

.JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact support@jstor.org.

National Academy of Sciences is collaborating with JSTOR to digitize, preserve and extend access toProceedings of the National Academy of Sciences of the United States of America.

http://www.jstor.org

This content downloaded from 169.229.32.136 on Wed, 7 May 2014 20:15:16 PMAll use subject to JSTOR Terms and Conditions

Proc. Natl. Acad. Sci. USA Vol. 76, No. 7, pp. 3065-3067,, July 1979 Statistics

Local convergence theorems for adaptive stochastic approximation schemes

(regression/weighted sums/martingale transform)

T. L. LAI AND HERBERT ROBBINS

Department of Mathematical Statistics, Columbia University, New York, New York 10027; and State University of New York, Stony Brook, New York 11790

Contributed by Herbert Robbins, April 27, 1979

ABSTRACT For the regression model y M(x) + c, adaptive stochastic approximation schemes of the form x.+ = n-

yn/(nbI) for choosing the levels xl,x2,.. at which Y1,Y,y ... are observed converge with probability 1 to the unknown root 0 of the regression function M(x). Certain local convergence theorems that relate the convergence rate of x,, -0 to the limiting behavior of the random variables b, are established.

Let E, El, E2,. . be independent, identically distributed random variables with EE = 0 and EE2 = a2 < m. Let ao c a I c ... be an increasing sequence of a-fields such that (t is ai-measurable and is independent of ai- 1 for all i > 1. Consider the general regression model

yi = M(x1) + i, i = 1,2,. [1]

in which the regression function M(x) is a Borel function such that

M(f) = 0, M'() = > 0 [2]

inf {M(x)(x - 0) > 0 for all 0 < 6 < 1, [3] 6clX_@ll1/6

IM(x)I S cIxI + d for some cd > 0 and all x. [4]

The Robbins-Monro stochastic approximation scheme for choosing the levels xi so that they will eventually approach the unknown root 0 of the regression function is defined by

*1 = ao-measurable random variable,

+I= - y/(ib), i = 1,27. [5]

where b is some positive constant. It is well known that for b < 2/3,

V-n (x, - 0) __N(0,U2/$b(2f - b)j) in distribution, and [6]

lim suip V g n _g |xn -6 I

= - -V-- - -- as. (almost surely). [7] Vb (2/ O- ~b)Y

Moreover, if for some X > 0

M(x) = :(x-0) + 0(jx -01 +X) as x-*, [8]

then for b > 23, there exists a random variable z such that

n/b(xn-0) z a.s., [9]

and for b = 2/,

/n (x. -0) - N(O ,U2/b2) in distribution [10] log n

The publication costs of this article were defrayed in part by page charge payment. This article must therefore be hereby marked "ad- vertisement" in accordance with 18 U. S. C. ?1734 solely to indicate this fact.

(cf. ref. 1). Thus, the asymptotically optimal choice of b is 1. In practice, the value M'(0) = will rarely be known. It is

therefore natural to try estimating 1 from the data being observed and to use instead of [5] an adaptive stochastic approximation scheme of the form

Xl = ao-measurable random variable

xi+ 1 Xi Yi/(ib ), i =^1,2,.. . [I|I 1]

where bi > 0 is some estimate of 1 based on xl,yl,... ,Xjy,. The following theorem, which follows from lemma 5 of ref. 1, deals with the a.s. convergence of the stochastic approximation scheme [11].

THEOREM 1. Suppose that in the regression model [1], M(x) satisfies conditions [2 -[4]. Define inductively yi by [1] and xi+s by [1 ], where {bil is a sequence of positive random variables such that

bi is ?i-i-measurable for all i > 1, [121

infbi>0 and sup bi<ooa.s. [13]

Then lim xn = 6 a.s. In view of [7], it is natural to ask whether xn converges to 6

at the rate

Xn6-0 O[n-1/2(log log n)'!2] [14]

on the event Ilim sup bn < 213. An affirmative answer is given by the following.

THEOREM 2. With the same notations and assumptions as in Theorem 1, [14] holds a.s. on the event Ilim sup bn < 211.

Theorem 2 is an immediate corollary of the following deeper result which says that if the assumptions of Theorem 1 are satisfied, then, with probability 1, a sufficiently long string of bn not exceeding (2- 7i)1 leads to a corresponding string of xn differing from 0 by no more than a constant times n-'/2(log log n)1/2.

THEOREM 3. With the same notations and assumptions as in Theorem 1, there exists an event Qo such that P(Q0) = 1 and all sample points c e QO have the following property: For every given 0< <I < 2, there exist C > 0 and positive integers M,k (depending on w and q) such that at w, for all m > M and n > mk

max bj < (2-) I Ixj-I0 < Cj-1/2 mrSj n

X (log log j)1/2 for all mk < j S n. [15]

We preface the proof of Theorem 3 with the following. LEMMA 1. Suppose that lbnj is a sequence of random vari-

ables such that [12] holds and

3066 Statistics: Lai and Robbins Proc. Natl. Acad. Sci. USA 76 (1979)

inf Ib,j > Oa.s., sup IbI < ooa.s.

Let Vn = o2c?bT2. Let S(.) be the random function defined by S(O) = Oand S(t) = 2?(ej/b) for Vn S t < Vn+1, n = 1,2, . Assume that a 3 0. Then, redefining the random variables on a new probability space if necessary, there exists a standard Wiener process w(t), t 2 0, such that

lrn I S(t) -w(t)I 0Oa.s. tim (t log log t)'/2

Consequently, n

E(e;/bi) lim sup l - = 1 a.s.

(2Vn log log n)1!2 Lemma 1 is a generalization of theorem 2 of ref. 2 and can

be proved by using essentially the same argument, applying theorem 3.1 of ref. 3, and noting that for p > 0 and 6 > 0,

Etl I j I PI[b > j]j>,il = SjXi2?2= J IxIPdG(x),

where G is the distribution function of E. Proof of Theorem 3: Without loss of generality, assume that

6 = 0. By Theorem 1, with probability 1,

limx= 0. [16]

Let S. = z ' (e/bil). By Lemma 1, with probability 1,

Sn = 0[(n log log n)1/2]. [17]

Let Qo be the event on which [161 and [171 both hold and

infb > 0, supbi < o. [18]

Then P(Qo) = 1. From [2] and [16], it follows that on QO

M(x,) = (, + tn)xn, where lim (= 0. [19]

Let dj = (3 + (j)/bj. Then, as shown in the proof of theorem 3 of ref. 1,

n xn I = Om-1,nXm - E I3 /(ibi), [20]

where for i = 0,1, . n - 1, n

13in = 1 O (- j-dj), 13nn =l[211 j=i+ 1

By partial summation, n n3 7i {-1 13 Oin i 1 I,n

+ J1n |+ f m-ln Sm_lj [22]

Let w E Qo and let 0 < i < 1. Choose 1 > p > 0 such that

(I - p)l(2 - t1) = P > 2 . [23] 2

Because di = [3 + o(1)]/bj at w by [19], there exists jo (depending on w) such that at w, for all j ? jo.

1 dj<2 (- )fi3/b1 ? d, S 2F/37W [24]

where b* - inf b1. By [21] and [24], for n > i ? jo~

< i-2f3+ 1,n(1 + 2/3/b*) at w. [25]

From [21], [23], and [241, it follows that at w

n>m > joand max bj <(2-

0~~~~~~~~ < ji < n

j=i + 1 (2 X7

= ynk^Yi for n i > m-1, [26] where

j=jo ( i[7 for some D > 0. Because p > 1/2 we can choose a positive integer k such that

p(l- k-1) >- [28] 2 From [16] - [18], [20], [22], and [25] - [28], it then follows that, by choosing M(> jo) and C sufficiently large, the desired im- plication [15] holds at X for all m > M and n > m .

COROLLARY 1. With the same notations and'assumptions as in Theorem 1, let -n = n1 Ilxi. Then almost all sample points w have the following property: For every given 0 < X < 2, there exist C > 0 and positive integers M,k (depending on w and r7) such that at c, for all m > M and n> mk,

max bj < (2 -i7)3 m<j<n

|X;-01 < C j-1/2(log log j)1/2for all mk j < n + V [29]

Proof: With probability 1, because x1, -1 0 by Theorem 1, m

(x (X-0) = o(m), E (xi -0) = o(n), 1 nti5n +V/

and therefore the desired conclusion follows from Theorem 3 by choosing k > 2.

An obvious choice for bn in the stochastic approximation scheme [11] is the usual least squares estimate

L Xi x - yi

O3n = [30] n (X - i)2

at least in the linear case M(x) (x - 0). Note that On is

Hn-measurable but not an-l-measurable. Hence, to satisfy [12] we should take bn = On-1 instead. It seems artificial to have to drop the last observation (xn,yn) in estimating ,3 at stage n, and it is therefore desirable to remove the condition [12]. In the sequel we shall replace [12] by the following

Condition C: There exist random variables b' and un such that (i) 2' u2 <co a.s., (ii) b'r and un are both Wn_,-measurable for all n > 1, and

(iii) Pf 0 bn - bn I = O( l Un 1r {) as n a) c]=1

THEOREM 4. Suppose that in Theorems 1, 2, and 3 we replace the condition [12] by Condition C. Then the conclusions of Theorems 1, 2, and 3 still hold.

To prove Theorem 4, we shall need LEMMA 2. Let fun} be a sequence of random variables such

Statistics: Lai and Robbins Proc. Natl. Acad. Sci. USA 76 (1979) 3067

that 2 u2 < co a.s. and un is a, --measurable for all n > 1. Then

(a) . Unfn converges a.s.,

(b) U2 2UE < co a.s., and therefore 1

(c) u1iI4 = O(V?)a.s.

Proof of Theorem 4: By Lemma 2(a), un En O 0 a.s., and therefore Condition C(iii) implies that bn - b- 0 a.s. Hence, by [13],

inf b' > 0 and sup bt < ca.s. [31]

We note that

bs =iXf + i {bs 2 [32]

Because b' is 3i_1-measurable for all i > 1 and because [31] holds,

fil/(ib') converges a.s. [33] 1

Because 1 E /i2 < oo and 2: u 4 < oo a.s. by Lemma 2(b), it then follows from [13], [31], Condition C(iii), and the Schwarz inequality that

ib,b -

b 0(uii I l) < co a s. [34]

By [32], [33], and [34], 2;' Ej/(ibj) converges a.s., and therefore by lemma 5 of ref. 1,

xn -0 a.s. [35]

Let S& = c El/bi, S' = 2 lEi/b'. By [13], [31], Condition C(iii), and Lemma 2(c),

ISn -S 0 = E0(i I E) = O(Vn) a.s. [361

Because S' , O[(n log log n)'/2] by Lemma 1, it then follows from [36] that

Sn = O[(n log log n)1/2] a.s. [37]

In view of [35] and [37], the same argument as that used to prove Theorem 3 shows that the conclusion of Theorem 3 still holds in the present more general case, and therefore so does the conclusion of Theorem 2.

As a corollary of Theorem 4, we obtain COROLLARY 2. Suppose that in Corollary 1 we replace the

condition [12] by Condition C; then the conclusion of Corol- lary 1 still holds.

In practice, although 3 is unknown, it is usually possible to give prior bounds B > B' > 0 such that B' </3 < B. In this case, a natural choice for bn is An, truncated below by B' and above by B, where An, is the least squares estimate of : defined by [30]. Obviously the bn so chosen satisfies [13], and it can be shown that Condition C also holds. Therefore Theorem 4 and Cor- ollary 2 are applicable, and they provide very useful tools for showing that such bn indeed converge to /3 a.s. and for ana- lyzing the asymptotic properties of the adaptive stochastic approximation scheme [11]. The details will be given else- where.

This research was supported by the National Science Foundation. 1. Lai, T. L. & Robbins, H. (1979) Ann. Stat. 7, in press. 2. Lai, T. L. & Robbins, H. (1978) Proc. Natl. Acad. Sci. USA 75,

1068-1070. 3. Jain, N. C., Jogdeo, K. & Stout, W. F. (1975) Ann. Probab. 3,

119-145.

Local Convergence Theorems for Adaptive Stochastic Approximation Schemes

Documents

rdunijbpin.mponline.gov.in · Weierstrass approximation theorem, Power series, uniqueness theorem for power series, Abel's and Tauber's theorems. Functions of several variables, linear

Estimating Tail Probabilities in Queues via Extremal Statistics · 2014-07-30 · regenerative queueing systems): 1. Almost sure limit theorems and associated Lp convergence (Theorems

Korovkin-type Theorems and Approximation by Positive Linear Operators

Strong convergence theorems for maximal monotone operators ... · Strong convergence theorems for maximal monotone operators and continuous pseudocontractive mappings Jong Soo Jung

HFF Effect of boundary condition approximation on ... · Effect of boundary condition approximation on convergence and accuracy of a finite volume discretization of the transient

Some Convergence Theorems for Stochastic …norman/Models with distance...CONVERGENCE THEOREMS FOR LEARNING MODELS 65 DEFINITION 2.1. A sequence {K,} of stochastic kernels converges

Existence Theorems in Probability Theorykeisler/neou2003.pdf · Existence Theorems in Probability Theory ... solutions then you can conclude that ... the familiar case of convergence

Variational Convergence of Discrete Minimal SurfacesVariational Convergence of Discrete Minimal Surfaces Henrik Schumacher 1 and Max Wardetzkyy2 1Centre for Optimization and Approximation,

Research Article On Some Approximation Theorems for Power ...downloads.hindawi.com/journals/tswj/2014/513162.pdf · On Some Approximation Theorems for Power -Bounded Operators on

Convergence Theorems for Lattice Group-Valued Measures

RATE OF CONVERGENCE FOR DISCRETE APPROXIMATION OF OPTION ... · sidering binomial tree approximation for Black-Scholes prices. In a binomial model with ... to the Black-Scholes option

Numerical Analysis An Advanced Coursemath.ubbcluj.ro/~tcatinas/CursAnalizaNum.pdf · theorems, best approximation and orthogonal polynomials. Chapter 2 treats the functions approximation

Rani Durgavati Vishwavidyalaya, Jabalpur M.P. · differentiation, Weierstrass approximation theorem, Power series, uniqueness theorem for power series, Abel's and Tauber's theorems

Limit theorems in Probability Theory, Random Matrix Theory and …kowalski/fim-08/nikeghbali.pdf · 1. Outline Weak convergence appears in probability theory (convergence in law or

The Convergence and Accuracy of the Matrix Formalism ... · The Convergence and Accuracy of the Matrix Formalism Approximation ... ICAP 2012. 11th International ... ICAP 2012 The

EMPIRICAL PROCESSES: Theory and Applications€¦ · 2. WEAK CONVERGENCE: THE FUNDAMENTAL THEOREMS 5 2 Weak convergence: the fundamental theorems Suppose that T is a set, and suppose

Stability analysis of approximation schemes for … Stability and convergence Numerical Stability Stability analysis of approximation schemes for BSDEs J-F Chassagneux (Imperial College

New proofs of Bing's approximation theorems for surfaces - MSP

Existence and convergence theorems for the split quasi · PDF fileExistence and convergence theorems for the split quasi variational inequality problems on proximally smooth sets Jittiporn

CONVERGENCE ANALYSIS AND A DC APPROXIMATION METHOD … · 2019-11-07 · 1 CONVERGENCE ANALYSIS AND A DC APPROXIMATION 2 METHOD FOR DATA-DRIVEN MATHEMATICAL PROGRAMS 3 WITH DISTRIBUTIONALLY