9
Multidimensional Stochastic Approximation Methods Author(s): Julius R. Blum Source: The Annals of Mathematical Statistics, Vol. 25, No. 4 (Dec., 1954), pp. 737-744 Published by: Institute of Mathematical Statistics Stable URL: http://www.jstor.org/stable/2236657 . Accessed: 28/08/2014 12:08 Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp . JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected]. . Institute of Mathematical Statistics is collaborating with JSTOR to digitize, preserve and extend access to The Annals of Mathematical Statistics. http://www.jstor.org This content downloaded from 94.112.98.61 on Thu, 28 Aug 2014 12:08:24 PM All use subject to JSTOR Terms and Conditions

Multidimensional Stochastic Approximation Methods

Embed Size (px)

Citation preview

Page 1: Multidimensional Stochastic Approximation Methods

Multidimensional Stochastic Approximation MethodsAuthor(s): Julius R. BlumSource: The Annals of Mathematical Statistics, Vol. 25, No. 4 (Dec., 1954), pp. 737-744Published by: Institute of Mathematical StatisticsStable URL: http://www.jstor.org/stable/2236657 .

Accessed: 28/08/2014 12:08

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp

.JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact [email protected].

.

Institute of Mathematical Statistics is collaborating with JSTOR to digitize, preserve and extend access to TheAnnals of Mathematical Statistics.

http://www.jstor.org

This content downloaded from 94.112.98.61 on Thu, 28 Aug 2014 12:08:24 PMAll use subject to JSTOR Terms and Conditions

Page 2: Multidimensional Stochastic Approximation Methods

MULTIDIMENSIONAL STOCHASTIC APPROXIMATION METHODS'

BY JULIUS R. BL.UM

University of California, Berkeley and Indiana University

1. Summary. Multidimensional stochastic approximation schemes are pre- sented, and conditions are given for these schemes to converge a.s. (almost surely) to the solutions of k stochastic equations in k unknowns and to the point where a regression function in k variables achieves its maximum.

2. Introduction. Let H(y I x) be a family of distribution functions depending

upon a real parameter x and let M(x) = f y dH(y I x) be the regression func-

tion corresponding to the family H(y I x). Robbins and Monro [1] define a sto- chastic approximation method to solve the equation M(x) = a, where a is a specified constant. Their method is such that the approximating random vari- ables converge in probability to 0, where 0 is a root of the equation M(x) = a. These results are generalized by Wolfowitz [2]. Kiefer and Wolfowitz [3] define a stochastic approximation scheme which converges in probability to 0, where 0 is the point at which M(x) achieves a maximum. Finally, it is shown [4] that in fact, in both of the situations mentioned above, the approximating sequence of random variables converges a.s. to 0.

The object of this paper is to extend these results to several dimensions. More precisely we consider the following two problems.

(A) Let { Y.(?)..,xhJ, ... , {I Ykl..,xA} be k families of random variables with corresponding families of distribution functions {F(')... ,J, ..., {FI... ,F

each depending on k real variables (xl, X k, Xk). Let M('(xi, ... , xk)= 00

f y dF(?... X*, for i = 1, , k, be the corresponding regression functions.

Then, if ai, ***, ak are k specified numbers, it is desired to find a stochastic approximation method such that the sequence of approximating random vectors converges a.s. to a solution of the equation

M(Xl I ...Xk) = ai, i = , ... , k.

Here it is assumed that the distributions FP) and the regression functions M(i are unknown; however, it is possible to make an observation on the random variable Y").. -, for i-1, * , k, and any choice of real numbers (xl, * ,k)

(B) Let { Yx1,... X, be a family of random variables, Fx ,... ,x be the corre- sponding distribution functions, and M(xi, ***, Xk) the corresponding regres-

Received 10/12/53, revised 5/29/54. ' This paper was prepared with the partial support of the Office of Ordnance Research,

U.S. Army. 737

This content downloaded from 94.112.98.61 on Thu, 28 Aug 2014 12:08:24 PMAll use subject to JSTOR Terms and Conditions

Page 3: Multidimensional Stochastic Approximation Methods

738 JULIUS R. BLUM

sion function. Subject to the assumptions of (A), it is desired to estimate that set of numbers (O1, * * *, Ok) for which the function M achieves its maximum.

The approximating sequences defined in this paper are straightforward general- izations of the sequences defined in [1] and [3]. The methods of proof used here were strongly motivated by the methods used in [2] and [3].

3. A theorem on almost sure convergence. The following theorem is an im- mediate consequence of the martingale convergence theorem of Doob [5].

THEOREM. Let Xn be a sequence of random variables satisfying

(i) supn E{lXnlI < 00,

(ii) E1 E{[E{Xn+i -Xn I Xi Xn$+]} < 00.

Then Xn converges a.s. to a random variable. As usual, we define X+ by X+ - [X + IXl]. We- immediately obtain the

following COROLLARY. Let Xn be a sequence of integrable random variables which satisfy

condition (ii) of the theorem and are bounded below uniformly in n. Then Xn con- verges a.s. to a random variable.

PROOF. Let Yn = Xn- a, where a is chosen so that Yn > 0 for all n. Then

n-1

E{lIYnlI = E{Yn) = E{Yl1 + E{Yj+i- YjA < E{Y1} j=1

n_1 + E El [E{Xj1l - Xj} I , -

j=1

Hence the theorem applies to the sequence Yn and consequently to the sequence Xn.

4. Convergent sequences of random vectors. Let Ek be a real k-dimensional vector space spanned by the orthogonal unit vectors u1, * * *, uk . If x and y are two vectors in Ek , we denote their inner product by (x, y) and their norms by llxll and Ilyll, respectively. Suppose that to each x E Ek corresponds a random vector Yx E Ek . Denote by M(x) the vector representing the conditional expecta- tion of Yx when x is fixed.

Let now f(x) be a real-valued function defined on Ek and possessing continuous partial derivatives of the first and second order. The vector of first partial deriva- tives will be denoted by D(x) and the matrix of second partial derivatives by A (x). That is

D(x) = I x, A(x) =(,)|x

Then, for any real number a, we have by Taylor's theorem

f(x + aY.) = f(x) + a(D(x), Y.) + Wa2(Y., A(x + OaY.)Yx),

This content downloaded from 94.112.98.61 on Thu, 28 Aug 2014 12:08:24 PMAll use subject to JSTOR Terms and Conditions

Page 4: Multidimensional Stochastic Approximation Methods

STOCHASTIC APPROXIMATION 739

where 0 is a real number with 0 < ? < 1. Consequently we may take expecta- tions on both sides to obtain

(4.1) E{f(x + aYx)} = f(x) + a(D(x), M(x)) + Ia 2EI(Y , A(x + Oa Y)Y )}.

Let now { a. } be a sequence of positive numbers and consider the following sequence of recursively defined random vectors

(4.2) Xn+1 = Xn + anYn

where X1 is chosen arbitrarily and where Yn has the distribution of Ye when Xn yields the observation x. The object of this section is to set down conditions under which Xn converges a.s. to zero.

To simplify writing we shall employ the following notation throughout:

Zx = f(x), U(x) = (D(x), M(x)), Va(x) = E{(Y,X A(x + OaYa)Yx)}.

When we substitute the random variables Xn for x and the numbers an for a, the corresponding random variables will be denoted by ZnX Un , and Vn . We shall assume throughout that M(O) = 0.

Consider now the following set A of conditions:

A1: EZan = 0O, E an n=l n=

A2: Z 2 O

A3: sup U(x) < 0 for every e > 0; eI I 1Il I

A4: inf Zx - ZoI > 0 for every e > 0; f_ I [XI I

A5: Va(X) _ V < oo for every number a.

Then we have THEOREM 1. If the sequence an satisfies A1 and if there exists a real-valued func-

tion f(x) with continuous first and second partial derivatives satisfying A2 ... A5, then the sequence {xn} defined by (4.2) converges a.s. to zero.

PROOF. From (4.1) we obtain

(4.3) E I Zn+1 I Zl X Zn} = Zn + anE{ Un| Z1, I - Zn} 2

+ an E{Vn I Z1 * Zn} as.

Since M(O) = 0, we have, by virtue of conditions A,

E{Un I Z1, * *, Zn} < O a.s., E{Vn Z1, *, Zn} < V a.s.,

both for all n. Hence

(4.4) E -Zn+ -Zn I Z I** Zn_ < la V a.s.

This content downloaded from 94.112.98.61 on Thu, 28 Aug 2014 12:08:24 PMAll use subject to JSTOR Terms and Conditions

Page 5: Multidimensional Stochastic Approximation Methods

740 JULIUS R. BLUM

We may assume V to be nonnegative. Using this fact together with conditions A1 and A2, we may apply the corollary of Section 3 to obtain

(4.5) PI{Z converges} = 1.

Taking expectations on both sides of (4.3) and iterating, we have n n

E{Zn+1} = Zl + Ea;E{ U} + Y2a.E{Vj1.

From what has been said above it follows that

E{IZn}I >O ElUn} < O EIVn} _ V, n=1,***.

002 Since V is nonnegative and the series E a. converges, the nonpositive term series anE I Un} also converges. By virtue of the fact that an = oo we have

lim sup Et U}= O, lim if El I Un |I = O n-joco n-woo

Let {nk) be an infinite sequence of integers such that lim1k-. E{jUnUj} = 0. Then Unk converges to zero in probability and there exists a further subsequence say { UmkI such that

P{limf Umk = 01 = 1.

From condition A3 it follows that P{ limk-. Xmk = 0 = 1. Since Zn is a con- tinuous function of Xn it follows from (4.5) that

(4.6) P{lim Zn = ZO= 1. n-woo

Now consider a sample sequence { Xn } such that for the corresponding sequence IZn } we have limn- Zn = Zo. From condition A4 it is clear that for such a

sequence we must have limn-.w Xn = 0. Hence (4.6) gives the desired result. We may obtain the same result by assuming a slightly different set of condi-

tions: A', changing A3 and A5 to:

A3: There exists e > 0 such that sup Va(X) ? V < oo for every number a; 0? I IXI I<E

A': There exists X > 0, with X > 'an for each n, such that sup [U(x) 5-!- I I XI I

+ X V+(x)] < 0 for every 8 > 0 and every number a.

Then we have THEOREM 2. If the sequence {an) satisfies condition A1 and if there exists a real-

valued function f(x) with continuous first and second partial derivatives satisfying A2, A", A4, and A5, then the sequence {Xn} defined by (4.2) converges a.s. to zero.

The proof of this theorem follows very closely that of Theorem 1, and so is omitted.

This content downloaded from 94.112.98.61 on Thu, 28 Aug 2014 12:08:24 PMAll use subject to JSTOR Terms and Conditions

Page 6: Multidimensional Stochastic Approximation Methods

STOCHASTIC APPROXIMATION 741

5. Examples. In this section we illustrate the results of the previous section by a few simple examples. Assume that the problem is as described in (A) of Section 2. Then to each x E Ek corresponds to a random vector Y. e Ek with co- ordinates Y( 'for i = 1, , k. Let M(x) be the vector of conditional expecta- tions, when x is given. Without loss of generality we assume that ai = @ = 0 for i = 1,-ts, k.

EXAMPLE I. Let B be a negative definite k x k matrix and assume

(i) for some p > 0, l(xli < p implies M(x) = Bx; (ii) ljxjj > p implies M(x) = M([p/jjxjf x);

(iii) c0 < a2 < X for each x E Ek, and each i =1,, k, wherea 2( iS

the variance of the ith component of Y2. Under these conditions it is clear that both jjM(x) and Et II Y.,12} are bounded

uniformly in x. Now define f(x) by f(x) = 1X112. If we choose the sequence {a.} to satisfy condition Al, we can easily verify that the remainder of condition A is satisfied. To do this we note that A2 and A4 are obviously satisfied from the choice of f(x). Further we have

U 2(x, Bx) llxll < P;

2[p/jjxjj] (x, Bx) $lxii > p;

V.(x) = 2E{ II Y1II21 for every number a.

From the boundedness of El 1I Y2ji21 it is clear that A5 if also satisfied. It remains to check A3. To do this we recall that for every negative definite matrix B there exists a positive number b such that (x, Bx) < -bjjxX12. Thus if e is any positive number with 0 < e < p, we have

(x, Bx) < -be2 ife E< 1f X 11 , [p/ PI X1 lx] (x, Bx) < -bp2 if j1 x 11 > p.

Hence A3 is also satisfied and Theorem 1 applies. EXAMPLE II. Consider a negative definite matrix B and assume

(i) M(x) = Bx; (ii) there exist e > 0 and C > 0 such that x < eimpliesEtI Yx 112} < C (iii) there exists p > 0 such that x > e implies

(x, Bx) + pE{II YX j2} <-0

With f(x) again defined by f(x) = I I x 112, we have

U(x) = 2(x, Bx), V,(x) = 2E I yz 1Y12} for all a.

Hence it is clear that if we choose the sequence {an} to satisfy condition Al, we need only verify A' , since the other conditions follow immediately. To do this, assume first that 11 x 11 < e as determined by assumption (ii) of this example, and let X be any positive number. Let b > 0 be such that (x, Bx) < - b I

x 112.

This content downloaded from 94.112.98.61 on Thu, 28 Aug 2014 12:08:24 PMAll use subject to JSTOR Terms and Conditions

Page 7: Multidimensional Stochastic Approximation Methods

742 JULIUS R. BLUM

Then we have

U(x) + XV+(x) = 2[(x, Bx) + EI{J yx 11211

< 2[(x, Bx) + XC] < 2[-b 1 x 112 + XCI.

Hence it is clear that if 0 < a 85 II x 11 ? E, we can choose X1 such that

U(x) + X,V+(x) < 2[-bS2 + X1CI < O, and if 11 x 11 > e, choose 0 < X2 < p, where p is determined by assumption (iii) of the example. Then

[U(X) + X2 V(X)I = -x2 (x, Bx) + N' [(x, Bx) + pE{II YZ| 121 2 \p /p

:5 P- X2) (x, Bx) < __(P -2 be2 < o.

Hence by choosing X = min (X1, X2) we satisfy condition A4 and Theorem 2 ap- plies.

6. The maximum of a regression function in several variables. In this section we turn to problem (B) of Section 2. Assume once more that x is a variable point in Ek and to each x corresponds a random variable Y,, with corresponding re- gression function M(x). Assume, without loss of generality, that M(x) has a unique maximum at x = 0. The problem becomes one of constructing a sequence {Xn} of random vectors with the property

P{lim Xn = 0} = 1. n-* oo

Let {an} and {cn} be two infinite sequences of positive numbers satisfying condi- tions B:

00 00

B1: lim Cn = O, B2: E an = oo, B3: E anCn < 0oo n-*oo n=1 n=1

B4: E an < 00(

Suppose now x e Ek and let c be a positive number. Let u1, * , Uk be the or- thonormal set spanning Ek . We construct a random vector Y., by taking k + 1 independent observations on the random variables Y,, Yz+ui, a ***, u and defining

Y2.0 = -(Y+C" YX) , -(y+-M YXA]

We proceed to construct a recursive sequence of random vectors by choosing X1 arbitrarily and defining

(6.1) X =+1 - Xn + an Yn/c,

where Yn has the distribution of K,, when Xn yields the observation x. The intuitive reason for (6.1) is fairly clear, since Yn/Cn is the vector in the direction

This content downloaded from 94.112.98.61 on Thu, 28 Aug 2014 12:08:24 PMAll use subject to JSTOR Terms and Conditions

Page 8: Multidimensional Stochastic Approximation Methods

STOCHASTIC APPROXIMATION 743

of the maximum slope of the plane determined by the k + 1 vectors

(Xn I Yx,,), (X. + CnUl , YXs+C,XU1), . (Xn + C,UkY Yxn+cnuA).

We denote the vector of first partial derivatives and the matrix of second partial derivatives of M(x) by D(x) and A(x), respectively. We write D. for D(Xn) and An for A (Xn), and denote by A. the vector whose coordinates are the diagonal entries of An , by An the vector Et Y. I X.} , and by a the variance of YZ . With- out loss of generality we assume that M(O) = 0 so that M(x) < 0 for all x. Then we have

THEOREM 3. Suppose the sequences {an} and {cn} satisfy conditions B and further that

(i) M(x) is continuous with continuous first and second derivatives;

(ii) 2 < 2 < NO;

(iii) for every positive number e there exists a positive number p(e) such that

x I > e implies M(x) < -p(e) and D (x) ? P p(e).

(iv) The second partial derivatives O2M(x)/lxiaxj are bounded for ij = 1, * , k

Then the sequence {Xn} defined by (6.1) converges a.s. to zero. PROOF. Expanding -M(Xn+l) we obtain, with 0 < 0 < 1,

2/

-M(Xn+l) = -M(Xn) - n (DnX Yn) -an (Yn, AXn + 0 c-n Yn) Yn).

Taking conditional expectation for given Xn we have 2

El -M(Xn+l) I Xn} = -M(Xn) - an (Dn),An)- -

n Cn 2C'n

E4j(Y A (Xn + O ? Yn) Yn) I Xn} a.s.

Since A(x) is a bounded matrix and CZ is bounded, we have

E E(Ynp A(X-n + t1 an Yn) Yn) I Xn} I < Kl II An 112 + K2,

where K1 and K2 are suitably chosen positive constants. By virtue of the hy- pothesis we obtain

Ant) = cn(D, X ud) + 4c' (ui , A(X. + 0('crui)uJy i = 1, * a k,

where i,(,') is the ith component of An and 0 ?@( 0) ? 1 for i = 1, k , k. Hence

(Dn I An) = cn Dn lDj2 + c2A(DnXAn) jj A,j2 = c2n jjDnij2 + c3n(Dn An) + X jn | jAn4

Now by hypothesis, II An is bounded, say II A. 112 < Kg . Then

11 (DnX An) 12 < K3 1j Dn 112.

This content downloaded from 94.112.98.61 on Thu, 28 Aug 2014 12:08:24 PMAll use subject to JSTOR Terms and Conditions

Page 9: Multidimensional Stochastic Approximation Methods

744 JULIUS R. BLUM

After some computation we find

E {-M(Xn+1) I Xn } -M(Xn) a. I I Dn 112[l MKan] - D 11 K 112[_cn- _Kla,c,]} + jKiK3ac 2 + IK2a 2/c2 a.s.,

where n is chosen so large that [1 - 'Kian] and [cn - Kia.c.] are both nonnega- tive.

Let X. be a sequence of random variables defined by

[1 if 11 Dn 11 > 1,

19 otherwise.

We note that for n sufficiently large we have

(6.2) an{1I Dn 112 [1 - -Klan] - Dn 11 -K12[2Cn Kiancn]} > 0. Hence, for such n we obtain

Et-M(Xn+l) I Xn < -M(Xn) + anc,(l - Xn) 1 Dn K3 2 an

+ 8KiK3a2cC2 + 4K2a2/c2 a.s.

This inequality clearly is still preserved if we take conditional expectations with respect to M(Xn) on both sides. But now we note that

X=1 ac,K 112[l - 'Klan] E{ (1-n) II Dn M(Xn) converges a.s.;

12P jKIK3a42c and 1 nK2a2j/cj both converge.

These follow from conditions B and the definitions of An . Hence, we may again apply the corollary of Section 3 to obtain that M(Xn) converges a.s. to a random variable. Now we note that Zinaj diverges to + Qo and that M(Xn) ? 0. Hence the series

n aj aE{lj D1 jj2 [1- Klajl- Xj 1I D, II K'2[ c, - -K1ajc,i

j=1

converges. This, together with (6.2), insures the existence of a subsequence D., with the property P{limk,,,D~n = 0} = 1. Hence X,k converges a.s. to zero. Since M(x) is continuous and M(O) = 0, we have P{liMk,.M(X.) = O} = 1, which implies the desired result.

REFERENCES

[11 R1. ROBBINS AND S. MONRO, "A stochastic approximation method," Ann. Math. Stat., Vol. 22 (1951), pp. 400-407.

[21 J. WOLFOWITZ, "On the stochastic approximation method of Robbins and Monro," Ann. Math. Stat., Vol. 23 (1952), pp. 457-461.

[31 J. KIEFER AND J. WOLFOWITZ, "Stochastic estimation of the maximum of a regression function," Ann. Math. Stat., Vol. 23 (1952), pp. 462-466.

[4] J. R. BLUM, "Approximation methods which converge with probability one," Ann. Math. Stat., Vol. 25 (1954), pp. 382-386.

[51 J. L. DooB, Stochastic Processes, John Wiley and Sons, (1953).

This content downloaded from 94.112.98.61 on Thu, 28 Aug 2014 12:08:24 PMAll use subject to JSTOR Terms and Conditions