21

Click here to load reader

Stochastic approximation of global minimum points

Embed Size (px)

Citation preview

Page 1: Stochastic approximation of global minimum points

Journal of Statistical Planning and Inference 41 (1994) 327-347

North-Holland

321

Stochastic approximation of global minimum points

Jiirgen Dippon*

Mathematisches Institut A, Universitiit Stuttgart, 7000 Stuttgart, Germany

Vklav Fabian**

Department cf Statistics and Probability, Michigan State University, East Lansing, MI 48824, USA

Received 11 August 1992; revised manuscript received 28 June 1993

Abstract

A method is proposed for approximating a point of a global minimum of a function J defined on a subset

D of iw’, when values off can be estimated at points determined by the method. The method combines

a stochastic approximation method with a nonparametric regression estimate. If f has only one global

minimum point, then the new estimate has the same asymptotic behavior as the component stochastic

approximation method, except that the new method does not require f to have only one stationary point.

Similar properties are obtained if there are multiple global minimum points. The best behavior is obtained if

f attains its global minimum on a set with a nonempty interior.

AMS Subject Class$cation: Primary 6L20; secondary 62502, 65KlO

Key words: Stochastic approximation; global minimum; optimal rate of convergence; asymptotic distribu-

tion; regression estimates; simulated annealing.

1. Introduction

I .I. The results

We consider here the problem of estimation of a global minimum point of a func-

tion f defined on a subset D of the k-dimensional Euclidean space I@. We assume that

we may choose points in D and then estimate the values off at these points. Known

stochastic approximation methods may be used, but may fail to converge to a global

Correspondence to: V. Fabian, Department of Statistics and Probability, Michigan State University, East

Lansing, MI 48824, USA.

*Research supported by a grant from the Deutsche Forschungsgemeinschaft. ** Research partly supported by NSF grants DMS-9005552 and DMS-9101396.

0378-3758/94/$07.00 0 1994-Elsevier Science B.V. All rights reserved SSDI 0378-3758(93)E0080-Z

Page 2: Stochastic approximation of global minimum points

328 J. Dippon, V. Fabian/Stochastic approximation of global minimum points

minimum point if f has multiple stationary points (under additional conditions,

points of sharp local maxima are excepted (see Nevel’son and Has’minskij, 1976,

Ch. 5; Ljung, 1978; Lazarev, 1992).

The estimate X, proposed and studied here has useful properties even if f has

multiple stationary points. It is a combination of a stochastic approximation method

U,,, and a nonparametric regression estimate fn. From the regression estimate we

obtain an estimate Z, of a point of global minimum. The combined method uses Z, to

restart U,, if it wanders too far away from Z, and then uses U, or Z,, depending on

which one is better, as the combined estimate X,.

If the function f has one global minimum point 8, then the new method has the

same properties as the component stochastic approximation method under weaker

conditions on f that allow multiple stationary points. In addition, similar results are

obtained when the function attains its global minimum at several points. If f attains

its global minimum on a set A with open interior, then, with probability one, X, is in

the interior of A for all II large enough (in this case Z, is better than U,,).

A price paid for the improvement is in the difficulty of practical implementation of

any regression estimate unless the dimension k is small; this difficulty disappears in

asymptotic considerations but is present in actual applications (see Remark 2.9).

Three conditions on f are considered, the basic Assumption 3.2, and two stronger

conditions, Conditions 3.3(i), (ii). The estimates are described by conditions easily

interpreted as instructions on how to construct the estimates. Concerning Z,,

these are given in the basic Assumption 3.7 and two stronger conditions,

Conditions 3.8(i), (ii). Remark 3.9 shows that there is no conflict among these condi-

tions and thus Z, can be constructed to satisfy all three. However, for some of the

properties of Z,, it is not necessary to assume all the conditions on Z,. Condition 4.1

specifies properties of the component stochastic approximation method and Assump-

tion 5.1 specifies the properties of the combined estimate X,.

The fact that we discuss the regression estimate fn, the estimate Z, based on fn, the

stochastic approximation method, and the combination of these, makes for multiple

and somewhat complex conditions.

The main results are in Theorem 5.4. Properties of Z, are of interest and are

summarized in Theorem 3.14. The regression estimate fn is characterized only by the

properties required in Assumption 3.2, and Section 2 describes a possible choice for fn.

The component stochastic approximation method is characterized only by its proper-

ties in Condition 4.1.

A possible choice for the component stochastic approximation is the method

described in Theorem 2.7 of Fabian (1971); we refer to it by SAF. The method is

a second-order variant of the method proposed in Fabian (1967), with the rate of

convergence nFB and /?=+s/(l + s), and its asymptotic distribution is known. The

method needs that, in a neighborhood of the global minimum point e,f have

continuous unidirectional partial derivatives of an odd order s + 1 near 8. By results of

Chen (1988), the rate of convergence is optimal. However, SAF depends on s.

Page 3: Stochastic approximation of global minimum points

J. Dippon, V. Fabian/Stochastic approximation of global minimum points 329

Although its behavior is known when a wrong s is used, it would be useful to make the

method adaptive in the sense that it would select the proper s itself. If the function

behaves similarly to an analytic function, then a method with convergence rate

n-“2(lOgn) 3/2 is available (see Fabian, 1969). Polyak and Tsibakov (1990) and Renz

(1991) extended the result on convergence rate to s odd. An interesting new gradient

estimate, proposed and studied by Spa11 (1988, 1992), can be used to improve the

asymptotic distribution in case s=2, k> 1 for some functions. An optimal choice of

the design within the class of gradient estimates used by SAF is described in Erickson

et al. (1993). However, there are no optimality results available in general for the

gradient estimate, and for the asymptotic distribution.

In some situations, the gradient of the function is estimable directly; the combined

method can again be used, but with the SAF method replaced by a suitable modification.

Problems of function minimization of the type considered here abound. They

include problems in experimental determination of optimal conditions, in artificial

neural network models (see White, 1989 and Fabian, 1992~). In many applications, it

is not known that the function considered has only one stationary point or it is known

that it does not (cf. McInerney et al. 1989). In deterministic problems of function

minimization, simulated annealing methods have been proposed to deal with difficul-

ties presented by multiple stationary points (see more in Section 1.2).

Actual applications of stochastic approximation methods have been hindered by

disappointing actual performance in some cases. It is likely that the new method will

have an improved nonasymptotic behavior (see more in Section 1.3).

The new method preserves most of the useful properties of stochastic approxima-

tion methods:

(i) allows for updating of the estimate when new observation are available,

(ii) uses observations of function values mostly at points close to a global min-

imum point and

(iii) performs asymptotically as well as the previously known stochastic approx-

imation methods under substantially weaker conditions on f:

Property (i) is important in applications where observations are obtained in a time

sequence and updated estimates are needed. Property (ii) is of importance in on-line

applications where an estimate of the function value at an x is obtained by running an

actual process (e.g. chemical reactor, auto pilot) at a setting x. Property (iii) is of

obvious importance.

A reader may prefer a solution not compromising the local character of stochastic

approximation by the use of a regression estimate. However, it may be impossible to

solve a global problem by a local method.

1.2. Related results

Stochastic approximation methods have their origin in the results by Robbins and

Monro (195 l), who described a method for approximating a zero point of a function,

Page 4: Stochastic approximation of global minimum points

330 .I. Dippon, V. Fabian/Stochastic approximation of global minimum points

and by Kiefer and Wolfowitz (1952), who described a method for approximating

a point of a maximum (or minimum) of a function, the problem considered

here.

Chen (1984) described an estimate of a point of global minimum with the optimal

rate of convergence n- iI3 for the case s = 2 of functions f on [0, l] with bounded third

derivatives, one global minimum point 0 and with the second derivative off positive

at 8. The estimate, based on II observations of function values, uses a two-step design.

In the first step, n/2 observations are used to form a regression estimate off: From this

estimate, an estimate S is obtained of a subset that contains 0. Then the method takes

additional n/2 observations of function values at points in S and fits a quadratic

polynomial to f on S. The estimate is then set equal to the point at which the

quadratic polynomial has its global minimum.

Miiller (1989) constructed an estimate with the rate n-O0 with /3,, = s/(2s + 3) </I in

case k = 1. The estimate is based on a regression estimate obtained from observations

of function values at equidistantly chosen points.

Simulated annealing methods for minimization of functions on IWk (see Has’minskij,

1965; Cerny, 1985; Geman and Hwang, 1986; Kushner 1987; Gelfand and Mitter,

1991; Pflug, 1992) are stochastic approximation methods that converge in probability

to a global minimum point. The methods use larger steps than usual or add additional

errors to the observations, or both. They have a much lower rate of convergence than

the method presented here. The properties of the simulated annealing methods

might appear better when judged by other criteria, in particular, by the probability

of entering a fixed neighborhood of the set A where f attains its minimum. However,

the number of observations required for these methods to leave a neighborhood

of a local minimum and approach a global minimum might be extremely large even

in the simplest situations (see Fabian, 1992b). These methods, seemingly local in

character, perform, in fact, a global search by randomly wandering around the

domain of f:

Simulated annealing methods appear to have an advantage that the number of

observations per step is linear in k, the dimension of the domain. In contrast, unless

k is small, the regression estimate described in Section 2 is difficult to apply (see

Remark 2.9). However, the advantage of the simulated annealing methods might be

only in appearance, because these methods also search the whole domain.

It should be noted that if k increases, the problem of finding a global minimum

point becomes more difficult, but a different problem of finding an x for which f(x) is

small may be easier.

Chen’s estimate does not have properties (i) and (iii) (cf. Section 1.1) of our method

and has property (ii) in a weaker form; the simulated annealing methods do not have

property (iii).

A paper by Yakowitz (1993) appeared during the preparation of the final revision of

the present study, in which the main theorem states a result similar to part (iv) of our

Theorem 5.4. but the result is incorrect as stated.

Page 5: Stochastic approximation of global minimum points

J. Dippon, V. Fabian/Stochastic approximation of global minimum points 331

1.3. Nonasymptotic behavior

The asymptotic distribution results are approximations to finite sample behavior.

Not only they are easier to derive, but are also simpler to interpret because they

depend much less on the function f; and on details in the specification of the estimate.

Important difficult questions concern the proximity of the nonasymptotic and asym-

ptotic properties and choice of modifications and specifications to improve the

nonasymptotic behavior. Some tentative conclusions may be obtained from the

proofs of asymptotic properties.

For the stand-alone stochastic approximation procedure, the proof of asymptotic

properties relies on the result that the method converges to the desired point 8. It is

possible, for the asymptotic considerations, to approximate the function by a quadratic

function with any desired accuracy. The central limit theorem gives then the asymp-

totic behavior. The difficulty is that the speed by which the method approaches the

desired point can be arbitrarily slow. This difficulty remains even if the function values

are observed without error and is related to the local character of the procedure.

A regression estimate may be chosen such that its asymptotic and nonasymptotic

properties are quite close. Thus, for the regression estimate described in Section 2, if

the Lipschitz constant is known (and similarly, if it is estimated) nonasymptotic

properties are not difficult to establish. Note that the distribution of the maximum W,

(see (2.6.4)) of nearly normal random variables can be well approximated.

In the new method, it is likely that the regression estimate may be used to speed up

the initial part of the approximation and that the nonasymptotic properties of the new

method - even if there is only one stationary point - will be better and closer to the

asymptotic properties. For additional comments on the nonasymptotic behavior of

the regression estimate component, see Remark 2.9 and Example 2.10.

1.4. Notation

The notation introduced here will be used throughout the paper. For a function

f on a set D, and a subset S of D, f[S] is the image of S under J and IfI = sup IfI [D]

denotes the supremum norm of J:

For number sequences (alt),(b,), we write a,=O(b,) if limsupIa,l/lb,I<cc,

a,-b, if both un=O(bn) and b,=O(u,) hold, and a,%b, if u,/b,-+co.

Probabilistic concepts refer to a basic probability space (Sz, 52, P). If X and X, are

random variables or vectors, we write X, +X and say that (X,) converges to X to

mean convergence with probability 1. We write X, =o(b,) if (b,) is a positive number

sequence and (X,/b, ) converges to 0.

If X,, are functions on Sz, we say that X, has a given property eventually if, for every o in

a set of probability 1, X,(o) has the property for all but a finite n. If Qn are subsets of Q we

say that Qn occurs eventually if the indicator function XQn of Q,, equals 1 eventually.

Page 6: Stochastic approximation of global minimum points

332 J. Dippon, V. Fabian/Stochastic approximation of global minimum points

The following conventions will be used: a condition summarizes properties for the

sake of later reference; an assumption is a condition assumed throughout after its

introduction. Three assumptions are used in the paper: Assumptions 3.2, 3.7, and 5.1.

Relations are numbered separately in each subsection; a reference to (i. j. k) is to

relation (k) in Section i.j.

2. A regression estimate

2.1. Introduction

A regression estimate with the supermum norm error converging to 0 with prob-

ability one at a known rate can be used as a component of the combined method. In

fact, less is required (see (3.2.2) and (3.2.3)). Regression estimates with the error given

by the supermum norm have been proposed and studied, e.g., by Ibragimov and

Has’minskij (1980, 1982), Stone (1982), Fabian (1988a, 1990), but the results were on

convergence in probability rather than with probability one. For the results in the last

two papers quoted, convergence with probability was proven in Fabian (1992a), but

the result is limited to the one-dimensional case only. However, it is likely that also the

other results, not limited to the one-dimensional case, can be extended and conver-

gence with probability one at a known rate can be proven. To do this here would

prolong the present paper too much and shift the emphasis away from the main goal.

As a compromise, we describe here a simple regression estimate that can be used in

the combined method. The method applies under very weak conditions on the

function and its use in the combined method (determinations of the sets A, in (3.6.2)

below) is computationally easy. It has a lower rate of convergence than the methods

referred above have for classes of smoother functions. For the asymptotic behavior of

the combined method, the lower rate of the component regression estimate is irrel-

evant. It is relevant for nonasymptotic behavior, and regression estimates with

a better rate of convergence may be preferable even if they may complicate the

computations.

Described below is a sequence ( qrn ) of regression estimates. The subscript m can be

interpreted as a nominal number of observations. For the construction of (Pi, the

domain off is divided in I, subsets K,i and (P,,, is based on observations of values of

f at at least J, points in each K,i. One possible choice is to use observations at exactly

J, points in each of the subsets and choose J, equal to the largest integer in m/l,,

making the total number of observations needed for (P,,, close to m and at most m.

Relation (2.4.3) requires a substantially weakened version of the above choice.

The conditions under which we describe the properties of (cp,) are stated precisely

below.

For simplicity, we assume (cf. Condition 2.5) that the errors of observations of the

function values are identically distributed. This assumption is used only in the proof of

Page 7: Stochastic approximation of global minimum points

J. Dippon, V. FabianfStochastic approximation of global minimum points 333

Theorem 2.6 when applying the large deviation theorem of Feller to establish (2.6.9).

The assumption can be weakened to (somewhat complicated) conditions that allow

nonidentically distributed errors, using large deviation results in Petrov (1975) or

Theorem 2.1 in Book (1976).

2.2. Notation

For D a nonempty subset of R“, denote by Li(D, C) the family of all Lipschitz

functions on D with the Lipschitz constant at most C. By a u-cover of D

we mean a finite sequence (S,, . . . , S,) of disjoint subjects of D with the diameter

diam(Si) < U, where diam(Si) = sup { 1 x-y Imax; {x, y} c Si} is the diameter with respect

to the max norm on Rk, defined by 1&,,,=max{lz,I; s= 1, . . ..k}.

The phrase for all (each) m will mean for all (each) m~{2,3, . . . }

Remark 2.3. If T and u are positive numbers, then the set [ - T, Tlk has a u-cover of

length (2T/u+ l)k, because we can take the smallest integer 1 such that 132T/u and

divide the interval [ - T, T] into 1 subintervals of length at most u. [ - T, Tlk is equal

to the union of lk cubes with edges of length l/ldu. If then TM/u,+ co, there are

u,-covers of length I,,, such that I, = O((T,/U,)~). This justifies the first part of (2.4.3)

below.

Condition 2.4. f is a function on a subset D of Rk, T, is a positive number for each m. If

D is bounded, then T,= T for all m with a T such that [ - T, Tlkl D. If D is not

bounded, then, for a positive T,

(1) Td T,+cc, T,=O(m’) for a t < l/k,

u, is positive number for each m and

(2) u_+;~)“‘2+k’,

For each m, ( K,l, . . . , KmI,,, ) is a u,-cover of D, = [ - T,, T,lknD. I,, J, are integers

satisfying

(3) I,=O((T,/U,)~) and J,,,-m/I,,,.

For each m and each i = 1, . . . , I,, J,,,i is an integer, Jmi > J,,,, X,ij for j = 1, . . . , Jmi are

points in Km; and

(4) ymij =f(xmij) + I?mij,

with r?,ij random variables.

The regression estimate q,,, is defined on D, and equals Ymi on K,i for i= 1, . . . . I,.

Page 8: Stochastic approximation of global minimum points

334 J. Dippon, V. Fabian/Stochastic approximation of global minimum points

Condition 2.5. The random variables Ymij in Condition 2.4 have the same distribution

as a random variable q with zero expectation and a finite moment generating function

on a neighborhood of 0. For each m and i, the random variables vmil, . . . , rmiJ,i are

independent.

Theorem 2.4. Assume that Conditions 2.4 and 2.5 hold and let e, 9 u,. Then, for every

positive C,

(1) suPIX~,(~~-f)l=o(e~), f

with the supremum extending over all f in Li(D, C); in addition, for a positive number c,

(2) U, = O(mpc), J&iogm, u*

Proof. The first part of (2) follows from (2.4.1) and (2.4.2) for a suitable c=cr. Next,

from (2.4.3), J,acom(u,/T,)k =c0mui2 TGkur2 for a positive co, and the second part

of (2) follows from (2.4.2) for a positive c=c2. Thus (2) holds with c=min{c,, c2}.

Define h, on D, by setting it equal, on K,i, to the arithmetic mean of all f(xmij)

with j = 1, . . . , J,i. We obtain that

(3) sup I~~,(h,-f)I=O(u~)=o(e~). s

Denote by qmi the arithmetic mean of Vlmij with j= 1, . . . . Jmi and set

(4) Wm=IllaX{lr],iI; i= 1, . . ..I.},

so that

(5) sup I XD,(‘P~ - him) I = Wm s

and the assertion is true if

(6) W,=o(e,).

Set &,=J6logm and b,=J(log m)/J,. It follows from (2) that

(7) <,=o(Jy6) and b,=O(u,).

Without loss of generality, we may assume that Var(q)= 1. Note that e,$u, by

assumption and thus e, 9 b, by (7). Thus, if c is a positive number, then, eventually,

(W,>ce,}c(JJ,W,>J6J,b,}=A,, where A,,, = { JJ,,, W, > L}. Consequently, it

is enough to prove that, with probability one, only finitely many of the events

A, occur. By the Borel-Cantelli lemma, it is sufficient to prove that

(8) $i %kJ< a.

(If the W, are independent, (8) is also necessary.)

Page 9: Stochastic approximation of global minimum points

J. Dippon, V. Fabian/Stochastic approximation qf global minimum points 335

Consider an m and an i in { 1,. . . , I,}. Since Jmi> J, and &, =o( Ji16) by (7),

Theorem XVI.7.1 in Feller (1968) on large deviations and Lemma VlI.1.2 in Feller

(1957) on the tail behavior of the standard normal distribution, yield that, for

a constant C,

P(A,) is bounded by I, times the left-hand side in (9). From (2.4.3) and since J,,+m by (2) we obtain that m 9 I,. Thus

logP(A,)~logm+logC-logt,-iti.

Eventually, log C-log 5, < 0 and log P(A,) < - 2 log m. This implies (8) and proves

the assertion.

Remark 2.7. If D is bounded, then T, = T and T, can be replaced by 1 in (2.4.2). In this

case, by results in Stone (1982), under the hypothesis of the preceding theorem, the

optimal rate of convergence, among all possible sequences of estimates, under our

conditions on f and under slightly stronger assumptions on the available observa-

tions, is given by the sequence (u, ).

The result was stated for m=2,3, ,.., but, of course, applies even if 40, is observed

only for m in a subset of {2,3,. . . >.

Remark 2.8. (The use of the regression estimate as the component of the combined method). Assume that (9,) is a sequence of estimates satisfying (2.6.1) with (e,)

a known number sequence converging to 0 and with pL, observations required for the

construction of (P,,,. Thus the estimates described in Theorem 2.6 can be used with

J,i = J,; for these, ,uLm - m by the second part of (2.4.3). For the asymptotic consider-

ations, it is enough and easy to note that it is possible to choose two increasing

sequences of integers m, and n, such that ,~i + ... + pI = o(n,) and then construct (P,,,, at

stage n,. Setting fn = (P_., E, = ewr and H, = D_ for n such that n, < n < n,+ 1 we obtain

IX.Y,(fn-f)I=o(cJ

with the total number of observations to construct fi, . . ..fn of order o(n) and (8,)

a known sequence converging to 0. This satisfies the requirements on the component

regression estimates in Assumption 3.2 (the sets H, are denoted by D, there) and in

Remark 5.5.

Remark 2.9. (Nonasymptotic aspects of the use of the regression estimate). An un-

pleasant aspect of regression estimation, or any estimation of a global property of

a regression function, is the sharp increase of the difficulty as the dimension k of the

domain increases. This is related to the entropy of [Wk (see, e.g., Kolmogorov and

Tihomirov, 1959); in a simplified way, we need approximately nek cubes of diameter at

most u to cover a unit cube in Rk and estimating a global property means investigating

Page 10: Stochastic approximation of global minimum points

336 .I. Dippon, V. Fabian/Stochastic approximation of global minimum points

the property in all cubes of such a cover with small U. This is compounded by the fact

that an easy cover is obtained by dividing the interval [0, l] into 1 subintervals of

length l/l and taking all Cartesian products of the subintervals. This is an l/l-cover of

length I“, and for k not small, Ik is, of course, very large already for I = 2. This can be

avoided, but not easily, by taking other covers. Regression estimates that are based on

estimates of function values at randomly selected points are basically of this type,

except that it is doubtful that the random selection of points leads to good covers.

Asymptotically, a low speed of the component regression estimate does not matter.

It does matter nonasymptotically, and it may be better to use regression estimates

other than those described in Theorem 2.6.

There are some savings on the number of observations possible that we shall discuss

for the case of the method in Theorem 2.6 with D bounded.

It is possible to apply the method such that the total number of observations to

construct fi, . . . . fn for rt= n, can be made equal approximately to m, rather than to

ml+ ... +m,. Indeed, the conditions concerning the errors qmij of function value

estimates (cf. Condition 2.5) do not prevent the use, for the construction of (P,,,,, of the

observations used previously for m = m,, . . . , m,_ 1. So all the previous observations

can be used unless there are more of them than required in some of the new subsets

K,i. This is unlikely to occur or will occur only to a negligible degree if, for each new

mi, the additional observations are taken at points approximately equally spaced over

D. Also, if we have some superfluous observations at a step, these observations can

still be used efficiently at later steps. Of course, it is not necessary to take the

additional observations needed to construct (P,,,, at the step n,; it is possible to take

them at this step or earlier.

If the storage of all the points X,ij and the corresponding estimates of the function

values is too expensive, it is possible to decrease the storage requirements by choosing

all xmij equal a point xmi. This will, however, make it more difficult to use, at a current

step, all the observations obtained before. It is then still possible to keep the number of

observations needed for constructing fi, ..,f. to be negligible with respect to n; see Example 2.10.

The combined method does not need the full force of the regression estimate.

Assumption 3.2(ii) states formally the requirements on the regression estimates. We

need the error to be o(E,) over the set { f-minf< 3s,}. On the set {f-min f > 3E,},

much less is required, namely that fn > minf+ 2.5,. In nonasymptotic considerations it

may be possible to stop upgrading estimates fn on subsets of D where we have good

evidence that f> minf+ 3e,. An asymptotic justification could be given by consider-

ing, for each n, two regression estimates, with different errors, using the less expensive

method to estimate a subset D, of D where f>minf+3~,, and evaluate the more

precise and more expensive method only on the set D--Do.

Example 2.10. We shall show a possible choice of the sequences (y,), and

(m,) and the resulting (E,). This should not be interpreted as a recommended

Page 11: Stochastic approximation of global minimum points

J. Dippon, V. Fabian JStochastic approximation of global minimum points 337

choice, the asymptotic results in Theorem 5.4 do not imply any choice preferable to

another.

For simplicity, assume that D is [0, l]“, T,= 1. In the construction of q,,,, use

l4, = [(m/log m)“(z + k’ - 11-l and note (2.4.2) holds. By Remark 2.3, it is possible to

choose the covers such that I, is the largest integer in (l/u, + l)k = (m/log VZ)~‘(’ tk’. Set

J,,, as the largest integer in m/I,,,. Now (2.4.3) holds and the number of observations

required for qrn is at most m and close to FL Choose positive numbers a, b and c such

that O<c< b-c 1. Set m,=ra,nl=ra’b and ~,,=n-~‘(*+~’ and construct (fn) from (40,)

as described in Remark 2.8. Since (E,) is a decreasing sequence, we obtain, by

Theorem 2.6, Ifn-fj =o(E,), if E,~$u,,,~_ I. But urn,_, - log”(2+k’(r- l)/(r- l)ai(Z+k’ -

log l/(2 +k’(y)/ral(* +k’ and cnV = r -@cK”K* +k’$ U,+, . Note that the E, converge to 0 not

much slower than n-1’(2+k’ if c is selected close to 1.

Of course, we obtain Jm, close to r20ic2 +k’(u log r)k’(2+k’.

As explained in Remark 2.9, it is possible to reuse the observations taken previously

and, in such a case, the total number of observations at n, is m, = o(q). If we do not

reuse the previous observations, the total number of observations at step n, is at most

ra+’ and, with a proper choice of a, this is again o(n,).

If f is assumed to have continuous derivatives of order p, it is likely that the

estimates described in Stone (1982) can be shown to have a property (2.6.1) for any

(e,) such that e,~((logm)/m)P”2p’+k’ and, for p large, we would obtain, similarly as

before, E, = ned with d less than but close to i.

3. The auxiliary estimate

3.1. Notation

We add some notation to Notation 1.4. jl (1 is the Euclidean norm. For subsets M of

lRk and points x, y in 1w”, d(M, y) is the Euclidean distance of a subset M from a point

y,B(x, r) the open sphere with the center x and radius r. We shall consider below

a subset D of [Wk. We denote by S(x, r) the sphere B(x, r)nD in the metric subspace D of

Rk, S(M, Y) the set (x;d(M,x)<r)nD and ~(M)=sup{r;S(x,r)c M,xEM). An in-

terior of a subset M of D with respect to the topology of D will be called the D-interior

of M; however, unless specifically indicated otherwise, topological concepts for

subsets of [Wk refer to the metric space [Wk.

The matrix of the second-order partial derivatives of a function f at x is denoted by

H(x), if these derivatives exist.

We shall consider functions on R with values subsets of [Wk (see, e.g., A, defined in

(3.6.2)). We use standard notation for functions with abstract ranges. Thus the assertion

A c A, means that A(w)cA,(w) for every w, {A CA,} is the set of all w for which

A(w) c A,(o), and the assertion A c A, eventually means that (A c A,} occurs eventually.

A reference e.g. to Condition 3.3(ii) means a reference to part (ii) of Condition 3.3.

Page 12: Stochastic approximation of global minimum points

338 J. Dippon, V. Fabian/Stochastic approximation of global minimum points

Assumption 3.2. (i) f is a function on a subset D of IWk attaining a minimal value

minf; A =f- ’ (minf } is the inverse image of {minj> under f: A is a compact set.

For every positive number r,

(1) inff[D - S(A, r)] > minf:

(ii) (D,,) is a nondecreasing sequence of subsets of D such that, for every bounded

subset Do of D, D,, c D, eventually. For each n, f, is a regression estimate defined on D, and, for a positive number sequence (E.) converging to zero,

(2) Ix D,n(S$minS+3En)(fn-f)I=O(&,)

and, eventually,

(3) fn > minf+ 2&, on {f>minf+3&,} nD,.

Condition 3.3. (Additional requirements on f). (i) For a positive r, a positive number

z and for all x in S(A,r),

(1) f(x) - minf 3 rd(x, A)‘.

The set A has a nonempty D-interior or satisfies

(2) p(S(4 rJ)=W’,)

for every positive number sequence r, converging to 0.

(ii) 0 is an interior point of D and is in A and H(B) exists and is positive definite.

Remark 3.4. (On the required properties of f). Part (i) in Assumption 3.2 has an

obvious meaning. Note that (3.2.1) holds if D is compact and f continuous. Part (ii)

requires an existence of a suitable regression estimate and is an additional implicit

condition on f: The regression estimate in Theorem 2.6 requires that f be Lipschitz

continuous and Remarks 2.8 and 2.9 discuss the use of the regression estimate so that

the number of observations needed in negligible with respect to n. Much less is

assumed in Assumption 3.2 about the regression estimates than the property

Ixo,(fn -f)l = o(E,) and as pointed out in Remark 2.9, this can be used to save on the

number of observations.

In nonasymptotic considerations it would be useful to select E, such that, with large

probability, the left-hand side in (3.2.2) is at most o(E,) and (3.2.3) holds. This is

relatively easy for the regression estimate described in Section 2, if the Lipschitz

constant for f is known, and it may be possible, if the constant is estimated.

Relation (3.3.1) is satisfied with z =2 and an r,,, if A consists of a finite number of

points and at each such point 8, H(8) exists and is positive definite. Condition (3.3.2) is

satisfied for finite and other reasonable sets A and fails for the set described in

Example 3.5. Note that the requirement of a nonempty D-interior is a weaker property

than that of a nonempty interior with respect to the topology of [Wk.

Page 13: Stochastic approximation of global minimum points

J. Dippon, V. Fabian/Stochastic approximation of global minimum points 339

Condition 3.3(ii) is a part of conditions under which the component stochastic

approximation method has been studied.

Example 3.5. Consider D = Rk and any nonincreasing sequences (r,) and (d, ) such

that r,+O, d,-+O and d,Br,,. Consider a subset A of [ -dl, d,] such that, for every

n> 1, An[S(O, d,))-S(0, d,,,))] has a finite number of points spaced at most

Y, apart. Let OEA. Notice that A is closed. Then S(A, r,)3S(O, d,) for every n and

A does not satisfy (3.3.2).

Notation 3.6. The following notation will be used. Set

(1) h =f- minf; k =f, - inff, CR,1

and

(2) A,={x;xED~, h,(x)<&,}.

Assumption 3.7. (The estimate Z,). Co is a number larger than 1, (6,) is a sequence of

positive numbers converging to 0. With

(1) 2,=(x; S(x, r)cA,j for r=(l/C,)p(A,),

1, are random variables with values 0 and 1 and such that

(2) i,= 1 if p(A,)>C,,& and z,=O if p(A,)<(l/Co)G,.

M, satisfy

(3) M,=A^, if z,=l and M,=A, ifl,=O.

For each y1= 1,2,. .., Z, is a random vector with values in M,.

Condition 3.8. (Additional properties of Z,). (i) Condition 3.3(i) holds for a z,

(1) &$&,l’r

and

(2) /IZ,-Z,_,l/dCod(M,,Z,_l) for all ~1.

(ii) Condition 3.3(ii), holds, 4 is a positive number and

(3) &$max{n-q, s,1j2).

Remark 3.9. (On the construction of Z,). Assumption 3.7 and Condition 3.8 describe

the construction of the estimates Z, by specifying conditions on these random

variables.

The sets A, in (3.6.2) are estimates of the set A at which f attains its global

minimum. In Assumption 3.7, the estimates Z, are chosen in A, or in the smaller sets

Page 14: Stochastic approximation of global minimum points

340 J. Dippon, V. Fabian/Stochastic approximation of global minimum points

A^,, if p(A,) is large enough. The random variables 1, keep track of which choice of the

two possible definitions of M, was used.

Properties (3.8.1) and (3.3.1) make it possible for the method to determine that

a point in A is an isolated point of A. As noted in Remark 3.4, (3.3.1) often holds with

r=2. Property (3.8.2) prevents Z, from fluctuating unnecessarily.

The constant Co is introduced to allow for approximate calculations of A,

and &L). The number q in Condition 3.8(ii) will be later assumed to be less than 0, where n-O

is the rate of the convergence of the component stochastic approximation method (see

Condition 5.3). It is easy to choose (6,) to satisfy the convergence requirement of

Assumption 3.7 and the requirements of Condition 3.8 if z is known, because it is

possible to choose 6,~rnmax{n-~,~,1/~, a,“‘). Often, we may have t=2 and &A/‘$ nPq,

and then the requirement is 6, % &,!I’.

It is now easy to see that Z, can be determined to satisfy both Assumption 3.7 and

Condition 3.8, provided it holds with z = 2 (or with another known r).

Lemma 3.10. The following properties hold:

and, for a sequence (r,) of positive numbers converging to 0,

(2) A c A, c S(A, r,) eventually.

Proof By Assumption 3.2, D, 3 A eventually. From (3.2.2) and (3.2.3) it follows that

fn - min f = o(E,) on A and f, > min f - o(E,). It follows that min fn = min f + o(E,). This

shows that (3.2.2) holds with f,-freplaced by h,- h. This and (3.2.3) imply

(3) D, n {h 3 2~,,} c {h, > E,,} eventually.

Thus, eventually, A, c {h < 2s,} and (1) and the first part of (2) follow from (3.2.2).

Define a function d on (0, ~0) by d(c) = inf {r, {h d 2~) c S(A, r)}. d is a nondecreas-

ing extended-valued function with a limit c at 0, c 3 0. In fact, c = 0, because, if c > 0,

then { h62.s} has common points with D-S(A, c/2) for all E>O, a contradiction to

(3.2.1). Set rn=(l A d(.z,))+ l/n. Then rn --+O and the second inclusion in (2) holds

because of (1).

Lemma 3.11. Let (r,) be a sequence of positive numbers converging to 0, S, = S(A, r,)

and let S(z,, C)C S, for a positive c, points z, in S, and all n. Let r < c. Then S(z,, r) c A eventually.

Proof. Assume that the assertion does not hold. Changing (z,) to a suitable subse-

quence, we obtain that z,--rz for a z, and that S(z,, r)C A fails for all n. Consider a

c0 in (r, c). Eventually, S(z, c,,)cS,, and, since A is the intersection of all such

S,, S(z, c,,) c A. It follows that S(z,, r)C A eventually, a contradiction.

Page 15: Stochastic approximation of global minimum points

J. Dippon, V. Fabian/Stochastic approximation of global minimum points 341

Lemma 3.12. Zf A0 is a subset of A such that, for some positive r and T and all x in

S(A,, r),

(1) h(x) 3 rd(x, A,)‘,

then, eventually,

(2) S(A, , r)nA,cS(&, r,)

with r, = (34~)“‘.

Proof. Set C,=S(A,, r)nA,-S(A,,, r,). By(l), we have h33s,, on C,. By (3.10.1), we

have h<2s,, on C, eventually. It follows that, eventually, C, is empty and (2) holds.

Lemma 3.13. If Condition 3.8(i) holds, then (3.10.2) holds with r,=o(o,).

Zf Condition 3.8(ii) holds, then: (a) 8 is an isolated element of A, (b) there is a positive

r and a sequence (r,) such that r, =o(&) and, eventually, (1) and (2) hold:

(1) S(& r)nA, cS(& r,),

(2) if t,,= 1 then M,nS(B, r)=@

Proof. We shall apply Lemma 3.12 and use r, defined there.

Assume first Condition 3.8(i). Apply Lemma 3.12 with A0 = A. Property (3.12.1) for

A holds because Condition 3.8(i) implies Condition 3.3(i). From (3.8.1) we obtain that

Y, = o(6,). By (3.10.2) S(A,, r)nA, = A, eventually. The desired assertion follows now

from (3.12.2).

Secondly, assume Condition 3.8(ii). This condition implies Condition 3.3(ii) and

thus assertion (a). Also, for a positive 8, we have B(8, E) c D.

Next, we shall apply Lemma 3.12 with A0 = (0). Note that (3.12.1) follows for an

r < E and for r = 2 from the properties of H(B) and that, by (3.8.3)

(3) r,=0(6,).

From Lemma 3.12, we obtain (1).

It is enough to prove (2) with r changed to ir. Because B(k), E)C D, we have

S(e, r)= B(B, r). Consider-the subsets Qn of Sz defined by Qn = (I, = 1, M,nS(B, fr)#@}.

On Qn, we have M,= A, and p(A,)>(l/Co)G, by (3.7.2) and (3.7.3), and, if Qn is

nonempty, it is possible to choose a function v, on Qn with values in A^,nB(B, ir). On

Q,,, we further obtain by (3.7.1) that S(v,, (1/C0)2S,)c A,,. Denote by R, the subset of

s;! on which (1) holds. For large enough n, we obtain, on QnnR,,

R(v,, (~IGI)~&J~R(B, r)nA,cB(B, r,).

But, for large enough n, this is impossible, because of (3). It follows that, for large

n, QnnRn is empty and Qi 2 R,. Since R, occurs eventually, also Q; occurs eventually,

and (2) holds eventually.

Page 16: Stochastic approximation of global minimum points

342 J. Dippon, V. Fabian/Stochastic approximation of global minimum points

Theorem 3.14. We have

(1) d(A, Z,) + 0 and f(Z,) < minf+ 2&, eventually,

and, if the D-interior of A is nonempty, then, for a positive r, eventually, I~= 1 and S(Z,, r) c A.

If Condition 3.8(i) holds, A has an empty D-interior and is the union of a finite number m of disjoint closed sets Ct, . . . . C,, then there is a measurable decomposition

IF 1, . . . . F,) of the sure event such that, on each Fi, d(Ci, Z,)=O(~,).

If Condition 3.8(ii) holds, then the following two properties hold on {Z,+Q}: z,=O eventually and (1 Z, - 0 II= o(6,).

Proof. The first assertion: By Assumption 3.7, the range of Z, is a subset of A, and (1)

follows from (3.10.2) and (3.10.1) in Lemma 3.10. Suppose A has a nonempty D- interior. Then, eventually, p(A,)>/p(A)> Co&, z, = 1 by (3.10.2) and (3.7.2), and Z,&,.

The last relation, (3.7.1) and (3.10.2) imply S(Z,, c)cS(A, r,) with c=p(A)/C, and

r,+O. The second part of the assertion now follows by Lemma 3.11 for every r<c. The second assertion: We may assume that all Ci are nonempty. We have,

eventually: Ac A,c S(A, r,), with r,=o(6,), by Lemma 3.13. It follows that

d(A, Z,)<r,. Also, by (3.3.2) p(A,)=o(G,) and thus, by (3.7.2) and (3.7.3), we have

M, = A,, eventually. Thus the subset of 52, defined by

(2) R,={d(A, Z,)dr,, A=M,cS(A, r,)},

occurs eventually. On R,, M, is the union of sets C,i such that Ci c C,i c S(Ci, r,). Let

i# j and set

(3) Qn=R,-lnR,n{Z,-l~C,-l,i, ZnECn,j}.

Set d=d(Ci, Cj). Then, on Q,,, IIZ,-Z,-I 11 >d-(r,+r,-l) and d(M,, Zn-l)<r,-l which contradicts (3.8.2) provided n is large enough. This proves that Qn is eventually

empty and it follows that {Zn-l~Cn-l,i, Z,EC,,j} is eventually empty. The second

assertion now follows, since the measurability of the sets Fi is obvious.

The third assertion: Assume that F = (Z, -g} has a positive probability since, if not,

the assertion is trivially true. Let r and r, be as in Lemma 3.13. On F, Z, is eventually

in S(0, r) and thus in S(6, r,) by (3.13.1). Since Z, is in M,, we obtain by (3.13.2) that

1, =0 eventually on F. Since r, = 0(6,), we obtain II Z, - 0 II= o(6,) on F. This proves

the third assertion.

4. A stochastic approximation method

The SAF method described in Theorem 2.7 of Fabian (1971) can be used to satisfy

Condition 4.1 under a strengthening of Condition 3.8(ii) (see Remark 4.2). Other

methods may be used under possibly other conditions on f:

Page 17: Stochastic approximation of global minimum points

J. Dippon, V. Fabian/Stochastic approximation of global minimum points 343

Condition 4.1. (Requirements on the component stochastic approximation method).

(6, F,, . . . ) is a nondecreasing sequence of sub-o-algebras of 52. For each

n=1,2 ,..., X,, W, are F,-measurable k-dimensional random vectors, X, is

F,-measurable and

(1) un=xn_i - w,.

If Condition 3.3(ii) holds and the event F = (X, -+0} satisfies P(F)= 1, or FEF, and

P(F)>O, then the following two implications hold:

If

(2) xF II X, - 8 II d II U, - 0 /I eventually,

then

(3) ~Fn411X,-8~~-+0 for a q in (O,p).

If

(4) xFXn = xF U, eventually,

then, conditionally on F, nP(X,--8) converges in distribution to a normal (p, C)

random vector.

Remark 4.2. Assume Condition 3.3(ii). Also assume that, for a positive even integer

s and a positive E, H and A,+ 1 exist and are continuous on B(8, E), where A,(x) denotes

the vector with the j-th component the p-th derivative off with respect to the j-th

coordinate. /j’= ~s/(s + 1).

Then SAF, described in Theorem 2.7 of Fabian (1971), satisfies Condition 4.1.

In SAF, the basic recurrence relation is (4.1.1) but with U,, replaced by X, and the

domain D is assumed to be the whole space [Wk.

Consider first the case Z’(F)= 1. The proof that (4.1.2) implies (4.1.3) is a slight

modification of Theorem 5.3 in Fabian (1967); the differences are that we assume

(4.1.2) eventually instead of (4.1.4) for every n and that the estimate of the gradient in

Fabian (1967) is replaced by the multiple of this estimate by an estimate of the inverse

of H(8), but these differences do not affect the proof. The second assertion, that (4.1.4)

implies the asymptotic distribution for X, as asserted, is the assertion of Theorem 2.7

in Fabian (1971) with the difference that here we assume (4.1.4) holds eventually rather

than for every n. The only change required in the proof is to change T, in the

application of Theorem 2.2 in Fabian (1968) without affecting the convergence

properties of T,. Similar considerations, in more detail, can be found in Fabian (1978,

1988b). If X,, is outside the interior of D, it may be impossible to form the gradient

estimate as required by the SAF method (choose, in such a case, U,, =X,_ 1 ). However,

by Condition 3.3(ii), on F, X, will be eventually in the neighborhood B(Q, a), a subset

of D, and then this difficulty does not occur. A formal proof is obtained by another

change of T, mentioned above.

Page 18: Stochastic approximation of global minimum points

344 J. Dippon, V. Fabian/Stochastic approximation of global minimum points

If O< P(F) < 1 and FEF~, then the conditional properties of the SAF method will be

preserved when the sure event is changed to F, and the probability measure P is

changed to the conditional probability P( 1 F). On the new probability space, X,-+(3

and the asymptotic distribution is as asserted. On the original probability space, this

becomes the conditional asymptotic distribution.

The limiting distribution is as described in Theorem 2.7 of Fabian (197 1).

5. The combined method

Assumption 5.1. V,, and X, are k-dimensional random vectors. With

Qn = S(Z,, 36,)- S(Z,, 26,) the following properties hold:

If I,,= 1 or Qn=8, then

(1) x,=z,;

If I, = 0, and Qn # 8, then

(2) -UQn, if II V-&/I >66,

and

(3) X,,= V, if II V,-Z,II ~66,.

Remark 5.2. Assumption (5.1 .l) makes X, eventually equal to Z, if A has a nonempty

D-interior; Assumption (5.1.2) prevents X, from getting too far from Z,: in the (51.3)

case, the value of X,, is determined by the component stochastic approximation

method. The following additional condition will be used in part (iv) of

Theorem 5.4.

Condition 5.3. Condition 3.8(ii) and Condition 4.1 hold with a 4 in (0, /I) and the event

F={X,,+g} satisfies P(F)=l, or FEF, and P(F)>O.

Theorem 5.4. The following statements are true:

(i) d(A, X,)+0.

(ii) If A has a nonempty D-interior, then, eventually, X,,=Z, and, for a positive r,

S(X,, r)C A.

(iii) If Condition 3.8(i) holds and A has an empty D-interior and is the union of a finite

number of disjoint closed sets C1, . . . . C,, then there is a measurable decomposition

{F,, . . , F,,} of the sure event such that, on each Fi, d(X,, CL)< 76, eventually.

(iv) If Condition 5.3 holds, then (X,) satisfies (4.1.3) and (4.1.4), and, conditionally on

F, nP(X, - 0) is asymptotically normal (u, C).

Page 19: Stochastic approximation of global minimum points

J. Dippon, V. Fabian JStochastic approximation cf global minimum points 345

Proof. Note that in all cases, because of Assumption 5.1,

(1) d(X,,, Z&6&,.

(i) and (iii) follow from (1) and Theorem 3.14. If A has a nonempty D-interior, then by

Theorem 3.14, and (5.1.1) eventually, z, = 1, X,, = Z, and assertion (ii) follows from

Theorem 3.14.

Consider assertion (iv): On F, eventually, Qn#8 and, by the last assertion in

Theorem 3.14, 1, = 0, /I Z, - 8 11 = 0(6,), and X,, is determined as in (5.1.2) or (5.1.3).

Next we shall prove that (4.1.2) holds. The following properties hold on F eventually. If X, is determined by (5.1.3), then, of course, the inequality in (4.1.2) holds.

Consider the (5.1.2) case. Since // U, - Z, 11 b 66,, it follows that 1) U,, - 6’ // 3 56,. Since

I/ X,, -Z, /I < 36,, it follows that /I X, - 0 11 <46,,. Again, the inequality in (4.1.2) holds.

We have shown that (4.1.2) holds, and, by Condition 4.1, (4.1.3) holds. If, for an o in

F, case (5.1.2) applies infinitely often, then, infinitely often, 11 X,, -Z, I/ 326, and

11X,-d 1) 36,. By (3.8.3) 6,$n-q for a q<p and by (4.1.3) the set of such LU has

probability 0. Consequently, (4.1.4) is satisfied, and the assertion follows from

Condition 4.1.

Remark 5.5. (The number of observations). If X, is determined by (5.1.1) it is not

necessary to compute U,,. The computation of U,,, I -X,, requires ks observations of

function values and an asymptotically negligible number of additional observations.

Similarly (cf. Remark 2.8), estimates fn require an asymptotically negligible number of

observations. The total number of observations to obtain X, is then v, with v,/n-+ks (and smaller in case (5.1.1)). Then the convergence in distribution of n”(X,, -8) to

a normal random vector U in part (iv) of Theorem 5.4 yields immediately the

convergence of vE(X, - 19) to (ks)” U.

Acknowledgment

The authors wish to thank the editor and two anonymous referees for comments

and criticism that lead to significant improvements of the paper.

References

Book, S.A. (1976). The Cramer-Feller-Petrov large deviation theorem for triangular arrays, manuscript. Cerny, V. (1985). A thermodynamical approach to the traveling salesman problem: an efficient simulation

algorithm. J. Optim. Theory Appl. 45, 41-51. Chen, H. (1984). Optimal rates of convergence for locating the global maximum of a regression function,

Thesis, Dept. of Statistics, Univ. of California, Berkeley.

Chen, H. (1988). Lower rate of convergence for locating a maximum of a function. Ann. Statist. 16, 1330-1334.

Page 20: Stochastic approximation of global minimum points

346 J. Dippon, V. Fabian/Stochastic approximation of global minimum points

Erickson, R.V., V. Fabian and J. Maiik (1993). An optimum design for estimating the first

derivative. Preliminary Report, RM 531, Dept. of Statistics and Probability, Michigan State Univ.,

East Lansing, MI.

Fabian, V. (1967). Stochastic approximation of minima with improved asymptotic speed. Ann. Math.

Statist. 38, 91-200. Fabian, V. (1968). On asymptotic normality in stochastic approximation. Ann. Math. Statist. 39,

1327-1332.

Fabian, V. (1969). Stochastic approximation for smooth functions. Ann. Math. Statist. 40, 299-302. Fabian, V. (1971). Stochastic approximation. In: J.S. Rustagi, Ed., Optimizing Methods in Statistics. Proc.

Symp., Ohio State University, June 14-16, 1971. Academic Press, New York, 439-470.

Fabian, V. (1978). On asymptotically efficient recursive estimation. Ann. Statist. 6, 854-866. Fabian, V. (1988a). Polynomial estimation of regression functions with the supremum norm error. Ann.

Statist. 10, 1345-1368.

Fabian, V. (1988b). The local asymptotic minimax property of a recursive estimate. Statist. Probab. Lett. 6, 383-388.

Fabian, V. (1990). Complete cubic spline estimation of non-parametric regression functions. Probab. Theory Related Fields 85, 57-64.

Fabian, V. (1992a). Convergence properties of supremum norm error of nonparametric regression

estimates. Trans. 11th Prague Con& on Information Theory, Statistical Decision Functions, Random Process, Academia, Prague, 35-48.

Fabian, V. (1992b). Simulated annealing simulated. Preliminary Report, RM 524, Dept. Statistics and

Probability, Michigan State Univ., East Lansing, MI.

Fabian, V. (1992~). On neural network models and stochastic approximation. Submitted for publication.

Preliminary Report, RM 530, Dept. Statistics and Probability, Michigan State Univ., East Lansing, MI.

Feller, W. (1957). An Introduction to Probability Theory and Its Applications. Wiley, New York.

Feller, W. (1968). An Introduction to Probability Theory and Its Applications, Vol. II, 2nd ed. Wiley,

New York.

Gelfand, B.S. and SK. Mitter (1991). Recursive stochastic algorithms for global optimization in Rd. SIAM J. Control Optim. 29, 999-1018.

Geman, S. and C. Hwang (1986). Diffusions for global optimization. SIAM J. Control Optim. 24, 1031-1043.

Has’minskij, R.Z. (1965). Use of random noise in problems of optimization and learning. Problemy Peredaci Informacii 1, 113-117 (in Russian).

Ibragimov, I.A., and R.Z. Has’minskij (1980). On nonparametric estimation of regression. Soviet Math. Dokl. 21, 810-814.

Ibragimov, LA., and R.Z. Has’minskij (1982). Bounds for the risks of nonparametric estimates of the

regression. Teorija Verojatn. i Prim. 32, 81-94 (in Russian); translation: Theory Probab. Appl. 27, 84-99. Kiefer, J. and J. Wolfowitz (1952). Stochastic estimation of the maximum of a regression function. Ann.

Math. Statist. 23, 462-466. Kolmogorov, A.N. and V.M. Tihomirov (1959). E-entropy and s-capacity of sets in functional spaces.

Uspechi Mat. 14, 3-86 (in Russian); translation: Amer. Math. Sot. Transl. Ser. 2 17 (1961), 277-364. Kushner, H.J. (1987). Asymptotic global behavior for stochastic approximation and diffusions with slowly

decreasing noise effects: global minimization via Monte Carlo. SIAM J. Appl. Math. 47, 169-185. Lazarev, V.A. (1992). On the convergence of the stochastic approximation procedures in the case of multiple

roots of the regression function. Probl. Peredaci Inf: 18, 75-88 (in Russian).

Ljung, L. (1978). Strong convergence of a stochastic approximation algorithm. Ann. Statist. 6, 680-696. McInerney, J.M., K.G. Haines, S. Biafore and R. Hecht-Nielsen (1989). Back propagation error surfaces can

have local minima. Tech. Report No. CS89-157, Univ. of California at San Diego, La Jolla, CA. Miiller, H-G. (1989). Adaptive nonparametric peak estimation. Ann. Statist. 17, 1053-1069.

Nevel’son, M.B. and R.Z. Has’minskij (1976). Stochastic Approximation and Recursive Estimation. Transl.

Math. Monographs Vol. 47, Amer. Math. Society, Providence, RI. Petrov, V.V. (1975). Sums of Independent Random Variables. Springer, Berlin.

Pflug, G. (1992). Application aspects of stochastic approximation, Part II of Ljung, L., G. Pflug and

H. Walk: Stochastic Approximation and Optimization of Random Systems. BirkhCuser, Basel.

Page 21: Stochastic approximation of global minimum points

J. Dippon, V. Fabian/Stochastic approximation of global minimum points 347

Polyak, B.T. and A.B. Tsibakov (1990). Optimal rates of search algorithms of stochastic optimization.

Probl. Peredaci Inf 26, 45-53 (in Russian).

Renz, J. (1991). Konvergenzgeschwindigkeit und asymptotische Konfidenzintervalle in der stochastischen

Approximation. Dr. rer. nat. Thesis, Universitat Stuttgart. Robbins, H. and S. Monro (1951). A stochastic approximation method. Ann. Math. Statist. 22, 400-407. Spall, J.C. (1988). A stochastic approximation algorithm for large-dimensional systems in the

Kiefer-Wolfowitz setting. Proc. IEEE Conf Decision Control, 1544-1548. Spall, J.C. (1992). Multivariate stochastic approximation using a simultaneous perturbation gradient

approximation. IEEE Trans. Automat. Control 37, 332-341. Stone, C.J. (1982). Optimal global rates of convergence for non-parametric regression. Ann. Statist. 10,

1040-1053. Yakowitz, S. (1993). A globally convergent stochastic approximation method. SIAM J. Control Optim. 31,

30-40. White, H. (1989). Some asymptotic results for learning in single hidden-layer feedforward network models.

J. Amer. Statist. Assoc. 84. 1003-1013.