Click here to load reader
Upload
juergen-dippon
View
217
Download
0
Embed Size (px)
Citation preview
Journal of Statistical Planning and Inference 41 (1994) 327-347
North-Holland
321
Stochastic approximation of global minimum points
Jiirgen Dippon*
Mathematisches Institut A, Universitiit Stuttgart, 7000 Stuttgart, Germany
Vklav Fabian**
Department cf Statistics and Probability, Michigan State University, East Lansing, MI 48824, USA
Received 11 August 1992; revised manuscript received 28 June 1993
Abstract
A method is proposed for approximating a point of a global minimum of a function J defined on a subset
D of iw’, when values off can be estimated at points determined by the method. The method combines
a stochastic approximation method with a nonparametric regression estimate. If f has only one global
minimum point, then the new estimate has the same asymptotic behavior as the component stochastic
approximation method, except that the new method does not require f to have only one stationary point.
Similar properties are obtained if there are multiple global minimum points. The best behavior is obtained if
f attains its global minimum on a set with a nonempty interior.
AMS Subject Class$cation: Primary 6L20; secondary 62502, 65KlO
Key words: Stochastic approximation; global minimum; optimal rate of convergence; asymptotic distribu-
tion; regression estimates; simulated annealing.
1. Introduction
I .I. The results
We consider here the problem of estimation of a global minimum point of a func-
tion f defined on a subset D of the k-dimensional Euclidean space I@. We assume that
we may choose points in D and then estimate the values off at these points. Known
stochastic approximation methods may be used, but may fail to converge to a global
Correspondence to: V. Fabian, Department of Statistics and Probability, Michigan State University, East
Lansing, MI 48824, USA.
*Research supported by a grant from the Deutsche Forschungsgemeinschaft. ** Research partly supported by NSF grants DMS-9005552 and DMS-9101396.
0378-3758/94/$07.00 0 1994-Elsevier Science B.V. All rights reserved SSDI 0378-3758(93)E0080-Z
328 J. Dippon, V. Fabian/Stochastic approximation of global minimum points
minimum point if f has multiple stationary points (under additional conditions,
points of sharp local maxima are excepted (see Nevel’son and Has’minskij, 1976,
Ch. 5; Ljung, 1978; Lazarev, 1992).
The estimate X, proposed and studied here has useful properties even if f has
multiple stationary points. It is a combination of a stochastic approximation method
U,,, and a nonparametric regression estimate fn. From the regression estimate we
obtain an estimate Z, of a point of global minimum. The combined method uses Z, to
restart U,, if it wanders too far away from Z, and then uses U, or Z,, depending on
which one is better, as the combined estimate X,.
If the function f has one global minimum point 8, then the new method has the
same properties as the component stochastic approximation method under weaker
conditions on f that allow multiple stationary points. In addition, similar results are
obtained when the function attains its global minimum at several points. If f attains
its global minimum on a set A with open interior, then, with probability one, X, is in
the interior of A for all II large enough (in this case Z, is better than U,,).
A price paid for the improvement is in the difficulty of practical implementation of
any regression estimate unless the dimension k is small; this difficulty disappears in
asymptotic considerations but is present in actual applications (see Remark 2.9).
Three conditions on f are considered, the basic Assumption 3.2, and two stronger
conditions, Conditions 3.3(i), (ii). The estimates are described by conditions easily
interpreted as instructions on how to construct the estimates. Concerning Z,,
these are given in the basic Assumption 3.7 and two stronger conditions,
Conditions 3.8(i), (ii). Remark 3.9 shows that there is no conflict among these condi-
tions and thus Z, can be constructed to satisfy all three. However, for some of the
properties of Z,, it is not necessary to assume all the conditions on Z,. Condition 4.1
specifies properties of the component stochastic approximation method and Assump-
tion 5.1 specifies the properties of the combined estimate X,.
The fact that we discuss the regression estimate fn, the estimate Z, based on fn, the
stochastic approximation method, and the combination of these, makes for multiple
and somewhat complex conditions.
The main results are in Theorem 5.4. Properties of Z, are of interest and are
summarized in Theorem 3.14. The regression estimate fn is characterized only by the
properties required in Assumption 3.2, and Section 2 describes a possible choice for fn.
The component stochastic approximation method is characterized only by its proper-
ties in Condition 4.1.
A possible choice for the component stochastic approximation is the method
described in Theorem 2.7 of Fabian (1971); we refer to it by SAF. The method is
a second-order variant of the method proposed in Fabian (1967), with the rate of
convergence nFB and /?=+s/(l + s), and its asymptotic distribution is known. The
method needs that, in a neighborhood of the global minimum point e,f have
continuous unidirectional partial derivatives of an odd order s + 1 near 8. By results of
Chen (1988), the rate of convergence is optimal. However, SAF depends on s.
J. Dippon, V. Fabian/Stochastic approximation of global minimum points 329
Although its behavior is known when a wrong s is used, it would be useful to make the
method adaptive in the sense that it would select the proper s itself. If the function
behaves similarly to an analytic function, then a method with convergence rate
n-“2(lOgn) 3/2 is available (see Fabian, 1969). Polyak and Tsibakov (1990) and Renz
(1991) extended the result on convergence rate to s odd. An interesting new gradient
estimate, proposed and studied by Spa11 (1988, 1992), can be used to improve the
asymptotic distribution in case s=2, k> 1 for some functions. An optimal choice of
the design within the class of gradient estimates used by SAF is described in Erickson
et al. (1993). However, there are no optimality results available in general for the
gradient estimate, and for the asymptotic distribution.
In some situations, the gradient of the function is estimable directly; the combined
method can again be used, but with the SAF method replaced by a suitable modification.
Problems of function minimization of the type considered here abound. They
include problems in experimental determination of optimal conditions, in artificial
neural network models (see White, 1989 and Fabian, 1992~). In many applications, it
is not known that the function considered has only one stationary point or it is known
that it does not (cf. McInerney et al. 1989). In deterministic problems of function
minimization, simulated annealing methods have been proposed to deal with difficul-
ties presented by multiple stationary points (see more in Section 1.2).
Actual applications of stochastic approximation methods have been hindered by
disappointing actual performance in some cases. It is likely that the new method will
have an improved nonasymptotic behavior (see more in Section 1.3).
The new method preserves most of the useful properties of stochastic approxima-
tion methods:
(i) allows for updating of the estimate when new observation are available,
(ii) uses observations of function values mostly at points close to a global min-
imum point and
(iii) performs asymptotically as well as the previously known stochastic approx-
imation methods under substantially weaker conditions on f:
Property (i) is important in applications where observations are obtained in a time
sequence and updated estimates are needed. Property (ii) is of importance in on-line
applications where an estimate of the function value at an x is obtained by running an
actual process (e.g. chemical reactor, auto pilot) at a setting x. Property (iii) is of
obvious importance.
A reader may prefer a solution not compromising the local character of stochastic
approximation by the use of a regression estimate. However, it may be impossible to
solve a global problem by a local method.
1.2. Related results
Stochastic approximation methods have their origin in the results by Robbins and
Monro (195 l), who described a method for approximating a zero point of a function,
330 .I. Dippon, V. Fabian/Stochastic approximation of global minimum points
and by Kiefer and Wolfowitz (1952), who described a method for approximating
a point of a maximum (or minimum) of a function, the problem considered
here.
Chen (1984) described an estimate of a point of global minimum with the optimal
rate of convergence n- iI3 for the case s = 2 of functions f on [0, l] with bounded third
derivatives, one global minimum point 0 and with the second derivative off positive
at 8. The estimate, based on II observations of function values, uses a two-step design.
In the first step, n/2 observations are used to form a regression estimate off: From this
estimate, an estimate S is obtained of a subset that contains 0. Then the method takes
additional n/2 observations of function values at points in S and fits a quadratic
polynomial to f on S. The estimate is then set equal to the point at which the
quadratic polynomial has its global minimum.
Miiller (1989) constructed an estimate with the rate n-O0 with /3,, = s/(2s + 3) </I in
case k = 1. The estimate is based on a regression estimate obtained from observations
of function values at equidistantly chosen points.
Simulated annealing methods for minimization of functions on IWk (see Has’minskij,
1965; Cerny, 1985; Geman and Hwang, 1986; Kushner 1987; Gelfand and Mitter,
1991; Pflug, 1992) are stochastic approximation methods that converge in probability
to a global minimum point. The methods use larger steps than usual or add additional
errors to the observations, or both. They have a much lower rate of convergence than
the method presented here. The properties of the simulated annealing methods
might appear better when judged by other criteria, in particular, by the probability
of entering a fixed neighborhood of the set A where f attains its minimum. However,
the number of observations required for these methods to leave a neighborhood
of a local minimum and approach a global minimum might be extremely large even
in the simplest situations (see Fabian, 1992b). These methods, seemingly local in
character, perform, in fact, a global search by randomly wandering around the
domain of f:
Simulated annealing methods appear to have an advantage that the number of
observations per step is linear in k, the dimension of the domain. In contrast, unless
k is small, the regression estimate described in Section 2 is difficult to apply (see
Remark 2.9). However, the advantage of the simulated annealing methods might be
only in appearance, because these methods also search the whole domain.
It should be noted that if k increases, the problem of finding a global minimum
point becomes more difficult, but a different problem of finding an x for which f(x) is
small may be easier.
Chen’s estimate does not have properties (i) and (iii) (cf. Section 1.1) of our method
and has property (ii) in a weaker form; the simulated annealing methods do not have
property (iii).
A paper by Yakowitz (1993) appeared during the preparation of the final revision of
the present study, in which the main theorem states a result similar to part (iv) of our
Theorem 5.4. but the result is incorrect as stated.
J. Dippon, V. Fabian/Stochastic approximation of global minimum points 331
1.3. Nonasymptotic behavior
The asymptotic distribution results are approximations to finite sample behavior.
Not only they are easier to derive, but are also simpler to interpret because they
depend much less on the function f; and on details in the specification of the estimate.
Important difficult questions concern the proximity of the nonasymptotic and asym-
ptotic properties and choice of modifications and specifications to improve the
nonasymptotic behavior. Some tentative conclusions may be obtained from the
proofs of asymptotic properties.
For the stand-alone stochastic approximation procedure, the proof of asymptotic
properties relies on the result that the method converges to the desired point 8. It is
possible, for the asymptotic considerations, to approximate the function by a quadratic
function with any desired accuracy. The central limit theorem gives then the asymp-
totic behavior. The difficulty is that the speed by which the method approaches the
desired point can be arbitrarily slow. This difficulty remains even if the function values
are observed without error and is related to the local character of the procedure.
A regression estimate may be chosen such that its asymptotic and nonasymptotic
properties are quite close. Thus, for the regression estimate described in Section 2, if
the Lipschitz constant is known (and similarly, if it is estimated) nonasymptotic
properties are not difficult to establish. Note that the distribution of the maximum W,
(see (2.6.4)) of nearly normal random variables can be well approximated.
In the new method, it is likely that the regression estimate may be used to speed up
the initial part of the approximation and that the nonasymptotic properties of the new
method - even if there is only one stationary point - will be better and closer to the
asymptotic properties. For additional comments on the nonasymptotic behavior of
the regression estimate component, see Remark 2.9 and Example 2.10.
1.4. Notation
The notation introduced here will be used throughout the paper. For a function
f on a set D, and a subset S of D, f[S] is the image of S under J and IfI = sup IfI [D]
denotes the supremum norm of J:
For number sequences (alt),(b,), we write a,=O(b,) if limsupIa,l/lb,I<cc,
a,-b, if both un=O(bn) and b,=O(u,) hold, and a,%b, if u,/b,-+co.
Probabilistic concepts refer to a basic probability space (Sz, 52, P). If X and X, are
random variables or vectors, we write X, +X and say that (X,) converges to X to
mean convergence with probability 1. We write X, =o(b,) if (b,) is a positive number
sequence and (X,/b, ) converges to 0.
If X,, are functions on Sz, we say that X, has a given property eventually if, for every o in
a set of probability 1, X,(o) has the property for all but a finite n. If Qn are subsets of Q we
say that Qn occurs eventually if the indicator function XQn of Q,, equals 1 eventually.
332 J. Dippon, V. Fabian/Stochastic approximation of global minimum points
The following conventions will be used: a condition summarizes properties for the
sake of later reference; an assumption is a condition assumed throughout after its
introduction. Three assumptions are used in the paper: Assumptions 3.2, 3.7, and 5.1.
Relations are numbered separately in each subsection; a reference to (i. j. k) is to
relation (k) in Section i.j.
2. A regression estimate
2.1. Introduction
A regression estimate with the supermum norm error converging to 0 with prob-
ability one at a known rate can be used as a component of the combined method. In
fact, less is required (see (3.2.2) and (3.2.3)). Regression estimates with the error given
by the supermum norm have been proposed and studied, e.g., by Ibragimov and
Has’minskij (1980, 1982), Stone (1982), Fabian (1988a, 1990), but the results were on
convergence in probability rather than with probability one. For the results in the last
two papers quoted, convergence with probability was proven in Fabian (1992a), but
the result is limited to the one-dimensional case only. However, it is likely that also the
other results, not limited to the one-dimensional case, can be extended and conver-
gence with probability one at a known rate can be proven. To do this here would
prolong the present paper too much and shift the emphasis away from the main goal.
As a compromise, we describe here a simple regression estimate that can be used in
the combined method. The method applies under very weak conditions on the
function and its use in the combined method (determinations of the sets A, in (3.6.2)
below) is computationally easy. It has a lower rate of convergence than the methods
referred above have for classes of smoother functions. For the asymptotic behavior of
the combined method, the lower rate of the component regression estimate is irrel-
evant. It is relevant for nonasymptotic behavior, and regression estimates with
a better rate of convergence may be preferable even if they may complicate the
computations.
Described below is a sequence ( qrn ) of regression estimates. The subscript m can be
interpreted as a nominal number of observations. For the construction of (Pi, the
domain off is divided in I, subsets K,i and (P,,, is based on observations of values of
f at at least J, points in each K,i. One possible choice is to use observations at exactly
J, points in each of the subsets and choose J, equal to the largest integer in m/l,,
making the total number of observations needed for (P,,, close to m and at most m.
Relation (2.4.3) requires a substantially weakened version of the above choice.
The conditions under which we describe the properties of (cp,) are stated precisely
below.
For simplicity, we assume (cf. Condition 2.5) that the errors of observations of the
function values are identically distributed. This assumption is used only in the proof of
J. Dippon, V. FabianfStochastic approximation of global minimum points 333
Theorem 2.6 when applying the large deviation theorem of Feller to establish (2.6.9).
The assumption can be weakened to (somewhat complicated) conditions that allow
nonidentically distributed errors, using large deviation results in Petrov (1975) or
Theorem 2.1 in Book (1976).
2.2. Notation
For D a nonempty subset of R“, denote by Li(D, C) the family of all Lipschitz
functions on D with the Lipschitz constant at most C. By a u-cover of D
we mean a finite sequence (S,, . . . , S,) of disjoint subjects of D with the diameter
diam(Si) < U, where diam(Si) = sup { 1 x-y Imax; {x, y} c Si} is the diameter with respect
to the max norm on Rk, defined by 1&,,,=max{lz,I; s= 1, . . ..k}.
The phrase for all (each) m will mean for all (each) m~{2,3, . . . }
Remark 2.3. If T and u are positive numbers, then the set [ - T, Tlk has a u-cover of
length (2T/u+ l)k, because we can take the smallest integer 1 such that 132T/u and
divide the interval [ - T, T] into 1 subintervals of length at most u. [ - T, Tlk is equal
to the union of lk cubes with edges of length l/ldu. If then TM/u,+ co, there are
u,-covers of length I,,, such that I, = O((T,/U,)~). This justifies the first part of (2.4.3)
below.
Condition 2.4. f is a function on a subset D of Rk, T, is a positive number for each m. If
D is bounded, then T,= T for all m with a T such that [ - T, Tlkl D. If D is not
bounded, then, for a positive T,
(1) Td T,+cc, T,=O(m’) for a t < l/k,
u, is positive number for each m and
(2) u_+;~)“‘2+k’,
For each m, ( K,l, . . . , KmI,,, ) is a u,-cover of D, = [ - T,, T,lknD. I,, J, are integers
satisfying
(3) I,=O((T,/U,)~) and J,,,-m/I,,,.
For each m and each i = 1, . . . , I,, J,,,i is an integer, Jmi > J,,,, X,ij for j = 1, . . . , Jmi are
points in Km; and
(4) ymij =f(xmij) + I?mij,
with r?,ij random variables.
The regression estimate q,,, is defined on D, and equals Ymi on K,i for i= 1, . . . . I,.
334 J. Dippon, V. Fabian/Stochastic approximation of global minimum points
Condition 2.5. The random variables Ymij in Condition 2.4 have the same distribution
as a random variable q with zero expectation and a finite moment generating function
on a neighborhood of 0. For each m and i, the random variables vmil, . . . , rmiJ,i are
independent.
Theorem 2.4. Assume that Conditions 2.4 and 2.5 hold and let e, 9 u,. Then, for every
positive C,
(1) suPIX~,(~~-f)l=o(e~), f
with the supremum extending over all f in Li(D, C); in addition, for a positive number c,
(2) U, = O(mpc), J&iogm, u*
Proof. The first part of (2) follows from (2.4.1) and (2.4.2) for a suitable c=cr. Next,
from (2.4.3), J,acom(u,/T,)k =c0mui2 TGkur2 for a positive co, and the second part
of (2) follows from (2.4.2) for a positive c=c2. Thus (2) holds with c=min{c,, c2}.
Define h, on D, by setting it equal, on K,i, to the arithmetic mean of all f(xmij)
with j = 1, . . . , J,i. We obtain that
(3) sup I~~,(h,-f)I=O(u~)=o(e~). s
Denote by qmi the arithmetic mean of Vlmij with j= 1, . . . . Jmi and set
(4) Wm=IllaX{lr],iI; i= 1, . . ..I.},
so that
(5) sup I XD,(‘P~ - him) I = Wm s
and the assertion is true if
(6) W,=o(e,).
Set &,=J6logm and b,=J(log m)/J,. It follows from (2) that
(7) <,=o(Jy6) and b,=O(u,).
Without loss of generality, we may assume that Var(q)= 1. Note that e,$u, by
assumption and thus e, 9 b, by (7). Thus, if c is a positive number, then, eventually,
(W,>ce,}c(JJ,W,>J6J,b,}=A,, where A,,, = { JJ,,, W, > L}. Consequently, it
is enough to prove that, with probability one, only finitely many of the events
A, occur. By the Borel-Cantelli lemma, it is sufficient to prove that
(8) $i %kJ< a.
(If the W, are independent, (8) is also necessary.)
J. Dippon, V. Fabian/Stochastic approximation qf global minimum points 335
Consider an m and an i in { 1,. . . , I,}. Since Jmi> J, and &, =o( Ji16) by (7),
Theorem XVI.7.1 in Feller (1968) on large deviations and Lemma VlI.1.2 in Feller
(1957) on the tail behavior of the standard normal distribution, yield that, for
a constant C,
P(A,) is bounded by I, times the left-hand side in (9). From (2.4.3) and since J,,+m by (2) we obtain that m 9 I,. Thus
logP(A,)~logm+logC-logt,-iti.
Eventually, log C-log 5, < 0 and log P(A,) < - 2 log m. This implies (8) and proves
the assertion.
Remark 2.7. If D is bounded, then T, = T and T, can be replaced by 1 in (2.4.2). In this
case, by results in Stone (1982), under the hypothesis of the preceding theorem, the
optimal rate of convergence, among all possible sequences of estimates, under our
conditions on f and under slightly stronger assumptions on the available observa-
tions, is given by the sequence (u, ).
The result was stated for m=2,3, ,.., but, of course, applies even if 40, is observed
only for m in a subset of {2,3,. . . >.
Remark 2.8. (The use of the regression estimate as the component of the combined method). Assume that (9,) is a sequence of estimates satisfying (2.6.1) with (e,)
a known number sequence converging to 0 and with pL, observations required for the
construction of (P,,,. Thus the estimates described in Theorem 2.6 can be used with
J,i = J,; for these, ,uLm - m by the second part of (2.4.3). For the asymptotic consider-
ations, it is enough and easy to note that it is possible to choose two increasing
sequences of integers m, and n, such that ,~i + ... + pI = o(n,) and then construct (P,,,, at
stage n,. Setting fn = (P_., E, = ewr and H, = D_ for n such that n, < n < n,+ 1 we obtain
IX.Y,(fn-f)I=o(cJ
with the total number of observations to construct fi, . . ..fn of order o(n) and (8,)
a known sequence converging to 0. This satisfies the requirements on the component
regression estimates in Assumption 3.2 (the sets H, are denoted by D, there) and in
Remark 5.5.
Remark 2.9. (Nonasymptotic aspects of the use of the regression estimate). An un-
pleasant aspect of regression estimation, or any estimation of a global property of
a regression function, is the sharp increase of the difficulty as the dimension k of the
domain increases. This is related to the entropy of [Wk (see, e.g., Kolmogorov and
Tihomirov, 1959); in a simplified way, we need approximately nek cubes of diameter at
most u to cover a unit cube in Rk and estimating a global property means investigating
336 .I. Dippon, V. Fabian/Stochastic approximation of global minimum points
the property in all cubes of such a cover with small U. This is compounded by the fact
that an easy cover is obtained by dividing the interval [0, l] into 1 subintervals of
length l/l and taking all Cartesian products of the subintervals. This is an l/l-cover of
length I“, and for k not small, Ik is, of course, very large already for I = 2. This can be
avoided, but not easily, by taking other covers. Regression estimates that are based on
estimates of function values at randomly selected points are basically of this type,
except that it is doubtful that the random selection of points leads to good covers.
Asymptotically, a low speed of the component regression estimate does not matter.
It does matter nonasymptotically, and it may be better to use regression estimates
other than those described in Theorem 2.6.
There are some savings on the number of observations possible that we shall discuss
for the case of the method in Theorem 2.6 with D bounded.
It is possible to apply the method such that the total number of observations to
construct fi, . . . . fn for rt= n, can be made equal approximately to m, rather than to
ml+ ... +m,. Indeed, the conditions concerning the errors qmij of function value
estimates (cf. Condition 2.5) do not prevent the use, for the construction of (P,,,,, of the
observations used previously for m = m,, . . . , m,_ 1. So all the previous observations
can be used unless there are more of them than required in some of the new subsets
K,i. This is unlikely to occur or will occur only to a negligible degree if, for each new
mi, the additional observations are taken at points approximately equally spaced over
D. Also, if we have some superfluous observations at a step, these observations can
still be used efficiently at later steps. Of course, it is not necessary to take the
additional observations needed to construct (P,,,, at the step n,; it is possible to take
them at this step or earlier.
If the storage of all the points X,ij and the corresponding estimates of the function
values is too expensive, it is possible to decrease the storage requirements by choosing
all xmij equal a point xmi. This will, however, make it more difficult to use, at a current
step, all the observations obtained before. It is then still possible to keep the number of
observations needed for constructing fi, ..,f. to be negligible with respect to n; see Example 2.10.
The combined method does not need the full force of the regression estimate.
Assumption 3.2(ii) states formally the requirements on the regression estimates. We
need the error to be o(E,) over the set { f-minf< 3s,}. On the set {f-min f > 3E,},
much less is required, namely that fn > minf+ 2.5,. In nonasymptotic considerations it
may be possible to stop upgrading estimates fn on subsets of D where we have good
evidence that f> minf+ 3e,. An asymptotic justification could be given by consider-
ing, for each n, two regression estimates, with different errors, using the less expensive
method to estimate a subset D, of D where f>minf+3~,, and evaluate the more
precise and more expensive method only on the set D--Do.
Example 2.10. We shall show a possible choice of the sequences (y,), and
(m,) and the resulting (E,). This should not be interpreted as a recommended
J. Dippon, V. Fabian JStochastic approximation of global minimum points 337
choice, the asymptotic results in Theorem 5.4 do not imply any choice preferable to
another.
For simplicity, assume that D is [0, l]“, T,= 1. In the construction of q,,,, use
l4, = [(m/log m)“(z + k’ - 11-l and note (2.4.2) holds. By Remark 2.3, it is possible to
choose the covers such that I, is the largest integer in (l/u, + l)k = (m/log VZ)~‘(’ tk’. Set
J,,, as the largest integer in m/I,,,. Now (2.4.3) holds and the number of observations
required for qrn is at most m and close to FL Choose positive numbers a, b and c such
that O<c< b-c 1. Set m,=ra,nl=ra’b and ~,,=n-~‘(*+~’ and construct (fn) from (40,)
as described in Remark 2.8. Since (E,) is a decreasing sequence, we obtain, by
Theorem 2.6, Ifn-fj =o(E,), if E,~$u,,,~_ I. But urn,_, - log”(2+k’(r- l)/(r- l)ai(Z+k’ -
log l/(2 +k’(y)/ral(* +k’ and cnV = r -@cK”K* +k’$ U,+, . Note that the E, converge to 0 not
much slower than n-1’(2+k’ if c is selected close to 1.
Of course, we obtain Jm, close to r20ic2 +k’(u log r)k’(2+k’.
As explained in Remark 2.9, it is possible to reuse the observations taken previously
and, in such a case, the total number of observations at n, is m, = o(q). If we do not
reuse the previous observations, the total number of observations at step n, is at most
ra+’ and, with a proper choice of a, this is again o(n,).
If f is assumed to have continuous derivatives of order p, it is likely that the
estimates described in Stone (1982) can be shown to have a property (2.6.1) for any
(e,) such that e,~((logm)/m)P”2p’+k’ and, for p large, we would obtain, similarly as
before, E, = ned with d less than but close to i.
3. The auxiliary estimate
3.1. Notation
We add some notation to Notation 1.4. jl (1 is the Euclidean norm. For subsets M of
lRk and points x, y in 1w”, d(M, y) is the Euclidean distance of a subset M from a point
y,B(x, r) the open sphere with the center x and radius r. We shall consider below
a subset D of [Wk. We denote by S(x, r) the sphere B(x, r)nD in the metric subspace D of
Rk, S(M, Y) the set (x;d(M,x)<r)nD and ~(M)=sup{r;S(x,r)c M,xEM). An in-
terior of a subset M of D with respect to the topology of D will be called the D-interior
of M; however, unless specifically indicated otherwise, topological concepts for
subsets of [Wk refer to the metric space [Wk.
The matrix of the second-order partial derivatives of a function f at x is denoted by
H(x), if these derivatives exist.
We shall consider functions on R with values subsets of [Wk (see, e.g., A, defined in
(3.6.2)). We use standard notation for functions with abstract ranges. Thus the assertion
A c A, means that A(w)cA,(w) for every w, {A CA,} is the set of all w for which
A(w) c A,(o), and the assertion A c A, eventually means that (A c A,} occurs eventually.
A reference e.g. to Condition 3.3(ii) means a reference to part (ii) of Condition 3.3.
338 J. Dippon, V. Fabian/Stochastic approximation of global minimum points
Assumption 3.2. (i) f is a function on a subset D of IWk attaining a minimal value
minf; A =f- ’ (minf } is the inverse image of {minj> under f: A is a compact set.
For every positive number r,
(1) inff[D - S(A, r)] > minf:
(ii) (D,,) is a nondecreasing sequence of subsets of D such that, for every bounded
subset Do of D, D,, c D, eventually. For each n, f, is a regression estimate defined on D, and, for a positive number sequence (E.) converging to zero,
(2) Ix D,n(S$minS+3En)(fn-f)I=O(&,)
and, eventually,
(3) fn > minf+ 2&, on {f>minf+3&,} nD,.
Condition 3.3. (Additional requirements on f). (i) For a positive r, a positive number
z and for all x in S(A,r),
(1) f(x) - minf 3 rd(x, A)‘.
The set A has a nonempty D-interior or satisfies
(2) p(S(4 rJ)=W’,)
for every positive number sequence r, converging to 0.
(ii) 0 is an interior point of D and is in A and H(B) exists and is positive definite.
Remark 3.4. (On the required properties of f). Part (i) in Assumption 3.2 has an
obvious meaning. Note that (3.2.1) holds if D is compact and f continuous. Part (ii)
requires an existence of a suitable regression estimate and is an additional implicit
condition on f: The regression estimate in Theorem 2.6 requires that f be Lipschitz
continuous and Remarks 2.8 and 2.9 discuss the use of the regression estimate so that
the number of observations needed in negligible with respect to n. Much less is
assumed in Assumption 3.2 about the regression estimates than the property
Ixo,(fn -f)l = o(E,) and as pointed out in Remark 2.9, this can be used to save on the
number of observations.
In nonasymptotic considerations it would be useful to select E, such that, with large
probability, the left-hand side in (3.2.2) is at most o(E,) and (3.2.3) holds. This is
relatively easy for the regression estimate described in Section 2, if the Lipschitz
constant for f is known, and it may be possible, if the constant is estimated.
Relation (3.3.1) is satisfied with z =2 and an r,,, if A consists of a finite number of
points and at each such point 8, H(8) exists and is positive definite. Condition (3.3.2) is
satisfied for finite and other reasonable sets A and fails for the set described in
Example 3.5. Note that the requirement of a nonempty D-interior is a weaker property
than that of a nonempty interior with respect to the topology of [Wk.
J. Dippon, V. Fabian/Stochastic approximation of global minimum points 339
Condition 3.3(ii) is a part of conditions under which the component stochastic
approximation method has been studied.
Example 3.5. Consider D = Rk and any nonincreasing sequences (r,) and (d, ) such
that r,+O, d,-+O and d,Br,,. Consider a subset A of [ -dl, d,] such that, for every
n> 1, An[S(O, d,))-S(0, d,,,))] has a finite number of points spaced at most
Y, apart. Let OEA. Notice that A is closed. Then S(A, r,)3S(O, d,) for every n and
A does not satisfy (3.3.2).
Notation 3.6. The following notation will be used. Set
(1) h =f- minf; k =f, - inff, CR,1
and
(2) A,={x;xED~, h,(x)<&,}.
Assumption 3.7. (The estimate Z,). Co is a number larger than 1, (6,) is a sequence of
positive numbers converging to 0. With
(1) 2,=(x; S(x, r)cA,j for r=(l/C,)p(A,),
1, are random variables with values 0 and 1 and such that
(2) i,= 1 if p(A,)>C,,& and z,=O if p(A,)<(l/Co)G,.
M, satisfy
(3) M,=A^, if z,=l and M,=A, ifl,=O.
For each y1= 1,2,. .., Z, is a random vector with values in M,.
Condition 3.8. (Additional properties of Z,). (i) Condition 3.3(i) holds for a z,
(1) &$&,l’r
and
(2) /IZ,-Z,_,l/dCod(M,,Z,_l) for all ~1.
(ii) Condition 3.3(ii), holds, 4 is a positive number and
(3) &$max{n-q, s,1j2).
Remark 3.9. (On the construction of Z,). Assumption 3.7 and Condition 3.8 describe
the construction of the estimates Z, by specifying conditions on these random
variables.
The sets A, in (3.6.2) are estimates of the set A at which f attains its global
minimum. In Assumption 3.7, the estimates Z, are chosen in A, or in the smaller sets
340 J. Dippon, V. Fabian/Stochastic approximation of global minimum points
A^,, if p(A,) is large enough. The random variables 1, keep track of which choice of the
two possible definitions of M, was used.
Properties (3.8.1) and (3.3.1) make it possible for the method to determine that
a point in A is an isolated point of A. As noted in Remark 3.4, (3.3.1) often holds with
r=2. Property (3.8.2) prevents Z, from fluctuating unnecessarily.
The constant Co is introduced to allow for approximate calculations of A,
and &L). The number q in Condition 3.8(ii) will be later assumed to be less than 0, where n-O
is the rate of the convergence of the component stochastic approximation method (see
Condition 5.3). It is easy to choose (6,) to satisfy the convergence requirement of
Assumption 3.7 and the requirements of Condition 3.8 if z is known, because it is
possible to choose 6,~rnmax{n-~,~,1/~, a,“‘). Often, we may have t=2 and &A/‘$ nPq,
and then the requirement is 6, % &,!I’.
It is now easy to see that Z, can be determined to satisfy both Assumption 3.7 and
Condition 3.8, provided it holds with z = 2 (or with another known r).
Lemma 3.10. The following properties hold:
and, for a sequence (r,) of positive numbers converging to 0,
(2) A c A, c S(A, r,) eventually.
Proof By Assumption 3.2, D, 3 A eventually. From (3.2.2) and (3.2.3) it follows that
fn - min f = o(E,) on A and f, > min f - o(E,). It follows that min fn = min f + o(E,). This
shows that (3.2.2) holds with f,-freplaced by h,- h. This and (3.2.3) imply
(3) D, n {h 3 2~,,} c {h, > E,,} eventually.
Thus, eventually, A, c {h < 2s,} and (1) and the first part of (2) follow from (3.2.2).
Define a function d on (0, ~0) by d(c) = inf {r, {h d 2~) c S(A, r)}. d is a nondecreas-
ing extended-valued function with a limit c at 0, c 3 0. In fact, c = 0, because, if c > 0,
then { h62.s} has common points with D-S(A, c/2) for all E>O, a contradiction to
(3.2.1). Set rn=(l A d(.z,))+ l/n. Then rn --+O and the second inclusion in (2) holds
because of (1).
Lemma 3.11. Let (r,) be a sequence of positive numbers converging to 0, S, = S(A, r,)
and let S(z,, C)C S, for a positive c, points z, in S, and all n. Let r < c. Then S(z,, r) c A eventually.
Proof. Assume that the assertion does not hold. Changing (z,) to a suitable subse-
quence, we obtain that z,--rz for a z, and that S(z,, r)C A fails for all n. Consider a
c0 in (r, c). Eventually, S(z, c,,)cS,, and, since A is the intersection of all such
S,, S(z, c,,) c A. It follows that S(z,, r)C A eventually, a contradiction.
J. Dippon, V. Fabian/Stochastic approximation of global minimum points 341
Lemma 3.12. Zf A0 is a subset of A such that, for some positive r and T and all x in
S(A,, r),
(1) h(x) 3 rd(x, A,)‘,
then, eventually,
(2) S(A, , r)nA,cS(&, r,)
with r, = (34~)“‘.
Proof. Set C,=S(A,, r)nA,-S(A,,, r,). By(l), we have h33s,, on C,. By (3.10.1), we
have h<2s,, on C, eventually. It follows that, eventually, C, is empty and (2) holds.
Lemma 3.13. If Condition 3.8(i) holds, then (3.10.2) holds with r,=o(o,).
Zf Condition 3.8(ii) holds, then: (a) 8 is an isolated element of A, (b) there is a positive
r and a sequence (r,) such that r, =o(&) and, eventually, (1) and (2) hold:
(1) S(& r)nA, cS(& r,),
(2) if t,,= 1 then M,nS(B, r)=@
Proof. We shall apply Lemma 3.12 and use r, defined there.
Assume first Condition 3.8(i). Apply Lemma 3.12 with A0 = A. Property (3.12.1) for
A holds because Condition 3.8(i) implies Condition 3.3(i). From (3.8.1) we obtain that
Y, = o(6,). By (3.10.2) S(A,, r)nA, = A, eventually. The desired assertion follows now
from (3.12.2).
Secondly, assume Condition 3.8(ii). This condition implies Condition 3.3(ii) and
thus assertion (a). Also, for a positive 8, we have B(8, E) c D.
Next, we shall apply Lemma 3.12 with A0 = (0). Note that (3.12.1) follows for an
r < E and for r = 2 from the properties of H(B) and that, by (3.8.3)
(3) r,=0(6,).
From Lemma 3.12, we obtain (1).
It is enough to prove (2) with r changed to ir. Because B(k), E)C D, we have
S(e, r)= B(B, r). Consider-the subsets Qn of Sz defined by Qn = (I, = 1, M,nS(B, fr)#@}.
On Qn, we have M,= A, and p(A,)>(l/Co)G, by (3.7.2) and (3.7.3), and, if Qn is
nonempty, it is possible to choose a function v, on Qn with values in A^,nB(B, ir). On
Q,,, we further obtain by (3.7.1) that S(v,, (1/C0)2S,)c A,,. Denote by R, the subset of
s;! on which (1) holds. For large enough n, we obtain, on QnnR,,
R(v,, (~IGI)~&J~R(B, r)nA,cB(B, r,).
But, for large enough n, this is impossible, because of (3). It follows that, for large
n, QnnRn is empty and Qi 2 R,. Since R, occurs eventually, also Q; occurs eventually,
and (2) holds eventually.
342 J. Dippon, V. Fabian/Stochastic approximation of global minimum points
Theorem 3.14. We have
(1) d(A, Z,) + 0 and f(Z,) < minf+ 2&, eventually,
and, if the D-interior of A is nonempty, then, for a positive r, eventually, I~= 1 and S(Z,, r) c A.
If Condition 3.8(i) holds, A has an empty D-interior and is the union of a finite number m of disjoint closed sets Ct, . . . . C,, then there is a measurable decomposition
IF 1, . . . . F,) of the sure event such that, on each Fi, d(Ci, Z,)=O(~,).
If Condition 3.8(ii) holds, then the following two properties hold on {Z,+Q}: z,=O eventually and (1 Z, - 0 II= o(6,).
Proof. The first assertion: By Assumption 3.7, the range of Z, is a subset of A, and (1)
follows from (3.10.2) and (3.10.1) in Lemma 3.10. Suppose A has a nonempty D- interior. Then, eventually, p(A,)>/p(A)> Co&, z, = 1 by (3.10.2) and (3.7.2), and Z,&,.
The last relation, (3.7.1) and (3.10.2) imply S(Z,, c)cS(A, r,) with c=p(A)/C, and
r,+O. The second part of the assertion now follows by Lemma 3.11 for every r<c. The second assertion: We may assume that all Ci are nonempty. We have,
eventually: Ac A,c S(A, r,), with r,=o(6,), by Lemma 3.13. It follows that
d(A, Z,)<r,. Also, by (3.3.2) p(A,)=o(G,) and thus, by (3.7.2) and (3.7.3), we have
M, = A,, eventually. Thus the subset of 52, defined by
(2) R,={d(A, Z,)dr,, A=M,cS(A, r,)},
occurs eventually. On R,, M, is the union of sets C,i such that Ci c C,i c S(Ci, r,). Let
i# j and set
(3) Qn=R,-lnR,n{Z,-l~C,-l,i, ZnECn,j}.
Set d=d(Ci, Cj). Then, on Q,,, IIZ,-Z,-I 11 >d-(r,+r,-l) and d(M,, Zn-l)<r,-l which contradicts (3.8.2) provided n is large enough. This proves that Qn is eventually
empty and it follows that {Zn-l~Cn-l,i, Z,EC,,j} is eventually empty. The second
assertion now follows, since the measurability of the sets Fi is obvious.
The third assertion: Assume that F = (Z, -g} has a positive probability since, if not,
the assertion is trivially true. Let r and r, be as in Lemma 3.13. On F, Z, is eventually
in S(0, r) and thus in S(6, r,) by (3.13.1). Since Z, is in M,, we obtain by (3.13.2) that
1, =0 eventually on F. Since r, = 0(6,), we obtain II Z, - 0 II= o(6,) on F. This proves
the third assertion.
4. A stochastic approximation method
The SAF method described in Theorem 2.7 of Fabian (1971) can be used to satisfy
Condition 4.1 under a strengthening of Condition 3.8(ii) (see Remark 4.2). Other
methods may be used under possibly other conditions on f:
J. Dippon, V. Fabian/Stochastic approximation of global minimum points 343
Condition 4.1. (Requirements on the component stochastic approximation method).
(6, F,, . . . ) is a nondecreasing sequence of sub-o-algebras of 52. For each
n=1,2 ,..., X,, W, are F,-measurable k-dimensional random vectors, X, is
F,-measurable and
(1) un=xn_i - w,.
If Condition 3.3(ii) holds and the event F = (X, -+0} satisfies P(F)= 1, or FEF, and
P(F)>O, then the following two implications hold:
If
(2) xF II X, - 8 II d II U, - 0 /I eventually,
then
(3) ~Fn411X,-8~~-+0 for a q in (O,p).
If
(4) xFXn = xF U, eventually,
then, conditionally on F, nP(X,--8) converges in distribution to a normal (p, C)
random vector.
Remark 4.2. Assume Condition 3.3(ii). Also assume that, for a positive even integer
s and a positive E, H and A,+ 1 exist and are continuous on B(8, E), where A,(x) denotes
the vector with the j-th component the p-th derivative off with respect to the j-th
coordinate. /j’= ~s/(s + 1).
Then SAF, described in Theorem 2.7 of Fabian (1971), satisfies Condition 4.1.
In SAF, the basic recurrence relation is (4.1.1) but with U,, replaced by X, and the
domain D is assumed to be the whole space [Wk.
Consider first the case Z’(F)= 1. The proof that (4.1.2) implies (4.1.3) is a slight
modification of Theorem 5.3 in Fabian (1967); the differences are that we assume
(4.1.2) eventually instead of (4.1.4) for every n and that the estimate of the gradient in
Fabian (1967) is replaced by the multiple of this estimate by an estimate of the inverse
of H(8), but these differences do not affect the proof. The second assertion, that (4.1.4)
implies the asymptotic distribution for X, as asserted, is the assertion of Theorem 2.7
in Fabian (1971) with the difference that here we assume (4.1.4) holds eventually rather
than for every n. The only change required in the proof is to change T, in the
application of Theorem 2.2 in Fabian (1968) without affecting the convergence
properties of T,. Similar considerations, in more detail, can be found in Fabian (1978,
1988b). If X,, is outside the interior of D, it may be impossible to form the gradient
estimate as required by the SAF method (choose, in such a case, U,, =X,_ 1 ). However,
by Condition 3.3(ii), on F, X, will be eventually in the neighborhood B(Q, a), a subset
of D, and then this difficulty does not occur. A formal proof is obtained by another
change of T, mentioned above.
344 J. Dippon, V. Fabian/Stochastic approximation of global minimum points
If O< P(F) < 1 and FEF~, then the conditional properties of the SAF method will be
preserved when the sure event is changed to F, and the probability measure P is
changed to the conditional probability P( 1 F). On the new probability space, X,-+(3
and the asymptotic distribution is as asserted. On the original probability space, this
becomes the conditional asymptotic distribution.
The limiting distribution is as described in Theorem 2.7 of Fabian (197 1).
5. The combined method
Assumption 5.1. V,, and X, are k-dimensional random vectors. With
Qn = S(Z,, 36,)- S(Z,, 26,) the following properties hold:
If I,,= 1 or Qn=8, then
(1) x,=z,;
If I, = 0, and Qn # 8, then
(2) -UQn, if II V-&/I >66,
and
(3) X,,= V, if II V,-Z,II ~66,.
Remark 5.2. Assumption (5.1 .l) makes X, eventually equal to Z, if A has a nonempty
D-interior; Assumption (5.1.2) prevents X, from getting too far from Z,: in the (51.3)
case, the value of X,, is determined by the component stochastic approximation
method. The following additional condition will be used in part (iv) of
Theorem 5.4.
Condition 5.3. Condition 3.8(ii) and Condition 4.1 hold with a 4 in (0, /I) and the event
F={X,,+g} satisfies P(F)=l, or FEF, and P(F)>O.
Theorem 5.4. The following statements are true:
(i) d(A, X,)+0.
(ii) If A has a nonempty D-interior, then, eventually, X,,=Z, and, for a positive r,
S(X,, r)C A.
(iii) If Condition 3.8(i) holds and A has an empty D-interior and is the union of a finite
number of disjoint closed sets C1, . . . . C,, then there is a measurable decomposition
{F,, . . , F,,} of the sure event such that, on each Fi, d(X,, CL)< 76, eventually.
(iv) If Condition 5.3 holds, then (X,) satisfies (4.1.3) and (4.1.4), and, conditionally on
F, nP(X, - 0) is asymptotically normal (u, C).
J. Dippon, V. Fabian JStochastic approximation cf global minimum points 345
Proof. Note that in all cases, because of Assumption 5.1,
(1) d(X,,, Z&6&,.
(i) and (iii) follow from (1) and Theorem 3.14. If A has a nonempty D-interior, then by
Theorem 3.14, and (5.1.1) eventually, z, = 1, X,, = Z, and assertion (ii) follows from
Theorem 3.14.
Consider assertion (iv): On F, eventually, Qn#8 and, by the last assertion in
Theorem 3.14, 1, = 0, /I Z, - 8 11 = 0(6,), and X,, is determined as in (5.1.2) or (5.1.3).
Next we shall prove that (4.1.2) holds. The following properties hold on F eventually. If X, is determined by (5.1.3), then, of course, the inequality in (4.1.2) holds.
Consider the (5.1.2) case. Since // U, - Z, 11 b 66,, it follows that 1) U,, - 6’ // 3 56,. Since
I/ X,, -Z, /I < 36,, it follows that /I X, - 0 11 <46,,. Again, the inequality in (4.1.2) holds.
We have shown that (4.1.2) holds, and, by Condition 4.1, (4.1.3) holds. If, for an o in
F, case (5.1.2) applies infinitely often, then, infinitely often, 11 X,, -Z, I/ 326, and
11X,-d 1) 36,. By (3.8.3) 6,$n-q for a q<p and by (4.1.3) the set of such LU has
probability 0. Consequently, (4.1.4) is satisfied, and the assertion follows from
Condition 4.1.
Remark 5.5. (The number of observations). If X, is determined by (5.1.1) it is not
necessary to compute U,,. The computation of U,,, I -X,, requires ks observations of
function values and an asymptotically negligible number of additional observations.
Similarly (cf. Remark 2.8), estimates fn require an asymptotically negligible number of
observations. The total number of observations to obtain X, is then v, with v,/n-+ks (and smaller in case (5.1.1)). Then the convergence in distribution of n”(X,, -8) to
a normal random vector U in part (iv) of Theorem 5.4 yields immediately the
convergence of vE(X, - 19) to (ks)” U.
Acknowledgment
The authors wish to thank the editor and two anonymous referees for comments
and criticism that lead to significant improvements of the paper.
References
Book, S.A. (1976). The Cramer-Feller-Petrov large deviation theorem for triangular arrays, manuscript. Cerny, V. (1985). A thermodynamical approach to the traveling salesman problem: an efficient simulation
algorithm. J. Optim. Theory Appl. 45, 41-51. Chen, H. (1984). Optimal rates of convergence for locating the global maximum of a regression function,
Thesis, Dept. of Statistics, Univ. of California, Berkeley.
Chen, H. (1988). Lower rate of convergence for locating a maximum of a function. Ann. Statist. 16, 1330-1334.
346 J. Dippon, V. Fabian/Stochastic approximation of global minimum points
Erickson, R.V., V. Fabian and J. Maiik (1993). An optimum design for estimating the first
derivative. Preliminary Report, RM 531, Dept. of Statistics and Probability, Michigan State Univ.,
East Lansing, MI.
Fabian, V. (1967). Stochastic approximation of minima with improved asymptotic speed. Ann. Math.
Statist. 38, 91-200. Fabian, V. (1968). On asymptotic normality in stochastic approximation. Ann. Math. Statist. 39,
1327-1332.
Fabian, V. (1969). Stochastic approximation for smooth functions. Ann. Math. Statist. 40, 299-302. Fabian, V. (1971). Stochastic approximation. In: J.S. Rustagi, Ed., Optimizing Methods in Statistics. Proc.
Symp., Ohio State University, June 14-16, 1971. Academic Press, New York, 439-470.
Fabian, V. (1978). On asymptotically efficient recursive estimation. Ann. Statist. 6, 854-866. Fabian, V. (1988a). Polynomial estimation of regression functions with the supremum norm error. Ann.
Statist. 10, 1345-1368.
Fabian, V. (1988b). The local asymptotic minimax property of a recursive estimate. Statist. Probab. Lett. 6, 383-388.
Fabian, V. (1990). Complete cubic spline estimation of non-parametric regression functions. Probab. Theory Related Fields 85, 57-64.
Fabian, V. (1992a). Convergence properties of supremum norm error of nonparametric regression
estimates. Trans. 11th Prague Con& on Information Theory, Statistical Decision Functions, Random Process, Academia, Prague, 35-48.
Fabian, V. (1992b). Simulated annealing simulated. Preliminary Report, RM 524, Dept. Statistics and
Probability, Michigan State Univ., East Lansing, MI.
Fabian, V. (1992~). On neural network models and stochastic approximation. Submitted for publication.
Preliminary Report, RM 530, Dept. Statistics and Probability, Michigan State Univ., East Lansing, MI.
Feller, W. (1957). An Introduction to Probability Theory and Its Applications. Wiley, New York.
Feller, W. (1968). An Introduction to Probability Theory and Its Applications, Vol. II, 2nd ed. Wiley,
New York.
Gelfand, B.S. and SK. Mitter (1991). Recursive stochastic algorithms for global optimization in Rd. SIAM J. Control Optim. 29, 999-1018.
Geman, S. and C. Hwang (1986). Diffusions for global optimization. SIAM J. Control Optim. 24, 1031-1043.
Has’minskij, R.Z. (1965). Use of random noise in problems of optimization and learning. Problemy Peredaci Informacii 1, 113-117 (in Russian).
Ibragimov, I.A., and R.Z. Has’minskij (1980). On nonparametric estimation of regression. Soviet Math. Dokl. 21, 810-814.
Ibragimov, LA., and R.Z. Has’minskij (1982). Bounds for the risks of nonparametric estimates of the
regression. Teorija Verojatn. i Prim. 32, 81-94 (in Russian); translation: Theory Probab. Appl. 27, 84-99. Kiefer, J. and J. Wolfowitz (1952). Stochastic estimation of the maximum of a regression function. Ann.
Math. Statist. 23, 462-466. Kolmogorov, A.N. and V.M. Tihomirov (1959). E-entropy and s-capacity of sets in functional spaces.
Uspechi Mat. 14, 3-86 (in Russian); translation: Amer. Math. Sot. Transl. Ser. 2 17 (1961), 277-364. Kushner, H.J. (1987). Asymptotic global behavior for stochastic approximation and diffusions with slowly
decreasing noise effects: global minimization via Monte Carlo. SIAM J. Appl. Math. 47, 169-185. Lazarev, V.A. (1992). On the convergence of the stochastic approximation procedures in the case of multiple
roots of the regression function. Probl. Peredaci Inf: 18, 75-88 (in Russian).
Ljung, L. (1978). Strong convergence of a stochastic approximation algorithm. Ann. Statist. 6, 680-696. McInerney, J.M., K.G. Haines, S. Biafore and R. Hecht-Nielsen (1989). Back propagation error surfaces can
have local minima. Tech. Report No. CS89-157, Univ. of California at San Diego, La Jolla, CA. Miiller, H-G. (1989). Adaptive nonparametric peak estimation. Ann. Statist. 17, 1053-1069.
Nevel’son, M.B. and R.Z. Has’minskij (1976). Stochastic Approximation and Recursive Estimation. Transl.
Math. Monographs Vol. 47, Amer. Math. Society, Providence, RI. Petrov, V.V. (1975). Sums of Independent Random Variables. Springer, Berlin.
Pflug, G. (1992). Application aspects of stochastic approximation, Part II of Ljung, L., G. Pflug and
H. Walk: Stochastic Approximation and Optimization of Random Systems. BirkhCuser, Basel.
J. Dippon, V. Fabian/Stochastic approximation of global minimum points 347
Polyak, B.T. and A.B. Tsibakov (1990). Optimal rates of search algorithms of stochastic optimization.
Probl. Peredaci Inf 26, 45-53 (in Russian).
Renz, J. (1991). Konvergenzgeschwindigkeit und asymptotische Konfidenzintervalle in der stochastischen
Approximation. Dr. rer. nat. Thesis, Universitat Stuttgart. Robbins, H. and S. Monro (1951). A stochastic approximation method. Ann. Math. Statist. 22, 400-407. Spall, J.C. (1988). A stochastic approximation algorithm for large-dimensional systems in the
Kiefer-Wolfowitz setting. Proc. IEEE Conf Decision Control, 1544-1548. Spall, J.C. (1992). Multivariate stochastic approximation using a simultaneous perturbation gradient
approximation. IEEE Trans. Automat. Control 37, 332-341. Stone, C.J. (1982). Optimal global rates of convergence for non-parametric regression. Ann. Statist. 10,
1040-1053. Yakowitz, S. (1993). A globally convergent stochastic approximation method. SIAM J. Control Optim. 31,
30-40. White, H. (1989). Some asymptotic results for learning in single hidden-layer feedforward network models.
J. Amer. Statist. Assoc. 84. 1003-1013.