Stochastic approximation: a survey

Advanced Review

Stochastic approximation: a surveyHarold Kushner∗

Stochastic recursive algorithms, also known as stochastic approximation, takemany forms and have numerous applications. It is the asymptotic properties thatare of interest. The early history, starting with the work of Robbins and Monro, isdiscussed. An approach to proofs of convergence with probability one is illustratedby a stability-type argument. For general noise processes and algorithms, the mostpowerful current approach is what is called the ordinary differential equations(ODE) method. The algorithm is interpolated into a continuous-time process, whichis shown to converge to the solution of an ODE, whose asymptotic properties arethose of the algorithm. There are probability one and weak convergence methods,the latter being the easiest to use and the most powerful. After discussing thebasic ideas and giving some standard proofs, extensions are outlined. Theseinclude multiple time scales, tracking of time changing systems, state-dependentnoise, rate of convergence, and random direction methods for high-dimensionalproblems. 2009 John Wiley & Sons, Inc. WIREs Comp Stat 2010 2 87–96

Stochastic approximation (SA) deals with asymp-totic properties of recursive stochastic algorithms

of the types

θn+1 = θn + εnYn, (1)

θεn+1 = θε

n + εYεn, (2)

θn+1 = �H[θn + εnYn] = θn + εn(Yn + zn), (3)

θεn+1 = �H[θε

n + εYεn] = θε

n + ε(Yεn + zε

n), (4)

where θn ∈ Rr, Euclidean r-space, εn > 0, ε > 0, Yn

and Yεn are functions of θn and random noise, and

E|Yn| < ∞ for each n. The �H denotes projectiononto some compact nonempty constraint set H andzn, zε

n are the projection terms. The case of random εnis important, but unless otherwise noted, εn is real-valued and tends to zero. It is always assumed that

∑n

εn = ∞. (5)

The references to be given illustrate the large numberof applications. Only a few key methods and somebrief comments on proofs will be given to give theflavor of the subject. The next section briefly sur-veys some of the original work and gives a ‘stability’argument for the convergence of a basic algorithm.The so-called ordinary differential equations method

∗Correspondence to: [email protected]

Applied Mathematics, Brown University, Providence, RI, USA

DOI: 10.1002/wics.57

(ODE) is perhaps the most powerful in use now, andis introduced in Section ‘Introduction to the (OrdinaryDifferential Equation) ODE Approach.’ Section ‘WeakConvergence’ deals with the weak convergence formof this method, the easiest to use. Various importantextensions are described in Section ‘Extensions.’

BRIEF COMMENTS ON EARLY WORK

The original work was that of Robbins and Monro1 onEq. (1) that sought the unique zero θ of a continuousreal-valued and unknown function g(·) on the real line.Only noise corrupted observations of the form Yn =g(θn) + δMn were available. The observation noiseδMn is a martingale-difference sequence (with respectto the filtration induced by {θ0, Yn, n < ∞}) with uni-formly bounded variances, g(θ)(θ − θ) < 0, θ �= θ , g(·)has at most a linear growth, and outside of each neigh-borhood of θ , g(θ) is bounded away from zero. Also

∑n

ε2n < ∞. (6)

(For the complete set of assumptions for the earlywork see Ref 2.) It was shown that θn → θ in meansquare. This seminal paper was the start of an enor-mous literature that remains vibrant to this day.Condition (6) assures that the martingale sequence∑

εnδMn converges w.p.1, and Eq. (5) is necessaryif θn is not to eventually get stuck at some pointthat is not θ . Generalizations to w.p.1 convergence

Volume 2, January /February 2010 2009 John Wi ley & Sons, Inc. 87

Advanced Review www.wiley.com/wires/compstats

for more general and vector-valued processes weregiven by Blum,3 Dvoretsky,4 and Schmetterer.5 Otherextensions are in Dupac,6,7 Gladyshev,8 Robbins andSiegmund9 and in the references in these articles andsurveys cited below. Sakrison10 (see also its referencesto the Prague Conferences) considered a modifica-tion with continuous observations. The observationshad multiplicative ergodic noise instead of additivemartingale- difference noise.

Sacks11 (see also Ref 12, Section 10.2) gave therate of convergence for the one-dimensional modelof Robbins-Monro. Let εn = K/n, K > 0. Supposethat g(·) is continuously differentiable in a neigh-borhood of θ with gθ (θ) < 0, K|gθ (θ)| > 1/2, andthat for each ρ > 0, E(δMn)2I{|θn−θ |≤ρ} → σ 2 in meanand {(dMn)2I{|θn−θ |≤ρ}} is uniformly integrable. Then√

n(θ − θ) converges in distribution to a normally dis-tributed random variable with mean zero and varianceK2σ 2/(2K|gθ (θ)| − 1). In this sense, the optimal valueof K is 1/|gθ (θ)|. The rate εn = O(1/n) is often toofast for applications in that, in a practical sense, θn

can ‘get stuck’ far from θ .Kiefer and Wolfowitz13 used a finite-difference

form to locate the minimum of an unknown differ-entiable real-valued function g(·), with a uniformlycontinuous derivative, with a unique stationary pointthat is a minimum, where the only available data arenoise corrupted observations Y±

n . Their algorithm is

θn+1 = θn + εn[Y+n − Y−

n ]/2cn, (7)

where Y±n = g(θn ± cn) + δM±

n , δM±n are martin-

gale differences, and 0 < cn → 0 is a finite-difference interval. Also

∑n εncn < ∞,

∑n ε2

n/c2n < ∞

and supn E(δM±n )2 < ∞. The first condition assures

that the biases in the finite-difference approximationto the derivative are asymptotically negligible. Thelast two conditions assure the convergence of the mar-tingale

∑(εn/cn)(δM+

n + δM−n ). The method is usually

used with a constant difference interval as it reducesthe noise effects and yields a more robust algorithm,even at the expense of a small bias. Further referencesto the early work as well as precise conditions andextensions can be found in the surveys by Wasan,2

Nevelson and Khasminskii,14 Tsypkin,15 Lai,16

Schmetterer,17 Ruppert,18 Fabian19 and Dupac.7

A Liapunov Function Proof of Convergenceof the Robbins-Monro (RM) AlgorithmA proof of convergence of the vector-valued form ofEq. (1) will now be given by a perturbed Liapunovfunction argument. W.l.o.g., let θ = 0. Let En denote

the expectation conditioned on Fn = {θ0, Yi, i < n}.The Ki > 0 are constants.

Theorem 1 Let 0 ≤ V(·) be real-valued, continuous,twice continuously differentiable with bounded mixedsecond partial derivatives and V(0) = 0, EV(θ0) < ∞.For each ε > 0, let there be δ > 0 : V(θ) ≥ δ for|θ | ≥ ε, and δ does not decrease as ε increases. WriteEnYn = g(θn). For each ε > 0 let there be δ1 > 0:V′

θ (θ)g(θ) ≡ −k(θ) ≤ −δ1 for |θ | ≥ ε. Suppose thatEn|Yn|2 ≤ K2k(θn) when |θn| ≥ K0. Let

E∞∑

i=1

ε2i |Yi|2I{|θ i|≤K0} < ∞. (8)

Then θn → 0 with probability one.

Proof. A truncated Taylor series expansion and thehypotheses yield

EnV(θn+1) − V(θn) ≤ εnV′θ (θn)EnYn + ε2

nK1En|Yn|2= −εnk(θn) + ε2

nK1En|Yn|2. (9)

The hypotheses imply that EV(θn) < ∞ for eachn. By shifting the time origin we can suppose thatK1K2ε

2n < εn/2. Define

δVn = K1En

∞∑i=n

ε2i |Yi|2I{|θ i|≤K0}, (10)

and the perturbed Liapunov function Vn(θn) =V(θn) + δVn ≥ 0. Note that

EnδVn+1 − δVn = −K1ε2nEn|Yn|2I{|θn|≤K0}. (11)

This, together with the hypotheses and Eq. (9), yields

EnVn+1(θn+1) − Vn(θn) ≤ −εnk(θn)/2, (12)

which implies that {Vn(θn)} is an Fn-supermartingale.By the supermartingale convergence theorem, there isa V ≥ 0 : Vn(θn) → V w.p.1. Since δVn → 0 w.p.1,V(θn) → V w.p.1.

For integers N and m,

−VN(θN) ≤ ENVN+m(θN+m) − VN(θN)

≤ −N+m−1∑

i=N

ENεik(θ i)/2. (13)

If P{V > 0} > 0, then by the properties of V(·), θnis asymptotically outside of some neighborhood ofthe origin, with a positive probability. This, the fact

88 2009 John Wi ley & Sons, Inc. Volume 2, January /February 2010

WIREs Computational Statistics Stochastic approximation

that∑

εi = ∞, and the properties of k(·), imply thatthe sum on the right side of Eq. (13) goes to infinityas m → ∞ with a positive probability, leading to acontradiction. Thus V = 0 w.p.1.

Extensions of the perturbed Liapunov functionmethod used in the above proof are the powerful toolfor more general problems.12,20–22 Combinations ofthe stability and ODE methods that do not requireconstraints are discussed in Ref 12 [Sections 5.4, 6.7,8.5, 23].

INTRODUCTION TO THE (ORDINARYDIFFERENTIAL EQUATION) ODEAPPROACHIn recent decades, the complexity of potential appli-cations of recursive algorithms has increased con-siderably, and the classical methods of proof arenot adequate. There are more complicated stochas-tic dependencies of the Yn observation processes,including state dependence of the noise, more flex-ible requirements on the step sizes, constraints onthe iterates, multiple time scales, decentralized algo-rithms, regression functions that are an average costover an infinite time interval, tracking of time-varyingsystems, etc.

The algorithms (1)–(4) can be viewed as finite-difference equations with step sizes εn or ε. Hence,there is a natural connection with differential equa-tions. The basic idea of what is now called theODE method was introduced by Ljung24 and wasextensively developed by Kushner and coworkers (seeRefs 12,25,26 and references therein and Ref 20), andwill be the subject of the rest of this article. One showsby a ‘local’ analysis that the noise effects average outso that the asymptotic behavior is determined by thatof a ‘mean’ ODE. Essentially, the dynamical term at avalue θ is obtained by averaging the Yn as though theparameter were fixed at θ. The ODE might be replacedby a constrained form or a differential inclusion. TheODE method is perhaps the most versatile and power-ful approach for the analysis of stochastic algorithms.

To get the limit mean ODE one works withcontinuous-time interpolations. Define t0 = 0 andtn = ∑n−1

i=0 εi. Define θ0(·) on (−∞, ∞) by θ0(t) = θ0

for t ≤ 0. For 0 ≤ tn ≤ t < tn+1, set θ0(t) = θn. Definethe sequence of shifted processes θn(·) by θn(t) =θ0(tn + t). For t ≥ 0, let m(t) denote the unique valueof n such that tn ≤ t < tn+1. For t < 0, set m(t) = 0.

Constrained AlgorithmsIn practice, in one fashion or another, the iteratesare usually constrained to lie in some compact set.

The constraint might be because of physical limitson the θn. In general, it is unreasonable to supposethat the user would allow the θn to go to infinity,without some appropriate intervention. Let H be acompact, connected and nonempty constraint set, andsuppose that if the iterate leaves H, it is immediatelyreturned to the closest point in H. If the constraint isadded artificially for practical reasons, then it shouldbe large enough so that the limits are inside. In anapplication, this might require some experimenta-tion. For originally unconstrained problems, Chen andZhu27,28 show convergence using gradually increasingbounds.

For simplicity we will suppose that H is a hyper-rectangle. Alternatives are: H = {x : qi(x) ≤ 0, i ≤ p},where the qi(·) are continuously differentiable real-valued functions, with gradients qi,x(·) and qi,x(x) �= 0if qi(x) = 0; or H is an R

r−1-dimensional connectedcompact surface with a continuously differentiableouter normal. These latter sets are dealt with Ref 12.Define the set C(x) as follows. For x ∈ H0, the interiorof H, C(x) = {0}; for x ∈ ∂H, the boundary of H,C(x) is the infinite convex cone generated by the outernormals at x of the faces on which x lies.

If the iterates are not constrained, then a stabilitymethod such as [Ref 12, Sections 5.4, 10.5] can beused to assure that they do not explode.

Compactness and Continuous-timeInterpolations. w.p.1 Convergence for theConstrained AlgorithmThe following extension of the Arzela–Ascoli Lemmawill be used. Suppose that for each n, fn(·) is an R

r-valued measurable function on (−∞, ∞) and {fn(0)}is bounded. Also suppose that for each T and ε > 0,there is a δ > 0 such that

lim supn

sup0≤t−s≤δ,|t|≤T

|fn(t) − fn(s)| ≤ ε. (14)

Then we say that {fn(·)} is equicontinuous in theextended sense, and then there is a subsequence thatconverges to some continuous limit, uniformly on eachbounded interval.

The Kushner–Clark Approach.To illustrate an approach to w.p.1. convergencevia the ODE method, consider a simple case ofEq. (3).12,25 Let Yn = gn(θn) + ψn + βn, where βn →0 w.p.1. For n ≥ 0, define n(t) = ∑m(tn+t)−1

i=n εiψ i,



Bn(t) = ∑m(tn+t)−1i=n εiβ i, Zn(t) = ∑mtn+t)−1

i=n εizi. Sup-pose that for some T > 0,

limn

supj≥n

max0≤t≤T

∣∣∣0(jT + t) − 0(jT)∣∣∣ = 0 w.p.1. (15)

Equation (15) is also a necessary condition forconvergence.29 It is assured by

limn

P

sup

j≥nmax0≤t≤T

∣∣∣∣∣∣m(jT+t)−1∑

i=m(jT)

εiψ i

∣∣∣∣∣∣≥ µ

= 0, each µ > 0.

(16)

Suppose that the gn(·) are continuous uniformly in n,and that there is a continuous function g(·): for eachθ ∈ H and t > 0,

limn

∣∣∣∣∣∣m(tn+t)∑

i=n

ε i [gi(θ) − g(θ)]

∣∣∣∣∣∣ → 0. (17)

Theorem 2 Under the above conditions {θn(·)} isequicontinuous in the extended sense. With proba-bility 1, the pathwise limits θ(·) satisfy the constrainedODE θ = g(θ) + z, z(t) ∈ −C(θ(t)), and θn convergesto a limit set of the ODE.

Proof.

θn(t) = θn +m(t+tn)−1∑

i=n

εigi(θ i)

+m(t+tn)−1∑

i=n

εi(ψ i + β i) +m(t+tn)−1∑

i=n

εizi. (18)

The functions defined by the middle sum go to zerow.p.1, and the set defined by the first sum is equicon-tinuous in the extended sense. Suppose that {Zn(·)}is not equicontinuous in the extended sense. Thenthere is a subsequence that has an asymptotic jump inthe sense that there are integers µk → ∞, uniformlybounded times sk, 0 < δk → 0 and ρ > 0 (all depend-ing on ω) such that |Zµk (sk + δk) − Zµk (sk)| ≥ ρ. Thefacts that zn = 0 if θn+1 ∈ H0, the interior of H, andthat zn ∈ −C(θn+1) can be used to complete the proofof equicontinuity and that any limit must satisfy theconstrained ODE θ = g(θ) + z, z(t) ∈ −C(θ(t)). Thelimits of θn must lie in a limit set of the ODE.

Assuring Eq. (15) for the martingale-differencenoise case [Ref 12, Section 5.3]. Let ψn = δMn, a mar-tingale difference. Define Mn = ∑n−1

i=0 δMi. For some

even integer p, suppose that∑n

εp/2+1n < ∞, sup

nE|δMn|p < ∞. (19)

Burkholder’s inequality [Ref 12, Section 4.1] and themartingale bound

PFn

{sup

n≤m≤N|Mm − Mn| ≥ λ

}

≤ EFnq(MN − Mn)q(λ)

, each λ > 0, (20)

where 0 ≤ q(·) is a nondecreasing convex function,yield

∑j

P

max

0≤t≤T

∣∣∣∣∣∣m(jT+t)−1∑

i=m(jT)

εiδMi

∣∣∣∣∣∣ ≥ µ

< ∞, each µ> 0,

(21)

which, via the Borel-Cantelli Lemma, implies Eq. (16).Alternatively, suppose that εn ≤ γ n/ log n, γ n →

0 and that the moments of δMn grow no faster thanGaussian in that there is K < ∞ such that for eachcomponent δMn,j of δMn, Eneγ (δMn,j) ≤ eγ 2K/2. Sup-pose also the innocuous condition that for T < ∞,there is c1(T) < ∞: for all n, supn≤i≤m(tn+T) εi/εn ≤c1(T). Then Eq. (16) holds [Ref 12, Section 5.3].Criteria assuring (Eq. (15)) for more general noiseprocesses are discussed in Ref 25. If g(θn, ψn) replacesgn(θn) + ψn, then an analysis can still be done,20,25

but becomes more difficult. The weak convergenceapproach below will be simpler.

Chain RecurrenceThe above theorem states that the limits of θn are ina limit or invariant set of the ODE. It was shown byBenaım,12,30–32 (assuming uniqueness of the solutionfor each initial condition) that the limits are in the pos-sibly smaller set of chain recurrent points within thelimit set. The chain recurrent points are defined as fol-lows. Let x(t|y) denotes the solution to either x = g(x)or the constrained form x = g(x) + z, z ∈ −C(x), attime t, given x(0) = y. A point x is said to be chainrecurrent30,31,33 if for each δ > 0 and T > 0 there isan integer k and points ui, Ti, 0 ≤ i ≤ k, with Ti ≥ T,such that

|x − u0| ≤ δ, |y1 − u1| ≤ δ, . . . , |yk − uk|≤ δ, |yk+1 − x| ≤ δ, (22)

where yi = x(Ti−1|ui−1) for i = 1, . . . , k + 1. Not allpoints in the limit or invariant set are chain recurrent.



Weak ConvergenceThe classical theory of SA was concerned with w.p.1convergence to a unique limit. Henceforth we workwith weak convergence. It is much easier to use inthat convergence can be proved under weaker andmore easily verifiable conditions and generally withless effort. The approach yields almost the sameinformation on the asymptotic behavior. The weakconvergence methods have advantages when deal-ing with problems involving correlated noise andstate-dependent noise, decentralized or asynchronousalgorithms, and discontinuities in the algorithm.

In applications, there is always a rule that tells uswhen to stop and accept some function of the recentiterates as the ‘final value.’ Then the information thatwe have about the final value is distributional andthere is no difference in the conclusions provided bythe probability one and the weak convergence meth-ods. If the application is of concern over long timeinterval, then the actual physical model might drift, inwhich case the step size is not allowed to go to zero,and there is no general alternative to the weak conver-gence methods. In practical on-line applications, thestep-size εn is not usually allowed to decrease to zero,because of considerations concerning robustness andto allow some tracking of the desired parameter as thesystem changes slowly over time. Then probability oneconvergence does not apply. In signal processing appli-cations, it is usual to keep the step size bounded awayfrom zero. The proofs of probability one results aremore technical. They can be hard for multiscale, state-dependent-noise cases or decentralized/asynchronousalgorithms.

There is always value in knowing convergencew.p.1, when appropriate. Although no iteration con-tinues indefinitely, it is still encouraging to know thatif it were, then it will assuredly converge. However, theconcerns that have been raised suggest that methodswith slightly more limited convergence goals can beas useful, if they are technically easier, require weakerconditions, and are no less informative when dealingwith the practical conditions in applications.

With the ODE approach, where the ODE isobtained by locally averaging the dynamics, the con-ditions required for the averaging with weak conver-gence are weaker than those needed for w.p.1 con-vergence and the proofs are simpler. Furthermore, the{θn} spends ‘nearly all’ of its time in an arbitrarily smallneighborhood of the limit point or set. Once we knowthat {θn} spends ‘nearly all’ of its time (asymptotically)in some small neighborhood of the limit point, then alocal analysis near this points can often be used to getconvergence w.p.1 [See, for example Refs 73, 35, Sec-tions 6.9 and 6.10]. Even when only weak convergence

can be proved, if θn is close to a stable limit point atiterate n, then under broad conditions the mean escapetime from a small neighborhood of that limit point isat least of the order of ec/εn for some c > 0.

DefinitionsWith probability 1 convergence required a type of(w.p.1) compactness of the paths of the SA sequence.Given this compactness, one then showed that thelimits satisfy the ODE. With weak convergence,we can no longer work with the individual paths,and the compactness is of the set of probabilitymeasures. This can be translated into information onthe asymptotic properties of the paths. Let Dr[0, ∞)denote the space of R

r-valued functions on [0, ∞)that are right-continuous with left-hand limits. TheSkorokhod topology is used. The exact definition isunimportant here,36,37 except to note that the limitsof a sequence that is compact in this topology andwhose discontinuities go to zero, has continuouslimits. This topology is common in weak convergenceanalysis, and is convenient as we are working withpiecewise-constant interpolations θn(·). One could usepiecewise-linear interpolations, but the criteria forcompactness and the details of proof are simpler withour choice.

A set of probability measures {Pn(·)} on Dr[0, ∞)is said to be tight or weakly sequentially compact iffor each δ > 0 there is a compact set Sδ such thatPn(Sδ) ≥ 1 − δ for all n. Pn is said to converge to Pweakly if

∫f (x)Pn(dx) → ∫

f (x)P(dx) for all boundedand continuous real-valued functions f (·). A tightsequence always contains a weakly convergent subse-quence. The Pn, P are the measures of processes withpaths in Dr[0, ∞), and it is these processes that are ofconcern. If the set of measures is tight or convergesweakly, then abusing notation, we use the same ter-minology for the processes. More detail on the use inSA theory is given in Ref 12.

Consider Eq. (3) and, henceforth, let {Yn} beuniformly integrable. Then {θn(·)} is tight. Also, hence-forth, suppose that {Yε

n} in Eq. (4) is uniformly inte-grable. Then {θε(T + ·), T < ∞, ε > 0} is tight. In bothcases, the limit processes, as n → ∞, or ε → 0, T →∞ have Lipschitz continuous paths w.p.1. For the ana-log of Theorem 2 with martingale-difference noise, thesequence {n(·)} is tight with zero weak-sense limitsif only εn → 0. Then all weak-sense limits satisfy theconstrained ODE. The analog of Eq. (16) is the muchsimpler requirement: For any T > 0 and all δ > 0,

limn

P{

max0≤t≤T

∣∣n(t)∣∣ ≥ δ

}= 0. (23)



Correlated Noise: A Convergence TheoremConstant ε will be used to simplify the notation. Theresults are analogous if εn → 0. Rewrite Eq. (4) as

θεn+1 = θε

n + εEnYεn + εδMε

n + εzεn, (24)

where δMεn = Yε

n−EnYεn and EnYε

n = g(θεn, ξε

n+1) whereξε

n is a random sequence. Define the martingale Mεn =∑n

i=0 εδMεi , with Mε(·) being the continuous-time

interpolation. Analogously, define Zε(·) from the {zεn}.

Let g(θ , ξ εn+1) be continuous in θ , in the mean,

uniformly in n. Suppose that there is a function g(·)such that for each θ

limnε→∞, εnε→0

1nε

m+nε−1∑i=m

Emg(θ , ξ εi+1) = g(θ) (25)

in mean, and uniformly in m. Due to the use of theconditional expectation, this is a weak form of the lawof large numbers. (Discontinuous and θ-dependentnoise can also be handled, but is more complicated.)

Theorem 3 Under the above conditions, {θε(T + ·)} istight and any weak-sense limit as ε → 0, T → ∞, is inthe limit or invariant set of the constrained ODE. Ifthe solution is unique for each initial condition, thenthe limit points are chain recurrent.

Proof. Write

θε(T + t) − θε(T) =(T+t)/ε−1∑

i=T/ε

εEig(θεi , ξε

i+1)

+ [Mε(T + t) − Mε(T)]

+ [Zε(T + t) − Zε(T)]. (26)

If supε,n E|Yεn|2 < ∞, then the martingale property

implies that

E sups≤t

|Mε(T + s) − Mε(T)|2 = E(T+t)/ε−1∑

i=T/ε

[εδMεi ]2

= O(tε), (27)

uniformly in T, implying that Mε(Tε + ·) − Mε(Tε)converges weakly to zero, no matter what the sequenceTε . This latter property also holds under only uniformintegrability. Let Tε → ∞. The sequence {θε(Tε + ·)}is tight and any weak-sense limit has Lipschitzcontinuous paths w.p.1 (by the uniform integrabilityassumption). As the weak-sense limit of the martingaleterms is zero and those of the sum in Eq. (26) haveLipschitz continuous paths, the weak-sense limits of

Zε(Tε + ·) − Zε(Tε) are Lipschitz continuous. Fix aweakly convergent subsequence, index it by ε fornotational simplicity, and denote the limit processesby θ(·) and Z(·), with z(t) = Z(t).

To get the limit ODE we need to consider the firstsum in Eq. (26). It will be seen that for continuous f (·),any integer q, any t > 0, and any si ≤ t, i ≤ q, τ > 0,

Ef (θ(si), i ≤ q)[θ(t + τ ) − θ(t) −

∫ t+τ

tg(θ(u))du

+ Z(t + τ ) − Z(t)]]

= 0, (28)

where Z(t) ∈ −C(θ(t)). This implies that θ(t) − θ(0) −∫ t0 g(θ(u))du − Z(t) is a martingale. The (absolute)

mean value is zero, hence it is zero. Thus θ(·) satisfiesthe constrained ODE.

The proof proceeds by using local averaging,using the fact that θε

n varies slowly. We can write

Ef (θε(si), i ≤ q)

×[θε(t + τ ) − θε(t) − ε

(t+τ )/ε−1∑j=t/ε

g(θεj , ξ ε

j+1)

− [Zε(t + τ ) − Zε(t)]

]= 0. (29)

Without loss of generality we can let Tε = 0. Considerthe term with the sum and collect terms in groups ofsize nε → ∞ , with δε = εnε → 0, and rewrite:


×{

θε(t + τ ) − θε(t)

−t+τ−δε∑l:lδε=t

δε

[1nε

lnε+nε−1∑j=lnε

Elnεg(θε

j , ξεj+1)

]}.

(30)

By the continuity and uniform integrability, modulonegligible terms,


×{θε(t + τ ) − θε(t)

−t+τ−δε∑l:lnε=t

δε

[1nε

lnε+nε−1∑j=lnε

Elnεg(θε

lnε, ξ ε

j+1)]}

.

(31)



As ε → 0, this expression is approximated by


θε(t + τ ) − θε(t) −

t+τ−δε∑l:lnε=t

δε g(θεlnε

)

,

(32)

where the sum is asymptotically replaceable by anintegral. The reflection term is treated similarly towhat was done in Theorem 2.

EXTENSIONS

Multiple Time ScalesIn multiple-time-scale systems, the components of theiterate are divided into groups with each group havingits own step-size sequence, say for two groups, εn, µn,and they are of different orders in that εn/µn → 0.The methods follow the same lines as above, with theusual modifications used in singular perturbations ofessentially dealing with the fast system first, gettingits limits as a function of the fixed slow variable,and then showing that the slow system can be dealtwith the limit fast variables used [Refs 12, Section 8.6and 38,39]. The rate of convergence can be very slow.

Random DirectionsFor high-dimensional Kiefer-Wolfowitz type prob-lems, the original algorithms repeat Eq. (7) for eachcomponent. The rate of convergence can be very slow,and one can waste much time iterating on unimportantparameters. This suggests the possible value of iterat-ing in random directions using an algorithm such as

θn+1 = θn − εndn[Y+

n − Y−n

]/2cn, (33)

where Y±n are observations taken at parameter values

θn ± cndn, and the dn are mutually independent andidentically distributed random direction vectors. Suchalgorithms were discussed in Ref 25 where the dn wereuniformly distributed on the unit sphere. Spall40–42

showed that there can be a considerable advantageif the directions dn = (dn,1, . . . , dn,r) were selected sothat the set of components {dn,i; n, i} were mutuallyindependent, with dn,i taking values ±1 each withprobability 1/2; that is, the direction vectors are thecorners of the unit cube in R

r. In Ref 12 [Section 10.7],it was shown that the result is the same if they are uni-formly distributed on the unit sphere, with radius

√r.

An analysis of the rate of convergence suggests thatthe improvement is best if there are many componentsof θ that are relatively unimportant.

Linear System Identification and TrackingThe problem of estimating or tracking the value of atime-varying parameter in linear systems has applica-tions in adaptive control theory, and in communica-tions theory for adaptive equalizers and noise cancella-tion, etc., where the signal, noise, and channel proper-ties change with time. Consider the observation modelyn = φ′

nθ + νn, where θ is the value of the unknownphysical parameter, νn is observation noise, and φn isa regression vector. The values of yn, φn are observed.Owing to the importance of this problem therehas been a vast literature.20,43–51 A basic algorithm,based on least squares approximation, has the formθε

n+1 = θεn + εφn[yn − φ′

nθεn], where θε

n is the currentestimate of θ . Now, replace θ by θn, and let the distri-butions vary slowly. Much attention has been devotedto get the optimal value of ε.20,47 A successful stochas-tic approximation scheme for tracking in the time-varying case was given in Ref 52 and further developedin Ref 53 and discussed in Ref 12 [Section 3.2]. Thenεn replaces ε, but it does not converge to zero. Theupdating algorithm for εn is the actual SA algo-rithm. An application to an adaptive antenna systemis shown in Ref 54. If the φn are not observable, thenwe have ‘blind optimization,’ and it is substantiallyharder.55

Iterate Averaging MethodsConsider the multivariate RM algorithm. If εn goesto zero slower than O(1/n) in that εn/εn+1 = 1 +o(εn), then the rate of convergence of the average�n = ∑n

n−Nnθ i/Nn, where n − Nn → ∞, is nearly

optimal under broad conditions, in that it is therate that would be achieved if εn were replacedby εnK for the asymptotically optimal matrix K.This surprising result was shown by Polyak andJuditsky56,57 and Ruppert,18 and analyzed undermore general conditions in Refs 12, [Section 11.1]and 58–65. In particular, Kushner60 showed that√

n(�n − θ) converges in distribution to a normallydistributed random variable with mean zero andcovariance V, where V is the smallest possible [Ref 12,Section 11.1].

Rate of ConvergenceClassical rate of convergence results for the RM pro-cess concern the asymptotic distribution of Un =(θn − θ)/

√εn and Uε

n = (θεn − θ)/

√ε, as n → ∞ or

ε → 0, nε → ∞. Define the interpolations Un(·) andUε(·) with intervals εn and ε, respectively. Supposethat θn(·) (or θε(T + ·), as ε → 0, T → ∞) convergesweakly to θ ∈ H0. Refs 25, 66, 67 introduced theidea of studying the rate by showing that, underappropriate conditions, Un(·) and Uε(T + ·) converge



to stationary mean-zero linear diffusions as n → ∞or ε → 0, T → ∞, and the covariance of these lim-its serves as a measure of the rate of convergence.A more complete development is shown in Ref 12[Chapter 10] and 20 .

Consider the constant ε case where EnYεn =

g(θn, ξ εn+1), where g(·, ξ ) is continuously differentiable

near θ . Let there be a Hurwitz matrix A such that

1m

n+m−1∑i=n

[Eε

ng′θ (θ , ξ ε

i ) − A] → 0 (34)

in probability as ε → 0 and n, m → ∞. Suppose thatthe sequence defined by

√ε∑(t+T)/ε

T/ε(g(θ , ξε

i ) + δMεi )

converges weakly to a Wiener process with covari-ance �. Then, under some other mild conditions,Uε(T + ·) converges weakly to the stationary processU(·) defined by dU = AUdt + dW, where � is thecovariance of the Wiener process W.

For εn = K/n, where K is a positive definitematrix, the limit is dU = (I/2 + KA)Udt + KdW. Inthe sense of minimizing the trace of the stationarycovariance matrix, the optimal value of K is −A−1 andthe corresponding covariance is A−1�(A−1)′. If εn →0 faster than O(1/n) in that (εn/εn+1)1/2 = 1 + o(εn),then drop the I/2 term. See Ref 12 [Chapter 10]. Ifthe limit θ is on the boundary of H, then there mightbe a reflection term on the diffusion.68 The rate ofconvergence of moments for the random directionsalgorithm is discussed in Ref 69.

An alternative point of view uses the so-calledstrong diffusion approximation and estimates theasymptotic difference between the normalized errorprocess and a Wiener process in terms of an ‘iteratedlogarithm’.70,71

State-dependent NoiseSo far, it has been assumed that EnYn = g(θn, ξn+1)(or the analogous form for εn = ε) and it was implic-itly supposed that the evolution of the noise didnot depend on the values of the iterates in thatP

{ξn+1 ∈ · | ξ i, θ i, i ≤ n

} = P{ξn+1 ∈ · | ξ i, i ≤ n

}. In

numerous applications, there is a dependence thatmust be accounted for. The usual form is what is calledMarkov state dependence. It takes the following form.P(·, ·|θ) is a Markov transition function parameterizedby θ such that P

{ξn+1 ∈ · | ξ i, θ i, i ≤ n

} = P(ξn, ·|θn).The analysis follows lines that are analogous to thesimpler case, but is technically more complicated. Theapproach was initiated in Refs 21, 26 and developedfurther in Ref 72 for weak convergence. See Ref 20for a complete development of the w.p.1 convergencecase. Some comments on the non-Markov case aregiven in Ref 12 [Section 8.4].

Minimizing an Ergodic Cost FunctionThe Koefer-Wolfowitz and related procedures are con-cerned with the minimization of a function whosevalues at the chosen parameters are observed subjectto noise. In many applications, one wishes to minimizefunctions of the type limT→∞ E

∫ T0 c(θ , x(t, θ))dt/T

where x(θ , ·) is a random process parameterized by θ ,and only samples of the integrand are observed. Effec-tive algorithms are discussed in Refs 12, [Section 9.1]and 72.

Large DeviationsIn Refs 12 [Section 6.9] and 35, 37, 39, large devi-ations methods are used to show that escape froma small neighborhood of a weak-sense limit point isimpossible, under weak conditions.

ACKNOWLEDGEMENT

The work was partially supported by NSF grant DMS 0804822.

REFERENCES

1. Robbins H, Monro S. A stochastic approximationmethod. Ann Math Stat 1951, 22:400–407.

2. Wasan MT. Stochastic Approximation. Cambridge:Cambridge University Press; 1969.

3. Blum JR. Multidimensional stochastic approximation.Ann Math Stat 1954, 9:737–744.

4. Dvoretzky A. On stochastic approximation. Proceed-ings of the Third Berkeley Symposium on MathematicalStatistics and Probability, 1956, 17:39–55.

5. Schmetterer L. Stochastic approximation. Proceedings ofthe Fourth Berkeley Symposium on Mathematical Statis-tics and Probability. Berkeley: University of California;1960, 587–609.



6. Dupac V. On the Kiefer–Wolfowitz approximationmethod. Casopis Pest Mat 1957, 82:47–75.

7. Dupac V. Stochastic approximation. Kybernetica(Prague) Suppl 1981, 17.

8. Gladyshev EG. On stochastic approximation. TheoryProbab Appl 1965, 10:275–278.

9. Robbins H, Siegmund D. A convergence theorem forsome nonnegative almost supermartingales and someapplications. In: Rustagi JS, ed. Optimizing Methods inStatistics. New York: Academic Press; 1971, 233–257.

10. Sakrison DJ. A continuous Kiefer-Wolfowitz procedurefor random processes. Ann Math Stat 1964, 35:590–599.

11. Sacks J. Asymptotic distribution of stochastic approxi-mation processes. Ann Math Stat 1958, 29:373–405.

12. Kushner HJ, Yin G. Stochastic Approximation andRecursive Algorithms and Applications. Berlin and NewYork: Springer-Verlag; 2003.

13. Kiefer J, Wolfowitz J. Stochastic estimation of the max-imum of a regression function. Ann Math Stat 1952,23:462–466.

14. Nevelson MB, Khasminskii RZ. Stochastic Approxima-tion and Recursive Estimation, vol. 47, Translationof Mathematical Monographs. Providence: AmericanMathematical Society; 1976.

15. Tsypkin YaZ. Adaptation and Learning in AutomaticSystems. New York: Academic Press; 1971.

16. Lai TL. Stochastic approximation. Ann Stat 2003,31:391–406.

17. Schmetterer L. Multidimensional stochastic approxima-tion. In: Krishnaiah PR, ed. Multivariate Analysis II.New York: Academic Press; 1969, 443–460.

18. Ruppert D. Stochastic approximation. In: Ghosh BK, SenPK, ed. Handbook in Sequential Analysis. New York:Marcel Dekker; 1991, 503–529.

19. Fabian V. Stochastic approximation. In: Rustagi JS, ed.Optimizing Methods in Statistics. New York: AcademicPress; 1971.

20. Benveniste A, Metivier M, Priouret P. Adaptive Algo-rithms and Stochastic Approximation. Berlin and NewYork: Springer-Verlag; 1990.

21. Kushner HJ. Approximation and Weak ConvergenceMethods for Random Processes with Applications toStochastic Systems Theory. Cambridge: MIT Press; 1984.

22. Solo V, Kong X. Adaptive Signal Processing Algorithms.Englewood Cliffs: Prentice-Hall; 1995.

23. Borkar VS, Meyn SP. The o.d.e. method for convergenceof stochastic approximation and reinforcement learning.SIAM J Control Optim 2000, 38:447–469.

24. Ljung L. Analysis of recursive stochastic algorithms.IEEE Trans Automat Control 1977, 22:551–575.

25. Kushner HJ, Clark DS. Stochastic Approximation forConstrained and Unconstrained Systems. Berlin and NewYork: Springer-Verlag; 1978.

26. Kushner HJ, Shwartz A. An invariant measure approachto the convergence of stochastic approximations withstate dependent noise. SIAM J Control Optim 1984,22:13–27.

27. Chen H-F. Stochastic Approximation and its Applica-tions. Boston: Kluwer Academic Publishers; 2002.

28. Chen H-F, Zhu YM. Stochastic approximation proce-dure with randomly varying truncations. Sci Sin Ser., A1986, 29:914–926.

29. Wang IJ, Chong EKP, Kulkarni SR. Equivalent necessaryand sufficient conditions on noise sequences for stochas-tic approximation algorithms. Adv Appl Probab 1996,28:784–801.

30. Benaım M. A dynamical systems approach to stochas-tic approximations. SIAM J Control Optim 1996, 34:437–472.

31. Benaım M. Dynamics of stochastic approximation algo-rithms. Seminaire de Probabilities, Lecture Notes inMathematics: 1769. Berlin and New York: Springer-Verlag; 1999, 1–69.

32. Benaım M. Dynamics of stochastic approximation algo-rithms. Seminaire de Probabilites, vol. XXXIII, LectureNotes in Mathematics, 1709. Berlin and New York:Springer; 1999, 1–68.

33. Guckenheimer J, Holmes P. Nonlinear Oscillations,Dynamical Systems, and Bifurcations of Vector Fields.Berlin and New York: Springer-Verlag; 1983.

34. Kushner HJ, Dupuis P. Numerical Methods for Stochas-tic Control Problems in Continuous Time. Berlinand New York: Springer-Verlag; 1992, (2nd edition,2001).

35. Dupuis P, Kushner HJ. Stochastic approximation andlarge deviations: upper bounds and w.p.1 convergence.SIAM J Control Optim 1989, 27:1108–1135.

36. Billingsley P. Convergence of Probability Measures, 2ndedn. New York: John Wiley & Sons; 1999.

37. Ethier SN, Kurtz TG. Markov Processes: Characteriza-tion and Convergence. New York: John Wiley & Sons;1986.

38. Bhatnagar S, Borkar VS. A two-time-scale stochasticapproximation scheme for simulation based paramet-ric optimization. Probab Eng Inform Sci 1998, 12:519–531.

39. Kushner HJ. Diffusion approximations to output pro-cesses of nonlinear systems with wide-band inputs,with applications. IEEE Trans Inf Theory 1980, 26:715–725.

40. Sadegh P, Spall JC. Optimal random perturbations forstochastic approximation using a simultaneous pertur-bation gradient approximation. IEEE Trans AutomControl 1998, 43:1480–1484.

41. Spall JC. Multivariate stochastic approximation usinga simultaneous perturbation gradient approximation.IEEE Trans Autom Control 1992, 37:331–341.



42. Spall JC. Adaptive stochastic approximation by thesimultaneous perturbation method. IEEE Trans AutomControl 2000, 45:1839–1853.

43. Delyon B, Juditsky A. Asymptotical study of parame-ter tracking algorithms. SIAM J Control Optim 1995,33:323–345.

44. Gunnarsson S, Ljung L. Frequency domain tracking char-acteristics of adaptive algorithms. IEEE Trans AcoustSpeech Signal Process 1989, 37:1072–1089.

45. Guo L. Stability of recursive tracking algorithms. SIAMJ Control Optim 1994, 32:1195–1225.

46. Guo L, Ljung L. Performance analysis of general trackingalgorithms. IEEE Trans Autom Control 1995, AC-40:1388–1402.

47. Guo L, Ljung L, Priouret P. Tracking performance analy-ses of the forgetting factor RLS algorithm. In Proceedingsof the 31st Conference on Decision and Control, Tucson,Arizona, pages 688–693, New York, 1992. IEEE.

48. Ljung L. System Identification Theory for the User.Englewood Cliffs: Prentice-Hall; 1986.

49. Ljung L, Soderstrom T. Theory and Practice of RecursiveIdentification. Cambridge: MIT Press; 1983.

50. Solo V. The limit behavior of LMS. IEEE Trans AcoustSpeech Signal Process 1989, 37:1909–1922.

51. Widrow B, Stearns SD. Adaptive Signal Processing.Englewood Cliffs: Prentice-Hall; 1985.

52. J-M Brossier. Egalization Adaptive et Estimation dePhase: Application aux Communications Sous-Marines.PhD thesis, Institut National Polytechnique de Grenoble,1992.

53. Kushner HJ, Yang J. Analysis of adaptive step size SAalgorithms for parameter tracking. IEEE Trans AutomControl 1995, 40:1403–1410.

54. Buche R, Kushner HJ. Adaptive optimization of leastsquares tracking algorithms: with applications toadaptive antennas arrays for randomly time-varyingmobile communications systems. IEEE Trans AutomControl 2005, 50:1749–1760.

55. Benveniste A, Goursat M, Ruget G. Robust identificationof a nonminimum phase system: blind adjustment of alinear equalizer in data communications. IEEE TransAutom Control 1980, AC-25:385–399.

56. Juditsky A. A stochastic estimation algorithm with obser-vation averaging. IEEE Trans Autom Control 1993,38:794–798.

57. Polyak BT, Juditsky AB. Acceleration of stochasticapproximation by averaging. SIAM J Control Optim1992, 30:838–855.

58. Chen H-F. Asymptotically efficient stochastic approxi-mation. Stochastics Stochastic Rep 1993, 45:1–16.

59. Delyon B, Juditsky A. Stochastic optimization with aver-aging of trajectories. Stochastics Stochastic Rep 1992,39:107–118.

60. Kushner HJ, Yang J. Stochastic approximation withaveraging of the iterates: optimal asymptotic rates of con-vergence for general processes. SIAM J Control Optim1993, 31:1045–1062.

61. Kushner HJ, Yang J. Stochastic approximation withaveraging and feedback: faster convergence. In: Good-win GC, Astrom K, Kumar PR, eds. IMA Volumes inMathematics and Applications: Adaptive Control, Fil-tering and Signal Processing, Volume 74, The IMASeries. Berlin and New York: Springer-Verlag; 1995,205–228.

62. Kushner HJ, Yang J. Stochastic approximation with aver-aging and feedback: rapidly convergent ‘‘on line’’ algo-rithms. IEEE Trans Autom Control 1995, 40:24–34.

63. Le Breton A. Averaging with feedback in Gaussianschemes in stochastic approximation. Math MethodsStat 1997, 6:313–331.

64. Yin G. On extensions of Polyak’s averaging approachto stochastic approximation. Stochastics Stochastic Rep1991, 36:245–264.

65. Yin G. Stochastic approximation via averaging: Polyak’sapproach revisited. In: Pflug G, Dieter U, eds. Lec-ture Notes in Economics and Mathematical Systems374. Berlin and New York: Springer-Verlag; 1992,119–134.

66. Kushner HJ. Rates of convergence for sequential monte-carlo optimization methods. SIAM J Control Optim1978, 16:150–168.

67. Kushner HJ, Huang H. Rates of convergence for stochas-tic approximation type algorithms. SIAM J ControlOptim 1979, 17:607–617.

68. Buche R, Kushner HJ. Rate of convergence for con-strained stochastic approximation algorithms. SIAM JControl Optim 2001, 40:1011–1041.

69. Gerencser L. Convergence rate of moments in stochasticapproximation with simultaneous perturbation gradientapproximation and resetting. IEEE Trans Autom Con-trol 1999, 44:894–905.

70. Joslin JA, Heunis AJ. Law of the iterated logarithm forconstant-gain linear stochastic gradient algorithm. SIAMJ Control Optim 2000, 39:533–570.

71. Pezeshki-Esfanahani H, Heunis AJ. Strong diffusionapproximations for recursive stochastic algorithms.IEEE Trans Inform Theory 1997, 43:312–323.

72. Kushner HJ, Vazquez-Abad FJ. Stochastic approxima-tion algorithms for systems over an infinite horizon.SIAM J Control Optim 1996, 34:712–756.

73. Dupuis P, Kushner HJ. Stochastic approximation vialarge deviations: asymptotic properties. SIAM J ControlOptim 1985, 23:675–696.

74. Dupuis P, Kushner HJ. Asymptotic behavior of con-strained stochastic approximations via the theory oflarge deviations. Probab Theory Related Fields 1987,75:223–244.


Documents

Stochastic approximation: a survey