15

Click here to load reader

Convergence Rate of Stochastic Approximation Algorithms in the Degenerate Case

  • Upload
    han-fu

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Convergence Rate of Stochastic Approximation Algorithms in the Degenerate Case

CONVERGENCE RATE OF STOCHASTIC APPROXIMATIONALGORITHMS IN THE DEGENERATE CASE∗

HAN-FU CHEN†

SIAM J. CONTROL OPTIM. c© 1998 Society for Industrial and Applied MathematicsVol. 36, No. 1, pp. 100–114, January 1998 005

Abstract. Let f(·) be an unknown function whose root x0 is sought by stochastic approximation(SA). Convergence rate and asymptotic normality are usually established for the nondegenerate casef ′(x0) 6= 0. This paper demonstrates the convergence rate of SA algorithms for the degeneratecase f ′(x0) = 0. In comparison with the previous work, in this paper no growth rate restriction isimposed on f(·), no statistical property is required for the measurement noise, the general step sizeis considered, and the result is obtained for the multidimensional case, which is not a straightforwardextension of the one-dimensional result. Although the observation noise may be either deterministicor random, the analysis is purely deterministic and elementary.

Key words. stochastic approximation, convergence rate

AMS subject classification. 62L20

PII. S0363012995281730

1. Introduction. The topic of SA is to search the roots or extremes of an un-known function f(·) : Rl → Rl which can be observed with noise. Since its pioneerwork by Robbins and Monro [1], SA has obtained much attention from researchers [2,3] and is applied in various areas, such as parameter identification, adaptive control,optimization, pattern recognition, and others [4].

In many applications not only convergence but also convergence rate of the al-gorithm is of interest. Intuitively, the rate of convergence depends on the derivativef ′(x0) of the function at its root x0; the rate in the nondegenerate case (f ′(x0) 6= 0)should be faster than it is in the degenerate case (f ′(x0) = 0). To be precise, theRobbins–Monro algorithm is defined by

xn+1 = xn + anyn+1,(1.1)

yn+1 = f(xn) + εn+1,(1.2)

where yn+1 is the observation and εn+1 is the noise. {an} is the step size and isselected to have the following properties:

an > 0, an−−−−→n→∞

0 and∞∑i=1

ai =∞.

Under certain conditions [1–4, 7] imposed on f(·) and εn, xn defined by (1.1),(1.2) converges to the root x0 of f(·), i.e.,

xn−−−−→n→∞

x0, f(x0) = 0.

Further, in the nondegenerate multidimensional case assume

f(x) = H(x− x0) + ∆(x), H < 0,(1.3)

∗Received by the editors February 17, 1995; accepted for publication (in revised form) October9, 1996. This research was supported by the National Natural Science Foundation of China.

http://www.siam.org/journals/sicon/36-1/28173.html†Laboratory of Systems and Control, Institute of Systems Science, Chinese Academy of Sciences,

Beijing 100080, People’s Republic of China ([email protected]).

100

Dow

nloa

ded

11/2

0/14

to 1

32.2

06.2

05.6

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

Page 2: Convergence Rate of Stochastic Approximation Algorithms in the Degenerate Case

CONVERGENCE RATE OF STOCHASTIC APPROXIMATION 101

∆(x) = O(‖x− x0‖2) as x→ x0.(1.4)

Then under some conditions on the observation noise

‖xn − x0‖ = o(aδn) ∀δ ∈(

0,12

),(1.5)

provided H + qδI is a stable matrix where α by assumption is defined by

a−1n+1 − a−1

n −−−−→n→∞

q ≥ 0.

The convergence rate in the case f ′(x0) = 0 was addressed in [5] for the special casewhere (i) f(·) is a scalar function, i.e., l = 1, and f(x) grows not faster than linearly as|x| → ∞; (ii) (x−x0)f(x) < 0 ∀x 6= x0; (iii) f(x) = f0|x−x0|1+γsign(x−x0)·(1+o(1))as x → x0, γ > 0; (iv) the conditional variance of εn+1 given xn is bounded, i.e.,Var(εn+1|xn) ≤ σ2; (v) εn+1 is conditionally independent of x0, . . . , xn−1 given xn;and (vi) the step size is special: an = 1

n . In comparison with [5] this paper derivesthe convergence rate for the general case. To be precise, we do not impose anygrowth rate restriction on f(·); we do not require any statistical property of the noise,which is allowed to be stochastic or deterministic; we consider the general step sizean and, finally, we give the convergence rate for both multidimensional and one-dimensional cases. The approach used here is completely different from that usedin [5] and is purely deterministic. A purely deterministic approach in a discretesetting was used in [9, 10] as an alternative means for obtaining convergence results,and the approach used here is similar in flavor. We further show the power of anelementary deterministic analysis by obtaining convergence rates. It is worth notingthat extension from the one-dimensional result to the multidimensional case is notstraightforward. As will be seen in section 2, in the multidimensional case only theupper bound is obtained, while in the one-dimensional case it is shown that the upperbound is attainable.

2. Main results. Before describing the main results of the paper we presenta convergence result, proved in [4, 6]. The algorithm considered in this paper is amodified version of (1.1), (1.2) and is defined as follows.

Let {Mk} be a sequence of real numbers, Mi > 0, Mi ↑ ∞ and let x∗ be a fixedpoint in Rl. The estimate xn is recursively given by

xk+1 = xk + akyk+1, x0 arbitrary,(2.1)

xk+1 = xk+1I[‖xk+1‖≤Mσk] + x∗I[‖xk+1‖>Mσk

],(2.2)

σk =k−1∑i=0

I[‖xi+1‖>Mσi],(2.3)

yk+1 = f(xk) + εk+1.(2.4)

Since Mi diverges, algorithm (2.1)–(2.4) coincides with the Robbins–Monro al-gorithm (1.1), (1.2) starting from some time, if we can prove that {xk} defined by(2.1)–(2.4) is bounded.

Let us list conditions which will be used later on.A1. f(·) is an Rl → Rl measurable and locally bounded function, and f(x) = 0

∀x ∈ J ; i.e., J is the root set of f(·).

Dow

nloa

ded

11/2

0/14

to 1

32.2

06.2

05.6

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

Page 3: Convergence Rate of Stochastic Approximation Algorithms in the Degenerate Case

102 HAN-FU CHEN

A2. ak > 0, ak −−−−→k →∞

0,∑∞i=1 ai =∞.

A3. There is a differentiable function v(·) : Rl → R such that

d(v(x), v(J)) > 0 if d(x, J) > 0

andsup

δ≤d(x,J)≤∆fτ (x)vx(x) < 0 ∀ 0 < δ < ∆,

where vx(x) denotes the gradient of v(x):

d(x, J) = inf{‖x− y‖ : ∀y ∈ J}, and v(J) = {v(x) : x ∈ J}.

A4. As x→ x0 the function f(x) is expressed as

f(x) = H(x− x0)‖x− x0‖γ + r(x), γ > 0,(2.5)

where H is a stable matrix (i.e., all its eigenvalues have negative real parts) and

r(x) ∈ Rl, r(x)/‖x− x0‖1+γ → 0 as x→ x0.(2.6)

A5.

qn4= a−1

n+1 − a−1n , 0 ≤ qn, lim sup

n→∞qn = q, 0 ≤ q <∞,(2.7)

∞∑i=1

bi =∞, where bi =ai

log a−1i

.(2.8)

PROPOSITION. Assume A1–A3 hold. If there is a constant c0 such that ‖x∗‖ < c0,v(x∗) < inf‖x‖ = c0v(x) and if v(J) is not dense in any interval, then {xk} definedby (2.1)–(2.4) converges to J

limk→∞

d(xk, J) = 0

whenever {εi} satisfies the following condition:

limT→0

lim supk→∞

1T

∥∥∥∥∥∥m(k,t)∑i=k

aiεi+1

∥∥∥∥∥∥ = 0 ∀t ∈ [0, T ],(2.9)

where

m(k, t) = max

{m :

m∑i=k

ai ≤ t}.

Remark 1. An obvious condition which guarantees (2.9) is the convergence of theseries

∞∑i=1

aiεi+1.

Condition (2.9) is also necessary for convergence of xn to the root of f(x). This isdiscussed in the recent paper [8], which also shows that (2.9) is equivalent to the

Dow

nloa

ded

11/2

0/14

to 1

32.2

06.2

05.6

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

Page 4: Convergence Rate of Stochastic Approximation Algorithms in the Degenerate Case

CONVERGENCE RATE OF STOCHASTIC APPROXIMATION 103

standard Kushner–Clark condition [3]. However, when εk depends on {x0, . . . , xk−1},it is difficult to directly verify (2.9). In [4, 6] it is shown that it suffices to verify (2.9)not along the whole sequence {k} but along the subsequence {nk} whenever {xnk}converges. In [4, 6] it is also demonstrated that this verification can be done in manypractically important problems.

Remark 2. If {xk} given by (1.1), (1.2) is a priori known to be bounded, thenunder conditions A1–A3 and (2.9)

limk→∞

d(xn, J) = 0;(2.10)

i.e., in this case the truncations introduced in (2.1)–(2.4) are not necessary.In A4 the matrix H is stable. By the Lyapunov equality there is a positive definite

matrix P > 0 such that

PH +HτP = −I.(2.11)

Denote by λmax and λmin the maximum and minimum eigenvalue of P , respec-tively, and by K the condition number λmax/λmin.

THEOREM. (i) If conditions A1–A5 are satisfied and x0 is the unique root off(·), then for {xn} defined by (2.1)–(2.4)

lim supn→∞

(log a−1n )

1γ ‖xn − x0‖ ≤

√K

(2qλmax

γ

) 1γ

(2.12)

if {εi} satisfies the following condition:

∞∑i=1

ai(log a−1i+1)

1γ εi+1 <∞,(2.13)

where γ and q are given by (2.5), (2.7), respectively.(ii) If, in addition, H is symmetric, then

lim supn→∞

(log a−1n )

1γ ‖xn − x0‖ ≤

(q

λlγ

) 1γ

,(2.14)

where λl is the smallest eigenvalue of −H and γ and q are given by (2.5), (2.7),respectively.

(iii) Further, in the one-dimensional case, i.e., l = 1, under the conditions statedin (i) except A3, the upper bound in (2.14) is attainable if qn → q > 0.

The proof of the theorem is given in section 3.Remark 3. From the theorem it is seen that the convergence rate of (xn − x0)

depends upon the decreasing rate of an. However, it is interesting to note that this de-pendence in the degenerate case is completely different from that in the nondegeneratecase.

From (1.5) it is seen that for the nondegenerate case, if an = 1nα , 0 < α ≤ 1, then

the convergence rate of (xn−x0) is improving as α increases from 0 to 1. However, inthe degenerate case the picture is different. By the theorem, limn→∞(α logn)

1γ |xn −

x0| equals 0 for all α ∈ (0, 1), while it may attain ( 1|H|γ )

1γ if α = 1. This means that in

contrast to the nondegenerate case, the convergence rate of |xn − x0| for α ∈ (0, 1) is

Dow

nloa

ded

11/2

0/14

to 1

32.2

06.2

05.6

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

Page 5: Convergence Rate of Stochastic Approximation Algorithms in the Degenerate Case

104 HAN-FU CHEN

better than that for α = 1. This fact is verified by simulation of the following simpleexample:

f(x) = −x|x|, x0 = 0, f ′(x0) = 0, γ = 1, H = −1,εi ≡ 0,

x(1)n+1 = x

(1)n − x(1)

n |x(1)n |

n , x(1)0 = 0.5,

x(2)n+1 = x

(2)n − x(2)

n |x(2)n |√n

, x(2)0 = 0.5.

The simulation shows that

x(1)n logn−−−−→

n→∞1, while x(2)

n logn−−−−→n→∞

0,

which are reconciled with results stated in the theorem.It is also worth noting that the right-hand sides of (2.14) depend upon the smallest

eigenvalue λl of −H when H is symmetric. As λl decreases the upper bound in(2.14) increases. In other words, the faster f(x) leaves the abscissa, the faster xnconverges to x0. This phenomenon is consistent with the convergence rate changefrom (1.5) for the nondegenerate case to (2.12) for the degenerate case. This is alsoverified by computation: if in the example considered above “H = −1” is replacedby “H = −1

2”, i.e., if f(x) = −12x|x|, then the recursion with an = 1

n becomesxn+1 = xn − xn|xn|

2n , x0 = 0.5. The computation shows

xn logn−−−−→n→∞

2,

which is larger than the limit of x(1)n logn.

Remark 4. In the case q > 0, the convergence rate given in the theorem cannotbe improved. However, when q = 0, i.e., when a slowly decreasing gain is applied,we have only established ‖xn − x0‖ = o(log a−1

n )−1r . The estimate may be not sharp,

but the computation shows that x(2)n logn in Remark 3 converges to zero very slowly.

This means that we should not expect a much faster rate than (2.15).

3. Order of estimation error. In this section we establish the order of esti-mation error when the estimation algorithm (2.1)–(2.4) is applied. As a matter of

fact, we intend to show that ‖zn‖4= ‖(log a−1

n )1γ (xn − x0)‖ is bounded. This is an

intermediate step toward proving the theorem which gives either upper bound or anexact limit of ‖zn‖.

LEMMA 1. If A5 holds, then (2.13) implies (2.9).Proof. Let (2.13) be held. Setting

sn =n∑i=1

ai(log a−1i+1)

1γ εi+1, s0 = 0,

we haven∑

i=m

aiεi+1 =n∑

i=m

(si − si−1)(log a−1i+1)−

=sn(log a−1n+1)−

1γ +

n−1∑i=m

si[(log a−1i+1)−

1γ − (log a−1

i+2)−1γ ].(3.1)

Dow

nloa

ded

11/2

0/14

to 1

32.2

06.2

05.6

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

Page 6: Convergence Rate of Stochastic Approximation Algorithms in the Degenerate Case

CONVERGENCE RATE OF STOCHASTIC APPROXIMATION 105

Since sn converges, the first term on the right-hand side of (3.1) tends to zero asn→∞, while the last term is dominated by

supm≤i≤n

|si|n−1∑i=m

|(log a−1i+1)−

1γ−(log a−1

i+2)−1γ | = sup

m≤i≤n|si|[(log a−1

m+1)−1γ−(log a−1

n+1)−1γ ],

which tends to zero as n→∞ and m→∞.Hence,

∑∞i=1 aiεi+1 converges and (2.9) holds by Remark 1.

LEMMA 2. Under the conditions stated in (i) or in (iii) of the theorem, xk definedby (2.1)–(2.4) converges to x0 as k →∞.

Proof. By Lemma 1 and the Proposition presented in section 2, under the condi-tions stated in (i) we see xk −−−−→

k →∞x0. For the one-dimensional case stated in (i) we

also have xk −−−−→k →∞

x0 if we can verify A3.

Since x0 is the unique root of f(·) by A1, we have by A4

(x− x0)f(x) < 0 ∀x 6= x0.

Then the function v(x) = (x− x0)2 satisfies A3.Define

zn = (log a−1n )

1γ (xn − x0),(3.2)

h(z) = Hz‖z‖γ +q + ∆γ

z, z ∈ Rl, ∆ > 0.(3.3)

LEMMA 3 (key lemma). Under the conditions stated in (i) of the theorem, {zn}is bounded if (2.13) holds.

Proof. To prove boundedness of {zn} we first express zn in the recursive form.For any ∆ > 0 and sufficiently large n by (2.7), we have qn ≤ q + ∆ and

(log a−1

n+1

log a−1n

) 1γ

=

log a−1

n + loga−1n+1

a−1n

log a−1n

=(

1 +log(1 + anqn)

log a−1n

) 1γ

=(

1 +anqn +O(a2

n)log a−1

n

) 1γ

=1 +an(q + ∆ + o(1))

γ log a−1n

.(3.4)

By Lemma 2 {xn} is bounded, and hence xk is defined by the Robbins–Monroalgorithm starting from some n0. Consequently, by (3.4) for n ≥ n0 we derive therecursive formula for {zn}:

Dow

nloa

ded

11/2

0/14

to 1

32.2

06.2

05.6

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

Page 7: Convergence Rate of Stochastic Approximation Algorithms in the Degenerate Case

106 HAN-FU CHEN

zn+1 =(

1 +an

γ log a−1n

(q + ∆ + o(1))zn

)

+an

log a−1n

(1 +

an

γ log a−1n

(q + ∆ + o(1)))

(log a−1n )1+ 1

γ

·[H(xn − x0)‖xn − x0‖γ + r(xn)] + an(log a−1n+1)

1γ εn+1

=(

1 +an

γ log a−1n

(q + ∆ + o(1)))zn

+an

log a−1n

(1 +

an

γ log a−1n

(q + ∆ + o(1)))[

Hzn‖zn‖γ +‖zn‖1+γr(xn)‖xn − x0‖1+γ

]+an(log a−1

n+1)1γ εn+1

=zn + bnhn(zn) + an(log a−1n+1)

1γ εn+1(3.5)

=zn + bnHnzn + an(log a−1n+1)

1γ εn+1,(3.6)

where

hn(z) =(Hz‖z‖γ +

‖z‖1+γr(xn)‖xn − x0‖1+γ

)(1 +

an

γ log a−1n

(q + ∆ + o(1)))

+q + ∆ + o(1)

γz = Hnz(3.7)

and

Hn =[(H +

r(xn)‖xn − x0‖1+γ ·

zτn‖zn‖

)(1 + o(1)) +

q + ∆ + o(1)γ‖zn‖γ

· I]‖zn‖γ .(3.8)

Assume that the converse is true, i.e., assume {‖zn‖} is unbounded.Let us fix a large enough constant c > 1 such that

q + ∆γcγ

λmax <15.(3.9)

Denote by {zli}, i = 1, 2, . . . , nc, those of {zn, n ≥ n0} for which ‖zli‖ ≤ c and‖zi‖ > c ∀i : i 6∈ {1, . . . , nc} where nc may be infinite. For both cases nc < ∞and nc =∞ from the unboundedness of {‖zn‖} we will obtain a contradiction. Thisimplies the conclusion of the lemma.

Case 1. If nc <∞, then ‖zi‖ > c ∀i ≥ nc.We now show that by selection (3.9) for c the difference equation (3.5) in Case 1 is

asymptotically stable and zn−−−−→i→∞

0. This implies impossibility of ‖zi‖ > c ∀i ≥ nc.Define

Φn,j = (I + bnHn)(I + bn−1Hn−1) · · · (I + bjHj), Φj,j+14= I,(3.10)

Φτn,jPΦn,j = Φτn−1,j(P + bn(HτnP + PHn) + b2nH

τnPHn)Φn−1,j ,(3.11)

where Hn is defined by (3.8).

Dow

nloa

ded

11/2

0/14

to 1

32.2

06.2

05.6

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

Page 8: Convergence Rate of Stochastic Approximation Algorithms in the Degenerate Case

CONVERGENCE RATE OF STOCHASTIC APPROXIMATION 107

From A4 and Lemma 2, notice that r(xn)/‖xn− x0‖1+γ −−−−→n→∞

0 and for n ≥ nc,

bn‖HτnPHn‖ ≤ c1bn‖zn‖2γ = c1

an

log a−1n

(log a−1n )2‖xn − x0‖2γ −−−−→

n→∞0,(3.12)

where c1 is a constant. Then by (2.11), (3.9), (3.12) for sufficiently large n we have

(HτnP + PHn) + bnH

τnPHn < −

12‖zn‖γI.(3.13)

Without loss of generality we may assume that n0 is large enough so that (3.13) isvalid for n ≥ n0. By (3.11), (3.13) for j ≥ nc we see that

Φτn,jPΦn,j ≤ Φτn−1,j

(P − 1

2bn‖zn‖γI

)Φn−1,j ≤

(1− bn‖zn‖γ

2λmax

)Φτn−1,jPΦn−1,j ,

where as defined in section 2 λmax is the maximum eigenvalue of P .This implies that

Φτn,ncPΦn,nc ≤(1− µbn‖zn‖γ)Φτn−1,ncPΦn−1,nc

<e−µbn‖zn‖γ

Φτn−1,ncPΦn−1,nc < λmaxe−µ

∑ni=nc

bi‖zi‖γ I,

where µ = 12λmax

.Consequently, we have

‖Φn,nc‖ <√Ke−

µ2

∑ni=nc

bi‖zi‖γ .(3.14)

We remind the reader that K = λmax/λmin and λmin is the minimum eigenvalue ofP .

From (3.6) it follows that

zn+1 = Φn,ncznc +n∑

j=nc

Φn,j+1aj(log a−1j+1)

1γ εj+1.(3.15)

Since ‖zi‖ > c ∀i ≥ nc and∑∞i=nc bi = ∞, by (3.14) the first term on the

right-hand side of (3.15) tends to zero as n→∞. Let us now estimate the last termof (3.15).

Set

ξn =n∑

j=nc

aj(log a−1j+1)

1γ εj+1.

By (2.13) it follows that ξn−−−−→n→∞

ξ <∞. We now have

n∑j=nc

Φn,j+1aj(log a−1j+1)

1γ εj+1

=n∑

j=nc

Φn,j+1(ξj − ξj−1)

= ξn −n∑

j=nc+1

(Φn,j+1 − Φn,j)ξj−1 − Φn,nc+1ξnc−1

Dow

nloa

ded

11/2

0/14

to 1

32.2

06.2

05.6

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

Page 9: Convergence Rate of Stochastic Approximation Algorithms in the Degenerate Case

108 HAN-FU CHEN

= ξn −n∑

j=nc+1

(Φn,j+1 − Φn,j)ξ

−n∑

j=nc+1

(Φn,j+1 − Φn,j)(ξj−1 − ξ)− Φn,nnc+1ξnc−1

= (ξn − ξ) + Φn,nc+1ξ −n1∑

j=nc+1

(Φn,j+1 − Φn,j)(ξj−1 − ξ)

+n∑

j=n1+1

Φn,j+1bjHj(ξj−1 − ξ)− Φn,nnc+1ξnc−1.(3.16)

By (3.14) it is clear that on the right-hand side of (3.16) all terms except thesecond-to-last one tend to zero as n → ∞ for any fixed n1. We now show that thesecond-to-last term of (3.16) can be made arbitrarily small by choosing n1 sufficientlylarge. For any ε > 0, take sufficiently large n1 such that

|ξj − ξ| < ε and 1 ≥ µbj‖zj‖γ2

∀j ≥ n1,

which is possible because

bj‖zj‖γ =aj

log a−1j

· [(log a−1j )

1γ ‖xj − x0‖]γ = aj‖xj − x0‖γ → 0.(3.17)

Using (3.14), (3.17), and noticing that ‖Hn‖ ≤ c2‖zn‖γ ∀n ≥ nc for some constantc2 > 0 we derive∥∥∥∥∥∥

n∑j=n1+1

Φn,j+1bjHj(ξj−1 − ξ)

∥∥∥∥∥∥ ≤ εc2√Kn∑

j=n1+1

e−µ2

∑ni=j+1 bi‖zi‖

γ

bj‖zj‖γ

≤ 4εc2√K

µ

n∑j=n1+1

e−µ2

∑ni=j+1 bi‖zi‖

γ

(1− e−µ2 bi‖zj‖

γ

) ≤ 4εc2Kµ

,

where we use the fact that x2 ≤ 1− e−x for x ∈ [0, 1].

Consequently, the left-hand side of (3.16) tends to zero as n → ∞, and hencezn−−−−→

n→∞0. This contradicts ‖zi‖ > c, ∀i ≥ nc. Therefore, nc must be ∞.

Case 2. Assume nc =∞. In this case {zi} will come back to the ball {‖z‖ ≤ c}infinitely many times and at the same time {zi} is unbounded. From this we canconclude that {‖zi‖} crosses a nonempty interval infinitely often. To be precise, letzτliPzli ≤ λmaxc

2, i = 1, . . . , nc, where P is given in (2.11). Starting from any zli ,i ∈ {1, 2, . . . , nc}, there exists an mi > li such that zτmiPzmi > 4c2λ2

max/λmin since{‖zn‖} is unbounded. Further, noticing nc =∞ we can find an integer ni+1 in the set{li, i = 1, 2, . . . , nc} such that ni+1 > mi. This procedure can be continued infinitelymany times. Without loss of generality, we may assume

zτniPzni ≤ λmaxc2, zτmiPzmi ≥ 4c2λ2

max/λmin,

λmaxc2 < zτj Pzj < 4c2λ2

max/λmin,(3.18)

ni < j < mi, i = 1, 2, . . . .

Dow

nloa

ded

11/2

0/14

to 1

32.2

06.2

05.6

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

Page 10: Convergence Rate of Stochastic Approximation Algorithms in the Degenerate Case

CONVERGENCE RATE OF STOCHASTIC APPROXIMATION 109

This implies the crossing property of {‖zi‖}:

‖zni‖ ≤√Kc, ‖zmi‖ ≥ 2c

√K, c < ‖zj‖ < 2cK,(3.19)

ni < j < mi, i = 1, 2, . . . .

We now show that∑mi−1j=ni bj ≥ T > 0 and ‖zs − zni‖ = O(T ) as T → 0 for all

large i and s:∑sj=ni bj ≤ T . This implies a contradiction to (3.19). We now prove

this in detail.Noticing that there are constants c3 and c4 such that

bn‖Hnzn‖≤an

log a−1n

[c3‖zn‖1+γ + c4‖zn‖](3.20)

≤ an

log a−1n

[c3(log a−1n )

γ+1γ ‖xn − x0‖1+γ

+c4(log a−1n )

1γ ‖xn − x0‖]−−−−→

n→∞0,

by (2.13) from (3.6) we see that

zn+1 − zn−−−−→n→∞

0.(3.21)

Summing up both sides of (3.6) from ni to mi we derive

zmi = zni +mi−1∑j=ni

bjHjzj +mi−1∑j=ni

aj(log a−1j+1)

1γ εj+1.(3.22)

From (3.22) using (3.18), (3.19), (3.20) we obtain

2c√K ≤√Kc+

mi−1∑j=ni

bj(c3‖zj‖1+γ + c4‖zj‖) +

∥∥∥∥∥∥mi−1∑j=ni

aj(log a−1j+1)

1γ εj+1

∥∥∥∥∥∥≤√Kc+

mi−1∑j=ni

bj(c3(2cK)1+γ + c42cK) +

∥∥∥∥∥∥mi−1∑j=ni

aj(log a−1j+1)

1γ εj+1

∥∥∥∥∥∥ .(3.23)

Since (2.13) the last term of (3.23) can be made arbitrarily small, say, less thanε (<

√Kc) if i is sufficiently large. Then from (3.23) it follows that

mi−1∑j=ni

bj ≥√Kc− ε

c3(2cK)1+γ + 2cc4K4= T > 0

for all large enough i. This means that l(ni, T ) ≤ mi − 1, where

l(n, t) = max

{l :

l∑i=n

bi ≤ t}, bi =

ai

log a−1i

.(3.24)

Dow

nloa

ded

11/2

0/14

to 1

32.2

06.2

05.6

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

Page 11: Convergence Rate of Stochastic Approximation Algorithms in the Degenerate Case

110 HAN-FU CHEN

Consequently, we have

‖zj − zni‖ ≤∥∥∥∥∥j−1∑s=ni

bsHszs +j−1∑s=ni

as(log a−1s+1)

1γ εs+1

∥∥∥∥∥≤ (c3(2cK)1+γ + 2c4cK)T + o(1) ≤ αT, α > 0,(3.25)

∀j ∈ [ni, . . . , l(ni, T )],

where α is a constantTherefore, by Taylor’s formula there exists z ∈ Rl such that

‖z − zni‖ ≤ αT(3.26)

and

zτl(ni,T )Pzl(ni,T ) − zτniPzni =zτP

l(ni,t)∑j=ni

bjHjzj +l(ni,T )∑j=ni

aj(log a−1j+1)

1γ εj+1

=l(ni,t)∑j=ni

bjzτj PHjzj +

l(ni,T )∑j=ni

bj(z − zj)τPHjzj

+zτPl(ni,T )∑j=ni

aj(log a−1j+1)

1γ εj+1.(3.27)

Using (3.20), (3.25), and (3.26) we see that∥∥∥∥∥∥l(ni,T )∑j=ni

bj(z − zj)τPHjzj

∥∥∥∥∥∥ ≤ 2αT 2λmax(c3(2cK)1+γ + c42cK).

By (2.13), the last term of (3.27) can be made arbitrarily small. Hence, by (3.12),(3.13), (3.19), (3.21) we have

zτj PHjzj =12zτj (PHj +Hτ

j P )zj < −14‖zj‖2+γ ≤ −1

4c2+γ ∀j = ni, . . . , l(ni, T ).

From (3.27) we can conclude that

zτl(ni,T )Pzl(ni,T ) − zτniPzni

≤−14c2+γT + 2αT 2(c3(2cK)1+γ + 2cc4K) + o(1) ≤ −1

5c2+γT,(3.28)

if i is large enough and T is sufficiently small. By (3.18) inequality (3.28) implies that

λmaxc2 < zτl(ni,T )Pzl(ni,T ) ≤ zτniPzni −

15c2+γT → λmaxc

2 − c2+γT

5,

which is impossible. The obtained contradiction shows that {zn} is bounded.

Dow

nloa

ded

11/2

0/14

to 1

32.2

06.2

05.6

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

Page 12: Convergence Rate of Stochastic Approximation Algorithms in the Degenerate Case

CONVERGENCE RATE OF STOCHASTIC APPROXIMATION 111

4. Proof of the theorem. We are now in a position to prove our theorem.Proof of the theorem. (i) For assertion (2.12) of the theorem it suffices to show

lim supn→∞

zτnPzn ≤ λmax

(2(q + ∆)λmax

γ

) 2γ 4

= a(4.1)

for arbitrarily small ∆ > 0. Let us fix ∆ > 0.The idea of proof for (4.1) is that we show that zτnPzn crosses a nonempty interval

infinitely often if (4.1) is not true and, at same time, zτnPzn is decreasing in a certainsense. This contains a contradiction.

By Lemma 3 {zn} is bounded; i.e.,

‖zn‖ ≤ ζ <∞ ∀n.(4.2)

Hence, from (3.8) we see that

Hn = H‖zn‖γ +q + ∆γ

I + o(1).(4.3)

From (3.3), (3.6), and (4.3) it follows that

zn+1 = zn + bnh(zn) + bno(1) + an(log a−1n+1)

1γ εn+1.(4.4)

Fix any small ε > 0 consider z ∈ Rl for which

zτPz ≥ λmax

(2(q + ∆)λmax + ε

γ

) 2γ 4

= b.(4.5)

This implies that

‖z‖ ≥(

2(q + ∆)λmax + ε

γ

) 1γ

.

Then by (2.11) and (4.4) we have

zτ(

(PH +HτP )‖z‖γ +2(q + ∆)

γP

)z

≤ zτ(−‖z‖γI +

2(q + ∆)λmax

γI

)z ≤ − ε

γ‖z‖2.(4.6)

Assume (4.1) is not true. Then there is a small δ > 0 such that

lim supn→∞

zτnPzn > a+ δ.(4.7)

Therefore, there is a subsequence

zτnkPznk > a+ δ, k = 1, 2, . . . .(4.8)

Let ε > 0 be small enough so that

a+ δ > b,(4.9)

where b is given by (4.5).

Dow

nloa

ded

11/2

0/14

to 1

32.2

06.2

05.6

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

Page 13: Convergence Rate of Stochastic Approximation Algorithms in the Degenerate Case

112 HAN-FU CHEN

We now show that from any nk, k = 1, 2, . . ., {zn} will enter the ellipsoid {z :zτPz ≤ b}. Assume the converse, i.e.,

‖zτi Pzi‖ ≥ b ∀i ≥ nk(4.10)

for some nk.By the boundedness of {zn} we have

‖zj − zn‖ ≤ c5T, j = n, . . . , l(n, T ) ∀n,(4.11)

and hence

|zτj Pzj − zτnPzn| ≤ c6T, j = n, . . . , l(n, T ) ∀n,(4.12)

where c5 and c6 are constants. Similar to (3.27), by (4.2), (4.11) we obtain

zτl(nk,T )Pzl(nk,T ) − zτnkPznk ≤l(nk,T )∑j=nk

bjzτj PHjzj + c7T

2 + o(1)

=l(nk,T )∑j=nk

bjzτj

(PH

(‖zj‖γ +

q + ∆γ

P

))zj + c7T

2 + o(1).

(4.13)

Using (2.11) leads to

zτl(nk,T )Pzl(nk,T )−zτnkPznk =12

l(nk,T )∑j=nk

bjzτj

(−‖zj‖γI +

2(q + ∆)γ

P

)zj+c7T 2 +o(1).

From this by (4.6), (4.10) we obtain

zτl(nk,T )Pzl(nk,T ) − zτnkPznk< −12

l(nk,T )∑j=nk

bjε

2γ‖zj‖2 + c7T

2 + o(1)

< − ε

(2(q + ∆)λmax + ε

γ

) 2γ

T + c7T2 + o(1)

≤ − ε

(2(q + ∆)λmax + ε

γ

) 2γ

T(4.14)

for sufficiently small T and large enough k. This means that after a finite number ofsteps (4.10) will not be satisfied; i.e., zn will enter the ellipsoid {z : zτPz ≤ b}. Thistogether with (4.8) implies that {zτnPzn} will cross the interval [b, a + δ] infinitelyoften; i.e., there are two subsequences {zlk} and {zmk} such that

zτlkPzlk ≤ b, zτmkPzmk ≥ a+ δ,

b < zτi Pzi < a+ δ ∀i : lk < i < mk.

Take T sufficiently small such that c6T < a + δ − b. Then by (4.12) we seel(li, T ) < mi ∀i and

zτl(lk,T )Pzl(lk,T ) ∈ (b, a+ δ),

Dow

nloa

ded

11/2

0/14

to 1

32.2

06.2

05.6

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

Page 14: Convergence Rate of Stochastic Approximation Algorithms in the Degenerate Case

CONVERGENCE RATE OF STOCHASTIC APPROXIMATION 113

which combined with (4.14) leads to a contradiction:

0 < zτl(lk,T )Pzl(lk,T ) − zτlkPzlk ≤ −ε

(2(q + ∆)λmax + ε

γ

) 2γ

T.

Therefore, (4.8) is impossible or (4.7) is impossible. Since δ may be arbitrarilysmall, the impossibility of (4.7) implies (4.1). Tending ∆ to zero, from (4.1) we derive(2.12).

(ii) Now, let H be symmetric. We simply consider ‖z‖2 instead of zτPz and setin (4.1) and (4.5)

a =(q + ∆λlγ

) 2γ

, b =(q + ∆ + ε

λlγ

) 2γ

.

The proof can be carried out along the lines of that given for the general case.For example, corresponding to (4.5), (4.6) we now have

‖z‖2 ≥ b, and zτ(H‖z‖γ +

q + ∆γ

I

)z ≤ zτ

(−λl

q + ∆ + ε

λlγ+q + ∆γ

)z = − ε

γ‖z‖2,

respectively, while (4.13) becomes

‖zl(nk,T )‖2 − ‖znk‖2 ≤ −l(nk,T )∑j=nk

bjzτj

(H

(‖zj‖γ +

q + ∆γ

I

)zj + c7T

2 + o(1)).

(iii) Since q > 0, we may set ∆ = 0 in (3.3) and in the proofs of Lemma 3 andpart (i) of the theorem.

In the one-dimensional case H in (2.5) is a negative number, and λl in (2.14)equals |H|. The root set of h(z) defined by (3.3) with ∆ = 0 is J = {0,±( q

−Hγ )1γ }.

It is easy to define a twice differentiable function v(z) such that

v(z) = v(−z), 0 < v(z) < v(0) ∀z : |z| ≤ ζ,

v′(z)h(z) < 0 ∀z 6∈{

0,±(

q

−Hγ

) 1γ

},

where ζ is given in (4.2).For ∀t ≤ T we find that

limT→0

lim supk→∞

1T

∥∥∥∥∥∥l(k,t)∑i=k

(bio(1) + ai(log a−1i+1)

1γ εi+1)

∥∥∥∥∥∥≤ limT→0

lim supk→∞

1T{t · o(1)}+ lim

T→0lim supk→∞

1T

∥∥∥∥∥∥l(k,t)∑i=k

ai(log a−1i+1)

1γ εi+1

∥∥∥∥∥∥=0.(4.15)

Applying Remark 2 in section 2 to (4.4) leads to

limk→∞

d(zk, J) = 0.

Dow

nloa

ded

11/2

0/14

to 1

32.2

06.2

05.6

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

Page 15: Convergence Rate of Stochastic Approximation Algorithms in the Degenerate Case

114 HAN-FU CHEN

This is valid for any {εi} satisfying (4.15). In particular, if {εi} is such thatbio(1) + ai(log a−1

i+1)1γ εi+1 = 0 ∀i ≥ 1, then (4.4) becomes the following recursion:

zn+1 = zn + bnh(zn).(4.16)

Note that h′(0) = qγ > 0, and hence 0 is not stable for the equation

zt = h(zt).

It is clear that 0 cannot be the limit point of (4.16).Therefore, in this case zn can converge either to ( q

|H|γ )1γ or to −( q

|H|γ )1γ . This

verifies the attainability of the upper bound in (2.14).

5. Concluding remarks. By using a deterministic analysis we have shown thepathwise convergence rate of SA when f(x0) = 0 and f ′(x0) = 0. Some problemsare still open and belong to further research. First, it might be possible to obtainmore precise results. For example, as a conjecture, the limit of the left-hand sideof (2.14) is one of ( q

λiγ)

1γ , i = 1, . . . , l, depending upon the initial value where λi,

i = 1, . . . , l are the eigenvalues of H. Second, it is not clear what happens if f(·) hasmore complicated behavior as x→ x0.

REFERENCES

[1] H. ROBBINS AND S. MONRO, A stochastic approximation method, Ann. Math. Statist.,22 (1951), pp. 400–407.

[2] M. B. NEVELSON AND R. Z. HASMINSKII, Stochastic Approximation and Recursive Estimation,Amer. Math. Soc. Transl. Math. Monographs 47, 1976.

[3] H. J. KUSHNER AND D. S. CLARK, Stochastic Approximation for Constrained and Uncon-strained Systems, Springer-Verlag, New York, 1978.

[4] H.-F. CHEN, Stochastic approximation and its new applications, in Proc. 1994 Hong KongInternational Workshop on New Directions of Control and Manufacturing, 1994, pp. 2–12.

[5] L. LJUNG, G. PFLUG, AND H. WALK, Stochastic Approximation of Random Systems,Birkhauser, Basel, 1992, pp. 71–76.

[6] H.-F. CHEN, T. DUNCAN, AND B. PASIK-DUNCAN, On Ljung’s approach to system parameteridentification, in 10th IFAC Symposium on Systems Identification, Vol. 2, preprint, M.Blanke and T. Soderstrom, eds., Copenhagen, 1994, pp. 667–671.

[7] H.-F. CHEN, Recursive Estimation and Control for Stochastic Systems, John Wiley, New York,1985.

[8] I. J. WANG, E. K. P. CHONG, AND S. R. KULKARNI, Equivalent necessary and sufficient condi-tions on noise sequences for stochastic approximation algorithms, Adv. in Appl. Probab.,accepted for publication.

[9] S. R. KULKARNI AND C. HORN, Convergence of the Robbins–Monro algorithm under arbitrarydisturbances, in Proc. of 32nd Conf. on Decision and Control, 1993, pp. 537–538.

[10] S. R. KULKARNI AND C. HORN, Alternative approach and conditions for convergence ofstochastic approximation algorithms, in Proc. 34th Conf. on Decision and Control, NewOrleans, IEEE Control Systems Society, 1995.

Dow

nloa

ded

11/2

0/14

to 1

32.2

06.2

05.6

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php