On Accelerated Stochastic Approximation

ACCELERATED STOCHASTIC APPROXIMATION 393

ON ACCELERATED STOCHASTIC APPROXIMATION

S. V. ZHULENEV

(Translated by Durri-Hamdani)

Introduction

In 1], Kesten proposed a non-Markovian modification of a one-dimensional processof stochastic approximation in which the value of each successive "iteration" depends onthe number of changes in the direction of the motion of the preceding "iterations". In thepresent work his conditions for convergence of these processes with probability 1 arerelaxed.

The method of proof of our main result differs from Kesten’s only in a more thoroughanalysis of certain trajectories of the process. Nevertheless, it has turned out to be possiblenot only to weaken the condition of monotonicity of the sequence a, normalizing theiteration of the process and to require the boundedness of the p-th moment of "interfer-ence", 1 < p -< 2, but even to extend the class of convergent processes. It must be said thatthe convergence of one such process not considered by Kesten (the Kiefer-Wolfowitzprocedures) has already been investigated by Kushner and Gavin in [2]. But their moregeneral result, although it indeed establishes convergence in the multi-dimensional caseunder the assumption of non-uniqueness of the desired point and without "additional"restrictions on the sequences a, and c,, relies nevertheless on greater smoothness of theregression function and the equality p 2.

The class of processes of stochastic approximation on which Kesten’s modification isconsidered is defined differently. In order to simplify the presentation, the rule of variationof "iteration" of the process is also chosen differently. Note however that the class ofprocedures of Kesten is not more general and that the proof of the derived results for p 2remains in force for it also.

In this work we consider the class of one-dimensional discrete processes defined by therecurrent relation

(1) X.+I X. + a(n)g(n, t(n), X., w), n => 1,

by means of the scalar sequence a,,,, the random variable X1 defined on some probabilityspace (fl, Z, P), the family of random variables g(n, m, x, to), m >- 1, n => m, x R, alsodefined on (fl, Y_., P) and satisfying the conditions

A.I: the real function g(n, m, x, to) is measurable in R x tl for all possible n and m,A.2: for any n =>2, the family g(n, rn, x, to), 2 <-_ m <-n, x R, does not depend on

Xl, Xn,A.3: there exists the expectation (n, m, x)= Eg(n, m, x, to),

and by means of the integer-valued random variable t(n) defined as follows (a(n)

t(1) 1, t(2) 2,while for n > 2,

where

2 i[

t(n)= t(n-l) if p,,>l and Ax.=0,

t(n-1)+v(Axk..Ax,,) if p,,>l and Ax,,#0,

1, x<0,Axi xi x_, v(x)

O, x >= O,k, max (i, Ax, 0, 1 < < n), 0,, Isign Ax, I.

i=2

We assume below that this is the sample probability space of the process.

Dow

nloa

ded

11/2

3/14

to 1

29.2

2.67

.107

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

394 s.v. ZHULENEV

The problem of the convergence of these processes to some point 0 of the real axis R isinvestigated under the general conditions

C.l" inf Ig(n, m, x)l =>p(6) >0,

C.2: (x O),(n, m, x) < 0 for Ix 01 >dm, n -> m => 2, where lim dm= 0,

C.3: sup Ig’(n, m,x)l<=g(1 +lxl),n_mm(K)

C.4" sup E{Ic,,(g(n,m,x, to)-,(n,m,x))lP}<-_tr’forp(1,2]xR,nm-m(tr)

scalar sequence c,., m >- 1,

C.5: the sequences a,, and Cm are such that

lTlm =(X), m c2mP2/(P-1)m=l

p--1 am+s( Cm )p/(p--1)sup /(p-l) < m, sup

C.6:lim lim inf inf P{g(n, m, x, to) > 0} > 0,-0 n>fm Ix-ollim lim inf inf P{g(n, m, x, to) < 0} > 0.

P{ lim t(n) m} P( lim X. 0, lim t(n) }.

Two other theorems give simple additional conditions under which

(2) P{lim X,, # 0, lim t(n)<}=0

or

(3) P{lim t(n)<} 0,

i.e., convergence to the desired point 0 is guaranteed with probability 1.In this work we often use the notation

E,{. }=E{. IX1,’", X,}, P,,{. }=P{. IX,...,,. c(n)(g(n, t(n), X,,, to)- ,(n, t(n), X.)), c(n)

q.(x) P.{sc,, < x}, g. ,(n, t(n), X.),the relations

(4) g.{:.} 0, g.{l. I"} -< rfollowing from A.2 and C.4 and the (implicitly) different form of denoting the process (1):

a(n)X,+, X, + a(n)g, +-,.

Throughout below 0 is assumed equal to 0.

2 Let us agree to omit here and below the term "with probability 1".

and some

<m, a,. >0, O<c.,<=L

The main result is

Theorem 1. I[ conditions C. 1-C.6 hold, then

Dow

nloa

ded

11/2

3/14

to 1

29.2

2.67

.107

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php


Auxiliary Assertions

In proving the main result the following two inequalities are used:

k=l

i=m k=l

which are valid for random variables k with E{k, ", k-} 0 and some properties ofthe processes under consideration. Four of these, formulated as lemmas, are analogues ofcorresponding propositions of Kesten. e proof of Lcmma 1 (an analogue of Lemmas 1and 2 in [1]) is carried out, however, because it is essentially different from Kesten’s.

Lemma 1. Ifcoitions C. 1, 2, 4, 5 hold, then them exist an integerM and a v 0, notdepending on n, such that

P{X+-X.-a(n)}l-u(c(n))p/-’ ift(n)M andX,

P{X+-X, Na(n)}NI-(c(n))/’-’ fft(n)MandXN-&

PROOF. It suces to show that, for all n 1,

q()min(,(p-1)((1-)r)/(-)p

for any , 0N N 1, if is measurable with respect to X,. , X and 0< < m. In fact,the first inequality then follows from the measurability and boundedness of c(n) and theexistence of < 1 satisfying the inequality

(p 1)((1-fl)O()L)p/-’2p

since by C. 1 and C.2 there is an M such that, for t(n) M1,

{ ’P()] ( "P()’x.+,-x. -( -q. (), x..e second inequality is completely symmetric to the first.

We shall assume that M m() and set a [0, 1 ] and n, ar./q,(r,). Then, by(4) and the inequality

we have

o"p >= ]xlp dq,, >= rlP-’ x dq, >= "OP-’ x dq,,

>- .OP.- x dq,,-a.r,, >-\q,,(’r,,)/

Hence, if q,(’,)< , the inequality

(5) is proved in [3]; (6) follows from (5).

Dow

nloa

ded

11/2

3/14

to 1

29.2

2.67

.107

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

396 s.V. ZHULENEV

holds. It remains to remark that the maximum of the right-hand side of this inequality withrespect to t, 0-_< c _-< 1-/3, is taken on for

Lemma 2. If conditions C.2, 3, 5 hold, then there is an integer M2 such that, ]’or anys >-1, we have the inclusion

Ix.l-<- 3& sup c--r[ _-< 8, t(n) -> M2n<--_j<i<n+s--1 r=j

Lelnma 3. If conditions C.2, 5 hold, then there is an integer M3 such that

E.{IX.+II}=<IXII- a(n)- if t(n) >--M3 and Ixl--> .Note also that for any n -> 1 the following inequalities hold if IX.l_->t (8 _-< 1) and

t(n) ->max [re(k), m(cr)]:

E{(IX+II-IXI)x{X+I sign X. Ixl}}(7) Io { ctn)

"a(n)l > } <<-- P,-a(n)lgl/--7-r,&l-x dx=

(8)

trp a(n)

(p- 1)p(()p-1 cP(n)

E.{(Ix.+,l-Ix.I)x{X.+l sign X, <- -Ix, I}}

<-_Ka(n)(1 +IX, I)X ga(n)>- + (p_l)p_\c(n)]

where x(A) is the characteristic function of the set A.Lemma 4. If conditon C.4 holds, then, .for some kl < oo not depending on n,

En{llXn+ll-En{lXn+ll}lP}<gl\c(rt)/ if t(n)>--m(tr).

Proof of Theorem 1

One can see that

e{0 < imlX.I -lim IX, <_-eo, lim t(n) oo} 0,

P{lim Ixl oo, lim t(n) c} 0

by means of Lemma 3 and the method used in Lemmas 5 and 6 in [1]. Insignificantlymodifying arguments in [1] and using Lemma 2, condition C.6, and inequality (6) one canshow that

P{lirn Ix.I 0<lim Ix.I, lim t(n)= oo} 0.

Thus it remains to establish the equality

P{0< lim Ix.l= x < oo, lim t(n)=oo}=o.Suppose the contrary. Then there are and ’, 0 < < 8’ < oo, and, for arbitrarily large

m, an integer N such that

(9) P{6 <-IX, <- 6’, n >- N; t(N) >- m} > O.

Let us show that this cannot be if we consider m to be chosen as large as will be neededbelow.

Dow

nloa

ded

11/2

3/14

to 1

29.2

2.67

.107

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php


Define an integer-valued singular random variable z and an auxiliary process Z,,n ->_ 1, by setting (0 <

N, t(N)<m,r

min (k k >-_ N, IXk] [8, 8’]),

z l[x"l’ --< ’ p)(Z,_,- a t(’r--1)+n--,r 2

t(N) >- m,

Also set A 6’ and

E.{Z./,-Z}=.(X,,..., X.) .,E,,{IZ.+, Z. g.lP} "y.(X, X.)= %.

As in [1], (9) implies the inequality

P{ l (Zk+, Zk tXk)[ >-- l txkl A, n >-- N; t(N) >= m} > O.

Hence to obtain the desired contradiction it suffices to show that as n oo the randomvariable v, lYI converges to co uniformly (in to) while the expectation L, of thevariable

(Zk+l- Zk tXk) X{t(g) >= m}

is bounded. The first is almost obvious. Let us prove the second.By Lemma 4, the definition of Z, and inequality (5) we have

[a(k)’Tk <-- gl\-] Xl’r> k},

.i’1 [a(k),p a(k) p

Hence the boundedness of L, will follow from the inequality

.i’ [a(k)’p

oO}ka(k)

t/ 2. +X{i=N+l

or

(10) sup E{ -l a(k) p }The method used below permits one to show only the inequality

<

since the fact that it is possible to estimate from above the other expectations in (10) by thesame constant will be evident from the calculations. Let us agree, moreover, not to indicatethe condition {-=oo} which is assumed throughout in the notation of the conditionalexpectations and probabilities encountered below.

We decompose our series into parts as follows:

(11)

Dow

nloa

ded

11/2

3/14

to 1

29.2

2.67

.107

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

398 s.v. ZHULENEV

where

n2k+l min {s n2k <= S <-00, Zsr/2k+2 min {s /’[2k+l +2<-s--<, Zs

n min{r" r=(-1)S,r>ns_a},--1h sup {s r/k<r/2k+2},s r/Xk+l r/2k+2

and make the estimate successively in three stages.The first of these is obvious since the estimate

k>-_O, no=N,

sign (Zr-Zr+l),/’1ok= F[2 s 1,

(clearly, n= nzk+l),

a(n2+2-1)P<2 Y. a, p

o->_ c(n2+2--1)1 c,,/

follows immediately from the inequality t(n2,4-4-1) > t(n2k 1). In order to consider theother terms we introduce the additional notation:

w=t(n+l-1)-t(n),n min{i’t(i)>t(n(r-1)), > ns(r-)},

k k kJsr’- ns(r+l)- nsr,

OSAk;l_-<r_-< w; no n;

O<--r<---- wk, n(w,+l 1,

""+"-’ [ 0(6) ] ,,1=Jr--Jr,Js, E v a,(.)+,----+(- 1) (Z.+, Z.) J=nr

nd ssum, without loss of gnrlity, that 11+ 1.

Let us prove that the conditional expectation of the triple sum in (11) for the givenvalues N, m, i, u and h (with the exception of u2, u,... U2h), 0 <= s<2hk= + 1, ofthevariables n, t(n), w, Z, and lk, respectively, may be estimated from above by a quantitynot depending on these values.

Analyzing the behavior of the corresponding parts of the trajectories of the process X,,we first estimate the conditional expectation of the sum

under the additional assumption that the Z.a. take on the values u,, 0<r i. ederivation of this estimate makes use of Lemma and the relations

a=+,p(6)

8’’’Uo <u =u =.’-=u =u <’’’

which follow from the definitions. We have

E/o < E E E Rk+ 1 +S](1--VCp/co-1)’srn+r

k0 r=0 \Cm+r./ s=l

i ap2A a- V--2 0 p2p/(p--1)’mo+rsup Z/(p-1)+vp(). o+,

Finally, we finish the estimation by considering the k-th term of the remaining sum in(11) and decomposing it into three parts; here we omit the index k. We have

{’=-’ (a==,_,+, P ’(a )P }m2s J(2s)O I1 + I2 + g.2s-1)r +r (J(2s)r+ 2

s=l r=O Cm2s__l+r/ Cm2s+Dow

nloa

ded

11/2

3/14

to 1

29.2

2.67

.107

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php


The desired conditional expectations of 11 and 13 are estimated identically by means ofLemma 1. For example,

and similarly

EI3 < p-2.p+2p/(p--1)"

s=l r=O m2s+

The greatest input into the total estimate is supplied by the remaining term I2. Itsconditional expectation will be estimated under the additional assumption

(12) u u2h+ U2h-" U3 ’.We have

m2s+r

r=O C m2s+r/

a sup (u,_- u+,) +sup (u-u) dF(u-u}

+sup (u- u_) dF(u- u_)

where

h

ll(j)= VI Ilk(j), ll(j)={J(2_l),=j(2_l), 0-<r-<i2_1};

Ts {r m2s <= r <= m2s + i2s}, 1 =< s =< h;F(u) P{Z,- u-i < u; Z,,+, >= Z,,, N2s_ <= tl < NIZ_, u_,, fl (D}.

By (7) and (8) we have, under the condition that a, <= 8/2k, n >- rn >= max Ira(K), m(cr)],cre -, a,,_,+,.

udF(u)<=(p-1)p(8)p-1 =o cp

Hence using Lemma 1 and noting that, according to C.5, the inequalityp--I

Ctpaq

sup < K2q-->_m. C ap-l-"

holds for some K2 < oo and for all =< rn2s, we get

2crP - a"2s-’+’ (t + 1)(1 --..p/(p-1),m2s_l+r]<(p- 1)pP()=1 ,>-,,2. ,=o cpm,-,+l

,2p+2p/(p--1)"(p- 1)vEpP() =1 =o

It is easy to see that in any other possible case the estimate for EI2 will be the same as inthe case (12).

Dow

nloa

ded

11/2

3/14

to 1

29.2

2.67

.107

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

400 s.v. ZHULENEV

Combining these estimates and noting that, in the sums obtained, terms with c, ofdegree p+ 2p/(p- 1) can be repeated at most four times and of degree 2p2/(p 1) at mosttwice we finally obtain

Eo< 2 a p 2A aP-1+ sup

On Convergence of the Processes (1)

As simple examples show, under conditions C.1-C.6 it is impossible to guarantee theconvergence of these processes under consideration to the point 0 with probability 1. It iseasy to see, however, that by a certain strengthening of these conditions one can achieve itthat one of the properties (2) or (3) hold and thus obtain the desired convergence. Let usadduce the following two sufficient criteria.

Theorem 2. If conditions C.1-C.6 hold wherein m(,) m(o’) 2 and d,,=-O, then theprocess (1) converges to O with probability 1.

PROOF. If lim t(n) <, then lim X exists (it may be equal to + oo). Let us show that inthis case property (2) is valid. Due to the symmetry it suffices for this to establish, forexample, that

P{lim X, > 0, lim t(n) < o} O.

Suppose the contrary. Then there exist B > 0 and integers m -> 2 and N N(rn, ,) suchthat, for any e, 0< e =< a,,o(B)/2,

p= P{Xn+I X >= e, X t, t(n)= m, n ->N}>0.But by Lemma 1, which the additional conditions permit us to apply for m => 2,

p,, P A.IB. fq (Ak f’) B,) _--< 1 ItCPm/(P-1)k=N

A.={X.+,-X.>--e}, B.={X.>-&t(n)=m},and hence

p P (A, f) B) _<- lim Pk O.k=N

Let us also give a modification of the general conditions ensuring property (3). It isrelated to conditions C.1, 2, 4, 6 and assumes the existence of a scalar sequence 0,converging to 0.

Theorem 3. If conditions C.3-C.5 hold along with the conditions

B.I: inf [g(n, m, x)[ =>p()>0,Ix-oml-&nra2

B.2: (x O,),(n, m, x) < O, x # 0,,, n>-m>__2,

B.6: lim inf inf P{g(n, m, x, to) > 0} > 0,8-*0 n>--m_2 Ix-Oml<=

lim inf inf P{g(n, m, x, to) < 0} > 0,.-0 n>>-m_2

and m(tr)= 2, then the process (1) converges to 0 with probability 1.

Dow

nloa

ded

11/2

3/14

to 1

29.2

2.67

.107

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

ON THE NUMBER OF MAXIMAL VERTICES OF A DIGRAPH 401

PROOF. Conditions B are a special case of conditions C. Hence it suffices to prove thatP{lim t(n) } 1. But this is obvious since from the proof of Theorem 2 it follows that thetrajectories with lira t(n) < oo can converge with positive probability only to the points 0,.,while by B.6 any trajectory lying in a sufficiently small neighborhood of any 0, changes thedirection of motion with probability 1 after a finite number of iterations.

From the derived results it is not hard to deduce conditions of convergence of theknown procedures of Robbins-Monro and Kiefer-Wolfowitz modified according to Kestenin terms of the corresponding regression functions. Thus, in the case of a Robbins-Monroprocess one can use Theorem 2 with c, 1. In the case of a Kiefer-Wolfowitz process onecan apply Theorem 3 with the sequence c,, converging to 0. This follows from Lemma 5 in[4] according to which ITS,,- 0[- c,, for regression functions monotonically varying fromthe left and from the right of the point 0, the unique minimum point.

Received by the editorsDecember 12,1973

REFERENCES

1] H. KESTEN, Accelerated stochastic approximation, Ann. Math. Statist., 29, (1958), pp. 41-59.[2] H.J. KUSHNER and T. CAVIN, Extensions o[ Kesten’s adaptive stochastic approximation method,

Ann. Statist., 1, 5 (1973), pp. 851-861.[3] I. VON BAHR and C.-G. ESSEEN, Inequalities ’or the r-th absolute moment o]" a sum o[ random

variables (1 < <= 2), Ann. Math. Statist., 36, (1965), pp. 299-303.[4] D. L. BURKHOLDER, On a class of stochastic approximation processes, Ann. Math. Statist., 27, 4

(1956), pp. 1044-1059.

ON THE NUMBER OF MAXIMAL VERTICES OFA RANDOM ACYCLIC DIGRAPH

V. A. LISKOVETS

Translated by S. M. Rudol’er)

Introduction

In the last few years, a large number of articles dedicated to the investigation of variousnumerical characteristics of random graphs have appeared. An especially intensive studywas made of random trees and objects related to them (in the first place, mappings of setsinto themselves). Their direct analogue when orientation is present are acyclic digraphs.This is what we call oriented graphs without dicycles 1. Although acyclic digraphs are alsovery often found in theoretical investigations (for example, in the form of Hertz’ graphs ofarbitrary digraphs (cf. 1], p. 253)) and are important for application to network planning,programming, and other areas, not enough attention has been given to their study untilrecently. In this article we investigate the behavior of one of their fairly simple and naturalcharacteristics, namely, the number of maximal vertices, i.e., vertices to which no arcs lead.It is well known that every finite non-empty acyclic digraph has maximal vertices. Thisproperty is a defining one" a finite digraph is acyclic if and only if each of its non-emptysubgraphs is a digraph with maximal vertices. The basic result of this article gives adescription of the limiting distribution of their number in a random n-vertex acyclicdigraph as n .

See 1], Ch. 4; with respect to the basic concepts of graph theory, we use the terminology of thisbook.

Dow

nloa

ded

11/2

3/14

to 1

29.2

2.67

.107

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

Documents

On Accelerated Stochastic Approximation