Noise conditions for prespecified convergence rates of stochastic approximation algorithms

  • Published on
    22-Sep-2016

  • View
    212

  • Download
    0

Embed Size (px)

Transcript

  • 810 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 45, NO. 2, MARCH 1999

    Noise Conditions for Prespecified ConvergenceRates of Stochastic Approximation Algorithms

    Edwin K. P. Chong, Senior Member, IEEE,I-Jeng Wang, Member, IEEE, and Sanjeev R. Kulkarni

    Abstract We develop deterministic necessary and sufficientconditions on individual noise sequences of a stochastic approximationalgorithm for the error of the iterates to converge at a given rate.Specifically, suppose fn

    gf

    n

    gf

    n

    g is a given positive sequence convergingmonotonically to zero. Consider a stochastic approximation algorithmx

    n+1

    = x

    n

    a

    n

    (A

    n

    x

    n

    b

    n

    ) + a

    n

    e

    n

    x

    n+1

    = x

    n

    a

    n

    (A

    n

    x

    n

    b

    n

    ) + a

    n

    e

    n

    x

    n+1

    = x

    n

    a

    n

    (A

    n

    x

    n

    b

    n

    ) + a

    n

    e

    n

    , where fxn

    gfx

    n

    gfx

    n

    g is the iteratesequence, fan

    gfa

    n

    gfa

    n

    g is the step size sequence, fen

    gfe

    n

    gfe

    n

    g is the noise sequence,and xxx is the desired zero of the function f(x) = Ax bf(x) = Ax bf(x) = Ax b. Then, underappropriate assumptions, we show that xn

    x

    = o(

    n

    )x

    n

    x

    = o(

    n

    )x

    n

    x

    = o(

    n

    ) if and onlyif the sequence fen

    gfe

    n

    gfe

    n

    g satisfies one of five equivalent conditions. Theseconditions are based on well-known formulas for noise sequences:Kushner and Clarks condition, Chens condition, Kulkarni andHorns condition, a decomposition condition, and a weighted averagingcondition. Our necessary and sufficient condition on fen

    gfe

    n

    gfe

    n

    g to achievea convergence rate of fn

    gf

    n

    gf

    n

    g is basically that the sequence fen

    =

    n

    gfe

    n

    =

    n

    gfe

    n

    =

    n

    g

    satisfies any one of the above five well-known conditions. We provideexamples to illustrate our result. In particular, we easily recover thefamiliar result that if an

    = a=na

    n

    = a=na

    n

    = a=n and fen

    gfe

    n

    gfe

    n

    g is a martingale differenceprocess with bounded variance, then xn

    x

    = o(nx

    n

    x

    = o(nx

    n

    x

    = o(n

    1=2

    (log (n))

    )(log (n))

    )(log (n))

    )

    for any > > >1=2.

    Index TermsConvergence rate, KieferWolfowitz, necessary and suf-ficient conditions, noise sequences, RobbinsMonro, stochastic approxi-mation.

    I. INTRODUCTIONWe consider a general stochastic approximation algorithm for

    finding the zero of a function f : H ! H, where H is a generalHilbert space (on the reals IR)x

    n+1

    = x

    n

    a

    n

    f

    n

    (1)where xn

    2 H is the estimate of the zero of the function f ,a

    n

    is a positive real scalar called the step size, and fn

    2 H

    represents an estimate of f(xn

    ). The estimate fn

    is often written asf

    n

    = f(x

    n

    )e

    n

    , where en

    2 H represents the noise in the estimate.Typical convergence results for (1) specify sufficient conditions onthe sequence fen

    g (see [15] and references therein).In this paper we are concerned with the rate of convergence

    of fxn

    g to x, the zero of the function f . To characterize theconvergence rate, we consider an arbitrary positive real scalar se-quence fn

    g converging monotonically to zero. We are interestedin conditions on the noise for which xn

    converges to x at rate n

    ;specifically, xn

    x

    = o(

    n

    ). By writing xn

    x

    = o(

    n

    ) we mean

    that 1n

    (x

    n

    x

    ) ! 0. As in some previous work on stochasticapproximation, instead of making probabilistic assumptions on the

    Manuscript received September 5, 1997; revised April 9, 1998. This workwas supported in part by the National Science Foundation under Grants ECS-9501652 and IRI-9457645.E. K. P. Chong is with the School of Electrical and Computer Engineering,

    Purdue University, West Lafayette, IN 47907-1285 USA.I-J. Wang is with the Research and Technology Development Center, The

    Johns Hopkins University Applied Physics Laboratory, Laurel, MD 20723-6099 USA (e-mail: I-Jeng.Wang@jhuapl.edu).S. R. Kulkarni is with the Department of Electrical Engineering, Princeton

    University, Princeton, NJ 08544 USA.Communicated by P. Moulin, Associate Editor for Nonparametric Estima-

    tion, Classification, and Neural Networks.Publisher Item Identifier S 0018-9448(99)01417-0.

    noise, we take a deterministic approach and treat the noise as anindividual sequence. This approach is in line with recent trends ininformation theory and statistics where individual sequences havebeen considered for problems such as source coding, prediction, andregression.In our main result (Theorem 2), we give necessary and sufficient

    conditions on the sequence fen

    g for xn

    x

    = o(

    n

    ) to hold.To illustrate our result, consider the special case where an

    = 1=n.

    Then, under appropriate assumptions, xn

    x

    = o(

    n

    ) if andonly if ( nk=1

    e

    k

    =

    k

    )=n ! 0 (i.e., the long-term average offe

    n

    =

    n

    g is zero). In fact, our result provides a set of five equivalentnecessary and sufficient conditions on fen

    g, related to certain familiarconditions found in the literature: Kushner and Clarks condition,Chens condition, Kulkarni and Horns condition, a decompositioncondition, and a weighted averaging condition (see [20] for a studyon these conditions and their relationship to convergence of stochasticapproximation algorithms).Classical results on convergence rates are obtained by calculating

    the asymptotic distribution of the process f1n

    (x

    n

    x

    )g, wheresome appropriate probabilistic assumption on fen

    g is imposed. Suchan approach was first taken by Chung [4], who studied the asymptoticnormality of the sequence f1n

    (x

    n

    x

    )g (for appropriate choiceof n

    ). Sacks [18] later provides an alternative proof of Chungsresult. These results provide a means of characterizing the rate ofconvergence of stochastic approximation algorithmsthey basicallyassert that if an

    = a=n, a > 0, then the rate of convergence forthe stochastic approximation algorithm is essentially O(n1=2). In[6], Fabian provides a more general version of such an asymptoticdistribution approach. Using weak convergence methods, Kushnerand Huang [12] provide even stronger results along these lines.Recently, Chen presents results on convergence rates that do not

    involve calculating asymptotic distributions (see [1] for a summary ofresults). In particular, he provides rate results of the form xn

    x

    =

    o(a

    n

    ), where 0 < 1. Chens characterization of rates are similarto ours, with n

    = a

    n

    .

    Our results are based on a stochastic approximation algorithm, ona general Hilbert space, of the form (1) with fn

    = A

    n

    x

    n

    b

    n

    e

    n

    and f(x) = Ax b. Here, An

    and bn

    are estimates of A and b,respectively, while en

    is the noise. Note that although f is affine,the form of the stochastic approximation algorithm is not a specialcase of the usually considered case with fn

    = f(x

    n

    ) e

    n

    . Indeed,the form of the algorithm we adopt has been widely studied (e.g.,[8], [9], [19]; see also [10] for a survey of references on this form ofstochastic approximation algorithms). In our main result (Theorem 2),we give necessary and sufficient conditions on fen

    g for the algorithmto converge with a prespecified rate n

    [i.e., xn

    x

    = o(

    n

    )].Our result therefore provides the tightest possible characterization ofthe noise for a convergence rate of the form xn

    x

    = o(

    n

    ) tobe achievable, and has so far not been available. Moreover, ours isthe first to provide rate results for the general class of stochasticapproximation algorithms with fn

    = A

    n

    x

    n

    b

    n

    e

    n

    . We believethat our approach may be useful for obtaining similarly tight rateresults for the nonlinear case with fn

    = f(x

    n

    )e

    n

    (see Section V).

    II. BACKGROUND: A CONVERGENCE THEOREMTo prove our main result on convergence rates, we first need

    a convergence theorem for the stochastic approximation algorithm.Consider the algorithmx

    n+1

    = x

    n

    a

    n

    A

    n

    x

    n

    + a

    n

    b

    n

    + a

    n

    e

    n

    (2)

    00189448/99$10.00 1999 IEEE

  • IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 45, NO. 2, MARCH 1999 811

    where xn

    is the iterate, an

    is the step size, en

    is the noise, An

    : H!

    H is a bounded linear operator, and bn

    2 H with bn

    ! b. Note thatthe above is a stochastic approximation algorithm of the form (1)with fn

    = A

    n

    x

    n

    b

    n

    e

    n

    (see also [8], [10], [19] for a treatmentof convergence for a similar linear case).We will need some assumptions for our convergence theorem. First,

    we assume throughout that the step size sequence fan

    g satisfies:A1) an

    > 0; a

    n

    ! 0, and 1n=1

    a

    n

    = 1.

    The above are standard requirements in all stochastic approximationalgorithms. Next, we consider conditions on the noise sequence fen

    g.

    A. Noise ConditionsA central issue in studying stochastic approximation algorithms is

    characterizing the noise sequence fen

    g for convergence. Here, wegive five conditions on the sequence fen

    g. Each condition dependsnot only on fen

    g, but also on the step size sequence fan

    g. As weshall point out later, these five conditions are all equivalent and areboth necessary and sufficient for convergence of (2). In the statementof the following conditions, we use k k to denote the natural normon H, and convergence is with respect to this norm.N1) For some T > 0

    lim

    n!1

    sup

    npm(n;T )

    p

    i=n

    a

    i

    e

    i

    = 0

    where

    m(n; T )

    = max fk: a

    n

    + + a

    k

    Tg:

    N2)

    lim

    T!0

    1

    T

    lim

    n!1

    sup

    npm(n;T)

    p

    i=n

    a

    i

    e

    i

    = 0:

    N3) For any ; > 0, and any infinite sequence of nonoverlap-ping intervals fIk

    g on IN = f1; 2; g, there exists K 2 INsuch that for all k K

    n2I

    a

    n

    e

    n

    <

    n2I

    a

    n

    + :

    N4) There exist sequences fn

    g and fn

    g such that en

    =

    n

    +

    n

    for all n, n

    ! 0, and nk=1

    a

    k

    k

    converges.N5)

    lim

    n!1

    n

    k=1

    k

    e

    k

    n

    k=1

    k

    = 0

    where

    n

    =

    a

    1

    ; if n = 1

    a

    n

    n

    k=2

    1=(1 a

    k

    ); otherwise.

    The noise conditions above have been widely studied in theliterature. The first four are proven to be equivalent in [20], whilethe fifth is shown to be equivalent also to the first four conditionsin [21]. Condition N1) is the well-known condition of Kushner andClark [13], while N2) is a version of a related condition by Chen[2]. Condition N3) was studied by Kulkarni and Horn in [11], related

    to a persistently disturbing noise condition. Condition N4) wassuggested in [13, p. 29] (see also [2], [3], [15, p. 11], and [14] forapplications of the condition). Condition N5) is a weighted averagingcondition (see [19] and [21]). For convenience and without loss ofgenerality, we assume in N5) that an

    < 1 for all n. A special caseof this condition (with an

    = 1=n) is considered in [5] and [10],where the weighted averaging condition reduces to regular arithmeticaveraging (i.e., k

    = 1 for all k).For convenience, we define the following terminology.Definition 1: Let fan

    g be a step size sequence. We say that a noisesequence fen

    g is small with respect to fan

    g if fen

    g satisfies any ofthe conditions N1)N5) with associated step size sequence fan

    g.

    Next, we assume that the sequence fAn

    g satisfies:B1) fAn

    Ag is small with respect to fan

    g, where A: H ! His a bounded linear operator with inffRe : 2 (A)g > 0,where (A) denotes the spectrum of A.

    B2)

    lim sup

    n!1

    n

    k=1

    k

    kA

    k

    k

    n

    k=1

    k

    0, 0 < 1,

  • 812 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 45, NO. 2, MARCH 1999

    and n

    = n

    , > 0, we have c = 0 if < 1, and c = =a if = 1. To see this

    n

    n+1

    a

    n

    n

    =

    n

    (n+ 1)

    an

    n

    =

    n

    an

    (n+ 1)

    1 +

    1

    n

    1

    =

    n

    an

    (n+ 1)

    n

    + o

    1

    n

    =

    n

    n+ 1

    + o(n

    1

    )=n

    1

    an

    1

    :

    We need one more assumption on fn

    g. We use the followingdefinition.Definition 2: A scalar sequence fan

    g is said to have boundedvariation if 1n=1

    ja

    n+1

    a

    n

    j < 1.

    The additional assumption on fn

    g is as follows.G3) The sequences fn+1

    =

    n

    g and fn

    =

    n+1

    g have boundedvariation.

    Assumption G3) is technical. In fact, the assumption cannot berelaxed (see remark following Lemma 1 below). Note that G3) holdsfor any sequence fn

    g of the form n

    = n

    , > 0.

    Next, we introduce an additional assumption on the operator A.B3) inffRe : 2 (A cI)g > 0, where I is the identity

    operator and (A cI) denotes the spectrum of A cI .To satisfy B3) in the case where A 2 IRdd, we need the

    eigenvalues of A to all exceed c. In cases where c = 0, assumptionB3) reduces to B1).With the above assumptions, we are ready to state our main result.Theorem 2: Let fxn

    g be generated by the stochastic approxi-mation algorithm (3). Assume that assumptions A1), B1)B3), andG1)G3) hold. Then, xn

    x

    = o(

    n

    ) if and only if fen

    =

    n

    g issmall with respect to fan

    g.

    The basic idea of the proof of Theorem 2 is to express the sequencef

    1

    n

    (x

    n

    x

    )g using a recursion that is essentially (3) with noisesequence fen

    =

    n

    g. The desired result then follows from applyingTheorem 1.To prove Theorem 2, we need the following technical lemma.Lemma 1: Consider a step size sequence fan

    g. Let fsn

    g be apositive sequence such that fsn

    g and f1=sn

    g have bounded variation.Define a^n

    = s

    n

    a

    n

    . Then, the following hold.1) If fan

    g satisfies A1), then so does fa^n

    g.

    2) A sequence fe^n

    g is small with respect to fan

    g if and only iffe^

    n

    g is small with respect to fa^n

    g.

    Proof: Note that fsn

    g and f1=sn

    g having bounded variationimplies that sn

    converges to some positive real number. Thus, part 1)is trivial. Part 2) follows from [21, Lemma 3].Remark: Part 2) can in fact be strengthened as follows. Let La

    bethe set of sequences that are small with respect to fan

    g, and La^

    be theset of sequences that are small with respect to fa^n

    g. Then, we haveL

    a

    = L

    a^

    if and only if fsn

    g and f1=sn

    g have bounded variation.Using the above lemma, we can now prove Theorem 2.Proof of Theorem 2: Substituting x^n

    =

    1

    n

    (x

    n

    x

    ) into (3),we obtain

    x^

    n+1

    =

    n

    n+1

    x^

    n

    a

    n

    1

    n

    A

    n

    (

    n

    x^

    n

    + x

    ) + a

    n

    b

    n

    n

    + a

    n

    e

    n

    n

    = x^

    n

    a

    n

    n

    n+1

    A

    n

    1

    a

    n

    (

    n

    =

    n+1

    )

    n

    n+1

    1 x^

    n

    + a

    n

    n

    n+1

    (AA

    n

    )x

    +

    e

    n

    n

    + a

    n

    n

    n+1

    1

    n

    (b

    n

    b):

    Write a^n

    = a

    n

    (

    n

    =

    n+1

    ), e^n

    = (A A

    n

    )x

    + e

    n

    =

    n

    ,^

    b

    n

    =

    1

    n

    (b

    n

    b), and

    ^

    A

    n

    = A

    n

    n

    n+1

    a

    n

    n

    I:

    Then, we have

    x^

    n+1

    = x^

    n

    a^

    n

    ^

    A

    n

    x^

    n

    + a^

    n

    ^

    b

    n

    + a^

    n

    e^

    n

    where ^bn

    ! 0 by assumption G1). Note that by Lemma 1-1),fa^

    n

    g satisfies A1). Also, note that by assumption G2), the sequencef

    ^

    A

    n

    (AcI)g is small with respect to fan

    g (using [20, Lemma 2]),and hence also with respect to fa^n

    g, by virtue of assumption G3)and Lemma 1-2). Thus, by assumption B3), f ^An

    g satisfies conditionB1) with respect to fa^n

    g [where the operator A in B1) is taken tomean A cI here]. Moreover, by writing

    k

    ^

    A

    n

    k kA

    n

    k+

    n

    n+1

    a

    n

    n

    we see that f ^An

    g satisfies condition B2) with respect to fa^n

    g. Hence,by Theorem 1, we conclude that x^n

    ! 0 if and only if fe^n

    g is smallwith respect to fa^n

    g, which holds if and only if fen

    =

    n

    g is smallwith respect to fa^n

    g, by definition of e^n

    and assumption B1) (weuse the fact that difference between two small sequences is also asmall sequence). By assumption G3) and Lemma 1-2), we obtain thedesired result.Remark: Assumption G1) can be relaxed to f1n

    (b

    n

    b)g is small withrespect to fan

    g. The proof above can be easily modified toaccommodate this relaxed assumption by including 1n

    (b

    n

    b)

    into e^n

    . The same line of argument as above can then be used. Similarly, assumption G2) can be relaxed to f(n

    n+1

    )=(a

    n

    n

    ) cg is small with respect to fan

    g.

    Note that the condition limn!1

    (

    n

    k=1

    k

    (

    k

    k+1

    )=(a

    k

    k

    ))=(

    n

    k=1

    k

    ) < 1 is automatically implied.Essentially the same proof as above goes through in this case.

    IV. EXAMPLESIn this section, we provide examples to illustrate our main result,

    Theorem 2. We assume throughout this section that B1)B3), andG1) hold.

    A. Random Noise SequenceWe give an example where fen

    g is a random process.Proposition 1: Suppose fen

    g is a martingale difference processwith E(e2n

    )

    2

    , 2

    2 IR. Let an

    = n

    with > 1=2. Then,x

    n

    x

    = o(n

    (1=2)+

    ) a.s. for any > 0.Proof: We apply Theorem 2 with n

    = n

    (1=2)+

    , > 0.

    Note t

Recommended

View more >