Comments on "A Stochastic Approximation Method"

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, NOVEMBER 1972

simulation. However, these errors may be small in relation topossible inaccuracies or biases in the model itself. When dealingwith a questionable model, the user of the simulation should notdelude himself into false confidence in the validity of his resultsby using a disproportionally large sample size.

REFERENCES[11 G. J. Hahn and S. S. Shapiro, Statistical Models in Engineering. New

York: Wiley, 1967, pp. 236-252.[2] F. S. Hillier and G. J. Lieberman, Introduction to Operations Research.

San Francisco, Calif.: Holden-Day, 1967, ch. 14.[3] H. M. Wagner, Principles of Operations Research. Englewood Cliffs,

N.J.: Prentice-Hall, 1969, ch. 21.[4] J. M. Hammersley and D. C. Handscomb, Monte Carlo Methods.

London: Methuen, 1964.[5] T. H. Naylor, J. L. Balintfy, D. S. Burdick, and K. Chu, Computer

Simulation Techniques. New York: Wiley, 1966.[6] K. D. Tocher, The Art of Simulation. New York: Van Nostrand, 1963.[7] A. E. Mace, Sample Size Determination. New York: Reinhold, 1964.[8] M. G. Natrella, Experimental Statistics, Nat. Bur. Stand. Handbook 91.

Washington, D.C.: U.S. Government Printing Office, 1963.

Comments on "A Stochastic Approximation Method"

LENNART LJUNG

Abstract-The results stated in the above paper' concerning anapproved stochastic approximation method are considered. Formulasfor the variances of the estimates are derived, and it is found that, in fact,the new algorithm is inferior to previously suggested ones.

In the above paper' and [I ] a new stochastic approximationmethod is proposed. With rk, k = 1,2, , as the sequence ofobservations, it is suggested that x = Erk (E is the mathematicalexpectation) should be estimated as follows:

Xn = in + Yn {fn(rn) -in-1} (1)with

fn(r,n) = / ?n(l)Iwhere

n-I

-n(l)= 2 rkrk+l. (2)t1 - k=1

The sequence Yn is chosen as 1/In or 1 I1/n. The algorithms thusobtained will be referred to as algorithms I and II, respectively.Sinha and Griscik1 [1] state that these algorithms converge inthe mean-square sense and that the variance of the estimateXn is given by

(.% -x) ( fl (1 - Yk)}k= 1

These algorithms will be referred to as algorithms III and IV,respectively.

In Sinha and Griscik' simulations of algorithms I-IV arecompared. It is also stated' that algorithms I and II convergewith probability one, and a proof to this effect is given. It willbe shown here that both proofs of convergence, as well as thestated variance of the estimate (3), are erroneous. However, it istrue that algorithms I and II converge to jxl (and not to x) bothin the mean-square sense and with probability one. Asymptoticexpressions for the variances of the estimates are derived and,in contrast to the aforementioned statements' [1], it is shownthat algorithms I and II are inferior to algorithm III as regardsthe rate of convergence.

In the proof,' using Dvoretsky's theorem, the decom-position of xn+1 into the transformation of old estimatesand innovation does not fulfill the elementary requirement thatthe innovation be independent (or virtually independent) of oldestimates.The errors under discussion' [1 ] seem to stem from the fact

that the estimates An(l) have not sufficiently carefully beendistinguished from the mathematical expectation Erkrk+ .

By iterating (1) it follows thatn

fcn - IxXI = (xcl - lX 1) fl (1 Yi)i 1+1

n n

+ Hi /i fI (I - yj) (4)i=+1 j =i+ I

where

= N/ER1(l)I - IXI.Let ri = x + cj, where sj, i = I, , is a sequence of randomvariables with zero-mean value and finite second and thirdmoments, such that X and ,i+k, k > I, are independent for eachi. Hence E5jX1+ I = 0.

It is actually argued1 [1] that Ex1jdj+, = 0 implies thatvi = Oand that thus (3) follows from (4). This is evidently false.By estimating the rate at which vi tends to zero as i -+ oo, wewill, however, be able to show convergence.From (2) we obtain

R(l)kk-I{x k + 1 k-++

Rk(l) = l j+X i+ 5i+z+Xk- I j=1+1 i=1 i=1 |

Applying the law of the iterated logarithm [3], we see that withprobability one there exists a constant M = M(co) such that

1k

<1M.(l/2)+E

j < M h- ( I >(5)

(6)(3)

where .*o is the initial estimate. Algorithms I and II would thusconverge faster than two previously suggested algorithms (see,e.g., Tsypkin [2]). These are given by (1) with

fn(rn) = rn; Yn = 1/n

I nfn(rn) = - Y rk; Yn = 1 /tI.

)t k=lI

Manuscript received June 1, 1972.The author is with the Division of Automatic Control, Lund Institute

of Technology, Lund, Sweden.' N. K. Sinha and M. P. Griscik, IEEE Trans. Systems, Man, Cybern.,

vol. SMC-1, pp. 338-344, Oct. 1971.

The constant M, however, depends on the realization of theprocess cj. Hence, if x 0,

k I k-1

V/k = -~2 cj + 2 a:j4 + 0(k- 1+) k --> (7)k j=i 22x k j=,

and Y/k = O(k- /2+E) k -+ oo, E > 0, with probability one. Ifx = 0 i/1k = O(k- 1/4+E), k -+ oo, with probability one. In bothcases it is readily shown that the right member of (4) tends tozero as y/,i with probability one in the cases Yk = 1/k and Yk -- 1/k. Hence algorithms I and It converge with probability

one to Jxl.

680

CORRESPONDENCE

Using (4), it is also possible to obtain an expreasymptotic variance of the estimates. For the sake oassume that E4E2 = c2 and 1 = 1. By truncatingvariables V'k,V'j at ± IxI/2 and expanding into seric

EVIkVtiJ= max(k,j3 ( + 4x2) + 0 (max (1/k2,

With Yk = 1 l/k and x X 0,

- Ix 1)2 =-( + 4x2) + 0(1/n2),

When x = 0,

2 2G2

127rnn -* oo.

In the case Yk = l/k, we obtain

ssion for the)f definiteness;the randomas, we obtain

/j 2))

and for x # 0,

E(xn - IX 1)2

1( - 1 + 62 11+15 (2n -3)¾'2n- 2 4X2 \ 4 (nl- 1)3

+ O(/X4), x -+ 0c.k,] ° ° These variances should be compared to the one that algorithm

III yields (cr2/n) and the one that algorithm IV yields [2(I2In)].The first variance is elementary and the second is calculated in

n oo.~* (8) the same way as (10).2Thus algorithms I and II yield estimates with higher variances

than algorithm III. In fact, algorithm III provides an estimatewhose variance equals the lowest possible one, given by theCramer-Rao inequality. Any search for better algorithms inthis respect will therefore be fruitless.A plausible account for the results of the simulations given by

Sinha and Griscik' is that RX(l) has been replaced by

n k=1Hence

n n

E(fc - IxI)2 = E2 1 /In k=1 j=1

( 2 )In n= o~~ (i + n2k=1 i~ (1max (k,I.)

+ 0 (max(p i2)))

= 2u2 (I + 2) + 0 ((log n)/n2), n oo. (10)

It is also possible to obtain formulas which are not asymptoticin n. In the case when the random variable rkrk+± has normaldistribution and yn = 1, it could be shown that, for x = 0,

2a2

127r(n - 1)

1 n-IX + - Xk4k+1-

n - /k=1

Such a formula is actually given' for the estimates, but is trueonly for expected values of the estimates. The variances of theestimates obtained by such an erroneous simulation of thealgorithms are given by the second term in the right-handmembers of (8) and (10), respectively. The results shown inSinha and Griscik' are consistant with these expressions.

REFERENCES[1] N. K. Sinha and M. P. Griscik, "New algorithm for stochastic approx-

imation," IEEE Trans. Inform. Theory (Corresp.), vol. IT-17, pp.494-495, July 1971.

[2] Ya. Z. Tsypkin, Adaption and Learning in Automatic Systems. NewYork: Academic Press, 1971.

[3] K. L. Chung, A Course in Probability Theory. New York: Harcourt,Brace & World, 1968.

[4] Ya. Z. Tsypkin, "On algorithms of distribution density estimation andmoments by observations," Automat. Telemech., vol. 7, pp. 101-106,1967.

2 The expression for the variance of the estimates from algorithm IVused in Sinha and Griscik ' [1] is erroneous. This has been pointed out in [4].

681

Ecn2 =

Documents

Comments on "A Stochastic Approximation Method"