A deterministic analysis of stochastic approximation with randomized directions

IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 43, NO. 12, DECEMBER 1998 1745

“hysteresis,” caused by the subcritical bifurcation. The bifurcationdiagrams for the backstepping controller (57) are shown in Fig. 4 forc0 = 6 and ��2k = 0:5. This controller “softens” the bifurcation fromsubcritical to supercritical and eliminates the hysteresis. In addition,it stabilizes all stall equilibria and prevents surge for all values of�.

While all three designs in Table I soften the bifurcation, theglobal design achieved with backstepping is due to a methodologicaldifference. The bifurcation designs in [14] and [7] are based onlocal stability properties established by the center manifold theorembecause the maximum of the compressor characteristic is a bifurcationpoint that is not linearly controllable. Hence stabilization is inherentlynonlinear and results in asymptotic but not exponential stability.Our Lyapunov-based design incorporates the good features of abifurcation-based design. For�0 = 1, the termVr(r) in the Lyapunovfunction (49), becomesVr(R) = R, so that the Lyapunov function

V2 =c0

2c1 +

3

8+

1

c0�2�2 +

5

�R +

9

8�2

+ �2 +

1

4�4 + 3R�2 +

1

2( � c0�)

2 (59)

is (locally) quadratic in� and but only linear inR. Since thederivative ofV2 is quadratic in all three variables,_V2 � �a1�2 �a2R

2�a3( �c0�)

2 [see (56)], this clearly indicates that the achievedtype of stability is asymptotic but not exponential. However, to satisfythe requirements not only for local but also for global stability, ouranalysis is considerably more complicated.

V. CONCLUSION

Experimental validation of the controller presented here is plannedbut is beyond the scope of this paper. Measurement of� represents achallenge but it is not expected to be insurmountable considering thata controller that employs thederivativeof � has been successfullyimplemented [7].

ACKNOWLEDGMENT

The authors would like to thank C. Nett for introducing them tothe problem of control of compressor stall and surge. They wouldalso like thank M. Myers and K. Eveker (Pratt & Whittney) andD. Gysling (United Technologies Research Center) for continuousvaluable interaction on this problem.

REFERENCES

[1] E. H. Abed and J.-H. Fu, “Local feedback stabilization and bifurcationcontrol, I. Hopf bifurcation,”Syst. Contr. Lett., vol. 7, pp. 11–17, 1986.

[2] , “Local feedback stabilization and bifurcation control—II: Station-ary bifurcation,”Syst. Contr. Lett., vol. 8, pp. 467–473, 1987.

[3] O. O. Badmus, S. Chowdhury, K. M. Eveker, C. N. Nett, and C. J.Rivera, “A simplified approach for control of rotating stall—Parts I andII,” in Proc. 29th Joint Propulsion Conf., Monterey CA, AIAA papers93-2229 and 93-2234, June 1993.

[4] O. O. Badmus, C. N. Nett, and F. J. Schork, “An integrated, full-rangesurge control/rotating stall avoidance compressor control system,” inProc. 1991 ACC, pp. 3173–3180.

[5] J. Baillieul, S. Dahlgren, and B. Lehman “Nonlinear control designs forsystems with bifurcations with applications to stabilization and controlof compressors,” inProc. 14th IEEE Conf. Control Applications, 1995,pp. 3062–3067.

[6] R. L. Behnken, R. D’Andrea, and R. M. Murray, “Control of rotatingstall in a low-speed axial flow compressor using pulsed air injection:Modeling, simulations, and experimental validation,” inProc. 34th IEEEConf. Decision and Control, 1995, pp. 3056–3061.

[7] K. M. Eveker, D. L. Gysling, C. N. Nett, and O. P. Sharma, “Integratedcontrol of rotating stall and surge in aeroengines,”SPIE Conf. Sensing,Actuation, and Control in Aeropropulsion, Orlando, FL, Apr. 1995.

[8] R. A. Freeman and P. V. Kokotovic, Robust Nonlinear Control Design.Boston, MA: Birkhauser, 1996.

[9] E. M. Greitzer, “Surge and rotating stall in axial flow compres-sors—Parts I and II,”J. Engineering for Power, pp. 190–217, 1976.

[10] , “The stability of pumping systems—The 1980 Freeman Scholarlecture,” ASME J. Fluid Dynamics, pp. 193–242, 1981.

[11] Z. P. Jiang, A. R. Teel, and L. Praly, “Small-gain theorem for ISSsystems and applications,”Math. Contr., Signals, and Syst., vol. 7, pp.95–120, 1995.

[12] M. Krstic, I. Kanellakopoulos, and P. V. Kokotovic, Nonlinear andAdaptive Control Design. New York: Wiley, 1995.

[13] M. Krstic, J. M. Protz, J. D. Paduano, and P. V. Kokotovic, “Backstep-ping designs for jet engine stall and surge control,” inProc. 34th IEEEConf. Decision and Control, 1995.

[14] D. C. Liaw and E. H. Abed, “Active control of compressor stallinception: A bifurcation-theoretic approach,”Automatica, vol. 32, pp.109–115, 1996, also inProc. IFAC Nonlinear Contr. Syst. Design Symp.,Bordeaux, France, June 1992.

[15] F. E. McCaughan, “Bifurcation analysis of axial flow compressorstability,” SIAM J. Appl. Math., vol. 20, pp. 1232–1253, 1990.

[16] F. K. Moore and E. M. Greitzer, “A theory of post-stall transients inaxial compression systems—Part I: Development of equations,”J. Eng.Gas Turbines and Power, vol. 108, pp. 68–76, 1986.

[17] J. D. Paduano, L. Valavani, A. H. Epstein, E. M. Greitzer, and G. R.Guenette, “Modeling for control of rotating stall,”Automatica, vol. 30,pp. 1357–1373, 1966.

[18] R. Sepulchre, M. Jankovi´c, and P. V. Kokotovi´c, Constructive NonlinearControl. New York: Springer-Verlag, 1997.

A Deterministic Analysis of StochasticApproximation with Randomized Directions

I-Jeng Wang and Edwin K. P. Chong

Abstract—We study the convergence of two stochastic approxima-tion algorithms with randomized directions: the simultaneous pertur-bation stochastic approximation algorithm and the random directionKiefer–Wolfowitz algorithm. We establish deterministic necessary andsufficient conditions on the random directions and noise sequences forboth algorithms, and these conditions demonstrate the effect of the “ran-dom” directions on the “sample-path” behavior of the studied algorithms.We discuss ideas for further research in analysis and design of thesealgorithms.

Index Terms—Deterministic analysis, random directions, simultaneousperturbation, stochastic approximation.

I. INTRODUCTION

One of the most important applications of stochastic approxi-mation algorithms is in solving local optimization problems. If anestimator of the gradient of the criterion function is available, theRobbins–Monro algorithm [9] can be directly applied. In [7], Kiefer

Manuscript received September 12, 1996. This work was supported in partby the National Science Foundation under Grants ECS-9410313 and ECS-9501652.

I-J. Wang is with the Applied Physics Lab., Johns Hopkins University,Laurel, MD 20723 USA (e-mail: [email protected]).

E. K. P. Chong is with the School of Electrical and Computer Engineering,Purdue University, West Lafayette, IN 47907-1285 USA.

Publisher Item Identifier S 0018-9286(98)08435-9.

0018–9286/98$10.00 1998 IEEE

1746 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 43, NO. 12, DECEMBER 1998

and Wolfowitz present a modification of the standard Robbins–Monroalgorithm to recursively estimate the extrema of a function in thescalar case. The algorithm is based on a finite-difference estimate ofthe gradient and does not require knowledge of the structure of the cri-terion function. In [1], Blum presents a multivariate version of Kieferand Wolfowitz’s algorithm for higher dimensional optimization. Wewill refer to this algorithm as the Kiefer–Wolfowitz (KW) algorithm.For an objective function with dimensionp, the finite-differenceestimation generally requires2p observations at each iteration. Thisrequirement usually results in unrealistic computational complexitywhen the dimension of the problem is high. To circumvent thisproblem, several authors have studied variants of the KW algorithmbased on finite-difference estimates of the directional derivativesalong a sequence of randomized directions; see, for example, [6],[8], and [12]. We will refer to this type of algorithm as theran-dom direction Kiefer–Wolfowitz(RDKW) algorithm. The number ofobservations needed by the RDKW algorithm is two per iteration,regardless of the dimension of the problem. However, the questionarises as to whether the increase in the number of iterations (due tothe randomization of the directions) may offset the reduction in theamount of data per iteration, resulting in worse overall performance.Different distributions for the randomized direction sequence havebeen considered: uniform distribution in [6], spherically uniformdistribution in [8], and Normal and Cauchy distributions in [12]. Noneof these results theoretically establish the superiority of the RDKWalgorithms with respective direction distribution over the standardKW algorithm.

In [11], Spall presents a KW-type algorithm based on a “simul-taneous perturbation” gradient approximation that requires only twoobservations at each iteration. The algorithm also moves along asequence of randomized directions as the RDKW algorithm. Byanalyzing the asymptotic distribution, Spall [11] shows that theproposed algorithm can be significantly more efficient than thestandard KW procedure. Following the terminology of [11], werefer to Spall’s algorithm as thesimultaneous perturbation stochasticapproximation(SPSA) algorithm. Chin presents in [4] both theoreticaland numerical comparison of the performance of KW, RDKW (thename RDSA is used therein), and SPSA algorithms. A more generalclass of distributions is considered for the RDKW algorithm there andthe SPSA algorithm is shown to exhibit the smallest asymptotic meansquared error. Chin makes an assumption that components of eachdirection have unity variance, which is not necessary for convergenceof the RDKW algorithm as illustrated by Proposition 2. In fact, asexplained in Section IV, we can show that the SPSA and RDKW algo-rithms achieved the same level of performance asymptotically underoptimized conditions. In [2], Chenet al. study a modification of theSPSA algorithm and prove its convergence under weaker conditions.

In this paper, we focus on the sample-path analysis of the SPSAand the RDKW algorithms. We develop a deterministic analysisof the algorithms and present deterministic necessary and sufficientconditions on both the randomized directions and noise sequence forconvergence of these algorithms. Different from the results in [2], [4],and [11], we treat the “randomized” direction sequence as an arbitrarydeterministic sequence and derive the conditions on each individualsequence for convergence of these algorithms. The resulting conditiondisplays the sample-path effect of each random direction sequence onthe convergence of both algorithms.

Throughout the paper, we consider the problem of recursivelyestimating the minimum of an objective functionL: Rp ! R basedon noisy measurements ofL. We assume thatL satisfies the followingconditions.

A1) The gradient ofL, denoted byf = rL, exists and isuniformly continuous.

A2) There existx� 2 Rp such that

• f(x�) = 0;• for all � > 0, there existsh� > 0 such thatkx�x�k � �

implies f(x)T (x � x�) � h�kx � x�k.

Note that Assumptions A1) and A2) are not the weakest possibleassumptions on the functionL for convergence of the SPSA orRDKW algorithms; for example, a weaker Lyapunov-type of condi-tion is considered by Chenet al. in [2] for convergence of the SPSAalgorithm. Since our main objective is not to obtain convergenceresults under weaker conditions onL, we adopt the more restrictiveassumptions [A1) and A2)] to avoid unnecessary complications thatmay arise from considering the more generalL. Throughout thispaper,fcng is a positive scalar sequence withlimn!1 cn = 0.

II. CONVERGENCE OFROBBINS–MONRO ALGORITHMS

We rely mainly on the following convergence theorem from [14]and [15] to derive conditions on the perturbations and noise.

Theorem 1: Consider the stochastic approximation algorithm

xn+1 = xn � anf(xn) + anen + anbn (1)

where fxng; feng; and fbng are sequences onRp; f : Rp !Rp satisfies Assumption A2), andfang is a sequence of positivereal numbers satisfyinglimn!1 an = 0; 1

n=1 an = 1, andlimn!1 bn = 0. Suppose that the sequenceff(xn)g is bounded.Then, for anyx1 in Rp; fxng converges tox� if and only if fengsatisfies any of the following conditions.

C1)

limn!1

supn�k�m(n;T )

k

i=n

aiei = 0

for someT > 0, wherem(n; T ) maxfk : an+ � � �+ak �Tg.

C2)

limT!0

1

Tlim supn!1

supn�k�m(n;T)

k

i=n

aiei = 0:

C3) For any�; � > 0, and any infinite sequence of nonoverlap-ping intervalsfIkg on N there existsK 2 N such that forall k � K

n2I

anen < �

n2I

an + �:

C4) There exist sequencesffng and fgng with en = fn + gnfor all n such that

n

k=1

akfk converges andlimn!1

gn = 0:

C5) The weighted averagef�eng of the sequencefeng defined by

�en =1

�n

n

k=1

kek

converges to zero, where

�n =1; n = 1

n

k=21

1�a; otherwise

n = an�n:

Proof: See [14] for a proof for conditions C1)–C4) and [15] fora proof for condition C5).

Theorem 1 provides five equivalence necessary and sufficient noiseconditions, conditions C1)–C5), for convergence of the standardRobbins–Monro algorithm (1). Note that the assumption thatff(xn)g


is bounded can be relaxed by incorporating the projection schemeinto the algorithm as in [3]. Since our objective is to investigatethe effect of the “random” directions on the algorithm, rather thanto weaken the convergence conditions, the form of the convergenceresult in Theorem 1 suffices for our purpose. In the next section, wewill use Theorem 1 to establish the convergence of two stochasticapproximation algorithms with randomized directions by writing theminto the form of (1).

III. A LGORITHMS WITH RANDOMIZED DIRECTIONS

In this section, we study two variants of the KW algorithmswith randomized directions, including the SPSA and the RDKWalgorithms. In contrast to the standard KW algorithm, which movesalong an approximation of the gradient at each iteration, these algo-rithms move along a sequence of randomized directions. Moreover,these algorithms use only two measurements per iteration to estimatethe associated directional derivative as in the case of the RDKWalgorithm, or some related quantity as in the case of the SPSAalgorithm.

We define the randomized directions as a sequence of vectorsfdngon Rp. We denote theith component ofdn by dni. Except forPropositions 1 and 2, the sequencefdng is assumed to be an arbitrarydeterministic sequence. The main goal of this section is to establisha deterministic characterization offdng that guarantees convergenceof the algorithms under reasonable assumptions. Note that we usethe same notationdn to represent the random directions for boththe SPSA and RDKW algorithms to elucidate the similarity betweenthem. This does not imply that the same direction sequence should beapplied to both algorithms. In general, different requirements onfdngare needed for convergence of these two algorithms, as illustrated inTheorems 2 and 3.

A. SPSA Algorithm

Although the convergence of Spall’s algorithm has been establishedin [11], it is not clear (at least intuitively) why random perturbationsused in the algorithm would result in faster convergence. In bothSpall’s [11] and Chen’s [2] results, conditions on random pertur-bations for convergence are stated in probabilistic settings. Thesestochastic conditions provide little insight into the essential propertiesof perturbations that contribute to the convergence and efficiencyof the SPSA algorithm. In this section, we develop a deterministicframework for the analysis of the SPSA algorithm. We present fiveequivalent deterministic necessary and sufficient conditions on boththe perturbation and noise for convergence of the SPSA algorithm,based on Theorem 1. We believe that our sample-path characterizationsheds some light on what makes the SPSA algorithm effective.

We now describe a version of the SPSA algorithm. We define asequence of vectorsfrng, related tofdng, by

rn =1

dn1; � � � ;

1

dnp:

The SPSA algorithm is described by

xn+1 = xn � any+n � y�n

2cndn (2)

where y+n and y�n are noisy measurements of the functionL atperturbed points, defined by

y+n = L(xn + cnrn) + e

+n

y�n = L(xn � cnrn) + e

�n

with additive noisee+n and e�n , respectively. For convenience, wewrite

fr(xn) =

L(xn + cnrn)� L(xn � cnrn)

2cn(3)

as an approximation to the directional derivative along the directionrn; rTn f(xn). To analyze the algorithm, we rewrite (2) into thestandard form of the Robbins–Monro algorithm (1)

xn+1 = xn � anfr(xn)dn + an

en

2cndn (4)

= xn � an rTn f(xn)� bn dn + an

en

2cndn

= xn � anf(xn) + anbndn + anen

2cndn

� an dnrTn � I f(xn) (5)

by defining

bn = rTn f(xn)� f

r(xn)

en = e�n � e

+n :

(6)

The sequencefbng represents the bias in the directional derivativeapproximation. The effective noise for the algorithm is the scaleddifference between two measurement noise values,e

cdn. We can

apply the result in Theorem 1 to establish the convergence of theSPSA algorithm (2). We first prove that the bias sequencefbngconverges to zero.

Lemma 1: Suppose thatL: Rp ! R satisfies Assumption A1),fcng converges to zero, andfrng is bounded. Then the sequencefbng defined by (6) converges to zero.

Proof: By the Mean Value theorem

bn = rTn f(xn)�

L(xn + cnrn)� L(xn � cnrn)

2cn

= rTn [f(xn)� f(xn + (2�n � 1)cnrn)]

where0 � �n � 1 for all n 2 N . Let � > 0 be given. Sincefis uniformly continuous, there exists� > 0 such thatkx � yk < �

implieskf(x)�f(y)k < �

sup kr k. Furthermore, by the convergence

of fcng there existsN 2 N such thatk(2�n � 1)cnrnk < � for alln � N . Hence, for alln � N

jbnj � krnkkf(xn)� f(xn + (2�n � 1)cnrn)k

< supn

krnk�

supn krnk= �:

Thereforefbng converges to zero.Using Theorem 1 and Lemma 1, we establish a necessary and

sufficient condition for convergence of the SPSA algorithm in thefollowing theorem.

Theorem 2: Suppose that Assumptions A1 and A2) hold, andfrng; fdng; andff(xn)g are bounded. Then,fxng defined by (2)converges tox� if and only if the sequencesf(dnrTn �I)f(xn)g andf e

2cdng satisfy conditions C1)–C5).Proof ()): Suppose thatfxng converges tox�. Thenff(xn)g

converges tof(x�) = 0 by the continuity off . Since frng andfdng are bounded,kdnrTn � Ik is bounded andf(dnrTn � I)f(xn)gconverges to zero. Thusf(dnrTn � I)f(xn)g satisfies conditionsC1)–C5). By Theorem 1,f(dnrTn � I)f(xn) �

e

2cdng satisfies

C1)–C5). Thereforef e

2cdng satisfies conditions C1)–C5).

((): This follows directly from Theorem 1 and Lemma 1.Theorem 2 provides the tightest possible conditions on both the

randomized direction sequencefdng and noise sequencefeng. Notethat the condition ondn is “coupled” with the function valuesf(xn).This special form of coupling suggests an adaptive design scheme

1748 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 43, NO. 12, DECEMBER 1998

for dn based on the estimate off(xn). However, this idea may bedifficult to carry out due to the special structure of the matrix

dnrTn � I =

0 d

d� � � d

dd

d0 � � � d

d

......

...d

d

d

d� � � 0

: (7)

We can see that it is difficult to scale the elements in the matrixaccording toff(xn)g. One solution to this is to establish probabilisticsufficient conditions on the perturbation to guarantee that the deter-ministic condition in Theorem 1 holds almost surely, as in [2] and[11]. We present a general sufficient condition based on the martingaleconvergence theorem. In the following proposition, we assume thatfdng and feng are random sequences.

Proposition 1: Let Fn be the �-algebra generated byfdkgk=1;��;n and fekgk=1;��;n. Assume that 1

n=1aqn < 1

for someq > 1, andE( dd

j Fn�1) = 0 for i 6= j. Suppose that

fdng; frng; andff(xn)g are bounded. Then,f(dnrTn � I)f(xn)gsatisfies noise conditions C1)–C5) almost surely.

Proof: Let zn = an(dnrTn � I)f(xn); zn = [zn1; � � � ; znp]

T .Since

E(zn j Fn�1) = E an dnrTn � I f(xn) j Fn�1 = 0

f n

k=1zkg is a martingale. Furthermore

E(jznijq) <1

for all i � p by the boundedness offdng; frng; ff(xn)g, and1

n=1apn. Hence by theLq convergence theorem for martingales [5,

eq. (4.4), p. 217], the sequencef n

k=1zkg converges almost surely.

Therefore,f(dnrTn � I)f(xn)g satisfies condition C4) and hence thenoise conditions C1)–C5).

In [2] and [11], dn is assumed to be a vector ofp mutuallyindependent random variables independent ofFn�1. Under thisassumption, it is clear that the condition in Proposition 1 can besatisfied by assuming eitherE(dni) = 0, as in [2], orE( 1

d) = 0,

as in [11].

B. RDKW Algorithms

The RDKW algorithm with random directions of different dis-tributions has been studied by several authors; see, for example,[6], [8], and [12]. In this section, we study the convergence ofthe RDKW algorithm for a general direction sequencefdng. Weestablish deterministic necessary and sufficient conditions on both thedirection sequencefdng and the noise sequence for the convergenceof the RDKW algorithm (Theorem 3). We compare and contrast theseconditions to those of Theorem 2.

The RDKW algorithm can be described by

xn+1 = xn � any+n � y�n

2cndn (8)

where

y+

n = L(xn + cndn) + e+

n

y�

n = L(xn � cndn) + e�

n :

As in the case of the SPSA algorithm, we can rewrite the algorithm(8) into a Robbins–Monro algorithm

xn+1 = xn � anfd(xn)dn + an

en

2cndn

= xn � an�2f(xn) + anbndn + an

en

2cndn

� an dndTn � �

2I f(xn) (9)

by defining

bn = dTnf(xn)� f

d(xn)

en = e�

n � e+

n

(10)

where� > 0 is an arbitrary constant real number. Comparing (9)with (4), we can see that the RDKW algorithm differs from the SPSAalgorithm only in the direction along which the directional derivativeis estimated.

Following the same arguments as in the proofs of Lemma 1 andTheorem 2, we can show that the sequencefbng defined by (10)converges to zero and the following theorem holds.

Theorem 3: Suppose that Assumptions A1) and A2) hold, andfdng andff(xn)g are bounded. Then,fxng defined by (8) convergestox� if and only if the sequencesf(dndTn��

2I)f(xn)g andf e

2cdng

satisfy noise conditions C1)–C5).Comparing the above conditions with those of Theorem 2, we again

notice a coupling betweenfdng andff(xn)g. Similar to the case ofthe SPSA algorithm, it may be difficult to designfdng based onff(xn)g to satisfy the condition forf(dndTn � �2I)f(xn)g above,due to the structure of the matrix

dndTn � �I =

(dn1)2 � �2 dn1dn2 � � � dn1dnp

dn2dn1 (dn2)2 � �2 � � � dn2dnp

......

...dnpdn1 dnpdn2 � � � (dnp)

2 � �2

:

(11)

Although we are allowed to choose smallerdni (the same is nottrue for SPSA since 1

dneeds to be bounded), the diagonal terms

always give a “weight” around the quantity�2. Furthermore, if wetry to choose�2 = �2n such that�2n � (dni)

2 and let�n ! 0, theresulting sequence of functionsf�2nf(xn)g in (9) may not satisfyAssumptions A1) and A2) and the algorithm may not converge.However, similar to the case for the SPSA algorithm, we can derivethe following sufficient probabilistic condition ondn. As before, weassume below thatfdng andfeng are random sequences.

Proposition 2: Let Fn be the �-algebra generated byfdkgk=1;��;n and fekgk=1;��;n. Assume that 1

n=1aqn < 1

for someq > 1, andE(dndTn j Fn�1) = �2I. Suppose thatfdng

and ff(xn)g are bounded. Thenf(dndTn � �2I)f(xn)g satisfiesconditions C1)–C5) almost surely.

IV. CONCLUSION

As pointed out in Section III, the SPSA and RDKW algorithmsare very similar in form. In fact, under a probabilistic framework,we can show that these two algorithms have the same asymp-totic performance under optimized conditions. In [10], Sadegh andSpall show that the random direction sequencefdng with Bernoullidistribution for each component is asymptotically optimal for theSPSA algorithm. Following the same approach, we can show thatthe Bernoulli distribution is also optimal for the RDKW algorithm,based on the asymptotic distribution established by Chin in [4] forthe RDKW algorithm. These two algorithms are clearly identical inthis case and hence exhibit the same performance (see [13, Sec. 5.4]for a proof).

Although the probabilistic approach used in [4] and [11] is usefulin assessing the asymptotic performance of stochastic approximationalgorithms, it provides little insight into the essential properties of ran-domized directions that contribute to the convergence and efficiencyof the SPSA and RDKW algorithms. Furthermore, simulation resultssuggest that the SPSA algorithm (or the RDKW algorithm) withBernoulli distributed randomized directions outperforms the standardKW algorithm not only in the average but also along each sample


path. This phenomenon certainly is not reflected in the probabilisticanalysis of [11] or [4]. Our deterministic analysis does shed somelight on what makes the SPSA and the RDKW algorithms effective.Perhaps a complete answer can be obtained by further analyzingthe sequencef(dnrTn � I)f(xn)g (or f(dndTn � �2I)f(xn)g forthe RDKW), which is the difference between the SPSA (or theRDKW) algorithm and the standard KW algorithm. In the case wheredni = �1, as in the case of Bernoulli distribution, the matrixdnr

T

n�I

(or dndT

n � I) has a unique structure

0 d

d� � � d

d

d

d0 � � � d

d

......

...d

d

d

d� � � 0

:

We conjecture that the effect of the elements in the sequencef(dnr

T

n �I)f(xn)g will be “averaged out” [in the sense of conditionC5)] across the iterations along each sample path, if the directionsfdng are chosen appropriately.

REFERENCES

[1] J. R. Blum, “Multidimensional stochastic approximation methods,”Ann.Math. Stats., vol. 25, pp. 737–744, 1954.

[2] H.-F. Chen, T. E. Duncan, and B. Pasik-Duncan, “A Kiefer–Wolfowitzalgorithm with randomized differences,” inProc. 13th Triennal IFACWorld Congr., San Francisco, CA, 1996, pp. 493–496.

[3] H.-F. Chen and Y.-M. Zhu, “Stochastic approximation procedures withrandomly varying truncations,”Scientia Sinica, Series A, vol. 29, no.9, pp. 914–926, 1986.

[4] D. C. Chin, “Comparative study of stochastic algorithms for systemoptimization based on gradient approximations,”IEEE Trans. Syst., Man,Cybern., vol. 27, 1997.

[5] R. Durrett, Probability: Theory and Examples. Wadsworth &Brooks/Cole, 1991.

[6] Y. Ermoliev, “On the method of generalized stochastic gradients andquasi-Fejer sequences,”Cybernetics, vol. 5, pp. 208–220, 1969.

[7] J. Kiefer and J. Wolfowitz, “Stochastic estimation of the maximum ofa regression function,”Ann. Math. Statist., pp. 462–466, 1952.

[8] H. K. Kushner and D. S. Clark,Stochastic Approximation Methods forConstrained and Unconstrained Systems. New York: Springer, 1978.

[9] H. Robbins and S. Monro, “A stochastic approximation method,”Ann.Math. Stats., vol. 22, pp. 400–407, 1951.

[10] P. Sadegh and J. Spall, “Optimal random perturbations for stochastic ap-proximation using a simultaneous perturbation gradient approximation,”IEEE Trans. Automat. Contr., vol. 43, pp. 1480–1484, Oct. 1998.

[11] J. C. Spall, “Multivariate stochastic approximation using a simultaneousperturbation gradient approximation,”IEEE Trans. Automat. Contr., vol.37, Mar. 1992.

[12] M. A. Styblinski and T. S. Tang, “Experiments in nonconvex optimiza-tion: Stochastic approximation with function smoothing and simulatedannealing,”Neural Networks, vol. 3, pp. 467–483, 1990.

[13] I.-J. Wang, “Analysis of stochastic approximation and related algo-rithms,” Ph.D. dissertation, Purdue Univ., 1996.

[14] I.-J. Wang, E. K. P. Chong, and S. R. Kulkarni, “Equivalent necessaryand sufficient conditions on noise sequences for stochastic approxima-tion algorithms,” Advances Appl. Probability, vol. 28, pp. 784–801,1996.

[15] , “Weighted averaging and stochastic approximation,”Math.Contr., Signals, Syst., vol. 10, pp. 41–60, 1997.

Documents

A deterministic analysis of stochastic approximation with randomized directions