12
Journal of Econometrics 38 (1988) 91-102. North-Holland A STATIONARY STOCHASTIC APPROXIMATION METHOD Roger F. LIDDLE Becton Dickinson Research Center, Research Triangle Park, NC 27709, USA John F. MONAHAN North Carolina Sate University, Raleigh, NC 27695, USA A problem occurring in statistics and econometrics is the minimization (or root-finding) of a function of several parameters expressed as an integral. If the integral cannot be evaluated analytically and numerical integration is required, evaluation can be prohibitively costly when the dimension of integration exceeds two. Since Monte Carlo techniques are natural in such a case, stochastic approximation can be viewed as combining integration and optimization. While these methods require minimal conditions for convergence, implementation as a practical problem solver faces difficulties in verification and stopping. A stationary stochastic approximation method is proposed, which is less sophisticated mathematically. This method is applied to two types of problems: a Bayesian decision problem and the computation of minimum Hellinger distance estimators. 1. Introduction This work has been motivated by the general class of problems of finding the optimum or root of a function of p variables denoted by 8. This function of 8 is defined by an integral over another set of r variables x, which cannot be evaluated analytically. The integrand is a function of both sets of variables (x, 19) which can be evaluated; the measure is a weight function w(x), independent of 0. 9 ]g(x; @+)d x or sop O=Jh(x; B)w(x)dx. 11) Neither integration nor optimization are particularly novel when considered separately. Davis and Rabinowitz (1984) present an excellent survey of the field of numerical integration; Dennis and Schnabel (1984) give an excellent guide of methods of unconstrained optimization. However, taken together, the computational task is formidable, if not forbidding. In some cases, where the integrand is a smooth function and the dimension of x is small, the numerical integration can be imbedded within the optimiza- tion. However, optimization methods can seldom tolerate errors in the evalua- 0304-4076/88/$3.50~1988, Elsevier Science Publishers B.V. (North-Holland)

A stationary stochastic approximation method

Embed Size (px)

Citation preview

Page 1: A stationary stochastic approximation method

Journal of Econometrics 38 (1988) 91-102. North-Holland

A STATIONARY STOCHASTIC APPROXIMATION METHOD

Roger F. LIDDLE

Becton Dickinson Research Center, Research Triangle Park, NC 27709, USA

John F. MONAHAN

North Carolina Sate University, Raleigh, NC 27695, USA

A problem occurring in statistics and econometrics is the minimization (or root-finding) of a function of several parameters expressed as an integral. If the integral cannot be evaluated analytically and numerical integration is required, evaluation can be prohibitively costly when the dimension of integration exceeds two. Since Monte Carlo techniques are natural in such a case, stochastic approximation can be viewed as combining integration and optimization. While these methods require minimal conditions for convergence, implementation as a practical problem solver faces difficulties in verification and stopping. A stationary stochastic approximation method is proposed, which is less sophisticated mathematically. This method is applied to two types of problems: a Bayesian decision problem and the computation of minimum Hellinger distance estimators.

1. Introduction

This work has been motivated by the general class of problems of finding the optimum or root of a function of p variables denoted by 8. This function of 8 is defined by an integral over another set of r variables x, which cannot be evaluated analytically. The integrand is a function of both sets of variables (x, 19) which can be evaluated; the measure is a weight function w(x), independent of 0.

9 ]g(x; @+)d x or sop O=Jh(x; B)w(x)dx. 11)

Neither integration nor optimization are particularly novel when considered separately. Davis and Rabinowitz (1984) present an excellent survey of the field of numerical integration; Dennis and Schnabel (1984) give an excellent guide of methods of unconstrained optimization. However, taken together, the computational task is formidable, if not forbidding.

In some cases, where the integrand is a smooth function and the dimension of x is small, the numerical integration can be imbedded within the optimiza- tion. However, optimization methods can seldom tolerate errors in the evalua-

0304-4076/88/$3.50~1988, Elsevier Science Publishers B.V. (North-Holland)

Page 2: A stationary stochastic approximation method

92 R. F. Liddle and J.F. Monahan, A stationary stochastic approximation method

tion of the function, which in this case dictates accurate numerical integration. Since the accuracy is commonly 0( n - k/d) for n evaluations in d dimensions, where k is 2 or 4, improvements in accuracy in high dimensions are costly. Consequently the O(n-‘I*) convergence of Monte Carlo becomes appealing.

When the integral is evaluated by Monte Carlo by sampling from w(x), the optimization/root-finding problem can be fit into the framework of stochastic approximation. Although stochastic approximation has been in the literature since 1951 (Robbins and Monro), its applications are few. Ruppert et al. (1984) is a notable exception. The problems that have motivated this work demand better accuracy than stochastic approximation commonly provides; three digits are considered necessary. While this will surely entail substantial computation, this cost is acceptable, as long as confidence can be established that the problem has indeed been solved. Problems with the practical imple- mentation of stochastic approximation have led to the consideration of a mathematically simpler stationary method.

The motivation of methods and their potential applications follow in section 2. This stationary stochastic approximation method is presented in section 4, after the discussion of stochastic approximation and its drawbacks in section 3. The solution of the motivating problems, minimum Hellinger distance estimation, and a Bayesian decision problem are discussed in section 5.

2. Motivation

Stochastic approximation methods are designed to find the root of a regression function E{ y 1 t9} = m(d) w h ere any number of random variables y can be generated for an arbitrary 8. For the multivariate case, both y and 0 are p-dimensional vectors, and m(d) is vector valued function in p dimen- sions. To convert this formulation to (l), for integration by Monte Carlo, let X be a random variable with density w(x), and since the distribution of y changes with 8, let y = h( X, 0). Hence

E(yIe)=~[y(x;e)18] =~[h(x;e)18] =jh(~;e)w(x)d~. (2)

Many problems can be reformulated in this way with an appropriate choice of h and w.

The original problem that motivated the authors’ interest in the stochastic approximation methods was the general Bayesian decision problem. A deci- sion maker, or agent, has a continuum of actions 8 available to him, which incurs a loss L(8; @) when the true state of nature is 9. The agent desires to minimize the expected loss of his action. In the case of an economic decision, $I represents parameters of an econometric model [see Zellner (1971)]. The expectation is taken with respect to the posterior distribution on the parame-

Page 3: A stationary stochastic approximation method

R.F. Liddle and J. F. Monahan, A stationary stochastic approximation method 93

ters p(+ Idata), which combines both prior information on the parameters, and the likelihood of the data for given values of the parameters. Now in many problems, the only tractable method of analyzing the posterior distribu- tion is by Monte Carlo integration with respect to an importance distribution I(+). Expressed as a problem in the format (l), this becomes

hbldata) I(@) 1

I(+) dh (3)

where W(X) = I($), the importance function. This importance function is chosen so that variate generation from it is relatively easy, and the integration is well-behaved [Geweke (1986)].

The immediate motivating problem for this work was the computation of minimum Hellinger distance (MHD) estimates for the multivariate normal distribution parameters, location vector ~1 and dispersion matrix 2:. The MHD estimators have excellent properties including asymptotic efficiency [Beran (1977)]. In the multivariate normal case, Tamura and Boos (1986) show MHD estimators to have strong robustness properties. The big drawback is that they are difficult, if not impossible, to compute using straightforward techniques. For observations in r dimensions, let B denote the p = r(r t 3)/2 parameters to be estimated [r for p and r( r + 1)/2 for .Z]. Let fe(x) denote the r-dimensional multivariate normal density for these parameters 0. Let f be the kernel density estimate obtained from the data. Then the MHD estimators minimize the distance

d(fe,f^) = j-[fe”2(x) -j1’2(x)]2dx

To reformulate this into the minimization form (l), let w(x) =p(x), then

d(fd) = j-[fe’/2(x)/j’/2(x) - 1]2f(x)dx,

so that

g(x; 8) = [/_ - 112. To rewrite the problem as a root-finding problem, invoke the Convergence Theorem and take derivatives with respect to 0 integral yielding

(5)

Dominated inside the

(6)

Page 4: A stationary stochastic approximation method

94 R.F. Liddle andJ.E Monahan, A stationay stochastic approximation method

where

Determining D( X, 0) presents a reasonably difficult exercise in calculus. This expression of the MHD problem requires generation of random variables from the kernel density distribution, which is quite easy.

For this paper, we have concentrated on the root-finding problem. Since most problems originate as optimization problems, converting them to root- finding problems requires the user to compute first partial derivatives analyti- cally. However, the root-finding problem is regarded as a much easier problem to solve, so much so that anytime the derivatives can be computed without undue human or computed hardship, this effort is well worthwhile.

Stochastic approximation methods have been adapted for optimization problems by Kiefer and Wolfowitz (1952) and extended to the multivariate case by Blum (1954). More recently, Ruppert et al. (1984) combined numerical differences with Monte Carlo integration variance-reduction tools for optimi- zation. We have not yet investigated the application of stationary stochastic approximation for the optimization problem.

3. Stochastic approximation

Consider the problem of finding the root of a regression function E(y ) 8) = m(O), where 0 can be chosen at will and any number of y’s can be obtained. When the (unknown) slope of the function is positive, then, if the observed y is positive, m(B) is more likely to be positive, and the root should be smaller than the current value of 8. Similarly, negative values of y suggest moving right, leading to the sequence of design points r, for values of 8,

tk+l = t, - a,Y,, (7)

so that for a regression function with positive slope, the correction should be in the opposite direction, and, after many steps, the size of the correction should be reduced systematically to zero. The rate of decrease should be slow enough, however, so that the corrections could cover any distance from the starting point to the root. This leads to a rate uk = O(K’). Taking uk = l/(kb), where b is the (unknown) slope of the regression function at the root, gives the best efficiency [Chung (1954)].

Stochastic approximation methods can be extended to the multivariate problem [Blum (1954)], that is, solving a system of p non-linear equations M(8) = 0, where M: RP + RP. Now both 0 and 0 are p-dimensional vectors, Yp is a p-dimensional random variable, and the correction step of stochastic

Page 5: A stationary stochastic approximation method

R.F. Liddle and J.F. Monnhan, A stationaT stochastic approximation method 95

approximation is rewritten

tk+l = t, - akAY,, (8)

where A is a p by p matrix. The scalar ak takes the form kl/(k2 + k). Blum took A to be the identity leading to a set of weakly coupled univariate stochastic approximation problems. Ruppert (1985) choose the matrix A in a method analogous to a Gauss-Newton approach to minimizing IIM(t,)l12 and used averages of several Y observations in place of Y,. Other differencing techniques have been proposed to construct A close to inverse of the Jacobian matrix of M near t9.

The conditions for the success of stochastic approximation (convergence of t,) are smoothness, with M(t) approximately linear near 8, bounded variance of Y,, and ak = O(k-‘). If additional conditions are satisfied, then an asymptotic distribution obtains,

fi( t,, - f3) -+ N(0, A2a2/(2Ab - 1)). (9)

Blum’s method requires additional assumptions for the multivariate case; others have weakened these considerably, at the expense of more complicated procedures. The difficulties we see in stochastic approximation are evident in the univariate case, where we will remain for the following discussion.

The implementation of stochastic approximation as a practical problem solver faces the problem of stopping, or, if stopped, evaluating the accuracy of resultant solution t,. The asymptotic normal distribution must be converted into a practical confidence interval. Both b and u2 must be estimated. and the condition 2bA > 1 must be verified. While the estimate of the root converges at O(K’/~), the least squares estimate of b converges much more slowly at O((log n) ‘12) [Lai and Robbins (1979)]. Confidence intervals constructed using least squares results [Wei (1985)] or differencing schemes [Stroup and Braun (1982)] can be shown to be consistent. The Monte Carlo results in the latter work are not enlightening since the experiments are not close to the demanding levels of accuracy required by the problems that have motivated this work.

Frees and Ruppert (1986) suggest an alternative approach, using the sto- chastic approximation sequence (8) to drive the process, and use the root of the regression surface estimated by least squares to estimate 8. They find (Theorem 3.2) that, while the convergence of the regression estimates is slower, the root of the fitted surface converges at the same O(H-‘/~) rate. Consistent confidence ellipsoids appear to converge at the slow rate.

Verification of the condition 2bA > 1 is particularly troublesome, since both the asymptotic normality and the rate of convergence of t, depend upon it. The fear is that in small samples the estimates of the slope b can be highly

Page 6: A stationary stochastic approximation method

96 R. F. Liddle and J. F. Monahan, A sfationary stochastic approximation method

biased in the positive direction when the starting value is near the root 8. An overestimated slope can cause a difficult problem to appear easy and can cause the condition 2bA > 1 to appear to be satisfied when it is not. Experiments [Liddle (1988)] with a linear function and normal errors have shown that, when A is too small to satisfy the condition, a test will still (incorrectly) verify it as much as 80% of the time after 400 steps of the process. Starting values near the root are worst; starting further away leads to more accurate least squares slope estimates.

Since the confidence statements in stochastic approximation depend on a condition whose verification is undependable, a simpler method was sought whose performance was more easily understood and whose conditions could be more easily verified.

Improvements in the estimation of the slope is one path toward correcting the problem. In the univariate case, Venter (1967) modified the process by replacing the usual design points by a pair offset to either side and estimated the slope by an average of central first differences. Nevel’son and Khas’minskii (1973) generalized this for the multivariate case. Instead of following this path and making adjustments in stochastic approximation, a new direction was pursued.

4. Stationary stochastic approximation

Another view of the convergence difficulties in stochastic approximation is that the convergence of t, is a two-edged sword. The convergence of t, which is central to the procedure, causes the small variation in the design and the corresponding slow convergence of the regression estimates. The stationary stochastic approximation follows from the desire to keep the variation in the process growing fast enough to produce more accurate regression estimates. The change is a simple one, to keep uk constant most of the time, which yields a correction formula of the form

t k+l=tk-aAYk,

where a is a constant scalar. When the integrand function is linear, then Y, follows a regression model,

Y,=b+Bt,+e,, (11)

where ek has zero mean so the root is B = -B-lb. Moreover, the pair (Y,, tk) follows a vector-valued first-order autoregressive process in 2p dimensions,

(12)

Page 7: A stationary stochastic approximation method

R.F. Liddle and J.F. Monahan, A stationary stochastic approximation method 91

which is a stationary stochastic process with mean vectors 0 for Y, and 8 for f,, when the matrix above has all its eigenvalues less than one in absolute value. The process {t, } is stationary if the eigenvalues y, of (I - uAB) also are less than one. These are related to eigenvalues A, of the former matrix by

y=P-h+1.

For both the joint and subset processes to be stationary, conditions on both y and h need to be satisfied. The intersection includes the interval for y on the real line (i, 1) and the region of the complex plane that surrounds it, roughly : on each side of the axis.

Now both a and A can be chosen to suit; neither affect the accuracy of the estimate of 8 proposed (Theorem 2). Taking A = B-’ is obvious and leads to eigenvalues y = (1 - a). The constant a should then be chosen small enough to ensure approximate linearity. Taking a too small leads to undercorrection; at zero, tk is constant and Y, are iid. The estimation of B by regression must be sufficiently accurate to insure that I - aB-‘B has its eigenvalues in the intersection region to give a stationary process.

In practice, of course, the integrand function will not be lienar, and the process (12) can never really be stationary. The constant a governs the adequacy of the linear approximation and the validity of the multivariate autoregressive model. However, this condition, that the regression function is locally (for the value of a) linear, can be tested. Multiple y’s are taken for the same design point t and the lack of fit of a linear model is tested using a pure error sum of squares. Rejection dictates reducing a until linearity is accepted. In practice, taking two y’s at each design point works well. The use of many design points insures power, and fortunately linearity is achieved quickly.

If a linear process is a satisfactory approximation, then the next concern is what to use as the estimate of the root, and when to decide that the root has indeed been reached. This latter concern is never questioned in stochastic approximation; in the stationary context, this condition can be easily ex- pressed as the hypothesis H: E[ M(t)] = 0. In stochastic approximation, t, is the estimate of the root; in the stationary case,*t, does not converge at all. Two other candidates are available: B = -h-lb, the root of the regression surface, and i,, the mean of the design points. The former is unwieldy since the value of the fitted surface at 8 is identically zero, testing whether the root has been reached is vacuous. Moreover, establishing confidence intervals for 4 requires a complicated application of Fieller’s theorem. The strong results for testing for closeness to the root solidifies the choice of the third candidate 5,.

Theorem I. Let Y, and t, follow the stutionuty stochastic process (12) yhere the error vectors ek are iid with zero mean and convariance matrix .Ze. Let b and

Page 8: A stationary stochastic approximation method

98 R. F. Liddle and .I. F. Monahan, A stationary stochastic approximation method

& be the regression estimates of b and B, respectively. Then

co”(5+Bi,)=O(n-*).

Proof: Since the regression surface must go through (r,, 5,) then 6 -t hi,, = r,. Now Y, can be written

k-l

Yk=!Dk(YO- +e,)+e,+(@-Z) C CP-‘ei, r=l

where Q, = (I - aBA). Then

nun= k Yk=@(Z-@)-'(Z-@n)[Yo-e,]+ f Wekek. k=l k=l

Ignoring the first term, which is negligible, we find

cov( nr;,) = i @“-k2,(@T)“-k. k=l

Now let 11@j12 I c -C 1, since the process (12) is stationary. Then

n

IICOV(“r,) II 5 c Il@“-kl1211~ell s 11~,ll0 - w. k-l

The consequence of this theorem is that the test for the estimate of the root 0, i, is a very powerful one, in both the usual and jargon sense. Since the variance of its fitted value is an order of magnitude smaller than one would expect from a fitted value in an ordinary regression, at alternatives that are close to the hypothesis, the test is very likely to reject. Failing to reject the hypothesis that i, is a root of the regression function is thus strong evidence indeed. The concern whether the model is correct has already been addressed by using replicates and testing lack of fit versus pure error in the usual regression format. The accuracy of i, can be assessed using Theorem 2.

Theorem 2. Under the conditions given above,

cov(i,) = n-lB-lZ:,B-T+ O(n-*).

Proof. From (12) we find, for !P = (I - aAB),

(t k+l - 0) = *(tk - 0) - aAek+l.

Page 9: A stationary stochastic approximation method

R. F. Liddle and J. F. Monahan, A stationary stochastic approximation method 99

Neglecting t, and f,,+r for

in=n-’ i t,, (pn-’ ek, k=l k=l

(i, - d) = q(i, - 19) - aAF,,

(in-e) = -(z- \k)-‘UAe,,

Note that neither a or A appear in the covariance matrix for i,. The implementation of the stationary stochastic approximation method can now be described:

(1) Begin with moderately small a and A = I.

(2) Run the process (10) for a while. (3) Stop and estimate and test:

i) Estimate b and $ ii) Replace A by i-’ iii) Test whether a root has been reached (if not so, return to step 2);

option: move t to root of regression surface. (4) Test for linearity (if not so, reduce a, and return to step 2). (5) Run the process (10) until sufficient accuracy is achieved. (6) Compute covariance matrix of i,.

The option in step 3 above can accelerate a slowly converging process by a Newton-like step. In application, this has not yet been found necessary.

5. Applications

As stated in section 2, the immediate motivating problem was the computa- tion of minimum Hellinger distance estimates for the multivariate normal distribution. Beran (1977) computed estimates of location and scale in a univariate problem with a sample of size 40, but not very accurately. Tamura and Boos (1986) attacked the two-dimensional problem. The numerical diffi- culties they encountered motivated this work.

Beran’s example did not serve well as a testing ground for the methods presented here. Beran attacked the maximization problem

by the trapezoidal rule and a Newton iteration for the optimization, claiming

Page 10: A stationary stochastic approximation method

loo R.F. Liddle and J.F. Monahan, A stationary stochastic approximation method

three-digit accuracy. Using many more points of the midpoint rule, higher accuracy could be claimed using the usual perturbation results. However, different estimates were obtained for the max and min problems, different from Beran’s. Close examination of the function surfaces, and those of the gradients, showed existence of local optima and roots that shifted with the integration rule more than could be predicted analytically. We concluded that the problem of computing accurate MHD estimates was fatally ill-condi- tioned. Estimates accurate to two digits are not a problem. Stationary stochas- tic approximation faced no difficulties there.

Two three-dimensional problems presented a challenge as something not attempted previously. For a three-dimensional normal distribution, we have nine parameters, three for location and six for the lower triangular matrix for L, the Cholesky factor of 2 = LL T. The numerical integration in three dimensions is not tractable by a fixed integration rule-Monte Carlo is the only route. One difficulty here was to keep the kernel density evaluation from bogging down the entire effort. Since the kernel had finite support, by binning the data, only identifiably nearby points are considered in computing the estimate, cutting the cost substantially. The affine invariance property of the MHD estimates of Tamura and Boos permitted the following stabilization trick. The kernel they proposed used the usual covariance estimate S in defining its shape. Factor this matrix S = MMr by Cholesky and center and standardize the data to Zci) by

Z(i) = M-1( $0 _ X).

Then the MHD estimates can be computed from the Zci),

P(X) = x+ P(Z),

where p( .) and D( .) are the MHD location and scale parameters, respectively. Since p(Z) will be near the zero vector and D(Z) near the identity matrix, this translation keeps the estimation stable computationally.

Johnson and Wichem (1982, table 8.1) give 100 weekly rates of return on five stocks, three of which were chosen to be estimated here. The change in the location estimate was found to be

j.&(z) = (-0.0404 - 0.0531 - 0.0227),

and the Cholesky factor of the change in the dispersion matrix

1.1551 . - 0.0410 1.1992

Page 11: A stationary stochastic approximation method

R. F. Liddle and J. F. Monahan, A stationary stochastic upproximation method 101

Notice the MHD dispersion estimators are slightly higher, reflected in L(Z) diagonals greater than the identity. This computation took 108 seconds on the IBM 3081 at the Triangle Universities Computation Center, giving a reported accuracy of 0.003. This small sample problem also exhibited some instabilities. Another problem used 800 generated normal deviates with zero mean and identity covariance matrix. The MHD estimate of the Cholesky factor L of D(z) is

i -0.017 0.003 1.079 0.003 1.095 0 1.083 0 0

with standard errors of the estimates of 0.002, taking only one minute. Our conclusion is that the more stable problem was easier, even though an evaluation of the kernel density cost eight times as much.

Lastly, a Bayesian decision problem was attacked. The spirit followed an example by Cox (1970, table 5.3) of binomial regression; the data were doctored to produce a significant effect. Observed were y, ingots out of m, not ready for rolling after heating time Bii and soaking time e,,. The logit of the probability function ~(‘8, $) = P, (ingot not ready for rolling after heating 8, and soaking 6,) took the form

The cost function took the form

The expected cost (with respect to the posterior) was minimized by values 8, = 1.184 and S, = 1.005, where ci = c2 = 1, cg = 100. The integration in three dimensions, the parameters #0, #i, q2, was improved by sampling from the asymptotic normal approximation to the posterior as the importance function I($). Accuracy to three digits required 48 seconds.

6. Conclusions

The stationary stochastic approximation method has shown its potential as a black box problem solver. The class of problems is general, only the integral must be rewritten as an expected value. The strengths of stationary stochastic approximation he in the ability to handle highly dimensional problems and in the verification of the assumptions on which it is based.

Page 12: A stationary stochastic approximation method

102 R. F. Liddle and J.F. Monahan, A staiionaty stochastic approximation method

References

Beran, R., 1977, Minimum Hellinger distance estimates for parametric models, Annals of Statistics 5, 445-463.

Blum, J., 1954, Multidimensional approximation methods, Annals of Mathematical Statistics 25, 737-744.

Cox, D.R., 1970, The analysis of binary data (Methuen, London). Chung, K.L., 1954, On a stochastic approximation method, Annals of Mathematical Statistics 25,

463-483. Davis, P.J., and P. Rabinowitz, 1984, Methods of numerical integration, 2nd ed. (Academic Press,

New York). Dennis, J.E. and R. Schnabel, 1983, Numerical methods for unconstrained optimization and

nonlinear equations (Prentice-Hall, Englewood Cliffs, NJ). Frees, E.W. and D. Ruppert, 1986, Estimation following a Robbins-Monro designed experiment,

Paper in review. Geweke, J., 1986, Bayesian inference in econometric models using Monte Carlo integration, Paper

in review. Johnson, R.A. and D.W. Wichem, 1982, Applied multivariate statistical analysis (Prentice-Hall,

Englewood Cliffs, NJ). Kiefer, J. and J. Wolfowitz, 1952, Stochastic estimation of the maximum of a regression function,

Annals of Mathematical Statistics 23, 462-466. Lai, T.L. and H. Robbins, 1979, Adaptive design and stochastic approximation, Annals of

Statistics 7, 1196-1221. Liddle, R.F., 1988, Stochastic approximation for the optimization of integrals, Ph.D. dissertation

(North Carolina State University, Raleigh, NC). Nevel’son, M.B. and R.Z. Khas’minskii, 1973, An adaptive Robbins-Monro procedure, Automa-

tion and Remote Control 34, 1594-1607. Robbins, H. and S. Monro, 1951, A stochastic approximation method, Annals of Mathematical

Statistics 22, 400-407. Ruppert, D., 1985, A Newton-Raphson version of the multivariate Robbins-Monro procedure,

Annals of Statistics 13, 236-245. Ruppert, D., R.L. Reish, R.B. Deriso and R.J. Carroll, 1984, Optimization using stochastic

approximation and Monte Carlo simulation (with application to harvesting of Atlantic menhaden), Biometrics 40, 535-545.

Stroup, D.F. and H.I. Braun, 1982, On a new stopping rule for stochastic approximation, Zeitschrift Rir Wabrscheinlichkeitstheorie und verwandte Gebiete 60, 535-554.

Tamura, R.N. and D.D. Boos, 1986, Minimum Hellinger distance estimation for multivariate location and covariance, Journal of the American Statistical Association 67, 237.

Venter, J., 1967, An extension of the Robbins-Monro procedure, Annals of Mathematical Statistics 38, 181-190.

Wei, C.Z., 1985, Asymptotic properties of least-squares estimates in stochastic regression models, Annals of Statistics 13, 1495-1508.

Zellner, A., 1971, An introduction to Bayesian inference in econometrics (Wiley, New York).