Convergence probability bounds for stochastic approximation

Preview:

Citation preview

680 IEEE TRANSACTIONS ON INFORMATION THEORY,VOL. IT-16, NO.~,NOVEMBER 1970

of Quantum E&tronics, P. L. Kelly et al., Eds. New York: McGraw-Hill, 1965, pp. 788-811.

[7] C. W. Helstrom, “The distribution of photoelectric counts from partially polarized Gaussian light,” Proc. Phys. Sot., vol. 83, pp. 777-782, 1964.

[S] J. PeZna, “Determination of the statistical properties of light from photoelectric measurements,” Czech J. Phys., vol. B17, p. 1086, 1967.

[9] D. Middleton, Introduction to Statistical Communication Theory. New York: McGraw-Hill, 1960.

[lo] R. C. Emerson, “First probability densities for receivers with square-law detectors,” J. Appl. Phys., vol. 24, p. 1168, 1953.

[ll] D. Slepian, “Fluctuations of random noise power,” Bell Sys. Tech. J., vol. 37, pp. 163-184, 1958.

[12] M. Kac and A. J. F. Siegert, “On the theory of noise in radio receivers with square-law detectors,” J. Appl. Phys., vol. 18, p. 383, 1947.

[13] D. A. Darling and A. J. F. Siegert, “A systematic approach to a class of problems in the theory of noise and other random phenomena: Pt. I,” IRE Trans. Information Theory, vol. IT-3, pp. 32-36, March 1957.

[14] A. J. F. Siegert, “A systematic approach to a class of problems in the theory of noise and other random phenomena: Pt. II- Examples,” IRE Trans. Information Theory, vol. IT-3, pp. 3% 43, March 19.57.

[15] M. Ohta and T. Koizumi, “Intensity fluctuation of stationary random noise containing an arbitrary signal wave,” Proc. IEEE (Letters), vol. 57, pp. 1231-1232, June 1969.

1161 W. Davenport and W. L. Root, Introduction to Random Signals and Noise. New York: McGraw-Hill, 1958.

1171 zaylge”c”; Stochastzc Processes. San Francisco, Calif.: Holden- 2 *

[18] A. Papoulis, Probability, Random Variables, and Stochastic Processes. New York: McGraw-Hill, 1965.

[19] H. L. Van Trees, Defection, Estimation, and Modulation Theory, Part I. New York: Wiley, 1968.

[20] G. Lacks, “Theoretical aspects of mixtures of thermal and coherent radiation,” Phys. Rev., vol. 138, pp. B1012-B1016, 1965.

[21] I. S. Gradshteyn and I. M. Ryzhik, Table of Integrals Series and Products. New York: Academic Press, 1965.

[22] R. Deutsch, Nonlinear Transformations of Random Processes. Englewood Cliffs, N. J.: Prentice-Hall, 1962.

[23] R. Courant and D. Hilbert, Methods of Mathematical Physics, vol. 1. New York: Interscience, 1953.

[24] M. G. Kendall and A. Stuart, The Advanced Theory of Statistics, vol. 1. London: Griffin, 1963.

[25] J. R. Klauder and E. C. G. Sudarshan, Fundamentals of Quantum Optics. New York: Benjamin! 1968.

[ZS] S. Karp, “A statistical model for radiation with applications to optical communications,” Ph.D. dissertation, Dept. of Elec. Engrg., University of Southern California, Los Angeles, 1967.

[27] J. W. S. Liu, “Reliability of quantum-mechanical communica- tion systems,” M.I.T. Research Lab. of Electron., Cambridge, Mass., Tech. Rept. 468, December 31, 1968.

[28] G. L. Fillmore and G. Laths, “Information rat,es for photo- count detection systems,” IEEE Trans. Information Theory, vol. IT-15, pp. 465468, July 1969.

[29] B. Reiffen and H. Sherman, “An optimum demodulator for Poisson processes: Photon source detectors,” Proc. IEEE, vol. 51, pp. 1316-1320, October 1963.

[30] R. M. Gagliardi and S. Karp, “M-ary Poisson detection and optical communications,” IEEE Trans. Communications Tech- nology, vol. COM-17, pp. 208-216, April 1969.

Convergence Probability Bounds for Stochastic Approximation

LEE D. DAVISSON, SENIOR MEMBER, IEEE

Absfracf-In certain stochastic-approximation applications, suf- ficient conditions for mean-square and probability-one convergence are satisfied within some unknown bounded convex set, referred to as a convergence region. Globally, the conditions are not satisfied. Important examples are found in decision-directed procedures. If a convergence region were known, a reflecting barrier at the boundary would solve the problem. Then the estimate would converge in mean square and with probability one. Since a convergence region may not be known in practice, the possibility of nonconvergence must be accepted. Let A be the event where the estimation sequence never crosses a particular convergence-region boundary. The sequence of estimates conditioned on A converges in mean square and with probability one, because the sequence of estimates is the same as if there were a reflecting barrier at the boundary. Therefore, the unconditional probability of convergence exceeds the probability of the event A. Starting from this principle, a lower bound on the convergence probability is derived in this paper. The results can also be used when the convergence conditions are satisfied globally to bound the maximum-error probability distribution. Specific examples are presented.

Manuscript received November 24, 1969; revised April 17, 1970. This research was supported by the National Science Foundation under Grant GK-14190.

The author is with the University of Southern California, Los Angeles, Calif. 90007.

I. INTRODUCTION

s

TOCHASTIC approximation is an important recur- sive method of system optimization, applied when system parameters such as transfer functions and

probability distributions are unknown. After each of a sequence of observations, the vector of parameter estimates is updated by making an incremental change that is a function of the current observation and current estimates only. The increment contains a factor that goes to zero with time so that convergence of the estimate vector to the desired value occurs in some sense under appropriate conditions.

In applications of stochastic approximation, sufficient conditions for mean-square and probability-one conver- gence are known [I]. In many cases, the most difficult of the conditions to satisfy is that the conditional expected change of the estimate vector decrease the error norm uniformly at every point of the estimate vector space for sufficiently large time. One way to avoid the difficulty as given by Sakrison [l] is to place a reflecting barrier at the boundary of a convex set where the sufficient conditions

DAVISSON: CONVERGENCE PROBABILITY BOUNDS 681

observations, previously classified as containing the signal, are used. A heuristic argument is presented to show that the estimate, though asymptotically biased, will converge. It is also noted in the experimental results that the estimate never failed to converge. In [7] the scheme was extended for estimating two unknown signals. The asymptotic probability of error was obtained, although the possibility of nonconvergence was not discussed.

A decision-directed receiver for synchronous detection was investigated in [9]. An important conclusion of this study was that the decision-directed receiver performed better, at all signal-to-noise ratios, than the corresponding nondecision-directed technique for phase measurement. Nonconvergence, which in this case occurs when the receiver loses the phase reference and never regains it, though acknowledged as a possibility, was not observed in the simulations.

Decision-directed techniques have also been investigated in connection with pattern classification problems with mislabeled samples [5]. In this case, the possibility of non- convergence does not exist, since there are only a finite number of mislabeled samples, and hence, only a finite number of incorrect decisions can be made.

A particularly important practical application of this work is when the decision-directed receiver is used for adaptive channel equalization [6]. The tap weights of a transversal filter are adjusted adaptively to maximize the “eye” opening by minimizing intersymbol interference under the assumption that past decisions are correct. As long as the tap weights are reasonably close to their optimum values, few decision errors are made. However, if a sequence of decision errors is made, the tap weights are adjusted the wrong way, making it likely that still more decisions errors will be made, until finally the eye opening is closed; the detector performs no better than a random decision device, and the tap weights wander aimlessly.

In other applications the sufficient conditions for convergence may be satisfied globally, but one might like to know the probability of large excursions from the final value. The techniques presented in this paper can also be applied there.

are known to be satisfied. Then convergence occurs in mean square and with probability one. A convex set where the conditions are satisfied is referred to subsequently as a convergence region. Unfortunately, a convergence region may not be known. Then the possibility of nonconvergence must be accepted. It is of obvious interest to lower bound the probability of convergence in applications under representative conditions. If the probability of convergence is very close to 1, for all practical purposes the possibility of nonconvergence can be ignored.

Let A be the event where the estimate sequence remains within some given convergence region. Convergence, conditioned on A occurs in mean square and with prob- ability one, because the sequence of estimates is the same as if a reflecting barrier were placed at the boundary. Therefore, the unconditional probability of convergence is bounded below by the probability of the event A. In this paper a lower bound is derived on the convergence prob- ability for an initial estimate inside a convergence region’ by lower bounding the probability of the event A. The bound is optimized through the choice of convergence region.

Important applications of the results in this paper can be found in the class of decision-directed receivers. A decision-directed receiver uses previous outputs (decisions) to estimate unknown parameters and, on the basis of these estimates, modifies the detector structure for subse- quent decisions. This type of receiver provides a means of unsupervised (“untaught”) adaptation without the com- plexity associated with other unsupervised learning tech- niques such as the mixture [8] or empirical Bayes [lo] approach. Inherent in the decision-directed approach, however, is the possibility of nonconvergence. This occurs when the detector commits a sequence of decision errors resulting in a degradation of parameter estimates, which in turn results in a further deterioration of detector performance. Unfortunately, the decision-directed tech- nique is difficult to analyze because of the dependencea introduced by the learning process.

The results found here represent a generalization of earlier work [2] on decision-directed receivers. The earlier work is concerned with the estimation of signal a priori probabilities by relative decision frequencies. The prob- ability estimates are used in the Bayes receiver as if they were the actual values. The initial estimates are taken to be the minimax values. It is found that the probability of nonconvergence is infinitesimal at moderate signal-to- noise ratios. At a signal-to-noise power ratio of 4, the probability of nonconvergence is less than 10-4000 for binary detection of equally likely alternatives. At low signal-to-noise ratios, however, convergence never occurs.

Other decision-directed techniques have been reported. Scudder [3] considered the binary detection problem of unknown signal versus null signal in noise. The estimate of the signal is taken as a sample-mean estimate in which

1 In practice the initial estimate is placed inside a convergence region by a training sequence, through minimax values, etc.

Il. STOCHASTIC APPROXIMATION

Suppose that it is desired to estimate the M vector 8 = if?,; i = 1, . a. , M) by (Y, = (c& i = 1, -3 . , M] at the nth step through the algorithm

%,I = % - (n + I)-‘D(n)h(z,,, 4, (1)

where D(n) is an M X M diagonal matrix, with bounded, positive maindiagonal components cZi (n), lim,,, di (n) # 0, and {z,) is a statistically independent, identically dis- tributed sequence of vector random variables, whose dimensionality is not necessarily M. A function of the observed vector and the estimate is h(z,, cy,). The condi- tional expectation of h(z,, E,J given a;, is called the re- gression function,

mt4 = E{h(z,, 4 I ~1. (2)

682 IEEE TRANSACTIONS O N INFORMATION THEORY, NOVEMBER 1970

The estimated vector is a root of the regression function, i.e.,

m(0) = 0. (3)

The following are sufficient conditions for the mean-square and probability-one convergence of (Y, to 8 [l].

1) For all values of (Y, there is a constant K > 0 such that

~IllW,, 4” I aI < K < a. (4)

2) There exist positive finite constants lc,, k, such that for all values of (Y

Pr h (a, E Q) [ 1 [ = Pr h ((Y; E fi) n=1 1 . (8) n=1 The importance of using the sequence ((Y:}, rather

than ( ar,} will be apparent in the subsequent bounding of the probabilities for the 2M component random walk problems. Now

k, /Ia - 8[12 < [a - 8]‘[D(n)m(cr)] < k, j/a! - 01j2.

1 - Pr ij (CXA E Q) [ n=1 1 [ = Pr e ((Y: @ Q) 1

< 2 {Pr [inf or: 5 061 + Pr [sup 01: 2 19f] ) . (9) i=1 n 12

In some applications these conditions may only be satisfied locally. In particular, it is assumed that the sufficient conditions are satisfied within some finite region Q, defined by a set of inequalities in the coordinates

Each of the 2M probabilities in (9) can be regarded as the probability of one-sided absorption for a random walk with absorbing barrier at 0:. Each is evaluated by the same technique. Each component equation is of the form

D = (an: 8; < a: < ef i=l . . . , f Mj, (5)

that is, Q is bounded by planes perpendicular to the coordinate axes in M space. Unknown in any application where this work applies is the 9. Outside the range given by (5), convergence may not occur for one reason or another; e.g., there may be other roots of (3) to which the estimate may converge. Thus, if the estimate is left free of restrictions, one can only find a probability of conver- gence to 0, which will not be 1 in general. In the next section, a convergence-probability bound is found.

ir %+1 = “7% if - (n + l)-lcZ,(n)hi(z,, CL). (10)

The probability that the supremum exceeds the upper bound in (9) will be considered in Sections IV and V. The infimum is handled by a simple sign reversal. Let /3* = n(0f - (Y:‘), be a scalar random walk. Obviously

Pr [sup crt’ 2 0j] = Pr [inf Pn < 01, n n (11)

where from (10)

P n+l = fin + d,(n)hi(z,, a,) - cd + 0;. (12)

III. PROBABILITY BOUND

Convergence, conditioned on the event that the estimate never leaves Q, is mean square and probability one. Hence, a lower bound on the probability P, of convergence is given by the probability that the estimate never leaves D for any n. That is,

The probability of (11) is the probability of absorption of the random walk &, at zero. Unfortunately, the random walk still has dependent nonstationary steps.

IV. FIRST APPROACH

Let {rn) be the random walk that satisfies

Yn+l = Yn + Yn, (13)

In general, it is not possible to evaluate this probability. From (1) it is seen that a,, executes a multidimensional random walk with nonstationary dependent increments. Further bounding will be done by reducing the lM- dimensional problem to 2111 one-dimensional random walks with stationary independent steps, one for each boundary plane of Q. Finally, a bound is found on the probability for each of these problems, and the resulting 2M probabilities are combined through a union bound to get the main result of the paper.

where the sequence of independent, identically distributed increments { yvL) has distribution function F(z) where F(X) is the “worst case” conditional distribution function of the random variable

CZi(7-L)hi(Z,, 2,) - a: + e; (14)

First, to decouple the components of (Y,, define a random walk a: that satisfies

as a function of n, with d, E Q, and ark’ < 0f for every x. The random variable of (14) is the step change of fin in (12) and the worst case conditional distribution is the supremum of the distribution function of (14) conditioned on n, 2, E w, af’ < 0;. Here the definition of the sequence { CK;] in (7) has its importance, since the supremum must be taken for &% E 1;2, rather than for all (II,. Clearly &, is stochastically larger than yn so that

4+1 = % ’ - (n + I)-‘D(n)hh, &I, where B, is a restricted value of a& i.e.,

(7)

nearest boundary point otherwise.

It can be seen that the sequence { cr,] never leaves s1 if and only if ( CX:) never leaves Q. That is

Pr [inf ~~ < 0] 2 Pr [inf Pn 5 01. (15) n n

The probability of absorption for (m} can be bounded by a technique of Feller’s ([4], pp. 366, 566-569). Let U(W) be the probability of absorption for 1~~) starting with the initial value w. It can be seen that the probability of absorption starting from w equals the probability of

DAVISSON : CONVERGENCE PROBABILITY BOUNDS 683

absorption from w + t weighted by the probability of a step size t. Hence, the following equation results: h(z,, 4 =

1 - 01, 2, > u~G?.LL)-~ In [Cl - c~&'l c22) otherwise,

u(w) = s - u(w + t> dF(t), of3 -m

subject to the constraint u(w) = 1 for w < 0. Let q(X) be the moment-generating function of F(x),

W Q = [; exp (Xx) dF(x),

and suppose’

*(A,) = 1 x1 < 0. (17)

This root is unique if it exists, since @ is convex. It is seen that exp (1,~) = u(w) satisfies the integral equation of (16), but not the condition that u(w) = 1 for w < 0. However, since for w 5 0, exp (X,w) - u(w) = exp (X,w) - 1 > 0, it follows that

so that in (12) cl,(n) = -1 and

P n+l = ~3% - h(z,, d2,) - a,’ + 01 .

The worst case distribution F(x) is attained for 8, = e1 for all x. The generating function of F(x) is q)(h) = ex(fll-l) Pr [h(x,, 0,) = 1 - oc,]

+ ehol Pr [h(x,, 0,) = -a,]. (21)

The real root of Xl!(X) = 1 can then be found, and the bound of (20) minimized over el, eo. See [2] for numerical results.

Pr [inf 7% < 0] I: exp (X,p), n (1%

where I? > 0 is the initial value of {rn) starting at some step m, or in terms of the initial value aA, from the definition of (yTZ} and (pm), I’ = m&(0f - ~2). Using (15) and (11) the following results:

V.SECOXD APPROACH

In some cases, for all choices of 0, the root of (17) may not exist or the bound may be unacceptable. Thus, a different approach must be taken. Define

g,(z,, &J = hik, &J + Cd:’

so that (12) becomes

P n+1 = pn + di(n)[gi(z,, a,) + C,a?] - CX~’ + ef

Pr [sup & > 0:] 5 exp [mx,(@ - CY~)]. 09) n

= (

1 + l - ci &(%) n >

p 1L + d,(n)g;(z,, B ) n (23)

+ cd(n) ef . Upon combining (19) and (9) in (6)) with X0, the corre- sponding root for the infimum in (9), and adding obvious superscripts, the following theorem results.

Theorem: The probability of convergence is bounded by

1 - P, 5 i;f 5 { exp [mxf(& - &J] i=l

+ exp [~&(a: - &)I), @O>

where the values (Xi ] as well as ( 0: ) depend on the choice of 3.

The constant Ci is one of the parameters available in minimizing the upper bound in the sequel. Let F(x) be the supremum of the conditional distribution functions of the random variable & (n)g% (z,, 2,) + Cid, (n) 0: in an analogous way to Section IV. Let { yJn) have the distribution F(x) and define

-In+1 = ( 1 + 1 - ci d ;(n)

n ) Yn + Yn

The bound can be minimized over (0:] to get the best possible bound. In some cases this bound may not be helpful, but can be sharpened by considering more carefully the characteristics of a given problem. In other situations these results can be quite dramatic. In [a], [3], the bound of (19) goes doubly exponentially to zero as a function of the signal-to-noise ratio. The following example is one of the cases considered in [2].

= fnYn + Yn

Ym = Ym = r.

Now using the union bound, we find

(24)

Pr [inf 7% I 0] < C Pr [vn i 01. TL=7?+1 (25)

Example 1: In [2] the a priori signal probabilities are estimated by the relative decision frequencies. In the simplest one-dimensional case the observed sequence lx,,) is a binary signal sequence =tp of independent values plus an additive uncorrelated Gaussian noise sequence with variance c2. If a~, is associated with the a priori probability of a +p then

A Chernoff bound is used on these summands,

a,L+l = a,& + (n + I)-lhb,, 4,

where ix,) is the sequence of observations and

(20

where \Ir is the generating function of { yn} as in Section IV and X, 4 0 is the minimizing value for the upper bound. Combining (26) into (25) and substituting in (9) results in the desired bound.

Example 2: Suppose (21) of Example 1 is modified by a gain term such that

2 A simple necessary condition for this is that E[yn] = q’(O) > 0. 01,+~ = a, + (n + l)-l(d + l)h(x,, 4.

684 IEEE TRANSACTIONS ON INFORMATION THEORY, NOVEMBER 1970

Thus, in (11)

di(n) = -(d + 1).

wn+1 = w, - (n + l)-‘WD(n)h(z,, W’w,,),

where W’ = W-‘.

h(x,, cr,) is the same as in (22). The bound of Section IV can not be applied here for

large d, because the supremum of the conditional distribu- tion function results in a generating function that does not satisfy (17). That is, the supremum is no longer attained for 8, = I&. However, if the equation is written as in (23) with Ci = -1,

Example S: Suppose that it is desired to obtain the optimum linear predictor of (s,j from the independent sequence of vectors { S,J. The prediction is then of the form a,‘&, with prediction-error sequence

GI = s, - a’,s,.

The weight vector satisfies

P n+l = (1 - dln>A - (d + 1Nd.h &> - 41, then the distribution F(z) is as in Example 1, where the supremum is again attained for 8, = 4. Let

p = Pr k7bn, 4) = 11 4=1-p.

Note that p and 4 depend on el. Then

WskQ = P ew hkwl - 111 + q exp hadI,

CL+1 = a, + (n + 1>-’ en&

so that in (1) D(n) = - I, h(z,, a,) = &$,. In this example the conditions for mean-square and probability-one convergence are satisfied. However, it might still be of interest to bound the probability that the coefficient sequence differs by some maximum amount from the optimum.

Now, if the optimum vector contains large values, the supremum of the conditional distribution functions is a distribution function whose mean value is outside Q. Hence, (17) has no negative root. On the other hand, if R = E[S,SQ is the covariance matrix and W is the orthogonal transformation such that

WRW* = D,

n-1

rnk = n (1 - d/j)* j=;

Thus, in (26)

Pr (rncl 5 0) 5 exp (kif-) ,4, b exp (rnkX(el - l))]

+ q exp (~nkw)l x I 0.

Now, taking the initial time m > d and using an integral bound,

[i/(n - l)]+ In (1--d’i) 5 r,i 2 [(i + 1)/n]“.

Using the above and doing some further bounding result in

Pr C-L+~ < 0) 2 exp {Ar($i)x

log (q + ew (-XY”)> dy 1

‘I i

;

X = -m log (1 - d/m). cm The first few terms of (25) can be calculated from this.

For large n, the bound of (27) becomes approximately

exp n {ll

Mk’(X + 1) + i1 log (q + exp (-Xy”) dy 11 .

Using the approximation, the remaining terms of (25) are summed as a geometric series. The bound is then optimized over h and &, B,,.

VI. USE OF TRANSFORMATIONS

In some cases, a better bound results if the coordinates are rotated by an orthogonal transformation on (l), before the region of (5) is defined. Let w, = Wa,, so that

is a diagonal matrix, then

Wn+l = w, + (n + l)-‘WS,[s, - ST(W’w,)].

Then,

E{WS,[s, - ST,(W*w,)]) = WG - Dw,,

where G = E(S,s,}. Hence, the mean value of the worst case distribution in

each coordinate is not affected by the other components of cir, and (17) can be satisfied.

VII. CONCLUSIONS

This paper presents methods for lower bounding the probability of convergence for stochastic approximation when sufficient conditions are not satisfied for mean- square and probability-one convergence.

REFERENCES

[II

PI

[31

[41

[51

161

[71

D. J. Sakrison, “Stochastic approximation: a recursive met.hod for solving regression problems,” Advances in Communication Systems, vol. 2, A. V. Balakrishnan, Ed. New York: Academic Piess, i966, pp. 51-106. L. D. DavIsson and F. C. Schwartz. “Analvsis of a decision- directed receiver with unknown priors,” IEEE Trans. Informa- tion Theory, vol. IT-16, pp. 270-275, May 1970. H. J. &udder? III, “Probability of error of some adaptive pattern-recognition machines,” IEEE Trans. Information Theory, vol. IT-11, pp. 363-371, July 1965. W. Feller, An Introduction to Probability Theory and Its Appli- cations, vol. 1, 3rd ed. New York: Wiley, 1968; vol. 2, 1966. R. L. Kashyap, “Recursive algorithms for classfication using pattern misclassified samples,” IEEE Proc. 7th Symp. on Aclaptive Processes, paper 3.f., December 1968. R. W. Lucky, “Techniques for adaptive equalization of digital communications,” Bell SUS. Tech. J., vol. 45. DD. 255-286. February 1966.

I --

E. A. Patrick and J. P. Costello, “Asymptotic probability of

IEEE~ TR4NsACTIoNs 0N INFORMATION THEORY, VOL. IT-16, NO. 6, NOVEMBER 1970 685

error using two decision-director estimators for two unknown [9] J. G. Proakis, P. R. Drouilhet, and R. Price, “Performance of mean vectors,” IEEE Trans. Znformation Theory (Corre- coherent detection systems using decision-directed channel spondence), vol. IT-14, pp. 160-162, January 196s. measurement,” IEEE Trans. Communication Technology,

[S] E. A. Patrick, “On a class of unsupervised estimation problems,” IEEE Trans. Information Theory, vol. IT-14, pp. 407415,

vol. COM-12, pp. 54-63, March 1964. [lo] H. Robbins, ‘(The empirical Bzyes approach to statistical

May 1968. decision problems,” Ann. Moth. Stat., vol. 35, pp. l-20, 1964.

Estima tion W ith F inite Memory RICHARD A. ROBERTS, MEMBER, IEEE, AND JOHN R. TOOLEY, MEMBER, IEEE

Abstract-A finite-state model for sequential minimum-mean- square-error estimation of a random variable in additive noise is analyzed to determine the dependence of opt imum performance and structure on the memory size of the estimator. Necessary con- ditions for determining the structure of the opt imum finite-state estimator are derived for arbitrary statistics. Numerical results are presented for Gaussian statistics. The performance of several different estimators is used to show the trade-off one may obtain between memory size, observation quality, and number of obser- vations.

M

ANY current applications of estimation theory involve problems in which data are accumulated sequentially. The models developed for these

problems include as a basic component the successive updating and storing of information, usually in the form of a sufficient statistic. The memory of these models is implicitly assumed to be perfect in the sense that stored functions or variables can be recalled and stored again without loss of information. In any practical realization of the model this assumption is not met since any physical implementation must ultimately truncate stored variables or approximate stored functions. One hopes, of course, that the truncation process does not markedly affect, performance. Cover [l] has shown, however, that for certain problems simple truncation of a sufficient statistic can greatly degrade performance. In any event, the general question arises as to the effects of memory con- straints in sequential processing of data.

Finite-memory schemes usually center on a discussion of the existence and form of finite dimensional sufficient. statistics [2], [3]. The implication of these works seems to be that the existence of a finite dimensional sufficient statistic in some sense bounds the memory capacity needed for processing data. Our view of finite memory

Manuscript received June 21, 1969; revised February 23, 1970. This work was supported in part by NSF Grant GK 2709.

R. A. Roberts is with the Department of Electrical Engineering, University of Colorado, Boulder, Colo. 80302.

J. R. Tooley was with the Department of Electrical Engineering, University of Colorado, Boulder. He is now wit,h Texas Instruments, Incorporated, Austin, Tex.

implies that a discussion of finite dimensional sufficient, statistics is not the essential subject. The memory capacity needed for even a one-dimensional sufficient, statistic is infinite (since it is memory needed to retain a real number): A truly finite memory is a memory that can take on a finite number of states. This concept, of finite memory has been used recently by Cover, Hellman, Mullis, and Roberts [4]-[6] in papers concerning finite- memory decision processes.

In the sequel we consider an estimation process in which data are summarized by means of a finite-valued statistic .sk E { 1, 2, . . . , P) where sk is updated in ac- cordance with (1).

s k+l = F&k, rd. (1)

The observations rk, Ic = 1, 2, . . * , are discrete. The valueofs,fork = 1,2, .a. is the content of the memory of the estimator. We permit the updating to vary with k (in a deterministic manner). Because this variation is predetermined, we label it hard or structural memory as opposed to the soft memory sk that varies as observations are taken. One might reasonably restrict the hard memory by requiring that s~+~ = F(sk, yk). (Further discussion of this point is given in the section on numerical results.) The estimate for any Ic is based on the contents of the memory sk, the value of lc, and possibly the current, ob- servation rs. In the results presented here the estimate does not use the current observation. This is done in order to simplify the estimator.

The objective of this work is to obtain quantitative results on the value of memory and to develop the begin- nings of a theory for processing information under memory constraints. The specific problem we consider is a se- quential-minimum-mean-square-error estimator of a ran- dom variable in added noise. Numerical results are obtained for the case of Gaussian observation statistics. These results serve to illustrate the structure and performance of a f inite-memory estimator. Numerical results are also obtained relating performance with memory size, length of observation, and input observation quality.

Recommended