Robust estimation via stochastic approximation

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. IT-21, NO. 3, MAY 1975 263

Robust Estima tion via Stochastic Approximation R. DOUGLAS MARTIN, MEMBER, IEEE, AND C. J. MASRELIEZ, MEMBER, IEEE

Abstract-It has been found that robust estimation of parameters may be obtained via recursive Robbins-Monro-type stochastic approximation (SA) algorithms. For the simple problem of estimating location, appropriate choices for the nonl inear transformation and gain constant of the algorithm lead to an asymptotically mitt-max robust estimator with respect to a family S(y,,p) of symmetrical distributions having the same massp outside [ -y,,y,], 0 < p < 1. This estimator, referred to as the p-point estimator (PPE), has the additional striking property that the asymptotic var iance is constant over the family 9(y,,p). The PPE is also efficiency robust in large samples. Monte Carlo results indicate that small sample robustness may be obtained using both one-stage and two- stage procedures. The good small-sample results are obtained in the one- stage procedure by using an adapt ive gain sequence, which is intuitively appeal ing as well as theoretically justifiable. Some possible extension of the SA approach are given for the problem of estimating a vector parameter. In addition, some aspects of the relationship between SA-type estimators and Huber’s M-estimators are given.

I. BACKGROUND AND INTRODUCTION

T HE BASIC problem of estimating a set of parameter values from a set of measurements corrupted by

random errors was one of the earliest considered by the pioneers of statistics. One approach, which was used as early as the 18th century and still enjoys great popularity, is the method of least squares (LS). This method of estimation has a strong intuitive appeal and was used in the beginning without any attempt to justify it from a statistical point of view. It is rather interesting to observe the round- about way in which Gauss introduced the normal density as the one density for which a LS-type estimator (the arithmetic mean) is efficient [4]. The work by Gauss together with the Gauss-Markov theorem (the best linear unbiased estimator of the expected value is the arithmetic mean) and the central limit theorem (the sum of many small errors is approximately normal) have over the years provided motivation for using the arithmetic mean as an estimator of location.

What then, if anything, is wrong with using the arithmetic mean? Nothing, if the underlying density is known to be Gaussian! However, in practice we will never know the underlying density exactly. In particular, we shall seldom know the detailed tail structure of the density function. Unfortunately, the arithmetic mean, which assigns the same weight to all observations, is sensitive to the shape of the tails so that a rather mild-looking deviation from the Gaussian shape may lead to a considerable loss in efficiency. This is particularly disturbing since in many cases in prac-

Manuscript received February 12, 1973; revised October 21, 1974. This work was supported by the National Science Foundat ion under Grant GK-5338.

R. D. Martin is with the Department of Electrical Engineering, University of Washington, Seattle, Wash. 98195.

C. J. Masreliez is with Boeing Corporation, Seattle, Wash. 98195.

tice the true underlying distribution is more heavy tailed than the Gaussian [5], [12], [13].

This situation was recognized early by statisticians but not much was done to find some robust alternative to the arithmetic mean until the mid-1940’s, when Tukey and the statistical research group at Princeton began to study the problem [12]. They found that there exist several simple alternatives to the arithmetic mean which display higher efficiency for many heavy-tailed distributions without losing much if the true distribution happens to be Gaussian. One example of such an estimator is the a-trimmed mean, which is formed by removing a small fraction a of extreme observations on either side and computing the mean of the remaining observations.

Since Tukey’s important early work on the robust estimation problem, many contributions involving several different estimator schemes have been made, most of them treating the one-dimensional location estimation problem. One of the most important of these is the work of Huber [5]. He considered a maximum-likelihood-type estimator (M-estimator) and approached the robustness problem by looking for a min-max solution, i.e., an estimator that minimizes the maximum asymptotic variance over some prescribed convex family of densities.

Two other classes of estimates that have received considerable attention recently are L-estimates (based on linear combinations of order statistics) and R-estimates (derived from rank tests). Since these estimators require ordering operations, they may become computationally unattractive for large-sample sizes. Also, the ordering requirement will often preclude real-time processing of data as well as estimation for the continuous-time problems that often arise in communication and control theory.

While M-estimates are solutions of nonlinear equations, their performance appears to be reasonably well approx- imated by one-step Newton-Raphson versions, which are computationally attractive [l]. On the other hand, they are not suited for obtaining estimates in real time or in the continuous-time situation.

Here we consider a fourth approach to the robust estimation problem that, in addition to being of interest in its own right, does not suffer from the aforementioned draw- backs. Furthermore, this estimator has natural extensions to the multiple regression problem (estimation of a vector parameter in a linear model).

The estimator proposed here is of the recursive stochastic approximation (SA) type. A particular robust estimator of this type was presented by Martin in [7]. The estimator studied here is somewhat different, having the striking property that its asymptotic variance is constant over a broad class of noise distributions. We shall first describe a

264 IEEE TRANSACTION§ ON INFORMATION THEORY, MAY 1975

one-dimensional estimator of this type and discuss some and aspects of its relationship to the M-estimates of Huber [S].

s

m Then we discuss one possible modification of the classical mf(u) = - g(x) de) (6) multidimensional LS-estimator, which leads to robustness;

00

i.e., we will show how to desensitize the LS-estimator to = 0. heavy-tailed deviations from the Gaussian shape. If, in addition, g( .) is nondecreasing and g’(x) is not zero

II. STOCHASTIC APPROXIMATION APPROACH

Consider the following familiar point estimation problem : the location parameter a of a distribution function is to be estimated from independent identically distributed observations with common distribution F(- -a), which for simplicity we will assume to be symmetric with respect to a; i.e., F(x) = 1 - F( -x - 0).

The LS-estimator solution for this one-dimensional problem is simply the arithmetic mean, which may be written in a recursive form as follows:

clk+l = ‘%k + i (& - &)

a, = x,. u(k) is the LS-estimate at stage k, based on observations x1,x2,. . . ,xk- i. We know, of course, that this estimator is consistent for the mean of the distribution E[X] = ~1. We would, therefore, expect uk to approach the true value u, when k becomes large, so that X, - & + X, - ~1. It, therefore, seems reasonable that the sensitivity of the estimator to disturbances to the tail structure of the distribution i;(x, - a) could be decreased by introducing a weighting function that de-emphasizes the effect of observations falling far from the mean CI. This leads us to consider the following recursive estimator

@.k+l = elk + $cx, - elk) (1)

where A = constant > 0 and g(a) is some odd continuous weighting function. The objective is to choose A and g(.) to obtain robustness. The starting point CQ may be chosen arbitrarily, but we will see that a “good” starting point is preferable since it yields improved small-sample behavior. The algorithm (1) fits the general form of the Robbins- Monro SA algorithm

on all sets of the positive mass, then m (a’) > 0, for a’ > u, and m,(u’) < 0, for a’ < a, so that CI is a unique zero of mF(u’). This is one of the sufficient conditions for the SA algorithm (1) to converge to OZ. One complete set of sufficient conditions is given in Theorem 1’ of Sacks [lo]. This theorem yields both convergence with probability one and asymptotic normality of u,. Specifically, the theorem yields: i) u, + u with probability one; ii) if Am,‘(u) > 3, then n1j2(u, - a) is asymptotically normal with mean zero and variance

where

v= A%$

2Am,‘(u) - 1 (7)

a,2 p E,g2(X). (8)

III. PRIOR RESULT

It has been shown by Martin [7] that if g( a) = Zk( *), with

i

+K, t>K M) a 4 ItI I K

-K t< -K (9)

and if A and K are appropriately chosen, then the corresponding algorithm (1) has the. following robustness property: it minimizes, with respect to all unbiased translation-invariant estimators, the maximum asymptotic variance over the class of symmetric epsilon-contaminated normal densities, f(t) = (1 - &)4(t) + &h(t), where $(t) is the unit normal density and k(t) is an arbitrary symmetrical density. In Section VI, particularly Theorem 5, we show that this estimator is asymptotically equivalent to Huber’s M-estimator and hence, by the work of Bickel [17] or Jaeckel [6], is asymptotically equivalent to a trimmed mean.

The emphasis in this paper is on another choice for g(e) that leads to robustness.

IV. ~-POINT ESTIMATOR a,+1 = % - 4W,r4 - mol (2) There exists a certain choice of the g-function and the

with constant A in that (1) yields a rather striking robustness

0 A m, = ~(xl;%) = -g(X, - u3. (3) property; the asymptotic variance as given by (7) will be

a, = - n identical for all members of the family of distributions

The corresponding regression function m,(u’), with mF p F(Yp,P) P {F(*): W-Y,) = P/2, Yp > o,o < P < 1,

E,{h(X,;u’)}, is given by F(v) symmetric and continuous at -tv,}. (10)

Mu’) = J%{--dK, - a’)> (4) This family of distributions will be called the “p-point

where the argument u’ is used to distinguish it from the true family” in the following, since it is defined by the sym-

parameter value u. The SA algorithm estimates u’ such that metrically located percentage points (quantiles) +yP. mF(u’) = m, = 0. Throughout this paper the observation random variables

With g(a) odd and the distribution I;( .) even, we get X,,n = 1,2,.-o, in (1) are assumed to be independent identically distributed with X. N I;(* -a), for some dis-

s

m m,(u’) = - g(x - cl’) dF(x - a) (5) tributione F(s). F(a) is in F;yP,p),‘ for ~some (y,,p), in

-a Theorems l-4.

MARTIN AND hfASRJ%IEZ: ROBUST ESTIMATTON 265

The following lemma will be needed. Lemma I : Let g be a continuous function on R’ having a

derivative g’, at all but a finite number of points. Let g’ be uniformly bounded. Also let F be a distribution function that is continuous at all points where g’ is discontinuous. Then it follows that

[$Eb(x - 4] = -Gf(x - 41 d=(I where we define g’ = 0 at discontinuities of g’.

Proof: Cover the discontinuity points of g’ by a system of small intervals of total length E and call the union of these intervals I. On the complement of Z we may differentiate under the integral sign since g’ is uniformly bounded there. When E + 0 the contribution from Z goes to zero since the probability measure on Z goes to zero by the assumption of a continuous distribution function.

For small-sample sizes the performance of the estimator will be sensitive to the choice of the initial estimate a,. In this connection the following lemma is of interest.

Lemma 2: If g( .) is odd, F( *) is symmetrical, i.e., F(x) = 1 - F( --x - 0), and a1 = X1, then a,, is unbiased, for 12 = 1,2; * *, where (a,,} is the sequence generated by (1).

For a proof see [7]. It turns out that other choices for a1 yield better small-sample performance than the choice a = X, of Lemma 2. While the dependencies introduced by other choices of a1 make it difficult to extend the lemma to cover these situations, we conjecture that the lemma is valid when al is unbiased and has a symmetrical density.

Now let g( *) = gp( *) with

TABLE I OPTIMAL PPE PARAMETERS

P =ln TAN+) "N "

.05 .438 2.178 .270

.lO .488 1.648 .-378

.15 .530 1.378 .497

.20 .571 1.199 ,636

.25 .612 1.066 .803

.30 .654 .960 1.009

.35 .698 .871 1.266

.40 .745 .794 1.596

.45 .796 .726 2.027

.50 .854 .663 2.603

The expression in the denominator follows from the fact that F(e) has the mass (1 - p) in [-y,,y,]. If now A is chosen as

A = (sy,)‘[l - p(1 + tan2(1/2s))]-l (15)

then it follows that V = A. (16)

Thus the asymptotic variance V is the same for all distributions F E F(y,,p), y,, p fixed. We have the following.

Theorem I: Let X,, n = 1,2, * . . , be independent identically distributed with common F(* -a), FE p(y,,p), p, yp fixed, with 0 < p < 1 and yp > 0, and let a,+1 = a, + (&)gp(Xn - a”), n = 1,2;**, with al arbitrary and with A given by (15). Then the sequence of estimates {a,] converges to a with probability one and n”*(a, - a) is asymptotically normal with zero mean and variance

V = A = (sy,)*[l - p(1 + tan2(1/2s)]-‘. (17)

--<~/sY~~ tan CMWJ t < 9, Proof: The proof follows from the preceding and

s,(t) 4 (l/sr,> tan CNWJI~ Itl s Yp verification that the hypotheses of [lo, theorem 1’1 are

WYJ tan CMWI, t > Yp (11) satisfied.

where s is a positive constant. If FE 9(y,,p) and Theorem 1 thus implies that the asymptotic variances are

g( *) = gp( a), then (5) yields equal for all symmetrical densities putting the same mass in a ‘given symmetrical interval [ - y,, y,]. The constant s may

mF’(a) = $ mAa’) be selected freely, the only requirement being that gp be

.‘=a finite and nondecreasing; i.e., s has to be in the interval

1 = 2

s y’

(R-l,co). However, for each p there exists an s = s,,, that

(sy,)-*(l + tan* [x/(2sy,)]) 0(x) (12) minimizes the asymptotic variance and s,,, does not depend YP upon yr The minimizing value s,,, satisfies the equation

where we have differentiated under the integral sign, which 2s,,, - p[l + tan* (1/2s,)][2s, + tan (1/2s,)] = 0. (18) is permitted by Lemma 1. Also, from (8), In Table I, s,,, is listed together with tan (1/2s,) and the

a,* = s

h (sY,)-* tan* ~/(~sY,)I W4

normalized minimum variance

-YP VN P A(P)/Y,’ + P(SY,)-* tan* [l/&)1. (13) where

Substitution of (12) and (13) into the asymptotic variance A(P) = (s,r,)*l? - ~(1 + tan* (W4J)l-1 (20) expression (7) yields is just the minimum asymptotic variance. We shall refer

V =A U

YP tan2 [x/2sy,] 0(x) + p tan* (1/2s)

to the algorithm (1) with g( *) = gp( *)lS=S, and A = A(p) as the p-point estimator (PPE).

-YP 1 . ” tan* [x/2sy,] dF(x) + (1 - p) - @$ 1

It is not difficult to show that the performance of the PPE is rather insensitive to errors in specifying yp when p is in the vicinity of p = 0.3 and the central shape of the

(14) density is Gaussian. Details may be found in [S].

266 IEEE TRANSACTIONS ON JNFORMATION THEORY, MAY 1975

V. ~-POINT ESTIMATOR AS MIN-MAX SOLUTION

In Section III it was mentioned that the choice g( *) = Z,(e) given by (9), along with the appropriate A and K, for the algorithm (1) leads to an estimator that is min-max over the family of e-contaminated Gaussian densities. An analogous property holds for the PPE with respect to the family F(y,,p). Let Y be the family of all regular translation-invariant estimators. Denote the asymptotic variance by V(T,F), T E r, FE @ IY,,P).

fO, defined by

OCR - [S

co -1 *- SjJ2(Y)f0(Y) dY

-CO 1 is given by

gCR * = (s,y,)*[l - p - p tan’ (1/2s,)] = A(p).

Corollary: The min-max property of the PPE also holds over the enlarged family

Theorem 2: The asymptotic variance V(T,F) has a saddle-point; i.e., there exists an F,, E F(y,,p) and a T, E Y such that

sup V(T,,F) = V(T,,F,) = inf V(T,F,) F T

where the supremum is over all FE F(y,,p) and the infimum is over all T E Y. T, is the SA algorithm defined in Theorem 1, with s and A equal to the optimal values s, and A(p) given by (18) and (20); i.e., To is the PPE. The least favorable density F, is explicitly given in terms of its density function by

9 * (Y,,P) = {F: F(-Y,) I p/2, yp > 030 < P < 1,

F( .) symmetric and continuous at 2 y,}.

It is interesting to note that the least favorable density f0 is of the form K cos* (y/2s,y,) in the middle while having exponential tails. The least favorable density of the solution using g( *) = Z,( *) mentioned in Section III is Gaussian in the middle and exponential in the tails. In view of the difference between the s-contaminated Gaussian family and the family F(y,,p), the similarity between the two least favorable densities is perhaps surprising.

VI. M-ESTIMATES AND STOCHASTIC APPROXIMATION ESTIMATES

fo(Y) K cos* (l/24,,) exp [2Kp-‘[cos* (1/2s,)](y + y,)],

= Kcos* (Y/%,,Y,),

1

Y < -Yp IYI s Yp

Recall that Huber’s M-estimators are solutions an of equations of the type

K cos* (1/2s,) exp [-2Kp-‘[cos* (1/2s,,)](y - y,)], Y ’ up (21)

where K is defined by

jl dxi - a,> = 0 (24)

s

YP K cos* (Y/2&Y,) dy = 1 - p. (22)

-YP

where the Xi are independent observations and g( *) is a nondecreasing function of the argument. Huber showed, under suitable conditions, almost surely convergence of M-estimators to a point a such that EF[g(X - a)] = 0, and further that &a, - a) is asymptotically normal with mean zero and variance

Proof: It is easy to check that fO( *) is a symmetric density, that F,,( - y& = p/2, and that F,( *) is continuous at fv,; thus F0 E F(y,,p). Furthermore, fO(y) is continuous at fy, and the same is true for fO’(y) at fy,. To prove the left-hand inequality, we note that, by Theorem 1, V (T,,F) = V(T,,F,) = A(p). For the right-hand inequality it suffices to show that the PPE variance A(p) attains the Cramer-Rao lower bound ccR2, for fO. From (21) we have

v= EF[g2(X - a>]

[ & EF[S(X - at)]] *

(25)

d=C4

-2Kp-’ cos* (l/24,,), Y < -Yp -fO’(y)lfo(y) =

i (ULY,) tan (Y/~LY,), 2Kp-’ cos* (1/2s,),

IYI < Yp Y ’ Yp.

Solving (22) for K yields

K= (1 - P) Y,[l + s, sin WJI .

Using this value for K yields, after some algebra

(23)

-fO’(y)lfo(y) = t

- U~Y,) tan W~GJ~ Y < -Yp (ULY,) tan (Y/~s,Y,), IYI 5 Y* WS~Y,) tan WGA Y ’ Yp

= &J(Y).

with the expectations under F( . -a). In any reference to an M-estimator it will be assumed that conditions are such that the asymptotic variance expression (25) holds and a, --, a almost surely (see Huber [S]).

Huber showed, as a special case of a general theorem, that for the class of symmetric e-contaminated Gaussian distributions the saddle point is given by the previously mentioned least favorable distribution (i.e., Gaussian in the middle and doubly exponential in the tails), and the M-estimator is the corresponding maximum-likelihood estimator. The associated g(a) function is of the form g( *) = ZK( *) given by (9). Since the a-contaminated family has a min-max solution in terms of both M-estimates and SA-estimates, it is not surprising to find that there is a min-max M-estimate for the p-point family F(y,,p).

Using the preceding, we find after more tedious algebra (see [8] for details) that the Cramer-Rao lower bound for

Theorem 3: Let FM be the class of M-estimators. Then there exists a T E TIM and an F,, E p(y,,p) such that

sup V(T,,F) = V(T,,F,) F

= inf V(T,F& TEFM, F E WY,, P) T

MARTIN AND MASRELIEZ: ROBUST ESTIMATION 267

with F0 defined by (21) and (22) of Theorem 2, and T, is the M-estimator with g( *) = gp( *), where s = s,,,.

Proof: W ithout loss of generality we take a = 0. Then by Lemma 1, noting that gr, = -fO’lfO,

Comment 1: Huber has pointed out that the min-max property still holds if the class of estimators is widened to include all (regular) translation invariant estimators. Since both the SA-estimator and the M-estimator belong to this class, we find that the F(y,,p) min-max problem has (at least) two estimator solutions. Note also that the maximum asymptotic variance may be obtained for several different densities, i.e., all densities for which

m’(a) = - -$ Edg,(X - a’)l a,=O = E&,‘(x)]

Since

EF&$(x)} = s_“, ($$‘hdx) dx

we have from (25)

WVo) = [/ym ($$$*fo(x) dx]-l which is the Cramer-Rao bound for fO( e), so the right-hand equality is established. For the left-hand equality, we use

s

YP tan’ (y/2s,) dF( y) = x0 = 1 - p - 2p tan’ (1/2s,).

-YP

M-p-Point Estimator Versus p-Point Estimator Let TSA be the T, of Theorem 2 and let TM be the T, of

Theorem 3.

Theorem 4: V(T,,F) I V(Ts&‘), F~y(y~,p) with equality, if and only if

s ” tan* (y/2s,y,) dF( y) =

s ‘* tan* (Y/~s,Y,) OF,.

-YP -YP

Proof: The proof follows immediately from the proof of Theorem 3 upon noting that V (T,,,F) = V(TsA,Fo) = V(T,,F,,) and that x,-, is the point of a strict maximum for all allowable p. The latter is true since (18) implies that p tan* (1/2s,) # 1 - p, for s, E (n-‘,co).

Wd’) = tan2 (Y/~s,Y,) WY) + -E-- tan2 (1/2s,)

(LYJ2

tan* (Y/~s,Y,) WY) + 7 l \* (1 - P,)” ^ . .

Set

then s

+YP

tan* (Y/2s,y,) WY) = x -YP

V(T,,F) = 4(s, yp)” . x[: ; :ap* (;;;) .

This expression has a unique maximum for x > 0 given by

sup IV&F) = (s,yp)*/(l - ~(1 + tan* (1/2s,)) F

which occurs when

x = x0 4 1 - p - 2p tan* (1/2s,).

Since for F = Fp the quantity x becomes

X= s

” tan2 (y/2& Y,) OF, -YP

= 1 - p - 2p tan* (1/2s,)

= xg

it follows that

sup IV&‘) = UT,,F,). F

We refer to the M-estimator of this theorem as the M-PPE. This result has also been found independently by Sacks and Ylvisaker [14].

In spite of the fact that the inequality of Theorem 4 will be strict for many F E 9(y,,p), direct computation shows that the differences are negligible from a practical point of view [S].

Stochastic Approximation Estimators that Are Asymptotically Equivalent to M-Estimators

The fact that there is a strict inequality in the statement of Theorem 4 for “most” f E p(y,,p) leads naturally to the question, given a particular F, of whether or not there is a SA-estimator that is as good as the M-PPE. The answer is in the affirmative-to every M-estimator there corresponds, for each F, a SA-estimator that is equivalent in the sense that the estimates converge to the same point almost surely and have asymptotically normal distribution with identical mean and variances. This result is stated in the following theorem.

Theorem 5: Let F( *) be fixed and let g( .) be an M-estimator function such that the sequence of M-estimates (a,‘} satisfy

i$l g(Xi - a,,‘) = 0, n = 1,2,3, * * . .

Let an” be the SA-estimates generated by

268 IEEE TRANSACTTONS ON INFORMATION THEORY, MAY 1975

where a is the unique root of EF[g(x - a)] = 0. Then if g( *) is such that the SA converges, we have 1) a,’ + a, a,” + a almost surely; 2) d’*(a - a) and nll*(a” - a) are asymptotically normal with mean zero and variance

b.4 = VM = ECg”(x - a)]/[$ [Eg(x - p)]l,=.]*. (26)

Proof: The first statement follows directly from the results of Huber and Sacks. The second statement may easily be proved for the SA-estimator by letting A = - [8/c?a’E[g(x - a’)]! ;t=, = (mf’(a))-’ in the asymptotic variance expression (7). This choice of A yields the minimum asymptotic variance for the SA-estimator and assures that Amf’(a) > 4, a condition that has to be fulfilled for asymptotic normality. Furthermore, the resulting SA-estimator variance expression is identical to the M-estimator variance (25).

As an application of the preceding theorem, choose g( *) = ZK( *) as defined by (9). Then for any given F the SA-estimator using the gain sequence of this theorem is asymptotically equivalent to Huber’s M-estimator using g( *) = Z,(e). Furthermore, as noted in Section III, the choice g( *) = Z,(a) leads to a min-max solution equivalent to that of Huber. This fact, when coupled with Theorem 5, places the I,(*)-determined SA-estimator in a stronger correspondence with the ZK( *)-determined M-estimator than is implied by Theorem 5 alone.

The PPE and the M-PPE provide another example of the correspondence of Theorem 5 with F = F,,. Again there is the stronger correspondence provided by Theorems 2 and 3. This correspondence is not as strong as one might like in view of Theorem 4; i.e., the PPE and M-PPE have different variances over F(y,,p). This is also the case for the Z,( *)-based SA- and M-estimators discussed previously.

Since the PPE has a constant variance over p(y,,p) the question arises as to whether or not there is an M-estimator that has this property. The answer is no! The M-PPE is asymptotically efficient for fO, and since it is the maximum- likelihood estimator for fO, the M-PPE is unique. Since the M-PPE and the PPE coincide for f0 and have different variances for many F E F(y,,p), there is no M-estimator that is equivalent to the PPE in the sense of constant asymptotic variances over g(y,,p). Let us consider the implication of Theorem 5 with g(e) = g,(*)l,=,,; i.e., the particular M-estimator is the M-PPE. Specification of gp( *) implies knowledge of yp and p; in practice we presume that some estimate of these quantities is at our disposal. Should we wish to implement the M-PPE equivalent SA-estimator, we would require, in addition to p and y,, knowledge of m,‘(a) = EFg,‘(X - a). The latter depends upon the shape of the distribution on [ - yp, y,,], which is considerably more detailed knowledge than is required to implement the M-PPE. One way around this difficulty is to estimate Eg,‘(X - a) “while we go” and use this estimate in the SA-estimator.

This approach will yield a significant beneficial effect upon the small-sample performance using a, = Xl as the initial estimate (relative to the PPE).

Instead of the usual gain sequence a, = A/n, we propose a gain sequence (a,,> of the form

with g( *) as in Theorem 5. This gain sequence satisfies the computationally convenient recursion relation

a,-’ = a;J1 + g’(X, - a”). (28)

In the following theorem, we will assume that the initial value a, is such that g’(X, - a,) > 0 so that a, < co, for all n; e.g., a1 = X, will do. When using “good” initial estimates such as the sample mean for aI, this assumption may not be satisfied for g( *) of the form gp( *) or ZK( a). It is easy, however, in this case to redefine the {a,> sequence in a way that does not affect the results to follow. For example, with g(s) = gp(*), define a, = 0, n = 1,2; * 1, m - 1, a,-’ = gp’(& - a,), a,-’ = C1=, gp’(xi - ai), n 2 m, where m is the smallest n such that gp’(X. - a,) # 0. Another possibility is to let a,-’ = max { 1, XI= 1 gp’(Xi - ai)}. These two choices turn out to be asymptotically equivalent. Which is better for small-sample performance is an open question.

Consider the SA-estimator’

a n+l = cr, + ad% - a,) a, = 4x1, - . - ,XJ, (29)

with {a,} defined as before and with g( *) and F(a) such that the algorithm converges when a,, = A/n. We wish to show that with this version of the SA-estimate the result of Theorem 5 will hold uniformly in F(e).

In order to do this, we must show first that (29) converges and then that the right asymptotic variance is obtained. This result can be read off from Theorem 3.2 (with a slight modification) of Fabian [16]. He does not actually provide a proof of the almost surely convergence of a, to a, but refers to several other sources. Hence we provide a proof appropriate to the particular setup.

Theorem 6: Define an interval Z by

Z = {a: 0 < y I m ’(a) I B}.

Let {a,} be generated by (29), using the gain sequence {a,} specified by (27), and with the modification that each a,, is truncated so that a, E Z, n = 1,2, * . *. If the variance of s’(X - a) is uniformly bounded in a, then a,, --f a almost surely.

Proof: The random variables g’(X, - ai) - m ’(a,) and g’(X, - aj) - m ’(aj), j # i, are orthogonal. Then since var g’(X - a) is uniformly bounded, the strong law of large numbers for orthogonal random variables (Doob, pp. 158-159) yields

(na,)- 1 - t $ m’(ai) --f 0 almost surely.

1 We choose to call (29) a SA-estimator even though the gain sequence is random. The term “randomized $4” or SA with “randomized gain” might be appropriate. Certain SA algorithms of this type have also been called “second-order” algorithms; see, for example, Kashyap and Blaydon, IEEE Trans. Inform. Theory, vol. IT-14, pp. 549-556, 1968.

MARTIN AND MASRIKIEZ : ROBUST ESTIMATION 269

Furthermore, for Qi E Z, i = 1,2,3, * * *, TABLE II APPE MONTH CARLO RESULTS

7 < inf m’(ai) 5 1 i m’(ai) < SUP t?l’ (CQ) I B I n I I

SO

y I lim (na,)-’ I i& (na,)-’ I B almost surely.

Let R, be the set of all sequences (~1,) such that

y/2 -c (a& ml -c B + y/2, Vn > N.

Then for any 6 > 0 there exists an N(6) such that

P(QN(,)) ’ (1 - m.

Now if (LX,} E QNta), it follows that

=oO

and

n=N(d)

so that the conditions on the gain sequence for the SA algorithm (29) to converge almost surely to a, conditioned On aN(a)9 are fulfilled. Thus for every E > 0 there exists an N(E) = N(.s,B) such that

PCIh - a( < & vn 2 N(&) 1 !ii?N(d)] > 1 - 6/2

and so

PClh - aI < E Vn 2 N(E)]

> P[an - a 1 < & vn 1 N(E) 1 szN(6)l ’ p@N(6))

> (1 - S/2)2 > (1 - 8).

Since there is an N(E), for every E, 6 > 0 such that the preceding inequality holds, we have

a, + a almost surely.

Corollary: If in addition m’( *) is continuous in a neighbor- hood of a, then

(na,)-l + m’(a) = Eg’(X - a).

Proof: Since a” + a almost surely, m’(a& --t m’(a) almost surely. As noted in the proof of the preceding theorem,

04- ’ - i $ m’(ai) + 0 almost surely.

Then since

lim 1 i m’(ai) = m’(a) almost surely. n nl

we get

(nu,)-’ + m’(a) = Eg’(X - a) almost surely.

It may be noted that for g(e) = gp(*), the truncation interval Z may be taken arbitrarily large for distributions having the whole line for support, and hence the truncation assumption is not a practical limitation.

Now Fabian’s result gives that & (a,, - a) is asymptotically normal with mean zero and variance given by the

r-

Gain ai1 = jl g;(Xi-ai) + c; c = g;(o)

Monte Carlo s ize q 500.

Distribution

Uniform

Gaussian Mixture

Double Exp.

Cauchy

Sample Size

25 50

25 50

25 50

25 50

Normalized Variance (Starting Point, First

Observation)

.87

.88

0.99 1.09

.98

.99

* . . . .

1 * The normalized var iance exceeded 50.

expression (26) for V, in Theorem 5. Hence we have the following.

Theorem 7: Theorem 5, with

replaced by the (a,)-’ of (27) and with the additional truncation assumption of Theorem 6, holds uniformly in F.

In words, for any M-estimator we have an SA-estimator that corresponds in the strong sense that asymptotic variances are the same over all suitably regular F(e) and g(e). The comments following Theorem 3 show that the converse is not true. Thus the class of SA-estimators is larger than the class of M-estimators. From the recent work of Jaeckel [S], this statement applies with “M- estimators” replaced by “L-estimators,” i.e., linear sys- tematic statistics.

Now the interesting implication of the gain sequence (27), when used in (29) with g(a) = gp( a), is that the gain is un- changed whenever the absolute value of the argument of gp( a) exceeds yp, since gp’( a) = 0 here. This indicates that the small-sample sensitivity to large values in the initial observations is less than that for gain sequences of the form a, = A/n used in the PPE. Monte Carlo simulations seem to support this conjecture; see Table II. The performance using the first sample as the starting point and gain sequence 4 -’ = u,--‘~ + g,‘(X, - a,,) is markedly better than for the PPE, which uses a, = A/n.

In the sequel we shall call (29), with g( .) = gp( a), the adaptive p-point estimator (APPE).

Comment: Perhaps a significant difference between SA- type estimates and other previously studied estimates, i.e., M-estimates, L-estimates, and R-estimates, is that the latter are invariant under reordering of the data whereas the former are not. A more detailed Monte Carlo study utilizing the techniques of [I] is currently under way. Preliminary results indicate that the loss relative to one-step M-estimators due to lack of invariance is rather small. The results of this study, when complete, will be presented elsewhere.

270 IEEE TRANSACTIONS ON INFORMATTON THEORY, MAY 1975

VII. MULTIDIMENSIONAL ROBUST STOCHASTIC 1,2,3, * * ., are stationary, and the one-dimensional SA APPROXIMATION ESTIMATES results may be applied coordinate-wise.

Let us briefly recall the essence of LS point estimation. Thus (if Sacks’ conditions are satisfied)

Suppose that the constant n-dimensional vector a is to be n1/2 (ski - ai) N N(-IO,a,2) (38) estimated from q-dimensional observations X, (q 2 n), with k = 1,2,3; . .,N:

X, = Ha + W, (30) 2 _ A,ZE{g,Z(Y - a>J Oi -

2A,m,l(a) - 1

where W, are q-dimensional zero-mean noise disturbance vectors, which are independent and identically distributed with distribution F(.). H is some constant q x n matrix of rank r = n. The LS approach to this problem consists of estimating a by aN+ 1, where a’ = aN+ 1 minimizes

i (X, - H * a’)TR-‘(Xk - H. a’) (31) 1

where R-’ is some positive definite matrix. The solution is given by

‘N+l = $ HTR-‘H)-’ .t HTR-lX,. (32) 1

m,‘(a) = - $ [&i(K - B)ll~=~,.

In words, this result tells us that the marginal distributions of ak, i.e., the distributions of the components of ak, tend to normal distributions with certain variances rri2.

If we now let gi( 0) = g,i( m), where g,i( *) = qP( *) as given by (11) with yP = ypi = pi-point of ith component of Z = M. W, and if we further choose Ai = A(pi) as in (20), then (37) becomes

a A( Pi> k+l,i = ski + - * gpi(cyk - ak)i)y

k

Let the estimate at stage k be ak. Then we have the i = 1,2;** ,n, k = 1,2; * *. (40)

following recursive form of (32): From Theorem 1 we now immediately have the following.

ak+l = ak + k-’ * M(X, - Ha,) (33) Theorem 8: Assume the estimation algorithm (40). Then where the asymptotic (auto) variance for the ith component of

M & (HTR-‘H)-‘HTR-? (34) a,, i.e., the asymptotic variance for fi (ak - a)i, is given by

Now two possibilities for making the recursive algorithm (33) robust are as follows. oi2 = A(pJ

L-N Approach = (S,JJPi)2[1 - pi(l + tan2 (1/2S,))]-‘,

In this version we first perform linear transformations i = 1,2; **,n. (41)

on the residuals Xk - Ha, and then operate on the result CoroIlary:Ifinadditionpi =p,i= 1,2;**,n,O <p < 1, with a nonlinear robustizing transformation. If we define then the trace of the asymptotic covariance matrix C,, for

Yk A (HTR+H)-‘HTR-‘X, (35) ak, becomes

= MX, tr C, = $ cri2 = sm2[1 - p(1 + tan2 (1/2S,))]-’ * $ y,i

then (33) has the form

ak+l = ak + k-‘(Yk - ak) (36) = constant . i yPi2. (42)

1

which is the recursive form of ak+l = k-l Cf Yi, i.e., the Thus the estimation procedure results in asymptotically

arithmetic mean of the transformed variables Yk. Now let robust autovariances that only depend on the p-percentage

the components of a,, ski, be recursively computed by points yPi of the corresponding marginal distributions of the

(compare (1)) transformed noise vector Z = A4 * W. When, in addition, the components of Z = Y - a = M. W have distribu-

A. a k+l,i = ski + L sic&i - ski) (37) - tions that are identical except for scale factors, the specifica- k tion of the best choice for R is given by the following.

where Yki denotes the ith Component Of Yk, i = 1,2; . .,n.2 This implies that we have n one-dimensional SA-estimators working in parallel.

The assumption was made that the noise disturbances W, are independent random vectors. Then, since the ith component of Yk is independent of the ith component of Y,, k # I, the distributions of Yki - ai = (MWk)i, k =

2 For simplicity we will denote the ith component of the true parameter vector I by aI.

Theorem 9: Let the marginal distributions for the components of the random vector Z = Y - a = M. W be functions of the argument ti = zi/si, where si2 = E{(Yi - ai)2}, SO that Fi(Zi) = F(tJ = F(Zi/S,), for some distribution of F(e). Choose pi = p, i = 1,2, * * * ,n. Then the choice R = constant * C, where C is the covariance matrix for W, minimizes the autovariance rri2. The proof may be found in [S]. The report [8] also contains some small-sample Monte Carlo results that indicate the efficiency robustness of the SA-estimate (40) relative to LS for

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. IT-21, NO. 3, MAY 1975

estimating the amplitude and phase of a signal in heavy- tailed non-Gaussian noise.

N-L Approach While the L-N approach described previously is expected

to yield efficiency robustness when the number of parameters to be estimated is small (as indicated by the Monte Carlo results cited previously) this may not be true when the dimension of the parameter vector is large. The reason is that the convolution implied by (35) will result in residuals that tend to be more Gaussian than W , with the variance inflated by the heavy tails of the density for W . In this situation the application of limiting-type nonlinearities no longer protect as well against the influence of outliers.

Thus we may consider an algorithm in which the non- linearity is applied directly to the residuals X, - Hkak, and the linear transformation is applied subsequently. The general form of this version is

ak+l = a, + iil HiTRiM ‘Hi) -’

’ ffkTRk-‘&7k(Xk - Hkak) (43)

where A, and R, are positive definite symmetric matrices and gk( .) is a vector operator mapping Rq -+ Rq. Typically, gk( .) will be a coordinate-wise transformation. The asymptotic and small-sample efficiency robustness of such SA regression estimates has been recently investigated; the results will be

271

REWRENCES [l] D. F. Andrews, et al., Robust Estimates of Location. Princeton,

N.J. : Princeton Univ. Press! 1972. [2] E. L. Crow and M. M. Siddrqui, “Robust estimation of location,”

J. Amer. Statist. Ass., p. 353, June 1967. [3] J. L. Gastwirth and M. L. Cohen, “Small sample behavior of some

robust linear estimators of location” J. Amer. Statist. Assoc., p. 946, June 1970.

[4] C. F. Gauss, “Gottingishe gelehrte anzeigen,” reprinted in Werke Ed.. vol. 4. v. 98.

151

W I

P. J. Hub&, “Robust estimation of a location parameter,” Ann. Math. Statist., vol. 35, pp. 73-101, 1964. L. A. Jaeckel, “Robust estimates of location: symmetry and asymmetric contamination,” Ann. Math. Statist., vol. 42, pp. 102O-1034, 1971.

171

181

[9]

1101

1111

1121

1131

1141

[151

R. D. Martin, “Robust estimation of signal amplitude,” IEEE Trans. Inform. Theory, vol. IT-18 pp. 596-606, Sept. 1972. R. D. Martin and C. J. Masreliez, “Robust estimation via stochastic approximation,” Dep. Elec. Eng., Univ. Washington, Seattle, Tech. Rep. 174, Aug. 1, 1973. H. Robbins and S. Monro, “A stochastic approximation method,” Ann. Math. Statist., vol. 22, pp. 4oo-407, 1951. J. Sacks, “Asymptotic distribution of stochastic approximation procedures,” Ann. Math. Statist., vol. 29 p. 373, 1958. D. J. Sakrison, “Stochastic approximation,” in Advances in Communicat ion Systems, vol. 2. New York: Academic, 1966, p. 51. J. W. Tukey and T. E. Harris, Statistical Res. Group, Princeton Univ., Princeton, N.J., Memo. Rep. 31, 1949. J. W. Tukey, “A survey of sampling from contaminated distributions,” in Contributions to Probability and Statistics, (Harold Hotelling Volume), Stanford, Calif. : Stanford Univ. Press, 1960, p. 448.

1 J. Sacks and D. Ylvisaker, “A note on Huber’s robust estimation of a location parameter,” Ann. Math. Statist., vol. 43, Aug. 1972.

btafrsr., pp. 114/+5x, wm.

1 A. P. Sage and J. L. Melsa, “Estimation theory with applications to identification and control,” New York: McGraw-Hill, 1971.

[16] Vaclav Fabian, “On asymptotic normality in stochastic approximation,” Ann. Math. Statist., vol. 39, Aug. 1968.

[17] P. J. Bickel, “On robust estimates of location,” Ann. Math. e. . . _A_ .-..-o ,,%I= presented elsewhere.

Discrete Optima l L inear Smoothing for Systems with Uncerta in Observations

ROBERT A. MONZINGO, MEMBER, IEEE

Abstract-The smoothing filter and smoothing error covar iapce matrix equat ions are developed for discrete linear systems whose observat ions may contain noise alone, where only the probability of occurrence of such cases is known to the estimator. An example of such a system arises in trajectory tracking, where the signal is first detected and then is pro- cessed by the estimator for tracking purposes. The results apply to any detection decision process, however, any ,such decision is associated with a false alarm probability, which is the probability that the detected signal contains only noise. The present results together with the earlier work of Nahi on prediction and filtering give a complete treatment of the discrete linear estimation problem for systems character ized by uncertain observations. These results, of course, reduce to well-known formulations for the classical estimation problem in the case where the observat ion is always assumed to contain the signal to be estimated.

Manuscript received July 9, 1974; revised November 18, 1974. The author is with Hughes Aircraft Company, Ground Systems

Group, Fullerton, Calif. 92634.

I. INTRODUCTION

T HE MODERN literature dealing with classical estimation theory [l], [2] adopted the viewpoint of orthog-

onal projection to develop the fundamental equations of optimal linear filtering and prediction for both discrete and continuous linear systems. The results derived in this manner were subsequently derived using a variety of other ap- proaches including least squares [3], maximum likelihood [4], [5], linear regression [6], dynamic programming [7], stochastic approximation [8], and the Bayesian approach [9]. Likewise, the literature dealing with the classical smoothing problem adopted a number of different view- points including the calculus of variations [lo], maximum likelihood [S], and orthogonal projection [I 11. More recently, the classical smoothing problem results were derived

Documents

Robust estimation via stochastic approximation