The Empirical Bayes Approach to Statistical Decision Problems

The Empirical Bayes Approach to Statistical Decision ProblemsAuthor(s): Herbert RobbinsSource: The Annals of Mathematical Statistics, Vol. 35, No. 1 (Mar., 1964), pp. 1-20Published by: Institute of Mathematical StatisticsStable URL: http://www.jstor.org/stable/2238017 .

Accessed: 21/12/2014 14:57

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp

.JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact [email protected].

.

Institute of Mathematical Statistics is collaborating with JSTOR to digitize, preserve and extend access to TheAnnals of Mathematical Statistics.

http://www.jstor.org

This content downloaded from 194.27.128.8 on Sun, 21 Dec 2014 14:57:47 PMAll use subject to JSTOR Terms and Conditions

http://www.jstor.org/action/showPublisher?publisherCode=ims

http://www.jstor.org/stable/2238017?origin=JSTOR-pdf

http://www.jstor.org/page/info/about/policies/terms.jsp


THE EMPIRICAL BAYES APPROACH TO STATISTICAL DECISION PROBLEMS'

BY HERBERT ROBBINS

Columbia University

1. Introduction. The empirical Bayes approach to statistical decision problems is applicable when the same decision problem presents itself repeatedly and in- dependently with a fixed but unknown a priori distribution of the parameter. Not all decision problems in practice come to us imbedded in such a sequence, but when they do the empirical Bayes approach offers certain advantages over any approach which ignores the fact that the parameter is itself a random variable, as well as over any approach which assumes a personal or a conventional distribution of the parameter not subject to change with experience. My own interest in the empirical Bayes approach was renewed by recent work of E. Samuel [10], [11] and J. Neyman [6], to both of whom I am very much indebted. In keeping with the purpose of the Rietz Lecture I shall not confine myself to presenting new results and shall try to make the argument explicit at the risk of being tedious. In the current controversy between the Bayesian school and their opponents it is obvious that any theory of statistical inference will find itself in and out of fashion as the winds of doctrine blow. Here, then, are some remarks and references for further reading which I hope will interest my audience in thinking the matter through for themselves. Considerations of space have confined mention of the non-parametric case, and of the closely related "compound" approach in which no a priori distribution of the parameter is assumed, to the references at the end of the article.

2. The empirical Bayes decision problem. We begin by stating the kind of statistical decision problem with which we shall be concerned. This comprises

(a) A parameter space A with generic element X. X is the "state of nature" which is unknown to us.

(b) An action space A with generic element a. (c) A loss function L(a, X) > 0 representing the loss we incur in taking action

a when the parameter is X. (d) An a priori distribution G of X on A. G may or may not be known to us. (e) An observable random variable x belonging to a space X on which a a-finite

measure , is defined. When the parameter is X, x has a specified probability density fx with respect to A.

The problem is to choose a decision function t, defined on X and with values in

Received 22 October 1963. 1 Presented as the Rietz Lecture at the 26th Annual Meeting of the Institute of Mathe-

matical Statistics, August 29, 1963, at Ottawa, Canada. Research sponsored by Nationlal Science Foundation Grant NSF-GP-316 at Columbia University.

1



2 HERBERT ROBBINS

A, such that when we observe x we shall take the action t(x) and thereby incur the loss L(t(x), X). For any t the expected loss when X is the parameter is

(1) R(t, X) = L L(t(x), X)fx(x) d,(x),

and hence the overall expected loss when the a priori distribution of X is G is

(2) R(t, G) = f R(t, X) dG(X),

called the Bayes risk of t relative to G. We can write

(3) R(t, 0 = A GX(t(X), x) d,u(x),

where we have set

(4) ckG(a, x) = fa )f(x) dG(X).

To avoid needless complication we shall assume that there exists a decision function (d.f.) tG such that for a.e. (,u) x,

(5) kG(tG(X), x) = MinalU0G(a, x).

Then for any d.f. t

(6) R(tG, G) = f mina cG(a, x) d,u(x) < R(t, G),

so that, defining

(7) R(G) = R(tG, G) = f G(tX(X), X) d,u(x),

we have

(8) R(G) = mint R(t, G)

Any d.f. tG satisfying (5) minimizes the Bayes risk relative to G, and is called a Bayes d.f. relative to G. The functional R defined by (7) is called the Bayes envelope functional of G. When G is known we can use tG and thereby incur the minimum possible Bayes risk R(G).

There remains the problem of what to do when G is not known. To extreme Bayesians and extreme non-Bayesians this question will be empty, since to the former G wil always be known, by introspection or otherwise, and to the latter G will not even exist. We shall, however, take the position that G exists, so that R(t, G) is an appropriate criterion for the performance of any d.f. t, but that G is not known, and therefore tG is not directly available to us.

Suppose now that the decision problem just described occurs repeatedly and inde- pendently, with the same unknown G throughout (for examples see [6]). Thus let

(9) (X1, xi) , (X2, X2), ...



THE EMPIRICAL BAYES APPROACH 3

be a sequence of pairs of random variables, each pair being independent of all the other pairs, the X. having a common a priori distribution G on A, and the conditional distribution of x. given that X. = X being specified by the probability density fx . At the time when the decision about Xn+l is to be made we have observed x1, x.*, x+ (the values X1, X2, ... remaining always unknown). We can therefore use for the decision about X.l+, a function of x,+1 whose form depends upon xi, * , xn ; i.e. a function

(10) t.(-) = tn(Xl ** Xn; )

so that we shall take the action tn(xn+1) c A and thereby incur the loss L(tn(Xn+l) Xn+l). Our reason for doing this instead of using a fixed d.f. t( - ) for each n (as might seem reasonable, since the successive problems are independent and -have the same structure) is that we hope for large n to be able to extract some information about G from the values xl, - - *, xn which have been observed, hopefully in such a way that tn( - ) will be close to the optimal but unknown tG( ) which we would use throughout if we knew G.

We therefore define an "empirical" or "adaptive" decision procedure to be a sequence T = { tn of functions of the form (10) with values in A. For a given T, the expected loss on the decision about Xn+i , given xl, Xn. , x,, will be (cf. (3))

(11) f cG(tn(X), x) dW(x),

and hence the overall expected loss will be

(12) Rn(T, G) = f EG(tn(X), x) d,u(x),

where E denotes expectation with respect -to the n independent random variables x ... X,n which have the common density with respect to ,u on X given by

(13) fG(X) = Lfx(x) dG(X),

the symbol x in (12) playing the role of a dummy variable of iIltegration and not a random variable. From (5) and (12) it follows that always

(14) Rn(T, G) _ R(G).

DEFINITION. If

(15) limn,. Rn(T, G) = R(G)

we say that T is asymptotically optimal (a.o.) relative to G. We now ask whether we can find a T which is in some sense "as good as

possible" for large n relative to some class 9 of a priori distributions which we are willing to assume contains the true G. In particular, can we find a T which is a.o. relative to every G in S? (9 may be the class of all possible distributions on A.)



4 HERBERT ROBBINS

3. Some generalities on asymptotic optimality. Comparing (2.7) and (2.12) we see by Lebesgue's theorem on dominated convergence that for T = {t 4 to be a.o. relative to a G it suffices that

(A) limn. EOG(tn(x), x) = 4OG(tG(x), x) (a.e. (,') x),

and

(B) E+G(t.(x)),x) < H(x)(all n), where H(x) d,u(x) < x.

The main problem is (A); we shall summarily dispose of (B) by assuming that

(C) f L(X) dG(X) < o,

where we have set

(1) 0 ? L(X) = supa L(a, X) _ oo

Then setting

(2) H(x) = f L(X)fx(x) dG(X) > 0O

we have by (2.4) for any T,

(3) 4G(tn(X), x) ? H(x) (all n),

and by (C),

(4) JH(x) dM(x) =AL() lf(x) d (x) dGc(X) L(X) dG(x) < x,

and (3) and (4) imply that (B) holds. Moreover, from (4) it follows that

(5) H(x) < X (a.e. (,u) x)

and hence to prove that (A) holds it will suffice to prove that

(D) p liMn,. 0(tn(X), X) = Oa(tG(X), X) (a-e. ()x),

where by p lim we mean limit in probability. Hence (C) and (D) suffice to en- sure that T is a.o. relative to G.

Let ao be an arbitrary fixed element of A and define

(6) AG(a, x) = f [L(a, X) - L(ao, X)]fx(x) dG(X)

and

(7) Lo(x) = I L(ao, X)fx(x) dG(X),




so that under (C) we have, for a.e. (,i) x,

(8) OG(a, x) = Lo(x) + AG(a, x).

Suppose we can find a sequence of functions

(9) An(a, x) = An(xi, n ,x; a, x)

such that for a.e. (,i) x,

(10) P limn-- supa IAn(a, x) - AG(a, x)| = 0.

Let En be any sequence of constants tending to 0 and set (subject to measurability conditions)

(11) tn(x) = tn(x1, ,xn ; x) = any element ad A

such that An(a, x) _ infa An(a, x) + En . Then by (2.5) and (8),

0 ? AG(tn(X), X) - A(tG(X), X)

(12) = [AG(tn(x), x) - An(tn(x), x)] + [An(tn(x), x) - An(tG(x), x)]

+ [An(tG(X), X) - AG(tG(X), X)].

Given any e > 0 we have by (10) that for large n with probability as near 1 as we please the right hand side of (12) will be <e + En + e; thus

(13) p limn-- AG(tn(x), x) = AG(tG(X), x) (a.e. (,x) ),

which by (8) implies (D). We have therefore proved THEOREM 1. Let G be such that (C) holds, let An(a, x) be a sequence of functions

of the form (9) and such that (10) holds, and define T = {Itn by (11). Then T is a.o. relative to G.

When A is finite this yields COROLLARY 1. Let A = {ao, - , a.} be a finite set, let G be such that

(14) fL(aj, X) dG(X) < oo (i = m),

and let Ai,n(x) = A,n(xl, * ,xn; x) fori= 1,* **,mand n= 1, 2, be such that for a.e. (,u) x,

(15) p lim:nAin(x) = f [L(ai, X) -L(ao, X)]fx(x) dG(X).

Set AO,n(X) = 0 and define

(16) tn(X) = ak, where k is any integer 0 < k < m such that

Ak,n(X) = min [0, Al Am,n(X)].

Then T = {tn} is a.o. relative to G. In the important case m = 1 (hypothesis testing) this becomes



6 HERBERT ROBBINS

COROLLARY 2. Let A = {ao, al}, let G be such that

(17) fL(ai, X) dG(X) < 0 (i = 0, 1),

and let A,(x) = A.(Xi , * ; x) be such that for a.e. (,i) x,

(18) p limn, An(X) = a = = f [L(a,, X) - L(ao, X)]fx(x) dG(X).

Define

tn(X) = ao X if An(X) > 0,

= a1, if An(X) <0.

Then T = {Itn is a.o. relative to G. We proceed to give an example in which a sequence An(x) satisfying (18) can

be constructed.

4. The Poisson case. We consider the problem of testing a one-sided null hypothesis Ho: X < X* concerning the value of a Poisson parameter X. Thus let A = 10 < X < oo}, A = {ao, al}, where ao = "accept Ho", a, = "reject Ho" and

X = {0, 1, 2, * *} = counting measure on X, (1) fx(X) =

It remains to specify the loss functions L(ai, X); we shall take them to be

L(ao, X) =0 if X < X,

if X > X

(2) L(al, X) =X -X if X < X*

.0 if X>X*.

Thus (very conveniently)

(3) L(ai, X)-L(ao, X) = X*X ( < X < oo),

and

(4) A(X) = fA [L(a, X) - L(ao, X))]fx(x) dGC(X) = j (X*- X) ! dG(X).

Now by (2.13),

(5) fa(x) = P(xi = x) = eX dG(X),

so that we can write

(6) AG(X) = X*f(x) - (x + 1)fG(x + 1).




Define

(.(x, y) =1 if x = ,

=0 if x # y,

and consider the expression n

(8) un(X) = Un(X , Xn ; x) = n E o(x, xj).

Noting that

(9) E'6(X, xi) = P(xi = x) = fG(x),

it follows from the law of large numbers that

(10) p limnOo Un(x) = fG(x) (x 0 1 .

and hence that, setting

(11) tAn(X) = X*() - (X + 1)un(x + 1),

we have for x = 0, 1, - * -

(12) P limn-- A7n(x) = X*f(x) - (x + l)fG(x + 1) = AG(x)-

Setting

tn(x) = ao if X *Un(X) - (x + 1)un(x + 1) > 0,

(13) = a1 otherwise,

it follows from Corollary 2 of Section 3 that T is a.o. relative to every G such that 00

(14) X dG(X) < c.

We remark that we could equally well have defined un(x) to be n+1

(15) un(x) = [1/(n + 1)] E 6(x, xi), j=1j

since (10) holds as well for (15) as for (8). Using (15) the corresponding T would require us to take action ao on Xn if and only if

* (x> + 1)(number of terms x1, *, x7 equal to xn + 1) (16) (number of terms xl, , xn equal to xn)

For the problem of point estimation with squared error loss it was showil by M. V. Johns, Jr. [2] that the right hand side of (16) is in fact an a.o. point estimator of X7n for every G such that

(17) x2 dG(X) < co.



8 HERBERT ROBBINS

The much easier result of the present section on hypothesis testing uses a loss structure (2) suggested by Johns in [4].

The relation (6) was basic to our construction (11) of a sequence An (x) satisfying (3.18); now, (6) is a special property of the Poisson distribution (1) and the loss structure (2), and therefore it might seem that the application of Corol- lary 2 to empirical Bayes hypothesis testing would be very limited. Such is not the case. The application of Corollary 2 to more general loss structures, and to many of the most common discrete and continuous parametric distributions of statistics is discussed in [4], [9], [11]. Instead of giving a review of these results here, we shall consider a case in which no asymptotically optimal T exists but the empirical Bayes approach is still useful.

5. An example in which asymptotic optimality does not exist but all is not lost. Let x be a random variable with only two values, II and 1, with respective probabilities 1 - X and X, the unknown parameter X lying in the interval A = {0 < X < 1}. On the basis of a single observation of x we want to estimate X; if our estimate is a E A = {O < a ? 1} the loss will be taken to be L(a, X) = (X - a)2. A d.f. t is determined by the two constants t(O), t(l) which are at our disposal on the unit interval A; the expected loss in using t for a given X is

(1)R(t, X) (1 - X) (X - t(O))2 + 2(-

= t2(0) + [t2(1) - 2t(O) -t2(O)]X + [1 - 2t(1) + 2t(0)]X2.

Consider the particular family of d.f.'s ta defined for 0 < a < 1 by setting

(2) ta(0) = a, ta( 1) = 2 ( 1 + a).

It is easily seen from (1) that

(3) R(ta, X) = '[a2 + (1 - 2a)X].

For a = we shall denote ta by t* so that

(4) t*(O) = , t*(l) = R(t*, X)

For any a priori distribution G of X let

(5) vi = v(G) = X dG(X) (i = 1, 2).

Then from (1) it follows that for any d.f. t, 1

()R(t, G) =fR(t, X) dG(X)

- t2(0) + vjlt2(i) - 2t(0) - t2(0)] + V2[1 - 2t(1) + 2t(0)].

Apart from the trivial cases in which v1 = 0 or 1 we have after some simple algebra the formula

R(t, G) = ( - V2) ( V2-v) /vl( - )

+ (1 - v1)[t(O) - (Vl - V2)/(1 - Vl)]2 + V1[t(l) - V2/V1],




from which it follows that for a given G, R(t, G) is minimized uniquely by the Bayes d.f. tG for which

(8) tG(O) = (vI - V2)/(1 - vI), tG(1) = V2/1VX

with

(9) R(G) = R(tG , G) = (v1- V2)(V2 - Vl - ).

Each ta (and in particular t*) is a Bayes d.f.; it suffices to find a distribution of X such that

(10) (VI - V2)/(1 - V) = la, V2/1V = (1 + a)/a,

and this is provided e.g. by the distribution Ga with the density

(11) [B(a, 1 - a)]-lXa-l(1 - X) (l-a)-

for which

(12) v = a 2 =a(1 + a), R(Ga) = 4a(1 - a).

The fact that t* is the Bayes d.f. relative to G; and that for any G,

(13) R(t*, G) j (t*, X) dG(X) = 6

has the important consequence that

(14) SUpGR(t, G) > jl for every t # t*.

For if for somre tSUpG R(t', G) < 1 , then in particular

(15) l-6 = R(t*, G,) ! R(t', GI) -

so that

(16) R(t', G,-)

= R(t*, 6) = G 16

and therefore t' = t*. Thus t* is the unique "minimax" d.f. in the sense that it minimizes the maximum Bayes risk relative to the class of all a priori distributions G. When nothing is known about G it is therefore not unreasonable (the avoidance of a direct endorsement of any decision function is customary in the literature) to use t*; the Bayes risk will then be 1 irrespective of G, while for any t -7? t* the Bayes risk will be > ' for some G (in particular for any G with vi -

V2= 3; e.g. Gi).

For any 0 < a < 1 let 9a denote the class of all G such that vi(G) = a. For any G in 9SSa (in particular for Ga) we see from (2) and (7) after a little algebra that

(17) R(ta , G) = la(l - a) (G a)

irrespective of the value of V2(G). It therefore follows as above that

(18) SUPGe9, R(t, G) > la(1 - a) for every t - ta,



10 HERBERT ROBBINS

so that relative to the class X ta is the unique minimax d.f. in the sense that it minimizes the maximum Bayes risk relative to the class 9a . If nothing is known about G except that vi(G) = a it is therefore not unreasonable to use ta; the Bayes risk will then be 'a(l -- a), while for any other d.f. the Bayes risk will be >I a(1 - a) for some G in ga (in particular for any G with vi = a, V2 = 2a(1 + a); e.g. Ga).

It follows from the above (or can be verified directly) that

(19) (VI - V2)( V2 - Il/ll l 4V(- ) -< 1l6

equality holding respectively when v2 = 'v'(1 + v1) and when v =

Suppose now that we confront this decision problem repeatedly with an unknown G. The sequence xi , x * * is an independent and identically distributed sequence of 0's and 1 's with

(20) P(x, = 1) X dG(X) = vi(G), P(Xi = 0) = 1 -V(G)

thus the distribution of the xi depends only on vp(G). Since tG, defined by (8), involves V2(G) as well, it follows that no T can be a.o. relative to every G in a class ? unless v2 is a function of vi in 9, which is not likely to be the case in practice.

On the other hand, let n

(21) Un= (1/n) xi,

and consider the decision procedure Tt = {tn with

(22) tn(?) = 2Un tn(l) = 2(1 + Un).

For any G in Sax by the law of large numbers, un -- a and hencetn >-- t, with probability 1 as n -? oo. In fact, since

(23) Exi = a = Ext Varx= a(1- a),

it follows that

(24) Eun = a EU2 = Varun + a2 = a(1 - a)/n + a2

and hence from (6) that

Rn (Pi G)

(25) = EI4un + a{4(1 + 2un + u2) - Un 2 + V2(1 1 Un + Un)]

- 4E[U2 - 2Un + a] = la(l - a)[1 + 1/n] = R(ta, G)[1 + 1/n].

Thus for large n we will do almost as well by using T as we could do if we knew v (G) = a and used ta . We have in fact for G E Sax

(26) Rn(PT G) - R(ta, G) = a(1 - a)/4n < 1/16n,




while

(27) Rn(T, G) - R(t*, G) = a(1 - a)(n + 1)/4n -

= -[ ( -2a)]2 + a(l - a)/4n.

This example illustrates the fact that even when an a.o. T does not exist, or when it does exist but Rn(T, G) is too slowly convergent to R(G), it may be worthwhile to use a T which is at least "asymptotically subminimax." R. Cogburn has work in progress in this direction for the case in which x has a general binomial distribution (for which see also [9] and [11]). The general problem of finding "good" T's has hardly been touched, and efforts along this line should yield interesting and useful results.

6. Estimating the a priori distribution: the general case. Returning to the general formulation of Sections 2 and 3 and confining ourselves for simplicity to the case A = {ao, al} and A = {-oo < X < oo }, we recall that an a.o. T exists relative to the class S defined by (3.17) whenever we can find a sequence An(x) =

n(X X *... , xn ; x) such that for a.e. (,u)x,

r0 (1) plimn,m An(x) = AG(X) = f [L(a1,X) - L(ao,2X)]f(x) dG(X)

oo

for every G in 9. One way to construct such a sequence (not the one used in Section 4) is to find a sequence Gn(X) = Gn(xi X** xn ; X) of random distribution functions in X such that

(2) P[limnoo Gn(X) = G(X) at every continuity point X of G] = 1.

If we have such a sequence Gn of random estimators of G then we can set

(3) An(x) = f [L(a, X) - L(ao, X)]f(x) dGn(A);

if for a.e. (,u) fixed x the function

(4) [L(al , X) - L(ao , X)]fx(x)

is continuous and bounded in X, then the Helly-Bray theorem guarantees (1). We shall now describe one method-a special case of the "minimum distance"

method of J. Wolfowitz-for constructing a particular sequence Gn(X) of random estimators of an unknown G, and then prove a theorem which ensures that under appropriate conditions on the family fx(x) the relation (2) will hold for any G whatever.

In doing this we shall relax the condition that the distribution of x for given X is given in terms of a density fx with respect to a measure ,u, and instead assume only that for every X c A = { - oo < X < oo }, Fx(x) is a specified distribution function in x, and for every fixed x c X = {-oo < x < oo }, Fx(x) is a Borel measurable function of X. (A function F(x) defined for - oo < x < oo will be



12 HERBERT ROBBINS

called a distribution function if F is non-decreasing, continuous on the right, and limx_QOF(x) = 0, limx,OF(x) = 1.)

For any distribution function G of X we define

(5) Fa(x) = fFA(x) dG(X); 00

then FG is a distribution function in x. Let xl, X2, - - be a sequence of independent random variables with FG as

their common distribution function, and define

(6) Bn(x) = B(xnX*, - - xn, ; x) = (no. of terms xl, X n, Xn which are < x)/n.

For any two distribution functions F1, F2 define the distance

(7) p(F1, F2) = sup, IF,(x) -F2(X)

and let E. be any sequence of constants tending to 0. Let 9 be any class of distribution functions in X which contains G, and define

(8) dn= inf5,9 p(Bn ,F5) a

Let Gn(X) = Gn(xl a * * Xn ; X) be any element of 9 such that

(9) p(Bn I Fan) < dn + 'en .

We say that the sequence Gn so defined is effective for 9 if (2) holds for every G in 9. We now state

THEOREM 2. Assume that (A) For every fixed x, Fx(x) is a continuous function of X. (B) The limnits F-,(x) = limx,-. F)x(x), F,O(x) = limxv, Fx(x) exist for

every x. (C) Neither F-, nor F. is a distribution function. (D) If G1, G2 are any two distribution functions in X such that FG1 = FG2,

then GC = G2. Then the sequence Gn defined by (9) is effective for the class 9 of all distribution functions in X.

PROOF. By the Glivenko-Cantelli theorem,

(10) P[limnm p(Bn X FG) = 0] = 1-

Since

P(FG,2, FG) < p(FG. , Bn) + p(Bn, FG)

< dn + En + p(Bn, FG) < p(Bn, FG) + Cn + p(Bn, F),

it follows from (10) that with probability 1 the sequence xi , x2, * is such that

(12) limn-o f FA(x) dGn(X) = f F\(x) dG(X) (uniformly in x). 00 co




Now consider any fixed sequence x1, x2, * such that (12) holds and -let Gkn be any subsequence of G. such that Gk.(X) - G*(X) at every continuity point X of G*, where G* is a distribution function in the weak sense, 0 < G*(-oo), G*( oo) < 1. By a simple extension of the Helly-Bray theorem it follows from (A) and (B) that for every x,

(13) limn. f Fx(x) dGkn(X) = Fx(x) dG* (X)

+ G*(-?oo)F_oo(x) + [1 -G*(o0)]Foo(x)

and hence from (12) that for every x,

( F,(x) dG(X) = j F (x) dG* (X) (14) X00

+ G*(- )F_oo(x) + [1 - G*(oo,)]Foo(x).

If we can show that G*( -oo ) = 0 and G*( 00 ) = 1 then it will follow from (D) that G = G*, and hence, since G* denoted the weak limit of any convergent subsequence of Gn, that (2) holds. We shall complete the proof of the theorem by showing that (C) implies that G*(-oo ) = 0 and G*( ?O ) = 1.

Since F, is the limit as X - -oo of F), it is a nondecreasing function of x such that

(15) 0 _ F_oo(-oo), F-oo( o) < 1,

and similarly for Fo. . Let x - o in (14). By Lebesgue's theorem of bounded convergence,

(16) ?-G*(-oo)F_o(-oo) + [1-G*(oo)]F,(-oo).

Hence if G*(-oo ) 5 1 then F, (-oo) = O, and if G*( oo ) 5 I then F.oo) = 0. Similarly, by letting x -0o0 in (14) we see that if G*( - oo) # 0 then

F,(oo ) = 1, and if G*( oo ) # 1 then F(oo ) = 1. Suppose now that a. is any sequence of constants converging to a limit a from the right. Then from (14), putting x = an, letting n -* oo, and subtracting (14) for x = a, we see that

G*- o )[F-.O(a + 0) - F_(a)]

+ [1 - G*(oo)][FO(a + 0) - F.(a)] = 0.

Hence if G*(-oo) $ 0 then F,O(a + 0) -F,O(a), and if G*(oo) # 1 then

F. (a + 0) = F.(a). It follows that if G*( -oo ) 0 0 then F, is a distribution function and if G*( 0o) $ 1 then F. is a distribution function. Hence by (C), G*(-co ) = 0 and G*( oo ) = 1. This completes the proof.

EXAMPLE 1. (location parameter) Let F be a continuous distribution function (e.g., the normal distribution function) with a characteristic function which never vanishes,

(18) 'F(t) = f eixt dF(x) 0 0 (all t). 00



14 HERBERT ROBBINS

Set Fx (x) = F(x - X); then (A), (B), (C) hold. If G1 , G2 are any two distribution functions such that FG, = FG2 ; i.e., such that

(19) JF(x - X) dG1 (X) = F(x - X) dG2(X) (all x), 00 00

then

(20) k0Ff(t)>kG1(t) = 4F1(t)4bG2(t) (all t), and hence

(21) 40GI(t) = 4G2(t) (all t),

so that G1 = G2. Hence (D) holds, and by Theorem 1 the sequence Gn defined by (9) is effective for the class 9 of all distributions G of X.

When the parameter space A is not the whole line, the statement and proof of Theorem 2 can be appropriately modified. As an example, suppose A = {O < X < 0o0. Then we can prove in exactly the same way as for Theorem 2

THEOREM 3. Assume that (A) As in Theorem 2. (B) The limit F.(x) = limx,_.. FI(x) exists for every x. (C) F. is not a distribution function. (D) If G1, G2 are any two distribution functions in X which assign unit probability

to A = O ?< X < oo} such that

F| F,(x) dG1(X) = f Fx(x) dG2(X) (all x),

then G1 = G2 . Then the sequence Gn defined by (9) is effective for the class 9 of all distributions which assign unit probability to A.

EXAMPLE 2. (Poisson parameter) Let

(22) Fo(x) =0 forx <0,

-1 for x > O,

and for 0 < X < o0 let

(23) Fx(x) = E e Xi/i!. O<i<x

Then (A), (B), and (C) hold. Let G E; then

(24) FG(x) = f F?(x) dG(X) = Z ffx(i) dG(X) A ~~~O<i< A

where

fo(i) = 1 for i = 0, (25) = 0 for i = 1, 2 *...

fx,(i) = e X/i! fori = O, 1, *andO < X < oo.




Now

FG(O) -I fx(O) dG(X) (26)

FG(n)- FG(n - 1) = ff(n) dG(X) (n = 1,2, .),

so if FG1 = FG2 then

(27) f fx(n) dG,(X) = fx(n) dG2(X) (n = 0, 1, .).

Define the set functions

(28) Hj1(B) = f e- dGj(X)/fex dGj(X) (j = 1,2);

then Hi is a probability measure on the Borel sets. Since by (27),

(29) c = fA e dG1(X) = ffx(0) dG1(X) = ffx(0) dG2(x) = f e- dG2(X),

we can write

(30) HjI(B) = -f eq dGj(X) (j = 1,2)

where O < c < oo. Since

(31) dH,/dGj = e /c

we have for n = 1, 2, ... and j = 1, 2

(32) fA" dHi(x) = e Xn dGj(C) =-| fA(n) dG1(X),

so that by (27)

(33) an= |n dH, = (Af XdH2(X) (n = 1, 2, ...);

moreover, since 0 _ exXn n! for O ? X < oo, we have

(34) 0 _ an= exXn dGj(X) - i (n = 1, 2, ...

so that the series 1j4 (an/n!) (2) n< oo. From a theorem of H. Cram6r (Mathe- matical Methods of Statistics, p. 176) it follows that H1 = H2 . Since

(35) G(B) J = fdHj IBdH(X) ced (j = 1,2)

it follows that G1 = G2, so that (D) holds.



16 HERBERT ROBBINS

We conclude with an example in which A = {O < X < o01, 0 here playing the role of - oo in Theorem 2.

EXAMPLE 3. (uniform distribution) Define for X c A = IO < X < X }

F,(x) = O forx ' O,

(36) = x/X for 0 < x < ,

- 1 forx > X.

Then

lim),o F (x) = 0 for x < 0O

=1 for x > 0

and

(38) limx_.. F),(x) 0

are not distribution functions, and (A), (B), (C) hold. For any G which assigns unit probability to A we have for x > 0,

(39) FG(x) = fAF(x) dG(X) = I 1 idG(X) + xf dG(X) A ( 0~~<X ' w) { >X) X

Hence if FG, = FG2 then

(40) G1(x) + x dG(X) G2(x) + x dG2(X) >X)z X I X)z X

If x is any common continuity point of G1 and G2 then

(41) dGj() [G (X) + Gj(X) dX - j(x) + Gj(X) dX (X>X} X X z (X~~>z X2 x >X)z X2_

so that

(42) G,(x) + x I dG>(x) -X Gj(X) X d >X)Z X (>%2}

and hence from (40),

(43) Gh(X) dX G2(X) Id

at every continuity point x > 0 of GI and G2. Differentiating with respect to x gives GI(x) = G2(x) for every such x and hence G1 = G2. (Cf. [13] and references for the question of when the "identifiability" assumption (D) holds.)

The defect of the method of Theorem 2 for estimating G is that the choice of Gn satisfying (9) is non-constructive. In contrast, the method of Section 4, which bypasses the estimation of G and gives tG directly, is quite explicit for the given parametric family and loss structure.

In the next section we shall indicate a method for estimating G which works




in the case of a parameter space A consisting of a finite number of elements, and which may possibly be capable of generalization.

7. Estimating the a priori distribution: the finite case. Consider the case in which the observable random variable x E X is known to have one of a finite number of specified probability distributions P1, * , Pr, which one depending on the value of a random parameter X E A = { 1, * , r} which has an unknown a priori probability vector G = g, , gr}, gi > 0, Z:gi gi= 1, such that P(X = i) = gi. Observing a sequence xi, x2, * of independent random variables with the common distribution

(1) PG(XEB) = EgiPi(B),

our problem is to construct functions

(2) gi,n = gi,n(X , x n)

such that 9i,n > 0, {=i 9i,n = 1, and whatever be G,

(3) P[limn--. 9i,n = gi] = 1(i = 1 ... * * )

A necessary condition for the existence of such a sequence gi,n is clearly that

(A) If G = {qg, * *, gr} and G = {91, I., r}

are any two probability vectors such that for every B, r r

(4) ZgiPi(B) = E giPi(B) i=l i=1

then G = G. We shall now show that (A) is also sufficient. Denote by ,u any a-finite measure on X with respect to which all the Pi are

absolutely continuous and such that their densities fi = dPl/d,i are square integrable:

(5) fI f(x) dyu(x) < x (=1,*,r).

(For example, we can always take ,u = P1 + --- + Pr, since then 0 < fi(x) < 1 and hence

(6) ff2(x) d,(x) < f f(x) dyu(x) = 1.)

The functions fi are elements of the Hilbert space H over the measure space (X, ,). From (A) they are linearly independent. For if cif? +* + crfr = 0 for some constants ci not all 0, then by renumbering the fi we can write

(7) c1fi + * + Ckfk - Ck+lfk+1 + - - - + Cqfq

with cl, , Cq all positive and 1 _ q ? r. Integrating over X we obtain

(8) Cl + + Ck = Ck+1 + + Cq = C > Oy



18 HERBERT ROBBINS

and hence

(9) G = { c/*c, c,kl * 0 *, } I G ={O, * * = ,0 , Ck+1/Ck,"**, Cq/C}

are such that (4) holds, contradicting (A). Now let Hj denote the linear manifold spanned by the r - 1 functions fi,

fi-i , fi+l 2 , fr. We can then write uniquely

(10) fi = LI + LI' (j =1*-* r) with

(11)~ ~ ~~fj x EHjx J' -l H>, 2 X 5, O.

Hence, setting

(12) 40j(x) = f7() /f [f dA(x)]2 d,(x)

we have

(13) f qj(X)fk(X) dA(x) = i if j=k, 0 ~~~if j$l=c .

Now define

n r

(14) 9n= n ' q i((X), gi,n = [in] / Z [9j,n] + r=1 j=1

where [a]+ denotes max(a, 0). If xi, x2, * * are independent random variables with the common distribution (1), their common density with respect to IL is

r

(15) E gifj(x), j=l

so that by (13), r r

(16) E4i(xv) = ] 4i(x) Zgjfj(x) d,u(x) = gj I 4(x)f1(x) d,u(x) = gj. X j=l ~~~~~j=l JX

The strong law of large numbers then implies (3). Returning now to the problem of Section 2 in which we have an action space

A and a loss structure L(a, X), here specified by functions L(a, i) (i = 1, * r), let us assume for simplicity that

(17) 0 < L(a, i) < L < oo (alla A andi = 1, ,r).

Then (2.6) becomes

(18) AG(a, x) = I [L(a, i) - L(ao, i)]fi(x)gi, i=1




and we can set r

(19) An(a, x) = j [L(a, i) -L(ao, i)jfj(x)gn, a

so that n

(20) SUpa IAn(a, x) - AG(a, x)j ? Llfi(x)Igi - gi.ni- 5=1

Since fi(x) < oo for a.e. (,u)x it follows from (3) that with probability 1, (2.10) holds, so that T = {tnj defined by (2.11) is a.o. relative to everyG = (91, * , g* 9}.

It would be interesting to try to extend this method of estimating G to the case of a continuous parameter space, say A = {- o < X < X }l. One possible way is the following. Suppose for definiteness that X is the locationparameter of a normal distribution with unit variance, so that xI , x2, have the common density function

(21) fG(x) = Lf(x - X) dG(X); f(x) = (2zrY)eiz'

with respect to Lebesgue measure on X = {-o < x < Xo For any n > 1 let

(22)~~~~~~ >(n) < >(nl < < ;,(n) (22) xf 2 <k..

be constants, and let gi,n (i = 1, - * kn) be defined by (14) where the f1(x) of (5) are replaced by f(x - X'n)). Consider the random distribution function

(23) Gn(X) = gi,n (sum over all i such that . < X).

Can we choose the values kn and (22) for each n so that, whatever be G,

(24) P[Gn -GI = 1?

REFERENCES

[11 HANNAN, J. F. and ROBBINS, H. (1955). Asymptotic solutions of the compound decision problem for two completely specified distributions. Ann. Math. Statist. 26 37-51.

[21 JOHNS, M. V., JR. (1956). Contributions to the theory of non-parametric empirical Bayes procedures in statistics. Columbia Univ. Dissertation.

[31 JOHNS, M. V., JR. (1957). Non-parametric empirical Bayes procedures. Ann. Math. Statist. 28 649-669.

[4] JOHNS, M. V., JR. (1961). An empirical Bayes approach to non-parametric two-way classification. Studies in Item Analysis and Prediction (ed. by H. Solomon), pp. 221-232. Stanford Univ. Press.

[5] MIYASAWA, K. (1961). An empirical Bayes estimator of the mean of a normal popula- tion. Bull. Inst. Internat. Statist. 38 181-188.

[6] NEYMAN, J. (1962). Two breakthroughs in the theory of statistical decision making. Rev. Inst. Internat. Statist. 30 11-27.

[7] ROBBINS, H. (19-50). Asymptotically subminimax solutions of compound stat.istical decision problems. Proc. Second Berkeley Symp. Math. Statist. Prob. 131-148.



20 HERBERT ROBBINS

[81 ROBBINS, H. (1955). An empirical Bayes approach to statistics. Proc. Third Berkeley Symp. Math. Statist. Prob. 1 157-164.

[9] ROBBINS, H. (1963). The empirical Bayes approach to testing statistical hypotheses. To appear in Rev. Inst. Internat. Statist.

[10] SAMUEL, E. (1963). Asymptotic solutions of the sequential compound decision problem. Ann. Math. Statist. 34 1079-1094.

[111 SAMUEL, E. (1963). An empirical Bayes approach to the testing of certain parametric hypotheses. Ann. Math. Statist. 34 1370-1385.

[121 SAMUEL, E. Strong convergence of the losses of certain decision rules for the compound decision problem. Unpublished.

[13] TEICHER, H. (1961). Identifiability of mixtures. Ann. Math. Statist. 32 244-248.



Documents

The Empirical Bayes Approach to Statistical Decision Problems