Bayes and Minimax Solutions of Sequential Decision Problemsstatic.stevereads.com/papers_to_read/bayes_and_minimax_solutions_of... · BAYES AND MINIMAX SOLUTIONS OF SEQUENTIAL DECISION

Bayes and Minimax Solutions of Sequential Decision ProblemsAuthor(s): K. J. Arrow, D. Blackwell and M. A. GirshickSource: Econometrica, Vol. 17, No. 3/4 (Jul. - Oct., 1949), pp. 213-244Published by: The Econometric SocietyStable URL: http://www.jstor.org/stable/1905525 .

Accessed: 22/09/2014 13:46

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp

.JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact [email protected].

.

The Econometric Society is collaborating with JSTOR to digitize, preserve and extend access to Econometrica.

http://www.jstor.org

This content downloaded from 192.80.65.116 on Mon, 22 Sep 2014 13:46:53 PMAll use subject to JSTOR Terms and Conditions

http://www.jstor.org/action/showPublisher?publisherCode=econosoc

http://www.jstor.org/stable/1905525?origin=JSTOR-pdf

http://www.jstor.org/page/info/about/policies/terms.jsp


BAYES AND MINIMAX SOLUTIONS OF SEQUENTIAL DECISION PROBLEMS1

BY K. J. ARROW, D. BLACKWELL, M. A. GIRSHICK

The present paper deals with the general problem of sequential choice among several actions, where at each stage the options available are to stop and take a definite action or to continue sampling for more information. There are costs attached to taking inappropriate action and to sampling. A characterization of the optimum solution is obtained first under very general assumptions as to the distribution of the successive observations and the costs of sampling; then more detailed results are given for the case where the alternative actions are finite in number, the observations are drawn under conditions of random sampling, and the cost depends only on the number of observations. Explicit solutions are given for the case of two actions, random sampling, and linear cost functions.

CONTENTS

PAGE

Summary ............................................................. 213 1. Construction of Bayes Solutions ........................... 216

The Decision Function ........ ................... . 216 The Best Truncated Procedure ........................ . .. .217 The Best Sequential Procedure ........................ 219

2. Bayes Solutions for Finite Multi-Valued Decision Problems.............. 220 Statement of the Problem ................... 221 Structure of the Optimum Sequential Procedure ...................... 221

3. Optimum Sequential Procedure for a Dichotomy When the Cost Function Is Linear ..... . . . 224

A Method for Determining g and ? .................................... 226 Exact Values of _ and 0 for a Special Class of Double Dichotomies .... 228

4. Multi-Valued Decisions and the Theory of Games ....................... 230 Examples of Dichotomies ................... . 230 Examples of Trichotomies ................... . . 234

5. Another Optimum Property of the Sequential Probability Ratio Test.... 240 6. Continuity of the Risk Function of the Optimum Test .................. 242 References ....... . . . . . . . 244

SUMMARY

THE PROBLEM of statistical decisions has been formulated by Wald [1] as follows: the statistician is required to choose some action a from a class A of possible actions. He incurs a loss L(u, a), a known bounded function of his action a and an unknown state u of Nature. What is the best action for the statistician to take?

If u is a chance variable, not necessarily numerical, with a known a

I This article will be reprinted as Cowles Commission Paper, New Series, No. 40. The research for this paper was carried on at The RAND Corporation, a non- profit research organization under contract with the United States Air Force. It was presented at a joint meeting of the Econometric Society and the Institute of

213



214 K. J. ARROW, D. BLACKWELL, M. A. GIRSHICK

priori distribution, then &L(u, a) = R(a) is the expected loss from action a, and any action, or randomized mixture of actions, which minimizes R(a) has been called by Wald a Bayes solution of the decision problem, corresponding to the given a priori distribution of u.

Now suppose there is a sequence x of chance variables xl, X2,

whose joinlt distribution is determined by u. Instead of choosing an action immediately, the statistician may decide to select a sample of x's, as this will yield partial information about u, enabling him to make a wiser selection of a. There will be a cost CN(X) of obtaining the sample xl, ... , XN and, in choosing a sampling procedure, the statistician must balance the expected cost against the expected amount of information to be obtained.

Formally, the possibility of making observations leaves the situation unchanged, except that the class A of possible actions for the statistician has been extended. His action now consists of choosing a sampling procedure T and a decision function D specifying what action a will be taken for each possible result of the experiment. The expected loss is now R(T, D) = l(T, D) -1- c(T), where l(T, D) is the expected value of L(u, a) for the specified sampling procedure and decision rule, and c(T) is the expected cost of the sampling procedure. A Bayes solution is now a pair (T, D), or randomized mixture of pairs (T, D) for which R(T, D) assumes its minimum value.

The minimizing T = T* has been implicitly characterized by Wald, and may be described by the rule: at each stage, take another observation if and only if there is some sequential continuation which reduces the expected risk below its present level. The main difficulty here is that various quantities wvhich arise are not obviously measurable: for instance if the first observation is xi , we must compare our present risk level, say wI(xi), with z(xi) = inf w(xi , T, D), where w(xi , T, D) is the expected

Mathematical Statistics at Madison, Wisconsin, September 9, 1948, under the title, "Statistics and the Theory of Games."

Many of the results in this paper overlap with those obtained previously by A. Wald and J. Wolfowitz [41, and also with some prior unpublished results of Wald and Wolf owitz, announced by Wald at the meeting of the Institute of Mathe- matical Statistics at Berkeley, California, June 22, 1948. Sections 3 and 6 of the present paper contain analogues of Lemmas 1-4 of [4], though both the statements and the proofs differ because of the generally different approach. The proof that the sequential probability ratio test of a dichotomy minimizes the expected nuimber of observations under either hypothesis, in Section 5 of the present paper, follows from Section 3 in the same way that the proof of the same theorem follows from Lemmas 1-8 in [4]. The previously mentioned unpublished results of Wald and Wolfowitz include the main result of Section 2 (structure of the optimum sequential procedure for the finite multi-decision problem) in the special case of linear cost functions.



SOLUTIONS OF SEQUENTIAL DECISION PROBLEMS 215

risk for any possible continuation (T, D); we take another observation if and only if w1 > z. It is not a priori clear that z will be a measurable function of xl, so that the set of points x1 for which we stop may not be

2 measurable. Actually, z always is measurable, as we shall show. A characterization of the minimizing T = T* is obtained for hy-

potheses involving a finite number of alternatives under the condition of random sampling. It consists of the following: We are given 1K hypotheses Hi (i = 1, 2, - *, ic) which have an a priori probability gi of occurringr, a risk matrix W = (wij) wvhere wij represents the loss incurred in choosing Hi when Hi is true, and a function c(n) which represents the cost of taking n observations. It is shown that for each sample size N, there exist lo convex regions S* in the (i, - 1)-dimensional simplex spanned by the unit vectors in Euclidean k-space whose boundaries depend on the hypotheses Hi, the risk matrix W and the cost function CN(n) -c(N + n) - c(n). These regions have the property that if the vector J(N) whose components represent the a posteriori probability distribution of the k hypotheses lies in S* , the best procedure is to accept Hj wvithout further experimentation. However, if s(N) lies in the complement of i1 S , the best procedure is to continue taking observations. At any stage, the decision whether to continue or terminate sampling is uniquely determined by this sequence of k regions and moreover this sequence of regions completely characterizes T*.

A method for determining the boundaries of these convex regions is given for k = 2 (dichotomy) when the cost function is linear. It is shown that in this special case, T* coincides with Wald's sequential probability ratio test.

The minimax solution to multi-valued decision problems is considered and methods are given for obtaining them for dichotomies. It is shown that in general, the minimax strategy for the statistician is pure, except when the hypotheses involve discrete variates. In the latter case, mixed strategies will be the rule rather than the exception.

Examples of double dichotomies, binomial dichotomies, and trichotomies are given to illustrate the construction of T* and the notion of minimax solutions.

It may be remarked that the problem of optimum sequential choice among several actions is closely allied to the economic problem of the rational behavior of an entrepreneur under conditions of uncertainty. At each point in time, the entrepreneur has the choice between entering into some imperfectly liquid commitment and holding part or all of his funds in cash pending the acquisition of additional information, the latter being costly because of the foregone profits.

2 The possibility of nonmeasurability is not considered in [11 or [4].




1. CONSTRUCTION OF BAYES SOLUTIONS

The Decision Function

We have seen that the statistician must choose a pair (T, D). It turns out that the choice of D is independent of that of T:

LEMMA: There is a fixed sequence of decision functions Dm such that

(1.1) R(T, D.) -- inf R(T, D) = w(T) for all T.

This will be the main result of this section. It follows that the expected loss from a procedure T may be taken as w(T), since this loss may be approximated to arbitrary accuracy by appropriate choice of Din, and a best sequential procedure T* of a given class will be one for which w(T*) = inf w(T) where the inf is taken over all procedures T of the class under consideration.

We are considering, then, a chance variable u and a sequence x of chance variables x1, X2 , * - - . A sequential procedure T is a sequence of disjunct sets So, Si, . .., SN , *. , where SN depends only on xi, * -

XN and is the event that the sampling procedure terminates with the sample xi, ... , xN ; we require that I:=O P(SN) = 1. SO is the event that we do not sample at all, but take some action immediately; it will have probability either 0 or 1.

A decision function D is a sequence of functions do, di(xi), *-, dN(x1, . * , XN), *. , where each dN assumes values in A, and specifies the action taken when sampling terminates with xl, ... , XN. We ad- mit only decision functions D such that L[u, dN(x)] is for each N a measurable function.

PROOF OF LEMMA: The loss from (T, D) is G(u, x; T, D) - L[u, dN(x)] + CN(X) for x e SN, and &G = R(T, D). Here, cN(x) depends only on xi, .. , XN. Then, denoting by UN the conditional expectation given xi, * , XN , we have &NG = &NL[u, dN(x)] + cN(x) for x e SN , and

0o

(1.2) R(T, D) = , UN L(u, dN) dP + c(T). N=O 'SN

Now fix N; we shall show that we can choose a sequence of functions dNm(x), m = 1, 2, ... , such that

(a) NL(u, dN.) > &NL(u, dN,p+i) for all x,

(b) eNL(u, dN) ) rN for all dN and all x, where

rN(x) = lim &NL(u, dNm). in-+oo

(c) rN ) &Nrn if n ) N.




First choose a sequence dNm such that

&L(u, dNm) --> inf &L(u, dN) = r. dN

Now define dNm inductively as follows: dNl =l ; dN,m = d'm for those values of x such that &NL(u, d'km) > cNL(ut, dN,m-l), otherwise dNm =

dN,m-l . Then certainly (a) holds, so that lirm SNL(u, dNm) = rN(x) m-~c

exists. Also &NL(it, dNm) < tNL(it, dNm) so that &rN= r. Choose any dN and any 6 > 0, and let S be the event ('NL(u, dN) < rN(x) -

Then, defining d*m = dN on S, d*m = dNm elsewhere, we have

'L(ti, d*n) . frN(x) dP + f 6NL(u, dNm) dP - 6P(S), s c~~~~s

so that lim (,L(u, dNm) r -3 P(S), and P(S) = 0. This establishes

(b). Finally, (c) followvs from the fact that every dN(x) is also a possible dn(x) if n> N. This means that, defining dn = dN , we have 6NL(U, dN)-

6N[&FL(u, d*n)] , '&Nrfl for all dN, and consequently (c) holds. Now define Dm = dNym}. Since SNL(u, dNm) decreases with m, (1.2)

yields that

00

R(T, Din) E rN(x) dP + c(T) = w(T), NrO SN

and, using (b), that R(T, D) ) w(T) for all D. Thus we have reduced the problem of finding Bayes solutions to the following: we are given a sequence x of chance variables x1, x2, . and a sequence of nonnegative expected loss functions wo, *-., where WN = rN(Xl ,

... 7

XN) + CN(X1, * , XN). CN is the cost of the first N observations, and rN is the loss due to incomplete information. With each sequential

procedure T = SN} there is associated a risk w(T) = E LNWN(x)dP.

How can T be chosen so that w(T) is as small as possible?

The Best Truncated Procedure

Among all sequential procedures not requiring more than N observations, there turns out to be a best, i.e., one whose expected risk does not exceed that of any other. Moreover, the procedure can be explicitly described, by induction backwards, in such a wvay that its measurability is clear. After N - 1 observationls x1, x * , , we compare the present risk wN_i with the conditional expected risk &.NV1WN if We take the final observation. Thus, by choosing the better course, we can linmit our loss to axNI = min (wN_j , N-1WN), which may be considered as the attainable



K. J. ARROW, D. BLACKWELL, M. A. GIRSHICK

risk with the observations x1, '* *, XN--1. We can then decide, on the basis of N - 2 observations, whether the (N - l)st is worth taking by comparing the present risk, WN-2, with N--2aN-1, the attainable risk if XN-1 is observed. Continuing backwards, we obtain at each stage an expected attainable risk ak for the observations xi, , Xk, and a description of how to attain this risk, i.e., of when to take another observation. This is formalized in the following:

THEOREM: Let xl, , XN; wo, * * , WN be any chance variables, Wi = W,(X , ''', Xi). Define aN = WN , aj = min (wi , 6jaj+) for j < N, Si = {wi > aifor i < j, wi = ai}. Then for any disjoint events Bo, , BN , Bi depending only on xl ,, * , xi, No P(Bi) = 1, we have

N N

fJ wjdP EI widP. -=0 i-0 B

PROOF: We shall show that, for fixed i and any (x1, .. , xi)-set A,

(1.3) E J a dP = E Jai dP, ji > ASi i>i AS;

and that, for fixed j, and any disjoint sets A i, , AN with Ai depending only on xl,, xi and --ij+i Ai depending only on x , .., xi,

(1.4) E f ,dP ( E f aidP.

Choosing A = Bi in (1.3) and summing over i, choosing Ai = BiSj in (1.4) and summing over j, and adding the results yields

N N

(1.5) c A q-dP ] J aidP. i-j=O BiSj i,jO BiSj

Now on Si, ai = wi , and always ai < w . Making these replacements in (1.5) yields the theorem.

We now prove (1.3) and (1.4). The relationship (1.3) is clear for i = N; for i < N,

L aidP dP + fi idP + dP j> Si SAji A(si++-.. '+SN)

But on Si+1 + - + SN, ai = Siai+l ; making this replacement in the final integral and using induction backward on i completes the proof. The relationship (1.4) is clear forj = N; forj < N,

f af dP = I aidP + ai+l dP i>j i i+l-' - '+AN iAj+l+- - '+AN

218




a+l dP + E f a+ dP, fA fA i>+l dA i+1 i > +

where the inequality is obtained from the fact that always ai < &jia+il. An induction backward on j now completes the proof of (1.4).

The Best Sequential Procedure

We are given now a sequence of functions wo , w , * * N, * * * , where

WN = rN(xl , .. , XN) + CN(Xl , ... , XN). The sequence rN(x) is uniformly bounded, since we supposed the original loss function L(u, a) to be bounded, and we have shown that rN > S&Nr for n > N. We shall

suppose that CN(X) is a nondecreasing sequence, CN(X) -> oo as N -> o for all x. We now construct a best sequential procedure.3

The best sequential procedure is obtained as a limit of the best truncated procedures given in the preceding section.

We first define aNN = WN, aON = min (wi, &jaj+1,N), SN = {Wi > aOi

for i < j, w = aN }. For fixed j, ajN is a decreasing sequence of functions; say aiN -* ca as N - -c. Then ai =a min (wj, &,air+). Define Si =

{wi > ai for i < j, , = aj}. We shall prove that T* = {Si} is a best

sequentill procedulre, i.e., T* is a sequential procedure, and for any sequential procedure T = {B,},

w(T-) - >7]? dP o I w, i =dP w(T). j =O ?=0JR

Now

- J C ." (IP W> f - dP 1 I ((7))

Z= A+1 f I 4 B, z>A 7 >N -V i i>N

where Al is the uniform upper bound of r (x), r2(x), . . Thus

A f '; dP + f twN dP +

z=0 t i NA

Jf WN dP < w(T) + MP(BN+l + ), B,

i>N

so that w(TN) - w(T), where TN is the truncated test Bo, *. , BN-1,

BN + BN+I + * . . From the preceding section,

E f Wj dP ( w(TN)

=0 SiN

for all K. Then

3 The assumption made here is somewhat weaker than Condition 6 in [1], p. 297. The only other assumption made, that L(u, a) is bounded, is Condition 1 in [11, p. 297.




K

, f w dP < w(T), 1-0 ij

letting N -> oo, and using Lebesgue's convergence theorem and the easily verified fact that the characteristic function of SJN approaches that of Si, and w(T*) < w(T).

It remains to prove that T* really is a sequential test, i.e.,

P(SN) = 1. N-O

oo

Write AN = C(So + .. + SN), i AN = A; we show that P(A) = 0. N-1

It is easily verified by induction that ajN )/ c for all N. The relation (1.3), with i = 0, A = sample space, yields that

ao N E cm dP c m dP CSQm+N ' ( ++mN)

for all m < N. Then

°ao T

cm dP > c dP Am A

for all m. Since cm - c, P(A) = 0. We now prove that w(T*) = ao. If T* denotes the truncated test

S0, Si, * * , SN + SN+ 1 + * * , the proof that w(TN) -> w(T) shows that

w(T*) -- w(T*). Also (1.3), with i = 0, A = sample space, shows that

N

aON = wi dP. J-O SN

Since {SON, * * , SNNv is the best of all procedures truncated at N, and T* is the best of all sequential procedures, w(T*) < aoN < w(TN). Letting N -- oo yields w(T*) = ao .

Now So = {wo aoj ; i.e., the best procedure T* is to take no observations if and only if there is no sequential procedure which reduces the risk below its present level. This remark, which identifies our procedure with that characterized by Wald, at least at the initial stage, will be useful in the next section.

2. BAYES SOLUTIONS FOR FINITE MULTI-VALUED DECISION PROBLEMS

In this section we shall seek a characterization of the optimum sequential procedure developed in section 1 in cases where the number of alter-

220




native hypotheses is finite. It wvill be shown that the optimum sequential test for a k-valued decisionl problem is completely defined by k (or a seqluence of k) convex regions in a (1k - 1)-dimensional simplex spanned by the unit vectors. No procedure has yet been developed for determining the boundaries of these regions in the general case. hIow- ever, for k = 2 (dichotomy) and for a linear cost function, a method for determining the two boundaries has been found and the optimum test is shown to be the sequential probability ratio test developed by Wald [2].

Statemzent of the Problem

We are given lo hypotheses H1 , H2 - - - , Hk Awhere each Hi is characterized by a probability measure ui defined over an R dimensional sample space ER and has an a priori probability gi of occurring. WVe are also given a risk matrix W = (wii), (i, j = 1, 2, * , k), where qWvj is a nonnegative real number and represents the loss incurred in accepting the hypothesis Hi whien in fact HIj is true. (We shall assume that wu,i 0 for all i. This is base(d on the supposition, which appears reasonable, that the decisionl mnaker is not to be penalized for selecting the correct alternative, no matter how unpleasant its consequences may be.) In addlition to the risk mnatrix (wi,) we shall assume that the cost of experimentation depends only on the number (it) of observations taken and is given by a function c(n) wrhich approaches infinity as n approaches infinity. The problem is to characterize the procedure for deciding on one of the 1: alternative hypotheses whichl results in a minimum average risk. This risk is defined as the average cost of takingg observations plus the average loss resulting from erroneous decisions.

Structure of the Optimum Sequential Procedure

Let Gk stand for the convex set in the A; dimensional space defined by the vectors (j = (q1, g2 9 k , gk) with components gi ) 0 and I:=l gi=

1; and let H = (H1 , H2, .. , Hk) represent the k hypotheses under consideration. Then every vector i in Gk may be considered as a possible a priori probability distribution of H.

For any (j in Gk and for any sequential procedure T, (see definition in Section 1), let R(U I T) represent the average risk entailed in using the test T when the a priori distribution of H is q7. Then

k k ,r

(2.1) k(g I T') = Z gc CKc(n) l T] + Z Z qiw"1Pi1 (T),

where &,[c(n) I T] is the average cost of observations when the sequential test T is used and Hi is true, and Pi2(T) is the probability that the sequential test T will result in the acceptance of Hi when Hi is true. The




risk involved in accepting the hypothesis Hi prior to taking any observations will be designated by Rj(j = 1, 2, *- , k) and is given by

k

(2.2) 1. = giwij. i=l

We now define k subsets S*, S2, * , Sk of Oh as follows: A vector g of Gk will be said to belong to S* if (a) min (R1, R2, Rk , &) = Rj and (b) R( jj T) ) RI for all T. We observe that since the unit vector with 1 in the jth comiponent belongs to S*, the subsets 8S* are nonempty. We now prove the following

THIEIIOREI. The sets S7* are convex. That is, if ?11 and y2 belong to S* so does aUc + (1 - a)g2for all a, O < a f 1.

PROOF: Assume the contrary. Then there exists a sequential procedure T such that

k k 1. k

(2.3) 11>((i IT) k &jc(n) I T] + gi

vc Pii < gi tuii.

t= L=c iil

But by definition, if cither iii or (i2 represents the a priori distribution of the hypotheses II, we must lhave for all sequiential procedures and hence for T,

k k k

(2.4) li'({i- 1 T') Z=l LKcq1'1C () T1 + Z gi w1iPii > L ii w i, t=l ~~~~~i'1 3J] i=l

an(d

k h 1 k

(2.15) T(q2 =) L |jcQ) T] + L W gwjPj 9 ___ g2iwij. i=1 J=1

If wve now multiply (2.4I) by a and (2.5) b)y (1 - a) rand add, we see that the resulting expression contradicts (2.3). This proves thie theorem.4

It is easily seen that for given hypotheses H the shape of the convex regions Si wvill depend on the cost function c(n) and the risk nmiatrix W. Thus if the cost of taking a single observation were prohibitive, the region S* in G wvould simply consist of all vectors (j for which min (Ri , B2 , ?, 14) = R. Ol the other hand, if the cost of takinlg observations were negligible land the risk of making an erroneous decision large, the recions AS wvould shrink to the vertices of the polyhedron Gk . To exhibit the dependence of the regions S* on II, c(t), and WV, we shall use the symbol S* [H, c(n), WV]. We shall also uise the symbol 8* [H, c(n), W]

4 A similar proof shows the convexity of the corresponding regions in cases where the number of alternatives is infinite.




to represent the region consisting of all vectors j in Gk which belong to the complement of >k= S*[H, c(n), W].

We now define cN(n) = c(N + n) - c(N) for all N = 0, 1, 2, Thus CN(n) represents the cost of taking n observations when N observations have already been taken.

We shall now show that, for randoin sampling, the problem of characterizing the optimum sequential procedure T* for a given H, c(n) and W reduces itself to the problem of finding the boundaries of the regions S*[H, CN(n), WV] for all N. The truth of this can be seen from the following considerations:

We are initially given a vector si in Gk as the a priori distribution of the hypotheses H. Initially we are also given a matrix W and a function c(n) = co(n). Now assume we have taken N independent observations (N = 0,1, 2, - ,). These N observations transform the initial state into one in which (a) the vector (j goes into a vector 9(N) in Gk where each component g N) of {J(N) represents the new a priori probability of the hypothesis Hi (i.e., the a posteriori probability of Hi given the values of the N observations), (b) the risk matrix W remains unchanged, and (c) the cost function c(n) goes into the function CN(n).

Assume now that the boundaries of the regions S*[H, CN(n), W] are known for each j and N. Then, if we take the observations in sequence, we can determine at each stage N(N = 0, 1, 2, - ,) in which of the k + 1 regions the vector (N) lies. If 9(N) lies in S*[H, CN(n), W], then, by definition of this region, there exists a sequential test T which, if per- formed from this stage on, would result in a smaller average risk than the risk of stopping at this stage and accepting the hypothesis corresponding to the smallest of the quantities R(N) = =1 gN)Wi.(= 1, 2, k , 1:). But it has been shown in section 1 that if any sequential test T is worth performing, the optimum test T* is also worth performing. Now T will coincide with T* for at least one additional observation. But when that observation is taken g(N) will become g(N+l) and CN(n) will become CN-1(n). Again if g(N+l) lies in S*[HI, CN+1(n), W], the same argument will show that it is worth taking another observation. However, if g(N+l) lies in S*[H, CN+l(n), T] for some j, it implies that there exists no sequential test T which is worth performing and hence the optimum procedure is to stop sampling and accept Hi.

Thus we see that the optimum sequential test T* is identical with the following procedure: Let N = 0, 1, 2, *.. , represent the number of observations taken in sequence. For each value of N we compute the vector g (N)

representing the a posteriori probabilities of the hypotheses H. As long as U(N) lies in S*[H, cN(n), TV] we take another observation. We stop sampling and accept H3(j = 1, 2, k, ) as soon as g(N) falls in the region Sj[HI, CN(n), W].




We have as yet no general method for determining the boundaries of S*[HI, CN(n), W] for arbitrary H, CN(n) and W. However, in the case of a dichotomy (- = 2) and a linear cost function, such a method has been found and wilf be discussed here in detail. We shall also give below some illustrative examples of the optimum sequential test for trichotomies (k = 3).

3. OPTIMUM SEQUENTIAL PROCEDURE FOR A DICHOTOMY WHEN THE

COST FUNCTION IS LINEAR

We are given two alternative hypotheses H1 and H2, which, for the sake of simplicity, we assume are characterized respectively by two probability densities fj(x) and f2(x) of a random vector X in an R dimensional Euclidean space. (If X is discrete fJ(x) and f2(x) will represent the probability under the respective hypotheses that X = x). We assume that the a priori probability of H1 is g and that of 1I2 is 1- g, where g is known. (Later we shall show how to construct the minimax sequential procedure whose average risk is independent of g). We are also given two nonnegative numbers W12 and W21 where wij(i # j = 1, 2) represents the loss incurred in accepting Hi when Hi is true. In addition we shall assume that the cost per observation is a constant c which, by a suitable change in scale, can be taken as unity. We also assume that the observations taken during the course of the experiment are independent. We define Pin = I|flifi(xi) and P2n = fl%%1f2(xi) where xi , X2 x, , represent the first, second, etc., observation.

If we apply the discussion of Section 2 to the dichotomy under consideration we see that the convex regions S*(i = 1, 2) reduce themselves to two intervals, A1 and I2 where 1A consists of points g such that O < g s g and 12 consists of points g such that g < g < 1 where g < 0. Moreover, in view of the assumption of constant cost of observations, the boundaries i and g of these two intervals are independent of the number of observations taken but depend only on w12 and W21 (and, of course c, which is taken as unity).5

The intervals I, and I2 have the following properties: If the a priori probability g for H, belongs to ,1 X then there exists no sequential procedure which will result in a smaller average risk than the risk R1 = Ws2g

of accepting 112 without further experimentation. If the a priori probability g for Hi belongs to '2, there exists no sequential procedure which will result in a smaller average risk than the risk R2 = w21(l - g) of accepting H1 without any further experimentation. However, in case

6 It is assumed here that the intervals are closed; this assumption has not yet been justified. It will be shown below (Section 3) that it is a matter of indiffer- ence whether the endpoints are included or not.




g < g < g, then there exists a sequential test T whose average risk will be less than the minimum of R1 and R2 .

Using the argument of Section 2, we see that if in the initial stage g < g, the optimum procedure is to accept H2 without taking any observations. Similarly, if in the initial stage g ) 9, the optimum procedure is to accept H1 without taking any observations. However, if g < g < g then there exists a sequential test T worth performing and this test will coincide with the optimal test T* for at least the first observation. Now suppose the first observation x1 is taken. We then compute the a posteriori probability gi that H1 is true, where gi is given by

(3.1) 91 - ~~~~~ gf1 (xi) (3.1) g= - gfl(XI) + (1 - 9)f2(X )

We are now in the same position as we were initially. If g1 < g the best procedure is to stop sampling and accept H2. If gi ) g, the best procedure is to stop sampling and accept H1 . However if g < gi < g then there exists a sequential test T' and hence T* which is worth performing and we take another observation.

We thus see that the optimum sequential test T* for a dichotomy must coincide with the following procedure. For any W12 and w21 we determine

V and- g by a method to be described later. Let n = 0, 1, 2, ... , represent the number of observations taken in sequence. At each stage we compute g,, where gn is given by

(3.2) gn ggp + -g gPi r(1 - g)P2nj

We continue taking observations as long as g < gn < 0. We stop as soon as, for some n, either gn < g or gn >, g. In the former case we accept H2, in the latter case we accept H1 .

The optimum test T* described above is identical with the sequential probability ratio test developed by Wald [2]. Wald's test is defined as follows: Let Ln = P2,,/P1n and let A and B be two positive numbers with B < 1 and A > 1. Observations are taken in sequence and sampling continues as long as B < LK < A. Sampling terminates as soon as for some sample size n, either Ln ) A or Ln < B. In the former case, H2 is accepted, and in the latter case Hi is accepted. This procedure, however, is the same as T* provided T* requires at least one observation and provided we set

(3.3) B 9 and A=1 g g Q 1-g g 1-g




We now define

(3.4) z = log f2(Xi)

(3.5) a = logA,-b = logB.

In terms of these quantities, the sequential procedure T* can be defined as follows: Continue sampling as long as -b < ,^1 zi < a. Terminate sampling and accept the appropriate hypothesis as soon as for some n,

n~~ Ej-, zi >, a or Ei'31 Zi <, -b.

A Method for Determining g and g.

From the above considerations we see that the optimum sequential test for a dichotomy with a constant cost function is completely determined by g and g. We shall therefore turn our attention to the problem of determining these quantities for any given w12 and w21 . However, before we consider this problem we state the following theorem which will be proved in Section 6: Let z be any class of sequential tests T for a dichotomy H1, H2. Let R(g I T) be the average risk of test T when the a priori probability of H1 is g. Then inf R(g I T) is a continuous function

of g in the open interval 0 < g < 1. The above theorem implies that the risk of the sequential test T*

which is best among the class of tests involving taking at least one observation is a continuous function of g. Hence it follows that at the boundaries g = g as well as g = g, the risk incurred when the appropriate decision is made with no observations must equal the average risk when observations are taken and the optimal procedure is used there- after. Thus if we equate these two risks at g = g and also at g = g, we get two equations from which we can obtain these quantities provided, of course, we are able to compute the operating characteristics of the sequential probability ratio test. Since the two risks are equal at g = and at g = g, it follows, as previously noted, that it makes no difference whether or not those points are included in the intervals h1 and I2, respectively, where no further observations are taken.

When g = g, the risk of accepting H2 with no observations is given by

(3.6) R2 = W12U-

On the other hand, if we set g = g in (3.5) we obtain a sequential probability ratio test T* defined by the boundaries

(3.7) a = log A = 0 and -b = log B = log 91




We are therefore led to the problemn of determining the average risk R(g I T) where T* is defined by (3.7).

It is clear that for this test, the first observation will either terminate the sampling or result in a new sequential test with boundaries a' = -zi and -b' = -(b + zi) where z1 is given by (3.4) with i = 1, and - b < z1 < 0. Let P,(z) be the probability distribution of z = log

f2(x)/fl(x) when Hi is true, (i = 1, 2). Moreover, for any sequential probability ratio test defined by boundaries -b' and a', let L(a', b' I Hi) = 1 -b') when Hi is true. Then 1 -L(a', b' I Hi) = P 1 z a') when Hi is true. We also define 6(n I a', b'; Hi) as the average number of observations required to reach a decision with this test when Hi is true.

Under the hypothesis H1, the average risk when T* is used, is given by 10

R1(Tg) = 1 + w12 f dPi(z) + j &(n |-z, b + z; H1) dP,(z)

(3.8) 0

+ W21 1-L(-z, b+ z I Hi)] dP,(z).

Under the hypothesis H2, the average risk with T* is given by -b r

R2(T*) = 1 + W21 dP2(z) + ] '(n I-z, b + z; H2) dP2(z) 00 b

(3.9)o + ?V21 L(-z, b + z, H2) dP2(z).

b

Hlence the total risk using T* when g = g is

(3.10) R(y I T*) = qR1(T*) + (1- g)R2(T*).

Equating (3.10) to (3.6) we get

(3.11) gR?(T*) + (1 - g)R2(T) =W12g

By symmetry, we get

(3.12) gRJ(T*) + (1 - g)R2(T*) = W21(1-

Equations (3.11) and (3.12) determine g and g uniquely. In general, it will be very difficult to compute the operating characteristics involved in these equations. However, it is possible to employ the approximations developed by Wald [2] which will usually result in values of g and g close to the true values, especially if the hypotheses H1 and H2 do not differ much from each other.




Exact Values of g and g for a Special Class of Double Dichotomies

We are given two binomial populations 7r, and 7r2 defined by two parameters pi and P2 where pi = P(x = 1) and qi = 1 - pi = P(x=O)

for i = 1, 2 and pi < P2. Let H1 stand for the hypothesis that pi is associated with 7rw and P2 is associated With 72 , and let I12 stand for the hypothesis that P2 is associated with 7r and pi is associated with W2.

Let W12 be the risk of accepting H2 when H1 is true and w21 be the risk of accepting Hi1 when 1I2 is true. We assume that the cost per observation is constant and, wvithout loss of generality, is taken as uniity. Let g be the a priori probability that HI is true and hence 1 - g is the a priori probability that II2 is true. The problem is to determine q and 9 which define the optimum procedure T* for testing these hypotheses.

It is easily shown that if g and g are known, T* is defined as follows: We set

l 1- g

l i_ g

(3.13) a =bcog_-g

-log---- l log -q

q2 pI q2 pl

where the symbol {y} stands for the smallest integer greater than or equal to y. Let xi1, x12, * , be a sequence of observations obtained from 7rw and X21, X22, , a sequence of observations obtained from r2.

We continue sampling as long as -b < Z:=l (X2i - xli) < a. We terminate sampling as soon as for some sample size n either ZX1 (x2i - xli) = a or -=1 (x2i - xli) = -b. In the former case we accept Hi , in the latter case we accept H2.

Let L(a, b I 1i) = P [Z=1 (x2i - x1) = a] when HI is true and let &(n I a, b; HII) be the expected number of observations required to reach a decision when IA is true. Then, without any approximation (see [3]) we have

a+b _b

(3.14) L(a, b i IiH) = -a+b - 1

(ua+bi (a+ )La + b b H1) b

(3.15) L(a, b; H2) 018a + b_

(3.16) C9(n ia, b; Hj) _ (a + b)L(a, b I Hi) - b pl q2 - p2 ql

(3.17) &(n j a, b; H2) -(a + b) L(a, b I H2) - a P2 U1 - pl q2




(3.18) where u = P2 ql pl q2

Now let g = g. Then T* is defined by the boundaries a = a and b = 0 where a is obtained from (3.13) witli g = g.

Using the same considerations as on page 227, we find the average risk of T7* to be,

R( I T*) = 1 + gW12(1 -p2ql)

(3.19) +y1p2qlf 6(n | a-1, 1; H1) + W12[1 - L(d - 1,1 1 Hi)]}

+ (1 -9P)plq2(n I - 1, 1, H2) + w2,L(a - 1 | H2)I.

If we equate (3.19) to w12g, the risk of accepting H2 with no observations, and simplify, we get

1 + plq2C(n a - 1, 1; H2) + plq2w2L(a -1, 1 H2)

(3.20) + u{p2qd[6(n a - 1, 1; H1) - w12L(a - 1,1 H1)]

- p1q2[6(m a - 1, 1; H2) + w21L(a - 1, lj H2)fl = 0.

If we now let g = 9, then Tg is defined by the boundaries a = 0 and b = a. Hence the average risk of going on with the optimum procedure when g = g is given by

R(g l T*) = 1 + gp1q2{6f(n I 1, - 1; Hi)

(3.21) + w12Fl - L(1, a - 1; H1)]} + (1 - O)w2l(1 - p2q,)

+ (1-g)p2q1[&(n 1, a - 1; H2) + w21L(1, a -1 | H2)]

Equating (3.21) to w2(1 - g) and simplifying, we get

1- p2q1w21 + p2q16(n |1, a - 1; H2) + p2qlw21L(1, a - i H2)

+ 9[p1q2&(n a- 1, 1; H1) + w12p1q2

(3.22) - p1q2w12L(1, a- 1; H1) + w21p2q1

- p2ql,6(n 1 1 a - 1; H2) - w21p2qiL(1, a - 1; H2)] = 0.

The following procedure may be used for computing g and g from equations (3.20) and (3.22). Guess an integer a and compute g from (3.20) and g from (3.22). Substitute these values in the formula

(3.23) a so (1 - b ). '1log p2q

pi1q2J




If the resulting quantity has a value equal to the guessed a, the computed g and g are correct. If not, repeat the process.

Equations (3.20) and (3.22) can also be used to compute w12 andw21 for given values of g and g. This can be accomplished as follows. For the given g and g, compute a from (3.23). Set J = a in (3.20) and (3.22) and solve for w12 and w21.

The average risk of the optimum sequential procedure under consideration can be computed as a function of g as soon as g and g are determined and is given by

(3.24) R(g I T*) = gC(n a, b; H1) + (1 - g)&(n I a, b; H2)

+ gw12[1 -L(a, b H1)] + (1 - g)w2L(a, b I H2)

where for the given g, a and b are determined from (3.13). Since a and b are integers, the curve obtained by plotting R(g I T*) against g will consist of connected line segments.

4. MULTI-VALUED DECISIONS AND THE THEORY OF GAMES

The finite multi-valued decision problem can be considered as a game with Nature playing against the statistician. Nature selects a hypothesis Hi(i = 1, 2, * , k) and the statistician selects a test procedure. The pay-off function involved in this game is the risk function which consists of the average cost of experimentation plus the average loss incurred in making erroneous decisions under the test procedure selected by the statistician.

From the point of view of the theory of games, the existence of the optimum sequential procedure T* for multi-valued decisions means this: For every mixed strategy (a priori distribution) of Nature, the statistician has a pure strategy which is optimum against it. Thus, if Nature's mixed strategy were known, the statistician's problem would be solved, except for the problem of characterizing T*. But often Nature's mixed strategy is completely unknown. In that case Wald suggests that the statistician play a minimax strategy: that is, the statistician should select that decision procedure which minimizes the maximum risk. This procedure has the property that if consistently employed, the resulting average risk will be independent of Nature's a priori distribution, provided Nature's best strategy is completely mixed.

Examples of Dichotomies

In terms of the multi-valued decision problem, say the dichotomy with a constant cost function, the minimax strategy of the statistician consists of the following: For a given H1, H2 , W12, W21 and cost per observation c (which we take as unity) the statistician computes g and g and then finds that g = g* for which the risk function R(g* I T%) is a




maximum. He then proceeds as if Nature always selects g* for its a priori distribution. If the hypotheses H1 and H2 involve continuous random variables, this procedure will always result in an average risk R(g I T**) which is independent of g. However, if H1 and H2 involve discrete random variables, this is no longer generally true and the statistician may have to resort to mixed strategies.

As an illustration, consider the double-dichotomy discussed in the preceding section. Suppose we have obtained g and 9 and found that R(g I T*) given in (3.24) has a maximum at g = g*. Let the a and b corresponding to g* be designated by a* and b*. Then, in order that the sequential test T** (defined by the boundaries a* and -b*) be independent of g, we must have (see 3.24),

(4.1) &(n I a*, b*; H1) W w12[1 -L(a*, b* IIi)]

= (n I a*, b*; H2) + w2,L(a*, b* I H2).

But since a* and b* are necessarily integers, this equation will usually not be satisfied. This implies that Nature's strategy g* must have the property that when we set g = g* in (3.13) (removing the braces) either a is exactly an integer, or b is exactly an integer, or both are integers. Sup- pose for example that a = a* is an integer but not b = b*. This means that when 11 (X2i - Xli) = a* we have two courses of action to follow. We can either stop and accept H1 or go on experimenting with the best sequential test defined by the boundaries a = a* + 1 and b = b*. But these two procedures will not always have the same risk for g # g*. Thus in order to make the average risk independent of g we shall have to employ a mixed strategy. That is, we shall have to employ one procedure some specified fraction a of the times and the other procedure, 1 - a of the times, where O < a < 1.

To illustrate these concepts, we have computed four examples of binomial dichotomies all with the same hypotheses H1: pi = 1/3, H2:

P2 = 2/3 but varying w12 and w21 . The risk function R(g I T*) as well as the minimax strategies are given for each of these examples' in Fig- ures 1 to 4.

It can be shown that for all symmetric binomial dichotomies there always exists a pure minimax strategy for the statistician if w12 = W^ . However, this is no longer true if Wl2 F W21 . This phenomenon is illus- trated in the charts following.

6 The optimum sequential test T* for any symmetric binomial dichotomy (i.e.,

pi + P2 = 1) becomes identical with test described in Section 5 when the following N N

substitutions are made: write gi for plq2, q2 for p2ql, .z xi for il (x2i - xli),

where each xi takes on the value of 1 with probability pi and -1 with probability , = 1 - pi(i = 1, 2).



232 K. J. ARROW, D. ]BLACKWELL, M. A. GIRSHICK

5

AVERAGE MINIMAX RISK * 4.3

THE AVERAGE MINIMAL RISK AS A FUNCTION 4 OF THE A PRIORI DlSTRIBUTION g FOR A

DICHOTOMY WITH Hip I' , H2ZPU, / I

C-l, Wit' W21, 10 3 /

R (9)

MINIMAX STRATEGIES: FOR NATURE: 13 49\ FOR THE STATISTICIAN: SEOUENTIAL

PROBABILITY RATIO TEST (o,b Ill) \

o I _I | | 1 X I I 0 .1 .2 .3 .4 .5 I.6 .7 .8 .9 LO

30G ro

9

FIGURE 1

AVERAGE MINIMAX RISK - 3

THE AVERAGE MINIMAL RISK AS A FUNCTION OF THE A PRIORI DISTRIBUTION 5 FOR A DICHOTOMY WITH H,I p, IT. H2:' 5DUi G'l, W,2'5, W21 10

R (g)

MINIMAX STRATEGIES: FOR NATURE: g9 . FOR THE STATISTICIAN! SEQUENTIAL

PROBABILITY RATIO TEST (I,b) (i1,i WITH FREQUENGY 3 , AND ACCEPTING H2 WITH NO OBSERVATIONS WITH FREQUENCY4

O . .2 .3 .4 .5 .6 .7 .8 .9 1.0

F 2 _R 25

FIGURE, 2



SOLUTIONS OF SEQUENTIAL DECISION PROBLEMS

AVCRAGE MINIMAX RISK l' I

R (g)

g

FIGURE 3

I8-

6 -

-VERY SHORT SEGMENT AVERAGE MINIMAX RISK 14.642

4 - 5) (3,4) (4 4) .

/ I'Z (5,3) 2-·2~ ~ ~~^y

5?

2- (2,61 )

O - . THE AVERAGE M;NIMAL RISK AS A FUNCTION <6,2) ,/ <-t ~ OF THE A PRIORI DISTRIBUTION g FOR A

~/ ~ ~oDICHOTOMY WITH HI p'-l, H2: t'-,

/ ..C., W,2 50. WHI' 100

VERY SHORT / SEGMENT W1T

/ (a, b) (, 7: (7,1) (TYP)

GSc~~ -~"~~ / ~MINIMAX STRATEGICS. j~~~~/ ~ FOR NATURE' 9*- 62175 _.~~ ,~~/ ~ FOR THE STATISTICIAN SEJQUENTIAL

J~~~~~/ ~ PROBABILITY RATIO TESTS (o,b) (3,4) / AND (4,4) WITH FREQUENCIES

4 - 58682 AND 41318 RESPECTIVELY

/ \ oX 0 3 2

' 17045

233

3 4 5 6 7 8 9 LO ' .92990

FIGURE 4 FIGURE 4

R (g)

I

I

I

I




Examples of Trichotomies

EXAMPLE 1. Assume that the random variables xl, x2, * * are inde-

pendently distributed and all have the same distribution. Each x, takes on only the values 1, 2, 3 with probabilities specified by one of the following alternative hypotheses:

-Typothesis Event 1 2 3

H, 0 4 2

H2 2 0 2 H3 2 ½ 0

Let w¢i be the loss if Hi is accepted when Hi is true; the values are given by the following table:

State of Nature Hypothesis Accepted H1 H2 H3

H1 0 4 6 I, 6 0 4 H3 4 6 0

Note that both of these matrices are invariant under a cyclic permutation of the hypotheses and events. Finally, assume that the cost of each observation is 1.

Let gi be the a priori probability of HIi. An a priori distribution (t = (g 2 , , 3), w\ith gi + g + g3 = 1, may be represented by a point

in an equilateral triangle with unit altitudes; the distances from the point to the three sides are the values of g , g2, and g3. Pi is the point where

gi = (i = 1, 2, 3). Let R(j i T') be the average risk under sequential procedure T when

the a priori probabilities are gi, g2, g3. Let To be the best sequential procedure where no observations are taken; let Si be the region in g-space where Ii is accepted under To. Let L,(j) be the loss in accepting HI when the a priori distribution is yj. Then

3

(4.2) Lj/) = w,,g. ,=1

S is defined by the inequalities

L1 < Lz, L1 L3,

or

6g2 + 4g3 < 4gl + 6g3,

C,L + 4, (79 G i 247 .

234




That is,

(4.3) 9gi max (g92 - g93, g2 + 3g3).

At Pi, g2 = g3 = 0, 9g = 1, so that P1 belongs to S . When g3 = 0,

(4.3) becomes gl ~ 3/2 g2, while gi + 92 = 1, so that gi ~ 3/5; when

g2 = 0, 91 > 2/393, 91 + 93 = 1, so that g1i 2/5. Also, the two lower

bounding lines for gi are equal when 3/292 - 1/293 = 1/392 + 2/393 = g ,

or gi = y2 = g3 = 1/3. AS contains all points above the boundary defined

by the polygon with vertices (3/5, 2/5, 0), (1/3, 1/3, 1/3), and (2/5, 0, 3/5). S2 and S3 can be obtained by successive cyclic permutations of these coordinates.

P, P,

A/

S2 \ S2

/2 P\ 3 2 P3

FIGURES 5 and 6

Let T*(j) be the optimum sequential procedure for a given a priori distribution. Let gii be the a posteriori probability of Hi given that x1 = j;

(4.4) gii = 3

gk Pki k-i

where pij is the probability that xl = j under H . Let T*(g) be the

sequential procedure defined as taking one observation and then using procedure T*(gij, 2j, g3j) when xi = j. T* () is the best sequential

procedure which involves taking at least one observation. In the present case, pij = 0, so that gii = 0 by (4.4); therefore, T*(gli, g2j, g3j) is the

optimum sequential test for a dichotomy.7 Let S* be the region in g-space for which Hi is accepted without any

observation under T*(ij). As has been shown, the regions S* are essen-

7 It is to be pointed out that, for any trichotomy, the intersection of the regions S* with the sides of the triangle may be determined by computing g, 0 [see Sec- tion 3] for the appropriate dichotomy.

235




tially all that is needed to determine all the tests T*(g). Further, the regions S* are convex sets whose boundaries are characterized by the relations

(4.5) R[g I T1(U)] = Li(g), Li(a) = min L2(g),

so that

(4.6) S* C S'.

It is first necessary to find the optimum tests for each of the dichotomies formed by taking pairs from the trichotomy H1 , H2 , H3 . Consider the dichotomy H1, H2 . Then gi + 92 = 1. Suppose gi is such that it pays to take at least one observation. From (4.4), since Pi3 = 1/2(i = 1, 2),

(4.7) 9i3 = 9i = + ig 91 + 92

Hence, if xi = 3, the a posteriori probabilities are unchanged, and, as shown previously, the optimum test calls for taking another observation. On the other hand, pi, = 0, so that 921 = 1. Therefore, if xi = 1, the process should be terminated and H2 accepted. Similarly, if xi = 2, the process should be terminated and H1 accepted. It follows that if g1 is such that the best sequential procedure calls for taking at least one observation, then the best procedure is to sample indefinitely until xn = 1 or 2; in the former case, accept H2 , in the latter, H1 . The probability of accepting the wrong hypothesis is zero under either hypothesis; the risk is then the expected number of observations, which is 2 under either hypothesis.

The boundaries gl2 and 912 of the interval of gl's in which it pays to go on are then determined by the equation,

W12Y12 = 2, or U12 2=

(4.8) W21(1 - p12) = 2 or g12 -

Returning to the specification of T*(U), we note that if xi = 3,

gi Pi3 9i3 = 3

E _ k Pk3 k=A1

As p33 = O, pi3 = 1/2 for i = 1, 2,

9i i ,. = 1)2)




Therefore, T, (g) can be described as follows: If xi = 3, H3 is rejected entirely; if gil/(g + 92) < 1/2, stop and accept H2 ; if 91/(g1 + 92) > 2/3, stop and accept H1; if 1/2 < g1/(g1 + 92) < 2/3, continue sampling until xn = 1 or 2, at which point stop and accept H2 or H1, respectively. The three conditions on gil/(g + 92) can be written in the simpler form,

(4.9) 91 g2, 9 1g 292, 92<91<292,

respectively. The cases where xi = 1 or 2 can be obtained from the preceding case by cyclic permutation of the numbers 1, 2, 3.

Let R*(j) be the conditional expected risk associated with T*(y) when the a priori probability distribution is g, given that xi = j; and let pi be the a priori probability that xi j. Then

3

(4.10) R[g I T*7(g)] = pjR*(9).

(4.11) Pi= pigi. i-1

f1 + w12 9g1 = 1 + 4g, if 91 < 92 gli+92 gl+g92

(4.12) R() = 3 if g2$ gj g1 2g2,

+ W2l(1- g- -1+ 6gq if g) 2g2.

Rr(9j) and R2(g) can be obtained from (4.12) by cyclic permutation of the subscripts.

The region Si can now be determined by the relations (4.5-6) and (4.10-12). First note that when g3 = 0, the problem reduces to the dichotomy between H1 and H2 already discussed, so that the interval from

gi = 1 to gy = 2/3 on the line 93 = 0 belongs to S* . Hence the point (2/3, 1/3, 0) lies on the boundary of S* . This point satisfies the conditions,

(4.13) 92 g 1 < 292, 92 ) 293, 93 - 91.

Consider the intersection, if any, of the boundary of Sr with the region R1 defined by (4.13). Using (4.10-11), (4.5),

(4.14) g2 +_I? (,) + 92 + D R (fj) - R Li7). 2 3.-1) 2 2

From (4.13-14), (4.12), and (4.2),




391 + g2 g92 + g3 (1 693 2 2 V 2 + g3

+ 1 + g4 ) 4g 692 + 4ga, 2 +g1 + g3

or

(4.15) 92= 3

The intersection of (4.15) with the line 92 = 293 occurs at the point (1/2, 1/3, 1/6), which satisfies the conditions (4.13) and so lies on the boundary of the region R1 . As R1 is convex, it follows that the boundary of S* actually does intersect R1 and there coincides with the line segment joining (2/3, 1/3, 0) and (1/2, 1/3, 1/6). The latter point satisfies also the conditions

(4.16) g2 ~ g91 292, 9 3 g92 ~ 293, 93 g.91

Let R2 be the region defined by (4.16). Then we can find as before the intersection of the boundary of S with R2 ; the boundary hits the line 92 = 93 at the point (3/7, 2/7, 2/7), which point lies in R2 . Hence the boundary of S* actually does intersect R2 and there coincides with the segment joining (1/2, 1/3, 1/6) to (3/7, 2/7, 2/7). If we continue this method, it can be shown that Si is bounded by the polygon with vertices (2/3, 1/3, 0), (1/2, 1/3, 1/6), (3/7, 2/7, 2/7), (2/5, 1/5, 2/5), (1/2, 0, 1/2), and (1, 0, 0). It is easily verified that S* is actually a subset of S , as demanded by (4.6). The vertices of the polygons bounding S* and S* can be obtained by cyclic permutation of the coordinates.

For any given j, the regions S , S* , and S completely define the optimal procedure. It remains to find the minimax procedure.

As shown by (4.12), the maximum conditional expected risk given that xi = 1 is 3, and this occurs when g3 < g2 293. Similarly, the maximum conditional expected risks given that xi = 2 and 3, respectively, are both equal to 3, and they occur when gi9 93 2 2g , g2 9 gi 292, respectively. Any g* satisfying these three conditions will be a least favorable a priori distribution; clearly, the only set of values is gi = 92 = 93 = 1/3. If xl = 1, the corresponding a posteriori distribution is (0, 1/2, 1/2). This is on the boundary of S , so that the optimum procedure is to stop after one observation and choose H3 . In general, then, the minimax procedure is to take one observation, stop, and accept H3 if xi = 1, H1 if xi = 2, and H2 if x = 3. The risk associated with this test is 3, of which the cost of observation is 1, and the expected loss due to incorrect decision is 2, independent of the true a priori distribution.

It may be of interest to note that the minimax test is not unique, the

238




lack of uniqueness corresponding exactly to the inclusion or exclusion of the boundaries of S* , S , and S3 in those sets. If we exclude the boundaries, then, when x1 = 1, we continue. So long as x, = 1, the a posteriori probabilities remain at (0, 1/2, 1/2); when Xn = 2 (3), the a posteriori probability of H3(H2) becomes 1. Therefore, a second minimax test is to stop the first time x,n xn_l, and then accept that hypothesis whose subscript equals neither xn nor x,_ . The maximum risk is again 3; all of this is represented by the expected number of observations which is the same for all a priori distributions.

(1,0,0)

BOUNDARIES OF REGIONS DETERMINING NATURE'S STRATEGY LEAST THE OPTIMUM TEST FOR A TRICHOTOMY: FAVORABLE TO THE

STATISTICIAN: X 1, 2,3 \'. (l1, I )

H : pO,_-3 H: P= 0, / \ (33'3 Ha: p -1 / \ H 2: P='-', /

C =l (4%o

/O 4 6\ W · 6 0 ( 3 3 \3 604A

\I6 I0/

({4 ̂ -,~o) ((2\.

~~~~H2 H3

(0',I,) '0(o-,.f) (0o, . ) ',,'

FIGURE 7

EXAMPLE 2. The boundaries of the regions S* might also be straight lines, as shown by the following example: Let xl, x2, ** , be independently distributed with the same distribution, and Xn takes on the values 1, 2, or 3. Let Hi be the hypothesis that x, = i with probability 1 (i = 1, 2, 3), and let Iii = 3 for i % j, wii = 0. Assume the cost of each observation is 1. Then the best test involving at least one observation is clearly to take exactly one observation and accept Hi if x = i. The expected loss due to incorrect decision is 0, so that the risk of this test is 1. Hence the boundary of S* is characterized by the relation, W2,1g2 + wZ.g3 = 3(g2 + g3) = 1, or gl - 2/3. Similarly, S*, S* are defined by the inequalities g2 > 2/3, g3 > 2/3, respectively.

239




If no observations are taken, the region S' in which H1 is accepted is characterized by W2192 + W31g3 < min (Wl2gl + W.3293 , w13g1 + W23g2) or 1-

611 ( min (61 + 93 , gl + 62), 2g, ) max (1 - 612, 1 - 93).

The boundary is the polygonal line with vertices (1/2, 1/2, 0), (1/3, 1/3, 1/3), and (1/2, 0, 1/2). The boundaries of S2 and S3 are found similarly. The regions S, S2, S3 clearly lie inside S, S, S3, respectively.

Pi Pi

(23, ,0)/ ,,)/ S

3 3~ ~ ~~~~~~~S

( '3 ' 3) ( 'O.) ( P3 3 (o'')

FI GURES 8 and 9

EXAMPLE 3. In both previous examples, inner boundaries of the regions S*were found by equating the risk of accepting Hi if no observations ar

taken with the risk under the best procedure calling for taking at leasF one observation. The regions so found wvere in both cases subsets of S' Howvever, this relation need not hold in general, as shown by the follow ing examnple:

Let all conditions be the same as in Example 2 except that wi,-5= f or i # j. Then the risk of accepting H1 writhout observations is equal tl the risk of the best test takiing at least one observation when 611= 2/~5 But the region g1 ) 2/5 is not a subset of S1 (which is the same here a in Example 2). S( is the intersection of S2 and the region 61 ) S 0 is bound from below by the polygonal line with vertices (1/2,1/' 0), (2/5, 2/5, 1/5), (2/5, 1/5, 2/5), and (1/2, 0,1/2).

5. ANOTHER OPTIMUM PROPERTY OF THE SEQUENTIAL PROBABILIT

RATIO TEST

In section 3 we have shown that the sequential probability ratio tes is optimum in the sense that for a given a priori distribution 61 it min mizes the average risk. We shall now prove that for fixed probabiliti of making erroneous decisions, this test minimizes the average numbe of observations when H1 is true as well as when H2 is true.




PROOF: For a fixed A ) 1 and B < 1, let a be the probability of accepting H2 when H1 is true if the sequential probability ratio test T* with these boundaries is used. Similarly, let A be the probability of accepting H1 when H2 is true. The quantities oa and ,B are uniquely determined by A and B.

Choose any g such that 0 < g < 1, solve for g and g from (3.3) and then compute W12 = W12 (g, g) and W21 = w21 (g, g) from (3.11) and (3.12). (The quantity b entering in (3.11) and (3.12) is given by log (A/B.)

The three quantities w12 (9, g), W21 (g, g) and g have the property that if the a priori distribution is g and if the risk of accepting H2 when HI is true is W12 (9, g) and the risk of accepting H1 when H2 is true is W21

(g, g), then the sequential test T* has minimum average risk. The average risk under T* is given by

R(g I T*) = g &(n I T*; H1)

(5.1) + (1- g) &(n I T*; H2)

+ 9 W12(U, 9)a + (1 - g) W21(9, 9)3.

Now let T be any other test procedure which results in probabilities a' < a and f' < A of making erroneous decisions. Then for the same triplet g, W12(9, g) and W21(g, g) the average risk under T is given by

R(g I T) = g &(n I T; Hi)

(5.2) +(1- g)&(n IT; H2)

+9 W12(g, 9)a' + (1 - 9) W21(g, 9)f'A

Now since R(g i T*) < R(g i T) we must have

g &(n J T*; H1) + (1- g) &(n I T*; H2)

(5.3) g &(n J T; Hi) + (1- ) &(n I T; H2).

But the inequality (5.3) must hold for all values of g, 0 < g < 1. Hence from continuity considerations we must have

(5.4) (n J T*;Hi) (n |T;Hi),

and

(5.5) 6(n I T*; 112) L 6(n | T; H2).

This proves the theorem.




6. CONTINUITY OF THE RISK FUNCTION OF THE OPTIMUM TEST8

THEORIEM: Let z be any class of seqtential tests of a mutltiple decision involving a finite nunber of alterntative hypotheses, and let R(!j I T) be the risk of test T when the a priori distribution of the hypotheses is g. Then inf R(fj I T) is a continuous fntnction of ( in the region for which gi > O for 7 E 53

all i, provided that for some To iL Y, R(J I To) is everywhere fitite. PROOF: For each hypothesis Hi and each test procedure 7' there is a

nonnegative risk wvhich is the sum of the expected cost of observations and the expected loss due to failure to make the best decision; call this risk A,(T). Then

(6.1) P,(,j I T) = i gA (T).

Let rjo be any a priori distribution for which goi > 0 for all i, and choose 0 < 6o < min goi . Let G be the region in g-space for which gi -

goi I o 6o; then gi > 0 for all i and all y in the compact set G. Let To be the test referred to in the hypothesis.

(6.2) stup in f R T) < sup R(yI To) = K < + , g, G Tf2 gEG

the last inequality following since R(yJ To) is linear and hence continuous on the compact set G.

Let I' be the subelass of Y for which inf R(j I T) < K + 1. As Y' E G

is a subset of X, inf R(]j I T) ) inf R(J I T). Suppose for some j' in G,

inf R(y' 7') > inf JiSQj' I T). Te^'

Then

(6.3) inf R(fj T) - inf 1R((J' I T). 7' E I Te I-~ ES

But if T belongvs to Y-,

(6.4) RZ(j' T) ? in f R( T) > K + 1. GeS

From (6.4) and (6.2),

inf f(j' 1 T) ) K + I ) sup inf R((j I T) ) inf R(y' T 7), T,E5 gES G TlE 2 T E LS

contradicting (6.3). Hence,

8 The essential features of the proof of this theorem are due to George Brown.




(6.5) inf R(g I T) = inf R(g I T) Tf Y Te f

everywhere in G, and it suffices to consider only tests in 1'. R(g I T) assumes its minimum in G. Hence,

inf R(g T) = R(g" T) = E g"Ai(T).

As g"i > O,1 A i(T) >, O f or all i,

(6.6) q'ffi'Ai(T) G R(g" T) = inf R(g T) G K + 1, geG

since T belongs to L'. Although 9t may vary with T, it must have a positive lower bound because of the compactness of G and the fact that 9, > 0 for all g in G. As i takes on only a finite number of values, gt has a positive uniform lower bound. Then (6.6) implies that Ai(T) is bounded from above uniformly in i and T. Let C be this upper bound.

Choose any a < bo, and any g such that E gi -oi I <a.

IR( IT) - R( 0o T) < C6,

R(9i T) - R( 0o T) < C6.

inf R(g TT) 4 inf R(go I T) + CS. Te ' T e2 T ',

Similarly, inf R(gjo T) < inf R(g I T) + Ca, so that inf R(g IT), TeL' Te2'1 Tell

and therefore, by (6.5), inf R(g T), is continuous at go . T e 2

The continuity of inf R ( T) does not extend in general to the bound- Te2

ary of g-space where gi = 0 for some i, as is shown by the following example:

Let xl, x2, * , be independently distributed variates with a common distribution, each xn taking on only the values 1, 2, 3. Let H1 be the

hypothesis, P(xn = 1) = pi, P(xn = 2) = 1 - pi, P(xn = 3) = 0, H2the hypothesis, P(xn = 1) = P2 ,P(xn = 2) = 1 - P2, P(xn = 3) = 0, H3 the hypothesis P(xn = 1) = P(xn = 2) = 0, P(xn = 3) = 1. Let the cost of each observation be 1.

Let To be the minimax test of the dichotomy H1, H2 . Let c = Zinl 1/n

Let qn be defined inductively, as follows:

1 1 ql - q n-I

c acn 2 11(1 - q) T =1

Then test T, is defined as follows: if Xn = 1 or 2, and the process has not




stopped before n, form the sequence yi, * *, ym consisting of all those elements of the sequence xl, ... , x, for which Xk 5 3. Then either go on or stop and make a decision in accordance with To applied to the sequence y1, ..., ym. If xn = 3, stop and accept H3 with probability q, go on with probability 1 - qn .

To has a certain risk R for all g such that gq + 92 = 1. Choose N > 1. Let T2 be the test consisting of taking N observation and then making the best decision.

Clearly, under H1 or H2 , xn is never equal to 3, so that y1, , ym is the same as x1 , , x,, and T1 coincides with To . The expected loss for T1 under HI or H2 is thus R. Under H3, Xn = 3 for all n; hence, the probability of stopping at m is 1/cm2, so that the probability of stopping is 1 but the expected cost and therefore the expected risk is infinite. Therefore, if 93 = 0, R(g U T1) = R; but if 93 > 0, R(g I T1) = + oo. On the other hand, R < N < R(g I T2) < + co, everywhere.

If z contains the two tests T1 and T2, inf R(g I T) = R(g I T1) = R

for93 = Obutinf R( T) = R T2) , N > Rforg3 #0.

Hence inf R(g T) is not continuous at any point for which g3 = 0. T f 2,

Cowles Commission for Research in Economics Howard University Stanford University

REFERENCES

[11 ABRAHAM WALD, "Foundations of a General Theory of Sequential Decision Functions," ECONOMETRICA, VOl. 15, 1947, pp. 279-313.

[21 ABRAHAM WALD, Sequential Analysis, John Wiley and Sons, Inc., New York, 1947.

[3] M. A. GIRSIIICK, "Contributions to the Theory of Sequential Analysis. I," Annals of M1athematical Statistics, Vol. 5, 1946, pp. 123-143.

[4] A. WALD AND J. WOLFOWITZ, "Optimum Character of the Sequential Prob- ability Ratio Test," Annals of Mathematical Statistics, Vol. 19, 1948, pp. 326-339.



Documents

Bayes and Minimax Solutions of Sequential Decision Problemsstatic.stevereads.com/papers_to_read/bayes_and_minimax_solutions_of... · BAYES AND MINIMAX SOLUTIONS OF SEQUENTIAL DECISION