196
Research Collection Doctoral Thesis Statistical decisions based directly on the likelihood function Author(s): Cattaneo, Marco E.G.V. Publication Date: 2007 Permanent Link: https://doi.org/10.3929/ethz-a-005463829 Rights / License: In Copyright - Non-Commercial Use Permitted This page was generated automatically upon download from the ETH Zurich Research Collection . For more information please consult the Terms of use . ETH Library

In Copyright - Non-Commercial Use Permitted Rights ...29855/... · L:VxV^¥, where Vis a set of probability measures on a measurable space (Q,A), and T> is a set. Vis the set of statistical

  • Upload
    lyphuc

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Research Collection

Doctoral Thesis

Statistical decisions based directly on the likelihood function

Author(s): Cattaneo, Marco E.G.V.

Publication Date: 2007

Permanent Link: https://doi.org/10.3929/ethz-a-005463829

Rights / License: In Copyright - Non-Commercial Use Permitted

This page was generated automatically upon download from the ETH Zurich Research Collection. For moreinformation please consult the Terms of use.

ETH Library

Diss. ETH No. 17122

Statistical Decisions Based Directly

on the Likelihood Function

A dissertation submitted to

ETH Zurich

for the degree of

Doctor of Mathematics

presented by

Marco E. G. V. Cattaneo

Dipl. Math. ETH

born September 19, 1976

citizen of Capriasca, Ticino

accepted on the recommendation of

Prof. em. Dr. Frank Hampel, examiner

Prof. Dr. Sara van de Geer, co-examiner

Prof. Dr. Thomas Augustin, co-examiner

2007

Contents

Abstract iii

Sunto v

Notation vii

1 Introduction 1

1.1 Statistical Decisions 1

1.1.1 Classical Decision Theory 3

1.1.2 Bayesian Decision Theory 5

1.2 Likelihood Function 8

1.2.1 Likelihood-Based Inference 10

1.3 Likelihood-Based Statistical Decisions 13

1.3.1 MPL Criterion 13

1.3.2 Other Likelihood-Based Decision Criteria 20

2 Nonadditive Measures and Integrals 25

2.1 Nonadditive Measures 25

2.1.1 Types of Measures 26

2.1.2 Derived Measures 28

2.2 Nonadditive Integrals 30

2.2.1 Shilkret Integral 31

2.2.2 Invariance Properties 35

2.2.3 Regular Integrals 38

2.2.4 Subadditivity 51

2.2.5 Continuity 55

2.2.6 Choquet Integral 58

ii Contents

3 Relative Plausibility 63

3.1 Description of Uncertain Knowledge 63

3.1.1 Representation Theorem 66

3.1.2 Updating 78

3.2 Hierarchical Model 81

3.2.1 Classical and Bayesian Models 83

3.2.2 Imprecise Probability Models 86

4 Likelihood-Based Decision Criteria 95

4.1 Decision Theoretic Properties 95

4.1.1 Attitudes Toward Ignorance 102

4.1.2 Sure-Thing Principle 116

4.2 Statistical Properties 127

4.2.1 Invariance Properties 128

4.2.2 Asymptotic Optimality 138

5 Point Estimation 149

5.1 Likelihood-Based Estimates 149

5.1.1 Asymptotic Efficiency 152

5.2 Shilkret Integral and Quasiconvexity 157

5.2.1 Impossible Estimates 160

References 171

Index 179

Curriculum Vitae 183

Abstract

This thesis studies the possibility of basing statistical decisions directly

on the likelihood function. By describing the problems of statistical infer¬

ence as particular decision problems, the usual likelihood-based inference

methods can be obtained as special cases. These are among the most ap¬

preciated methods of statistical inference, although in general they are not

optimal from the repeated sampling point of view. In fact, their principal

advantages are that, being based on the likelihood function, they are intu¬

itive, generally applicable, and asymptotically optimal. These properties

are shared by all the methods that base decisions on the likelihood function

in some reasonable way.

The decision criteria based on the likelihood function are analyzed, and

some of their decision theoretic properties are related to their representa¬

tions by means of nonadditive integrals; in this connection, the theoryof integration with respect to nonadditive measures is reviewed and ex¬

tended. Some statistical properties of the likelihood-based decision criteria

are studied as well, and a paradigmatic example of statistical decision

problem is examined in more detail: the problem of point estimation in

Euclidean spaces. A simple, intuitive decision criterion emerges as partic¬

ularly interesting: the MPL criterion, which consists in using the likelihood

of the statistical models to weight the respective losses, before applying the

minimax criterion.

The likelihood function can be interpreted as a measure of the relative

plausibility of the statistical models considered, and thus as the second

level of a hierarchical description of uncertain knowledge, of which the sta¬

tistical models form the first level. There is a similarity with the Bayesian

model, in which the second level consists of a probability measure, but the

measure of relative plausibility considered in this thesis can also describe

iv Abstract

ignorance; in particular, in the case of complete ignorance we obtain the

classical model, which in fact consists only of the first level. The mea¬

sure of relative plausibility and the MPL criterion are linked by a simple

representation theorem for preferences between randomized decisions.

Sunto

Questa tesi studia la possibilità di basare le decisioni statistiche diret-

tamente sulla funzione di verosimiglianza. Descrivendo i problemi di in-

ferenza statistica come particolari problemi di decisione, i consueti metodi

di inferenza basati sulla verosimiglianza possono essere ottenuti come casi

particolari. Essi sono tra i più apprezzati metodi di inferenza statistica,benché in generale non siano ottimali secondo il principio del campiona-mento ripetuto. Infatti, i loro principali vantaggi, implicati dall'essere

basati sulla funzione di verosimiglianza, sono l'intuitività, l'applicabilità

generale, e l'ottimalità asintotica. Queste propriété sono condivise da

ogni metodo che basi in modo ragionevole le decisioni sulla funzione di

verosimiglianza.

I criteri di decisione basati sulla funzione di verosimiglianza sono ana-

lizzati, e alcune délie loro propriété inerenti alla teoria délia decisione sono

poste in relazione con le loro rappresentazioni per mezzo di integrali non

additivi; a questo proposito, la teoria dell'integrazione rispetto a misure

non additive è passata in rassegna e ampliata. Alcune propriété statistiche

dei criteri di decisione basati sulla funzione di verosimiglianza sono pure

studiate, e un esempio paradigmatico di problema di decisione è esaminato

più dettagliatamente: il problema délia stima puntuale negli spazi euclidei.

Un criterio di decisione semplice e intuitivo risulta essere particolarmenteinteressante: il criterio MPL, che consiste nell'utilizzare la verosimiglianzadei modelli statistici per pesare le relative perdite prima di applicare il

criterio minimax.

La funzione di verosimiglianza puô essere interpretata come una misura

délia relativa plausibilità dei modelli statistici considerati, e quindi come

il secondo livello di una rappresentazione gerarchica délia conoscenza in-

certa, di cui i modelli statistici costituiscono il primo livello. C'è una chiara

vi Sunto

analogia con il modello bayesiano, nel quale il secondo livello consiste in

una misura di probabilité, ma la misura di plausibilità relativa considerata

in questa tesi puô anche descrivere l'ignoranza; in particolare, in caso di

ignoranza totale otteniamo il modello classico, che di fatto consiste uni-

camente nel primo livello. La misura di plausibilità relativa è associata al

criterio MPL da un semplice teorema di rappresentazione per le preferenzetra decisioni randomizzate.

Notation

Let A and B be two sets. The inclusion of A in B is denoted by A Ç B,while A C B denotes the strict inclusion; the set B \ A is the difference

of B and A. The power set of A is denoted by 2,and \A\ denotes the

cardinality of A. If A is a set of sets, then M A is the union of the elements

of A, with the convention that M 0 = 0.

The set of all functions / : A —> _B is denoted by _BA. If / G -BA, then

/_1 : 2B —> 2 is the mapping assigning to each subset of _B its inverse

image under /; but when / is bijective, /_1 can also denote its inverse

function (the meaning of /_1 should be clear from the context). The imageof a G A under / G B is denoted by f(a) or f[a], but the brackets can

be omitted when there is no ambiguity; for instance, the inverse image of

{b} Ç B can be expressed as /_1{&}. If / G B,and CCA, then f\c

is the restriction of / to C; and if D is a set, and g G DB, then g o f is

the composition of / and (?. The identity function on A is denoted by îd^,while Ia denotes the indicator function of A (the domain of Ia should be

clear from the context).

The set of real numbers is denoted by R, while M denotes the set of

(affmely) extended real numbers: M = M U {oo, —oo}. In the interval nota¬

tion, a round bracket indicates the exclusion of the corresponding endpoint,while a square bracket indicates its inclusion; for instance, if x, y G ffi, then

(x,y] = {z £ ~R : x < z < y}. The set of positive real numbers is denoted

by P, while P denotes the set of nonnegative extended real numbers:

P = (0,oo) and P = [0,oo].

The supremum and infimum of S Ç P are denoted respectively by sup S

and inf S, with the convention that sup0 = 0 and inf 0 = oo. If / G P,

viii Notation

and CCA, then sup / is the supremum of /, and supc / is the supremum

of/|c:supc/ = sup/(a) = sup{/(a) : a G C} = sup/|c;

a£C

and analogously for the infimum. When the supremum is attained, max

can be substituted for sup; and when the infimum is attained, min can be

substituted for inf.

Expressions involving functions are to be interpreted pointwise, when

they do not make sense otherwise; for instance, if /, g G P,then / < g

means that f(a) < g(a) for all a G A. Accordingly, if b G B, then b can also

denote the function in B with constant value b (the domain A should be

clear from the context); for instance, if C Ç A, and Ic has domain A, then

1 — Ic is the indicator function of A \ C. A pair of curly brackets enclosingan expression involving functions denote the subset of the domain of the

functions consisting of all arguments for which the expression is satisfied;for instance, if / G P

,and iéP, then

{f>x} = {aGA: f(a) > x}.

As usual in measure theory, we define 0 oo = oo 0 = 0; but otherwise we

adopt the standard definitions about extended real numbers. In particular,

oo — oo is undefined, and thus in general the difference of two functions

/, g G P is undefined: with a slight abuse in notation we define

\\f - g\\ = sup \f(a) - g(a)\.

The notation is reasonable, because when / and g are bounded, \\f — g\\ is

the supremum norm of their difference; hence, the function (/, g) i—> \\f—g\\on P x P extends the metric induced by the supremum norm. The

notion of uniform convergence can be extended accordingly: the sequence

/lj /2j • • • G P converges uniformly to / G P when linin^oo ||/„ — f\\ =0.

If / G ffi*, and y G ffi, then lim^ f(x) and lim^ f(x) denote the

limits of / at y from the left and from the right, respectively (when the

limits exist). If a; G P, then logo; is the (extended) natural logarithm of

x. If k is a positive integer, and ö G V, then Bg = {x G ffife : \x\ < 0} is

the closed ball in Mfe with center 0 and radius ö, and B^j = Mfe \ B,5 is its

complement (the value of k should be clear from the context). If k is a

positive integer, and y G ffife, then the function ty : x i—> x — y on Mfe is the

Notation ix

translation by —y (the value of k should be clear from the context); for

instance, if ö G P, then t^ÇBg) is the closed ball with center y and radius

S.

Let (Q,C,P) be a probability space. A random object is a measurable

function X : Q —> X, where X is a set equipped with a cr-algebra containingall singletons of X; the random object X is continuous if P{X = x} = 0

for all x G X, and it is discrete if there is a finite or countable y Ç X such

that P{X G y} = 1. A random variable is a random object X : Q —> R,where M is equipped with the Borel cr-algebra.

The symbol D marks the end of a proof, while the symbol 0 marks the

end of an example.

1

Introduction

In the present chapter the basic concepts of statistical decision problem and

of likelihood function are introduced. The usual approaches to statistical

decision problems and the principal likelihood-based inference methods

are briefly described (avoiding foundational issues). Finally, the likelihood-

based decision criteria appearing in the literature are presented.

1.1 Statistical Decisions

A statistical decision problem is described by a loss function

L:V xV ^¥,

where V is a set of probability measures on a measurable space (Q,A),and T> is a set.

V is the set of statistical models considered: these are mathematical

representations of some aspects of the reality under consideration (in par¬

ticular, the observed data can be represented by elements of A). It is

important to note that we use the term "model" to indicate a single prob¬

ability measure (without unknown parameters), while in the statistical

literature it is also used to indicate a whole family of probability measures

(such a family can be a subset of V). It is not assumed that one of the

models in V is in some sense "true", but our conclusions will be based

on V and can thus not be expected to be satisfactory when all models in

V poorly represent the reality. No structure is imposed on P; in particu¬

lar, it is possible to enlarge V in order to improve the robustness of the

conclusions.

2 1 Introduction

T> is the set of possible decisions: a solution of the statistical decision

problem is a subset of T> consisting of the decisions that are optimal ac¬

cording to some criterion. No structure is imposed on T>; in particular, it

is possible to restrict T> by excluding the dominated decisions.

The loss function L summarizes all important aspects of the possibledecisions: L{P, d) is the loss we would incur, according to the model P G V,

by making the decision d G T>. The loss function is usually finite, but

infinite values pose no problems. The decision d is correct for the model

P if L{P, d) = 0 (it is not necessary that for each model there is a correct

decision, or that each decision is correct for some model), and the unit in

which the loss is expressed is of no importance. The loss is thus a relative

quantity: that is, for all c G P, the loss functions L and cL are equivalent,while the loss functions L and L+c are not (because of the peculiar meaningof the value 0). For each d G T>, let Id be the function on V defined by

ld(P) = L(P, d) for all P G V.

If Id < Id', then d! cannot be preferred to d; and if Id = Id', then d and

d! are equivalent (in the sense that no one can be preferred to the other).A decision d is said to dominate a decision d! if Id < Id' and Id ¥" ld'\in this case, if we have to choose between d and d', then we must choose

d (there is no reason for choosing d'). A decision criterion usually corre¬

sponds to a monotonie functional V : P —> P (where monotonie means

that / < /' implies V(l) < V(l')), in the sense that it can be expressed

as follows: minimize V(ld). In this case, an optimal decision (that is, a

decision d minimizing V(ld)) can in general be dominated, but only byanother optimal decision; if necessary, the criterion can thus be refined by

discarding from the set of optimal decisions those that are dominated (byother optimal decisions).

Wald (1939) introduced the statistical decision problem as a general¬ization of the problems of statistical estimation and testing hypotheses.For example, if Q is a set, and g : V —> G is a mapping, then the problemof estimating g(P) can be described by a loss function L on V x G express¬

ing the estimation error: in particular, g(P) is a correct decision for P.

Inference problems (such as estimation problems) can also be stated with¬

out reference to a particular loss function, but some expression of the loss

incurred is actually needed in order to compare different solutions. Often

some standard loss function is used, as in the following simple example,which will be considered throughout the present chapter.

1.1 Statistical Decisions 3

Example 1.1. Let X be the number of successes in n independent binary

experiments (Bernoulli trials) with common success probability p. Assume

that we know n, and we want to estimate p with squared error loss. Let

V = {Pp : p G [0, 1]} be a family of probability measures (the choice

of the measurable space (Q, A) is irrelevant) such that under each model

Pp the random variable X has the binomial distribution with parameters

n and p. The estimation problem is then described by the loss function

L : (Pp, d)^{p- d)2 onPx [0,1]. 0

1.1.1 Classical Decision Theory

Assume first that we have no information at all about which models are

more or less plausible. In this case we only know that, according to the

models in V, the loss we would incur by making the decision d lies in the

range of Id, and thus in particular between inf Id and sup Id- The functional

V = sup gives us a useful upper bound for the loss incurred, and it cor¬

responds to the most important general decision criterion in the case of

complete ignorance about the models in V: the minimax criterion:

minimize sup/^.

By contrast, the functional V = inf gives us a less interesting lower bound

for the loss incurred, and in particular all d G T> that are correct for

some Pd G V are optimal according to the corresponding decision cri¬

terion (called minimin criterion), which is therefore often useless. A

compromise between the minimax and the minimin criteria is given by the

Hurwicz criterion with optimism parameter A G (0, 1):

minimize A inf Id + (1 — A) sup/^.

This criterion was introduced by Hurwicz (1951); it reduces to the minimax

criterion when each d G T> is correct for some Pd G V.

Example 1.2. The minimax criterion applied to the estimation problemof Example 1.1 (without considering the realization of X) leads to the

estimate d = h. Since each possible estimate d G [0, 1] is correct for the

model Pd G V, the minimin criterion is useless, and the Hurwicz criterion

reduces to the minimax criterion. 0

4 1 Introduction

Consider now that an event A G A has been observed: this observation

gives us some information about the relative plausibility of the models in

V; and to be reasonable a decision criterion must be able to use this in¬

formation. The classical approach consists in considering the observation

as a particular realization A = {X = xa} of a random object X : Q —> X,in selecting a whole decision function ö : X —> T> instead of a single deci¬

sion d = S(xa), and in evaluating the different decision functions without

considering that X = xa has already been observed. In this sense, the

selection of the decision function is based on a pre-data evaluation. Let

A be a set of decision functions ö such that for each P G V the func¬

tion x i—> L[P, S(x)] on X is measurable: under each model P the loss we

would incur by using the decision function ö G A is the random variable

L[P,8{X)\. In order to compare different decision functions, the random

loss is reduced to a single, representative value: usually the expected value,but other quantities (such as quantiles) can be used instead. That is, for

each model P G V and each decision function 5 G A we have a represen¬

tative value L'(P, 6) of the random loss: the loss function L' on V x A

describes the problem of selecting a decision function 5 G A. If L'{P, 5) is

defined as the expected value of L[P, ö(X)] with respect to P, then L' is

called risk function, and the minimax criterion applied to the decision

problem described by L' is the minimax risk criterion introduced byWald (1939).

Example 1.3. The minimax risk criterion applied to the estimation problemof Example 1.1 (with X = {0, ...,n}, and A defined as the set of all

decision functions on X) leads to the estimate

XI -/^

K l"(l-^n)ö> witn ^71n I i/n + 1

(see Hodges and Lehmann, 1950). Note that the case with n = 0 corre¬

sponds to the case without observations (^ is the estimate obtained in

Example 1.2), while for large n the estimate is approximately —. 0

The classical approach requires the definition of the random object X;that is, it requires the consideration of what could have been observed

instead of A. In general this is not a problem, since the alternatives to

A have already been considered when defining V. More problematic can

be the assumption that the same decision problem would have been faced

for all possible realizations of X. Even if this assumption is reasonable,

1.1 Statistical Decisions 5

the classical approach has two main drawbacks with respect to conditional

methods (that is, methods selecting a single decision in the light of the

observation X = xa, without considering the other possible realizations of

X). The first one is that selecting a decision function is in general much

more complicated than selecting a single decision. The second one is that

the selection of the decision function is based on a pre-data evaluation,whose meaning can be questionable once the data have been observed (seefor instance Goutis and Casella, 1995).

1.1.2 Bayesian Decision Theory

When no data are observed, the Bayesian approach consists in averaging

over the models in V by means of a probability measure 7r on (V, C), where

C is a cr-algebra of subsets of V such that Id is measurable for all d G T>:

we obtain the Bayesian criterion:

minimize E^Q^d)-

If for each A G A the function P i—> P(A) on V is measurable, then 7r can

be combined with the models in V (interpreted as a stochastic kernel) to

obtain a probability measure Pv on (V x fi, £), where 6 is the cr-algebra of

subsets of V x Q generated by the sets C x A with C G C and A G A. If we

observe an event A G A, then we can condition the probability measure

Pj; on the event V x A G £, obtaining in particular an updated marginal

probability measure on (V,C), which can then be used in the Bayesiancriterion. The Bayesian approach consists thus in combining the models

in V to obtain a single model Pv, in updating it by conditioning it on the

observed data, and in reducing the decision problem described by the loss

function L on V x T> to the one described by the loss function L" on II x 2?,where II = {-Pjr}, and L"(P^, d) is defined as the expected value of L(P, d)with respect to the updated version of Pv. Since II is a singleton, there

is no concern about the decision criterion to be applied to the decision

problem described by L".

Example 1.4- For the estimation problem of Example 1.1, let 7r be a prob¬

ability measure on V corresponding to a beta distribution for the para¬

meter p G [0, 1]. When observing X = x, the updating of 7r correspondsto increasing the two parameters of the beta distribution by x and n — x,

respectively. The estimate of p obtained by applying the Bayesian crite¬

rion is the mean of the resulting beta distribution: if a, b G P are the two

6 1 Introduction

parameters of the initial beta distribution corresponding to 7r, then the

estimate is

X + a ,^/1,n°•

, ^n

—: Ti;=A h(l-A„)——, withA„ = — —.

n + a + b n a + b n + a + b

This estimate depends strongly on the choice of a, b G P; if a + b is largewith respect to n, then the estimate is approximately —j-t ,

while if a + b

is small with respect to n, then the estimate is approximately —.

When the only available information about the relative plausibility of

the models in V is provided by the observation X = x, the two most

popular choices of 7r are the probability measures on V corresponding to

the beta distributions with parameters a = b = 1 (that is, the uniform

distribution on [0,1], proposed by Bayes, 1763) and a = b = ^ (proposedby Jeffreys, 1946), respectively. 0

The Bayesian approach is conditional (the selection of a decision is

based on a post-data evaluation), and it possesses the following important

property of "temporal coherence" between pre-data and post-data evalua¬

tions. Let X : Q —> X be a random object, and define the decision function

8' : X —> T> as follows: let 8'(x) be the decision that we would select after

having observed X = x. If in the pre-data decision problem considered

in Subsection 1.1.1 we define L' as the risk function, some regularity con¬

ditions are satisfied (see for example Brown and Purves, 1973), and A is

sufficiently wide to contain 8', then 8' is an optimal decision, accordingto the Bayesian criterion, for the decision problem described by L'. The

temporal coherence is useful because it allows us to construct an optimaldecision function without having to compare decision functions, but only

single decisions. Thanks to the additivity of En, the Bayesian approach

possesses also the following important property of "coherence with respect

to additive loss". Let L\ and L2 be two loss functions on V x T>\ and

V x 2?2, respectively, let T> = T>\ x 2?2, and let L be the loss function on

V xV defined by L[P,{d1,d2)] = £i(P,di) + L2(P,d2) (for all P G V

and all (di, d2) G T>). If the Bayesian criterion can be applied to the three

decision problems (in the sense that the measurability condition stated at

the beginning of the present subsection is satisfied), and di and d2 are op¬

timal for the decision problems described by L\ and L2, respectively, then

(di, d2) is optimal for the decision problem described by L. The coherence

with respect to additive loss is useful because for some decision problemsit allows us to construct an optimal decision by considering only simpler

problems.

1.1 Statistical Decisions 7

These coherence properties are possible because in the (strict) Bayesian

approach there is no uncertainty about the model: II is a singleton. In fact,the central problem of the Bayesian approach is the choice of the averaging

probability measure 7r, which is usually interpreted either as a statistical

model, or as a description of subjective uncertain knowledge. Indepen¬

dently of the interpretation of 7r, it is reasonable to allow some uncertaintyabout the model P^ by assuming only that 7r is an element of some set

r of probability measures on (V,C); the set T can also be interpreted as

an imprecise probability measure on (V,C): see for example Walley (1991)and Weichselberger (2001). Each element of the set II = {P^ : jt £ T}can be conditioned on the observed data, and the decision problem de¬

scribed by the loss function L on V x T> can be reduced to the one de¬

scribed by the loss function L" on II x 2?, where L" is defined as above.

The minimax criterion applied to the (post-data) decision problem de¬

scribed by L" is the T-minimax criterion, which was introduced byHurwicz (1951) for the pre-data decision problem under the name of "gen¬eralized Bayes-minimax principle" ; but the pre-data and the post-data

applications of the T-minimax criterion do not coincide, since the tem¬

poral coherence of the strict Bayesian approach is lost (see for example

Augustin, 2003). A decision criterion similar in spirit to the T-minimax is

the restricted Bayesian criterion, which was introduced by Hodges and

Lehmann (1952) for the pre-data decision problem. It consists in applyingthe Bayesian criterion (with respect to an averaging probability measure

7r) to the decision problem described by the risk function L', but only af¬

ter having restricted the set A to those decision functions for which the

expected loss is bounded above by a particular constant. The intent of the

restriction of A is to limit the importance of the particular choice of 7r.

Example 1.5. The application of the restricted Bayesian criterion is diffi¬

cult even for the simple estimation problem of Example 1.1: some limited

results are reported by Hodges and Lehmann (1952). The application of

the T-minimax criterion is much simpler, but T must be chosen carefully.For instance, if for the estimation problem of Example 1.1 we choose as

r the family of probability measures on V corresponding to the family of

beta distributions for the parameter p G [0,1], then the resulting estimate

is i, independently of n and of the observation X = x. The family of beta

distributions is a natural choice for T (it is the so-called conjugate familyfor this estimation problem, and for instance the two popular choices of 7r

8 1 Introduction

considered at the end of Example 1.4 belong to it): this example simplyshows that the T-minimax criterion is useless when T is too wide. 0

1.2 Likelihood Function

Let Pbea set of statistical models, and let A represent the observed data

(V is a set of probability measures on a measurable space (Q,A), and

A G A). The likelihood function hk on V induced by the observation of

A is defined by

hk{P) = P{A) for all P G V.

The likelihood function measures the relative plausibility of the models

in the light of the observed data alone. Only ratios of the values of hk

for different models have meaning: proportional likelihood functions are

equivalent. In particular, if the statistic s : X —> S is sufficient for the

random object X : Q —> X, then the likelihood function induced by the

observation X = xa is equivalent to the one induced by the observation

s(X) = s(xa)- That is, (the equivalence class of) the likelihood function

depends only on sufficient statistics.

If X : Q —> X is a random object that is continuous under each model

in V, then the observation X = xa would induce a likelihood function

with constant value 0. But in reality, because of the finite precision of all

measurements, it is only possible to observe X G Na, for a neighborhood

Na of xa If fp is a density of X under the model P, then it can be useful to

consider the approximation P{X G Na} ~ 6 /p(x^), where ö G P. If this

holds for all P G V, then the function P i—> /p(x^) on V is approximately

equivalent to the likelihood function induced by the observation X G Na-

When all fp are densities of X with respect to the same measure on X, the

function P i—> /p(x^) on V is often considered as the likelihood function

induced by the observation X = xa, independently of the quality of the

above approximation. This alternative definition of likelihood function has

some little problems (such as nonuniqueness), but it leads to an elegantmathematical theory, and all the results that we shall obtain for discrete

random objects would be valid also for continuous ones if this alternative

definition were used.

Under each model P G V, the likelihood ratio /fc/PÂ of P against

a different model P' G V almost surely increases without bound when

1.2 Likelihood Function 9

more and more data are observed, if some weak conditions are satisfied.

Consequently, when V is finite, the likelihood function tends to concentrate

around P; and the same holds also when V is infinite, if some regularityconditions are satisfied.

A parametrization of V is a bijection t : V —> O, where the parameter

space O can be any set, but usually O Ç W71. The set V can be identified

with O, and therefore a likelihood function hk on V can be identified

with hk oT1, which is called likelihood function on O (with respect to

t). A pseudo likelihood function on a set Q with respect to a mapping

g : V —> G is a function that, to some extent at least, can be used as if it

were a likelihood function on Q. For instance, the function P i—> fp(xA)on V considered above as an approximation of the likelihood function

induced by the observation X G Na is a pseudo likelihood function on V

with respect to id-p. Other pseudo likelihood functions on V with respect

to id-p can be obtained for example by modifying the likelihood function

in order to take into account the complexity of the models in V. If hk is

a likelihood function on V, and g : V —> G is a mapping, then the profilelikelihood function hkg on G with respect to g is defined by

hkg^) =

sup|fl=7| hk for all 7 G G-

In particular, if t : V —> O is a parametrization of V, then hkt is the likeli¬

hood function on O. The profile likelihood function is a pseudo likelihood

function that is defined for all functions jon?, and that usually leads to

reasonable results; but better results can sometimes be achieved by usingother pseudo likelihood functions. In the literature on likelihood-based sta¬

tistical inference, many alternative pseudo likelihood functions have been

proposed for different situations: see for instance Barndorff-Nielsen (1991)and Severini (2000, Chapters 8 and 9), and the references therein.

Example 1.6. Consider the estimation problem of Example 1.1: the map¬

ping Pp 1—> p is a parametrization of V with parameter space [0,1]. The

likelihood function hk on [0,1] induced by the observation X = x satisfies

hk(p) = (n j px (1 - p)n-'x for all p G [0, 1].

By observing the realization of each one of the n binary experiments, we

would have obtained the likelihood function hk1 on [0,1] defined by

hk'(p) =px(l- p)n~x for all p G [0,1],

10 1 Introduction

00.2 0.4 0.6 0.8 1

Figure 1.1. Profile likelihood functions from Example 1.6.

where x is the number of successes. The likelihood functions lik and lik1

are equivalent (that is, lik oc lik'), and in fact the number of successes is

a sufficient statistic.

Let g be the function on V assigning to each model Pp the probability of

observing at least 3 successes in 5 future independent binary experimentswith the same success probability p:

g(Pp) =J2 (5)/(l-p)5~fe = 6/-15p4+10p3 for all p G [0,1].fe=3W

Figure 1.1 shows the graphs of the profile likelihood functions likg on [0, 1](actually, g is a parametrization of V) induced by the three observations

X = 8 with n = 10 (dotted line), X = 33 with n = 50 (dashed line), and

X = 178 with n = 250 (solid line), respectively (the three functions have

been scaled to have the same maximum). The data have been generated

using the model P0.7, and in fact the profile likelihood function tends to

concentrate around g(Po.7) ~ 0.837. 0

1.2.1 Likelihood-Based Inference

Let lik be a likelihood function (induced by some observation A) on a set

V of statistical models, let Q be a set, and let g : V —> G be a mapping. The

maximum likelihood estimate 7ml of g(P) is the 7 G G maximizing

1.2 Likelihood Function 11

X n 7ml LR(H) {hkg > ß sup hk}

8 10 0.942 0.146 (0.321,1.000)

33 50 0.780 0.074 (0.461,0.952)

178 250 0.852 UT10 (0.741,0.927)

Table 1.1. Likelihood-based inferences from Example 1.7.

hkg, when such a 7 exists and is unique. The method of maximum likeli¬

hood is by far the most popular general method of statistical estimation;it was introduced (in a general form, but limited to the case of parame-

trizations g) by Fisher (1922). The likelihood ratio of a set TL Ç V is the

value

, s sup.,, hk

sup hk

The likelihood ratio test of the null hypothesis Hq : P G TL versus the

alternative Hi : P G V \TL rejects the null hypothesis when LR(TL) < ß,where the critical value ß G (0,1) is chosen in order to obtain a test of

a particular level (that is, ß is chosen on the basis of a pre-data evalua¬

tion). The likelihood ratio tests were introduced by Neyman and Pearson

(1928), and have a central place in both the hypothesis testing literature

and practice; they are closely related to the following confidence regions.The likelihood-based confidence region for g(P) with cutoff point

ß G (0,1) is the set

{7 G G : LR{g = 7} > ß} = {hkg > ß sup hk}.

The (pre-data) coverage probability of a likelihood-based confidence re¬

gion with cutoff point ß depends on the set V of statistical models consid¬

ered (and on the mapping g): this is the so-called calibration problem of

likelihood-based inference.

Example 1.1. The method of maximum likelihood applied to the estima¬

tion problem of Example 1.1 leads to the estimate Pml = (when n > 1;while for n = 0 the maximum likelihood estimate is undefined). Table 1.1

gives some likelihood-based inferences for the mapping g and the data con¬

sidered in Example 1.6 (the corresponding profile likelihood functions hkgare plotted in Figure 1.1). The maximum likelihood estimate 7ml tends to

12 1 Introduction

0950 02 04 06 08 1

Figure 1.2. Coverage probability of the likelihood-based confidence region from

Example 1.7, when n = 10.

the value g(Po 7) « 0.837 of the function g for the model Pq 7 used to gen¬

erate the data. The likelihood ratio LR(Ti) of the hypothesis TL = {g < ^}(which states that the probability g(Pp) is at most ^) tends to 0. The

likelihood-based confidence region for g(P) with cutoff point ß « 0.036 (sothat —2 log/3 is the 99%-quantile of the x2 distribution with one degree of

freedom) is an interval tending to concentrate around g(Po 7) ~ 0.837. For

large n the coverage probability of this likelihood-based confidence regionis approximately 99%; the exact coverage probability for n = 10 is plottedin Figure 1.2 as a function of p G [0, 1]. 0

The method of maximum likelihood is conditional, while the tests and

confidence regions based on the likelihood ratio are fully conditional onlyif the choice of the critical value or cutoff point ß is not based directly

on pre-data evaluations (for example, it can be based on experience and

on analogies with simple situations). Although they are (at least partially)conditional, in general the likelihood-based inference methods perform well

also from the pre-data point of view. Under suitable regularity conditions,

they are consistent and even asymptotically efficient. But they also have

good pre-data performances in a surprisingly large number of importantsmall sample problems, even though examples with bad performances can

be easily constructed.

If we consider a set T of averaging probability measures 7r on V, as

done at the end of Subsection 1.1.2, then we obtain a set II of probabilitymeasures P^ on V x Q. The observation of an event A G A induces a

1.3 Likelihood-Based Statistical Decisions 13

likelihood function hk on T, in the sense that the observation VxA induces

a likelihood function on LT, and the mapping P^ i—> 7r is a paranietrization of

II. The likelihood hk(ji) of 7r is the expected value of P(A) with respect to

7r: that is, it is the average with respect to 7r of the likelihood function on V

induced by the observation of A. The likelihood-based inference methods

can be applied to hk in order to obtain inferences about the averaging

probability measures 7r in L, as proposed for example by Good (1965).

1.3 Likelihood-Based Statistical Decisions

Since the statistical decision problems were introduced as a generalizationof the problems of statistical estimation and testing hypotheses, and the

most appreciated general methods for these inference problems are based

directly on the likelihood function, it is natural to consider general deci¬

sion criteria based directly on the likelihood function. Of particular interest

are decision criteria leading to maximum likelihood estimates and likeli¬

hood ratio tests when applied to (some standard form of) estimation and

hypothesis testing problems, respectively.

1.3.1 MPL Criterion

Consider a statistical decision problem described by a loss function L on

V x T>: let A represent the observed data (V is a set of probability measures

on a measurable space (Q,A), and A G A), and let hk be the likelihood

function on V induced by the observation of A. If we use the likelihood of

the statistical models to weight the respective losses, before applying the

minimax criterion, then we obtain the MPL criterion:

minimize sup/î/e/^.

The MPL criterion was introduced by Cattaneo (2005): "MPL" means

"Minimax Plausibility-weighted Loss", and for the present situation "plau¬

sibility" can simply be read as "likelihood". A likelihood function with

constant value c G P is equivalent to the one induced by the observation of

Q, which corresponds to having observed no data; with such a likelihood

function, the MPL criterion reduces to the minimax criterion. In general,the likelihood function measures the relative plausibility of the models (in

14 1 Introduction

the light of the observed data alone), and the MPL criterion is a simple,intuitive modification of the minimax criterion, making use of the infor¬

mation provided by the likelihood function: the more plausible a model,the larger its influence on the evaluation of the different decisions. In the

extreme case of a likelihood function proportional to I{p\ for a model

P G V (that is, P is infinitely more plausible than every other model in

V), a decision d G T> is optimal according to the MPL criterion when it

minimizes L{P, d): in particular, if d is correct for P, then d is optimal.

Example 1.8. The MPL criterion applied to the estimation problem of Ex¬

ample 1.1 leads to the (unique) estimate

arg min max« (1 — p)~ (p — d) .

de [0,1] Pe [0,1]

Except for the simplest cases (with very small n, or with X = Ç), it is not

possible to express this estimate in a simple form, but its numerical evalua¬

tion poses no problems. When n = 0 or n = 1, this estimate coincides with

the one obtained by applying the minimax risk criterion (considered in

Example 1.3); but as n increases, the present estimate tends to the max¬

imum likelihood estimate pml = (considered in Example 1.7) faster

than the one obtained by applying the minimax risk criterion. Both esti¬

mates obtained by applying the Bayesian criterion with as initial averaging

probability measures TronP the two popular choices considered at the end

of Example 1.4 have a behavior (for increasing n) similar to the one of the

present estimate (which coincides with both of them when n = 0, and

with the one based on the proposal of Jeffreys when n = 1). Figure 1.3

shows, for various n, the graphs of the expected losses (as functions of

p G [0,1]) for the estimates obtained by applying the MPL criterion (solidline), the minimax risk criterion (dotted horizontal line), the method of

maximum likelihood (dotted curved line), and the Bayesian criterion for

the two choices of 7r considered above (dashed lines: when n > 1 and p = ^,the resulting expected loss is larger for the proposal of Jeffreys than for

the one of Bayes). 0

Let Q be a set, and let g : V —> G be a mapping. If the loss function L

depends on the models P only through g{P), in the sense that there is a

function i'ongxD such that L(P, d) = L'[g(P), d] for all P G V and all

d G T>, then the MPL criterion can be expressed as follows:

minimize sup hkg l'd,

1.3 Likelihood-Based Statistical Decisions 15

02-

0 1-

0 02

0 01-

02 04 06 08 1

n = 1

002 04 06 08 1

n = 10

0 002

0 001

0 0002

0 0001

02 04 06 08 1

n = 100

02 04 06 o:

n = 1000

Figure 1.3. Expected losses from Example 1.

where for each d G T> the function l'd on Q is defined by l'd(j) = L'(y, d) (forall 7 G Q). That is, the MPL criterion automatically employs the profilelikelihood function hkg on Q; but other pseudo likelihood functions on Q

(with respect to g) can be used instead of hkg in the above version of the

MPL criterion, if they are expected to give better results.

When Q is finite and we want to estimate g(P), a simple way to define

an estimation error is to assign a constant error to estimates 7 7^ g(P),while assigning no error to the correct estimates 7 = g(P). That is, the

estimation problem can be described by the loss function Iw on V x Q,

16 1 Introduction

where W = {(P, 7) G V x Q : g(P) 7^ 7}. If hkg has a unique maximum

at 7 = 7ml jthen the maximum likelihood estimate 7ml of g(P) is the

unique optimal decision, according to the MPL criterion, for the decision

problem described by Iw This is in general not true when Q is infinite,and in fact in this case the loss function Iw is usually unreasonable. When

the (finite or infinite) set Q possesses a metric p, we can generalize the

finite case by considering the loss function IyyE on V x Q, where e G P,and W£ = {(P, 7) G V x Q : p[g(P), 7] > e}. When a maximum likelihood

estimate 7ml of g(P) exists, it can be considered really unique only if

LR{p(g, 7ml) > 5} < 1 for all ö G P. In this case, if 7 is an optimal de¬

cision, according to the MPL criterion, for the decision problem described

by Iwei then p(7, 7ml) < e- Thus by selecting a sufficiently small e G P,

we obtain a situation which in practice is analogous to the finite case: if

a unique maximum likelihood estimate exists, then it corresponds practi¬

cally to the unique optimal decision (according to the MPL criterion) for

the estimation problem.

Consider now the problem of testing the null hypothesis Hq : P G TL

versus the alternative H\ : P G V \ TL, for a subset Ti of V: a simple

way to define a loss function is to assign constant losses c\, c<i G P to

errors of the first and of the second kind, respectively, where c\ > c<i. That

is, the hypothesis testing problem can be described by the loss function

L on V x {r,n} such that lr = Cil^ and /„ = C21-pVHi where r and

n are respectively the decisions of rejecting and of not rejecting the null

hypothesis. The MPL criterion applied to the decision problem described

by L leads to the rejection of the null hypothesis when LR(7i) < — (whenLR(7i) =

—, both decisions r and n are optimal); that is, practicallyit leads to the likelihood ratio test with critical value ^. The level of

.Cl.

this test depends on the sets TL and V (this is the calibration problem of

likelihood-based inference): it can be argued that those sets should have

been considered when selecting the constants c\ and C2; another way to

address the calibration problem is by using some pseudo likelihood function

depending on the sets 7i and V. The choice of a loss function to describe

the problem of selecting a confidence region for g(P) is complicated bythe fact that also the extent of the region plays a role; but a simple loss

function can be chosen if we consider only a limited set T> C 2s of possibleconfidence regions (for example all the intervals whose length does not

exceed a particular constant). In this case we can assign a constant error

to regions d jj g(P), while assigning no error to regions d 9 g(P)', that is,

1.3 Likelihood-Based Statistical Decisions 17

the problem of selecting a confidence region can be described by the loss

function L on V x T> such that l& = I{g4d\ f°r aU d G T>. The MPL criterion

applied to the decision problem described by L consists in minimizing

LR{g (fi d}; and for each ß G (0,1) the smallest region r Ç Q satisfying

LR{g ^ r} = ß is the likelihood-based confidence region for g(P) with

cutoff point ß. That is, if 2? is sufficiently wide, then the MPL criterion

leads to the selection of a likelihood-based confidence region for g{P), in

the sense that it is the smallest region among the optimal ones.

The MPL criterion leads thus to the usual likelihood-based inference

methods, when applied to some standard form of the corresponding de¬

cision problems. As the likelihood-based inferences, under suitable regu¬

larity conditions the decisions selected on the basis of the MPL criterion

are asymptotically optimal and even asymptotically efficient (see Subsec¬

tions 4.2.2 and 5.1.1). However, asymptotic properties have no practical

meaning if they are not supported by some experience with finite sample

problems, telling us how many data are necessary to approximate the as¬

ymptotic results. In fact, the appreciation of the likelihood-based inference

methods rests on the positive experience with a myriad of finite sample

problems, and on the property of being general, simple, and intuitive. This

property is shared by the MPL criterion: it implies a wide applicability of

the methods, and is related with the property of being based directly on

the likelihood function (which implies in particular conditionality, para-

metrization invariance, and dependence only on sufficient statistics).

Besides the analogy with the usual likelihood-based inference methods,there are other interesting analogies concerning the MPL criterion. Con¬

sider a decision problem described by a loss function L on V x T>, and a

random object X : Q —> X; and assume for simplicity that the sets V, T>,and X are finite. After having observed X = x, the decision ô(x) G T> is

optimal according to the MPL criterion if it minimizes

maxP{X = x} L[P,ô(x)}.

If we start with the uniform distribution on V as the averaging probability

measure of the Bayesian approach, then after having observed X = x,

the decision ö(x) G T> is optimal according to the Bayesian criterion if it

minimizes

Y/P{X = x}L[P,ö(x)}.Per

18 1 Introduction

The decision function ö G T>x is optimal according to the minimax risk

criterion if it minimizes

m^Y/P{X = x}L[P,ö(x)}.

There is a high similarity between these three expressions, although the

term P{X = x} appears with different meanings. It can be noted that the

MPL criterion corresponds to the minimax risk criterion by assuming that

for all the alternative, unobserved realizations of X we would have selected

a correct decision. If we apply the Bayesian approach (with the uniform

distribution on V as the averaging probability measure) to the pre-datadecision problem described by the risk function, then the decision function

ö G T>x is optimal according to the Bayesian criterion if it minimizes

££ P{X = a:} L[P, <*(*)].P£Vx£X

Since the summation operators commute, we obtain the temporal coher¬

ence of the Bayesian approach. The maximization and summation opera¬

tors do not commute, and in fact the MPL approach does not in general

satisfy this temporal coherence: the MPL criterion for the pre-data deci¬

sion problem described by the risk function corresponds to the minimax

risk criterion (since the likelihood function induced by observing no data

is constant). Since the maximization operators commute, a property anal¬

ogous to the temporal coherence of the Bayesian approach is obtained for

the MPL approach if

maxP{X = x} L[P,ô(x)}x<^X

is used (instead of the expected value) as the representative value of the

random loss of ö G T>x, when defining the loss function L' on V x T>x for

the pre-data decision problem considered in Subsection 1.1.1. This repre¬

sentative value has the drawback of depending on spurious modifications

of the random object X (for example when a possible observation x G X is

"subdivided" into two equiprobable new observations by considering also

the result of flipping a fair coin), but the decision functions obtained byconditional application of the MPL criterion are the only decision func¬

tions that are optimal for all such modifications (according to the MPL

criterion applied to the decision problem described by L').

1.3 Likelihood-Based Statistical Decisions 19

Besides temporal coherence, the Bayesian approach possesses also the

property of coherence with respect to additive loss: the MPL approach

possesses the analogous, important property of "coherence with respect

to maxitive loss". Let L\ and L2 be two loss functions on P x T>\ and

P x 2?2, respectively, let T> = T>\ x T)2l and let L be the loss function

on?xD defined by L[P,(d1,d2)] = maxji^P, dx), L2(P, d2)} (for all

P G V and all (^1,^2) G T>). If di and d2 are optimal for the decision

problems described by L\ and L2, respectively, then (^1,^2) is optimalfor the decision problem described by L. The coherence with respect to

maxitive loss is useful because for some decision problems it allows us to

construct an optimal decision by considering only simpler problems.

Consider a statistical decision problem described by a loss function L

on V x T>: let A represent the observed data, and let lik be the likelihood

function on V induced by the observation of A. If we consider a set F of

averaging probability measures 7r on P, then the MPL criterion for the

decision problem described by the loss function L" considered at the end

of Subsection 1.1.2 can be expressed as follows:

minimize sup E^^hk Id).wer

In fact, it can be easily proved that for all 7r G F and all d G T>

P*(P x A)Ep„[L(P,d)\P xA}= E^hkU).

Since E^{likld) < suphkld, if T can be interpreted as an extension of P,in the sense that T contains, at least as limits, the Dirac measures öp on

P, for all P G P, then the application of the MPL criterion to the decision

problem described by L" leads to the same results as its application to the

decision problem described by L. More precisely, if for each P G P and

each d G T> there is a sequence tt± , tt2 ,... G T such that

lim E^ihkld) = hk(P)ld(P),

then according to the MPL criterion, a decision d G T> is optimal for the

decision problem described by L" if and only if it is optimal for the decision

problem described by L. This property too can be considered as a sort of

coherence; its premise is satisfied in particular when T is the set of all

probability measures on (P,C), and C contains all singletons of P.

20 1 Introduction

Example 1.9. If for the estimation problem of Example 1.1 we consider

the family T of averaging probability measures on V corresponding to the

family of beta distributions for the parameter p G [0,1], as done in Exam¬

ple 1.5, then the MPL criterion applied to the decision problem described

by L" leads to the estimate obtained in Example 1.8 without consideringthe family T of averaging probability measures, since T contains as limits

the Dirac measures on V. Hence, unlike the T-minimax criterion, the MPL

criterion can be useful even when T is wide enough to be interpreted as an

extension of V (see also Subsection 3.2.2). 0

1.3.2 Other Likelihood-Based Decision Criteria

Consider a statistical decision problem described by a loss function L on

VxD, and let lik be the likelihood function on V induced by the observed

data. When a (unique) maximum likelihood estimate Pml of P exists,a likelihood-based decision criterion that is often used in practice (eventhough perhaps it was never stated in its full generality) is the followingone: simply replace P with its maximum likelihood estimate Pml in the

loss function L, and thus minimize L{Pml, d)- The asymptotic propertiesof this criterion (and of all reasonable likelihood-based decision criteria)are similar to those of the MPL one; and since it reduces the set V to

a single model PmLi the present criterion possesses important coherence

properties (such as the coherence with respect to additive or maxitive loss),but not the temporal coherence, because the model Pml depends on the

data. The above definition of this criterion is not general (since it depends

on the existence of a maximum likelihood estimate of P) : we will consider

the general definition of an almost equivalent criterion in a moment.

The likelihood function measures the relative plausibility of the models

(in the light of the observed data alone), and a simple way to use it in a

decision criterion is to reduce the set V of considered statistical models by

excluding the less plausible ones. If we reduce V to the likelihood-based

confidence region for P with cutoff point ß G (0,1), before applying the

minimax criterion, then we obtain the LRM^ criterion:

minimize sup{hk>ß euphk} ld.

It is possible that this criterion was never stated in such generality before:

"LRM" means "Likelihood-based Region Minimax". The LRM^ criterion

is thus simply the minimax criterion applied after having discarded the

1.3 Likelihood-Based Statistical Decisions 21

models in V that are too implausible (when compared to other models in

V) in the light of the observed data: the larger is ß, the more plausible

are the discarded models. By letting ß tend to 1, we obtain the MLD

criterion:

minimize limsup{hfe>/3 euphk} ld.

This criterion is almost equivalent to the one considered above, based on

the maximum likelihood estimate Pml of P- "MLD" means "Maximum

Likelihood Decision". In fact, if there is a topology on V such that Id is

continuous, and LR(V \ J\f) < 1 for all neighborhoods J\f of Pml, then

1suP{hfe>/3 Bupi.li;} ld = L(PMl, d).

If such a topology does not exist, then the use of L(pML,d) to evaluate

the decision d does not seem very reasonable: we can say that the MLD

criterion corresponds to the one considered at the beginning of the present

subsection, in all the cases in which this is well-defined and reasonable.

When applied to an estimation problem, the MLD criterion leads to

the maximum likelihood estimate if some weak conditions (such as the

above existence of a topology) are satisfied, while the LRM^ criterion does

not in general lead to the maximum likelihood estimate even in the simple

problem described by the loss function Iyy considered in Subsection 1.3.1.

When applied to the hypothesis testing problem considered in Subsec¬

tion 1.3.1 (with c\ > C2), the MLD criterion leads to the rejection of the

null hypothesis H0 : P G H if and only if LR(H) < 1, while the LRM^criterion leads to the rejection of Ho if and only if LR(7i) < ß. That is,the MLD criterion leads to an unreasonable test, while the LRM^ criterion

leads to the likelihood ratio test with critical value ß (independently of — ).Analogously, for the problem of selecting a confidence region considered in

Subsection 1.3.1, a region d G T> is optimal according to the MLD criterion

if LR{g ^ d} < 1, while it is optimal according to the LRM^ criterion if

LR{g ^ d} < ß. In conclusion, both these criteria use the likelihood in a

very limited way: the MLD criterion divides the values of the likelihood

ratio into two categories: equal 1 or not, while the LRM^ criterion divides

them into the two categories: larger than ß or not. The LRM^ criterion

has the advantage of allowing some control through the choice of ß, but

on the other hand this choice is complicated by the calibration problem of

likelihood-based inference.

22 1 Introduction

Example 1.10. Consider the estimation problem of Example 1.1: the esti¬

mate obtained by applying the LRM^ criterion is the midpoint pß of the

interval corresponding to the likelihood-based confidence region for p with

cutoff point ß. The case with n = 0 corresponds to the case without ob¬

servations (the midpoint ^of the interval [0,1] is the estimate obtained

in Example 1.2), while as n increases, the estimate pß tends to the max¬

imum likelihood estimate Pml = (considered in Example 1.7), if ß is

held constant. But note that lim^o pß = \, independently of n and of the

observation X = x. The MLD criterion applied to this estimation prob¬lem leads to the estimate lirngfip^. When n > 1, this is the maximum

likelihood estimate Pml = ~, while for n = 0 the maximum likelihood es-n

i

timate is undefined, and the MLD criterion leads to the estimate pß = k.

It is important to note that the estimates obtained by applying the LRM^and MLD criteria would be the same for all symmetric, strictly increasingestimation errors; that is, for all loss functions L : (Pp,d) i—> f(\p — d\)on V x [0, 1], where / : [0,1] —> P is strictly increasing. This is not true

(when n > 1) for the estimates obtained by applying the minimax risk,

Bayesian, and MPL criteria considered in Examples 1.3, 1.4, and 1.8, re¬

spectively. 0

If we consider a set T of averaging probability measures 7r on V, then

applying the MLD criterion to the decision problem described by the loss

function L" considered at the end of Subsection 1.1.2 usually means apply¬

ing the Bayesian criterion (to the decision problem described by L) with

as averaging probability measure on V the maximum likelihood estimate

t^ml (obtained from the likelihood function on T). As stated at the end of

Subsection 1.2.1, this way of selecting 7r was proposed for example by Good

(1965); but also most instances of the empirical Bayes approach introduced

by Robbins (1951) can be considered as applications of the MLD criterion

(to the decision problem described by L", after having selected a suitable

set r of averaging probability measures). DasGupta and Studden (1989)applied a criterion similar to the LRM^ one to the decision problem (de¬scribed by L") obtained from a particular estimation problem (describedby L) by using a particular set T of averaging probability measures.

The choice of the way of using the likelihood function lik to weightthe function Id in the MPL criterion is intuitive, but arbitrary. In particu¬

lar, reasonable alternative criteria can be easily obtained by transforminghk before multiplying it with Id, or by changing the way in which (thetransformed version of) hk and Id are combined. For example, the LRM^

1.3 Likelihood-Based Statistical Decisions 23

criterion can be obtained by transforming hk into Itß supiifc)00iohk before

multiplying it with l& (the MLD criterion can also be obtained, but in a

slightly more complicated way), while an alternative way of combining hk

and Id was proposed by Cattaneo (2005) under the name of "MPL* cri¬

terion".Such modifications of the MPL criterion are special cases of the

general definition of likelihood-based decision criterion that will be used

in Chapters 4 and 5. The main advantage of the way in which the MPL

criterion combines hk and Id is its simplicity (and thus its applicabilityin difficult problems; see also Subsection 4.1.2 and Section 5.2), and the

best reasons for not transforming hk before multiplying it with Id are the

analogies and the coherence properties considered at the end of Subsec¬

tion 1.3.1 (some transformed versions of hk can simply be interpreted as

pseudo likelihood functions).

Lehmann (1959, Section 1.7, which remained substantially unchangedin the following two editions of the book) proposed a likelihood-based

decision criterion for decision problems that involve at most countably

many possible decisions, and that are formulated as follows: for each model

P G V there is exactly one correct decision ip(P) G T>, which would result

in a positive gain G(P), while every other decision in T> would yield no

gain (ip and G are functions on V). The proposed criterion consists in

selecting the decision tp(P'), where P' is the P G V maximizing hkG,when such a P exists and is unique. That is, this criterion modifies the

one based on the maximum likelihood estimate Pml of P (considered at

the beginning of the present subsection) by using the possible gain G to

weight the likelihood function hk before maximizing it (these two criteria

correspond if G is constant). The above decision problem can be described

by the loss function lon^xP such that Id = Ir^d} G for all d G T>: then

L{P, d) is the additional gain that would have resulted from the selection

of the correct decision ip(P) instead of d (in particular, L[P,ip(P)] = 0

for all P G P). When a P' maximizing hk G exists, it can be considered

really unique, and the decision d! = ip(P') clearly determined, only if

supf^ ,d,\hkG < hk(P') G(P') (this condition could also be formulated

as the existence of a particular topology or metric, in analogy with other

conditions considered above). In this case, d! is the unique optimal decision,

according to the MPL criterion, for the decision problem described by L.

That is, the criterion based on P' can be considered as a special case of the

MPL criterion, applicable to the decision problems that involve at most

countably many possible decisions, and that can be formulated as above

in terms of the functions ip and G on P.

24 1 Introduction

Giang and Shenoy (2002) proposed a likelihood-based decision crite¬

rion for decision problems that are described by a binary utility function

U : V xV -> U, where U = {(a, ft) G [0, l]2 : max{a, b} = 1} is the

set of possible values for the binary utility: if c > c', then (c, 1) is better

than (c', 1), while (l,c) is worse than (l,c'). That is, (0,1) is the worst

value for the binary utility, (1, 0) is the best one, while (1,1) is an in¬

termediate value. Each decision d G T> is evaluated by the binary utility

{s\xphku\, suphku2), where u\ and u<i are the functions on V such that

U(P,d) = (ui(P) sup/î/e, U2(P) suphk) for all P G V: the criterion con¬

sists in selecting the decision with the best evaluation in terms of binary

utility. If a maximum likelihood estimate Pml of P exists, then the eval¬

uation of a decision d lies somewhere (on the binary utility scale) between

U(Pml, d) and (1, 1); that is, the intermediate value (1, 1) plays a peculiarrole in the evaluation of decisions (see also the remarks at the end of Sub¬

section 4.1.2). This criterion is based on a modification of the axiomatic

approach to qualitative decision making with possibility theory studied in

Giang and Shenoy (2001); a direct application of the criterion to a sta¬

tistical decision problem described by a loss function L on V x T> is not

possible, because the values of L must first be translated into (subjective)binary utility values. If the loss function L is bounded above by c G P, then

by defining U = (1, — ), we obtain that the present criterion corresponds to

the MPL one (independently of the choice of the upper bound c). However,this definition of U does not really conform with the above definition of

U, since sup L should in fact be translated into (0, 1), but then we would

be faced with the problematic choice of a loss value to be translated into

(1,1)-

2

Nonadditive Measures and Integrals

This chapter is an introduction to nonadditive measures and integrals,which will be used in the following chapters. The literature on this topic is

extensive and heterogeneous, ranging from pure mathematics to decision

theory and artificial intelligence (a partial survey is given by Grabisch,

Murofushi, and Sugeno, 2000). In particular, many concepts appear in the

literature under different names: the names used in this chapter have been

selected in order to avoid possible confusion. The present introduction

focuses on maxitive measures and regular integrals, because these topics

are particularly important for the following chapters. Many definitions are

new (for example the one of regular integral), and as a consequence many

results are new too, but in general they are rather simple.

2.1 Nonadditive Measures

Usually, a measure is a nonnegative, extended real-valued set function sat¬

isfying countable additivity; while for nonadditive measures the require¬ment of additivity is dropped. Thus, in general a nonadditive measure is

not a measure, and an additive measure is a nonadditive measure. To avoid

such contradictions, we will use the term "measure" to denote the general

concept of nonnegative, extended real-valued set function, which can then

be specialized by attributes such as "countably additive".

26 2 Nonadditive Measures and Integrals

2.1.1 Types of Measures

A measure on a set Q is a function

/i : 2Q - P such that /i(0) = 0.

This definition could be generalized by allowing /i to be defined only on

some subset of 2s, but this generality will not be really needed in the

following. The requirement /i(0) = 0 is not very restrictive, since we will

consider monotonie measures only. A measure /i on Q is said to be mono-

tonic if

AÇB => fi(A) < (j,(B) for all A, B Ç Q.

In this case, if /i(0) = 0 were not required, we would have /i(0) = min/i,and thus /i' = fi—fi(0) would be a monotonie measure satisfying /i'(0) = 0

(except for the uninteresting case with /i = oo). That is, the requirement

/i(0) = 0 simply spares us the subtraction of/i(0). If/i is monotonie, then

/i(<2) = max/i, and therefore /i is bounded when it is finite. A monotonie

measure /i on Q is said to be nonzero if /i(<2) > 0; otherwise it is the zero

measure, which has constant value 0. A monotonie measure /i on Q is said

to be normalized if /i(<2) = 1.

A measure /i on Q is said to be subadditive if

fi(A U B) < fi(A) + fi(B) for all disjoint A,B ÇQ.

A monotonie measure is not necessarily subadditive, and a subadditive

measure is not necessarily monotonie. When combined, monotonicity and

subadditivity have important consequences: in particular, it can be easily

proved that if /i is a monotonie, subadditive measure on <2, and A Ç Q is

a set such that fJ,(A) = 0, then

li{B) = n{B \ A) for all B Ç Q.

Thanks to this property, it make sense to consider a statement whose truth

depends on the elements of Q as true almost everywhere with respect to /i

(written a.e. [/i]) when it is false only on a set ACQ such that fJ,(A) = 0.

A measure /i on Q is said to be 2-alternating if it is monotonie and

[i(A UB)+ fi(A C\B) < fi(A) + n(B) for all A,BCQ.

2.1 Nonadditive Measures 27

A 2-alternating measure is thus monotonie and subadditive, and has other

important properties: see for instance Choquet (1954) and Huber and

Strassen (1973).

If a measure /i on Q is monotonie and subadditive, then

max{/i(A), (J,(B)} < fi(A U B) < fi(A) + fi(B) for all disjoint A, B Ç Q.

In this sense, at one extreme of the class of monotonie, subadditive mea¬

sures we have the finitely additive ones, satisfying

^{A U B) = ^i(A) + ^i(B) for all disjoint A, B Ç Q;

while at the other extreme we have the finitely maxitive ones. A measure

/i on Q is said to be finitely maxitive if

fi(A U B) = max{/i(A), /i(B)} for all disjoint A, B Ç Q.

It can be easily proved that a finitely maxitive measure /i on Q is

2-alternating and satisfies

M I JA) = max^/i for all finite, nonempty A Q 2~

Unlike additivity, this property can be extended without problems to all

sets A (not only to the countable ones). A measure /i on Q is said to be

completely maxitive if

MM A j = sup^ /i for all A Ç 2~

In this case, /i is uniquely determined by its density function /i^ on Q,defined by fi^(q) = n{q} (for all q G <2); in fact,

fi(A) = sup^ /i^ for all ACQ.

Density functions do not satisfy particular requirements: each nonnegative,extended real-valued function on Q is the density function of a completelymaxitive measure on Q. Maxitive measures were introduced by Shilkret

(1971), who focused on countable maxitivity in analogy with additive mea¬

sures.

Example 2.1. The measure /i on [0,1] defined by fJ,(A) = sup A (for all

^4 Ç [0,1]) is normalized and completely maxitive, with density function

l_i-i =îd[0)1]. 0

28 2 Nonadditive Measures and Integrals

A possibility measure on a set Q is a completely maxitive measure

/i on Q such that /i(<2) < 1. The density function of a possibility mea¬

sure is often called "possibility distribution function", but we will use the

expression "distribution function" with another meaning. Possibility mea¬

sures were introduced by Zadeh (1978) in the context of the theory of

fuzzy sets. Other authors use different definitions: for example, for Dubois

and Prade (1988) a possibility measure is a normalized, finitely maxitive

measure.

A monotonie measure /i on Q is said to be continuous from below if

f°° \/ill An = lim fJ,(An) for all sequences A\ Ç A2 Ç ...

C Q.

V=i /

A completely maxitive measure is continuous from below, while this is in

general not true if the maxitivity is only finite. It can be easily proved that,

as in the case of additivity, a finitely maxitive measure is continuous from

below if and only if it is countably maxitive. But even complete maxitivityis not enough to assure continuity from above.

2.1.2 Derived Measures

The dual of a finite, monotonie measure /i on Q is the finite, monotonie

measure /Ion Q defined by

JI(A) = fj,(Q) - fj,(Q \ A) for all ACQ.

The equality ~ß = fi justifies the use of the term "dual".If /i is subadditive,

then /J < /i, and therefore /J is also subadditive only if /J = /i; in this case

/i is 2-alternating only if it is finitely additive.

It can be easily proved that if Q and T are sets, /i is a measure on <2,and t : Q —> T is a function, then ^oT1 is a measure on T; and if /i

satisfies one of the properties considered in Subsection 2.1.1, then so does

/iot-1. A class A4 of measures (that is, each /i G A4 is a measure on some

set <2M) is said to be closed under transformations if /i o t-1 G A4

for all /i G A4, all sets T, and all functions t : Qß —> T. For instance,the class of all normalized, completely maxitive measures is closed under

transformations.

2.1 Nonadditive Measures 29

If /i is a measure on <2, and £ : P —> P is a nondecreasing function such

that £(0) = 0, then ö o /i is a measure on Q. Among the possible proper¬

ties of /i considered in Subsection 2.1.1, the only ones that are certainlymaintained by ö o /i are monotonicity and finite maxitivity; but if £ is left-

continuous, then also complete maxitivity and continuity from below are

maintained. If c G P, and ö is the function x i—> m on P, then ö o fi = cfi

maintains all the properties of /i, except for the properties of being normal¬

ized and of being a possibility measure. A class Ai of measures is said to

be regular if c /i G Ai for all /i G Ai and all c G P, and .M is closed under

transformations. For instance, the class of all finite, nonzero, completelymaxitive measures is regular.

Example 2.2. Let /i be the completely maxitive measure on [0,1] considered

in Example 2.1, and let 5 be the function on P defined by

r, s f f if 0 < x < 1,

yx if 1 < a; < oo.

The function ö is not left-continuous at 1, and in fact the measure v = So/xon [0, 1] is finitely maxitive, but not continuous from below (and thus not

completely maxitive), since for example

limz40,x) =limJ(x) = \ < 1 = 6(1) = u[0,1).

The restriction /i\a to A Ç Q of the measure /i on Q is the measure

on A defined by h\a(B) = fJ,(B) (for all B Ç A). Since the function /i

on 2s is restricted to 2,we should write (i\2a instead of (j,\a, but the

abuse of notation in writing /i\a is coherent with the abuse of terminologycommitted by saying that /i is a measure on Q when in fact it is a function

on 2s. If /i satisfies one of the properties considered in Subsection 2.1.1,then so does /i\a, except for the properties of being normalized and of

being nonzero. A class .M of measures (each /i G .M is a measure on some

set <2M) is said to be closed under restrictions if /i\a G .M for all /i G .M

and all A Ç Qß. For instance, the class of all finite, completely maxitive

measures is closed under restrictions.

30 2 Nonadditive Measures and Integrals

2.2 Nonadditive Integrals

As with nonadditive measures, to avoid contradictions, we will use the term

"integral" to denote a very general concept, which can then be specialized

by attributes such as "additive".

Let Ai be a class of measures (each /i G Ai is a measure on some set

Qß). An integral on Ai is a function that associates a value J/d/i G P

(the integral of / with respect to /i) to each pair consisting of a measure

/i G .M and of a function / : Qß —> P, and that satisfies the followingindicator property:

/ Ia d/i = /i(A) for all /i e .M and all A Ç QM.

That is, integrals can be seen as extensions of measures: from the indicator

functions to all nonnegative, extended real-valued functions. The definition

of integrals for functions taking also negative values would pose some an¬

noying problems (since oo — oo is undefined) and will not be really needed

in the following.

Usually, a basic property of integrals is linearity, which consists of (fi¬nite) additivity:

/ (/ + (?) d/i = / / d/i + g d/i for all /i G Ai and all /, g G vQp..

and (positive) homogeneity. An integral on Ai is said to be (positively)homogeneous if

c/d/i = c / /d/i for all /i G M, all c G P, and all / G PQ,\

Note that the case with c = 0 is already implied by the indicator prop¬

erty: JO d/i = 0, since /i(0) = 0 by definition. When combined with the

indicator property, homogeneity places no restrictions on the measures

/i G Ai. By contrast, additivity implies that they are finitely additive,since Iaub = IA + Iß f°r aU disjoint A, B Ç Q^. Thus, additivity cannot

be required for an integral on nonadditive measures. But if the measures

are at least monotonie, then the following fundamental consequence of ad-

2.2 Nonadditive Integrals 31

ditivity can still be required. An integral on A4 is said to be monotonie if

f<9 => ffä»<f9ä» forah>G.Mandall/,ffGP^.

When combined with the indicator property, monotonicity implies that

the measures fi G A4 are monotonie, since Ia < Ib f°r all A Ç _B C Qß.

2.2.1 Shilkret Integral

The Shilkret integral associates the value

f-S

/d/i = sup xfi{f > x}

to each pair consisting of a monotonie measure fi on a set Q and of a

function / : Q —> P. It was introduced by Shilkret (1971) as the unique ho¬

mogeneous, countably maxitive integral on countably maxitive measures.

Before examining this aspect, we consider two other characterizations of

the Shilkret integral.

Theorem 2.3. The Shilkret integral is a homogeneous, monotonie integral

on the class of all monotonie measures.

If fi is a monotonie measure on a set Q, and f : Q —> P is a function,then

f d/i < jfa.flfor all homogeneous, monotonie integrals on classes A4 E5 fi. That is, the

Shilkret integral is the minimum of all homogeneous, monotonie integrals.

Proof. The indicator property for the Shilkret integral follows from

l*{lA>x}=l£A> i£f°<X^1' (2.1)

^LJ

[0 if 1 < x < oo.y J

Since for c G P we have x fi{c f > x} = ex' fi{f > x'} with x' =-,

the

Shilkret integral is homogeneous. It is also monotonie, since / < g implies

{/ > x} Q {9 > x}j and /i is monotonie.

32 2 Nonadditive Measures and Integrals

For all x G P we have / > xlrj>x\. Hence, a homogeneous, monotonie

integral satisfies

/ / d/i > J x I{f>xy d/i = x I{f>x} d/i = x fi{f > x}.

The desired result is obtained by taking the supremum (over all x G P) of

the right-hand side of the inequality. D

A well-known nonadditive integral was introduced by Sugeno (1974):the following generalization was considered for instance by Weber (1984),Mesiar (1995), and de Cooman (2000). A generalized Sugeno integralassociates the value

// d/i = sup m(x, fi{f > x})xef

to each pair consisting of a monotonie measure /i on a set Q and of a

function / : Q —> P, where to is a nonnegative, extended real-valued

function on P x P that is nondecreasing in both arguments. In the Sugeno

integral, the function to returns the minimum of its two arguments; if it

returns the product, we obtain the Shilkret integral.

Theorem 2.4. A generalized Sugeno integral satisfies the indicator prop¬

erty if and only if the function to fulfills the following two conditions:

m{x, 0) = 0 for all i£P, (2.2)

m(l,y)=y for ally G P. (2.3)

In this case it is a monotonie integral on the class of all monotonie mea¬

sures.

It is also homogeneous if and only if the function to returns the product

of its two arguments. That is, the Shilkret integral is the homogeneous

version of the generalized Sugeno integral.

Proof. The indicator property implies J 0 d/i = 0, and thus also condi¬

tion (2.2). If condition (2.2) holds, then from the equality (2.1) we obtain

JIa d/i = to[1,/i(A)] (since to is nondecreasing in the first argument),and therefore the indicator property is equivalent to condition (2.3). The

monotonicity of the integral can be proved in the same way as for the

Shilkret integral, since to is nondecreasing in the second argument.

2.2 Nonadditive Integrals 33

Reasoning as above, we obtain Jcl^dfi = m[c, fJ-(A)] for all c G W.

Hence, the homogeneity of the integral implies that m returns the productof its two arguments. The converse was proved in Theorem 2.3. D

Let Ai be a class of measures (each /i G Ai is a measure on some set

<2M). An integral on Ai is said to be (finitely) maxitive if

/ max{/, (?} d/i = max< / /d/i, / gdfi > for all /i G AiJ VJ J ) and all f,g eP^.

When combined with the indicator property, maxitivity implies that the

measures /i G Ai are finitely maxitive, since Iaub = niax{//i,-fB} for all

A, B Ç Qp. It can be easily proved that for an integral on a class of finitelymaxitive measures, maxitivity implies monotonicity, while the converse is

not true.

Theorem 2.5. The Shilkret integral is maxitwe on the class of all finitelymaxitwe measures.

Proof. The maxitivity of the Shilkret integral follows from

^{maxj/, g}>x} = fi({f > x} U {g > x}) = max{/i{/ > x}, fi{g > x}},

since sup and max commute. D

Shilkret (1971) noted that homogeneity and maxitivity do not char¬

acterize his integral: maxitivity must be at least countable. We are not

interested in countable maxitivity, so we proceed directly to the strongest

version of maxitivity. An integral on .M is said to be completely maxi¬

tive if

sup /d/i = sup / /d/i for all /i G .M and all T Ç P^.

fer ferJ

When combined with the indicator property, complete maxitivity impliesthat the measures /i G .M are completely maxitive, since I\ja= snl?AeA IA

for all AC 2s".

Theorem 2.6. The Shilkret integral is completely maxitwe on the class ofall completely maxitwe measures.

34 2 Nonadditive Measures and Integrals

If fi is a completely maxitive measure on a set Q, and f : Q

function, then* /*S

/ d/i = / / d/i = sup / nl

for all homogeneous, completely maxitive integrals on classes Ai 9 /i. That

is, the Shilkret integral on completely maxitive measures is characterized

by homogeneity and complete maxitivity.

Proof. By definition of/i^, we have /i{/ > x} = supr f>x\ /i^, and therefore

s

/ d/i = sup sup xfi^(q)= sup sup xfi^(q) =xer q{f>x} ge{/>0} xe(o,/(9)]

= sup f(q)nl(q) = sup ffi1.?e{/>0}

This expression implies the complete maxitivity of the Shilkret integral,since sup GS

and supj-Gjr commute.

If an integral is completely maxitive and homogeneous, then it satisfies

//d/i = / sup sup clrq\ dfi = sup sup / cLid/i =J q{f>0} c(0,f(q)) q {/>0} c (0,f(q)) J

= sup sup cfi{q}= sup f(q) /i ((/) = sup//i^.ge{/>0} ce(o,/(g)) ge{/>0}

Example 2.7. Let /i be the completely maxitive measure on [0, f ] consideredin Example 2.f, and for each d G [0, f] let l^ be the function n->(i- d)2on [0,1]. Theorem 2.6 implies that

ld d/i = sup x (x - d)2 = \\3

!3 ^

xe[o,i] I 27a H

4 S« S i,

this integral is plotted (as a function of d G [0,1]) in the first diagram of

Figure 2.1 (dashed line).

A comparison of the above expression with the results of Example 1.8

shows that the value d =-g minimizing Js/dd/i is the estimate obtained

by applying the MPL criterion to the estimation problem of Example 1.1,when n = 1 and X = 1; in fact, in this case /i^ is equivalent to the likelihood

function hk on [0, 1] considered in Example 1.6 (see Section 3.1). 0

2.2 Nonadditive Integrals 35

0.6^

0.4^

0.2-

0"0.2 0.4 0.6 o.: 0.2 0.4 0.6 0.:

Figure 2.1. Integrals and distribution function from Examples 2.7, 2.10, 2.11,and 2.16.

2.2.2 Invariance Properties

The properties considered so far for integrals on classes of measures applyto each measure separately, while the following fundamental invariance

involves integrals with respect to different measures. Let A4 be a class

of measures (each /i £ A4 is a measure on some set <2M) that is closed

under transformations. An integral on A4 is said to be transformation

invariant if

(/ o t) d/i = / / d(/i o t_1) for all /i G A4, all sets T,J

all / GPT, and all t G Ts".

When combined with the indicator property, transformation invariance

places no restrictions on the measures in A4. But it is a powerful property,since the integral of / with respect to /i can be written as

(idpOf)dfi= idpd(p,of x)/d/i

That is, a transformation invariant integral on A4 can be seen as a function

on the subset of A4 consisting of the measures on P.

Theorem 2.8. All transformation invariant integrals on classes A4 ofmeasures (A4 closed under transformations) have the following property:

if fJ, G A4 is a monotonie, subadditive measure on a set Q; the nonempty

36 2 Nonadditive Measures and Integrals

set ACQ satisfies fi(Q \ A) = 0, and /, g : Q —> P are functions, then

/"U cm,

/ /d/i = / /Ud/^U,

and in particular

/ = g o.e. [/i] => / /d/i = / ffd/i.

Proof. As noted in Subsection 2.1.1, we have /i(ß) = A*(-B H A) for all

B Ç Q (since /i is monotonie and subadditive, and fi(Q \ A) =0). Hence,if T is a set, and t : Q —> T is a function, then /i o t_1 = /i|^ o (t|^)_1,since for all C Ç T we have

m[*_1(C0] = m[*_1(C) n A] = m^U)-1^)] = MUKtU)-1^)].

This result implies the first two statements of the theorem. To conclude

that /i\a C Ai, it suffices to choose a function t : Q —> A such that

t|^4 = idj±. we obtain /iot-1 = /i|^, and .M is closed under transformations.

To conclude that J f d/i = J f\A d/i| Aj it suffices to choose t = f: we obtain

/i o /_1 = /i|^ o (/|a)_1, and the integral is transformation invariant.

The last statement of the theorem follows from the first two, except for

the case with {/ = g} = 0. In this case /i is the zero measure on <2, and

therefore J /d/i = 0 for all /, since /io/_1 = /i o (70)_1. D

Let /i be a monotonie measure on a set <2, and let / : <2 —> P be a

function. The essential supremum of / with respect to /i is the value

essM sup / = inf{x G P : /i{/ > x} = 0}.

When /i is also subadditive, we can say that essM sup / is the infimum of

all x G P such that f < x a.e. [/i]. In particular, it can be easily provedthat if /i is completely maxitive, then essM sup/ = supr i>0i /.

The rectangular integral associates the value

/ / d/i = (essM sup /) /i{/ > 0}

to each pair consisting of a monotonie measure /i on a set <2 and of a func¬

tion / : Q —> P. When /i is completely maxitive, the rectangular integral of

/ with respect to /i can be written as Jr/d/i = (supr i>0i /) (supr<>0i /i^).

2.2 Nonadditive Integrals 37

Theorem 2.9. The rectangular integral is a transformation invariant, ho¬

mogeneous, monotonie integral on the class of all monotonie measures.

If fi is a monotonie, subadditive measure on a set Q, and f : Q —> P is

a function, then

r r

f d/i < / / d/i

for all transformation invariant, homogeneous, monotonie integrals on

classes Ai E5 /i. That is, the rectangular integral on monotonie, subadditive

measures is the maximum of all transformation invariant, homogeneous,monotonie integrals.

Proof. The rectangular integral on monotonie measures satisfies the indica¬

tor property, monotonicity, and homogeneity, since the essential supremum

satisfies'

0 if fi(A) = 0,eSSMSup/A-n ifM(A)>0)

f < g => essM sup / < essM sup 9,

essM sup c f = cessM sup / for all c G P.

The rectangular integral is transformation invariant, since for all A Ç P

M{/ot GA} = [MoCfot)-1]^) = [(MorVr1]^) = (Mot"1)!/ G A}.

Let /i be a monotonie, subadditive measure. If essM sup / is infinite, then

/ / d/i = oo. If essM sup / is finite, then for all x G (essM sup /, oo) we have

/ = min{/, x} a.e. [/i]. Hence, a transformation invariant, homogeneous,monotonie integral satisfies

//d/i= / min{/, x} d/i < / a;/{/>o} d/i = x/i{/ > 0}.

The desired result is obtained by letting x tend to ess^sup/. D

Example 2.10. Let /i and /^ be the measure and the functions on [0, f] de¬

fined in Examples 2.f and 2.7, respectively. Since /i is completely maxitive,

'

f (1 - d)2 if0<d<i,/dd/i= (sup(01]/d)(sup[01]Ud}îd[0)1]) = j^2 ifl<d<l-

this integral is plotted (as a function of d G [0,1]) in the first diagramof Figure 2.f (dotted line). Since Js/rfd/i = Jr/^d/i for all d G [0, ^] (see

38 2 Nonadditive Measures and Integrals

Example 2.7), Theorems 2.3 and 2.9 imply that J/^d/i = (1 — d)2 for

all d G [0, i] and all transformation invariant, homogeneous, monotonie

integrals on classes Ai 3 /i. 0

Theorem 2.8 states that for a transformation invariant integral on

Ai, if /i G Al is a monotonie, subadditive measure on a set <2, and

/ : Q —> P is a function, then J/d/i = J/l^d/il^ when AC Q is a

nonempty set such that /a\q\a = 0. The reciprocal invariance property

is that J"/d/i = J/l^d/il^ when AC Q is a nonempty set such that

Î\q\A = 0- But there is a difference: if A = {A Ç Q : /i|g\ ^4= 0}, then

in general /«|g\n.4. 7^ 0 (we are sure that /«|g\n.4. = 0 if /i is also contin¬

uous from below); while if A = {A Ç Q : /|g\4 = 0}, then /|g\n^i = 0,since f]A = {/ > 0}, the support of /. That is, the reciprocal invariance

property can simply be stated as follows.

Let Af be a class of measures (each /i £ Af is a measure on some

set <2M) that is closed under restrictions. An integral on Af is said to be

support-based if

/ /d/i = / /|{/>o}d/i|{/>o} for all /i G AÎ and all / G PQm.

This property places no restrictions on the measures in Af, since for indica¬

tor functions / = Ia it is implied by the indicator property. For instance,the rectangular integral is support-based on the class of all monotonie

measures, since it depends only on /i{/ > 0} and on /i{/ > x} for a; G P.

2.2.3 Regular Integrals

Let X and Y be nonnegative random variables on a probability space with

probability measure P. Usually, Y is said to (first-order) stochasticallydominate X (with respect to P) if

P{X <x}> P{Y < x} for all a; G P.

Since a probability measure is finitely additive, finite, and continuous (fromboth above and below), this condition is equivalent to the following one:

P{X >x}< P{Y > x} for all a; G P.

Let /i be a monotonie measure on a set Q, and let / : Q —> P be a

function. The (decreasing) distribution function of / with respect to /i

2.2 Nonadditive Integrals 39

is the function x i—> /i{/ > x} on P; it shall be denoted by /i{/ > •}. Let Al

be a class of monotonie measures (each /i G AI is a measure on some set

Qß). An integral on AI is said to respect distributional dominance if

/•*{/ > •} < v{g > } =^ / /d/i < / ffdi/ for all /i, ^ G Al,•^ J

all / G Ps", and all j6Ps».

If /i and v are probability measures, then this property corresponds to

respecting (first-order) stochastic dominance with respect to the product

probability measure fi x is.

Example 2.11. Let /i and Id be the measure and the functions on [0, 1]defined in Examples 2.1 and 2.7, respectively. For each d G [0,1], the

distribution function of Id with respect to /i satisfies

(1 if 0<x< (1-d)2,fJ>{ld > x} = < d — ^Jx if (1 — d)2 < x < d2,

[ 0 if max{d2, (1 - d)2} < x < oo;

the function x i—> /i{/4 > x} is plotted in the second diagram of Figure 2.1

for x G [0,1] (solid line).

Let v = ö o n be the finitely maxitive measure on [0,1] considered in

Example 2.2. Since

Hh >-} = So fi{h >}< I{0} + \ I(o,i] = mU{1} > •},

all integrals respecting distributional dominance on classes AI E5 /i, v sat¬

isfy Jlidv < flri-idfj, =

g-This is not the case for the rectangular

integral: in fact, for all d G [0,1]

Id dis = (ess^ sup ld) ö(ji{ld > 0}) = / ld d/i = max{d , (1 - d) },

since f^-jO} = {0} and 0(1) = 1 (see Example 2.10). 0

The property of respecting distributional dominance strengthens mono-

tonicity, implying in particular also monotonicity with respect to measures;

but the next theorem states that at least for integrals on the class of all

completely maxitive measures, it is equivalent to monotonicity, providedthat the invariance properties considered in Subsection 2.2.2 are satisfied.

The term "bimonotonicity" shall designate monotonicity with respect to

40 2 Nonadditive Measures and Integrals

both functions and measures, and analogously, the term "bihomogeneity"shall designate the double homogeneity. An integral on A4 is said to be

bimonotonic if it is monotonie and

/i <v => / /d/i < / / dv for all /i, v G A4 such that Qß = Qv,J J

and all / G Ps".

Theorem 2.12. Let A4 be a class of monotonie measures. All integrals on

A4 respecting distributional dominance are bimonotonic, and they are also

transformation invariant if A4 is closed under transformations.

All transformation invariant, bimonotonic integrals on the class of all

completely maxitive measures respect distributional dominance.

All support-based, transformation invariant, monotonie integrals on the

class of all completely maxitive measures respect distributional dominance.

Proof. An integral respecting distributional dominance is bimonotonic and

transformation invariant, since for all x G P we have

/ < 9 => /"{/ >x}< n{g > x},

H<V =*> fj,{f >x}< v{f > x},

fi{fot>x} = (fiot-1){f>x}.

If /i is a completely maxitive measure and the integral is transformation

invariant, then J f d/i depends only on the function (/io/-1)^ on P. Let /'and /i' be respectively the function and the completely maxitive measure

on Q = P2 x {0,1} defined by

/>(..».,)-., a„d (,T(.,»,«)-{!|* ^t»^'^Then J f d/i = J f d/i', because for all x G P

\tx' o (/r ¥(*) = suptf,)-^^')* = (m° r1)1^).

If v is a completely maxitive measure, then for all x G P

fi{f > x} < v{g > x} O sup[Xi0o](fio f-1)1 < sup[xoo](v o g'1)1.

In this case, for all (x,y) G P2, if y < (/io/_1)^(i), then there is an x' > x

such that y < {y o g^1)-1^'): define g'(x,y, 1) = x' and g'(x,y, 0) = x. We

2.2 Nonadditive Integrals 41

obtain a function g' on Q such that /' < g1, and therefore, if the integral is

monotonie, then J f d/i' < J g' d/i'. Now, let is' be the completely maxitive

measure on Q defined by

y if y < (/i o f~1)^(x) and z = 1,

(z/)^(a;, y, z) = ^ y if y < (z^ o g-1)^^) and z = 0,0 otherwise.

Since /i' < z/, if the integral is bimonotonic, then J g' d/i' < J g' dis'.

To obtain the second statement of the theorem, it suffices to show that

f g' dis' = J g dis; that is, it suffices to show that for all i'éP

[ls'o(g')-l}ï(X')=SuP{g,)-1{x,}(ls'y = (isog-l)ï(X').

The second equality holds, since if g'(x,y, 1) = x' and y < (/i o /_1)^(a;),then y < (is o g~ *) J- (x').

To obtain the third statement of the theorem, it suffices to show that

if the integral is support-based, then the second equality of

/dM= //'dM'= ff'dis'< Ig'dv'= fgdis

holds, since the other two equalities have already been proved, and the

inequality is implied by monotonicity. But if the integral is support-based,then the second equality is implied by (m')I{/'>0} = (^OI-T/'X)}; which

holds since it is equivalent to [(/u')^]l{/'>0} = [(^')^]l{/'>0}-

Let Ai be a regular class of monotonie measures (each /i G Ai is a

measure on some set <2M). An integral on .M is said to be bihomogeneousif it is homogeneous and

/ fd(c/i) = c / /d/i for all /i G M, all c G P, and all / G PQ".

An integral on .M is said to be regular if it is bihomogeneous and it

respects distributional dominance.

The next theorem states that a regular integral on .M corresponds to

a functional on the set

VT(M) = {n{f >}: fiGM, / G PQ"}

42 2 Nonadditive Measures and Integrals

of all distribution functions generated by the measures in A4. Since these

measures are monotonie, DJ7(A4) is a subset of

Aflf = \(p G P : (p is nonincreasingj.

In general DJ7(A4) is a proper subset of AfT^: the class A4 is said to

be exhaustive if the two sets are equal; for example, the class of all

completely maxitive measures is exhaustive. In order to state the theorem,

we need some additional definitions.

Let Q be a set, and let J7 be a subset of Ps. A functional V : J7 —> P

is said to be monotonie if

f<9 => V(f)<V(g) for all/, g G T.

A functional V : J7 —> P is said to be homogeneous if

V(cf) = cV(f) for all / G T and all c G P such that cfGJ7.

For example, if an integral on A4 is monotonie and homogeneous, then for

each measure /i G A4 the functional / i—> J / d/i on Ps>* is monotonie and

homogeneous. If cp : P —> P is a function, and c G_P, then y(-) denotes the

function ï k ip(^) on P. Let * be a subset of Pp. A functional F : * —> P

is said to be bihomogeneous if it is homogeneous and

F[cp(-)] = cF(cp) for all ip G * and all c G P such that y>(-) G *.

Note that if if G ©^(.M) and c G P, then cep, cp(-) G VT(M). A func¬

tional _F : ^ —> P is said to be »-independent, for an x G P, if

^lp\{x} = ^lp\{x} =^ F(ip)=F(iP) for all <^,V> G *.

A functional F : Aflf —> P is said to be calibrated if

F(I[0,i]+yI{0}) = 1 for all y G P.

Note that a O-independent functional F on Aflf is calibrated if and onlyif F(I\nt\]) = 1. The left-continuous version of a function 93 G Aflf is the

function ip~ G .A/ïp defined by

ip~ (x) = infjo x) if for all a; G P.

It can be easily proved that ip~ is left-continuous, and that (tp~)~ = tp~

Note that ip~ (0) =00.

2.2 Nonadditive Integrals 43

Theorem 2.13. Let A4 be a regular class of monotonie measures. An in¬

tegral on A4 is regular if and only if the functional

F:p{f>-}» J /dM

on T>T(A4) is well-defined, monotonie, and bihomogeneous. In this case,

the functional F is x-independent for all x G (0, oo]; if it is also 0-

mdependent and A4 is closed under restrictions, then the integral is

support-based.

Conversely, if F : Aflf —> P is a calibrated, bihomogeneous, monotonie

functional, then the equality

|/dM = F(M{/>-})

defines a regular integral on the class of all monotonie measures. The

integral is support-based if and only if F is 0-mdependent. In this case,

F{}p) = F((p~) for all ip G J\fZp, and in particular, denoting by fi{f > •}the function x i—> /i{/ > x} on W, the integral can be written also as

|/dM = F(M{/>-}).

Proof. An integral on A4 respects distributional dominance if and only if

the functional F is well-defined and monotonie. In this case, the integralis bihomogeneous if and only if F is bihomogeneous. If ip\f\ rx\

= V'lwfxifor an x G P, then for all x' GW\ {x} and all c G (1, oo) we have

4>(x) < ^(f ) = ¥>(f ) and rp(x')=<p(x')<<p(^).

Hence F(tp) < F[ip(-)] = cF(ip) holds when F is monotonie and bihomo¬

geneous: by letting c tend to 1 we obtain the inequality F(tp) < F(ip), and

thus, by symmetry, the x-independence. To prove the oo-independence,consider the following consequence of Theorem 2.3:

lim /i{/ >x}>0 => / /d/i > / /d/i = oo.

xToo J J

If y|[o,oo) = V>l[o,°°)' then limxToo^ = limxTooV>- If this limit is 0, then

ip = tp; otherwise F(ip) = F(ip) = oo. If F is 0-independent, then the

44 2 Nonadditive Measures and Integrals

integral is support-based, since

(mI{/>0}{/|{/>0} > -})l(0,oo] = (Mi/ > -})l(0,oo]-

The integral defined by a calibrated, bihomogeneous, monotonie func¬

tional F : J\flp —> P satisfies the indicator property, since if fJ,(A) G P,then using bihomogeneity and calibration we obtain

J IA dM = F(p{IA > •})=F[p(i)I(o,1]+M(eM)/{o}]=#)>

and using limits and nionotonicity we obtain the same result also for

fJ-(A) G {0,oo}. We have already proved that the integral is regular, that

F is oo-independent, and that the integral is support-based when F is 0-

independent. Now assume that the integral is support-based, and consider

a function cp G J\fX^. To prove the O-independence of F, it suffices to show

that defining ip' = (p I(0,oo] + (suP(o,oo] ^K{0},jwe obtain F(ip) = F(ip').Let /i be the completely maxitive measure on P defined by /i^ = ip. Since

cp is nonincreasing, we have fi{idf > •} = cp, and therefore, since the inte¬

gral is support-based and {id^ > 0} = (0, 00], to obtain the desired result

it suffices to show that /u|(o)0o]{*^pl(o,oo] > x} = (y9'(a;) f°r aU x GV. This

equality clearly holds for positive x, and for i = 0we have

mI(o,oo]{^fI(o,oo] > 0} = /i(0, 00] = sup(0)Oo] (p = ip'(0).

Defining cp" = <pl(n 06) + <y9~ -^{0,00}) we obtain ip~ (x) < <p"{-) for all x G P

and all c G (f,oo), and therefore

F(<p")=F(<p)<F(<p-)<cF(<p").

The equality F(ip) = F{tp~) is obtained by letting c tend to f. The last

statement of the theorem follows from this result, because the equality

(/•*{/ > •})" = (/•*{/ > •})" is obtained from the inequalities

/"{/ > x} < /i{/ >x}< inf[0jX) /i{/ > •}

by considering the left-continuous versions of the corresponding functions

of x G P:

(M{/ > •})" < (M{/ > •})" < [(M{/ > •})"]- = (M{/ > •})". n

2.2 Nonadditive Integrals 45

Many authors (for example Denneberg, 1994) define the distribution

function as /i{/ > •}. Theorem 2.13 assures us that if a support-based,

regular integral on an exhaustive class of monotonie measures is defined

by a functional F on Wîp, then it is not affected by the choice between

/"{/ > •} and /"{/ > •}•

Theorem 2.13 implies that the Shilkret integral is regular and support-

based on the class of all monotonie measures, since it corresponds to the 0-

independent, bihomogeneous, monotonie functional _Fs : MXf —> IP defined

by

Fs(ip) = supxip(x) for all cp G J\fl^.

By contrast, the rectangular integral is not regular on the class of all

monotonie measures, because the functional F is not well-defined, since in

general /i{/ > 0} is not determined by /i{/ > •}. But if /i is continuous

from below, then /i{/ > 0} = supp /i{/ > •}, and therefore the rectangular

integral is regular and support-based on the class of all monotonie measures

that are continuous from below, since it correspond to the 0-independent,

bihomogeneous, monotonie functional Fr : J\fX^ —> P defined by

Fr(ip) = mf{ip = 0} supp cp for all cp G J\fX^.

The regularized rectangular integral is the integral defined by FT

on the class of all monotonie measures: it associates the value

r

/ d/i = (essM sup /) supP fi{f > }

to each pair consisting of a monotonie measure /i on a set Q and of a func¬

tion / : Q —> P. The regularized rectangular integral is thus regular and

support-based on the class of all monotonie measures, and it correspondsto the rectangular integral on the class of all monotonie measures that are

continuous from below.

Example 2.14- Let /i, u, 8, and l& be the measures and the functions defined

in Examples 2.1, 2.2, and 2.7. For all d G [0, 1)

/r/>r irr

Id d/i = max{d , (1 — d) }= / 14 <lv = / ldà.i>;

the first equality holds because /i is completely maxitive (and thus contin¬

uous from below), the second and third ones were proved in Examples 2.10

46 2 Nonadditive Measures and Integrals

and 2.11, and the last one is implied by supp v{ld > •} = 1 = v\ld > 0}.When d = 1, the first three equalities still hold, but the last one does not:

Jrr/i <lv =g,

since supP v{l\ > •} = ^. In fact, Theorem 2.13 implies that

all support-based, regular integrals on exhaustive classes Ai 3 /j,,u satisfy

fhdv =t; J/id/i, because u{h > -}|(o,oo] = \ n{h > -}|(o,oo]- 0

Theorem 2.15. If fi is a monotonie measure on a set Q, and f : Q —> P

is a function, then

S /> irr

/ d/i < / / d/i < / / d/i

/or o// regular integrals on classes Ai 9 /i. That is, the Shilkret integraland the regularized rectangular integral are respectively the minimum and

the maximum of all regular integrals.

In particular, if A Ç Q is a nonempty set, and /i is the completelymaxitwe measure on Q defined by ^ = Ia, then

/ d/i = supA / and / / d/i = infA f

for all regular integrals on classes Ai 9 /i, ~ß.

Proof. Since a regular integral is monotonie and homogeneous, the first

inequality was proved in Theorem 2.3. If y = supp/i{/ > •} G {0,oo},then the second inequality clearly holds. If y G P, then there is an x G P

such that y' = /i{/ > x} G (0,y]. Let g = (essM sup/) 7(/>x} and v = -^ /i.

Since /i{/ > •} < /i(ß) J{0} + 2//(o.ess^sup/] < ^{ff > •}, we have

//•rrgdz^ = (essM sup /) za[/ >x}= /d/i.

Consider now the case with /i^ = i^. If x < sup^ /, then /i{/ > x} = 1;while if x > sup^ /, then /i{/ > x} = 0. Hence we have

/S/>TT

/ d/i = / / d/i = supA /.

The last equality of the theorem can be obtained in the same way: if

x < infa f, then /!{/ > x} = 1 — /i{/ < x} = 1; while if x > infa /, then

ß{f > x} = 1 - /i{/ < x} = 0. D

2.2 Nonadditive Integrals 47

The second part of Theorem 2.15 justifies the use of the expression

"density function" for /i^ (where /i is a completely maxitive measure on

a set <2) when regular integrals are considered, since /i^ is the "Radon-

Nikodym derivative" of /i with respect to the completely maxitive measure

i/onQ defined by v^ = 1. In fact, for all regular integrals on classes A4 9 v

we have

/ p} dv = p^ Ia d-is = sup/i^ Ia = n(A) for all ACQ,

where the left-hand side of the first equality is the usual abbreviation for

the right-hand side.

Geometrically, if we consider the rectangles lying in the first quadrantof the (extended) Cartesian plane and having two sides on the coordi¬

nate axes, then _Fs (<y?) is the area of the largest rectangle lying under the

graph of cp (if such a rectangle exists), while Fr(p) is the area of the small¬

est rectangle containing the portion of the graph of p not lying on the

coordinate axes. From this geometrical standpoint, the most natural cali¬

brated, bihomogeneous, monotonie functional on J\fX^ is probably the one

assigning to each function the area under its graph. That is, the functional

Fc : Afl¥ -> P defined by

/>oo

Fc((f) = / (f(x) dx for all ip G Aflf,Jo

where the right-hand side of the equality is the Lebesgue integral of ip\f(with respect to the Lebesgue measure on P: since cp is nonincreasing, it is

Borel measurable), which corresponds to the improper Riemann integralwhen p\f is finite (in this case, cp is Riemann integrable on every bounded,closed interval i" C IP, since cp\j is bounded and nonincreasing), while the

integral is infinite when p\f is not finite.

The Choquet integral is the integral defined by Fq on the class of

all monotonie measures: it associates the value

C />oo

/d/i= / /i{/ > x} dx

to each pair consisting of a monotonie measure /i on a set Q and of a

function / : Q —> P. The Choquet integral is a well-known nonadditive

integral, which was introduced by Choquet (1954) and then developed by

48 2 Nonadditive Measures and Integrals

many authors (see for example Denneberg, 1994); it is regular and support-

based on the class of all monotonie measures, since Fq is O-independent,

bihomogeneous, and monotonie.

Example 2.16. Let /i and l& be the measure and the functions on [0, 1]defined in Examples 2.1 and 2.7, respectively. The results of Example 2.11

imply that

J lddv = jo V{ld>x}dx=l[1_d3 + 5_{1_d)3 ifi<d<21;

this integral is plotted (as a function of d G [0,1]) in the first diagramof Figure 2.1 (solid line). The value d fa 0.691 minimizing Jclddfi is the

estimate obtained by applying the "MPL* criterion" of Cattaneo (2005) to

the estimation problem of Example 1.1, when n = 1 and X = 1 (comparewith Example 2.7).

In the second diagram of Figure 2.1, the area under the solid line (thatis, under the graph of fi{U > •}) corresponds to the Choquet integral of

14. with respect to /i, while the areas of the dashed and dotted rectangles

correspond to the Shilkret integral and to the (regularized) rectangular

integral, respectively. 0

All the regular integrals (on regular classes Ai of monotonie measures)that we have considered so far are support-based (when Ai is closed under

restrictions). The next example shows that there are regular integrals (onclasses .M closed under restrictions) that are not support-based.

Example 2.17. Consider the functional F : J\fX^ —> P defined by

F(ip) = max {Fs(tp), inî{tp = 0} minj^ y(0), supP ip}j for all ip G J\flp.

It can be easily proved that F is calibrated, bihomogeneous, and mono-

tonic, but not 0-independent. Therefore the integral defined by F on the

class of all monotonie measures is regular, but not support-based. 0

The hypograph of a function cp : W —> P is the set

H(p) = {(x,y)eW:y<p(x)}.

The functional Fq assigns to each cp G J\fX^ the value uc[H{ip)\, where

2.2 Nonadditive Integrals 49

vq is the countably additive measure on (the Borel sets of) P2 such that

z^c(P2 \ IP2) = 0, and vc\v2 is the Lebesgue measure on P2. It can be

easily proved that, in analogy with the Choquet integral, each integral

respecting distributional dominance on some class of monotonie measures

can be represented by some (not unique) monotonie measure v on P2, in

the sense that it can be written as

Jfdp = v[H(p{f >•})]. (2.4)

For example, the Shilkret integral and the regularized rectangular integral

are respectively represented by the measures z^s and vT on P2 defined by

vs (^4) = sup x y and vT (A) = I sup x 11 sup y I,(x,y)A \(x,y)£AnP2 J\(x,y)An¥2 )

for all ACW2.

Conversely, if a monotonie measure v on (the Borel sets of) P2 satisfies

is[H(yI[0il]+y'I{0})]=y for all y,y' G P, (2.5)

then the equality (2.4) defines an integral respecting distributional domi¬

nance on the class of all monotonie measures, since the indicator propertyis implied by the condition (2.5). For instance, with the measure v on P2

defined by

u(A) =

supAn(-Pxp) m for all A Ç P2,

where to is a nonnegative, extended real-valued function on P x P that

is nondecreasing in both arguments and that fulfills the conditions (2.2)and (2.3), we obtain the generalized Sugeno integral corresponding to

to. Imaoka (1997) introduced a class of integrals obtained in this way

using a particular kind of countably additive measures. Thanks to the

Carathéodory extension theorem, it can be proved that the Choquet inte¬

gral is the only homogeneous integral that can be represented by a count¬

ably additive measure on (the Borel sets of) P2, although this measure

is not unique. That is, if we call "generalized Imaoka integrals" the in¬

tegrals representable by countably additive measures on (the Borel sets

of) P2, then what distinguishes the Choquet integral among the general¬ized Imaoka integrals is the homogeneity, which is also the distinguishingfeature of the Shilkret integral among the generalized Sugeno integrals.

50 2 Nonadditive Measures and Integrals

The left-continuous pseudo-inverse of a function cp G AfT¥ is the func¬

tion Lpr"J G AfTp defined by

tp" (x) = sup{<y9 > x} = mf{ip < x} for all i£P.

It can be easily proved that (<ßr")~ = tpr"'', and that H(ip"J) is the mirror

image of H((p~) with respect to the diagonal y = x of P2. Hence, if a

support-based, regular integral on the class of all monotonie measures is

represented by a measure on (the Borel sets of) P2 that is invariant with

respect to the reflection about the diagonal y = x, then the correspondingfunctional F on Aflf satisfies

F(<p) = F(<p~) = v\H(<p-)] = v[H(<p~)] = F(<p~) for all <p G AfX¥.

A regular integral on an exhaustive, regular class of monotonie measures

is said to be symmetric if the corresponding functional F : AfT^ —> P

satisfies

F(tp) = F(ipr~J) for all ip G Afl¥.

The Choquet integral, the Shilkret integral, and the regularized rectangular

integral are symmetric, since the measures isc, vs, and vr on (the Borel sets

of) P2 are invariant with respect to the reflection about the diagonal y = x.

The next theorem illustrates an interesting consequence of the symmetry

of an integral.

Theorem 2.18. All symmetric, regular integrals on exhaustive, regularclasses A4 of monotonie measures are support-based (if A4 is closed under

restrictions) and have the following property: if fi G A4 is a completelymaxitive measure on a set Q, and f : Q —> P is a function, then, de¬

noting by F the functional on AfT¥ corresponding to the integral, and by

supr i y 1 / the function x i—> suprßi>x\ f on W, we have

/ fdfi = F(sup{ßly }f);

moreover, if fi is finite, then denoting by inf{ui>^(g)- }/ the functionx i—> inf(„i>„(Q)_xi / on W, we have also

fdß = F(/[0jM(Q)] inf{Mi>M(Q)_ } /)

2.2 Nonadditive Integrals 51

Proof. The integral is support-based, because <y2|(o)0o] = V'lfUoo] implies

Lpr"J = i\)r"J, and therefore F is 0-independent. If we define cp = fi{f > •}and ip = supr i> i /, and we prove that cp~ = ip~, then we can use

Theorem 2.13 to obtain the desired result

/dM = F(<p) = F(y>~) = F(i,-) = F(i>).

The equality cp~ = ip~ can be proved as follows: for all x,y G P we have

<P~(x)<y O Vy'G(y,oo] fi{f > y'} < x

^ Vy'G(y,oo] 3x'g[0,x) {/ > y'} (~l {^ > x'} = 0

^ Vy'G(y,oo] 3x'g[0,x) sup(Ati>x,} / < y'

•<=> ip~ (x) < y.

The last statement of the theorem can be proved in a similar way:

defining ip = Jl{f > •} and ip = I[o,ß(Q)] mi{ßiy^Q)_ y /, it suffices to

show that Lpr"J = ip~. This can be done as follows: for all x, y G P we have

<p~(x)>y o 3y'e(y,oo] ~ß{f >y'}>x

^ 3y'G(y,co] /i{/ < y'} < /i(Q) - x

<^> 3y'£(y, oo] Vx'g[0,x){/ < 2/'} n {m4 > A*(ö) - x'} = 0 and x < /i(Q)

<^> 3y'£(y, oo] Vx'g[0,x)

^o.mCS)]^') inf{Mi>M(Q)-x'} / > V'

& ip~{x)>y. D

2.2.4 Subadditivity

Let Ai be a class of measures (each /i G Af is a measure on some set

Qß). We have seen that if some measures in Ai are not finitely additive,then an integral on Ai cannot be additive; but if the measures in Ai are

subadditive, then we have at least

/ (Ia + Ib) d/i < / Ia d/i + Ib d/i for all fi G Ai'

J Jand all disjoint A, B Ç QM.

52 2 Nonadditive Measures and Integrals

A quasi-subadditive integral extends this property to all pairs of functions

/, g G P^ with disjoint supports (that is, {/ > 0} n {g > 0} = 0). An

integral on A4 is said to be quasi-subadditive if

/ (f+g) d/i < / / d/i+ / g d/i for all /i e .M

and all /, (? G PS,J with disjoint supports.

When combined with the indicator property, quasi-subadditivity impliesthat the measures /i G .M are subadditive, since Iaub = Ia + Ib for all

disjoint A, B Ç Qß. A functional F : AfFp —> P is said to be subadditive if

i% + ip) < F((f) + F{ip) for all ip, V G A/"%.

Theorem 2.19. j4// regular integrals on exhaustive, regular classes A4 of

subadditive, monotonie measures have the following property: if the cor¬

responding functional on AfFp is subadditive, then the integral is support-

based (if A4 is closed under restrictions) and quasi-subadditive.

Proof. The integral is support-based, since the corresponding functional

F on AfFp is 0-independent. In fact, if <y2|(o)0o] = V'lfUoo] and y(0) < ^(0),then (tp — ip) G AfFp, and so there are a measure /i G A4 and a function / on

<2M such that (ïp—ip) = fi{f > •}; but this implies that (ip—cp) = fi{I0 > •},and therefore

F{ip) < F(il>) < F{ip) + F(il> -ip)= F{ip) + j 70 dM = F {ip).

The integral is quasi-subadditive, because if/ and g have disjoint supports,

then {/ + g > x} = {/ > x} U {g > x}, and therefore

(/ + g) dM < F{p{f > •} + n{g > •}) < J f dM + J g dM,

since /i is subadditive. D

Theorem 2.19 implies that the Shilkret integral and the Choquet inte¬

gral are quasi-subadditive on the class of all monotonie, subadditive mea¬

sures, since the functionals Fg and Fc are subadditive. By contrast, the

next example shows that the regularized rectangular integral can be quasi-subadditive only on very limited classes of monotonie, subadditive mea¬

sures.

2.2 Nonadditive Integrals 53

Example 2.20. Let /i be a monotonie, subadditive measure on a set Q. If

there are two disjoint sets A, B C Q such that 0 < fi(A) < \ fi(B) < oo,

then the two functions / = fi(A) Iß and g = fi(B) Ia on Q have disjoint

supports, but

/TT/>TT />TT

(f + g)dfi = fi(AUB)fi(B)>2fi(A)fi(B) = fdfi + gdfi.

0

In fact, as stated by the next two theorems, the Shilkret integral and

the Choquet integral satisfy the following stronger property on suitable

classes of monotonie, subadditive measures. An integral on M. is said to

be subadditive if

(f + g) dfi < f dfi + g dfi for all fi G M and all /, g G Va^.

Theorem 2.21. The Shilkret integral is subadditive on a class of mono-

tonic measures if and only if these are finitely maxitive.

Proof. For the "if part, we can adapt the proof of Shilkret (1971) to the

finitely maxitive measures. It suffices to show

/s/*S

f dfi + / gdfi for all x G P,

since then the desired result is obtained by taking the supremum (over all

x G P) of the left-hand side of the inequality. If one of the two integralsof the right-hand side is infinite, the inequality clearly holds. If both are

finite, observe that

{f + 9>x} Ç {/> Xx}U{g> (1- X)x} for all AG (0,1).

If one of the two integrals is 0, say J gdfi = 0, then fi{g > x'} = 0 for all

x' G P, and therefore

x fi{f + g > x} = sup Xxfi{f + g>x}< sup Xxfi{f>Xx}<Ae(o,i) Ae(o,i)

/'S /'S /'S

< f dfi = fdfi+ gdfi.

54 2 Nonadditive Measures and Integrals

If both the integrals are positive and finite, then setting

JS/dMPfdfi + pgdfi

we obtain

x fi {/ + g > x} < max{i/i{/ > Ai}, x fi{g > (1 — A) x}} <

< maxj - / / d;U, —— / g d^, \ = / f dji + J gdji.

For the "only if part, assume that the Shilkret integral is subadditive

on a class Ai of monotonie measures, let /i G .M be a measure on <2, and

let A, B Ç Q be two disjoint sets. We have

(fi(AUB) ifO<x</i(A),n{n{A UB)IA+ fi(A) IB>x} = < h\ä) if fi(A) <x< fi(A U B),

[0 if fi(AUB) <x <oo;

and therefore

/ [M(A UB)/U M(A) JB] d/i = fj,(A U B) fj,(A).

From this result and the equality

H{A U B) Iaub = HA UB)IA + fjL(A) IB] + HA U B) IB - ii(A) IB],

using the subadditivity of the Shilkret integral and the property

/ clc d/i = c/i(C) for all c G P and all C Ç Q,

we obtain the inequality

H(A U B)2 < fi(A U B) n(A) + HA US)- n(A)] n(B),

which is equivalent to

HA US)- /i(A)] HA US)- /i(B)] < o.

Since both factors of the left-hand side of this inequality are nonnegative,one of them must vanish; that is, fi(A U B) = max{/i(A), fi(B)}. D

2.2 Nonadditive Integrals 55

The next theorem is proved for example in Denneberg (1994, Chap¬ter 6).

Theorem 2.22. The Choquet integral is subadditive on a class of mono¬

tonie measures if and only if these are 2-alternating.

2.2.5 Continuity

There is a correspondence between convergence theorems for regular inte¬

grals on regular classes A4 of monotonie measures, and continuity prop¬

erties of the corresponding functionals on DJ7(A4). The next one is the

convergence theorem corresponding to the continuity property implied by

monotonicity and bihomogeneity.

Theorem 2.23. All regular integrals on regular classes A4 of monotonie

measures have the following property: if the sequence /ii,/i2,... G A4 ofmeasures on a set Q converges uniformly to the measure fi G A4 on Q;the sequence /1; /2,... : Q —> P of functions converges uniformly to the

function f : Q —> W, and one of the following two conditions is fulfilled

lim supp/i„{/„ > •} = supp/u{/ > •} < oo

^°°

(2 6)and lim ess„ sup fn = ess„ sup / < oo,

n^oon

fJ-(Q) and sup/ are finite, and the integral is quasi-subadditive, (2.7)

then

n^oo

lim / /„d/i„ = / /d/i.

Proof. Define

f = /•*{/ > •}> fn = Vn{fn >}, « = SUpp (f, an = SUpp (fn,

b = essM sup / = supp cp~, and bn = essMn sup fn = supp(<y2„)^.

First assume that condition (2.6) is fulfilled: then

e„ = max{||/„ - /||, \\fin - fi\\, \an - a\, \bn - b\, ±}

is positive, and limn^^s-n = 0. If a = 0 or b = 0, then J /d/i = 0, and

56 2 Nonadditive Measures and Integrals

the desired result follows from

/TT/„d/i„ = anbn.

If a and b are positive, and n is sufficiently large, then

min< 1a

'

a„'

b + en bn + en

is well-defined and cn G (0, 1). We have linin^oo c„ = 1, since

liminf <pn(Je^) > lim (/i{/ - e„ > y^} - e„) = a,n^oo n^oo

liminf (y>n)~(v^?0 > lim sup{x G P : /i{/ - £„ > x} - e„ > V^} = b-n^ oo n^ oo

If we show that for n sufficiently large and all x G P

C„ f(^) < Vn(a;) < ^ V(Cn x), (2-8)

then the desired result is obtained by using Theorem 2.13 and letting n

tend to infinity. If x = 0, then (2.8) reduces to c„ /i(<2) < /i„(<2) < — /i(<2),which is valid when e„ < /i(<2)2, since then

cn<l-V^<l-^S) and ^>ïr^>l + V^>l + ^y-

If a; G (0, y7^«]) then (2.8) can be obtained as follows:

c„<^(^) <cna< ifn{y/ë^) < ipn(x) <an< j-ip(y^) < j-ip(cnx).

If a; G (y'e^, (tPn)"" (\/£n)) and e„ < 1, then (fn(x) > y7^, and (2.8) follows

from

9?(x + e„) - e„ < 9?n(x) < ip(x - e„) + e„,

because for y > a/ë^ the inequalities cny < y— en and — y > y + e„ are

a consequence of c„ < 1 — ^Je^ < 1+ ^—.In fact, we have

c^(^) ^ c„^(x + en) < c„ [<^„(x) + en] < <pn(x),

-^ <p(cnx) > ± (f(x - £„) > -^ [<fn(x) ~ Sn] > (fn(x).

If x G [((pnyj(y/£^), oo], then the first inequality of (2.8) is a direct con¬

sequence of — > ft, while the second one is obvious only if x > bn, but it

follows from (f(cn bn) > y'e^ > fni?) if x < bn.

2.2 Nonadditive Integrals 57

Now assume that condition (2.7) is fulfilled: then

e„ = max{||/„ - /||, \\fin - fi\\, i}

is positive, and linin^oo e„ = 0. Define

/' = min{/, b} and f'n = min{/„, b + 2en}I{f>Fn}.

Since /i{/' > •} = (f, the part of the theorem already proved implies

lim / fn dM« = / / d/i,

because ||/^ - /'|| < 2e„,

SUpp /U„{/^ > •} < Mn{/n > 0} < Hn{f > £«} < y{en) + Sn,

SUpp/i„{/^ > •} > Mn{/n > en} > /"«{/ > 2e„} > <^(2£n) - £„,

essMnsup/4 < b + 2en,

essMn sup/4 > sup{x G P : /i„{/4 > x} > £„} >

> sup{x G P : <p(x +£„) > 2e„} > ipr"(2en) — en.

So, defining A = {/ > £„} and _B = {/„ < b + 2e„}, the desired result

follows from

<

/4 d/Un < fn d/Un <

< / /n^Anßd/in+ / /n/Q\j4d/in+ / fnlA\BdfJ.n<

< fn ÏAnB d/i„ + /i„(Q) supQyA /„ + /in(ß \ -B) sup /„

< fn dfln + Ia*(Ö) + £n] 2 £„ + £„ (sup / + £„)

by letting n tend to infinity. D

Example 2.24- Let /i, z/; and Id be the measures and the functions on [0, 1]defined in Examples 2.1, 2.2, and 2.7, respectively. Since the mappingd i—> Id on [0, 1] is continuous (with respect to the supremum norm), and

supp fi{ld > •} = 1 and ess^sup/^ = max{d, (1 — d)2} for all d G [0,1],Theorem 2.23 implies that the function d i—> J/^d/i on [0,1] is continuous

58 2 Nonadditive Measures and Integrals

for all regular integrals on classes Ai 9 /i. It also implies that the function

d l~^ J Id du on [0,1] is continuous for all quasi-subadditive, regular inte¬

grals on classes Ai 9 v. But since lim^Ti supP is{ld > •} 7^ supP u{l\ > •},Theorem 2.23 does not imply that d 1—> fl^dv is continuous at 1 for all

regular integrals on classes Ai 9 v\ in fact, the results of Example 2.14

show that d 1—> J" /^ dz^ is not continuous at 1. 0

2.2.6 Choquet Integral

An important property of the Choquet integral is the additivity of the

corresponding functional Fc on J\fFp, implying in particular the additivityof the integral with respect to measures:

fd(fJL + u)= / fdfJL+ fdu,

for all pairs of monotonie measures /i, v on a set <2, and all functions

/ : Q —> P. The next theorem states another important property of the

Choquet integral, but first we need some definitions.

Two functions /, g : Q —> P on a set <2 are said to be comonotonic if

f(q)<f(q) => g(q)<g(q) for all 9,9'e e.

Let .M be a class of measures (each /i G .M is a measure on some set <2M).An integral on .M is said to be comonotonic additive if

/ (/ + (?) d/i = f d/i + g dfi for all /i G A^

and all comonotonic /,jëPs^.

Theorem 2.25. The Choquet integral is comonotonic additive on the class

of all monotonie measures.

Proof. The additivity of the Choquet integral for comonotonic, finite func¬

tions is proved for instance in Denneberg (1994, Proposition 5.1); we can

extend this result as follows. If / and g are comonotonic, then there are no

q, q' such that q G {/ < 00} fl {g = 00} and q' G {/ = 00} fl {g < 00}, and

therefore the inclusion A = {/ < 00} Ç {g < 00} can be assumed without

loss of generality. If fJ-iQ^ \A) > 0, then the desired result is obvious, since

both the integrals of / and of (/ + g) are infinite. If /i(QM \ A) = 0, then

2.2 Nonadditive Integrals 59

dehne b = sup^(/ + g), and note that supAg < infg \4 (?, since / and g

are comonotonic. Proposition 4.5 of Denneberg (1994) implies that there

are two continuous, nondecreasing functions u, v : [0, b] —> P such that

u + v = idpib], f\A = uo(f\A + g\A), and g\A = v o (f\A + g\A).

Define h = min{/ + g,b}: Proposition 4.1 of the same book implies that

C pC pC

(u o h) d/i + / (v o h) d/i = h d/i.

The desired result follows from this equality, since uoh = min{/, sup^ /}and v o h = min{(?, sup^ (?}, and for all functions /' : Qß —> P

min{/',supA/'}d/i = / /'d/i,

because fJ-iQ^ \ A) =0, and therefore the distribution functions of /' and

min{/', sup^/'} are equal. D

Schmeidler (1986) proved that for finite, monotonie measures and

bounded functions, the Choquet integral is characterized by monotonic-

ity and comonotonic additivity. But the next example shows that this

characterization fails when the measures or the functions are unbounded

(see also Wakker, 1993).

Example 2.26. Consider the integral that associates the value

fH//_ J/C/dM if Jr7d/i<oo,

to each pair consisting of a monotonie measure /i on a set Q and of a

function / : Q —> P. It can be easily proved that this integral is regularand symmetric on the class of all monotonie measures, and subadditive on

the class of all 2-alternating measures. If the functions /, g : Q —> P are

comonotonic, then for all x G P one of the two sets {/ > x}, {g > x} is a

subset of the other, and therefore

M{/ + 9 > *} < /"{{/ > f} U {g > f}} = max{M{/ > f}, ,i{g > f}}.

Hence, if J f d/i and J g d/i are finite, then so is J (/ + <?) d/i; and thus

the integral defined by (2.9) is comonotonic additive on the class of all

monotonie measures.

60 2 Nonadditive Measures and Integrals

It is interesting to note that this integral can be represented (in the

sense considered at the end of Subsection 2.2.3) by a finitely additive mea¬

sure on (the Borel sets of) P2. That is, homogeneity does not suffice to

characterize the Choquet integral among the ones representable by finitelyadditive measures on (the Borel sets of) P2; and the countable additivityof the measure is not necessary to obtain the comonotonic additivity of

the integral (see also Klement, Mesiar, and Pap, 2004). 0

An integral on Ai is said to be translation equivariant if

/ (/ + c) d/i = / /d/i + cfj,(Q,j,) for all /i G Ai, all c G P,J J

and all / G PQ".

Since two functions are certainly comonotonic when one is constant, all

comonotonic additive, homogeneous integrals are translation equivariant.

By contrast, the next example shows that a maxitive, homogeneous integral

can be translation equivariant only on limited classes of finitely maxitive

measures.

Example 2.21. Let /i be a finitely maxitive, finite measure on a set Q. If

there is a set A C Q such that 0 < fJ,(A) < /i(<2), then all maxitive,

homogeneous integrals on classes Ai E5 /i satisfy

/(/a + I)d/u = max{2fj,(A),fj,(Q\A)} < fj,(A) + fj,(Q) = f IAdfj, + fj,(Q).

0

The equality satisfied by a translation equivariant integral on Ai can

also be considered as a way to define the integral of functions / : Qß —> M

with respect to measures /i G Ai, when inf / and /i(QM) are finite. We

obtain

y"/dM = y"(/-inf/)dM + M(eAt)inf/; (2.10)

and if the translation equivariant integral on .M is regular, subadditive, or

comonotonic additive, then so is the extended integral (when the definition

of distribution function is extended from P to 1). It is important to note

that in general the resulting extended integral does not satisfy the equality

/(—/) d/i = — J /d/i: for example, if A Ç Q^ is a set, then we have

(-IA)dfj,= /(l-/A)d/i-/i(eM) = / /QM\j4d/i-/i(QAt) = - / IAdß.

2.2 Nonadditive Integrals 61

The next theorem (proved for instance in Denneberg, 1994, Proposi¬tion 5.1) states that in the case of the Choquet integral this relation holds

for all bounded functions (not only for the indicator functions).

Theorem 2.28. If fi is a finite, monotonie measure on a set Q; and the

function f : Q —> M is bounded, then the extension of the Choquet integral

by means of equality (2.10) satisfies

j (-/)dM=-y faß.

There is another way to extend a monotonie, translation equivariant

(or homogeneous) integral on M. to functions / : Qß —> M, when inf / and

m(Öm) are finite: it is to define

//d/i= / max{/, 0} d/i - / max{-/, 0} d/i. (2.11)

The right-hand side is certainly well-defined, and the resulting extended

integral satisfies the equality /(—/) d/i = —J /d/i, but in general it does

not maintain the translation equivariance and other possible properties of

the original integral, such as subadditivity or maxitivity.

The extension of the Choquet integral by means of equality (2.10) usu¬

ally keeps the name "Choquet integral", while the extension by means of

equality (2.11) is often called "Sipos integral" (it was introduced by Sipos,

1979). Thanks to Fubini's theorem, it can be proved that if /i is count-

ably additive, then both extended versions of the Choquet integral reduce

to the Lebesgue integral (when either / > 0, or both inf/ and /i(QM) are

finite). So the Choquet integral generalizes the Lebesgue integral to nonad¬

ditive measures, and has strong properties (subadditivity and comonotonic

additivity) on a broad class of measures (the class of all 2-alternating mea¬

sures), containing all finitely additive and all finitely maxitive measures.

By contrast, the Shilkret integral does not generalize the Lebesgue inte¬

gral, and has strong properties (subadditivity and maxitivity) only on the

class of all finitely maxitive measures. The Choquet integral is thus a very

general and powerful integral, while the Shilkret integral is specialized on

maxitive measures: in particular, on completely maxitive measures it re¬

duces to the very simple form stated in Theorem 2.6. The main drawback

of the Shilkret integral is the lack of translation equivariance; this impliesin particular that it cannot be extended in a satisfactory way to functions

taking also negative values.

3

Relative Plausibility

The likelihood function can be considered as a description of uncertain

knowledge about the statistical models: in the present chapter this kind

of description is studied, under the name of "relative plausibility", from

both the static and the dynamic points of view. In particular, it is com¬

pared with the Bayesian, probabilistic description of uncertain knowledgeabout the models, and it is connected to the MPL criterion by a representa¬

tion theorem for preferences between randomized decisions. The statistical

models and their relative plausibility build a hierarchical model with two

levels describing two different kinds of uncertain knowledge: this hierarchi¬

cal model is briefly compared with the classical, Bayesian, and imprecise

probability models.

3.1 Description of Uncertain Knowledge

A relative plausibility measure rp on a set Q is an equivalence class

of completely maxitive measures on Q with respect to the equivalencerelation oc. Thus, to be precise, rp is not a measure (in the sense introduced

in the preceding chapter) but a class of proportional measures; however,this slight abuse of terminology is coherent with the abuse of notation

we commit by using rp to indicate one of its representatives /i, when the

choice of /i G rp is irrelevant. For example, we can say that for each

function / : Q —> P there is a unique relative plausibility measure /^ on

Q such that (/^ oc /. On the other hand, when there is no possibility of

confusion, we use rpot~x and rp\j\ as abbreviations for [(rp ot^1)^ and

[(rPU) ] , respectively (where t is a function on Q, and A is a subset of

Q). A relative plausibility measure rp on Q is said to be nondegenerate

64 3 Relative Plausibility

if rp(Q) G P (that is, if rp is finite and nonzero); in this case, rp denotes

the (unique) normalized representative of rp:

m(A) = T-^\ for all ACQ.rp(Q)

Consider a statistical decision problem described by a loss function L on

VxD, and let hk be the likelihood function on V induced by the observed

data; Theorem 2.6 implies that the MPL criterion can be expressed as

follows:

fSt

minimize / l& dhk '.

The likelihood function hk measures the relative plausibility of the models

in V in the light of the observed data alone, and hk" extends this relative

measure to the subsets of V: it is the relative plausibility measure on V

induced by the observed data. This extension of hk to the subsets of V by

means of the supremum agrees with the profile likelihood: if Q is a set, and

g : V —> Q is a mapping, then lik^ o g~x = (likgY; the extension is thus

in accordance with the likelihood-based inference methods: in particular,if lik^ is nondegenerate, then hk^ (Ti) = LR(Ti) for all TL Ç V.

Hence a relative plausibility measure rp on V can be interpreted as

a description of uncertain knowledge about the models in V, based on

the description rp-1 of the relative plausibility of the elements of V. It is

strictly related to other descriptions of uncertain knowledge appearing in

the literature: "degrees of potential surprise" (Shackle, 1949), "support byeliminative induction" (Cohen, 1966), "consonant plausibility functions"

(Shafer, 1976), and possibility measures (if rp is nondegenerate, then rp

is a normalized possibility measure on V). The models in V are mathe¬

matical representations of some aspects of the reality under consideration;no model is true, but some models represent the observed reality better

than others. The relative plausibility of a model is its relative ability to

represent the observed reality; the ability of a set of models is interpreted

as the ability of its best element. In the Bayesian approach ,the uncertain

knowledge about the models in V is described by an averaging probabilitymeasure 7r on (V,C) (or by a set T of such probability measures), where C

is a cr-algebra of subsets of V. The interpretation of a probability measure

on (V, C) is based on the assumption that exactly one of the models in

V is "true" : V is an exhaustive set of mutually exclusive possible "states

of the world" (at least when C contains all singletons of V). The exhaus-

tivity of V is not implied by the interpretation of a relative plausibility

3.1 Description of Uncertain Knowledge 65

measure rp on V (because rp contains no evaluation of the "absolute plau¬

sibility" of V), although some sort of "actual exhaustivity" is implied bythe fact that we do not consider models that are not in V. The same situa¬

tion can be obtained for the Bayesian approach by considering a "relative

probability measure" on V, in analogy with rp: the Bayesian criterion for

statistical decision problems is unaffected by the "absolute probability" of

V (Hartigan, 1983, developed a Bayesian approach with non-normalized

probability measures). The basic difference between the interpretations of

relative plausibility and probability is that besides the exhaustivity, neither

the mutual exclusivity of the models in V is implied by the interpretationof a relative plausibility measure on V, in the following sense. The proba¬

bility of a model is the probability that it is the "true" one, and truth is

exclusive; but the ability to represent the observed reality is not exclusive:

different models can coexist as satisfactory representations of the reality.Models are mental constructions, not "states of the world", even thoughin particular situations (such as drawing from an urn containing colored

balls) a model can be so strongly associated with a state of the world that

they can almost be identified. The fundamental qualitative difference be¬

tween relative plausibility and probability (see for example Dubois, 1988)is that if Tii, Ti-2, and Ti are subsets of V (in the case of probability theymust be elements of C) such that rp(Tii) > rp(Ti2), and Ti is disjoint from

both Tii and H.2, then the inequality rp(Tii UTi) > rp(Ti2UTi) holds if and

only if rp(Ti) < rp(Tii) (while in the case of probability it always holds). In

fact, the ability of a set Ti of models to represent the observed reality does

not increase when it is enlarged by models that are not more satisfactorythan the ones in Ti: if rp(Tii) < rp(Ti), then rp(Tii U Ti) = rp(Ti).

It is important to note that the identification of the set theoretic oper¬

ations on subsets of V with the connectives of propositional logic is based

on the assumption that exactly one of the models in V is "true". Without

this assumption the identification fails; for example, rp{P, P'} is not the

(relative) plausibility of "P or P'" (in fact, "P or P'" means "P is the

true one or P' is the true one"): it is simply the (relative) plausibility of

the set {P, P'}. The relative plausibility measure lik^ on V is a convenient

extension of the (relative) likelihood lik from the elements of V to the

subsets of V (in accordance with the likelihood-based inference methods),and since we do not assume that one of the models in V is "true", the use

of hk~ does not contradict the assertion of Fisher (1930) that in general"the likelihood of P or P'" has no meaning ( "you do not know what it is

until you know which is meant").

66 3 Relative Plausibility

3.1.1 Representation Theorem

Let ^ be a binary relation on a set S. The binary relations ~ and -<onS

are defined as follows on the basis of ^ :

a ~ b •<=> a ^ b and b ^ a for all a,b G S,

a -< b •<=> a ^ b and not b ^ a for all a,b G S.

The relation ^ is said to be a total preorder on S if it fulfills the followingtwo conditions:

a ^ b or b ^ a for all a,b G S,

a ^ b and ö ^ c => o ^ c for all a,b,c G S.

If ^ is a total preorder on S, then ~ is an equivalence relation on S, and

for all a,b G S exactly one of the following three cases applies: a -< b, b -< a,

or a ~ &. A function / : S —> M is said to represent the relation ^ on S" if

a<b <^> /(a) < f(b) for all a,b G S.

If ^ is represented by some function, then it is a total preorder.

In the Bayesian approach, the use of probability measures as a descrip¬tion of uncertain knowledge is usually justified in terms of representationtheorems for qualitative preference relations. The most famous representa¬

tion theorem is the one of Savage (1954): it gives necessary and sufficient

conditions for a preference relation ^ on Cs (where Q is a set of "states

of the world" and C is a set of consequences) to be represented by the

mapping / i—> E^{v o /) for a normalized, finitely additive measure tt on Q

(the countable additivity is given up, and so measurability problems are

avoided) and a bounded function v : C —> M (a quantitative evaluation of

the consequences). The elements of Cs are functions / : Q —> C associating

a consequence to each state of the world; that is, each / can be considered

as the uncertain consequence of a particular decision. By assuming that

the uncertain consequence summarizes all important aspects of a decision,we identify the decisions with their uncertain consequences, interpretingthe elements of Cs as decisions, and the relation ^ as a description of

preferences between them.

Consider a preference relation ^ on T> Ç Cs: a representation theo¬

rem gives necessary and sufficient conditions for the preferences between

3.1 Description of Uncertain Knowledge 67

decisions to be represented by a particular kind of functional on T>. We

shall obtain a representation theorem where the functional is of the form

/ i—> Js(v o I) drp for a nondegenerate relative plausibility measure rp on

Q and a function v : C —> [0, oo). We interpret f o / as a quantitativeevaluation of the uncertain loss incurred by making the decision l G D,and accordingly we interpret the expression / ^ /' (where /, /' G T>) as "/'

is not preferred to I". For the sake of generality, we formulate the repre¬

sentation theorem with Q as a generic set of alternatives determining the

consequences of the decisions; but when interpreting the conditions, Q can

be regarded as a set of statistical models. The theorem describes the kind

of preference relation on T> induced by the MPL criterion, when applied

(with respect to some nondegenerate relative plausibility measure rp on

Q) to a decision problem described (for some function v : C —> [0, oo)) bythe loss function Lv : (q, I) i—> f[/(g)] on Q x T>.

Representation theorems are often considered as justifications for par¬

ticular approaches (such as the Bayesian one), but even if the postulatedconditions for the preferences between decisions are reasonable, the con¬

clusion that the only reasonable preference relations are represented by a

particular kind of functional is based on the wrong assumption that rea¬

sonable conditions are noncontradictory (compare with Chernoff, 1954).Moreover, the postulated conditions are usually motivated only by ab¬

stract consideration of simple examples; hence their application beyondthose simple examples is in fact unmotivated, and can have unexpected

consequences (see for example Ellsberg, 1961). For these reasons, the rep¬

resentation theorem that we shall obtain should be interpreted as a math¬

ematical theorem showing the distinguishing properties of the preferencesbased on the MPL approach, not as a justification for this approach. The

same considerations apply to the axiomatic justifications for the use of the

likelihood function as a description of the information content of statistical

data: see Birnbaum (1962, 1969) and Evans, Fraser, and Monette (1986).

The first condition postulated by Savage for the preferences between

decisions is that the preference relation is a total preorder on Cs. This is

a necessary condition for the preferences to be represented by a functional,but it is certainly too strong for the real-world preferences of an individ¬

ual, also because it is assumed that T> = Cs (that is, T> is sufficiently wide

to encompass all possible uncertain consequences). Since we do not con¬

sider the representation theorem as a justification for the MPL approach,

we can for simplicity assume that T> = Cs, but we will later consider on

68 3 Relative Plausibility

which subsets of Cs the preference relation suffices to determine its rep¬

resentation by the mapping / i—> Js(v o I) drp for a nondegenerate relative

plausibility measure rp on Q and a function v : C —> [0, oo).

If we assume that the preferences between the consequences in C are

not affected by the particular case q G Q considered, and correspond to the

preferences between the constant (uncertain) consequences in Cs, then the

following condition of monotonicity is hardly debatable. A binary relation

^ on Cs is said to be monotonie if

Kl) ^ l'(<l) for all ? G Q => l£ I' for all /, /' G CQ.

If the preference relation ^ on Cs is induced by the MPL criterion (appliedto the decision problem described by the loss function Lv on Q x Cs),then the preference relation on C is represented by v, and ^ is monotonie,since the Shilkret integral with respect to completely maxitive measures

is homogeneous and monotonie (Theorem 2.3). If / and /' satisfy the left-

hand side of the above condition of monotonicity, and {/ -< I'} is not empty,

then we can say that / dominates /'; the monotonicity of the preferencerelation assures that if /' is dominated by /, then /' is not preferred to /. The

monotonicity is part of Savage's "sure-thing principle" (at least when / and

/' have finite range; see Subsection 4.1.2); but this principle states also that

if / dominates /', then / must be preferred to /', at least when {/ -< I'} is

not impossible (that is, when it has positive "probability"). This condition

is satisfied by the preferences induced by the MPL criterion if we discard

the dominated decisions: the resulting preference relation ^ on Cs is such

that / -< I' if and only if either vol < vol' and v o/ ^ v o/', or / is preferredto /' on the basis of the MPL criterion applied to the decision problemdescribed by Lv. It can be easily proved that in general this preferencerelation is not a total preorder on Cs, since it is not transitive (but it is

important to note that the strict preference relation -< is transitive); this

shows that to be reasonable and useful, a preference relation does not need

to be a total preorder.

The problem of the "sure-thing principle" of Savage is that the premise"when {/ -< I'} is not impossible" cannot be expressed directly in terms

of preferences: therefore the principle was replaced with other conditions.

In particular, the second condition postulated by Savage for the prefer¬ences between decisions states that the preference between two functions

1,1' G Cs does not depend on the values taken by the functions on the set

{/ = /'}, in the sense that the preference is left unchanged when the two

3.1 Description of Uncertain Knowledge 69

functions are modified in the same way on {/ = /'}. This condition is in

general not satisfied by the preferences induced by the MPL criterion, also

if we discard the dominated decisions: it is possible that by modifying the

functions on {/ = /'} a strict preference becomes an equivalence, or vice

versa (but it is impossible that a strict preference is reversed). This dis¬

cord between the preferences induced by the MPL criterion and the second

condition postulated by Savage is strictly related to the fundamental qual¬itative difference between relative plausibility and probability, considered

at the beginning of the present section: for a relative plausibility measure

rp, the additional consideration of TL can neutralize the preference between

TL\ and TL2-

The preferences induced by the MPL criterion present the same kind

of discord with the fourth condition postulated by Savage as with the

second one, while they fulfill the third and fifth ones (the first five pos¬

tulated conditions imply the existence of a qualitative probability on Q).To obtain preferences satisfying the second and fourth conditions postu¬

lated by Savage, it suffices to restrict attention to the subset {/ ^ /'} of

Q when applying the MPL criterion for discriminating between the deci¬

sions /,/' G Cs, but it can be easily proved that in general the resulting

preference relation is not a total preorder on Cs, since it is not transitive

(however, in this case too the strict preference relation is transitive). It

is interesting to note that on the end pages of Savage (1954) the author

restated in a slightly different form the seven postulated conditions that

are scattered through the book: even though he asserted that "the logicalcontent of each postulate is left unaltered", the preferences induced by the

MPL criterion satisfy all the conditions appearing on the end pages, apartfrom the (highly debatable) sixth one. This shows that minimal changes in

the postulated conditions can have important consequences on the implied

representations, and thus points out how unsuitable the representation the¬

orems are as justifications for particular approaches (in the second edition

of the book, published in 1972, the second condition appearing on the end

pages has been corrected, but the fourth one is still flawed).

Ellsberg (1961) showed that the second condition postulated by Savageis at variance with the propensity toward hedging bets, which reveals strict

uncertainty aversion. Schmeidler (1989) proved a representation theorem

where the functional is of the form / i—> J (v o I) d/i for a normalized,monotonie measure /i on Q (a "nonadditive probability" measure) and

a bounded function v : C —> M (a quantitative evaluation of the conse-

70 3 Relative Plausibility

quences): the preferences reveal uncertainty aversion if and only if /i is

2-alternating (they reveal uncertainty neutrality when /i is finitely addi¬

tive) .A strictly related representation theorem for the preferences induced

by the T-minimax criterion was proved by Gilboa and Schmeidler (1989).Both these theorems use the framework of the representation theorem for

the preferences induced by the Bayesian criterion proved by Anscombe and

Aumann (1963) and generalized by Fishburn (1970), in which the set C

is enlarged by including randomized consequences (that is, the existence

of objective probabilities is assumed). We will use this framework too, but

with an important difference: these representation theorems are based on

the one of von Neumann and Morgenstern (1944), which considers the

consequences of economic decisions, while we are interested in statistical

decisions. The main difference is that statistical decision problems possess

an additional element: the concept of correct decision; we assume thus the

existence of the consequence cq G C of a correct decision.

For a set C E5 cq of consequences, we define

R(C, co) = {rG [0, if : |{r > 0}| < 1, r(c0) = 0} ,

and we interpret each element r = p I{c} of R(C, cq) as the following ran¬

domized consequence: c with probability p, and cq with probability 1 — p.

The set C of non-randomized consequences can thus be identified with a

subset of R(C, cq): each c G C\ {cq} can be identified with c = Irc\, and cq

can be identified with cq = 0. In our representation theorem, a total pre-

order ^ on the set R(C\ co)s is considered; that is, the set C is enlarged by

including the randomized consequences of the above form (for all p G (0,1)and all c G C). The first condition of our representation theorem states

the peculiar role of the consequence cq of a correct decision (that is, no

consequence can be preferred to Co):

c0< c for all c G C. (3.1)

In the framework of Anscombe and Aumann, preferences between de¬

cisions in R(C)Q are considered, where

R(C) = {rG [0, If : |{r > 0}| < oo, £cecKc) = 1}

is the set of all the randomized consequences that are mixtures of a finite

number of consequences. The use of R(C) instead of R(C, cq) in our theo¬

rem would pose no problem (a condition should be slightly modified), but

3.1 Description of Uncertain Knowledge 71

thanks to the peculiar role of cq, it suffices to consider the preferences be¬

tween decisions in the set R(C\ co)s, which can be identified with a subset

of R(C)Q. Moreover, Cs can be identified with a subset of R(C\ co)s, and

if l" G CQ is identified with /' G R(C,c0)a, and p G [0,1], then pi' can

be interpreted as the following randomized decision: I" with probability p,

and Co with probability 1 — p. We shall see that the preference relation on

the set of randomized decisions of the form pi' (where /' can be identi¬

fied with an element of Cs) suffices to determine its representation by the

mapping / i—> Js(vol) drp for a nondegenerate relative plausibility measure

rp on Q and a function v' : C —> [0, oo), where v : R{C,cq) —> [0, oo) is

the homogeneous extension off', in the sense that v(pc) = pv'(c) for all

p G (0, 1] and all c G C.

Roughly speaking, each one of the three representation theorems of

Anscombe and Aumann, Schmeidler, and Gilboa and Schmeidler im¬

poses three fundamental conditions on the total preorder ^: independence,

continuity, and monotonicity. The independence condition postulated byAnscombe and Aumann can be formulated as follows: / ^ /' if and only if

pi + (1 — p) I" ^ pi' + (1 — p) /", where p G (0, 1); it can be interpreted as

stating that the preference between two decisions / and /' does not change if

these are randomized by mixing them in the same way with a third decision

I": the influence of I" on the evaluation of the mixed decisions is indepen¬dent of / or /'. This is contradicted by the propensity toward hedging bets

(when / ~ /', the preference pi + (1 — p) /' -< I can be explained by strict

uncertainty aversion: there is a preference for risk differentiation); there¬

fore Schmeidler relaxed the independence condition by assuming that /, /',and I" are pairwise comonotonic (/ and /' are comonotonic if there are no

q,q' G Q such that l(q) -< l(q') and l'(q') -< l'(q))- Gilboa and Schmeidler

relaxed the independence condition even more by assuming instead that

I" is constant; they called this condition "certainty-independence", and

needed to complement it with the one of "uncertainty aversion", which

states that / ~ /' implies p/ + (l— p)/'^/. The second condition of our

representation theorem corresponds to the certainty-independence for the

only case in which it is defined in our framework (that is, for I" = cq):

1<J' ^ pl^pl' for allpG (0, 1) and all /,/' G R(C,co)Q. (3.2)

This condition can be interpreted as stating that if we shall have to decide

with probability p (while with probability 1 — p we shall know the correct

decision), then our preferences must be the same as if we certainly had to

decide; the condition can also be considered as expressing the "scale invari-

72 3 Relative Plausibility

ance" of the preferences (that is, the equivalence of the loss functions L and

a L, for all a G P). It is important to note that in general the preferencesinduced by the MPL criterion would not satisfy the certainty-independencewhen R(C) were used instead of R(C, cq) (because the Shilkret integral is

not translation equivariant); on the other hand, they would satisfy the un¬

certainty aversion (since the Shilkret integral with respect to completelymaxitive measures is subadditive and homogeneous).

The continuity condition is the same for the three representation the¬

orems cited above, and can be formulated as follows: if / -< I' -< /", then

there are p,p' G (0,1) such that pi + (1 -p) I" -< /' -< p' I + (1 -p') I". This

condition can be interpreted as stating that there are randomized decisions

obtained by mixing / and I" that are as close as we want (on the prefer¬ence scale) to / or /"; it is reasonable only if the uncertain consequences

are in some sense "bounded" (see also Wakker, 1993). In our framework,when the monotonicity and the conditions (3.1) and (3.2) are satisfied, the

uncertain consequences can not be preferred to Co, and the only case in

which the above continuity condition is defined and not vacuous is thus

for Z = co. In this case, the second implied preference of this condition

has no problems related to unbounded uncertain consequences, and the

implication is equivalent to the following one:

pl<l' for all p G (0,1) => 1<Ï for all/, /' G R(C, c0)Q. (3.3)

In our representation theorem, we do not need to assume that the uncertain

consequences are bounded above, but for simplicity we can assume that

there are no "infinitely unfavorable" consequences (otherwise we should

impose an additional condition stating their equivalence, to avoid "trans-

finite evaluations" ) : the uncertain consequences can be unbounded only if

C is infinite. The following condition is strictly related to the first implied

preference of the above continuity condition when I" = c (and / = Co); it

assures that the consequences c G C are bounded and that there are no

"infinitesimal evaluations" :

I ~^pc for all p G S, and inf S = 0 => / ~ cq

(3.4)for all c G C, all S Ç (0,1), and all / G R(C, c0)Q.

In the three representation theorems cited above, the monotonicity of

the preference relation is assumed; in our theorem it is replaced by the

following stronger condition (the monotonicity corresponds to the special

3.1 Description of Uncertain Knowledge 73

case with lq = I' for all q G Q):

l(q) -< Uq) and L -< /' for all q£ Q => I < V

(3.5)for all {/, /'} U{lq:qGQ}C R(C, c0)Q.

This condition can be interpreted as stating that the preferences are based

on a "worst-case evaluation" : if for each case q G Q there is a decision

lq whose consequence according to q is not more favorable than the one

of /, and such that /' is not preferred to lq, then /' is not preferred to /

either. The part of this condition that is additional to the monotonicity

can be stated as follows: if /' is preferred to /, then there is a q G Q such

that /' is preferred to H{q\ (the decision that has the same consequence

as / according to q, and is correct according to all q' ^ q). The worst-case

evaluation explains why the preferences considered in our representationtheorem would not satisfy the certainty-independence when R(C) were

used instead of i?(C, Co): the importance of a particular consequence c

depends on the plausibility of the case q implying it, and if p I + (1 — p) c is

evaluated on the basis of the worst case q, then the influence of c on this

evaluation depends on /, since it depends on which q is the worst case.

With this five conditions, we can finally state our representation the¬

orem. A function v : R{C,cq) —> [0, oo) is homogeneous ifv(pr) = pv(r)for all p G (0, 1) and all r G i?(C, Co); in this case v(cq) = 0, and v is

uniquely determined by the function c i—> v(c) on C \ {co}. Hence the

quantitative evaluation v (of the randomized consequences in R(C, cq))considered in our representation theorem is the expected value of a quan¬

titative evaluation (of the non-randomized consequences) v' : C —> [0, oo)with v'(co) = 0.

Theorem 3.1. Let C and Q be two nonempty sets, and let cq G C. A

total preorder ^ on R(C,c0)a fulfills the conditions (3.1), (3.2), (3.3),(3.4), and (3.5) if and only if there are a nondegenerate relative plausibility

measure rp on Q and a homogeneous function v : R(C, cq) —> [0, 00) such

that ^ is represented by the mapping I 1—> Js(v o I) drp; the function v is

unique up to a (positive) multiplicative constant, and rp is unique when v

is not constant.

Proof. The "if part is simple: since v is finite and homogeneous, and the

Shilkret integral is homogeneous (Theorem 2.3), we have

/S/*S /*S

[v o (p /)] drp = p (vol) drp and / (v o c) drp = v(c) rp(Q).

74 3 Relative Plausibility

The first four conditions follow easily from these equalities, and for the

fifth one we obtain v o I < sup G gv o lq : now it suffices to exploit the

monotonicity and the complete maxitivity of the Shilkret integral with

respect to completely maxitive measures (Theorems 2.3 and 2.6).

For the "only if part, consider first the trivial case with c ~ cq for

all c G C: condition (3.2) and monotonicity imply that / ~ cq for all

/ G R(C, co)s, and therefore the desired representation is valid if and onlyif v = 0. When c\ G C does not satisfy à\ ~ cq, condition (3.1) impliesthat Co -< c\: in this case, let V be the functional on R(C, co)s defined by

, s

_JsupjpG [0, 1] :pc! ^ /} if/^Ci,

\ inf{a; G (1, oo) : \ I < Cl} if cx < I.

This definition implies that if / ^ /', then V(l) < V(l'); to prove the

converse we need first the following simple result: if p' < p, then p' I ^ pi,and p' I ~ pi only if pi ~ / for all p G (0,1]. Thanks to monotonicity, it

suffices to show p' I ^ pi for / = c; if pc ~< p1 c = jpc were true (where

7 =— ), then by induction and condition (3.2) we would obtain p c -< jn pc

for all positive integers n, and therefore pc ~ cq by condition (3.4), but

this would contradict pc -< p' c, since p' c = jpc ~ jcq = cq. If p' I ~ pi,then by the same technique we obtain pi ~ Jnpl, and thus p' I ^ pi for

all p' G (0,p]; but p2 I ~ pi and condition (3.2) imply pi ~ /, and thus

p' I ^ I for all p' G (0, 1]. In particular, p' < p implies p' c\ -< pc\, because

otherwise c\ ~ cq would follow from condition (3.4). Using these results

and conditions (3.2) and (3.3), the following equivalences can be easily

proved:

V(l)=pG[0,1] & l~pc!,

V(l) = ±G[l,oo) ^ pl-C!,

V(l)=oo <=> ci ~< pi for all p G (0, 1].

These expressions imply that V is homogeneous (that is, V(pl) = pV(l))and it represents ^: the only difficulty is to show that if V(l) = V(l') = oo,

then / ^ /'. To prove it, consider first that V(c) < oo (for all c G C) follows

from condition (3.4); hence l(q) ^ /' for all q G Q, but then condition (3.5)implies / ^ /' (with lq = l(q))- Let /i be the completely maxitive measure

on Q defined by ^(q) = V(c\ I{q\)', we prove now that /i is normalized,and V(l) = Js(v o I) d/i, where v is the homogeneous function on R(C, Co)defined by v(r) = V(r). Let si = sup GS V[l I{q}]', the monotonicity of

^ implies that V(l) > s;. If V(l) > s; were true, then there would be

3.1 Description of Uncertain Knowledge 75

a c G C such that V(c) > s; (otherwise ^(Z) < s; would follow from

monotonicity), and thus also an r G R(C, cq) such that V(l) > V(r) > s;,

but this is contradicted by the relation / ^ r implied by condition (3.5)with lq = II{q}. Hence V(l) = si, and in particular V(c\) = sup G q ^ (q) ;

thus /i is normalized, since l^(ci) = 1. The above results about V imply

V(r I{q}) = V(r) V{c\ I{q}), and therefore (using Theorem 2.6) we obtain

,s

V(l) = sup V[l(q) I(q\] = s\xpv[l(q)] fJ, (q) = sup(f o /) fj,* = / (vol) drp,

where rp is the nondegenerate relative plausibility measure on Q defined

by rp = /i. The results obtained about V imply that if V : R(C\ co)s —> P

is a homogeneous functional representing ^, then V = V'(c\)V with

V'(ci) G P; this proves the uniqueness results. D

Let T> be a subset of i?(C, co)s that contains c and ci|q| for all c G C

and all q G Q, and that is closed with respect to randomization, in the sense

that if / G T>, then pi G T> for all p G (0,1). It can be easily proved that

Theorem 3.1 is still valid if we substitute T> for R(C\ co)s in its statement

(and in the five conditions); that is, the preference relation on T> suffices

to determine its representation by the mapping / i—> Js(v o I) drp (up to

the slight nonuniqueness of v and rp). In particular, as noted above, T> can

be a set of randomized decisions of the form pi, where / can be identified

with an element of Cs, and p G [0, 1]. The smallest set T> satisfying the

above requirements consists of the decisions r and r I[qy for all r G R(C, cq)and all q G Q; by slightly modifying the postulated conditions, the prefer¬ences between decisions of the form r I[qy would suffice to determine the

representation.

Apart from the fact that the evaluations are not necessarily bounded

above, the main differences of our representation theorem with respect to

the one of Anscombe and Aumann are the peculiar role of Co, the weakeningof the independence condition, and the strengthening of the monotonicitycondition. The monotonicity is strengthened through worst-case evalua¬

tion: this is strictly related to the description of uncertain knowledge by

means of relative plausibility; in fact, this description is based on a "best-

case evaluation", in the sense that the ability of a set of models to representthe observed reality is interpreted as the ability of its best element. The

apparent conflict between "worst-case" and "best-case" in the previous

period is due to the fact that we consider an evaluation of the decisions in

76 3 Relative Plausibility

terms of loss; that is, the negativity of the uncertain consequences is eval¬

uated. This is the reason for interpreting the expression / ^ /' as "/' is not

preferred to /", but in the cited literature about representation theorems

the above expression is interpreted as "/ is not preferred to /'", because

the decisions are evaluated in terms of utility: to avoid confusion, all the

cited conditions and results have been modified (if necessary) in order to

comply with our interpretation of the preference relation (in fact, the only

ones needing changes are those concerning uncertainty aversion). The pe¬

culiar role of Co and the weakening of the independence condition in our

representation theorem with respect to the one of Anscombe and Aumann

are strictly related to the differences between the evaluations of decisions

in terms of loss and in terms of utility.

Usually, a statistical decision problem is described by a loss function L

on V x T>, and there is a clear idea of what constitutes a correct decision

for a model P G V, also when this decision is not an element of T>. The loss

functions L and a L are considered equivalent (for all a G P), and there¬

fore, in order to be representable by an evaluation in terms of loss, the

preferences between decisions must fulfill the scale invariance expressed bythe weak independence condition (3.2), which can be stated in the frame¬

work of Anscombe and Aumann only if the existence of the consequence cq

of a correct decision has been assumed. A general decision problem is often

described by a bounded utility function U : Q x T> —> R, and in generalthe concept of correct decision does not make sense; in this regard it is

important to distinguish between the concept of optimal decision (d G T>

is optimal for q if it maximizes U(q, d) among the elements of T>: the op¬

timally depends thus strongly on T>) and the one of correct decision (acorrect decision for q is not necessarily an element of T>, and its correct¬

ness does not depend on the other elements of T>). The utility functions U

and a U + ß are considered equivalent (for all a G P and all /?£ R), and

therefore, in order to be representable by an evaluation in terms of utility,the preferences between decisions must fulfill the certainty-independencecondition postulated by Gilboa and Schmeidler. If the loss function L and

the utility function U on Q x T> describe the same decision problem and

are expressed in the same scale, then uo(q) = U(q, d) + L(q, d) is the util¬

ity of a correct decision for q G Q (and thus in particular it does not

depend on d). Hence, when a decision problem is described by a utilityfunction U on Q x T>, to obtain a description of the problem by a loss

function we must determine the function uq on Q; but this determination

is problematic for general decision problems, since there is usually no clear

3.1 Description of Uncertain Knowledge 77

idea of what constitutes a correct decision. In order to be representable

by an evaluation in terms of utility or loss avoiding the problems related

to the determination of vq, the preferences between decisions must fulfill

the strong independence condition postulated by Anscombe and Aumann:

in fact, in this case the utility functions U and a U + u are equivalentfor all a G P and all bounded functions u : Q —> M (interpreted as func¬

tions on Q x T> such that u(q, d) does not depend on d). This points out a

fundamental difference between the MPL approach to statistical decision

problems and the Bayesian one: the former is specialized on problems de¬

scribed by loss functions, while the latter is able to cope indifferently with

loss and utility functions; but this ability comes with a price: it implies

uncertainty neutrality, while uncertainty aversion can be considered as the

basic attitude in statistics.

Hence, if the evaluation method allows strict uncertainty aversion, then

the choice of uq matters when translating U into a loss function: two al¬

ternative assumptions about uq are often made when its determination is

problematic. One of them is that for each q G Q a correct decision is con¬

tained in T> (at least as a limit): we obtain the loss function L on Q x T>

defined by L(q, d) = supd/Gl3 U{q, d') — U(q, d) for all (q, d) G Q xT>. That

is, the optiniality is substituted for the correctness of a decision; this kind

of loss function was introduced explicitly by Savage (1951) and is often

called "regret function" : it gives the amount of utility lost for not hav¬

ing selected the best available decision (that is, the optimal decision). If

a method consists in evaluating the decisions on the basis of the regret

function obtained from a utility function, then the induced preferences

satisfy the strong independence condition postulated by Anscombe and

Aumann (and thus reveal uncertainty neutrality), but the preference be¬

tween two decisions can depend on the whole set T> (since the optiniality

depends strongly on T>). The other frequent assumption about uq is that

the utility of a correct decision does not depend on q (that is, uq is con¬

stant): we obtain the loss function c — U on Q x T>, where c > sup U. To

be reasonable, an evaluation of decisions on the basis of c — U must be

independent of c; that is, it must be based on an evaluation method such

that the induced preferences fulfill the certainty-independence condition

postulated by Gilboa and Schmeidler. Hence, the application of the MPL

criterion to the decision problem described by the loss function c — U is

not reasonable: to obtain preferences independent of c we can replace the

Shilkret integral with a translation equivariant integral, and the natural

choice is the Choquet integral.

78 3 Relative Plausibility

A representation theorem analogous to Theorem 3.1, but with the Cho-

quet integral instead of the Shilkret integral (that is, a representation the¬

orem for the preferences induced by the "MPL* criterion" of Cattaneo,

2005), can be easily obtained by imposing an additional condition to the

preferences considered in the representation theorem of Schmeidler. The

additional condition must force the complete maxitivity of the "nonaddi¬

tive probability" measure, as does for example the following one: if 1\a is

constant, and /' -< /, then there is a q G A such that /' -< I I(Q\Ayj{q} (f°rall nonempty ACQ and all /,/' G R(C)Q). This condition can be inter¬

preted as stating that the preferences are based on a worst-case evaluation

limited to cases implying the same consequence; it could also be expressedin a form analogous to the one of condition (3.5), and in this case it would

imply the monotonicity of the preference relation on the set of functions

/ G R(C)Q having finite range (the preference relation on this set suffices

to determine the representation), but anyway this enforcement of complete

maxitivity is rather artificial.

3.1.2 Updating

Let ? be a set of statistical models (V is a set of probability measures

on a measurable space (Q,A)): a relative plausibility measure on ? is a

description of uncertain knowledge about those models. In the previoussubsection we considered the static aspect of this description, in relation

to the preferences induced by the MPL criterion; in the present subsection

we consider the dynamic aspect: the evolution of the description when new

information about the models is acquired.

When we observe an event A G A, we obtain a likelihood function hk

on V, and we must condition on A each model in V; if subsequently we

observe a second event B G A, then we obtain the likelihood function hk'

on V defined by hk'(P) = P(B\A) (for all P G V). To be precise, when we

observe B, we obtain a likelihood function hk" on the set V of the models

in V conditioned on A, but it is convenient to avoid explicit reference to

V1 by considering hk1 = hk" o c instead of hk", where c : V —> V is the

surjection representing the conditioning on A of the models in V. Since

for each P G V we have P(A (~l B) = P(A) P(B\A) (independently of the

definition of P(B\A) when P(A) = 0), the likelihood function on V induced

by the observation of both events A and B is hk hk1] that is, the likelihood

function hk is updated through multiplication with the likelihood function

3.1 Description of Uncertain Knowledge 79

hk' induced by the new observation. We will use the symbol 0 to denote

the binary operation on relative plausibility measures corresponding to

the multiplication of their density functions: rp 0 rp' = [rp-1 (rp')^ for

all pairs of relative plausibility measures rp, rp' on the same set Q. Hence

the relative plausibility measure lik^ is updated through combination with

(lik'y by means of the operation 0.

The dynamic aspect of the description of uncertain knowledge about

statistical models by means of relative plausibility can be incorporated in

our representation theorem (Theorem 3.1, with Q = V) by imposing an

additional condition to the preferences between decisions. This condition

can be stated as follows (in a slightly informal way): if the relation ^on R(C\ cq)v expresses our preferences before observing the event A (themodels in V are assumed to be conditional on past observations), then

our preferences after observing A shall be expressed by the relation ~^a on

R(C\ cq)v defined by: / ~^a /' if and only if I a ^ I'a, where Ia is the decision

whose uncertain consequence will be / when A will occur, and cq otherwise

(l'A is defined analogously, for all /, /' G R(C\ cq)v). This condition can be

interpreted similarly to condition (3.2), as stating that if we shall have to

decide only when A occurs (while otherwise we shall know the correct de¬

cision) ,then our preferences must be the same as if A has already occurred

and we certainly have to decide. Under each model P G V, the decision

IA has the following randomized consequence: l(P) with probability P(A),and Co with probability 1—P(A); that is, I a = hk I. Therefore, if the prefer¬ences before observing A are represented by the mapping / i—> Js(vol) drp,

as stated in Theorem 3.1, then the preferences after observing A will be

represented by the same mapping after the updating of the relative plau¬

sibility measure on V:

/s/*S

[v o (hk I)] drp oc / (v o I) d(rp Ç) hk~).

The relative plausibility measure Ü on P describes the complete ab¬

sence of information for discrimination between the models in V (in the

sense of Kullback and Leibler, 1951): it is the neutral element of the op¬

eration 0, and it is induced for example by the observation of Q, which

corresponds to having observed no data. Basing decisions directly on the

likelihood function hk implies in particular using no information about V

that was available prior to the observation of A: this means using 1^ as

a description of the uncertain knowledge about the models in V prior to

80 3 Relative Plausibility

the observation of A, and in fact lik^ = 1.Î © hk^ is then the updated

description. On the basis of these considerations, it is natural to allow the

possibility of using information about V that was available prior to the

observation of events in A, when this is described by a (prior) relative

plausibility measure rp on V (or equivalently, by a prior likelihood func¬

tion rp*- on V), and the models in V are assumed to be conditional on the

observations that have induced rp: in this case, after observing A we obtain

the updated relative plausibility measure rpÇ) hk\ which can be used for

decision making. The choice of the prior rp can be based on analogies with

the relative plausibility measures encountered in past experience, or with

those induced by hypothetical data in the situation at hand. It is also pos¬

sible to interpret the prior rp operationally: if we observed no data, then

we would base our decisions on rp, and thus the preferences between deci¬

sions can shed light on the form of rp. In particular, if we decide accordingto the MPL criterion, then we can choose the prior rp by considering the

"ideal" form of the uncertain (relative) loss / G V^ : the prior likelihood

function can simply be defined as proportional toj.

Hence in general the description of uncertain knowledge about the mod¬

els in V by means of a relative plausibility measure on V evolves from a

prior rp (which can also describe the complete absence of information for

discrimination between the models) through combination (by means of the

operation ©) with the relative plausibility measures on V induced by the

observations of new data, and the description can be used in any moment

for decision making. There is a strong similarity with the Bayesian ap¬

proach, in which the prior probability measure i on ? evolves through

conditioning of the resulting probability measure Pv on P x Ü, and can be

used in any moment for decision making: in particular, after observing A,the MPL and Bayesian criteria applied to a decision problem described by

a loss function L on V x T> consist in minimizing

/ l^hkdrp and / l^hkäir,

respectively (the second expression is a Lebesgue integral). If 7r possesses a

density (with respect to some measure on V), then this is updated through

multiplication with the likelihood functions induced by the observations of

new data (the normalization of the resulting density is not necessary for

decision making), exactly in the same way as the density function of rp

is updated. But there is a difference: the updating of the density function

of rp is symmetric, in the sense that two likelihood functions are multi-

3.2 Hierarchical Model 81

plied, while the updating of the density of 7r is asymmetric, in the sense

that a likelihood function is multiplied with the density of a probability

measure. A consequence of the symmetry is that when several prior rela¬

tive plausibility measures on V are assumed to be induced by independent

observations, they can be combined without problems by means of the op¬

eration 0 (which is commutative and associative), while the combination

of several prior probability measures is problematic.

However, the fundamental problem with prior probability measures is

the impossibility of describing ignorance (in the sense of absence of in¬

formation for discrimination between the models): the intuition that a

(nearly) constant density function describes a situation of (near) igno¬

rance is wrong for probability measures (see for example Shafer, 1982,and in particular the comment by DeGroot), while it is correct for rela¬

tive plausibility measures. The choice of a prior likelihood function on V

seems thus to be better supported by intuition than the choice of a prior

probability measure on V, also because of the possibility of analogies with

likelihood functions encountered in past experience (we can only observe

relative likelihoods of models, not probabilities of models), or with those

induced by hypothetical data in the situation at hand (see also Dahl, 2005).The relative plausibility measure 1^ on V describes complete ignorance (inthe above sense), but partial ignorance can also be represented: for exam¬

ple, if Q is a set, g : V —> G is a mapping, and f : Q —> W describes the

relative plausibility of the elements of G, then (/ o j)î contains the same

information as /, in the sense that the relative plausibility of the elements

of V is the one (described by /) of their images under g (in this way we

can in particular obtain a relative plausibility measure on V from a pseudolikelihood function / on G)-

3.2 Hierarchical Model

Altogether, our mathematical representation of reality can be considered

as a hierarchical model with two levels: the first one consists of a set V

of statistical models, and the second one consists of a relative plausibility

measure on V. The two levels describe different kinds of uncertain knowl¬

edge: in the first level the uncertainty is stochastic, while in the second one

it is about which of the statistical models in V is the best representationof the reality. When new data are observed, the hierarchical model can be

82 3 Relative Plausibility

updated by conditioning each element of V, and by updating the relative

plausibility measure on V as considered in Subsection 3.1.2. The relative

plausibility measure on V is simply a convenient extension (from the el¬

ements of V to the subsets of V by means of the supremum, and thus in

accordance with the likelihood-based inference methods) of the likelihood

function on V induced by the observed data, possibly after multiplicationwith a prior likelihood function. Therefore, under each model P G V the

relative plausibility measure on V tends to concentrate around (the con¬

ditioned version of) P when more and more data are observed, if some

regularity conditions are satisfied.

When we face a decision situation depending on some aspects of the

reality described by a hierarchical model consisting of a relative plausibility

measure rp on a set V of statistical models, if we can describe the decision

problem by a loss function L on V x T> (under each model P G V the

loss we would incur by making the decision d G T> can be reduced to a

representative value, such as the expected value), then we can apply a

likelihood-based decision criterion, with respect to the (pseudo) likelihood

function rp-1 on V. Statistical inferences can be obtained by applying the

likelihood-based inference methods to rp-1 (in Subsection 1.3.1 we have

seen that the MPL criterion leads to these methods, when applied to some

standard form of the corresponding decision problems). In particular, if

Q is a set, and g : V —> Q is a mapping, then (rp o g^1)-1 is a pseudolikelihood function on Q with respect to g, and we can obtain the maximum

likelihood estimate and the likelihood-based confidence regions for g(P) by

considering (rp o g^1)-1 as equivalent to the profile likelihood function on

G (they are in fact equivalent if rp is the relative plausibility measure on

V induced by the observed data).

Some authors have noted that if lik is a likelihood function on V,then hk^ is a normalized possibility measure on V assigning to each set

7i C V its likelihood ratio LR(7i), and they have thus used the apparatus

of possibility theory to obtain inferences and decisions: see for exampleSmets (1982), Moral and de Campos (1991), Chen (1993), and Giang and

Shenoy (2002). The connection with possibility theory is in fact tempting,but it can be misleading, since in this theory possibility measures are often

used in ways incompatible with the meaning of likelihood functions.

3.2 Hierarchical Model 83

3.2.1 Classical and Bayesian Models

From an abstract point of view, the classical and Bayesian models can be

considered as special cases of our hierarchical model, but the approachesto statistical decision problems are different. The classical model consists

of a set V of statistical models, with complete ignorance about which of

the elements of V is the best representation of the reality; in the classi¬

cal approach to statistical decision problems the model is never updated,since only pre-data decision problems are considered. Hence the classical

model corresponds to the hierarchical model with prior relative plausibil¬

ity measure 1^ on the set V, as long as only pre-data decision problems

are considered. In particular, with this prior the MPL criterion applied to

the pre-data decision problem described by the risk function correspondsto the minimax risk criterion. With a general prior relative plausibility

measure on V, the MPL criterion applied to this decision problem consists

in using the prior likelihood of the models to weight the respective losses,before applying the minimax risk criterion. The use of a nondegenerate

prior relative plausibility measure rp on V when applying the MPL crite¬

rion to the pre-data decision problem described by the risk function is thus

a simple way to address the idea behind the restricted Bayesian criterion:

we can use some prior information, but at the same time maintain the

maximum expected loss under control. In fact, if A = inî rp} > 0, then the

maximum expected loss of an optimal decision function according to the

MPL criterion is at most j- times the one of the optimal decision functions

according to the minimax risk criterion.

In Subsection 1.1.2 we have seen that the (strict) Bayesian model con¬

sists of a single probability measure Pv on ? x SÎ, and it is updated by

conditioning P^ on the observed data. Hence the Bayesian model corre¬

sponds to the hierarchical model with as first level the single statistical

model Pjr, since there is a unique nondegenerate relative plausibility mea¬

sure on the singleton {P-k}- Thus only stochastic uncertainty is considered

in the Bayesian model, and this allows the coherence properties stated in

Subsection 1.1.2. In fact, in the Bayesian approach to statistical decision

problems the uncertainty about which of the statistical models in V is the

best representation of the reality is eliminated by assuming that exactly

one of them is "true" and mixing them by means of a prior probability

measure 7r on V. In our approach a prior relative plausibility measure rp

on V is used instead of 7r: the main differences between relative plausibility

84 3 Relative Plausibility

and probability measures have been considered in the previous section. It

is interesting to note that the Bayesian method of maximum a posteri¬ori for the estimation of P G V corresponds to the method of maximum

likelihood, when rp-1 is proportional to the density of 7r with respect to

the considered dominating measure on V, but the method of maximum

likelihood with prior rp can be applied also when the choice of a dominat¬

ing measure on V is problematic (see for example Leonard, 1978). In fact,thanks to their simplicity, relative plausibility measures can be handled

without problems also on sets V that are so wide that using probability

measures on them would be difficult, as in the following example.

Example 3.2. Consider the problem of nonparametric estimation of the

mean of a probability distribution on [0,1], with squared error loss. Let

V be a family of probability measures such that under each of them the

random variables X\,..., Xn are independent and identically distributed

on [0, 1], and such that each probability distribution on [0, 1] for these

random variables corresponds to an element of V. Let g : P i—> Ep(Xi) be

the function on V assigning to each model the mean of the corresponding

probability distribution on [0, 1], and let L : (P, d) i—> [g(P) — d]2 be the

loss function onPx [0, 1] describing the estimation problem. As noted at

the beginning of Section 1.2, for a continuous distribution the observation

X% = x% is impossible: we can only observe X% G Nt, for a neighborhood Nz

of x%. However, for the decision problem at hand, if the neighborhoods Nz

are sufficiently small, then the results are practically equivalent to those

obtained by considering only the discrete distributions on [0,1] and the

exact observations X% = x%.

For instance, in the case with n = 1 we obtain the following (approxi¬mate) profile likelihood function for the mean g(P) of the distribution of

X\\ the function hkg on [0, 1] defined by

rSL ifo<7<a:1,

hkgd) = < 1 if 7 = xi,

[ti if*i<7<l.

If the function / : [0,1] —> P describes the prior relative plausibility of

the possible values of the mean g(P), we can use this prior information

simply by multiplying hkg with /: this amounts to using the prior relative

plausibility measure (f og)î on V. If we do not use prior information, then

the maximum likelihood estimate of g(P) is X\, and the estimate obtained

3.2 Hierarchical Model 85

In i

08: / \

06: ^^^ / \04: /-^^T / \

0 2: / / \

002 04 06 08 1

°02 04 06 08 1

Figure 3.1. Estimators and profile likelihood functions from Example 3.2.

by applying the MPL criterion is m(Xi), where m : [0, 1] —> [^, -g] is an

increasing bijection, plotted in the first diagram of Figure 3.1 (solid line)together with the bijection îdr01i corresponding to the maximum likelihood

estimate (dashed line). It can be proved that the estimate m(Xi) obtained

by applying the MPL criterion is optimal also with respect to the minimax

risk criterion, since the least favorable distribution on [0, 1] with mean 7

is the binomial distribution with parameters n = 1 and p = 7 (consideredin Example 1.8); therefore the maximum expected losses (as functions of

p = 7 G [0,1]) for the estimates obtained by applying the MPL criterion

and the method of maximum likelihood (MLD criterion) are plotted in the

first diagram of Figure 1.3 (solid and dotted curved lines, respectively).

In the cases with n > 1 (without prior information), the maximum

likelihood estimate is simply the arithmetic mean X of the observations

X\,...,Xn, while the estimate obtained by applying the MPL criterion

is not a function of X, because sets of observations with the same mean

can induce rather different profile likelihood functions hkg on [0, 1] (butas n increases, the estimate obtained by applying the MPL criterion tends

rapidly to X). For example, in the case with n = 2 the pairs of observa¬

tions {Xi = 0.4, X2 = 0.4} and {Xx = 0, X2 = 0.8} induce the profilelikelihood functions plotted in the second diagram of Figure 3.1 (solid anddashed lines, respectively; the two functions have been scaled to have the

same maximum): the estimates obtained by applying the MPL criterion

are 0.449 and 0.401, respectively. The first estimate is larger than the

second one because the corresponding profile likelihood function is more

86 3 Relative Plausibility

skew; in fact, the estimates obtained by applying the LRM^ criterion with

ß fa 0.259 (so that —2 log/3 is the 90%-quantile of the x2 distribution with

one degree of freedom) are 0.449 and 0.414, respectively. These estimates

are the midpoints of the intervals corresponding to the likelihood-based

confidence regions for g(P) with cutoff point ß: Owen (1988) proved that

these regions have 90% asymptotic coverage probability. 0

3.2.2 Imprecise Probability Models

As noted at the end of Subsection 1.1.2, the Bayesian model can be gen¬

eralized by considering a whole set T of prior probability measures 7r on

V (with complete ignorance about which of the elements of T is the best

prior). The resulting model consists of a set II of probability measures P^

on V x Q, and it is updated by conditioning each of them on the observed

data, but usually the induced likelihood function on II is not considered

(some methods using the likelihood function have been cited in Subsec¬

tions 1.2.1 and 1.3.2). Hence in general this model cannot be interpretedas a special case of the hierarchical one, because the prior relative plau¬

sibility measure 1^ on II is not updated (that is, we remain in a state of

complete ignorance about which of the elements of II is the best Bayesian

model).

A set V of probability measures on a measurable space (fi, A) can

also be interpreted as an imprecise probability measure on (Q,A): see

for example Walley (1991) and Weichselberger (2001). From V we obtain

in particular the upper probability measure supP (which is normalized,

monotonie, and subadditive) and its dual measure inf V (the lower proba¬

bility measure). Gärdenfors and Sahlin (1982) advocated the necessity of

considering the reliability of the probability measures in V: they stated

that this reliability should be described by a real-valued function p on V,without explaining precisely how p should be updated, but their model

seems to be in agreement with the hierarchical one (through identification

of p and rp-1). The axiomatic approaches of Nau (1992), de Cooman and

Walley (2002), and Giraud (2005) lead to models similar to the hierarchical

one, at least as regards the static aspect. If we start with no information

for discrimination between the probability measures in V, and we observe

an event A G A, then we obtain a likelihood function lik on V, and we

can base inferences and decisions on lik. In particular, if we are interested

in the probability of an event B G A, then we can consider the mapping

3.2 Hierarchical Model 87

g : P i—> P(B\A) on V, and base our inferences on the profile likelihood

function likg on [0,1], as we did in Example 1.7 for the profile likelihood

functions plotted in Figure 1.1. In this regard it is important to note that

hkg does not depend on the definition of P(B\A) when P(A) = 0, since

in this case hk(P) = 0. Of particular interest are the supremum and infi-

mum of the likelihood-based confidence region for g(P) with cutoff point

ß G (0, 1): these can be considered as improved upper and lower proba¬bilities of B, conditional on the observation of A; they correspond to the

upper and lower conditional probabilities of B after the restriction of V

to the likelihood-based confidence region Vß for P with cutoff point ß. It

is important to note that Vß does not depend on the choice of the event

B G A, and that the restriction of V to Vß is only temporary: no element

of V should be really discarded, since the relative plausibility of the ele¬

ments of V can change in the light of new data. The larger is ß, the more

precise are the upper and lower conditional probabilities based on Vß (inthe sense that their distance is smaller), but the lower is the confidence

in them: there is a sort of confidence-precision trade-off. The limits of the

upper and lower conditional probabilities based on Vß as ß tends to 0 corre¬

spond to the "regular extension" considered by Walley (1991, Appendix J)and resulting from the "intuitive concept of conditional probability" stud¬

ied by Weichselberger and Augustin (2003), while their limits as ß tends to

1 correspond to the result of a generalization of the rule for conditioningintroduced by Dempster (1967) and having a central role in the theoryof Shafer (1976): see also Moral and de Campos (1991) and Gilboa and

Schmeidler (1993). Hence, for ß varying in (0,1), the restrictions of V to Vßbuild a continuum between the two extreme cases corresponding to "regu¬lar extension" and "generalized Dempster's conditioning" (inferences other

than upper and lower probabilities can also be based on Vß: for instance

upper and lower expected values of a random variable). As regards upper

and lower conditional probabilities, other compromises between the two

extreme cases have been proposed in the literature: for example, the use of

Jsgdhk]_ as the upper probability of B conditional on the observation of

A (the lower probability can be obtained as dual measure) was considered

by Chen (1993) for some particular cases of independent events A and B,while Moral (1992) proposed to use the Choquet integral or the Sugeno

integral instead of the Shilkret integral in the above expression.

Consider a statistical decision problem described by a loss function L

on V x T>: if we modify the Bayesian approach by using an imprecise prior

probability measure T on (V,C), instead of a (precise) prior probability

88 3 Relative Plausibility

measure on (V, C), then we obtain the imprecise Bayesian model consistingof the set II = {P^ : tï G T} of probability measures on (?xfi,£), consid¬

ered at the end of Subsection 1.1.2. When we observe an event A G A, we

can condition each element of the set II on the event V x A, and we obtain

the likelihood function lik : P^ i—> P-e{V x A) on II. The decision criteria

appearing in the literature on imprecise Bayesian models can usually be

interpreted as criteria applied to the decision problem described by the loss

function L" : (P^d) <-> EPn[L(P,d)\V x A] on II x V without using the

information contained in the likelihood function lik (that is, a part of the

information provided by the data is disregarded). A simple way of using lik

for decision making is to (temporarily) restrict II to the likelihood-based

confidence region Ilg for P^ with cutoff point ß G (0,1), when applyingthe decision criterion (that is, the criterion is applied to the decision prob¬lem described by L"\ußxT>)- In particular, the T-minimax criterion appliedafter having restricted II to 11^ (or equivalently, T to the likelihood-based

confidence region Tß for 7r with cutoff point ß) corresponds to the LRM^criterion applied to the decision problem described by L". The information

contained in the likelihood function lik can be used in a less limited way by

applying other likelihood-based decision criteria (such as the MPL crite¬

rion) to the decision problem described by L". It is interesting to note that

the application of the MPL criterion to the decision problem described byL" can be interpreted as a sort of T-minimax criterion with respect to the

set of non-normalized conditional probability measures (Moral, 1992, ar¬

gued that this set describes the conditional information): in fact, it consists

in minimizing the supremum (over all 7r G T) of

PAV x A)EP„[L(P,d)\V xA]= EP„[L(P,d)IVxA].

The state of complete ignorance about the models in V can be de¬

scribed by the set Tq of all probability measures on (P,C). As noted at

the end of Subsection 1.3.1, if C contains all singletons of V, then usingthe imprecise prior probability measure To on V and applying the MPL

criterion to the decision problem described by L" is equivalent to using the

prior relative plausibility measure 1^ on V and applying the MPL criterion

directly to the decision problem described by L. This property is not pos¬

sessed by other likelihood-based decision criteria, such as the LRM^ one,

the MLD one, or the one obtained by substituting the Choquet integral for

the Shilkret integral in the MPL criterion (that is, the "MPL* criterion" of

Cattaneo, 2005), but in general the likelihood-based methods can be used

without problems with the imprecise prior probability measure Tq on?,

3.2 Hierarchical Model 89

while this is not the case for the methods that do not use the information

contained in the likelihood function lik on II. In fact, when the imprecise

Bayesian model is updated by means of "regular extension" (and thus lik

is used only to discard the Pv G II that are impossible in the light of the

observed data, in the sense that Pv (V x A) = 0), it is in general not pos¬

sible to get out of the state of complete ignorance, because an important

part of the information provided by the data is disregarded (see also the

beginning of Subsection 4.2.2). More generally, a consequence of the disre¬

gard of the information contained in the likelihood function lik is that the

conclusions obtained on the basis of the "regular extension" can be useless

if the imprecise prior probability measure on V has not been selected with

extreme care (see for instance Kriegler, 2005, Chapter 4): this undermines

the intuitive basis of the imprecise Bayesian model.

Example 3.3. In Examples 1.5 and 1.9 we considered (for the estimation

problem of Example 1.1) the imprecise Bayesian model II with as impre¬cise prior probability measure on V the family T of probability measures

on V corresponding to the family of beta distributions for the parameter

p G [0,1]. The observation of X = x induces a likelihood function lik on II:

a method using the information contained in lik (such as the MPL crite¬

rion) can lead to reasonable conclusions, while a method disregarding this

part of the information provided by the data (such as the T-minimax crite¬

rion) will lead to vacuous conclusions. However, non-vacuous conclusions

can be obtained without using the information contained in the likelihood

function lik, when instead of T some narrower set of probability measures

on V is used. Walley (1991, Section 5.3) proposed to use a "near-ignorance

prior" on V: that is, an imprecise prior probability measure on V such that

for all possible observations the upper and lower probabilities are 1 and 0,

respectively; in this way, an apparent initial state of complete ignoranceis obtained (an example of near-ignorance prior is T itself). In particu¬

lar, Walley considered the near-ignorance priors Ts C T corresponding to

the sets of all the beta distributions with sum of the two parameters equal

s G P: the resulting imprecise Bayesian models are special cases of the "im¬

precise Dirichlet models" studied in Walley (1996), where the particularchoices s = 2 and s = 1 are advocated.

Let g' be the function on II corresponding to the function g on V

considered in Example 1.6: that is, g' assigns to each model Pv the prob¬

ability Ep^ [g(P) \X = x] of observing at least 3 successes in 5 future in¬

dependent binary experiments with the same success probability as the n

90 3 Relative Plausibility

0.2 0.4 0.6 o.: 0.2 0.4 0.6 0.:

Figure 3.2. Profile likelihood functions from Example 3.3.

observed ones. Figure 3.2 shows the graphs of the profile likelihood func¬

tions likgt on [0, f] induced by the observations X = 8 with n = fO (firstdiagram), and X = 33 with n = 50 (second diagram), for the three impre¬cise Bayesian models with imprecise prior probability measures T (dottedlines), T2 (dashed lines), and T± (solid lines), respectively (the functions

have been scaled to have the same maximum). The profile likelihood func¬

tions likgt for the imprecise Bayesian model with prior T correspond to

the profile likelihood functions likg for the model V (plotted in Figure 1.1:

dotted and dashed lines, respectively), and also to the limits, as s tends to

infinity, of the profile likelihood functions likgt for the imprecise Bayesianmodels with prior Ts; while as s tends to 0 these functions tend to con¬

centrate on 7' = E^t{g), where 7r' is the probability measure on V corre¬

sponding to the (possibly degenerate) beta distribution with parameters x

and n — x (for the two observations considered, 7' « 0.905 and 7' « 0.771,

respectively; as n increases, 7' tends to the maximum likelihood estimate

7ml, considered in Example 1.7).

The profile likelihood function likgt for the imprecise Bayesian model

with prior T always takes positive values on the whole interval (0, 1), and

therefore only vacuous conclusions can be obtained if the model is updated

by means of "regular extension" (and the information contained in likgtis thus disregarded). But with the priors Ts non-vacuous conclusions can

be obtained even if the model is updated by means of "regular extension",because in this case the range of the function g' (corresponding to the

support of likgt) is usually only a small subset of the interval [0,1]: for the

3.2 Hierarchical Model 91

two observations considered, with s = 2 we obtain the probability intervals

(0.758,0.937) and (0.733,0.790), respectively, while with s = 1 we obtain

the probability intervals (0.833,0.923) and (0.752,0.781), respectively (theendpoints of the probability intervals are the upper and lower probabilitiesof observing at least 3 successes in 5 future independent binary experimentswith the same success probability as the n observed ones). The last column

of Table 1.1 gives the probability intervals obtained by restricting the im¬

precise Bayesian model II with prior T to the likelihood-based confidence

region Ilg for Pv with cutoff point ß « 0.036: this cutoff point is rather

small, but for any reasonable choice of ß the probability intervals based

on Ilg would be much longer than the above ones (based on the "regularextension" of the imprecise Bayesian models with priors 1^ and Ti, re¬

spectively), as can be easily seen from the graphs of Figure 3.2. Anyway,the usefulness of the near-ignorance priors Ts is very limited: it suffices to

slightly modify the model to obtain vacuous conclusions, when the mod¬

ified model is updated by means of "regular extension" (see for example

Piatti, Zaffalon, and Trojani, 2005). 0

It is interesting to consider the possibility of replacing the statistical

models in V with imprecise statistical models: let XV be a set of imprecise

probability measures on a measurable space (Q, A); that is, each IP G XV

is a set of (precise) probability measures on (Q,A). The simplest way

to handle imprecise statistical models within our hierarchical model is to

consider the set V = [JXV of (precise) statistical models, and the mapping

g : V -> XV defined by g~1{IP} = IP (for all IP G XV). To be precise, g

is well-defined only if the elements of XV are disjoint, but this can alwaysbe obtained for example by expanding the measurable space (Q, A) to its

product with the measurable space (XV, 2L'P), and replacing each P G IP

with the product of P and the Dirac measure öjp on XV (for all IP G XV).If we observe an event A G A (or the corresponding event A x XV, when

Q has been expanded to Q x XV), then we obtain a likelihood function lik

on V, and we can base inferences and decisions on the profile likelihood

function hkg on XV. Since

hkg(IP) = sup/P hk = (supIP)(A) for all IP G XV,

hkg is simply the pseudo likelihood function on XV based on the upper

probability measures sup IP. For example, if we have a set V' of statistical

models on (Q,A), and for some e G (0, 1) we replace each P' G V with

its e-contamination class (that is, the imprecise model consisting of the

92 3 Relative Plausibility

contaminated statistical models (1 — e) P' + eP, for all the probability

measures P on (Q,A)), then instead of the likelihood function lik' on V'

we obtain the pseudo likelihood function (1 — e) lik' + e, which can be

interpreted as an attenuate version of lik' (the information contained in

lik' has been discounted, in the sense of Shafer, 1976).

In general, the profile likelihood function likg on XV favors the more

imprecise models in XV (because it is based only on the upper probability

measures sup IP) : this behavior seems to be in agreement with the various

interpretations of the imprecise probability measures, but if in a particular

case it is disapproved, then another pseudo likelihood function on XV, pe¬

nalizing the more imprecise models (for example by explicitly taking into

account also the lower probability measures inf IP), should be used instead

of hkg. When we combine the imprecise statistical models in XV with a

(precise or imprecise) prior probability measure on XV, we obtain the im¬

precise Bayesian model consisting of the probability measures on XV x Q

that result from all the possible combinations of the averaging probability

measures considered and the (precise) statistical models considered; but

if we update the resulting model by means of "regular extension" (disre¬garding thus a part of the information provided by the data), then we can

easily get unreasonable results (see for example Wilson, 2001).

Example 3.4- If in the estimation problem of Example 1.1 we replacethe models for the results of each single binary experiment with their

e-contamination classes, but we still consider the n results as indepen¬dent and identically distributed, then we obtain the problem of estimating

p G [0, 1] (with squared error loss) after having observed the number of

successes in n independent binary experiments with common success prob¬

ability q G [(1 — e)p, (1 — e)p + e\. The first diagram of Figure 3.3 shows

the graphs of the profile likelihood functions likg on [0, 1] for the mapping

g and the three observations considered in Example 1.6, under the conta¬

minated models with e = 0.1 (the three functions have been scaled to have

the same maximum); that is, it shows how the graphs of Figure 1.1 are

modified by the e-contaminations of the binary experiments. From these

graphs we can see that there is an unavoidable uncertainty about p: in fact,even if we knew the actual success probability q, we could only conclude

that p G [max{f5f ,0},min{14i, 1}].

The second diagram of Figure 3.3 shows (for the case with n = 100) the

graphs of the maximum expected losses under the contaminated models

with e = 0.1 (as functions of p G [0,1]) for the estimates obtained by

3 2 Hierarchical Model 93

Figure 3.3. Profile likelihood functions and (maximum) expected losses from

Example 3 4

applying the MPL criterion to these models (upper solid line) and to the

uncontaminated models (upper dashed line), and also the graphs of their

expected losses under the uncontaminated models (lower solid and dashed

lines, respectively) The estimate obtained by applying the MPL criterion

to the uncontaminated models was considered in Example 1 8 (the lower

dashed line of the second diagram of Figure 3 3 corresponds to the solid line

of the third diagram of Figure 1 3) ; its maximum expected loss under the

contaminated models is particularly high for values of p near the endpointsof the interval [0, 1] this is a consequence of the fact that the maximum

(\ + \p — g |) £ of the possible difference between p and the actual success

probability q is larger for values of p near the endpoints than for values of

p near the center of the interval [0,1] The estimate obtained by applyingthe MPL criterion to the contaminated models corrects this behavior by

favoring the values of p near the endpoints of the interval [0,1] 0

4

Likelihood-Based Decision Criteria

This chapter begins with a general definition of likelihood-based decision

criterion. Several additional decision theoretic properties are then consid¬

ered and related to integral representations of the decision criteria. Fi¬

nally, some statistical properties of the likelihood-based decision criteria

are studied; in particular their asymptotic optimality.

4.1 Decision Theoretic Properties

Consider a statistical decision problem described by a loss function L on

?xD, and let rp be a relative plausibility measure on V describing the

uncertain knowledge about the models in V. By definition of L, all im¬

portant aspects of the decisions d G T> are summarized by the respectivefunctions 14 G V^; we will consider only decision criteria that compare the

possible decisions on the basis of some functional Vrp : V^ —> P (dependingon rp), in the sense that the criteria can be expressed as follows:

minimize Vrp(ld).

Each functional Vrp represents a total preorder ^rp on V^, and 14 ^rp 141

can be interpreted as "d! is not preferred to d on the basis of the decision

criterion considered and of the relative plausibility measure rp".

For the sake of generality, we replace V with a generic set of alter¬

natives determining for each possible decision the loss incurred, and we

consider thus decision criteria associating to each (nondegenerate) relative

plausibility measure rp (on some set Qrp) a preference relation ^rp on

pSrp. wnen interpreting the definitions and the results, however, we re¬

gard Qrp as a set of statistical models, and each / G VQrp as the uncertain

96 4 Likelihood-Based Decision Criteria

loss Id incurred by making a particular decision d. The following necessary

condition of monotonicity for the preference relations ^rp was stated at

the beginning of Section 1.1 as a direct consequence of the interpretationof a loss function:

/ </' => l-<rpl' for all/,/' G PQ^. (4.1)

Since the loss functions L and a L are considered equivalent (for all a G P),the following condition of scale invariance for the preference relations ^rpis also necessary:

l<rpï => al<rpaï for all a G P and all /, /' G ^Qrp. (4.2)

To avoid vacuous decision criteria, we can assume the following con¬

dition describing the ability of discrimination between the decisions with

constant losses:

c <rp c' => c<c' for all c, c' G P. (4.3)

The next theorem implies that, given the preceding three conditions, the

following two are sufficient for a total preorder ^rp to be represented by a

functional Vrp : Ps>-p —> P; it can be easily proved that they are also nec¬

essary (except for the cases with inf C = 0 and supC = oo, respectively).

/ Sp c for all c G C => I <rp inf C for all C Ç P

and all / G ¥Qrp.

c ;Sp I f°r all c G C => sup C <rp I for all C Ç P

and all / G PQrp

(4.4)

(4.5)

Theorem 4.1. Let Qrp be a nonempty set. A total preorder ^rp on Ps>-p

fulfills the conditions (4-1), (4-®)> (4-3)> (4-4); and (4-5) if and only if it is

represented by a homogeneous, monotonie functional Vrp : Ps>-p —> P such

that Vrp(c) = c for all c G P; the functional Vrp is unique.

Proof. The "if part is simple: the first two conditions follow from the

monotonicity and the homogeneity of Vrp, respectively, and the last three

conditions are implied by the property that Vrp(c) = c for all c G P.

For the "only if part, define Vrp(l) = sup{c G P : c ^rp /} for all

/ G VQrp. Condition (4.5) implies that Vrp(l) ^rp /, and condition (4.4)

4.1 Decision Theoretic Properties 97

implies that / ^rp Vrp(l), because / ^rp c for all c > Vrp(l); therefore /

is equivalent to the constant loss Vrp(l). The desired results can be easily

proved using this equivalence and the correspondence of the relation ^rpfor the constant losses with the order < on P, implied by conditions (4.1)and (4.3). D

As regards the connection between the relative plausibility measures

rp and the respective preference relations ^rp, we can assume that if a

relative plausibility measure is derived from rp by means of a function on

Qrp, then the respective preference relation can be derived from ^rp in

the corresponding way:

I ;Spot-i I' ^ I ° * ;Sp l' ° * f°r all sets T, all t G TQr*>, . .

and all /, /' G PT* '

A likelihood-based decision criterion is a function associating to each

nondegenerate relative plausibility measure rp (on some set Qrp) a total

preorder ^rp on P2rp, in such a way that the conditions (4.1), (4.2), (4.3),(4.4), (4.5), and (4.6) are fulfilled (for all nondegenerate relative plausibility

measures rp).

Consider a statistical decision problem described by a loss function L

on V x T>, and let hk be a likelihood function on V. If G is a set, g :V —> G

is a mapping, and the loss function L depends on the models P only

through g(P), in the sense that there is a function L' on G x 2? such that

L(P,d) = L'[3(P),d] for all P 6 P and all d G V, then condition (4.6)states that applying a likelihood-based decision criterion to L with respect

to hk is equivalent to applying it to L' with respect to the profile likelihood

function hkg. If another pseudo likelihood function hk' on G (with respect

to g) is expected to give better results than hkg, then the criterion can be

applied to L' with respect to hk'; in this sense, condition (4.6) states that

the profile likelihood function is the default choice of a pseudo likelihood

function, in accordance with the likelihood-based inference methods.

There is a close connection between the conditions appearing in the

definition of a likelihood-based decision criterion and those appearing in

Theorem 3.1. In the present section we assume that the consequences of

the decisions are in P, and thus no quantitative evaluation of the conse¬

quences is necessary; in particular, cq corresponds to 0, and so condition

(3.1) is clearly fulfilled. The monotonicity (4.1) is a special case of the

worst-case evaluation (3.5), and the scale invariance (4.2) corresponds to

98 4 Likelihood-Based Decision Criteria

(3.2). The only effect of assuming condition (4.3) is the exclusion of the

trivial relations ^rp for which all the elements of Ps>-p are equivalent: this

nontriviality is not assumed in Theorem 3.1, but it is necessary and suffi¬

cient for the uniqueness of rp. Conditions (4.4) and (4.5) can be rewritten

as a continuity condition corresponding to (3.3) and two conditions ex¬

pressing the exclusion of "infinitesimal evaluations" and of "transfinite

evaluations" (the cases with inf C = 0 and supC = oo, respectively): the

first of these last two conditions corresponds to (3.4), and the second one

is a special case of (3.5), with the difference that in Subsection 3.1.1 the

boundedness of the consequences is assumed, and thus no consequence cor¬

responds to oo. Condition (4.6) cannot be expressed in the framework of

Subsection 3.1.1 (because in that framework the existence of the relative

plausibility measure rp is not assumed), but it is related to the limited

version of the worst-case evaluation stated at the end of Subsection 3.1.1.

Thus, apart from the assumptions that the consequences are in P, that rp

describes the uncertain knowledge about the elements of Qrp, and that the

preference relation on Ps>-p is nontrivial, the difference between the con¬

ditions appearing in the definition of a likelihood-based decision criterion

and those appearing in Theorem 3.1 is that the worst-case evaluation (3.5)is weakened by replacing it with two special cases (the monotonicity and

the exclusion of "transfinite evaluations") and condition (4.6). The next

theorem states that a likelihood-based decision criterion corresponds to a

functional on the set

S¥={<p£[0,lf:sap<p = l};

the MPL criterion of Theorem 3.1 corresponds to the particular functional

cp i—> sup idp ip on <Sjp.

Theorem 4.2. A function associating to each nondegenerate relative plau¬

sibility measure rp (on some set Qrp) a binary relation ^rp on Ps>-p is

a hkehhood-based decision criterion if and only if there is a functionalS : Sf —> P such that each relation ^rp is represented by the functional

Vrp : I i—> S\{rp_o Z-1)^] and the following three conditions are fulfilled:

S(I{1}) = 1, (4.7)

S[(p(-)]=cS((f) for all ip G <% and aile G W, (4.8)

suP[x,oo] V<

suP[x,oo] V> 1and supj0)X] ip > supj0)X] ip > => S(ip) < S(ip) for all ip, ifi G «%. (4.9)

for all x G P J

4.1 Decision Theoretic Properties 99

Proof. Theorem 4.1 implies that we have a likelihood-based decision cri¬

terion if and only if condition (4.6) is fulfilled and each relation ^rp is

represented by a homogeneous, monotonie functional Vrp : Ps>-p —> P

such that Vrp(l) = 1 (since then Vrp(c) = c for all c G P follows from

homogeneity and nionotonicity). Condition (4.6) corresponds to the ex¬

istence of a functional S : <Sjp —> P such that Vrp(l) = S[(rj) o Z-1)^]. In

fact, the existence of such a functional implies condition (4.6), because

rg o (/ o t)_1 = (rp oT1) o Z_1; conversely, when condition (4.6) holds we

can define S(ip) = Vvr (tdf), since

Vrp{l) = Vrp(td¥ o /) = Vrp[Vrp{l) o /] = Vrpol-i [Vrp{l)\ = Vrpol-i (tdf),

where the last equality is a consequence of the second one. Hence, to

prove the theorem it suffices to show that the property Vrp(l) = 1,the homogeneity, and the nionotonicity of the functionals Vrp correspondto the conditions (4.7), (4.8), and (4.9), respectively. The first two cor¬

respondences follow easily from the equalities (rp o T1)^ = ir-n and

[rp o (c/)_1]^(a;) = (rgoT1)^-), respectively. The nionotonicity of the

functionals Vrp follows from condition (4.9), since by defining

1,9= (rgoT1)^ and ip = [qjo (/')_1]^ (4-10)

the inequality / < /' implies the premise of condition (4.9): in fact,

suprx 001^ = ry\l > x} and supr0 xi cp = rp{l < x}, and analogously for

ip and /'.To prove the converse it suffices to show that for all cp,ip G Spsatisfying the premise of condition (4.9) there are a nondegenerate relative

plausibility measure rp (on some set Qrp) and two functions 1,1' G PSrp

satisfying / < /' and (4.10). Define the set Qrp Ç P x [0,1) x {0,1} as

follows:

QrP = ({(x,y) G P2 : y < <f(x)} x {0}) U ({(x,y) G P2 : y < ip(x)} x {1}),

and let rp be the nondegenerate relative plausibility measure on Qrp de¬

fined by rj>} (x, y, z) = y. Since suprx w| ip < suprx w| ip, for each (x, y) G P2

such that y < (f(x) there is an x' G [x,oo] such that y < ip(x'): define

l(x,y,0) = x and l'(x,y,0) = x'. Analogously, since supr0 xi(p > supr0 x-\ip,for each (x,y) G P2 such that y < ip(x) there is an x' G [0,x] such that

y < ip(x'): define l(x,y, 1) = x' and l'(x,y, 1) = x. We obtain two func¬

tions /, /' G PSrp such that / < /'; the equalities of (4.10) are also satisfied,since l(x,y,z) = x' implies y < y(x'), and conversely y < (f(x') implies

l(x',y,0) = x': therefore (rgo/_1)^(x') = y(x'), and analogously for ^ and

/'. D

100 4 Likelihood-Based Decision Criteria

Theorem 4.2 implies that a likelihood-based decision criterion asso¬

ciates to each nondegenerate relative plausibility measure rp (on some set

Qrp) a functional Vrp on WQrp such that Vrp(l) depends only on (rpol^1)^.Thus the values taken by / on {rp-1 = 0} have no influence on Vrp(l), and

in particular we have

inf(rpi>0} / < Vrp(l) < sup|rpi>0| / for all / G ps>-p.

It can be easily proved that the functionals Vrp = supr i>0i correspond to

a likelihood-based decision criterion, which can be called essential mini-

max criterion, since supr i>0iI = essrpsup/. If Qrp = V is a set of

statistical models, and rp is updated as described in Subsection 3.1.2 when

new data are observed, then the density functions of the updated relative

plausibility measures will still take the constant value 0 on {rp-1 = 0}.Therefore the models in {rp-1 = 0} can be definitively discarded, since

they will never have any influence on the likelihood-based decisions and

inferences. A likelihood-based decision criterion discards thus the models

P such that rp{P} = 0, but it can be useful only if the models P such that

rp{P} is small have only a weak influence on the evaluation of a decision

(the smaller rj){P}, the weaker the influence of P). That is, to be useful

a likelihood-based decision criterion must satisfy some sort of continuityat 0 of the influence of a model, with respect to its relative plausibility:

we shall consider this sort of continuity more precisely in Subsection 4.2.2,but it is clear that the essential minimax criterion does not satisfy it, since

it uses rp only to discard the models in {rp-1 = 0}.

Theorem 4.2 states that a likelihood-based decision criterion corre¬

sponds to a functional S on Sp; in particular the two functions /i, /o on

[0,1] can be defined as follows on the basis of S:

fx{x) = S(I{o} +xl{i}) and f0(x) = S[(l - x) I{0} + /{i}] - /i(l),

for all x G [0, 1]. The properties of S imply that /i and /o are nondecreas-

ing, /i(0) = /o(0) = 0, and /i(l) + /0(1) = 1. The "continuity at 0 of

the influence of a model, with respect to its relative plausibility" would

imply the continuity of /i and /o at 0 and 1, respectively; in fact, for the

essential minimax criterion, fi = /mi] and /o = 0. If rp is a nondegeneraterelative plausibility measure on a set Q, then the functional Vrp on Ps can

be defined as in Theorem 4.2: for all A Ç Q we have

Vrp(IA)=S[m(Q\A)I{0}+m(A)I{1}]=f1[m(A)] + fç)[m(A%

4.1 Decision Theoretic Properties 101

where rp is the dual of the normalized representative of rp. In particular,if rp = 1Î, then V^ {Ia) = /i(l) f°r aU nonempty A C <2; this implies that

two decisions can be considered equivalent by all likelihood-based decision

criteria, even when one decision dominates the other, their uncertain losses

are bounded, and rp-1 > 0. Moreover, since 1/it(1) = 1, no likelihood-based

decision criterion can correspond to an additive functional V^ on Ps when

Q has at least three elements.

Consider a statistical decision problem described by a loss function

L on V x T>, let rp be a nondegenerate relative plausibility measure on

V describing the uncertain knowledge about the models in V, and let

d, d! G T> be two decisions such that d dominates d!. The monotonicity

(4.1) of the preference relation ^rp assures that d! is not preferred to d,but as shown above, it is possible that d and d! are equivalent accordingto all likelihood-based decision criteria (this is in particular the case when

rgoid = rp o Ij, ). However, as noted at the beginning of Section 1.1,if we have to choose between d and d', then we must choose d; hence, if

necessary, a likelihood-based decision criterion can be refined by discard¬

ing from the set of optimal decisions those that are dominated (by other

optimal decisions). Anyway, if d! is strictly dominated by d, in the sense

that inf (Id' — Id) > 0, then a necessary and sufficient condition for d to be

preferred to d! according to all likelihood-based decision criteria is that Id is

essentially bounded with respect to rp, in the sense that essrp sup Id < oo.

This condition is necessary, because the functionals Vrp = essrp sup cor¬

respond to the essential minimax criterion, and it is sufficient, because if

e = inf(/(2< — Id) and s = essrp sup Id, then ^^ Id < Id' on {rp-1 > 0}, and

thus Vrp(ld) < Vrp(ld') for all likelihood-based decision criteria, since Vrpis monotonie and homogeneous, Vrp(ld) < s, and Vrp(ld') > e.

In Subsection 1.3.1 we have seen that the MPL criterion leads to the

usual likelihood-based inference methods, when applied to some standard

form of the corresponding decision problems; this is true also for a generallikelihood-based decision criterion, if the corresponding functions /i and

/o fulfill some conditions. For the estimation problems described by the

loss functions Iw and IwE considered in Subsection 1.3.1, it can be easily

proved that when a maximum likelihood estimate exists and is (really)unique, a likelihood-based decision criterion leads (practically) to it if and

only if /i(x) < /i(l) + /o(l — x) for all x G [0, 1). When applied to the

hypothesis testing problem considered in Subsection 1.3.1 (with Ci > c^),a likelihood-based decision criterion leads practically to a likelihood ratio

102 4 Likelihood-Based Decision Criteria

test with some critical value ß depending on^ (where "practically" means

that we do not bother about the case with LR(7i) = ß) if and only if

the function / : x i—>, ^uf t_^

on [0,1] is well-defined, it is strictly

increasing on /_1(0,1), and it satisfies sup/0 j\ / = 1 and inf(o,i) / = 0.

The critical value ß is a strictly increasing function of — if and only if

/ is continuous: in this case, ß = /_1(—) (the inverse function /_1 is

well-defined on (0,1)). For the problem of selecting a confidence regionconsidered in Subsection 1.3.1, a likelihood-based decision criterion is able

to discriminate between regions d G T> with different values of LR{g ^ d}if and only if /i is strictly increasing. In conclusion, a likelihood-based

decision criterion leads to the usual likelihood-based inference methods if

and only if fi is continuous and strictly increasing, and /o is continuous

on [0,1); in fact, for the MPL criterion, /i = idr0jli and /o = 0.

4.1.1 Attitudes Toward Ignorance

Let Pbea set of statistical models, and let lik : V —> [0, 1] be the likelihood

function induced by the observed data. If lik = I{p\ for some model P G V,then all likelihood-based decisions and inferences would be the same as if

we were certain of P (that is, as if P were the only representation of

reality that we consider). If lik > I{p\, then we would be more ignorantabout which of the models in V is the best representation of the reality: in

fact, the data would contain less information for discrimination between

P and the other models in V (in the sense of Kullback and Leibler, 1951);the larger lik, the more ignorant we would be. In the extreme case with

lik = I{p\, the data contain infinite information for discrimination between

P and the other models in V; while at the other extreme we have the case of

complete ignorance, in which lik = 1 and the data contain no information

for discrimination between the models. In this sense, if rp and rp' are

two nondegenerate relative plausibility measures on V such that rjo < rp',then we are more ignorant about which of the models in V is the best

representation of the reality when rp1 describes our uncertain knowledgeabout the models than when rp describes it.

A likelihood-based decision criterion reveals ignorance aversion if

the corresponding functionals Vrp satisfy

rp < rp!_ => Vrp < Vrpi

for all pairs of nondegenerate relative plausibility measures rp, rp' on the

same set Q. This property can be expressed as follows in terms of preference

4.1 Decision Theoretic Properties 103

relations:

c TZrp I and TJL < "HßL => c TZrp1 I f°r all c G P and all / G P.

Since the evaluation of the decisions with constant losses c G P does not

depend on our uncertain knowledge about the models, we can use these

decisions to compare our preferences in different states of knowledge: ig¬

norance aversion means that if we do not prefer a particular decision to

the constant loss c, then neither would we when we were more ignorant.

Theorem 4.3. Consider a likelihood-based decision criterion and the cor¬

responding functional S : Sp —> P; the following four assertions are equiv¬

alent:

the decision criterion reveals ignorance aversion,

S is monotonie,

S is ^-independent,

each relation ^rp is represented by the functional Vrp : I i—> S(rj){l > •}).

Proof. Let cp, ip G Sp such that p < ip. If the decision criterion reveals

ignorance aversion, then S(p) = Vptfidp) < V^t (idp) = S(ip), and thus S

is monotonie. If <^L0 c»]= V'lfUoo]) then S(ip) < S(p) follows from condition

(4.9), and therefore S is 0-independent when it is monotonie. If S is 0-

independent, and x = (tM° ^_1)^ then

MO = S(X) = S(I{0} +x/(0,oo]) = S(te{1 > •}),

where the last equality is implied by condition (4.9) with p = I{o}+xl(o,oo]and ip : x i—> suprx w| x

= U>{1 > x}. Finally, if the fourth assertion

holds, then the ignorance aversion is a consequence of condition (4.9),since suprx w| ry\l > •} = rj){l > x} and supr0 xi ry\l > •} = 1. D

Let TL be a nonempty subset of V, and consider the (nondegenerate)relative plausibility measure {IuY on T* Theorem 4.3 implies that the

functional on V^ corresponding to a likelihood-based decision criterion

revealing ignorance aversion is simply V/j y= supw. In particular, in

the case of complete ignorance about which of the models in V is the

best representation of the reality we have V^ = sup, and therefore the

likelihood-based decision criteria revealing ignorance aversion can be con¬

sidered as generalizations of the minimax criterion. Moreover, for these cri-

104 4 Likelihood-Based Decision Criteria

teria /i(l) = S(Ir01\) = S'(/r1i) = 1, and thus /o = 0; hence a likelihood-

based decision criterion revealing ignorance aversion leads to the usual

likelihood-based inference methods if and only if f\ : [0,1] —> [0, 1] is an

increasing bijection. The fourth assertion of Theorem 4.3 suggests the idea

that the functionals Vrp corresponding to a decision criterion revealing

ignorance aversion can be expressed by means of an integral respectingdistributional dominance; the next theorem states that the injectivity of

/i is a sufficient condition for this integral representation.

Theorem 4.4. Consider a likelihood-based decision criterion such that the

corresponding function f\ : [0,1] —> [0, 1] is strictly increasing, and let

Ai be the class of all monotonie measures taking values in the range of

f\. The decision criterion reveals ignorance aversion if and only if there

is a support-based, homogeneous integral respecting distributional domi¬

nance on Ai such that each relation ^rp is represented by the functional

Vrp : I i—> J/d(/i org); the integral is unique.

Proof. The "if part is simple: the decision criterion reveals ignorance aver¬

sion because f\ is nondecreasing and an integral respecting distributional

dominance is monotonie with respect to measures (Theorem 2.12).

For the "only if part, we have Vrp(I^) = fi[rp(A)] for all nondegener-ate relative plausibility measures rp and all A Ç Qrp, and so the equality

J/d(/i org) = Vrp(l) defines an integral on the class Ai' of all measures

of the form f\ o /i, for all normalized, completely maxitive measures /i,

since for each v G Ai' there is a unique nondegenerate relative plausibilitymeasure rp such that v = f\ o rp. Theorem 4.3 implies that the integral

respects distributional dominance, and thus it corresponds to a functional

on VJ-'(Ai') that can be extended to DJ7(Ai) by means of 0-independence.The integral on Ai corresponding to the extended functional satisfies the

desired properties, and its uniqueness can be easily proved by exploitingthe fact that VT{M) = VT{M") for the class .M" of all measures of the

form fi o /i, for all possibility measures /i. D

The integral obtained in Theorem 4.4 can be easily (but not uniquely)extended to a support-based, homogeneous integral respecting distribu¬

tional dominance on the class of all monotonie measures, when f\ is left-

continuous; the next example shows that such an extension is in generalnot possible when f\ is not left-continuous.

4.1 Decision Theoretic Properties 105

Example 4-5- Let ö be the function on P defined in Example 2.2, and con¬

sider the likelihood-based decision criterion associating to each nondegen-erate relative plausibility measure rp (on some set Qrp) the relation ^rpon Ps>-p represented by the functional Vrp : I i—> JTld(öorp). This decision

criterion reveals ignorance aversion and /i = <5|ro,i] is strictly increasing,but the results of Example 2.11 imply that no integral respecting distribu¬

tional dominance on the class of all monotonie measures can correspondto the rectangular integral on the class of all monotonie measures takingvalues in the range of /i. 0

In Theorem 4.4 the injectivity of /i is needed to define the integral

on the basis of the functionals Vrp, but in general for all nondecreasingfunctions /i : [0, 1] —> [0, 1] such that /i(0) = 0 and /i(l) = 1, and for all

homogeneous integrals respecting distributional dominance on the class of

all normalized, monotonie measures taking values in the range of /i, the

functionals Vrp : I i—> J I d(fiorp) correspond to a likelihood-based decision

criterion revealing ignorance aversion. In particular, if the range of /i is

{0, 1}, then the choice of the integral has no influence: Vrp = ess/l0rEsup.

If fi = Itß i] for some ß G (0,1), then the functionals Vrp = ess/, 1]0r£suP

correspond to the LRM^ criterion; while if fi = im, then the func¬

tionals Vrp = ess/r^or^sup correspond to the MLD criterion. These two

likelihood-based decision criteria reveal thus ignorance aversion, and the

corresponding functionals Vrp can be expressed as integrals with respect

to a distortion of rp.

Theorem 4.3 implies that a likelihood-based decision criterion reveal¬

ing ignorance aversion corresponds to a 0-independent functional S on

Sp, but since the functionals Vrp : I i—> S(rj){l > •}) on WQrp depend

on rp through its normalized representative only, in general they are not

"support-based", in the sense that in general rp{l > 0} > 0 does not imply

Vrp(l) = Vrpi (J\{i>o\)- In fact, the essential minimax criterion is the

only likelihood-based decision criterion such that the corresponding func¬

tionals Vrp are "support-based" in this sense. However, for decision criteria

revealing ignorance aversion (and thus generalizing the minimax criterion)it is reasonable to assume that the preference between two decisions does

not depend on the models for which both decisions are correct; that is, the

preference between / and /' is left unchanged when we restrict attention

to the subset {/ > 0} U {/' > 0} = {/ + /'> 0} of Qrp. A likelihood-based

decision criterion is said to be O-independent if for each nondegenerate

106 4 Likelihood-Based Decision Criteria

relative plausibility measure rp (on some set Qrp)

I Sp I' O l\{i+i>>0} Sp\{l+l,>0} l'\{i+i>>0} for all /, /' G P2^ such

that rp{l + /' > 0} > 0.

The next theorem implies in particular that all 0-independent likelihood-

based decision criteria reveal ignorance aversion.

Theorem 4.6. A likelihood-based decision criterion is O-mdependent ifand only if either it is the essential minimax criterion, or there are a

support-based, regular integral on the class of all finite, monotonie mea¬

sures, and a positive real number a such that each relation Sp ls repre¬

sented by the functional I i—> J I d(rp°); in this case, the integral is unique.

Proof. For the "if part, let A Ç Qrp be a set such that rp(A) > 0 and

^Qr \A= 0- The functionals Vrp corresponding to the likelihood-based

decision criterion satisfy Vrp(l) = Vrp\A(l\A) in the case of the essential

minimax criterion, while in the other case they satisfy

vrp(t) = J id(m°)= J i\Ad(m°)\A = ln>(A)]avrpU(i\A),

because the integral is support-based and bihomogeneous. To obtain the

O-independence it suffices to choose A = {/ + /'> 0}.

For the "only if part, first note that the function f\ on [0, 1] cor¬

responding to the 0-independent likelihood-based decision criterion con¬

sidered is positive on (0, 1]. To prove this, let rp be the (nondegenerate)relative plausibility measure on {0, 1} defined by rpHl) = c G (0,1] and

Œi(0) = 1. Then /i(c) = Vrp(I{1}) > Vrp(0) = 0, because rp{l} > 0

and Vrp\ (i"{i}|{i}) = f > 0 = ^rp|m(0|{i].). We can now prove that

the functional S on S^ corresponding to the decision criterion satisfies the

following equality for all c G (0, 1] and all cp G Sp\

S(I{0} + c^7(0,00]) = h(c) S(<p). (4.11)

Define Q = P x {0, 1} x {0,1} and A = {(x,y,z) G Q : z = 1}, let

rp' be the (nondegenerate) relative plausibility measure on Q defined by(rp')Hx. y. z) = czip(x) + (1 — z), and let /,/' be the two functions on

Q defined by l(x,y,z) = xz and l'(x,y,z) =. yA y z, respectively. If

S(ip) = 0, then cp = I{oi, because f\ is positive on (0, 1], and S fulfills

4.1 Decision Theoretic Properties 107

conditions (4.8) and (4.9). Hence if S(p) = 0, then the equality (4.11) is

obvious; while if S(p) > 0, then rp'{l + /' > 0} > 0, and (4.11) can be

obtained as follows (the choice p = irji implies /i(l) = 1):

S(I{0y+cipI(0>oo]) = KP'(0 = Vrp>(l') = S(l{0}+clJ S(V)

fid)

h(c) /i(i)'

where the second equality is implied by 0-independence and

VtP'\a{1\a) = S((p) = S{IS0iJHXl\) = Vrp'\A(l'\A),

because An {l\a +1'\a >0} = {/ + /' > 0} follows from (1 + 1')\q\a =0. If

n is a positive integer, and p = i|o} + c 7[i}, then equality (4.11) implies

f1(cn+1) = /i(c)/!(cn). By induction we obtain that f^cV) = [/i(c)pholds for all positive integers y, and from this we obtain that it holds also

for all positive rational numbers y. Hence the two functions y i—> fi(cy)and y i—> [fi(c)]y on P are nonincreasing and they correspond on the set

of all positive rational numbers: since the second one is continuous, theyare equal. Let c =

h(x) = [h(e-^)\

- e

— log X

and a = — log[/i(e 1)]; for all x G (0, 1) we obtain

If, 0, then /i '(0,1] and the decision

criterion is the essential minimax criterion, because for all nondegeneraterelative plausibility measures rp and all q G {rp^ > 0} we have

Vrp(l) > Vrp[l(q)I{q}] > l(q)h(m{q}) = %)•

If a G P, then let F be the functional on VP

defined by

{<p G Afl¥ : <p(0) G P}

F(p) = p(0)S (-£.)

Equality (4.11) implies that F is 0-independent, since for all cp, ip G ^ such

that p\(o,oo] = V>l(o,oo] and p(0) > ip(0) we have

F{p) = p(0) S

= ¥>(0)/i|($8

7{o} + 1^(0) ) l^(ö),

)a s Uro))

'(0,o

Therefore F is also monotonie, because for all cp,ip G ^ such that p < ip

we have

F (if) = F[rP(0)I{0} + v>/(I,.«,]] < Fty),

where the inequality follows from the monotonicity of S, which is im¬

plied by Theorem 4.3, since equality (4.11) with c = 1 expresses the

108 4 Likelihood-Based Decision Criteria

O-independence of S. Moreover, F is clearly bihomogeneous, and for all

y G [0, oo) we have

F(I[0>1]+yI{0}) = (2/ + l)/i[(^)-] = 1.

Hence F is the restriction to VP of the calibrated, 0-independent, bihomoge¬

neous, monotonie functional cp i—> supr^,G,j, ,/,<„}F on A/"2p, corresponding

to a support-based, regular integral on the class of all monotonie measures

(Theorem 2.13). The desired integral representation follows from Theo¬

rem 4.3:

Vrp(l) = S(m{l > •}) = F[(rjf){l >}]= J ld(ma),

and the uniqueness of the integral on the class of all finite, monotonie

measures follows from the uniqueness of the functionals Vrp (implied byTheorem 4.1). D

Theorem 4.6 implies in particular that a 0-independent likelihood-based

decision criterion leads to the usual likelihood-based inference methods if

and only if it is not the essential minimax criterion, since for this criterion

/l = -^(0,11; while otherwise there is an a G P such that /i : x i—> xa.

In this case, if L is a loss function on ? x D, and lik is the likelihood

function on V induced by the observed data (while before observing the

data we had no information for discrimination between the models in V),then the 0-independent decision criterion applied to the decision problemdescribed by L consists in minimizing J ldd(hka)\ where the integral is

regular and support-based on the class of all monotonie measures. In fact,each support-based, regular integral on the class of all finite, monotonie

measures can be extended to a support-based, regular integral on the class

of all monotonie measures; but the extension is not unique. However, the

extension used in the proof of Theorem 4.6 is the most natural one, since

it is the one characterized by the following continuity property:

/ /d/i= lim / /dmin{/i, c} (4-12)J CT°° J

for all sets <2, all monotonie measures /i on <2, and all functions / : Q —> P.

Thanks to the extension of the integral, the criterion can be applied also

when hk is an unbounded pseudo likelihood function (obtained for exam¬

ple when using the densities of a continuous random object to approximate

4.1 Decision Theoretic Properties 109

the likelihood function). The decision criterion can be interpreted as eval¬

uating the loss we would incur by making each decision d G T> by means

of the integral of Id with respect to a completely maxitive measure whose

density function is proportional to the distorted likelihood function hka.

If a G (0,1), then the distortion represents an attenuation of the like¬

lihood function hk, while it a G (l,oo), then the distortion represents

an intensification of hk; in fact, hka contains a times the information

in hk for discrimination between the models in V (in the sense of Kull-

back and Leibler, 1951). Hence, a G P can be interpreted as a parameter

expressing the confidence in the information provided by the likelihood

function hk, or more generally, by the (nondegenerate) relative plausi¬

bility measure rp. In fact, the limits as a tends to 0 of the functionals

Vrp : I i—> J I &(rya) corresponding to the decision criterion considered are

the functionals Vrp : I i—> J/d(/(0,i] ° £2) corresponding to the essential

minimax criterion (which usually amounts to disregarding the information

provided by rp); while the limit of rpa as a tends to infinity is I^iy o rp,

and the functionals Vrp : I i—> J/d(/r1i o rp) correspond to the MLD cri¬

terion (which usually amounts to considering the model maximizing rp-1

as certain). The limit of Jld(rp°) as a tends to infinity is not necessarily

/ / d(/(i} org), but if essrp sup / < oo, and the integral is quasi-subadditive,then riniajoo J I d(rpa) = J I d(/|1| org) follows easily from Theorem 2.23.

The attitude toward ignorance opposed to aversion can be called attrac¬

tion. A likelihood-based decision criterion reveals ignorance attraction

if the corresponding functionals Vrp satisfy

ry < rp[_ => Vrp > Vrpt

for all pairs of nondegenerate relative plausibility measures rp, rp1 on the

same set Q. This property can be expressed as follows in terms of preferencerelations:

/ ^rp c and rp < rp[_ => / ;^rp< c for all c G P and all / G P ;

ignorance attraction means that if we do not prefer the constant loss c to

a particular decision, then neither would we when we were more ignorant.The next theorem parallels Theorem 4.3, and implies in particular that the

function /i on [0, 1] corresponding to a likelihood-based decision criterion

revealing ignorance attraction satisfies /i(l) = S'(/|o,i}) = ^(^{0}) = 0.

Therefore, a decision criterion cannot reveal both ignorance aversion and

attraction, and a decision criterion such that /i(l) G (0,1) reveals nei-

110 4 Likelihood-Based Decision Criteria

ther ignorance aversion nor attraction (we shall consider examples of such

decision criteria in a moment).

Theorem 4.7. Consider a likelihood-based decision criterion and the cor¬

responding functional S : Sp —> P; the following four assertions are equiv¬

alent:

the decision criterion reveals ignorance attraction,

ip < qp => S(ip) > S(ip) for all ip, ip G Sp,

S is co-independent,

each relation ^rp is represented by the functional Vrp : I i—> S(rj){l < ]),where rj){l < •} denotes the function x i—> ry\l < x} on P.

Proof. Let cp, ip G Sp such that cp < ip. If the decision criterion reveals

ignorance attraction, then S(ip) = Vv^(idp) > V^t (idp) = S(ip), and thus

the second assertion holds. If <y2|[o)0o) = V'lfOjOo)) then S(ip) < S(ip) follows

from condition (4.9), and therefore S is oo-independent when the second

assertion holds. If S is oo-independent, and x = {tR° ^_1)^ then

Vrp(l) = S(x) = S(xl[ol00) +^{00}) = S(rp{l < •}),

where the last equality follows from condition (4.9) with cp = X-^[o,oo)+-^{oo}and ip : x i—> supr0 xi X

= U>{1 < %} Finally, if the fourth assertion holds,then the ignorance attraction is a consequence of condition (4.9), since

sup[X)0o] m{l < •} = 1 and sup[0x] rp{l < } = rä{l < x}. D

Let TL be a nonempty subset of V, and consider the (nondegenerate)relative plausibility measure (lu) on T* Theorem 4.7 implies that the

functional on V^ corresponding to a likelihood-based decision criterion re¬

vealing ignorance attraction is simply V/j y= inf^. In particular, in the

case of complete ignorance about which of the models in V is the best rep¬

resentation of the reality we have V^ = inf, and therefore the likelihood-

based decision criteria revealing ignorance attraction can be considered as

generalizations of the minimin criterion. The counterpart of Theorem 4.4

for a decision criterion revealing ignorance attraction can be easily ob¬

tained: it states in particular that if the corresponding function /o on

[0,1] is strictly increasing, then each relation ^rp is represented by the

functional Vrp : I i—> J/d(/o org), for a (unique) support-based, homo¬

geneous integral on the class of allmonotoniemeasurestakingvalues

in

4.1 Decision Theoretic Properties 111

the range of /o, but the integral would "respect distributional dominance"

only if this property were defined in terms of the "distribution function"

/i{/ > •} instead of /i{/ > •}. However, in general for all nondecreas-

ing functions /o : [0,1] —> [0, 1] such that /o(0) = 0 and /o(l) = 1, and

for all homogeneous integrals respecting distributional dominance on the

class of all normalized, monotonie measures taking values in the range

of /o, the functionals Vrp : I i—> J/d(/o org) correspond to a likelihood-

based decision criterion revealing ignorance attraction. In particular, if

the range of /o is {0,1}, then the choice of the integral has no influence:

Vrp = ess/o0gjsup. If /o = I\\-ß,\] for some ß G (0, 1), then the function¬

als Vrp = ess/. ^orgsup = infrÎI£i>/gi correspond to what can be called

likelihood-based region minimin criterion (in analogy with the LRM^ crite¬

rion) : it consists in reducing V to the likelihood-based confidence region for

P with cutoff point ß, before applying the minimin criterion. If /o = I(o,i\ ,

then the functionals Vrp = ess/(0 or^sup = lirngfi inf{r^->ß\ correspond

to the ignorance attracted analogue of the MLD criterion.

Consider a statistical decision problem described by a loss function L

on V x T>, and let lik be the likelihood function on V induced by the

observed data (while before observing the data we had no information

for discrimination between the models in V). If a maximum likelihood

estimate Pml of P exists, then each decision d G T> that is correct for

Pml is optimal according to all likelihood-based decision criteria revealing

ignorance attraction, since (lik^ol~^ )^(0) = 1 implies Vhk^(ld) = S(l) = 0.

Moreover, if d! G T> is not correct for Pml, and there is a topology on V

such that Id' is continuous, and LR(V\J\f) < 1 for all neighborhoods J\f of

Pml, then Vilkr (Id1) > 0 for all likelihood-based decision criteria such that

/o is positive on (0, 1]. That is, the decision criteria revealing ignoranceattraction and such that /o is positive on (0,1] are almost equivalent to

the MLD criterion, at least when a maximum likelihood estimate Pml

of P exists and there is a correct decision for Pml- So the likelihood-

based decision criteria revealing ignorance attraction are not so useless for

statistical decision problems as the minimin criterion (of which they are

generalizations), but since /i = 0, they do not lead to the usual likelihood-

based inference methods (except for the maximum likelihood method, if

/o is positive on (0, 1]).

Let rp be a nondegenerate relative plausibility measure on V, and let

/ : V —> P be a function. It can be easily proved that if V' is the functional

on V^ corresponding to a likelihood-based decision criterion revealing ig-

112 4 Likelihood-Based Decision Criteria

norance aversion, and V" is the functional on V^ corresponding to a deci¬

sion criterion revealing ignorance attraction, then the following inequalitieshold:

inf{rpi>o} I < Vr'p(0 < linhml>ß}l ^

- ^flSUP{ml>ß} l ^ Vrp(l) ^ SUP{rpi>0} I-(4.13)

That is, a likelihood-based decision criterion revealing ignorance aversion

can be interpreted as comparing the decisions d G T> by means of an

upper evaluation V'(ld) of Id, while a decision criterion revealing ignoranceattraction compares them by means of a lower evaluation V"{ld) of Id- The

sixth value in the inequalities (4.13) is the upper evaluation of / considered

by the essential minimax criterion, and analogously the ignorance attracted

decision criterion using the first value in (4.13) as the lower evaluation of /

can be called essential minimin criterion. The likelihood-based decision

criteria using the fourth and third values in (4.13) as evaluations of / are

the MLD criterion and its ignorance attracted analogue, respectively.

If 5 : [0,1] —> [0, 1] is a nondecreasing function such that 5(0) = 0, then

the dual of 5 is the function 5 on [0,1] defined by

ô(x) = 5(1) - 5(1 - x) for all x G [0,1].

Note that 5 is nondecreasing, with 5(0) = 0 and 5(1) = 5(1); the use of the

term "dual" is justified by the equality 5 = 5 and by the property that if

/i is a normalized, monotonie measure, then 5 o /i = 5 o ~ß. If there are two

nondecreasing functions /i, /o : [0, 1] —> [0,1] such that /i(0) = /o(0) = 0

and /i(l) = /o(l) = 1, and a homogeneous integral respecting distribu¬

tional dominance on the class of all normalized, monotonie measures, such

that V' and V" can be written as integrals with respect to fi o rg and

/o o rg, respectively, then the functionals V' and V" can be considered

as giving conjugate upper and lower evaluations if and only if /i o rg and

/o o rf> are dual to each other (and this holds for all nondegenerate relative

plausibility measures rp if and only if /i and /o are dual to each other).In this case, the inequalities (4.13) can be rewritten as follows:

1 d(7(o,i] ° m) < i d(/i ° re) < i d(/{i> ° re) <J

,

J

f(4.14)

< //d(/(1}oŒ)< //d(/1oŒ)< //d(/(0)i] org),

4.1 Decision Theoretic Properties 113

because the functions I(p,i] an<i I{i} on [0, 1] are dual to each other. Thus

in particular the essential minimax criterion and its ignorance attracted

analogue (the essential minimin criterion) correspond to functionals giving

conjugate upper and lower evaluations, and the same holds for the MLD

criterion and its ignorance attracted analogue. The interpretation of the

fifth and second values in (4.14) as conjugate upper and lower evaluations

of / is particularly clear when fi is left-continuous and the integral is

regular and symmetric on the class of all monotonie measures: in this case,

Theorem 2.18 can be applied, and in particular, if /i|(o,i) has range in

(0,1), then the inequalities (4.14) can be rewritten as follows, using the

notation of Theorem 2.18:

liminf(jl0rEi>1_x} / < -F(/[o,i] inf{/l0azi>i- } 0 < liminf{jl0rEi>1_x| / <x 1 J_ X .J.U

< limsup{/l0azi>x} / < F(/[0ji] sup{/l0rEi> } /) < limsup{/l0azi>x} /.x 1 J_ X .J.U

The functionals V' and V" corresponding to two likelihood-based de¬

cision criteria revealing respectively ignorance aversion and attraction can

thus be interpreted as giving respectively upper and lower evaluations:

these can be combined to obtain intermediate evaluations, by means of an

averaging function a : H{id^) —> P, where H(id^) = {(x, y) G P2 : x > yjis the hypograph of zdp. Using Theorem 4.2, it can be easily proved that

the resulting functionals Vrp = a{V' V") correspond to a likelihood-based

decision criterion if and only if a(l, 1) = 1, the function a is positively ho¬

mogeneous (that is, a(cx,cy) = ca(x,y) for all (x,y) G H(zdp) and all

c G P), and it is nondecreasing in both arguments. In the case of complete

ignorance about which of the models in V is the best representation of

the reality, the resulting decision criterion corresponds to the functional

Vtf = a(sup, inf) on V^; therefore the resulting decision criterion reveals

ignorance aversion or attraction if and only if a(x, y) = x or a(x, y) = y

for all (x,y) G H(zdp), while in all other cases it reveals neither ignoranceaversion nor attraction. In particular, if A G (0, 1), and a\ is the function

on H(zdp) defined by a\(x,y) = Xy + (1 — X) x (for all (x,y) G H(idpj),then the likelihood-based decision criteria obtained by means of the av¬

eraging function a\ can be considered as generalizations of the Hurwicz

criterion with optimism parameter A.

Consider a regular integral on the class of all monotonie measures,

and let /i,/o : [0,1] —> [0,1] be two nondecreasing functions such that

^(0) = /0(0) = 0 and /^l) + /0(1) = 1. A likelihood-based decision

114 4 Likelihood-Based Decision Criteria

criterion with corresponding functions /i,/o can be defined for example

through the functionals

vrp .i I d(/i orjo)+ I d(/0 o rg),

or through the functionals

Vrp:l^-> I d(/i or2 + /0oŒ).

If /i(l) = 1 or /i(l) = 0, then the two definitions correspond, and the

decision criterion reveals respectively ignorance aversion or attraction; but

if /i(l) G (0,1), then in general the two definitions are different, althoughboth decision criteria generalize the Hurwicz criterion with optimism pa¬

rameter A = /o(l). In this case, the first definition corresponds to the

application of the averaging function a\ to the integrals with respect to

jz~\ /l ° TJL and \ /o ° TË, respectively; while in general the second defi¬

nition cannot be obtained as a(V' V") for an averaging function a on

H(idf). However, for the Choquet integral the two definitions correspond

(thanks to the additivity with respect to measures), and in particular, if

j^j /i andj /o are dual to each other, and the function 6 =

y~\ fi = \fois left-continuous, then using Theorem 2.18 we can clarify the connection

with the Hurwicz criterion by expressing the functionals Vrp as follows by

means of the Lebesgue integral (with respect to the Lebesgue measure on

[0,1]):

Vrp:l^ [A inf((5orEi>x} / + (1 - A) sup((5orEi>x} /] dx.

Jo

Since the Choquet integral is translation equivariant, it can be used to

evaluate the decisions also when the decision problem is described by a

bounded utility function U : V x T> —> R. In particular, Theorem 2.28 im¬

plies that if we apply the likelihood-based decision criterion correspondingto the functionals Vrp : I i—> Jcl d(f\ org+ /0 org) to the decision problemdescribed by the loss function c — U, where c is an upper bound of U (wehave considered this kind of loss function at the end of Subsection 3.1.1),then we obtain the following decision criterion:

f°- -

maximize / u^ d(/o OQ)+/i o rp).

4.1 Decision Theoretic Properties 115

where the integral is the extension of the Choquet integral by means of

equality (2.10), and for each d G T> the function u& on V is defined by

Ud(P) = U{P, d) (for all P G V). In particular, the ignorance averse deci¬

sion criterion corresponding to the functionals Vrp : I i—> Jcl d(/iorg) com¬

pares the decisions d G T> by means of the lower evaluation JcUd d(/i org)of Ud] while the ignorance attracted decision criterion corresponding to the

functionals Vrp : / i—> Jcl d(/o o rg) compares them by means of the upper

evaluation JCUd d(/o org) of w^.

By contrast, as noted at the end of Subsection 3.1.1, the Shilkret in¬

tegral cannot be used directly to evaluate the decisions when the decision

problem is described by a utility function U : V x T> —> R, because the

utility functions U and a U+ß are considered equivalent (for all a G P and

all ß G ffi). But if the positive values of C7 are interpreted as gains, and the

negative ones as losses, then the functions U and a U + ß are equivalent

only if ß = 0 (because of the peculiar meaning of the value 0), and the

Shilkret integral can be used to evaluate separately gains and losses. If we

evaluate gains and losses by means of the Shilkret integral with respect

to /i o rg, then the uncertain gain resulting from a decision d G T> can

be indifferently defined as max{«(2,0}, as Ud\{Ud>o}j or as ud\{ud>0} (andanalogously for the uncertain loss), since the Shilkret integral is support-

based:

/s/*S

max{«d,0}d(/1or2)= / ud\{Ud>0} d(/i o rjo\{ud>0}).

By contrast, the use of rg_ in the integrals would pose some problems, be¬

cause in general if TL is a subset of V, then rg|^ is different from rp\n-i,since the latter depends also on rjiiV \ TL). The decisions d G T> can thus

be compared on the basis of the two evaluations Js m&x{ud, 0} d(/i o rp)

and Js max{—Ud, 0}d(/i org); for instance, maximizing their difference

corresponds to maximizing j Udd(f\ org), where the integral is the ex¬

tension of the Shilkret integral by means of equality (2.11). It is importantto note that both evaluations of gains and losses are "upper evaluations",and therefore the resulting decision criteria are ignorance attracted as re¬

gards gains, and ignorance averse as regards losses. When /i is continuous,the comparison of decisions on the basis of separate evaluations of gainsand losses by means of the Shilkret integral seems to be in agreement with

the theory of Shackle (1949).

116 4 Likelihood-Based Decision Criteria

4.1.2 Sure-Thing Principle

A likelihood-based decision criterion satisfies the sure-thing principle if

for all disjoint sets A, B, and all nondegenerate relative plausibility mea¬

sures rp on Q = A U B such that rp(A) > 0 and rp(B) > 0, the followingcondition holds:

1\a ^,u 1'\a and I, ,

n

, Z „=> ISpl' for all/,/'G PS.

L\B ^rp\B « Iß J

Note that the conditions on rp(A) and rp(B) are necessary only because

~^rp' is undefined if rp' is a relative plausibility measure with constant value

0, but when for instance rp(B) = 0, for all likelihood-based decision criteria

we have 1\a ^,rp\A 1'\a if and only if/ ^rp /'. The above sure-thing principle

corresponds to the first part of the "sure-thing principle" of Savage (1954),while the second part states that if one of the two preferences on the left-

hand side of the implication is strict, then so is the preference on the

right-hand side. This would imply that if / < /', and rp{l < /'} > 0, then

/ must be preferred to /'; therefore, no likelihood-based decision criterion

can satisfy the second part of the "sure-thing principle" of Savage: the

next example shows that this is true also if the dominated decisions are

discarded.

Example 4-8- Let Q = {a,b,c} be a set with three elements, and con¬

sider the relative plausibility measure 1^ on Q. The two functions Ira\and I{b,c\ on Q are considered equivalent by all likelihood-based decision

criteria, since VlT(I[a}) = /i(l) = ^iî(I[b,c}). But I{a}\{c} = 0 is cer¬

tainly preferred to I{b,c\\{c\ = lj while the equivalence of /{0}|{o,6} and

!{b,c}\{a,b} = ^{6}l{a,6} is implied by symmetry. 0

However, the second part of the "sure-thing principle" of Savage does

not really deserve its name, since we are not "sure" of the strict prefer¬ence: we do not assume that the strict preference holds in both alternative

cases A and B. If we were assuming this, then we would obtain the "strict

version" of the above sure-thing principle, in which each one of the three

preferences is strict: it can be easily proved that if a likelihood-based de¬

cision criterion satisfies the sure-thing principle, then it satisfies also its

strict version. Moreover, since relative plausibility measures are (equiva¬lence classes of) finitely maxitive measures, the content of the sure-thing

4.1 Decision Theoretic Properties 117

principle is left unchanged when the assumption that A and B are disjointis dropped (and this is valid also for the strict version).

Consider a statistical decision problem described by a loss function L

on V x T>, and let rp be a nondegenerate relative plausibility measure on

V describing the uncertain knowledge about the models in V. The sure-

thing principle states that for all coverings Ti, Ti' of V (that is, Ti,Ti' Ç Vand V = Ti U Ti1) such that rp(Ti) > 0 and rp(Ti') > 0, if the preferencebetween two decisions in T> is the same when we restrict attention to the

models in Ti or to the models in Ti', then this preference holds also when we

consider all the models in V. In particular, if d G T> is optimal, accordingto a likelihood-based decision criterion satisfying the sure-thing principle,for both the decision problems described by L^xT) and by L\?{ixt>, then

d is optimal also for the decision problem described by L. Moreover, if

for both the restricted decision problems d is the unique optimal decision

in T>, then this holds also for the unrestricted decision problem; but this

conclusion is in general not valid when d is the unique optimal decision

only for one of the two restricted decision problems (since the second part

of the "sure-thing principle" of Savage does not hold).

The "if part of the condition of O-independence is a special case of

the sure-thing principle, while the "only if part is a special case of the

second part of Savage's "sure-thing principle". Theorem 4.6 implies that a

likelihood-based decision criterion can be 0-independent only if the corre¬

sponding function f\ is positive on (0, 1] : the next theorem states that in

this case the O-independence is implied by the sure-thing principle.

Theorem 4.9. A hkehhood-based decision criterion satisfying the sure-

thing principle is 0-mdependent if and only if the corresponding function

/i : [0, 1] —> [0,1] is positive on (0,1].

Proof. The "only if part follows from Theorem 4.6, and to prove the "if

part it suffices to show that if the function f\ is positive on (0, 1], and A is

a subset of Q such that /|q\a = 0, and both rp(A) and rp(Q\ A) are pos¬

itive, then Vrp(l) = Vrp(IA)Vrp\A(l\A), because Vrp(IA) > fi[m{A)] > 0.

The above equality follows from the equivalence of / and Vrp\A(l\A) Ia,which is implied by the sure-thing principle and the equivalence of 1\a and

*W*U).

The sure-thing principle does not follow from the O-independence: for

instance, the likelihood-based decision criterion such that each relation

118 4 Likelihood-Based Decision Criteria

^rp is represented by the functional / i—> Jcldrp (that is, the "MPL*

criterion" of Cattaneo, 2005) is 0-independent (Theorem 4.6), but the next

example shows that it does not satisfy the sure-thing principle. In fact, the

subsequent theorem implies that a likelihood-based decision criterion such

that each relation ~^rp is represented by the functional / i—> J I drp can

satisfy the sure-thing principle only if the integral is maxitive.

Example J^.10. Consider the likelihood-based decision criterion correspond¬

ing to the functionals Vrp : I i—> Jcl drp, and let rp be the (nondegenerate)relative plausibility measure on Q = {0,1, 2} defined by rp(0) = rp(l) = 1

and rp{2) = \. We have Vrp\{0 2} (îdQ|{0,2}) = Kp|{1}Mq|{i}) = 1, but

Vrp(idQ) = §; that is, idQ\^0>2} ^,rP\{02} 1|{0,2} and idQ\^y ;SP|{1} l|{i}>but not idQ ^rp 1. 0

Theorem 4.11. Consider a hkehhood-based decision criterion satisfyingthe sure-thing principle. The corresponding functions /i, /o : [0,1] —> [0, 1]fall into one of the following five alternative cases:

/o = 0 and fx = J(0,1],

/i = 0 and /o = 7(i}'

/o = 0 and fx = ^{1},

/i = 0 ond /o = /(0,l],

/o = 0 ond /i : x i—> xa /or on a G P.

i/ /o = 0, t/ien i/ie decision criterion reveals ignorance aversion, and

for each nondegenerate relative plausibility measure rp (on some set Qrp)the corresponding functional Vrp on WQrp fulfills the following condition:

Vrp(max{l,l'}) =max{Vrp(l),Vrp(l')} for all I, l' ePQr».

Proof. Consider first the case with f\ = 0; let rp be the (nondegenerate)relative plausibility measure on Q = {0, 1, 2} defined by rg{0} = 1 and

rp{l} = rp{2} = 1—x for aniS [0,1], and let / = Ir0\ and /' = fo(x) i{o,i}be two functions on Q. We have

h(x) =Vrphoii(l\m}) =Vrphoii(l'\m}) =Vrp(l) =Vrp(l') = lf0(x)}2,

where the fourth equality is implied by the sure-thing principle, since

'1(2} = 0 = ^'|{2}- Hence the range of /o is {0,1}; to obtain that /o = im

4.1 Decision Theoretic Properties 119

or /o = I(o,i\ jit suffices to show that sup{/o = 0} G (0, 1) leads to a contra¬

diction: assume thus that it is true and define y = (1 — sup{/o = 0}) 5. Let

rp' be the (nondegenerate) relative plausibility measure on Q = {0,1, 2}defined by (rp')^ : k i—> yk, and consider the two functions i|o} and ^{0,1}on Q. We have

Vrp>(I{0}) = /o(l - y) = 0 < 1 = /o(l - y2) = Kp-^o.i});

this contradicts the sure-thing principle, since

Vrp'\{i 2}(7{0,1} I{1,2}) = /o(l ~y) = 0 = Kp'|{1 2}(J{0>I{1,2>)

and ^{o,i}l{o} = 1 = /{o}|{o}-Consider now the case with /i(l) > 0; we first prove that the decision

criterion reveals ignorance aversion (and thus in particular /o = 0). ByTheorem 4.3 it suffices to show that the corresponding functional S on Spis 0-independent; that is, it suffices to show that for all p G Sp we have

equality (4.11) with c = 1 and /i(l) = 1. Define <2, A, rp', /, and /' as in

the proof of Theorem 4.6, but with c = 1; since /|g\4 = 0 = 1'\q\a and

Vrp'\A(l\A) = Si^p) = Vrp'\A(l'\A), the sure-thing principle implies

S(I{o} +93/(0,00]) = KP'(0 = Vrp>(l') = Vrp,\A{l'\A) = S{if).

We now prove that sup{/i = 0} G {0, 1}: analogously to the case with

/1 = 0, assume that sup{/i = 0} G (0,1) and define y = (sup{/i = 0})s.Let rp be the (nondegenerate) relative plausibility measure on Q = {0,1, 2}defined by rg^ : k 1—> yk, and consider the two functions / = i|2} and

I' = fi(y)I{i,2} on Q. We have Frp(/) = h(y2) = 0 < [/^y)]2 = Vrp(l'):this contradicts the sure-thing principle, since l\{oi = 0 = l'\{oi and

yrp|{12}('l{i,2}) = fi(y) = Vrp\{12}(l'\{i,2})- Therefore, either we have

sup{/i = 0} = 1 and thus /1 = I{i}, or we have sup{/i = 0} = 0 and thus

/1 = 7(o,i] or fi : x t-^ xa for an a G P (Theorems 4.9 and 4.6). Since the

decision criterion reveals ignorance aversion, Theorem 4.3 implies

yrp(max{/,/'}) = S(rp{max{l,l'} > •}) =

= S(max{rp{l > -},ffi{/' > •}}) > max{Vrp(l), Vrp(l')},

where the inequality follows from the monotonicity of S. To prove the last

statement of the theorem, it suffices thus to show that this inequality is

an equality; that is, it suffices to show that for all nonincreasing cp, ip G Spwe have S(max{ip, tp}) < max{S(cp), S (ip)}. Define A = H(p) x {0} and

120 4 Likelihood-Based Decision Criteria

B = H(ip) x {1}, where H(p) and H(ip) are the hypographs of p and ip,

respectively, and thus AUB Ç P x [0,1] x {0, 1}; let / and rp be respectivelythe function and the (nondegenerate) relative plausibility measure on AUB

defined by l(x, y,z) = x and rj^-(x,y, z) = y. Since rg{Z|^ > x} = (p(x) and

rj)(A) = 1, we have V^.p|A(Z|ji) = S(p); and analogously K-p|B('|ß) = S(ip).The sure-thing principle implies thus that Vrp(l) < max.{S(cp), S (ip)}, but

Vrp(l) = S(max{(p, ip}), because rp_{l > x} = max{y(x), tp(x)}. D

Theorem 4.11 implies in particular that neither the LRM^ criterion

(for any ß G (0,1)), nor any likelihood-based decision criterion with

/i(l) G (0,1) (such as one generalizing the Hurwicz criterion) can satisfythe sure-thing principle. If a likelihood-based decision criterion satisfies it,and /o = 0, then the decision criterion reveals ignorance aversion, and thus

in particular it generalizes the minimax criterion; but the next exampleshows that a likelihood-based decision criterion satisfying the sure-thing

principle does not necessarily reveal ignorance attraction when /i = 0.

Example J^.12. Let S be the functional on Sp defined by

S^ = { suP{p > 0} if p(0) = 0,fOT a11 f G ^

It can be easily proved that S corresponds to a likelihood-based decision

criterion satisfying the sure-thing principle with /i = 0 and /o = 7;n (thesecond case of Theorem 4.11), but revealing neither ignorance aversion nor

attraction. 0

Five other examples of likelihood-based decision criteria satisfying the

sure-thing principle and falling in the five alternative cases listed in The¬

orem 4.11 are the essential minimax criterion and its ignorance attracted

analogue (the essential minimin criterion), the MLD criterion and its ig¬

norance attracted analogue, and the MPL criterion, respectively. Let S be

the functional on Sf corresponding to a likelihood-based decision criterion,and consider the functional S' on Sp defined by

S'(p) = \ S^ * SUP^ > °| < °°'for all p g Sf.

yrj

[oo if sup{<^ > 0} = oo,r r

It can be easily proved that S' corresponds to a likelihood-based deci¬

sion criterion with the same functions /i,/o as the original one (that is,

4.1 Decision Theoretic Properties 121

the one corresponding to S). If the original decision criterion satisfies the

sure-thing principle or reveals ignorance aversion, then so does the one

corresponding to S"; but this one cannot reveal ignorance attraction, even

when the original one does. The functional on Sp corresponding to the es¬

sential minimax criterion is S : ip i—> sup{<^ > 0}, and in this case S' = S;in fact, using Theorems 4.9 and 4.6 we obtain that the essential minimax

criterion is the only likelihood-based decision criterion satisfying the sure-

thing principle with /o = 0 and /i = I(o,i]- But if S is the functional on

Sp corresponding to one of the other four likelihood-based decision criteria

listed above, then S' is different from S, and thus only the first one of

the five alternative cases of Theorem 4.11 uniquely determines a decision

criterion satisfying the sure-thing principle.

In particular, Theorems 4.9, 4.6, and 4.11 imply that if a likelihood-

based decision criterion satisfies the sure-thing principle with /o = 0 and

/i : x i—> xa for an a G P, then there is a unique support-based, regu¬

lar integral on the class of all finite, monotonie measures, such that each

relation ^rp is represented by the functional / i—> Jld(rp°), and the in¬

tegral is maxitive (on the class of all finitely maxitive measures). Since

the integral is maxitive, homogeneous, and transformation invariant, we

have J/d/i = Js/d/i when / is essentially bounded with respect to /i

(that is, ess^sup/ < oo). But to obtain the Shilkret integral also when

essM sup / = oo, we need something more: for instance countable maxitiv-

ity, or the following continuity property, which parallels condition (4.12):

// d/i = lim / min{/, c} d/i,cTooJ

for all sets <2, all finite, monotonie measures /i on <2, and all functions

/ : Q —> P. This continuity property corresponds to the following condition

for preference relations:

min{/, c} <rp I' for all c G P => I <rp I' for all /, /' G PQ^. (4.15)

The next theorem implies that this condition suffices to determine a uniquelikelihood-based decision criterion satisfying the sure-thing principle in

each one of the three cases with /o = 0 listed in Theorem 4.11; but this

is not true in the two cases with /i = 0: for instance, the condition is

fulfilled by both the essential minimin criterion and the decision criterion

of Example 4.12.

122 4 Likelihood-Based Decision Criteria

Theorem 4.13. Consider a hkehhood-based decision criterion such that

the corresponding function f\ : [0, 1] —> [0, 1] satisfies /i(l) > 0, and

for each nondegenerate relative plausibility measure rp (on some set Qrp)condition (4-15) is fulfilled. The decision criterion satisfies the sure-thing

principle if and only if either it is the essential minimax criterion or the

MLD criterion, or there is a positive real number a such that each relation

^rp is represented by the functional I i—> Jsld(rpa).

Proof. The "if part is simple: the essential minimax criterion satisfies the

sure-thing principle, because

Vrp{l) =essrpsup/ = ma-x{Vrp\A(l\A),Vrp\B(l\B)};

the MLD criterion satisfies it because if rp(A) = rp(B) = 1, then

Vrp{l) =ess/{1}0r£sup/ = ma-x{Vrp\A(l\A),Vrp\B(l\B)},

while if rp(B) < 1, then Vrp(l) = VrpiA(l\A). Finally, in the third case,

using Theorem 2.6 we obtain

Vrp(t) = sup/ (r^r = max{[02(A)]Q VrpU(l\A), [m(B)]aVrp\B(l\B)}.

For the "only if part, Theorem 4.11 implies that /o = 0 and we

have three alternative cases for f\. If f\ = /(o,il, then Theorems 4.9

and 4.6 imply that the decision criterion is the essential minimax crite¬

rion. For /1 = /{1}, we have to prove that if b G (ess/|1|0r2suP^ °°);then Vrp(l) < b; in fact, in this case the decision criterion is the MLD

criterion, because Vrp(l) > ess/|1|0r2sup/ follows from ignorance aver¬

sion (implied by Theorem 4.11). Thanks to condition (4.15), it suffices

to show that Vrp(min{l, c}) < b for all c G (ö, 00): this follows from the

last statement of Theorem 4.11, since min{/, c} < max{&, cln>b\} and

vrp(I{i>b}) = fiimi1 > H) = 0- Finally, if /1 : x i-> xa for an a G P,then Theorems 4.9, 4.6, and 4.11 imply that there is a maxitive, regular

integral on the class of all finite, completely maxitive measures, such that

each relation ^rp is represented by the functional Vrp : I 1—> J ld(rp°). Wehave to prove that Vrp(l) = Jsl d(rpa). and thanks to Theorem 2.15 and

condition (4.15), it suffices to show that J min{/, c} d(rpa) < Jsl d(rpa)

for all c G W. Now, for a bounded function / with finite range, the

value of J f d(rpa) is determined by maxitivity and homogeneity, since

/ can be written as the maximum of a finite number of functions of

4.1 Decision Theoretic Properties 123

the form a/4 with a G P. Therefore J f d(rpa) = Js f d(rpa), but then

J min{/, c} d(rpa) = Js min{/, c} d(rpa) follows for instance from Theo¬

rem 2.23. D

Theorem 4.13 implies that the MPL criterion (possibly applied after

having distorted the likelihood function by raising its values to the power

a G P) is the only likelihood-based decision criterion that satisfies the sure-

thing principle and the continuity condition (4.15), and that leads to the

usual likelihood-based inference methods, when applied to some standard

form of the corresponding decision problems. Moreover, the MPL criterion

possesses also the following property. A likelihood-based decision criterion

is said to be decomposable if the corresponding functionals Vrp satisfy

Vrp(l) = Vrpot-i\yrpht=}(l\{t= })] (4.16)

for all sets Q and T, all nondegenerate relative plausibility measures rp

on <2, all functions / : Q —> P, and all functions t : Q —> T such that

rp{t = t} > 0 for all t G T; where Vrpi _

(l\:t= }) denotes the function

r 1—> Vrp\,t_,(l\[t=Ty) on T. Note that the condition on rp{t = t} is

necessary only because Vrpt is undefined if rp1 is a relative plausibilitymeasure with constant value 0; alternatively, we could drop this condition

and assume simply that Vrpi _

(l\rt= i) denotes a function /' : T —> P

such that l'(r) = Vrpi_t

(l\rt=T\) for all t G {(rp oT1)^ > 0}, since the

values taken by /' outside this set have no influence on Vrpot-i(l').

The equality Vrpot-i [Vrp\ } (/|{t= })] = Vrp[Vrp\{t= } (/|{t= }) o t] holds

for all likelihood-based decision criteria; if a decision criterion is decom¬

posable, then the function Vrpi _

(l\rt= 1) ot can be interpreted as the

"conditional evaluation" of / given t, in analogy with the conditional ex¬

pectation E(X\Y) of a random variable X given a random object Y.

The "conditional evaluation" corresponds to the conditional expectationin many respects; in particular, equality (4.16) corresponds to the prop¬

erty E(X) = E[E(X\Y)]. That is, a likelihood-based decision criterion

is decomposable if it allows the introduction of a concept of "conditional

evaluation" presenting many analogies with the concept of conditional ex¬

pectation.

It can be easily proved that a decomposable decision criterion satisfies

the sure-thing principle: it suffices to choose T = {0,1} and t = Ia- In fact,the property of decomposability can also be expressed as follows: for all

sets A of disjoint sets, and all nondegenerate relative plausibility measures

124 4 Likelihood-Based Decision Criteria

rp on Q = y A such that rp(A) > 0 for all A G A,

1\a ^rp\A 1'\a for all A G A => I <rp I' for all /, /' G PQ.

The sure-thing principle corresponds to the case with finite A, and it is thus

equivalent to the finite decomposability (that is, decomposability with the

additional condition that T is finite). By exploiting the fact that there is a

finite or countable subset Q' of Q such that ?*p|q'o(/|q/)_1 = rpo/-1, it can

be proved that decomposability is equivalent to countable decomposability

(for which T and A are at most countable). As in the case with finite A, also

in the general case the assumption that the elements of A are disjoint can

be dropped without consequences. But contrary to the case with finite A,the above formulation of decomposability does not imply its strict version

(in which both preferences are strict); in fact, no likelihood-based decision

criterion can satisfy the strict version even for countable sets A, since this

would imply that if / and /' have countable range, and / < /', then / is

preferred to /', while it is possible that rp o /_1 = rp o (Z')_1.

Theorem 4.14. Consider a hkehhood-based decision criterion such that

the corresponding function f\ : [0,1] —> [0, 1] is continuous at 0, and the

corresponding function /o : [0, 1] —> [0, 1] is continuous at 1. The decision

criterion is decomposable if and only if there is a positive real number a

such that each relation ^rp is represented by the functional I i—> Jsl d(rpa).

Proof. The "if part is simple: using Theorem 2.6 we obtain

Vrpot-i[Vrp\{t=}(l\{t=})] = SUpVrp\{t=T}(l\{t=T}) (r^{t = T})a =

=

supsup|t=r} / {rpl)a = Vrp(l).

For the "only if part, let l,t and rp be respectively the functions and

the (nondegenerate) relative plausibility measure on Q = (0, 1) x {0, 1}defined by l(x,y) = y and t(x,y) = x, and rp^(x,y) = xy + (1 —y). We

have rp{t = t} = 1 and Vrp\{t=r} (/|{t=r}) = /i(r) for all t G T = (0, 1);and therefore

hi1) = Vrp(l) = Vrp{fx ot) < sup(/i ot) = sup(01) /i.

That is, /i is continuous at 1; the continuity of /o at 0 can be proved

analogously. Theorem 4.11 implies thus that the decision criterion reveals

4.1 Decision Theoretic Properties 125

ignorance aversion, and fi : x i—> xa for an a G P; we have to show that

Vrp(l) = sup/ (rp^)a for all nondegenerate relative plausibility measures

rp (on some set Qrp), and all functions / : Qrp —> P. Let /',£' and rp'be respectively the functions and the (nondegenerate) relative plausibility

measure on Q' = Qrp x {0, 1} defined by l'(q,y) = l(q)y and t'(q,y) = q,

and (rp'y(q,y) = rp*-(g) y + (1 — y). We have Vrp>(l') = Vrp(l), because

ry'\l' > x} = rj/_({l > x} x {1}) = rp_{l > x} for all x G P (Theorem 4.3);and since rp'{t' = t} = 1 and Vrpn ,_ (l'\rti=Ty) = /(t) [rj>}{T)]a for all

t G T = Qrp, we obtain

vrp(i) = vrp,(i') = vrplo(tl)-i[i (n>l)a} = sup/ (n>l)a,

because rp' o (t')_1 = 1^ on Qrp. Ü

At the beginning of the present section we noted that a likelihood-based

decision criterion can be useful only if it satisfies some sort of continuityat 0 of the influence of a model, with respect to its relative plausibility;and thus in particular only if the corresponding functions /i and /o on

[0,1] are continuous at 0 and 1, respectively. In this sense, Theorem 4.14

implies that the MPL criterion (possibly applied after having distorted

the likelihood function by raising its values to the power a G P) is the

only decomposable likelihood-based decision criterion that can be useful;and it is also the only decomposable decision criterion that leads to the

usual likelihood-based inference methods, when applied to some standard

form of the corresponding decision problems. The only consequence of the

assumption about /i in Theorem 4.14 is the exclusion of the essential

minimax criterion, while the assumption about /o does not exclude onlythe essential minimin criterion, but for instance also the decision criterion

of Example 4.12.

Theorem 4.2 states that a likelihood-based decision criterion compares

the decisions on the basis of a functional S : Sp —> P applied to the func¬

tions (rgo/-1)^ describing the relative plausibility of the possible values of

the loss incurred. That is, a decision criterion corresponds to a preferencerelation (represented by S) on the set Sp, which can be identified with the

set of all nondegenerate relative plausibility measures on the set P of the

possible values of the loss. Moreover, the property of decomposability can

be interpreted also as a rule for "combining" the nondegenerate relative

plausibility measures rp o (/|(t=r})_1 on P (for each t G T) on the basis

of the "marginal" nondegenerate relative plausibility measure rp ot^1 on

126 4 Likelihood-Based Decision Criteria

T. Hence, if we assume the sure-thing principle (that is, the finite decom-

posability), then we can consider a framework similar to the one of the

representation theorem of von Neumann and Morgenstern (1944): a de¬

cision criterion corresponds to a preference relation on an abstract set of

consequences (the nondegenerate relative plausibility measures on P), on

which for each r G P we have a binary operation (the combination of two

consequences with plausibility ratio r).

The axiomatic approach to qualitative decision making with possibility

theory of Dubois and Prade (1995) uses a similar framework, and results

in particular in two decision criteria differing from one another only in

their attitudes toward ignorance: one is based on an assumption of igno¬

rance aversion, while the other is based on an assumption of ignoranceattraction. By replacing these assumptions with an axiom of "qualitative

monotonicity", and adapting other axioms to the new attitude toward

ignorance and to the use of profile likelihood functions, we obtain the de¬

cision criterion of Giang and Shenoy (2002), briefly introduced at the end

of Subsection 1.3.2. If we identify the best value (1,0) of the binary util¬

ity with the loss 0, and the worst value (0,1) with the loss 1, then the

assumption of "qualitative monotonicity" corresponds to the assumptionthat both functions /i and /o on [0,1] are strictly increasing. Theorem 4.11

implies that this assumption is incompatible with the sure-thing principle,on which the framework of Giang and Shenoy is based; but there is no

contradiction, because their criterion is not a "likelihood-based decision

criterion" in the sense introduced at the beginning of the present section.

In our framework, the justification of the axiom of "qualitative monoto¬

nicity" by Giang and Shenoy can be expressed as follows: if one of the

functions fi, /o is not strictly increasing, then there is a decision problemin which a decision is considered equivalent to another decision dominat¬

ing it. But this is unavoidable when the decisions are evaluated solely on

the basis of the profile likelihood function on the set of the possible con¬

sequences; moreover, two decisions can be considered equivalent by the

criterion of Giang and Shenoy even when one strictly dominates the other

and they have only a finite number of possible consequences (that is, this

decision criterion does not satisfy the strict version of the sure-thing prin¬

ciple) .In particular, when we are in the state of complete ignorance about

the models considered, and a decision has only a finite number of possible

consequences, if all these consequences are better than the intermediate

value (1,1), then the criterion of Giang and Shenoy evaluates the decision

by its worse consequence, while if all the possible consequences are worse

4.2 Statistical Properties 127

than (1, 1), then the decision is evaluated by its best consequence; other¬

wise the decision is simply evaluated by the intermediate value (1, 1), and

thus two decisions can be considered equivalent even if one is clearly bet¬

ter than the other. It is interesting to note that if we consider a decision

problem described by a utility function, we interpret the positive values

of the utility as gains, and the negative ones as losses, and we identifythe value 0 of the utility with the intermediate value (1,1) of the binary

utility, then the decision criterion of Giang and Shenoy is ignorance averse

as regards gains, and ignorance attracted as regards losses; in oppositionto the decision criterion considered at the end of Subsection 4.1.1, and to

the theory of Shackle (1949).

4.2 Statistical Properties

Consider a statistical decision problem described by a loss function L on

VxD, where V is a set of probability measures on a measurable space

(fi, A). If we regard the observed data as a particular realization of a ran¬

dom object X : Q —> X, then we can consider a loss function L' on V x A

(where A is a set of decision functions 5 : X —> T>), as done in Subsec¬

tion 1.1.1. A likelihood-based decision criterion can be applied either to

the pre-data decision problem described by L', or to the post-data deci¬

sion problem described by L (conditional on the observed realization of

X). When applied to the pre-data decision problem described by the risk

function, all likelihood-based decision criteria revealing ignorance aversion

reduce to the minimax risk criterion (if no prior likelihood function is

used). In general the application of a decision criterion to a pre-data deci¬

sion problem is rather difficult; therefore it can be interesting to consider

the decision function ö obtained by conditional application of a likelihood-

based decision criterion (to the post-data decision problem): that is, S(x)is the decision that we would select after having observed X = x. It X

is continuous (under each model in V), then it is usually possible to base

the decisions 5{x) on the pseudo likelihood functions on V obtained from

the densities of X and considered at the beginning of Section 1.2 as ap¬

proximate likelihood functions. If a statistic s : X —> S is sufficient for

X, then ö depends on x only through s(x); that is, the decision function

obtained by conditional application of a likelihood-based decision criterion

depends only on sufficient statistics. In general the decision function ö is

not optimal from the pre-data point of view (it is not even necessarily in

128 4 Likelihood-Based Decision Criteria

A), but it can nevertheless be useful. The pre-data evaluation of ö in terms

of L' can be of interest even when we are primarily concerned with the

post-data decision problem described by L; in particular, it can be inter¬

esting to consider the expected loss of the decision function ö under each

model in V.

The pre-data performances of the decision functions obtained by con¬

ditional application of a likelihood-based decision criterion can be arbi¬

trarily bad in particular decision problems. But the same is true for the

pre-data performances of the likelihood-based inference methods, and yet

these are the most appreciated general methods for the respective inference

problems, because they are intuitive and their pre-data performances are

usually good in actual applications. A reasonable likelihood-based decision

criterion can be intuitive too, and if it leads to the likelihood-based infer¬

ence methods when applied to some standard form of the correspondingdecision problems, then we can expect that in many actual problems the

decision functions obtained by conditional application of this criterion will

perform well from the pre-data point of view. When this is not the case, the

pre-data performance of these decision functions can often be improved by

replacing the profile likelihood function (which is automatically employed

by the decision criterion) with another pseudo likelihood function (thistechnique is used also for the likelihood-based inference methods).

4.2.1 Invariance Properties

The likelihood-based decision criteria are obviously "paranietrization in¬

variant",since they do not depend on a paranietrization of the set V of

statistical models considered; they possess many other invariance proper¬

ties: in particular the following one.

Theorem 4.15. Consider a statistical decision problem described by a loss

function L on V x T> such that there are two functions f : V —> V and

g :T> —> T>, and a positive real number c satisfying L(P, d) = cL[f{P), g(d)]for all P G V and all d G T>. All hkehhood-based decision criteria have

the following property: the decision d G T> is optimal with respect to the

nondegenerate relative plausibility measure rp on V, if g(d) is optimal with

respect to rpo/-1.

In particular, if rp o f^1 = rp, then for all hkehhood-based decision

criteria the set of all optimal decisions (with respect to rp) is invariant

4.2 Statistical Properties 129

under g 1, and thus if there is a unique optimal decision, then it is a fixed

point of g.

Proof. Since for all x G P we have

{ld = x} = {P G V : L(P, d)=x} = r'llgid) = f},

Theorem 4.2 implies

vrp(id) = sirzoi-1) = SKrzof-'or^i-)} = cvrpof-,{ig{d)),

and therefore d minimizes Vrp(ld) if d! = g(d) minimizes Vrpof-i(ld').

The invariance property stated by Theorem 4.15 can be very useful

when the decision problem presents some symmetries. In particular, the

next theorem states that the decision function ö obtained by conditional

application of a likelihood-based decision criterion is equivariant, if it is

well-defined and the decision problem is invariant. In fact, the premiseof the theorem is the definition of an invariant decision problem: onlythe assumption about the prior relative plausibility measure rp and the

restriction to discrete random objects X are not usual. But the assumptionabout rp is clearly satisfied when we use no prior information (that is, when

rp = 1Î); and the theorem is valid also for continuous X, if we base ö on the

densities of X (and we consider the usual definition of invariant decision

problem). Moreover, the theorem is valid also if condition (4.6) is restricted

to the functions t that are bijective, and this restricted condition can be

satisfied also when the profile likelihood function is replaced by another

pseudo likelihood function.

Theorem 4.16. Let V be a set of probability measures on a measurable

space (fi, A), and let X : Q —> X be a random object that is discrete under-

each model in V. Consider a statistical decision problem described by a

loss function L on V x T>, with prior (nondegenerate) relative plausibility

measure rp on V. Let G,G',G" be three permutation groups on X,V,T),

respectively, and let g i—> g' and g i—> g" be two homomorphisms of G to G'

and G", respectively, such that for all g G G; all x G X', all P G P, and

ail d G T> we have rp o (g')^1 = rp,

P{X = g(x)} = [g'(P)]{X = x}, and L[g'(P),d] = L[P,g"(d)}.

All likelihood-based decision criteria have the following property (for all

g G G, all d G T>, and ail x G X such that there is a P G {rp-1 > 0} with

130 4 Likelihood-Based Decision Criteria

P{X = x} > 0): the decision d is optimal given the observation X = x ifand only if g"(d) is optimal given X = g(x).

In particular, if the decision function ö : X —> T> obtained by conditional

application of a likelihood-based decision criterion is well-defined, then it

is equivariant: that is, ö o g = g" o ö for all g G G.

Proof Since rp^(P)P{X = g(x)} = rp^[g'{P)\ [g'(P)]{X = x}, the de¬

sired result can be easily obtained by applying Theorem 4.15 twice. D

In the classical approach, when the loss function L on V x T> describes

an invariant decision problem (that is, when the premise of Theorem 4.16

is satisfied), the equivariance can be imposed on the decision functions

5 G A, assuring that they present symmetries corresponding to those of

the decision problem; this is particularly useful when G" acts transitively

on V, because then L'{P,S) (the representative value of the random loss

of 5) does not depend on P. For a decision function 5 obtained by condi¬

tional application of a likelihood-based decision criterion, the equivarianceis assured by Theorem 4.16, without need of considering the symmetries of

the decision problem (and only if these are not invalidated by asymmetric

prior information). However, the recognition of these symmetries simpli¬fies the construction of 5 (as in the next example); in particular, if G acts

transitively on X, then ö is uniquely determined by any of its values S(x).

Example 1^.11. The estimation problem of Example 1.1 is invariant with re¬

spect to the permutation group G = {idx, n — idx} on X = {0,..., n}, the

permutation group G" = {idx>, 1 — id-p} on T> = [0,1], and the permutation

group G" on V that corresponds to G" by means of the parametrization

Pp i—> p of V with parameter space T> (considered in Example 1.6); the

two homomorphisms are obvious. Let 5 : X —> T> be the decision function

obtained by conditional application of a likelihood-based decision criterion

(without using prior information). Theorem 4.16 implies that if 5 is well-

defined (this is for instance the case when we use the MPL, LRM^, or MLD

criteria considered in Examples 1.8 and 1.10), then 5(n — x) = 1 — 5{x)for all x G X, and thus ö is uniquely determined by its values ö(x) for

x< f. 0

A very important aspect of a statistical procedure is its robustness

under deviations from the assumed models (see for example Hampel,

4.2 Statistical Properties 131

Ronchetti, Rousseeuw, and Stahel, 1986, Chapter 1). In general the de¬

cisions obtained by conditional application of a likelihood-based decision

criterion are not robust, but they can often be robustified by enlargingthe set V of statistical models considered. In this regard it is important to

consider the possibility of using a prior relative plausibility measure on the

enlarged set V'] for instance the one with density function proportional to

I-p + ß I-p'VP f°r a ß *= [0,1] : the prior relative plausibility of the models

in V' \ V is ß times the one of the models in V (the extreme cases with

ß = 0 and ß = 1 correspond to using the sets V and V1, respectively,without prior information). When V is enlarged to V1, the loss function L

onPxI) must be extended to a loss function L' on V' x T>; this extension

is simple if each model P G V is replaced by a set Tip Ç V' of models in¬

volving the same losses as P (with P G Tip and V1 = {JP-p Tip): we have

L'(P', d) = L[g(P'), d] for all P' G V and all d G V, where g : V -+ V is

the mapping defined by g~1{P} = Tip (for all Pg P). Of course, g and V

are well-defined only if all the sets Tip are disjoint, but when this is not the

case we can "disjoin" them as suggested at the end of Subsection 3.2.2 for

the imprecise statistical models; in fact, if the density function of the priorrelative plausibility measure is constant on each set Tip, then enlarging

V to V1 corresponds to replacing each model P G V with the imprecisestatistical model Tip.

As regards the enlargement of V to V1 by replacing each model P G V

with a set Tip of models involving the same losses as P, it is interesting to

consider the stability of the decisions obtained by conditional applicationof a likelihood-based decision criterion. Let rp be the nondegenerate rela¬

tive plausibility measure on V' describing our uncertain knowledge about

the models after having observed the data, and assume that rp\-p is nonde¬

generate too. For each decision d G T> the uncertain loss with respect to V1

is given by Id o g : T" —> P; hence for all likelihood-based decision criteria

the preferences between the possible decisions remain unchanged (when Vis enlarged to V) if rp o g^1 = rp\-p. This can happen in particular when

each P G V is the most plausible model in Tip in the light of the observed

data, or when the models in Tip that are more plausible than P in the

light of the data have been sufficiently penalized by the prior relative plau¬

sibility measure on V. More generally, for each likelihood-based decision

criterion we can consider the change Vrpog-i (I) — Vrp\v (I) in the evaluation

of the uncertain loss / G P when V is enlarged to V', and in particular the

continuity at rp" = {rp-1 ogY of the mapping rp' i—> Vrp,og-i (I) (defined onthe set of all nondegenerate relative plausibility measures rp' on V1), with

132 4 Likelihood-Based Decision Criteria

respect to \\rj/_ — rp"\\ (note that rp" o g^1 = rp\p). If a likelihood-based

decision criterion always satisfies this continuity when / is bounded, then

the corresponding functions /i, /o on [0,1] are continuous; but if /i is pos¬

itive on (0,1], then the above continuity cannot always be satisfied when

/ is unbounded. Theorem 2.23 implies that the likelihood-based decision

criterion corresponding to the functionals Vrp : I i—> J I d(/i o rp) satisfies

the above continuity for all bounded functions /, if fi : [0,1] —> [0,1] is a

nondecreasing surjection, and the integral is regular and quasi-subadditive

on the class of all completely maxitive measures. The Shilkret integral sat¬

isfies these properties, and its values are particularly stable with respect

to changes in the measure, since the Shilkret integral depends on the dis¬

tribution function only through a supremum. Moreover, ii fi : x i—> xa for

an a G P, then the likelihood-based decision criterion corresponding to the

functionals Vrp : I i—> Js/ d(/i org) (that is, the MPL criterion applied after

having distorted rp by raising its values to the power a G W) satisfies the

sure-thing principle (and its infinite version: the decomposability), which

can also be interpreted as expressing some sort of stability of the prefer¬ences between the possible decisions: in particular, if a preference between

two decisions holds with respect to both V and V1 \ V, then this preferenceremains unchanged when V is enlarged to V1.

It is also interesting to consider the stability of the decisions obtained

by conditional application of a likelihood-based decision criterion, when the

values of the loss function L:?xI)^P are changed. If some important

symmetry of the decision problem is maintained, then the optimal decisions

can remain unchanged; but in general when the change in the values of

L is substantial, the resulting decision problem is very different from the

original one, and so are the optimal decisions. As with the enlargement of

V, we can consider the continuity at / G V^ of the evaluation /' i—> Vrp(l')(that is, of the functional Vrp on V^), with respect to \\l' — l\\; we obtain

in particular that the likelihood-based decision criterion corresponding to

the functionals Vrp : I i—> JI d(/i o rp) satisfies this continuity for all

functions /, if the integral is regular and quasi-subadditive on the class

of all finitely maxitive measures, and /i : [0,1] —> [0,1] is nondecreasing,

/i(0) = 0, and /i(l) = 1. The decisions obtained by conditional applicationof a likelihood-based decision criterion revealing ignorance aversion can

sometimes be "robustified" with respect to the choice of the loss function

L, by replacing it with a whole set jC of loss functions on V x T>, and

using as evaluation of each decision d G T> the supremum of Vrp(l(i) as

L ranges over L (it is important to note that all the loss functions in L

4.2 Statistical Properties 133

must be expressed in the same scale). Theorem 2.6 implies that if fi is

continuous, and Vrp : / i—> Jsld(f\ org), then the same decisions can be

obtained by applying the likelihood-based decision criterion correspondingto the functionals Vrp (that is, by applying the MPL criterion after havingdistorted rj> by means of /i ) to the decision problem described by the loss

function sup£.

Example 1^.18. Consider the problem of estimating the mean of a normal

distribution with known variance, without prior information about the rel¬

ative plausibility of the possible values of the mean (in order to simplifythe results, in the present example the variance is assumed to be known,but analogous results would be obtained without this assumption). Let

V = {Pß : fi G M} be a family of probability measures such that under

each model P^ the random variables X\,..., Xn are independent and nor¬

mally distributed with mean /i and variance 1. These random variables are

continuous: assume that each realization X% = x% is observed with preci¬sion ö G P, in the sense that in fact we observe Xt G [xt — |, xt + |]; for

instance, if the observations are rounded off to m digits after the decimal

point, then ö = 10~m. Suppose that ö is sufficiently small to allow the use

of the function P^ i—> 5 (f(xt — /i) on V (where cp is the continuous densityof the standard normal distribution, with respect to the Lebesgue measure

on R) as approximation of the likelihood function on V induced by the

(imprecise) observation X% = x%. The arithmetic mean X is sufficient for

Xi,..., Xn, and it is normally distributed with mean /i and variance -.

Hence, for a likelihood-based method the problem reduces to the estima¬

tion of /i on the basis of a single observation from a normal distribution

with mean /i and variance—,

and in this reduced problem the only reason¬

able point estimate is the observation itself (that is, the arithmetic mean

X), at least if we consider overestimation and underestimation as equallyharmful.

In fact, using Theorem 4.2 it can be easily proved that X is an optimal

decision, according to all likelihood-based decision criteria, for all symmet¬

ric, nondecreasing estimation errors; that is, for all the decision problemsdescribed by loss functions L : (PM, d) i—> /(|/i — d\) on V x R, where

/ : [0, oo) —> P is nondecreasing. In general this optimal decision is not

unique: for instance, according to the essential minimax criterion all the

decisions in M are equivalent (independently of the choice of /); but if / is

strictly increasing, and lim^oo f(x) [ip(x)]n = 0, then the arithmetic mean

is the unique optimal decision according to the MPL, LRM^, and MLD

134 4 Likelihood-Based Decision Criteria

Figure 4..1. Profile likelihood functions and maximum likelihood estimates of the

number of degrees of freedom from Example 4.18.

criteria. Anyway, for the decision problem described by L, if the decision

obtained by conditional application of a likelihood-based decision criterion

is well-defined, then it is the arithmetic mean, which is not robust with

respect to deviations from the normal models (in particular it is vulner¬

able to outliers). Consider for instance the six observations X\ « 1.176,

X2 « -0.563, X3 « 0.235, AT4 « -L443, X5 « -1.079, and X6 = 5; the

first five have been generated using the model Vo (their arithmetic mean is

X « —0.335), while the sixth one is an artificial outlier, strongly influenc¬

ing the arithmetic mean (which is X « 0.554 when all six observations are

considered). The likelihood functions on M (with respect to the parame-

trization Pß 1—> /i of V) induced by the first five observations and by all six

observations are plotted in the first diagram of Figure 4.1 for /i G [—2, 2](solid and dashed lines, respectively; the two functions have been scaled

to have the same maximum). These functions give no information about

how well the normal models fit the data, and thus no method based only

on the likelihood function can recognize Xq as an outlier, if V does not

contain some other model to which the normal ones can be compared.

In order to robustify the decisions obtained by conditional applica¬tion of a likelihood-based decision criterion, we can for example enlarge

V to the set V1 consisting of all the probability measures such that the

random variables X\,..., Xn are independent and identically distributed

with variance 1 and either normal or t distribution (with any number

v G (2, 00) of degrees of freedom; the normal distribution corresponds to

4.2 Statistical Properties 135

the limit v —> oo). Let m : P i—> /i = .Ep[.Xi] be the function on V1 as¬

signing to each model the expected value of the random variables Xt, and

let L' : (P, d) i—> /[|m(P) — d|] be the obvious extension of L to V1 x 2?.

As above, assume that the likelihood function on V1 induced by the (im¬precise) observation X% = x% can be approximated using the continuous

densities of the random variable X% (with respect to the Lebesgue measure

on R). The profile likelihood functions hkm on M induced by the first five

observations and by all six observations are plotted in the first diagram of

Figure 4.1 for /i G [—2, 2] (solid and dotted lines, respectively; the two func¬

tions have been scaled to have the same maximum). When considering onlythe first five observations, the enlargement of V to V1 has no real conse¬

quence (the t models do not fit these observations considerably better than

the normal models): the profile likelihood functions hkm and hkmiv are al¬

most equal on [—2, 2] (in fact, their graphs are indistinguishable), and tend

very rapidly to 0 outside this interval; hence, each reasonable likelihood-

based decision criterion will give (almost) the same results when appliedto the decision problem described by L' as when applied to the one de¬

scribed by P, if the function / is not too extreme. But when we consider all

six observations, the enlargement of V to V1 has important consequences:

the profile likelihood functions hkm and hkm\v are completely different;that is, the t models fit these observations much better than the normal

models. In fact, the second diagram of Figure 4.1 shows the graph of the

maximum likelihood estimates of the number v of degrees of freedom for

the t models in m^1^} (as a function of fi G [—2, 2]): the estimated val¬

ues of v are quite small (far away from the limit v —> oo correspondingto the normal models). The profile likelihood functions hkm induced bythe first five observations and by all six observations are rather similar,and thus each reasonable likelihood-based decision criterion applied to the

decision problem described by L' will give similar results when the outlier

Xq is considered and when it is discarded, if the function / is not too

extreme. For instance, when applied to the decision problem described byL' with / : x i—> x2 (that is, with squared error loss), the MPL and MLD

criteria lead respectively to the estimates —0.286 and —0.323 when Xq is

considered, and to the estimate —0.335 when Xq is discarded.

The t models with unknown number of degrees of freedom have been

successfully used instead of the normal models to obtain maximum like¬

lihood estimates: see for instance Lange, Little, and Taylor (1989) and

Pinheiro, Liu, and Wu (2001); since the resulting likelihood functions typ¬

ically have several local maxima, even better estimates can perhaps be

136 4 Likelihood-Based Decision Criteria

obtained by using other likelihood-based decision criteria. However, the t

models do not cover all the possible deviations from the normal models; in

particular, they have problems with asymmetric distributions and extreme

outliers. In order to cope with asymmetric distributions, several general¬izations of the t models have been proposed in recent years: see for instance

Azzalini and Capitanio (2003). To allow the possibility of extreme outliers,we could enlarge V by replacing the normal distributions of the random

variables X\,..., Xn with their e-contamination classes for some e G (0, 1),while still considering these random variables as independent and identi¬

cally distributed, as done by Huber (1964). By using the MLD criterion we

could then obtain reasonable results, but since the pseudo likelihood func¬

tion hk' on V would satisfy inf hk1 > 0 (independently of the number n of

observations), each likelihood-based decision criterion revealing ignoranceaversion and such that the corresponding function /i : [0, 1] —> [0, 1] is

positive on (0, 1] would consider all the decisions in M as equivalent, when

applied (with respect to hk1) to the decision problem described by the

loss function L corresponding to an unbounded function /. The problemis that under each contaminated model the random variables X\,..., Xn

are independent and identically distributed, and so for each normal model

the event that all n observations will be arbitrarily extreme outliers has a

probability larger than (-)" under a suitable contaminated model; this is

one of the reasons why Huber evaluated the estimators by means of the

maximum asymptotic variance (and not the maximum variance) under the

contaminated models.

To avoid this problem, we can for example enlarge V to the set V"

consisting of all the probability measures such that n — k of the random

variables X\,..., Xn are independent and normally distributed with the

same mean and variance 1 (while the other k random variables can have

any distribution), where k is any nonnegative integer not larger than the

fixed, positive integer K < ^. Let g be the function on V" assigning to each

model P the mean /i of the majority of the random variables X\,..., Xn,and let L" : (P, d) i—> f[\g(P) — d\] be the obvious extension of L to V" x T>.

By using V", we allow at most K observations to be considered as outliers:

the profile likelihood function hkg on M induced by the n observations is the

maximum of the likelihood functions hkg\v induced by all possible subsets

of n—K observations; in particular, we certainly have inf hkg = 0, and thus

the above problems with unbounded functions / are resolved. Each value

hkg(p,) of the profile likelihood function induced by the n observations

corresponds to the value hk(P^) of the likelihood function induced by the

4.2 Statistical Properties 137

n—K observations nearest to /i; that is, if we use no prior information, then

the number of observations discarded as outliers is always the maximum

possible, because there is no penalty for discarding observations.

This can be changed by using a prior relative plausibility measure on

V" with density function proportional to r o k, where k is the function on

V" assigning to each model P the smallest number k such that n — k of

the random variables X\,..., Xn are independent and normally distrib¬

uted with mean g(P) and variance 1, while the function r on {0,..., K}gives the prior relative plausibility of the number of outliers. Each plausi¬

bility ratio 3^4 can also be interpreted as the penalty factor for discardingk observations: if we apply the same penalty factor c G (0, 1) for each

discarded observation, then we obtain the function r : k i—> ck. In particu¬

lar, if c = 5 (f(x) for an x G P, and rp is the relative plausibility measure

after having observed the data, then for each /i£l the value rp{g = /i}corresponds (up to a positive multiplicative constant independent of /i)to the value hk^P^) of the likelihood function induced by the n "huber-

ized" observations: that is, the n observations after having replaced each

value x% such that \x% — /i\ > x with the value /i + xsign(xt — /i) (whenthere are more than K such values x„ only the K values farthest from

/i are replaced). For example, when using x = 4 and observing the first

five of the six realizations Xz = x% considered above, the density function

of the resulting relative plausibility measure rp o g^1 and the likelihood

function hkgiv induced by these observations are proportional on the in¬

terval [—2.824,2.557]; on the interval [—2.824,1] these two functions are

also proportional to the density function of the relative plausibility mea¬

sure rp o g~x obtained after having observed all six realizations X% = x%.

Outside these intervals the three functions tend very rapidly to 0, and thus

each reasonable likelihood-based decision criterion will give (almost) the

same results when applied to the decision problem described by L" (withor without considering Xq) as when applied to the one described by L

without considering Xq, if the function / is not too extreme. For instance,if / : x i—> x2, then the arithmetic mean of the first five observations is

the estimate obtained by applying the MPL and MLD criteria to the deci¬

sion problem described by L" with or without considering Xq; that is, the

outlier Xq has no influence on the estimate (this holds also for the LRM^criterion, if ß > 0.012). 0

138 4 Likelihood-Based Decision Criteria

4.2.2 Asymptotic Optimality

Consider a statistical decision problem described by a loss function L on

VxD. If the data contain only little information for discrimination be¬

tween the models in V, then different (conditional) decision criteria can

lead to completely different conclusions, when applied to the decision prob¬lem described by L; but if the data contain a lot of information, then

we can expect that the decision criteria lead to similar conclusions. Con¬

sider in particular the simplest nontrivial case, in which L is bounded,

V = {Pi, P2}, and T> = {di, cfo}, and let d\ be the best decision accordingto Pi ; a conditional decision criterion can be reasonable only if it tends to

select d\, when in the light of the observed data Pi tends to be infinitelymore plausible than P2 (and Pi was not discarded by the prior informa¬

tion). For example, the Bayesian criterion is reasonable in this sense, while

in general a decision criterion based on an imprecise Bayesian model up¬

dated by means of "regular extension" will tend to select d\ only if the

lower prior probability of Pi is positive (that is, some prior information is

needed). It can be easily proved that the following property is necessary for

a likelihood-based decision criterion to be reasonable in the above sense;

we shall see that it is also sufficient. A likelihood-based decision criterion is

said to be consistent if the corresponding functional S : Sp —> P satisfies

lim5(/{1} +yl{x}) = 1 for all x G P\{l,oo}.

When compared with condition (4.7), the property of consistency ap¬

pears as a minimal requirement of "continuity at 0 of the influence of a

model, with respect to its relative plausibility". At the beginning of the

previous section we noted that this sort of continuity is necessary for a

likelihood-based decision criterion to be useful, and in fact a decision cri¬

terion that does not tend to select d\ in the above simple situation cannot

be really useful. The consistency of a likelihood-based decision criterion

implies that the corresponding functions /1 and /q on [0,1] are continuous

at 0 and 1, respectively (the continuity of /o at 1 corresponds to the case

with x = 0, and the continuity of /1 at 0 follows easily from the next the¬

orem), and thus in particular the essential minimax and essential minimin

criteria are not consistent. It can be easily proved that the likelihood-based

decision criterion corresponding to the functionals Vrp : I 1—> J I d(/i o rg)is consistent, if /1 : [0, 1] —> [0, 1] is nondecreasing and continuous at 0

4.2 Statistical Properties 139

with /i(0) = 0 and /i(l) = 1, and the integral is homogeneous, quasi-

subadditive, and respects distributional dominance on the class of all nor¬

malized, finitely maxitive measures taking values in the range of f\. Hence

in particular the MPL, LRM^, and MLD criteria are consistent; and The¬

orem 4.14 implies that the MPL criterion (possibly applied after havingdistorted the likelihood function by raising its values to the power a G P)is the only likelihood-based decision criterion that is decomposable and

consistent.

Theorem 4.19. A likelihood-based decision criterion is consistent if and

only if the corresponding functionals Vrp have the following property: ifA and B are two disjoint sets, the sequence rp\, rp2, of nondegeneraterelative plausibility measures on Q = AU B satisfies lirru,.-»,^ rpn(A) = 0,and I : Q —> P is a function such that I \ a is bounded and I \ b has constant

value cgP, then linin^oo VrPn(l) = c.

Proof. To prove the "if part it suffices to consider the functions / such

that 1\a has constant value x G P\{l,oo}, and 1\b has constant value

1. For the "only if part, if d G (c, oo), then /' = sup,A Ia + Ib satis¬

fies d V > / and linin^oo VrPn (/') = 1, and thus limsup,^^ VrPn(l) < d.

Analogously, if c" G (0, c), then I" = 1J^1 IA + Iß satisfies c" I" < I

and linin^oo Vrpn (/") = 1, and thus \imin{n^00Vrpn(l) > c". Hence

linin^oo VrPn(l) = c. a

The continuity property appearing in Theorem 4.19 corresponds to a

special case of the continuity of the evaluations Vrp(l) of the uncertain

losses / G V'P when V is enlarged to P', considered in the previous sub¬

section. In particular, no likelihood-based decision criterion such that the

corresponding function f\ : [0, 1] —> [0, 1] is positive on (0, 1] could satisfy

it, when the assumption that 1\a is bounded were dropped. If Q = V, and

the sequence rp\, rp2, describes the evolution of the uncertain knowl¬

edge about the models in V obtained by observing more and more data,then Theorem 4.19 implies that a consistent likelihood-based decision cri¬

terion tends to evaluate a bounded uncertain loss / correctly, when / is

constant on B and the models in A tend to be discarded by the informa¬

tion obtained. In particular, a consistent likelihood-based decision criterion

would tend to select d\ in the situation considered at the beginning of the

present subsection. Consider a likelihood-based decision criterion and the

corresponding functionals Vrp; a sequence d\,d2,... G T> of decisions is said

140 4 Likelihood-Based Decision Criteria

to be obtained by applying the decision criterion to L with respect to the

sequence rp\, rp2,... of nondegenerate relative plausibility measures on V,if linin^oo ||K.Prt(Zd„) — in-£deT>VrPn(ld)\\ = 0. That is, the single decisions

dn do not need to be strictly optimal according to the decision criterion:

only the limiting behavior of the evaluations VrPn (ldn ) is well-defined, but

this suffices in the next theorems.

Theorem 4.20. Consider a statistical decision problem described by a loss

function L on V x T>; and a consistent hkehhood-based decision criterion

with corresponding junctionals Vrp. Let d\,d,2,... G T> be a sequence ofdecisions obtained by applying the decision criterion to L with respect

to a sequence rp\,rp2, of nondegenerate relative plausibility measures

on V. If

lim lim sup [Vrp (ld) — Vrp (minj/rf, c})] = 0 for all d G T>; (4.17)cjoo ra^oo

l l

and there are a model P G V and a topology on V such that \ld : d G T>}is equicontmuous at P, and linin^oo rpn(V \ TL) = 0 for all neighborhoods

TL of P, then

lim L(P,dn) = inf L(P,d).ra^oo dV

Proof Let e G (0,1), and let d' G V such that P(P, d') < mideV P(P, d)+e.Since {Id : d G T>} is equicontinuous at P, there is a neighborhood TL of P

such that \\ld\H — L{P, d)\\ < e for all d G T>. Condition (4.17) implies that

there is a c' G P such that limsupn_>00[V^.Prl(Zd') — VrPn{rair\.{ld>, c'})] < e.

Let/ = [L{P, d') +e] i^ + min{/d', c'}I-p\^. For sufficiently large n we have

VrPn (I<h) > 1 — e (since /o is continuous at 1), VrPn (I) < L{P, d') + 2 e (byTheorem 4.19), VrPn(ld') — VrPn(mm{ld', c'}) < 2e (thanks to the above

choice of c'), and VrPn(ldn) < inLjgp VrPn(ld) + s (by definition of the

sequence d\, d^-, .). Therefore we obtain

inf L(P,d) > L{P,d')-e> VrPn(l)-3e> VrPn(mm{ld,, c'}) - 3e >d^T)

> VrPn(ld>) -5e > inf Vrpn(ld) - 5e > Vrpn(ldJ -6e >

d^T)

> VrPn (max{P(P, dn) — e, 0} i^) — 6 e >

> max{P(P, dn) — e, 0} V^Pn (In) — 6 e >

> [P(P,d„)-e](l-e)-6e.

This implies the desired result, since e G (0, 1) was chosen arbitrarily. D

4.2 Statistical Properties 141

Note that the boundedness of the uncertain losses l& is sufficient for

condition (4.17) to be satisfied, and their continuity at P is necessary

for {Id : d G T>} to be equicontinuous at P. Theorem 4.20 states that

the decisions obtained by applying a consistent likelihood-based decision

criterion to L with respect to rp\, rp2, tend to be optimal according to

the model P, if condition (4.17) is satisfied (that is, unbounded functions

Id pose no problems), and there is a topology on V that is sufficiently

strong to allow the equicontinuity of {Id ' d G T>} at P and sufficientlyweak to allow limm-»^ rp^iV \ TL) = 0 for all neighborhoods TL of P. A

sequence d\,d2,... of decisions depending on the realizations of a sequence

of random objects is said to be (strongly) asymptotically optimal under

the model P G V (for the decision problem described by L) if the decisions

dn G T> are well-defined for sufficiently large n a.e. [P], and

lim L(P,dn) = inf L(P,d) a.e. [P].ra^oo d£V

When a sequence rp\, rp2,... of nondegenerate relative plausibility mea¬

sures on V is obtained by observing the realizations of a sequence of random

objects, the asymptotic optimality (under the model P) of the decisions

obtained by applying a consistent likelihood-based decision criterion to L

with respect to rp\, rp2, follows easily from the proof of Theorem 4.20,if condition (4.17) holds a.e. [P], and there is a topology on V such that

{Id : d G T>} is equicontinuous at P, and lim^-too EEni'P \ TL) = 0 a.e.

[P] for all neighborhoods TL of P. In particular, if under each model in V

the random objects are independent and identically distributed, then the

last condition can often be proved by means of the next theorem, which

is essentially based on the version of Schervish (1995, Subsection 7.3.2)of the theorem of Wald (1949) on the (strong) consistency of the maxi¬

mum likelihood estimates; in order to state the theorem, we need some

definitions.

Let V be a set of probability measures on a measurable space (Q,A),let X : Q —> X be a random object that is discrete under each model in

V, and for each x G X let rpx be the relative plausibility measure on V

induced by the observation X = x. For each P G V and each TL Ç V define

IX{P,H)=EP(lOS¥±),V rpx (H) J

whereq

is defined as 1 (or any other value in P: this choice has no influence

on Xx (P, 7~L)). If TL = {P'}, then Xx (P, TL) is the expected value (under P)

142 4 Likelihood-Based Decision Criteria

of the information in X for discrimination in favor of P against P1, intro¬

duced by Kullback and Leibler (1951); they proved that Xx(P, {P'}) G IP,and that Xx {P, {P'}) = 0 if and only if X has the same distribution under

P as under P'. In general, Xx(P,TL) can be interpreted as the expectedvalue (under P) of the information in X for discrimination in favor of P

against TL; note that Xx{P, TL) is not necessarily in P (it is not even neces¬

sarily well-defined). For each x G X let hkx be the likelihood function on

V induced by the observation X = x; the value

Hx(P) = EP(-\og[hkx(P)])

is the entropy of the distribution of X under the model P G V, studied

by Shannon (1948). Since hkx < 1 for all x G X, we have HX(P) G P,and if HX{P) is finite, then XX(P,TL) > -HX(P) for all TL Ç V (inparticular, Xx (-P, TL) is well-defined for all TL Ç V). In the next theorems

only discrete random objects are considered, but the results are valid also

for continuous ones, if the approximate likelihood functions based on the

densities are used, and are bounded a.e. [P] (when this is not the case,

the approximation is unreasonable); the approximation must be explicitlyconsidered when determining the entropy of the distributions.

Theorem 4.21. Let V U {P} be a set of probability measures on a mea¬

surable space (fi, A), and let X\, X2, ' £1 —> X be a sequence of random

objects such that under each model in V U {P} the random objects are

discrete, independent, and identically distributed. Let rp be the prior (non-degenerate) relative plausibility measure on V, and let rp±,rp2,... be the

sequence of relative plausibility measures on V obtained by observing the

realizations of the random objects Xx,X<i, If P' G {rp-1 > 0} is the

unique P" G V minimizing Xx1(P, {P"}), and V is equipped with a topol¬

ogy such that there are a compact subset V1 of V and a set A of relatively

open subsets ofT" such that (J A = V'\{P'} andXXl(P,A) > ?x1(P, {P'})for all A G A and also for A = V \ V1, then the relative plausibility mea¬

sures rpi, rp2, are nondegenerate a.e. [P], and linin^oo rjhJyV \ TL) = 0

o.e. [P] for all neighborhoods TL of P'.

In particular, the set A certainly exists if the relative topology on V is

metrizable, the likelihood function on V1 induced by observing the realiza¬

tion of Xi is upper semicontmuous a.e. [P], and Hx1(P) is finite.

Proof. Let hkk be the likelihood function on V induced by observing the

realization of Xf., and let Yj. = infa log'

,k^—-, where ^ is defined as 1,

4.2 Statistical Properties 143

and A is a subset of V such that Xx1 (P, A) > Xx1 (P, {P'})- Since such a

set A certainly exists (for example A = V\P'), we obtain in particular that

Xxt (P, {P'}) is finite, and thus hkk(P') > 0 a.e. [P]. Therefore rp\, rp2,

are nondegenerate a.e. [P], and

. rpn{P'} ( rpt(P'),

^, hkk(PJ\ rp{P'} ,^log TJV =infA log —

+ >log —-— > log—TJ\+L^Yk

rpn{A) y rpi- ^ hkk J rp(A) ^

holds a.e. [P]. The strong law of large numbers implies that the right-handside of the inequality tends to infinity a.e. [P] asm oo, since the Y& are

independent and identically distributed (under P), and

, N/ hkUP) \ ( hk1(P)\

EM) = EP (log_iU_

j- EP (log^j

> o.

That is, lirrim^or, rpn(Ä) = 0 a.e. [P] for all the subsets A of V such that

XXl(P,A) > XXl(P, {P'})- Let W be a neighborhood of P'. There is an

open neighborhood Ti' of P' such that Ti' Ç Ti; and since V' \ Ti' is a

compact subset of y .4, there are A\,..., Am G „4 such that V' \ Ti' is a

subset of y 1 A,. Hence

rpn(V \Ti) < max{rpn(V \ V'), ffi^Ai), • •

•, EEa(An)},

and the right-hand side tends to 0 a.e. [P] as n —> oo.

To prove the last statement of the theorem, we restrict attention to

the set V' equipped with the relative topology. Since this is metrizable,for each P" G V\{P'} there is a sequence Ti\ D Tii D

...of closed

neighborhoods of P" such that Plm=i "^-m = {P"}- Since V' is compact, if

hk\\-pt is upper semicontinuous, then each Tim contains a model Pm such

that hki(Pm) = supw hk\, and limsup^^^ lik\{Pm) < hk\{P"). That

is, linim^oo supw hk\ = hk\{P") a.e. [P]. Since Hx1 (P) is finite, we have

XXl (P, A) > -oofor all A Ç V'. If XXl (P, Ti!) = oo, then

IXl (P, Hi) > XXl (P, P \ V) > XXl (P, {P'}).

If Xx1(P,Tii) < oo, then supWl hk\ > 0 a.e. [P], and

lim inf IXl (P, Wm) = lim inf EP logyUl *

+ JXl (P, Wx) >m^oo m^oo y SUp^ llk\ J

( supWl /î/Cl\

=^1(p,{p"})>^1(p,{p'i),

144 4 Likelihood-Based Decision Criteria

where the first inequality follows from Fatou's lemma. Thus in any case we

have Xx1 {P, "Hm) > 3-Xi (-Pj {P'}) f°r a sufficiently large m; let A Ç 7im be

an open neighborhood of P": since Xxx (P, A) > Xxt (P, "Hm), the collection

A of these sets A (for all P" G P'\{P'}) has the desired properties. D

In Theorem 4.21, the model P generating the data is not assumed to

be in V. If the premise of Theorem 4.21 is satisfied, the loss function L on

V x T> is bounded, and {Id : d G T>} is equicontinuous, then the decisions

obtained by applying a consistent likelihood-based decision criterion to

L with respect to rp\, rp2, tend to be optimal (a.e. [P]) according to

the unique model P' G V that is expected to be the most compatiblewith the data, in the sense that P' minimizes the expected information for

discrimination in favor of P contained in each observation. The situation is

more complicated when this expected information is minimized by several

models in V1, because then the resulting relative plausibility measures can

be asymptotically unstable on the set of these models (see for example

Berk, 1966). If P G V, and the distribution of X\ under P is different from

the distribution of X\ under any other model in V, then P is the uniqueP" G V minimizing Xx1 (P, {-P"}), and thus Theorems 4.20 and 4.21 can

be joined to prove the asymptotic optimality of the decisions obtained by

applying a consistent likelihood-based decision criterion.

Theorem 4.22. Let V be a finite set of probability measures on a mea¬

surable space (fl, A), and let X\, X2, ' £1 —> X be a sequence of random

objects such that under each model in V the random objects are discrete,

independent, and identically distributed, and their distributions are differ¬ent under different models in V. Consider a statistical decision problemdescribed by a finite loss function LonVxV, with prior (nondegenerate)relative plausibility measure rp on V. All sequences of decisions obtained by

applying a consistent likelihood-based decision criterion to L (with respect

to the sequence of relative plausibility measures on V obtained by observ¬

ing the realizations of the random objects X\, X2, ) are asymptotically

optimal under all models m {rp-1 > 0}.

Proof. Let P G {rp-1 > 0}, and consider the discrete topology on V. If

VI = V, and A is the set of all singletons of V \ {P}, then the premise of

Theorem 4.21 is satisfied, with P' = P. Since Id is bounded for all d G T>,and {Id : d G T>} is trivially equicontinuous with respect to the discrete

topology on V, the premise of Theorem 4.20 holds a.e. [P], and the desired

result follows. D

4.2 Statistical Properties 145

When the set V of statistical models considered is infinite, the uncer¬

tain losses Id can be finite but unbounded, and usually the premise of

Theorem 4.21 cannot be satisfied when using the discrete topology on V]hence in general the condition (4.17) and the equicontinuity of {Id : d G T>}are not trivially satisfied. In some cases they pose no problem, as in the

following example, while sometimes the subsequent generalization of The¬

orem 4.20 can be useful.

Example J^.23. Consider the estimation problem of Example 1.1, and let

the random variables X\, X2, ' £1 —> {0, 1} describe the results of each

single binary experiment (that is, X = 5^fe=i^fe)- With respect to the

topology on V induced by the metric p : (Pp,Ppi) 1—> \p — p'\, both likeli¬

hood functions Pp 1—> 1 — p and Pp 1—> p on V induced by observing the

two possible realizations 0 and 1 of X\ are continuous, V is compact, and

\ld(P) ~ k{P')\ < 2p{P,P') for all d G [0,1] and all P,P' G V. Since L is

bounded and HXl{P) < Hxl{Pi-) = log2 for all P G V, Theorems 4.20

and 4.21 imply that as n tends to infinity, all sequences of decisions ob¬

tained by applying a consistent likelihood-based decision criterion to the

decision problem described by L are asymptotically optimal under all mod¬

els in V. To be precise, in order to obtain this result we must consider that

in the proof of Theorem 4.20, for each e G (0, 1) we use the assumption

lirrim^or, rVniV \ TL) = 0 only for a particular neighborhood TL of P. 0

Theorem 4.24. Consider a statistical decision problem described by a loss

function L on V x T>; and a consistent hkehhood-based decision criterion

with corresponding junctionals Vrp. Let rp\, rp2,... be a sequence of rela¬

tive plausibility measures on V, depending on the realizations of a sequence

of random objects. If there are a model P £P and a set Ai of nondegener-ate relative plausibility measures on V such that rpn G M. for sufficiently

large n o.e. [P], and there is a subset V of T> such that

lim sup [Vrp(ld) — Vrp(xmn{ld, c})] = 0 for all d G V',

mfdGV L{P, d) = inf^gp L{P, d), there is a positive real number e such that

all d! G T> satisfying Vrp(ld') < inf^p Vrp(Jd) + e for some rp G M. are in

V', and there is a topology on V such that {Id ' d G T>'} is equicontmuous

at P, and \iran^x rjhJ(P \ TL) =0 o.e. [P] for all neighborhoods TL of P,then all sequences of decisions obtained by applying the decision criterion

to L with respect to rp\, rp2, are asymptoticallyoptimalunderP.Proof.ItsufficestoadapttheproofofTheorem4.20.D

146 4 Likelihood-Based Decision Criteria

If the premise of Theorem 4.21 is satisfied, with P' = P, then its

conclusion can be useful for selecting a set Ai of nondegenerate relative

plausibility measures on V such that rpn G Ai for sufficiently large n a.e.

[P]. Even when we consider a loss function L and a consistent likelihood-

based decision criterion that do not fulfill the condition (4.17) and the

equicontinuity of {Id : d G T>} at P, it is sometimes possible to select a

subset V of T> such that the premise of Theorem 4.24 is satisfied (for somee G P), as in the next example.

Example J^.25. Consider the estimation problem of Example 4.18, with

PM i—> ö Lp{x% — /i) as approximation of the likelihood function on V

induced by the (imprecise) observation X% = x%. With respect to the

topology on V induced by the metric p : (P^^P^) i—> \/i — /i'|, the set

V' = {Pfj,1 : A*' G [/i — 2, /i + 2]} is compact for all /i G ffi, and all the

likelihood functions on V induced by observing the possible realizations

of Xi are continuous. Since Xx1{P^,V \ V') ~ 0.398 for all /i G ffi, and

Hx1 (P) = \ + log ^^, Theorem 4.21 (more precisely, its version for con¬

tinuous random objects) implies that lim^.^n^, ry^jV \ 7i) = 0 a.e. [PM]for all /i G ffi and all neighborhoods 7i of PM, where rp±,rp2, is the

sequence of (nondegenerate) relative plausibility measures on V obtained

by observing the realizations of the random variables X\, X<i,....

Consider for instance the MPL criterion and the decision problem de¬

scribed by the loss function L : (PM, d) i—> (/i — d)2 (that is, the estimation

of /i with squared error loss); since {Id : d G T>} is nowhere equicontinuous

on V, Theorem 4.20 cannot be directly applied. But for each fiel the

above result implies that rpn G -MM for sufficiently large n a.e. [PM], when

.M^ is defined for instance as the set of all (nondegenerate) relative plau¬

sibility measures on V with density function proportional to a function of

the form PM< i—> [(f(x — /i')]m, for all x £ [/i — 1, /i + 1] and all positive

integers m. For each /i G ffi let V = [/i — 2, /i + 2]; Theorems 2.21 and 2.6

imply that for all /i G ffi and all d G V

lim sup I / Id dry — / min{/rf. c} dry I <

/"S< lim sup / max{/d — c, 0} dry <

< lim sup (y — c) expf —2

)—

^rn c exP(2

) = 0.

cTooî/e(^,oo) V / cToo V /

4.2 Statistical Properties 147

Moreover, for each /i£Rwe have mîdev L(P^, d) = 0 = inf^p L(P^, d),and since the Shilkret integral respects distributional dominance (on the

class of all monotonie measures), for e = 1 —2e_1 « 0.264, all d' G T>\T>' ,

and all rp G Ai^ we also have

/S/*S /*S

/rf' dry > / ^'d(I{u org) > 1 = 2e" + e > inf / Id dry + e.

Since supdGl3/ \ld — ld{P^)\ = niax{((l_2J ^+2} — 4 is continuous at P^ (withrespect to the topology on P induced by the metric p) for all /i G M,Theorem 4.24 implies that all sequences of decisions obtained by applyingthe MPL criterion to L with respect to rp±,rp2, are asymptotically

optimal under all models in P. This simply means that as n tends to

infinity, the arithmetic mean of X\,..., Xn is strongly consistent for /i, as

implied also by the strong law of large numbers. 0

Of course, sometimes the premise of Theorem 4.24 cannot be satisfied,

simply because the decisions obtained by applying the likelihood-based

decision criterion considered are not asymptotically optimal, as in the next

example.

Example J^.26. In the situation of Example 4.25, consider the loss function

L on P x {d\, c^} defined by /^ = c\ I^ and ld2 = c<i I*p\Hi where c\, c<i G W

and TL = {Pß : fi G ffi \ P}. That is, consider the problem of testing if the

mean of a normal distribution with known variance is positive or not,

without prior information about the relative plausibility of the possiblevalues of the mean. Since L is bounded, and {ldnld2} is equicontinuousat Pß for all /i G ffi \ {0}, with respect to the topology on P considered

in Example 4.25, Theorem 4.24 implies that all sequences of decisions

obtained by applying a consistent likelihood-based decision criterion to L

with respect to rpi,rp2, are asymptotically optimal under all models

in P\{Po} (it suffices to choose V = {^1,^2}, and define Ai as the set

of all nondegenerate relative plausibility measures on P). But /^ and ld2are continuous at Pq only with respect to a topology on P such that TL

is a neighborhood of Pq, while lim-n^o^ ry^jP \ TL) = 0 a.e. [Pq] does

not hold, since Po{rpn(P \ TL) = 1} = \ for all n. That is, the premise of

Theorem 4.24 cannot be satisfied when P = Pq, and in fact it can be easily

proved that no sequence of decisions obtained by applying a consistent

likelihood-based decision criterion to L with respect to rp\, rp2, can be

asymptotically optimal under Pq. 0

5

Point Estimation

The present chapter focuses on the estimation of points in the Euclidean

space Mfe: many statistical decision problems can be formulated as prob¬lems of point estimation in Mfe by expressing the estimation error in some

specific way (see also Wald, 1939). We shall consider the application of

likelihood-based decision criteria to the problems of point estimation in

Mfe, with particular attention to the asymptotic efficiency of the result¬

ing sequences of estimators, and to some important properties of those

likelihood-based decision criteria that can be represented by means of the

Shilkret integral.

5.1 Likelihood-Based Estimates

Let ? be a set of statistical models, and let g : V —> Mfe be a mapping,where k is a positive integer. Consider the problem of estimating g(P) on

the basis of a nondegenerate relative plausibility measure rp on Mfe describ¬

ing the uncertain knowledge about the elements of Mfe; that is, consider the

problem of estimating g(P) on the basis of the pseudo likelihood function

rpK When some data are observed, rp-1 can be defined as equivalent to

the induced profile likelihood function on Mfe; but other pseudo likelihood

functions with respect to g can also be used, and prior information can be

included. In particular, the complexity of the models in V can be taken

into account: for example by penalizing it in accordance with the criteria

introduced by Akaike (1973) or Rissanen (1978). Moreover, as seen in Ex¬

ample 4.18, in general an estimate of g(P) based on the pseudo likelihood

function rp*- can be robust only if robustness aspects were considered when

defining rp-1.

150 5 Point Estimation

When the estimation problem is stated without reference to a particularloss function, the usual likelihood-based inference methods can be appliedto the pseudo likelihood function rp-1, by considering it as equivalent to

the profile likelihood function on Mfe with respect to g. If the maximum

likelihood estimate 7ml of g(P) exists and is really unique, in the sense

that rp\t^ (Bc£)] < rp{jML} for all e G P, then it is certainly a most

natural point estimate of g(P). But the method of maximum likelihood

uses rp-1 in a very limited way: the additional consideration of a likelihood-

based confidence region for g(P) with cutoff point ß G (0, 1) can be very

informative. In the case with k = 1, such a confidence region is often an

interval, and its endpoints can be generalized by the following conjugate

upper and lower evaluations of g(P):

/c/*C

td^ d(ö o rp) and / idy d(ö o rp),

where 5 : [0,1] —> [0,1] is a nondecreasing function such that 0(0) = 0

and £(1) = 1, and the integrals are defined as the limits of the integralswith icfa and rp restricted to the interval [a, b], when —a and b tend to

infinity and the Choquet integral is extended by means of equality (2.10).In the case with ö = I(ß!i] for some ß G (0,1), the above conjugate upper

and lower evaluations of g(P) are well-defined and correspond respectivelyto the supremum and to the infimum of the likelihood-based confidence

region for g(P) with cutoff point ß.

It can be easily proved that if the maximum likelihood estimate 7ml

of g(P) exists, and k = 1, then the above conjugate upper and lower

evaluations of g(P) are well-defined and can be expressed as

/c/*C

idPd(6or£ot^L\P) and Jml~ îdP d[^oŒo(-t^MI/)_1|P],

respectively. Any regular integral (on the class of all monotonie measures)could be substituted for the Choquet integral in these two expressions, and

the resulting upper and lower evaluations of g(P) could be combined to

obtain a point estimate of g(P), but the meaning of this estimate would

not be very clear. In order to obtain a point estimate of g(P) depending on

the pseudo likelihood function rp-1 in a less limited way than the maximum

likelihood estimate 7ml does, it is better to explicitly describe the estima¬

tion problem by means of a loss function and then apply a likelihood-based

decision criterion.

5.1 Likelihood-Based Estimates 151

A function / : Mfe —> P is said to be quasiconvex if

f[\x + {l-\)y] <max{f(x)J(y)} for all x, y G Rk and all AG (0,1);

/ is said to be strictly quasiconvex if the inequality is strict when x ^ y.

An (estimation) error function for the problem of estimating 7 G ffife is

a function Lil'xR^P such that

L(j, d) = w{j) /(7 - d) for all 7, d G Rk,

where w : Mfe —> P is a function, and / : Mfe —> P is a quasiconvex function

with /(0) = 0. The statistical decision problem of estimating (?(P) with

error function L is described by a loss function L' on P x 2?, where 2? is

a subset of Mfe, and L'(P,d) = L[g(P),d] for all P 6 P and all d G V;

obviously, the definition of L outside the set {(g(P), d) : P G T>, d G T>}is irrelevant for the decision problem described by L'. Condition (4.6) im¬

plies that applying a likelihood-based decision criterion to L' with respect

to a likelihood function hk on V is equivalent to applying it to L\Kkx-vwith respect to the profile likelihood function hkg on Mfe. Therefore, more

generally, we can consider a nondegenerate relative plausibility measure

rp on M. and compare the estimates d G T> Ç R by means of the evalu¬

ation Vrp(l,i) of /d, where V^.p is the functional on P* corresponding to a

likelihood-based decision criterion, and for each d G ffife the function 14 on

Rk is defined by ld(j) = L(j, d) (for all 7 G Mfe).

Theorem 5.1. Consider the problem of estimating 7 G M. on the basis

of a nondegenerate relative plausibility measure rp on Mfe, with an error

function L such that f : x 1—> f'(\Mx\), where f is a function on W,and M G Mfexfe is an mvertible matrix. Let Q be the closed convex hull of

{rp-1 > 0}. For each d G ffife there is a d' G G such that according to no

likelihood-based decision criterion d is preferred to d!.

Moreover, if Q is bounded, and L is such that f is strictly increasing,

infrrpi>0} w > 0, and supr i>0i w < oo, then for each d G ffife \ G there

is a d! G Q that is preferred to d according to all likelihood-based decision

criteria.

Proof. Let d ^ G (otherwise we can choose d! = d), and let d! be the

element of G such that Md! is the projection of Md on the closed, convex

set {Mj : 7 G G}- Since \M(j - d')\ < \M(j - d)\ for all 7 G G, we have

Id' |{rpi>0} ^ 'd|{rpi>o}j an<i the first statement of the theorem follows from

Theorem 4.2.

152 5 Point Estimation

At the beginning of Section 4.1 we noted that d! is preferred to d

according to all likelihood-based decision criteria, if essrp sup Id' < oo and

inf(/d — Id') > 0. Theorem 4.2 implies that the second condition can be

weakened to infrrpi>o}()d — Id') > 0, while the first one can be expressed

as supr i>0i Id' < oo; these two conditions follow easily from the premiseof the second statement of the theorem. D

Theorem 5.1 implies that for an important class of error functions, when

applying a likelihood-based decision criterion to the problem of estimating

7 G ffife on the basis of a nondegenerate relative plausibility measure rp

on Mfe, we can restrict attention to the estimates in the closed convex hull

of {rp-1 > 0}. Theorem 5.1 can be easily generalized; in particular, in the

case with k = 1, the assumption about / is not necessary for the first

statement, while for the second one it can be replaced by the assumptionthat / is strictly quasiconvex. It is important to note that for the problemof simultaneous estimation of the components of 7 G Mfe (when k > 1), it

is generally difficult to obtain a reasonable error function by combining in

some way the estimation errors of the single components. In fact, the scale

in which the estimation error is expressed is irrelevant (in the sense that

the error functions L and aL are considered equivalent, for all a G P),but the relative scale of the estimation errors of the single components

can be very important. That is, either 7 G ffife is a genuine vector (in the

sense that we are really interested in 7, and not only in its components

with respect to the standard basis of M. ), in which case reasonable error

functions are usually of the form considered in Theorem 5.1, or else it is

probably better to estimate the k components of 7 separately; therefore,the case with k = 1 is by far the most important one.

5.1.1 Asymptotic Efficiency

Let the loss function L' on V x T> describe the statistical decision problem of

estimating g(P) with an error function L on Mfe x Mfe, where k is a positive

integer, g : V —> Mfe is a mapping, and T> is a subset of Mfe containing the

image of V under g. In Subsection 4.2.2 we have seen that under suitable

regularity conditions, all sequences of decisions obtained by applying a

consistent likelihood-based decision criterion to L' (with respect to the

sequence of relative plausibility measures on V obtained by observing the

realizations of a sequence of random objects) are asymptotically optimal

5.1 Likelihood-Based Estimates 153

under all models in V. This implies that the corresponding sequences of

estimators of g(P) are (strongly) consistent, if the error function L has the

following property: infßc / > 0 for all e G P. The condition /_1{0} = {0}is necessary for this property to hold; in the case with k = 1 it is also

sufficient, while in the case with A; > 1 it is sufficient when / is continuous.

Under suitable regularity conditions, the method of maximum likeli¬

hood leads to an asymptotically efficient sequence of estimators; and un¬

der very similar regularity conditions, the likelihood function tends to be

equivalent to the density of a normal distribution with as mean the maxi¬

mum likelihood estimate (see for example Fraser and McDunnough, 1984).In this case, the next theorem can be used to show that the estimates ob¬

tained by applying a certain kind of likelihood-based decision criterion

(such as the MPL criterion) tend to the maximum likelihood estimates

sufficiently fast to ensure that the corresponding sequences of estimators

are asymptotically efficient as well.

Theorem 5.2. Consider the problem of estimating 7 G ffife with an error

function L such that w is continuous at x' G ffife, and f satisfies

lim sup \C-a f(cx) - \Mx\a\ = 0 for all ÔGW,ci° xeBs

where M G Mfexfe is an mvertible matrix, and a GV. Let rpl7rp2,... be a

sequence of nondegenerate relative plausibility measures on Mfe such that

lim supn^°°xeBs

where the limit of the sequence x\, X2, G ffife is x', and Si, S2,... G Mfexfe

is a sequence of matrices such that linin^oo ^fnYjn exists and is m-

vertible. Consider a likelihood-based decision criterion corresponding to

the functionals Vrp : I 1—> JI d(/i o rp_) for a nondecreasmg surjection

/1 : [0, 1] —> [0, 1] and a regular integral (on the class of all monotonie

measures) such that the corresponding functional F on J\fX^ satisfies

Tp>tpon]? => liminf [F(^pIi0 c}) -F(ipIi0 ci)] > 0 for all ip G Aflp,c^oo

where ip G J\flf is defined by tp(x) = f\ [exp( — x^ )] (for all x G V). If

lim limsup n* \\Vrpn(lxJ - Vrpn(lXn I{Xn+xnX xeBä})\\ = 0, (5.1)

rynHxn + Snx) - expf-^-j 0 for allô G

154 5 Point Estimation

then all sequences di,d,2, G ffife satisfy

0hm n 2

n^oo

Vrpn(ld„) - inf VrPn(ld) hm \/n (dn — xn) = 0.

Proof. Let S = linin^oo ^/n'En, since S is mvertible, for sufficiently large

n the matrices S„ are mvertible too Conditions (5 1) and (4 6) imply that

for all e G P there is a 8£ G P such that for all ö G [ô£, oo) and sufficiently

large n we have

For each 7 G Mfe and each ö G V, let /' and /i^ be respectively the functions

and the completely maxitive measures on Mfe defined by

f!f(x) = \Mi:(x-y)\a and n\(x)h exp '4) if x G Bg

if a; G BS

As n tends to infinity, the completely maxitive measures on Mfe with den¬

sity functions (f\ o rj^} o t~x o S„) Irs + 7bc converge uniformly to fig,

and the functions (w o Ç1 o En)nï (/ o £n) Ig^ converge uniformly to

w(x') /q /ßä Hence, using Theorem 2 23 we obtain that for all ö G [ô£, 00)and sufficiently large n

n^ VrPn(lxJ < w(x') / /o/Bäd/"<5 +2e,

since the integral is regular and condition (2 6) is fulfilled Let a, b G P be

respectively the mfimum and the supremum of {|MEx| x G ffife, \x\ = 1},and let c = w(x') (böi)a + 2 we have n^ V^Pn (/x„) < c for sufficiently large

n Since for all d G ffife and all n we have

n^ VrPn(ld) =n^ ldd(/i or^) > f1[r2ra}(xn)]w(xn)n^ f(xn- d),

for a sufficiently large f £P and sufficiently large n we obtain

inf nt F (/d) > w(x') (a 8')a - 1 > c + 1,

and thus d„ G \xn + Enx x G B,y} for sufficiently large n Assume that

the conclusion of the theorem does not hold, then there is an e' G P such

5.1 Likelihood-Based Estimates 155

that dn (fi {xn + Yjnx : x G B£<} for infinitely many n, and thus there is

a subsequence dni, d„2,... such that linv^oo E^1 (dnm — xUm) = 7' G ffife

with e' < \j'\ < 8'. Conditions (4.1) and (4.6) imply that for all 8 G P and

sufficiently large n we have

VrPn(ldJ >Kp„o(S-1ota;„)-i[(U;0*x„loSn)(/oSnots-i(^_:i.ra))/Bä],

and as above, using Theorem 2.23 we obtain that for all 8 G P and suffi¬

ciently large m

n Vrpnm (ldnm ) > w(x ) / /7< !bs d/"<5 - £•

Now, for all 8 G P we have

J & IBs dM, = (y/2 b)a F (if 1^^ y^ ,

since _F is bihomogeneous (Theorem 2.13) and

M/,/Bä>a;}=//i[exp(-rt)] ifO<*<W,

[O if (bö)a <x<oo.

Analogously, we obtain that there is a ^ G Wîp such that ip > (/? on P,and for all 8 G P

JfylBsdfJis > (y/2b)aF(yi[0i(±.yy).Hence, there are e, 8" G P such that for all 8 G [8", 00) we have

w(x') I /y IBs d/is - w(x') I /o /ßä d/"<5 > 4e,

and thus with J = max{4, ^"} we obtain that for infinitely many n

nf [VrPAldJ ~ VrPn{lxJ] > e > 0,

in contradiction with the assumption about the sequence d\, c^,

Theorem 5.2 can be generalized to a wider class of likelihood-based

decision criteria: in particular, it holds also for the LRM^ and MLD cri¬

teria. In Theorem 5.2, the assumptions about L and about the sequence

156 5 Point Estimation

rpi, rp2,... ensure that locally near xn the estimation problem tends to a

simple form; and the optimality of xn in this simple problem is ensured bythe assumptions about /i and F. The condition on F is rather weak, when

/i is not too extreme; in particular, it is satisfied for both the Shilkret inte¬

gral and the Choquet integral, if there are a, b G P such that f\ (x) < a xb

for all x G [0,1]. Condition (5.1) implies that it suffices to consider the

estimation problem locally near xn; the validity of this condition depends

on the particular problem at hand. For example, if /i satisfies the above

requirement, w is bounded, / : x i—> |Afa;|a, and rp^ is logarithmically

concave for sufficiently large n, then condition (5.1) is fulfilled for both the

Shilkret integral and the Choquet integral.

In Theorem 5.2 it is assumed that locally near 0 the function / tends

to be centrally symmetric about 0; the next example shows that when

this condition is not fulfilled, the conclusion of the theorem does not nec¬

essarily hold, even when all other assumptions are satisfied. In fact, all

asymptotically efficient sequences of estimators of g(P) are asymptoticallynormal with mean g(P) (under each model P G V), while a biased asymp¬

totic distribution can be advantageous when the assumption about / of

Theorem 5.2 does not hold.

Example 5.3. In the situation of Examples 4.18 and 4.25, consider the prob¬lem of estimating /i £ R with an error function L such that w is constant

and / is defined by

f(\_j —2 x if — oo < x < 0,J \x) — |x if 0 < a; < oo.

This estimation error penalizes the overestimation of /i more than its un¬

derestimation; and in fact, if the estimate obtained by applying a particularlikelihood-based decision criterion is well-defined, then there is a nonneg¬

ative real number c such that for each n this estimate is Xj= (where

as usual X denotes the arithmetic mean of the observations X±,..., Xn).The corresponding sequence of estimators of /i is asymptotically efficient if

and only if c = 0; that is, if and only if the resulting estimates are the max¬

imum likelihood estimates X. For the MPL criterion c « 0.345, and thus

the corresponding sequence of estimators is not asymptotically efficient;but the expected loss is 0.915 times the expected loss for the maximum

likelihood estimates (independently of n and /i). 0

5.2 Sliilkret Integral and Quasiconvexity 157

5.2 Shilkret Integral and Quasiconvexity

Consider the problem of estimating 7 G ffife with an error function L, on the

basis of a nondegenerate relative plausibility measure rp on M.k. Let S be

the functional on <Sjp corresponding to a likelihood-based decision criterion

revealing ignorance aversion. Consider first the case with k = 1, and for

each del define ipd = m{hI(d,oc) > } and ipd = m{ldI(-oo,d) > •}•Theorem 4.3 implies that S is monotonie, and Vrp(ld) = S(max.{(fd, ipd})for all d G ffi, where Vrp is the functional on F8 corresponding to the

likelihood-based decision criterion considered. If d,d' G M with d < d',then Lpdt < ipd and ipd' > tpd: that is, if we raise the estimate d, then

the values of ipd decrease and those of tpd increase, while if we reduce the

estimate d, then the values of ipd increase and those of tpd decrease; an

optimal estimate corresponds thus to some sort of equilibrium point. In

the case with k > 1, the situation is more complicated, because there are

infinitely many directions in which the estimate d can be moved; but when

L is of the form considered in Theorem 5.1, for each of these directions we

can obtain two functions with the properties of ipd and ipd by dividing M.k

into two half-spaces. Theorems 5.1 and 5.2 exploit two special cases: in the

first one, for a particular direction we have tpdt < ipd and ipdi = tpd = 0,and thus d cannot be preferred to d'; in the second one, for each direction

we have ipd = ipd, and thus d is optimal. The general case can be simplifiedwhen the likelihood-based decision criterion corresponds to the functional

Vrp : I 1—> Js/ d(/i org) for a nondecreasing function f\ : [0, 1] —> [0,1] such

that /i(0) = 0 and /i(l) = 1: in fact, the maxitivity of the Shilkret integral

implies that Vrp(ld) = max.{S((pd), S(ipd)}- In the case with k = 1, the

function d 1—> Vrp(ld) on M is then quasiconvex (since it is the maximum of

a nonincreasing function and a nondecreasing function); the next theorem

generalizes this result.

Theorem 5.4. Consider the problem of estimating 7 G ffife with an error

function L; on the basis of a nondegenerate relative plausibility measure

rp on M..Let f\ : [0, 1] —> [0,1] be a nondecreasing function such that

fi(0) = 0 and /i(l) = 1, and let Vrp be the functional I 1—>Jsld(f\orp)onP*.Thefunctiond1—>Vrp(ld)onMfeisquasiconvex.Moreover,if{rp-1>0}isbounded,andLissuchthatfiscontinuousandstrictlyquasiconvex,infrrpi>0\w>0,andsupri>0iw<oo,thenthefunctiond1—>Vrp(ld)onMriscontinuousandstrictly

quasiconvex.

158 5 Point Estimation

Proof. For each 7 G ffife the function d 1—> ^(7) is quasiconvex, and

Vrp(ld) = / 'dd(/i or£) = supx/i(r2{Zd > x}) =J xeF

= sup sup x f\ lim inf rpH^fr, )^eP

7lj72j GRfc such thatL n^oo

^(t)^^ for infinitely many n

sup

71,72, el

limsup/d(7„) /1 lim inf ^(7«)

Hence the function d 1—> V^.p(Zd) on Mfe is quasiconvex, since the limit

superior of a sequence of quasiconvex functions is quasiconvex, and so is

the supremum of a set of quasiconvex functions.

To prove the second statement of the theorem, let d! and d" be two

distinct elements of Mfe, and let d = Xd' + (1 — A) d" for some A G (0, 1).The assumptions about rp and L imply that there are a, b G P such that on

|rp4 •> q| we nave max{/(2<, ^//} > ld + a and /^ < &; hence Vrp(ld) < &, and

max{/(2<, /d"} > max{(l + |) Z^, a} on {rp^ > 0}. Since the Shilkret integralis monotonie, homogeneous, niaxitive, and transformation invariant on the

class of all finitely niaxitive measures, and /1 org is finitely niaxitive, usingTheorem 2.8 we obtain

m&x{Vrp(ld,),Vrp(ld„)} = / max{/d/,/d//}d(/! orj)>

> f max{(l + f ) ld, a} d(/i o m) =

= max{(l + f ) Vrp{ld), a} > Vrp(ld);

and this proves the strict quasiconvexity of the function d 1—> Vrp(ld) on

M..Since the Shilkret integral is also subadditive on the class of all finitely

niaxitive measures, using Theorem 2.8 again we obtain

\Vrp(ld) ~ Vrp(ld:)\ < sup{rpi>0} \ld - ld,\ for all d, d' G Rk.

Hence, the continuity at d' G ffife of the function d 1—> Vrp(ld) on Mfe follows

from the equicontinuity at d' of the set of all functions d 1—> ^(7) on ^fe

with 7 G {rp-1 > 0}; and the equicontinuity at each d' G ffife of this set of

functions is implied by the assumptions about rp and L. D

5.2 Sliilkret Integral and Quasiconvexity 159

Consider a likelihood-based decision criterion corresponding to the

functional Vrp : / i—> JSld.(f\orp), where f\ : [0, 1] —> [0,1] is a nondecreas-

ing function such that /i(0) = 0 and /i(l) = 1 (for example the MPL,

LRM^, or MLD criteria). The decision criterion applied to the estimation

problem described by L consists in minimizing Vrp(ld), and Theorem 5.4

states that the function d i—> Vrp(ld) on M. is quasiconvex. Therefore, in

particular, the set of all optimal estimates in Mfe is convex, and if the

function d i—> Vrp(ld) has a strict local minimum at d' G ffife, then d' is

the unique optimal estimate in Mfe. If the premise of the second statement

of the theorem is satisfied, then the function d i—> Vrp(ld) on Mfe is also

continuous and strictly quasiconvex, and so there is at most one optimalestimate in Mfe. It can be easily proved that if the premise of the second

statement of Theorem 5.4 is satisfied, and either k = 1, or L is of the form

considered in Theorem 5.1, then there is a unique optimal estimate in Mfe,and it lies in the closed convex hull of {rp-1 > 0}.

Consider now the case when fi is left-continuous; then fi o rp is com¬

pletely maxitive, and using Theorem 2.6 we obtain

,s

Vrp(ld)= supf(1-d)w(1)f1[m^1)}= / (fotd)dfJi foralldGMfe,7Rfc J

where /i is the completely maxitive measure on Mfe with density function

fj,^- = w (fiorjr1). When /i is finite, applying the considered likelihood-based

decision criterion to L with respect to rp corresponds thus to applying the

MPL criterion to the error function L' : (7, d) 1—> f(j — d) with respect to

the (nondegenerate) relative plausibility measure on Mfe proportional to /i.

In particular, the assumptions about w of Theorem 5.4 can be replaced bythe assumptions that f\ is left-continuous and w (f\ orj>}) is bounded. An

algorithm for approximating the optimal estimates of 7 G Mfe (that is, for

minimizing the function d 1—> Vrp{ld) on Mfe) is suggested by the followingtwo expressions:

Vrp(ld) =sup(Ati>0}(/otd)/ui for all d£ Rk,

{d£Rk:Vrp(ld)<c}= f| {/o(-t7)<^} for all cG P.

7£{/iJ->0}

Let d G Mfe be an initial estimate, and let 71,..., jn G {/i^ > 0} be suitable

approximations of the local maxima of (/ o td) fJ,^. Then the valuec=max{/(7!-d)/i^Ti),•••,f(jn-d)

mHt«)}

160 5 Point Estimation

approximates Vrp(ld), and if actually infd/GRk Vrp(ld') < c, then the optimalestimates lie in the set

A={d'<=Mk:f(rfl-d')<1Irfa, -.., f{lu-d')<1^)}.

The set A is convex (since / is quasiconvex), and the estimate d usuallylies on the border of A; other estimates d' G A lying in the interior of A can

be expected to be better than d, in the sense that Vrp(ld') < Vrp(ld). The

rule for selecting a new estimate d' G A can exploit the properties of the

particular problem at hand, such as the differentiability of f or fiK Once

the estimate d' G A has been selected, a new step of the algorithm can be

carried out by using d! as the initial estimate; and 71,..., 7n can be used

as starting points for approximating the local maxima of (f otd>) fiK That

is, the algorithm can be interpreted as searching for the local maxima of a

function that depends on the obtained approximations of the local maxima:

some steps of the maximization algorithms are alternated with some stepsof an algorithm that redefines the function to be maximized (by selecting

a new estimate d'). The algorithm is particularly simple and effective in

the case with k = 1, but it has proved successful also in multidimensional

problems with k up to 10; however, the speed of convergence depends

strongly on which properties of the estimation problem can be assumed

and exploited when implementing the algorithm, and in particular when

defining the rule for selecting the new estimate d! G A.

5.2.1 Impossible Estimates

Consider the problem of estimating 7 G ffife with an error function L,on the basis of a nondegenerate relative plausibility measure rp on Mfe.

The elements of {rp-1 = 0} can be considered as impossible, since they

are infinitely less plausible than any element of {rp-1 > 0}. If V is a set

of statistical models, g : V —> Mfe is a mapping, and rp-1 is proportionalto the profile likelihood function hkg induced by the observation of some

data, then rp*- vanishes at 7 G M. if and only if either 7 is physically

impossible (in the sense that it does not lie in the image of V under g),or it is impossible in the light of the observed data (in the sense that

the likelihood function on V induced by the observed data vanishes on

{g = 7}). Impossible optimal estimates can be problematic, and even a

unique optimal estimate lying on the border of {rp-1 > 0} can be disturbing.

5.2 Sliilkret Integral and Quasiconvexity 161

Consider a likelihood-based decision criterion corresponding to the

functional Vrp : / i—> Jsld(f\ org), where f\ : [0,1] —> [0,1] is a non-

decreasing function such that /i(0) = 0 and /i(l) = 1. We have seen that

under some conditions, if either k = 1, or L is of the form considered in

Theorem 5.1, then there is a unique optimal estimate 7 G ffife, and 7 lies

in the closed convex hull of {ryr > 0}. Of course, 7 can be impossible (inthe sense that rp^(j) = 0) when {rp-1 > 0} is not convex; for instance, if

k = 2, the error function is L : (7, eZ) 1—> l'y — d\, and rj>} is the indicator

function of the unit circle C = {x G ffi2 : |x| = 1}, then 7 = 0 is the unique

optimal estimate according to all likelihood-based decision criteria reveal¬

ing ignorance aversion. If we try to minimize the function d 1—> Vrp(ld) on

T> = {rp-1 > 0}, then in general the minimum does not exist when T> is

not closed. In fact, 7 can be impossible also when {rp-1 > 0} is convex; for

example, with k and L as above, if rj>} is the indicator function of an open

unit half-disc H = {x G M.2 : \x\ < 1, x y > 0}, where y G M2\{0}, then

7 = 0 is the unique optimal estimate according to all likelihood-based de¬

cision criteria revealing ignorance aversion. The next theorem implies that

under some conditions 7 cannot be an extreme point of the closed convex

hull of {rp-1 > 0}; the above example does not contradict this theorem,since 0 is not an extreme point of the closure of H (the extreme points of

the closure of H are the x G C such that x y > 0). The next theorem

is particularly interesting in the case with k = 1, because then the closed

convex hull of {rp-1 > 0} is an interval ICI, and if 7 is not an extreme

point of /, then it lies in the interior of /, and thus also in the interior of

{rp-1 > 0}, when this is convex.

Theorem 5.5. Consider the problem of estimating 7 G ffife on the basis

of a nondegenerate relative plausibility measure rp on Mfe, with an error-

function L such that supr i>0i w < 00 and f : x 1—> f'(\Mx\), where

/' : P -> P is a function continuous at 0, and M G Mfexfe is an mvert-

ible matrix. Let f\ : [0, 1] —> [0,1] be a nondecreasmg function such that

/i(0) = 0 and /i(l) = 1, and consider a likelihood-based decision criterion

corresponding to the functional Vrp : I 1—> Js/d(/i org). If 7 G ffife is the

unique optimal estimate, and Vrp(lj) > 0, then 7 is not an extreme point

of the closed convex hull of {rp-1 > 0}.

Proof. Let Q be the closed convex hull of {rp-1 > 0}, and assume that 7 is

an extreme point of Q. Then Mj is an extreme point of the closed, convex

set Q' = {Mj : 7 G G}, and thus for each e G P there is an open half-space

162 5 Point Estimation

H£ C Rk such that Mj G H£ and g'nHeC tMx- (B£). For each e G P let j£

be the element of Mfe such that Mj£ is the projection of Mj on the border

of H£: we have \M(j - j£)\ < \M(j - 7)| for all 7 G M"1^ \ i7£). Since

Ö' is not a singleton (because otherwise Vrp(lj) = 0), for sufficiently small

e G P we have \M(j£ — 7)! < e; and the assumptions about L thus implythat there is an e' G P such that lj ,

< Vrpil^) on ÇC\M~1(H£>). Since the

Shilkret integral is monotonie, homogeneous, maxitive, and transformation

invariant on the class of all finitely maxitive measures, and /1 org is finitely

maxitive, using Theorem 2.8 we obtain

Vrp(llc,) = / max{/7e, 7gnM-i(ffe,),/7e, Im-\m"\he,)} d(/i °!ï) <

< / max{Frp(/7),/7}d(/1 org) = Vrp(h),

in contradiction with the assumption that 7 is uniquely optimal. D

Theorem 5.5 can be easily generalized; in particular, in the case with

k = 1, the assumption about / can be replaced by the assumption that / is

continuous at 0. Moreover, as in Theorem 5.4, the assumption about w of

Theorem 5.5 can be replaced by the assumptions that /1 is left-continuous

and w (/1 org}) is bounded. When /_1{0} = {0}, a sufficient condition

for Vrp(Ly) to be positive is that {f\ o rg} > 0} is not a singleton; in

particular, if (as in the following example) rp^ is continuous on {rp^ > 0},and {rp-1 > 0} is convex and is not a singleton, then Vrp(Ly) = 0 is possible

only if /1 = /{1} (that is, only for the MLD criterion).

Example 5.6. The problem of estimating (with squared error loss) the

mean of a normal distribution with known variance is made much more

complex by the additional assumption that the mean lies in a partic¬ular bounded interval. Casella and Strawdernian (1981) proved that if

V1 = {Pß : /i G (—to, to)} is a family of probability measures such

that under each model Pß the random variable X is normally distributed

with mean /i and variance 1, and to G (0, 1.056), then the minimax risk

criterion applied to the decision problem described by the loss function

L' : (PM, d) 1—> (/i — d)2 on V' x M. leads to the estimate TO-tanh(TO-X). For

larger values of to, the estimate obtained by applying the minimax risk

criterion is in general unknown, but Donoho, Liu, and MacGibbon (1990)showed that the maximum expected loss for this estimate is at least 0.8

5.2 Sliilkret Integral and Quasiconvexity 163

times the one for the estimate ?.X. When \X\ > to + —, this esti-

mate is impossible; an estimate that is never impossible and has smaller

expected loss (under all models in V) can be obtained by using the results

of Moors (1981):

sign(X) min{^|X|,mtanh(m|X|)}. (5.2)

The maximum expected loss for this estimate is thus at most 1.25 times

the one for the estimate obtained by applying the minimax risk criterion.

For the hierarchical model described in Section 3.2, the assumption that

the mean /i lies in the interval (—to, to) simply corresponds to a particular

prior relative plausibility measure on the set V = {P^ : /i G ffi} 2 V

considered in Example 4.18 (with n = 1 and X = X\): the prior relative

plausibility measure with density function proportional to the function

Pß i—> I(-m>m) (/i) on V. No probability measure on V can describe this kind

of prior information, but a reasonable estimate can be obtained by applyingto the decision problem described by the loss function L : (Pß, d) i—> (fi—d)2onPxM the Bayesian criterion with as initial averaging probability mea¬

sure on V the one corresponding to the uniform distribution on (—to, to)for the parameter /i (see Gatsonis, MacGibbon, and Strawderman, 1987).When the realization X = x is observed, the above prior relative plausi¬

bility measure can be updated by combining it (as described in Subsec¬

tion 3.1.2) with the relative plausibility measure on V whose density func¬

tion is proportional to the approximate likelihood function Pß i—> ö ip(x—fï)on V considered in Example 4.18. A likelihood-based decision criterion can

then be applied to the decision problem described by L: consider the de¬

cision criteria corresponding to the functionals Vrp : I i—> Jsld(f\ org) for

some nondecreasing function /i : [0, 1] —> [0,1] such that /i(0) = 0 and

/i(l) = 1. Theorems 5.1 and 5.4 imply that there is a unique optimal esti¬

mate in R, and it lies in the closed interval [—m, m]; and Theorem 5.5 im¬

plies that if /i ^ I{\\i then the estimate lies in the open interval (—m, m).That is, the MLD criterion is the only one of the considered likelihood-

based decision criteria that can lead to an impossible estimate, and in fact

it leads to the estimate sign(X) min{|X|, to}, which is impossible when

\X\ > to. When \X\ < to, this estimate corresponds to the maximum

likelihood estimate X; more generally, if /i is not too extreme (it suffices

that there are a, b G P such that f\(x) < axb for all x G [0,1]), then there

is a c G P such that the resulting estimate is X when \X\ < to — c (forall to > c): for instance, c = y2 when f\ = îdr01i (that is, for the MPL

164 5 Point Estimation

Figure 5.1. Estimators and corresponding expected losses from Example 5.6.

criterion). This can be considered as a stability property: when the obser¬

vation x lies in the interval (—to, to) and is sufficiently far away from its

endpoints, the additional assumption that /i G (—to, to) has no influence

on the estimate of /i G R.

The first diagram of Figure 5.1 shows the graphs of some estimates of

/i, as functions of the observation x G [0, 6] (all the estimators considered

are odd functions), in the case with to = 3: the estimate (5.2) (dotted,almost polygonal line), and the estimates obtained by applying the MPL

criterion (dashed line), the MLD criterion (upper solid line), the LRM^criterion with ß fa 0.259 (so that —2 log/3 is the 90%-quantile of the x2distribution with one degree of freedom; lower solid line), and the Bayesiancriterion with initial averaging probability distribution uniform on (—3, 3)(dotted curved line). The second diagram of Figure 5.1 shows the graphsof the corresponding expected losses (as functions of /i G (—3, 3)): for the

estimate (5.2) (dotted line with maximum 0.809 occurring near /i = 0),and for the estimates obtained by applying the MPL criterion (dashed line

with maximum 0.947), the MLD criterion (solid line with maximum 0.995

at /i = 0), the LRM^ criterion with ß « 0.259 (solid line with supremum

1.001 occurring at the endpoints of the interval), and the Bayesian criterion

with initial averaging probability distribution uniform on (—3, 3) (dottedline with supremum 1.000 occurring at the endpoints of the interval). A

consequence of the stability property considered above is that the three

estimates obtained by applying a likelihood-based decision criterion have

a relatively high expected loss under the model Pq. 0

5.2 Sliilkret Integral and Quasiconvexity 165

Using the previous results, it can be easily proved that when the MPL

criterion is applied to the problem of estimating 7 G M with an error

function L, on the basis of a nondegenerate relative plausibility measure

rp on R, if L is such that / is continuous and strictly quasiconvex, w rp-1

is bounded, and

lim f(x — d) w(x) rp^{x) = lim f(x — d) w(x) rp^(x) = 0 for all dgR,

then there is a unique optimal estimate in R, and it lies in the interior of the

convex hull of {rp-1 > 0}, when this is not a singleton. Theorem 2.6 impliesthat applying the MPL criterion to L with respect to rp corresponds to

applying it to the error function L' : (7, d) 1—> f{j — d) with respect to

w^ 0 rp; that is, the weight function w plays the role of a prior likelihood

function on R, to be combined with the pseudo likelihood function rpKMore generally, when V is a set of statistical models, g : V —> R is a

mapping, and we estimate 7 = g(P), a weight function w' can be defined

directly on V: if rp1 is a relative plausibility measure on V, then applyingthe MPL criterion to L' with respect to [(w/)^ 0 rp'] o g~x corresponds to

applying it (with respect to rp') to the decision problem described by the

loss function L" : (P, d) 1—> w'(P) f[g(P)—d] onPxI. When w' depends on

P only through g(P), we recover the above situation, with rp = rp' o g^1and w' = wog. The weight function w' must be chosen carefully: in

general, the resulting estimates have better pre-data performances when

w' is chosen so that estimates with bounded expected losses are possible.

Example 5.7. Famous examples of impossible estimates are the negativeestimates of variance components in random effects models (see for in¬

stance Searle, Casella, and MacCulloch, 1992). Consider in particular the

balanced one-way layout under normality assumptions; that is, consider a

family V' = {P^yVa,vs ' (/•*, va, ve) G R x P x P} of probability measures such

that under each model PlijVajVe the random variables Xly3 are normally dis¬

tributed with mean /i and variance va + ve (for all 1 G {1,..., na} and all

j G {l,...,ne}), and the correlation coefficient of two different random

variables X, ~ and X,> ,< is p = —^f— when 1 = «', and 0 otherwise. Then,

under each model P..,, ,,the random variables -,— and —- are y

distributed with na — 1 and na (ne — 1) degrees of freedom, respectively,where

Sa=neJ2(Xh -J,)2 and Se = g J^(XttJ - Xt> )2,1=1 1=1J=l

166 5 Point Estimation

with X, = — y^e, X, „and X = — Y^"", X, .

The estimates of the

variance components va and ve obtained by applying the method of analy¬sis of variance are

va = — and ve = -,

ne \na-l na (ne - 1J ) na (ne - 1)

respectively; these estimates are unbiased and have minimum variance

among all unbiased estimates, but va can be negative. The problems of

estimating va and ve can be described as statistical decision problems bymeans of two loss functions La and Le on V x R; consider in particularthe loss functions

L (P d)^{Va ~ da)2

and L • (P d)^(^e ~ ^

(neva+vey vi

The squared error loss is not particularly well suited for scale parameters,

but it accords with the estimates va and ve (which have minimum variance

among all unbiased estimates). The weights are chosen so that the decision

problems are location and scale invariant, and estimates with bounded

expected losses are possible (the naive choice -\ for La would in fact

impose the estimate da = 0).

Figure 5.2 shows the graphs of some equivariant estimators of va and

fe, and of the corresponding expected losses, in the case with na = ne = 3;the other cases are qualitatively similar, although the differences are less

marked when na and ne are larger. All the estimates considered depend

on the observations Xly3 only through Sa and Se: the estimates of va and

ve divided by Sa + Se are plotted (as functions of r =s °g G (0,1)) in

the first and second upper diagrams, respectively, while the correspond¬

ing expected losses are plotted (as functions of p G (0,1)) in the first and

second lower diagrams, respectively. The estimates va and ve obtained by

applying the method of analysis of variance (solid straight lines) have rela¬

tively large expected losses (upper solid lines); in particular, the expectedloss for va is large when p is small, because va is negative when r is small:

in fact, the truncated estimate max{wo,0} has smaller expected loss (up¬per dashed line). This truncated estimate is biased; when the requirementof unbiasedness is dropped, many generalizations of the estimate va are

possible. For instance, Härtung (1981) proposed an estimate of va (dottedstraight line) that minimizes the bias among the nonnegative, invariant,

quadratic estimates, but the corresponding expected loss (upper dotted

line) is even larger than the one for va when p is small.

5.2 Sliilkret Integral and Quasiconvexity 167

0.15

0.05

00.2 0.4 0.6 0.8 1

0.15-

0.1-

0.05

0.3-

0.25 -'

0.2 0.4 0.6 0.: 0 0.2 0.4 0.6 o.:

Figure 5.2. Estimators and corresponding expected losses from Example 5.7.

Besides va and ve, the most appreciated estimates of va and ve seem

to be those based on the method of maximum likelihood; consider the

density-based approximation of the likelihood function on V induced bythe observations XtJ. The maximum likelihood estimate of va (dashedline) has a relatively small expected loss (lower dashed line), while for

the maximum likelihood estimate of ve (lower dashed line) the expectedloss (lower dashed line) is small only when p is small; these are the esti¬

mates obtained by applying the MLD criterion to the decision problemsdescribed by La and Le, respectively, without using prior information. The

restricted maximum likelihood estimates considered by Thompson (1962)are obtained by applying the method of maximum likelihood to a partic¬ular pseudo likelihood function on V: they are appreciated because they

are nonnegative and correspond to the estimates obtained by applying the

168 5 Point Estimation

method of analysis of variance when no one of these estimates is negative.In fact, the restricted maximum likelihood estimate of va is max{vo,0},and the one of ve (upper dashed line) is ve when va > 0; but the cor¬

responding expected losses (upper dashed lines) are larger than the ones

for the maximum likelihood estimates (lower dashed lines). Anyway, these

two estimates of va based on the method of maximum likelihood can be

impossible (in the sense that they can be 0); and even when the set V is

extended to include the models with va =0, it can be disturbing to have

estimates lying on the border of the set of possible values for va.

To obtain positive estimates, the likelihood function must be used in

a less limited way; for instance, the Bayesian criterion can be applied to

the decision problems described by La and Le, but the choice of the ini¬

tial averaging probability measure on V is not simple. To simplify this

choice, the Bayesian criterion can be extended by allowing infinite mea¬

sures on V as initial "averaging measures" ; but unfortunately the usual

choices lead to measures on V that are still infinite when conditioned on

the observed data. Tiao and Tan (1965) proposed the infinite measure on

V corresponding to the one with density /n \ +v \on M. x P x P (with

respect to the Lebesgue measure) for the parameter (/i, va, ve); when con¬

ditioned on the observations X%>3, this measure is finite and thus propor¬

tional to a probability measure that can be used in the Bayesian criterion.

This initial "averaging measure" has been strongly criticized: see for in¬

stance Hill (1965) and Stone and Springer (1965); in particular, it cannot

be interpreted as describing prior information about the models in V, since

it depends on ne. Anyway, Klotz, Milton, and Zacks (1969) showed that

when using the (unweighted) squared error loss, the estimates of va and

ve obtained by applying the Bayesian criterion with this initial "averagingmeasure" have poor pre-data performances. But Portnoy (1971) showed

that the estimates obtained by applying the Bayesian criterion (with this

initial "averaging measure" ) to the decision problems described by La and

Le (dotted curved lines) have very good pre-data performances: the max¬

imum of the expected loss for the estimate of va (lower dotted line) seems

to be nearly the minimum possible, and the expected loss for the estimate

of ve (dotted line) is also particularly small.

The estimates obtained by applying the MPL criterion to the deci¬

sion problems described by La and Le without using prior information

(solid curved lines) are positive too, and they have very good pre-data

performances, since the corresponding expected losses (lower solid lines)

5.2 Sliilkret Integral and Quasiconvexity 169

are similar to the ones for the estimates studied by Portnoy. Compared to

the Bayesian criterion, the MPL criterion has the important advantage of

avoiding the difficult choice of the initial "averaging measure", when no

prior information is available, or when we do not want to use the available

prior information. 0

References

Akaike, H. (1973). Information theory and an extension of the maximum likeli¬

hood principle. In Petrov and Csâki, 267-281.

Anscombe, F. J., and Aumann, R. J. (1963). A definition of subjective proba¬

bility. Ann. Math. Stat. 34, 199-205.

Augustin, T. (2003). On the suboptiniality of the generalized Bayes rule and

robust Bayesian procedures from the decision theoretic point of view: A cau¬

tionary note on updating imprecise priors. In Bernard et al., 31-45.

Azzalini, A., and Capitanio, A. (2003). Distributions generated by perturbation

of symmetry with emphasis on a multivariate skew ^-distribution. J. R. Stat.

Soc, Ser. B, Stat. Methodol. 65, 367-389.

Ballester, A., Cardûs, D., and Trillas, E., eds. (1982). Proceedings of the Sec¬

ond World Conference on Mathematics at the Service of Man. Universidad

Politécnica de Las Palmas.

Barndorff-Nielsen, O. E. (1991). Likelihood theory. In Hinkley et al., 232-264.

Bayes, T. (1763). An essay towards solving a problem in the doctrine of chances.

Philos. Trans. R. Soc. Lond. 53, 370-418.

Berk, R. H. (1966). Limiting behavior of posterior distributions when the model

is incorrect. Ann. Math. Stat. 37, 51-58.

Bernard, J.-M., Seidenfeld, T., and Zaffalon, M., eds. (2003). ISIPTA '03. Car-

leton Scientific.

Birnbaum, A. (1962). On the foundations of statistical inference. J. Am. Stat.

Assoc. 57, 269-326.

Birnbaum, A. (1969). Concepts of statistical evidence. In Morgenbesser et al.,

112-143.

Bouchon-Meunier, B., Yager, R. R., and Zadeh, L. A., eds. (1991). Uncertainty

m Knowledge Bases. Springer.

172 References

Breese, J. S., and Koller, D., eds. (2001). UAI '01. Morgan Kaufmann.

Brown, L. D., and Purves, R. (1973). Measurable selections of extrema. Ann.

Stat. 1, 902-912.

Casella, G., and Strawderman, W. E. (1981). Estimating a bounded normal

mean. Ann. Stat. 9, 870-878.

Cattaneo, M. (2005). Likelihood-based statistical decisions. In Cozman et al.,

107-116.

Chen, Y. Y. (1993). Bernoulli trials: From a fuzzy measure point of view. J.

Math. Anal. Appl. 175, 392-404.

Chernoff, H. (1954). Rational selection of decision functions. Econometrica 22,422-443.

Choquet, G. (1954). Theory of capacities. Ann. Inst. Fourier 5, 131-295.

Cohen, L. J. (1966). A logic for evidential support. Br. J. Philos. Set. 17, 21-43

and 105-126.

Cozman, F. G., Nau, R., and Seidenfeld, T., eds. (2005). ISIPTA '05. SIPTA.

Dahl, F. A. (2005). Representing human uncertainty by subjective likelihood

estimates. Int. J. Approx. Reasoning 39, 85-95.

Darwiche, A., and Friedman, N., eds. (2002). UAI '02. Morgan Kaufmann.

DasGupta, A., and Studden, W. J. (1989). Frequentist behavior of robust Bayes

estimates of normal means. Stat. Decis. 7, 333-361.

de Cooman, G. (2000). Integration in possibility theory. In Grabisch et al., 124-

160.

de Cooman, G., Fine, T. L., and Seidenfeld, T., eds. (2001). ISIPTA 'Ol. Shaker.

de Cooman, G., Ruan, D., and Kerre, E. E., eds. (1995). Foundations and Ap¬

plications of Possibility Theory. World Scientific.

de Cooman, G., and Walley, P. (2002). A possibilistic hierarchical model for

behaviour under uncertainty. Theory Decis. 52, 327-374.

Dempster, A. P. (1967). Upper and lower probabilities induced by a multivalued

mapping. Ann. Math. Stat. 38, 325-339.

Denneberg, D. (1994). Non-Additive Measure and Integral. Kluwer.

Donoho, D. L., Liu, R. C, and MacGibbon, B. (1990). Minimax risk over hy-

perrectangles, and implications. Ann. Stat. 18, 1416-1437.

Dubois, D. (1988). Possibility theory: Searching for normative foundations. In

Munier, 601-614.

Dubois, D., and Prade, H. (1988). Possibility Theory. Plenum Press.

References 173

Dubois, D., and Prade, H. (1995). Possibilistic mixtures and their applications to

qualitative utility theory — Part II: Decision under incomplete knowledge.

In de Cooman et al., 256-266.

Dubois, D., Wellman, M. P., D'Ambrosio, B., and Smets, P., eds. (1992). UAI

'92. Morgan Kaufmann.

Ellsberg, D. (1961). Risk, ambiguity, and the Savage axioms. Q. J. Econ. 75,643-669.

Evans, M. J., Fraser, D. A. S., and Monette, G. (1986). On principles and argu¬

ments to likelihood. Can. J. Stat. 14, 181-199.

Fishburn, P. C. (1970). Utility Theory for Decision Making. Wiley.

Fisher, R. A. (1922). On the mathematical foundations of theoretical statistics.

Philos. Trans. R. Soc. Lond., Ser. A 222, 309-368.

Fisher, R. A. (1930). Inverse probability. Proc. Carab. Philos. Soc. 26, 528-535.

Fraser, D. A. S., and McDunnough, P. (1984). Further remarks on asymptotic

normality of likelihood and conditional analyses. Can. J. Stat. 12, 183-190.

Gärdenfors, P., and Sahlin, N.-E. (1982). Unreliable probabilities, risk taking,

and decision making. Synthese 53, 361-386.

Gatsonis, C., MacGibbon, B., and Strawderman, W. (1987). On the estimation

of a restricted normal mean. Stat. Probab. Lett. 6, 21-30.

Giang, P. H., and Shenoy, P. P. (2001). A comparison of axiomatic approachesto qualitative decision making using possibility theory. In Breese and Koller,

162-170.

Giang, P. H., and Shenoy, P. P. (2002). Statistical decisions using likelihood

information without prior probabilities. In Darwiche and Friedman, 170-178.

Gilboa, I., and Schmeidler, D. (1989). Maxmin expected utility with non-unique

prior. J. Math. Econ. 18, 141-153.

Gilboa, I., and Schmeidler, D. (1993). Updating ambiguous beliefs. J. Econ.

Theory 59, 33-49.

Giraud, R. (2005). Objective imprecise probabilistic information, second order

beliefs and ambiguity aversion: an axiomatization. In Cozman et al., 183-192.

Good, I. J. (1965). The Estimation of Probabilities. MIT Press.

Goutis, C., and Casella, G. (1995). Frequentist post-data inference. Int. Stat.

Rev. 63, 325-344.

Grabisch, M., Murofushi, T., and Sugeno, M., eds. (2000). Fuzzy Measures and

Integrals. Physica-Verlag.

174 References

Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., and Staliel, W. A. (1986).Robust Statistics. Wiley.

Hartigan, J. A. (1983). Bayes Theory. Springer.

Härtung, J. (1981). Nonnegative minimum biased invariant estimation in vari¬

ance component models. Ann. Stat. 9, 278-292.

Hill, B. (1965). Inference about variance components in the one-way model. J.

Am. Stat. Assoc. 60, 806-825.

Hinkley, D. V., Reid, N., and Snell, E. J., eds. (1991). Statistical Theory and

Modelling. Chapman and Hall.

Hodges, J. L., Jr., and Lehmann, E. L. (1950). Some problems in minimax pointestimation. Ann. Math. Stat. 21, 182-197.

Hodges, J. L., Jr., and Lehmann, E. L. (1952). The use of previous experience

in reaching statistical decisions. Ann. Math. Stat. 23, 396-407.

Huber, P. J. (1964). Robust estimation of a location parameter. Ann. Math.

Stat. 35, 73-101.

Huber, P. J., and Strassen, V. (1973). Minimax tests and the Neyman-Pearson

lemma for capacities. Ann. Stat. 1, 251-263.

Hurwicz, L. (1951). Some specification problems and applications to econometric

models. Econometrica 19, 343-344.

Imaoka, H. (1997). On a subjective evaluation model by a generalized fuzzy

integral. Int. J. Uncertain. Fuzzmess Knowl.-Based Syst. 5, 517-529.

Jeffreys, H. (1946). An invariant form for the prior probability in estimation

problems. Proc. R. Soc. Lond., Ser. A 186, 453-461.

Klement, E. P., Mesiar, R., and Pap, E. (2004). Measure-based aggregation op¬

erators. Fuzzy Sets Syst. 142, 3-14.

Klotz, J. H., Milton, R. C, and Zacks, S. (1969). Mean square efficiency of

estimators of variance components. J. Am. Stat. Assoc. 64, 1383-1402.

Kriegler, E. (2005). Imprecise Probability Analysis for Integrated Assessment of

Climate Change. PhD thesis, Universität Potsdam.

Kullback, S., and Leibler, R. A. (1951). On information and sufficiency. Ann.

Math. Stat. 22, 79-86.

Lange, K. L., Little, R. J. A., and Taylor, J. M. G. (1989). Robust statistical

modeling using the t distribution. J. Am. Stat. Assoc. 84, 881-896.

Lehmann, E. L. (1959). Testing Statistical Hypotheses. Wiley.

References 175

Leonard, T. (1978). Density estimation, stochastic processes and prior informa¬

tion. J. R. Stat. Soc, Ser. B 40, 113-146.

Mesiar, R. (1995). Possibility measures based integrals — Theory and applica¬tions. In de Cooman et al., 99-107.

Moors, J. J. A. (1981). Inadmissibility of linearly invariant estimators in trun¬

cated parameter spaces. J. Am. Stat. Assoc. 76, 910-915.

Moral, S. (1992). Calculating uncertainty intervals from conditional convex sets

of probabilities. In Dubois et al., 199-206.

Moral, S., and de Campos, L. M. (1991). Updating uncertain information. In

Bouchon-Meunier et al., 58-67.

Morgenbesser, S., Suppes, P., and White, M., eds. (1969). Philosophy, Science,

and Method. Macmillan.

Munier, B. R., ed. (1988). Risk, Decision and Rationality. Reidel.

Nau, R. F. (1992). Indeterminate probabilities on finite sets. Ann. Stat. 20,

1737-1767.

Neyman, J., ed. (1951). Proceedings of the Second Berkeley Symposium on Math¬

ematical Statistics and Probability. University of California Press.

Neyman, J., and Pearson, E. S. (1928). On the use and interpretation of certain

test criteria for purposes of statistical inference. Biometrika 20A, 175-240

and 263-294.

Owen, A. B. (1988). Empirical likelihood ratio confidence intervals for a single

functional. Biometrika 75, 237-249.

Petrov, B. N., and Csâki, F., eds. (1973). ISIT '71. Akadémiai Kiadö.

Piatti, A., Zaffalon, M., and Trojani, F. (2005). Limits of learning from imperfect

observations under prior ignorance: the case of the imprecise Dirichlet model.

In Cozman et al., 276-286.

Pinheiro, J. C, Liu, C, and Wu, Y. N. (2001). Efficient algorithms for robust es¬

timation in linear mixed-effects models using the multivariate t distribution.

J. Comput. Graph. Stat. 10, 249-276.

Portnoy, S. (1971). Formal Bayes estimation with application to a random effects

model. Ann. Math. Stat. 42, 1379-1402.

Rissanen, J. (1978). Modeling by shortest data description. Automatica 14, 465-

471.

Robbins, H. (1951). Asymptotically subminimax solutions of compound statis¬

tical decision problems. In Neyman, 131-148.

176 References

Savage, L. J. (1951). The theory of statistical decision. J. Am. Stat. Assoc. 46,

55-67.

Savage, L. J. (1954). The Foundations of Statistics. Wiley.

Schervish, M. J. (1995). Theory of Statistics. Springer.

Schmeidler, D. (1986). Integral representation without additivity. Proc. Am.

Math. Soc. 97, 255-261.

Schmeidler, D. (1989). Subjective probability and expected utility without ad¬

ditivity. Econometrica 57, 571-587.

Searle, S. R., Casella, G., and MacCulloch, C. E. (1992). Variance Components.

Wiley.

Severini, T. A. (2000). Likelihood Methods in Statistics. Oxford University Press.

Shackle, G. L. S. (1949). Expectation in Economics. Cambridge University Press.

Shafer, G. (1976). A Mathematical Theory of Evidence. Princeton University

Press.

Shafer, G. (1982). Lindley's paradox. J. Am. Stat. Assoc. 77, 325-351.

Shannon, C. E. (1948). A mathematical theory of communication. Bell System

Tech. J. 27, 379-423 and 623-656.

Shilkret, N. (1971). Maxitive measure and integration. Indag. Math. 33, 109-116.

Sipos, J. (1979). Integral with respect to a pre-measure. Math. Slovaca 29, 141—

155.

Smets, P. (1982). Possibilistic inference from statistical data. In Ballester et al.,

611-613.

Stone, M., and Springer, B. G. F. (1965). A paradox involving quasi prior dis¬

tributions. Biometrika 52, 623-627.

Sugeno, M. (1974). Theory of Fuzzy Integrals and Its Applications. PhD thesis,

Tokyo Institute of Technology.

Thompson, W. A., Jr. (1962). The problem of negative estimates of variance

components. Ann. Math. Stat. 33, 273-289.

Tiao, G. C., and Tan, W. Y. (1965). Bayesian analysis of random-effect models in

the analysis of variance — I: Posterior distribution of variance-components.

Biometrika 52, 37-53.

von Neumann, J., and Morgenstern, O. (1944). Theory of Games and Economic

Behavior. Princeton University Press.

Wakker, P. (1993). Unbounded utility for Savage's "foundations of statistics"

and other models. Math. Oper. Res. 18, 446-485.

References 177

Wald, A. (1939). Contributions to the theory of statistical estimation and testing

hypotheses. Ann. Math. Stat. 10, 299-326.

Wald, A. (1949). Note on the consistency of the maximum likelihood estimate.

Ann. Math. Stat. 20, 595-601.

Walley, P. (1991). Statistical Reasoning with Imprecise Probabilities. Chapman

and Hall.

Walley, P. (1996). Inferences from multinomial data: Learning about a bag of

marbles. J. R. Stat. Soc, Ser. B 58, 3-57.

Weber, S. (1984). _L-decomposable measures and integrals for Archimedean t-

conorms _L. J. Math. Anal. Appl. 101, 114-138.

Weichselberger, K. (2001). Elementare Grundbegriffe einer allgemeineren Wahr¬

scheinlichkeitsrechnung I. Physica-Verlag.

Weichselberger, K., and Augustin, T. (2003). On the symbiosis of two concepts

of conditional interval probability. In Bernard et al., 608-629.

Wilson, N. (2001). Modified upper and lower probabilities based on impreciselikelihoods. In de Cooman et al., 370-378.

Zadeh, L. A. (1978). Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets

Syst. 1, 3-28.

Index

rp 64

lil 27

f 63

P 31

r 47

f 36

r 45

MU 29

7i 28,112

^rp 95

© 79

M{/>-} 39

<P(-C) 42

V?~ 50

V?" 42

c 70

a.e. 26

approach to statistical decision

problems

Bayesian 5-7,17-19,64-78,80-81,

83-84,138

classical 4-5,18, 83, 127-131,

141-147,152-156

imprecise Bayesian 7-8, 12-13,

19-20, 22, 64-65, 70-78, 86-92,

138

MPL 13-20, 63-93,116-169

asymptotic efficiency 152-156

asymptotic optimality 138-147

B,5 viii

Bayesian criterion 5

class of measures

closed under restrictions 29

closed under transformations 28

exhaustive 42

regular 29

coherence

temporal 6, 18

with respect to additive loss 6

with respect to maxitive loss 19

comonotonic functions 58

complete ignorance 3, 79-81, 88-91,

102-103,110

conditional method 5

consequence 66

randomized 70-71

uncertain 66

V 1-2

decision 2,66

correct 2, 70, 76

domination 2

strict 101

randomized 71

decision criterion 2

Bayesian 5

r-minimax 7

Hurwicz 3

likelihood-based 97

180 Index

0-independent 105

consistent 138

decomposable 123

revealing ignorance attraction

109

revealing ignorance aversion 102

satisfying the sure-thing principle

116

LRM^ 20,105

minimax 3

essential 100

minimax risk 4

minimin 3

essential 112

MLD 21, 105

ignorance attracted analogue of

111

MPL 13,64

approximation algorithm

159-160

restricted Bayesian 7

decision function 4

density function 27, 47

VT 41

distribution function 38, 45

distributional dominance 39

dual function 112

e-contamination class 91

equivariance 129-130

error function 151-152, 165

asymmetric 156

ess 36

essential minimax criterion 100

essential minimin criterion 112

essential supremum 36

estimation problem 2-3, 149-152

evaluation

conditional 123

conjugate 112-113, 150

in terms of loss 67, 76-77

in terms of utility 76-77

lower 112-113, 150

post-data 6

pre-data 4-6

upper 112-113,150

worst-case 73, 75-76, 78

fo 100-102

/i 100-102

functional

bihomogeneous 42

calibrated 42

homogeneous 42

monotonie 42

subadditive 52

^-independent 42

T-minimax criterion 7

hierarchical model 81-93

Hurwicz criterion 3

hypograph 48

impossible estimate 160-169

indicator property 30

integral 30

bihomogeneous 41

bimonotonic 40

Choquet 47,61

comonotonic additive 58

generalized Imaoka 49

generalized Sugeno 32

homogeneous 30

maxitive 33

completely 33

monotonie 31

quasi-subadditive 52

rectangular 36

regularized 45

regular 41

respecting distributional dominance

39

Shilkret 31

Sipos 61

subadditive 53

Index 181

support-based 38

symmetric 50

transformation invariant 35

translation equivariant 60

L 1-3

Id 2,151

hk 8-10

hkg 9

likelihood function 8-10, 67

prior 80-81

profile 9

pseudo 9

updating 78-79

likelihood ratio 11

likelihood ratio test 11

likelihood-based confidence region

11

likelihood-based inference 10-13,

15-17,101-102

calibration problem of 11

likelihood ratio test 11

likelihood-based confidence region11

maximum likelihood estimate 10

likelihood-based region minimax 20

likelihood-based region minimin 111

loss function 1-3

LR 11

LRM^ criterion 20, 105

maximum likelihood decision 21

maximum likelihood estimate 10

measure 26

2-alternating 26

continuous from below 28

dual 28

maxitive

completely 27

finitely 27

monotonie 26

nonzero 26

normalized 26

possibility 28

probability 64-65,80-81

imprecise 7,86-93

lower 86-93

prior 80-81,87-91

upper 86-93

relative plausibility 63-65, 78-81

nondegenerate 63

prior 80-81

updating of 79

subadditive 26

minimax criterion 3

minimax plausibility-weighted loss

13

minimax risk criterion 4

minimin criterion 3

MLD criterion 21,105

ignorance attracted analogue of

111

MPL criterion 13, 64

approximation algorithm 159-160

Ml¥ 42

P vii

V 1

parametrization 9

parametrization invariance 128

preference relation 66-79

monotonie 68

representation 66

quasiconvex function 151, 159

strictly 151,159

regular extension 87-92, 138

representation theorem 66-79

restricted Bayesian criterion 7

risk function 4

robustness 92-93,130-137

rp 63-65,78-81

Sp 98

182 Index

sequence of decisions

asymptotically optimal 141

obtained by applying a likelihood-

based decision criterion 140

statistical decision problem 1-3

statistical model 1,64-65

imprecise 91-93, 131

sure-thing principle 116-127, 132

Savage's 68-69, 116-117

strict version 116-117

ty viii

total preorder 66

uncertainty aversion 69-72, 76-77

Vrv 95-102

Curriculum Vitae

Personal Data:

Name

Born

Nationality

Marco E. G. V. Cattaneo

September 19, 1976 in Kaiserslautern, West Germany

Swiss, citizen of Capriasca, Ticino

Education:

1991-1995

1995-2000

2002-2007

Liceo Cantonale di Locarno

ETH Zurich, study of mathematics

ETH Zurich, doctoral studies in mathematics

Teaching Experience:

1997-1998

2000-2002

2002-2006

ETH Zurich, undergraduate teaching assistant

ETH Zurich, teaching assistant in mathematics

ETH Zurich, teaching assistant in statistics