190
STATISTICS 450/850 Estimation and Hypothesis Testing Supplementary Lecture Notes Don L. McLeish and Cyntha A. Struthers Dept. of Statistics and Actuarial Science University of Waterloo Waterloo, Ontario, Canada Winter 2013

STATISTICS 450/850 Estimation and Hypothesis Testing · PDF fileSTATISTICS 450/850 Estimation and Hypothesis Testing Supplementary Lecture Notes Don L. McLeish and Cyntha A. Struthers

Embed Size (px)

Citation preview

STATISTICS 450/850

Estimation and HypothesisTesting

Supplementary Lecture Notes

Don L. McLeish and Cyntha A. StruthersDept. of Statistics and Actuarial Science

University of Waterloo Waterloo, Ontario, Canada

Winter 2013

Contents

1 Properties of Estimators 1

1.1 Prerequisite Material . . . . . . . . . . . . . . . . . . . . 1

1.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.3 Unbiasedness and Mean Square Error . . . . . . . . . 4

1.4 Sufficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.5 Minimal Sufficiency . . . . . . . . . . . . . . . . . . . . . 16

1.6 Completeness . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.7 The Exponential Family . . . . . . . . . . . . . . . . . . 23

1.8 Ancillarity . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2 Maximum Likelihood Estimation 39

2.1 Maximum Likelihood Method- One Parameter . . . . . . . . . . . . . . . . . . . . . . 39

2.2 Principles of Inference . . . . . . . . . . . . . . . . . . . 50

2.3 Properties of the Score and Information- Regular Model . . . . . . . . . . . . . . . . . . . . . . . 52

2.4 Maximum Likelihood Method- Multiparameter . . . . . . . . . . . . . . . . . . . . . . 54

2.5 Incomplete Data and The E.M. Algorithm . . . . . . 67

2.6 The Information Inequality . . . . . . . . . . . . . . . . 75

2.7 Asymptotic Properties of M.L.Estimators - One Parameter . . . . . . . . . . . . . . . 79

2.8 Interval Estimators . . . . . . . . . . . . . . . . . . . . . 82

2.9 Asymptotic Properties of M.L.Estimators - Multiparameter . . . . . . . . . . . . . . 92

2.10 Nuisance Parameters andM.L. Estimation . . . . . . . . . . . . . . . . . . . . . . . 108

2.11 Problems with M.L. Estimators . . . . . . . . . . . . . 109

2.12 Historical Notes . . . . . . . . . . . . . . . . . . . . . . . 111

1

0 CONTENTS

3 Other Methods of Estimation 1133.1 Best Linear Unbiased Estimators . . . . . . . . . . . . 1133.2 Equivariant Estimators . . . . . . . . . . . . . . . . . . . 1153.3 Estimating Equations . . . . . . . . . . . . . . . . . . . . 1193.4 Bayes Estimation . . . . . . . . . . . . . . . . . . . . . . 124

4 Hypothesis Tests 1354.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 1354.2 Uniformly Most Powerful Tests . . . . . . . . . . . . . 1384.3 Locally Most Powerful Tests . . . . . . . . . . . . . . . 1444.4 Likelihood Ratio Tests . . . . . . . . . . . . . . . . . . . 1464.5 Score and Maximum Likelihood Tests . . . . . . . . . 1534.6 Bayesian Hypothesis Tests . . . . . . . . . . . . . . . . 155

5 Appendix 1575.1 Inequalities and Useful Results . . . . . . . . . . . . . 1575.2 Distributional Results . . . . . . . . . . . . . . . . . . . 1595.3 Limiting Distributions . . . . . . . . . . . . . . . . . . . 1685.4 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

Chapter 1

Properties of Estimators

1.1 Prerequisite Material

The following topics should be reviewed:

1. Tables of special discrete and continuous distributions including themultivariate normal distribution. Location and scale parameters.

2. Distribution of a transformation of one or more random variablesincluding change of variable(s).

3. Moment generating function of one or more random variables.

4. Multiple linear regression.

5. Limiting distributions: convergence in probability and convergence indistribution.

1.2 Introduction

Before beginning a discussion of estimation procedures, we assume thatwe have designed and conducted a suitable experiment and collected dataX1, . . . ,Xn, where n, the sample size, is fixed and known. These dataare expected to be relevant to estimating a quantity of interest θ whichwe assume is a statistical parameter, for example, the mean of a normaldistribution. We assume we have adopted a model which specifies the linkbetween the parameter θ and the data we obtained. The model is theframework within which we discuss the properties of our estimators. Ourmodel might specify that the observations X1, . . . ,Xn are independent with

1

2 CHAPTER 1. PROPERTIES OF ESTIMATORS

a normal distribution, mean θ and known variance σ2 = 1. Usually, as here,the only unknown is the parameter θ. We have specified completely the jointdistribution of the observations up to this unknown parameter.

1.2.1 Note:

We will sometimes denote our data more compactly by the random vectorX = (X1, . . . ,Xn).

The model, therefore, can be written in the form f (x; θ) ; θ ∈ Ω whereΩ is the parameter space or set of permissible values of the parameter andf (x; θ) is the probability (density) function.

1.2.2 Definition

A statistic, T (X), is a function of the data X which does not depend onthe unknown parameter θ.

Note that although a statistic, T (X), is not a function of θ, its distrib-ution can depend on θ.An estimator is a statistic considered for the purpose of estimating a

given parameter. It is our aim to find a “good” estimator of the parameterθ.In the search for good estimators of θ it is often useful to know if θ is a

location or scale parameter.

1.2.3 Location and Scale Parameters

Suppose X is a continuous random variable with p.d.f. f(x; θ).Let F0 (x) = F (x; θ = 0) and f0 (x) = f (x; θ = 0). The parameter θ is

called a location parameter of the distribution if

F (x; θ) = F0 (x− θ) , θ ∈ <

or equivalently

f(x; θ) = f0(x− θ), θ ∈ <.

Let F1 (x) = F (x; θ = 1) and f1 (x) = f (x; θ = 1). The parameter θ iscalled a scale parameter of the distribution if

F (x; θ) = F1

³xθ

´, θ > 0

1.2. INTRODUCTION 3

or equivalently

f(x; θ) =1

θf1(x

θ), θ > 0.

1.2.4 Problem

(1) If X ∼ EXP(1, θ) then show that θ is a location parameter of thedistribution. See Figure 1.1

(2) IfX ∼ EXP(θ) then show that θ is a scale parameter of the distribution.See Figure 1.2

-1 0 1 2 3 4 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

f(x)

θ=0

θ=1θ=-1

Figure 1.1: EXP(1, θ) p.d.f.’s

1.2.5 Problem

(1) If X ∼ CAU(1, θ) then show that θ is a location parameter of the dis-tribution.

(2) If X ∼ CAU(θ, 0) then show that θ is a scale parameter of the distrib-ution.

4 CHAPTER 1. PROPERTIES OF ESTIMATORS

0 0.5 1 1.5 2 2.5 30

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

x

f(x)

θ=0.5

θ=1

θ=2

Figure 1.2: EXP(θ) p.d.f.’s

1.3 Unbiasedness and Mean Square Error

How do we ensure that a statistic T (X) is estimating the correct parameter?How do we ensure that it is not consistently too large or too small, andthat as much variability as possible has been removed? We consider theproblem of estimating the correct parameter first.We begin with a review of the definition of the expectation of a random

variable.

1.3.1 Definition

If X is a discrete random variable with p.f. f(x; θ) and support set A then

E [h (X) ; θ] =Px∈A

h (x) f(x; θ)

provided the sum converges absolutely, that is, provided

E [|h (X)| ; θ] =Px∈A

|h (x)| f(x; θ)dx <∞.

1.3. UNBIASEDNESS AND MEAN SQUARE ERROR 5

If X is a continuous random variable with p.d.f. f(x; θ) then

E[h(X); θ] =

∞Z−∞

h(x)f (x; θ) dx,

provided the integral converges absolutely, that is, provided

E[|h(X)| ; θ] =∞Z−∞

|h(x)| f (x; θ) dx <∞.

If E [|h (X)| ; θ] =∞ then we say that E [h (X) ; θ] does not exist.

1.3.2 Problem

Suppose that X has a CAU(1, θ) distribution. Show that E(X; θ) does notexist and that this implies E(Xk; θ) does not exist for k = 2, 3, . . ..

1.3.3 Problem

Suppose that X is a random variable with probability density function

f (x; θ) =θ

xθ+1, x ≥ 1.

For what values of θ do E(X; θ) and V ar(X; θ) exist?

1.3.4 Problem

If X ∼ GAM(α,β) show that

E(Xp;α,β) = βpΓ(α+ p)

Γ(α).

For what values of p does this expectation exist?

1.3.5 Problem

Suppose X is a non-negative continuous random variable with momentgenerating function M(t) = E(etX) which exists for t ∈ <. The functionM(−t) is often called the Laplace Transform of the probability densityfunction of X. Show that

E¡X−p

¢=

1

Γ(p)

∞Z0

M(−t)tp−1dt, p > 0.

6 CHAPTER 1. PROPERTIES OF ESTIMATORS

1.3.6 Definition

A statistic T (X) is an unbiased estimator of θ if E[T (X); θ] = θ for allθ ∈ Ω.

1.3.7 Example

Suppose Xi ∼ POI(iθ) i = 1, ..., n independently. Determine whether thefollowing estimators are unbiased estimators of θ:

T1 =1

n

nPi=1

Xii, T2 =

µ2

n+ 1

¶X =

2

n(n+ 1)

nPi=1Xi.

Is unbiased estimation preserved under transformations? For example,if T is an unbiased estimator of θ, is T 2 an unbiased estimator of θ2?

1.3.8 Example

Suppose X1, . . . ,Xn are uncorrelated random variables with E(Xi) = μand V ar(Xi) = σ2, i = 1, 2, . . . , n. Show that

T =nPi=1aiXi

is an unbiased estimator of μ ifnPi=1ai = 1. Find an unbiased estimator of

σ2 assuming (i) μ is known (ii) μ is unknown.

If (X1, . . . ,Xn) is a random sample from the N(μ,σ2) distribution thenshow that S is not an unbiased estimator of σ where

S2 =1

n− 1nPi=1(Xi − X)2 =

1

n− 1

∙nPi=1X2i − nX2

¸is the sample variance. What happens to E (S) as n→∞?

1.3.9 Example

Suppose X ∼ BIN(n, θ). Find an unbiased estimator, T (X), of θ. Is[T (X)]−1 an unbiased estimator of θ−1? Does there exist an unbiasedestimator of θ−1?

1.3. UNBIASEDNESS AND MEAN SQUARE ERROR 7

1.3.10 Problem

Let X1, . . . ,Xn be a random sample from the POI(θ) distribution. Find

E³X(k); θ

´= E[X(X − 1) · · · (X − k + 1); θ],

the kth factorial moment of X, and thus find an unbiased estimator of θk,k = 1, 2, . . . ,.

We now consider the properties of an estimator from the point of viewof Decision Theory. In order to determine whether a given estimator orstatistic T = T (X) does well for estimating θ we consider a loss functionor distance function between the estimator and the true value which wedenote L(θ, T ). This loss function is averaged over all possible values of thedata to obtain the risk:

Risk = E[L(θ, T ); θ].

A good estimator is one with little risk, a bad estimator is one whose riskis high. One particular loss function is L(θ, T ) = (T − θ)

2which is called

the squared error loss function. Its corresponding risk, called mean squarederror (M.S.E.), is given by

MSE(T ; θ) = Eh(T − θ)

2; θi.

Another loss function is L(θ, T ) = |T − θ| which is called the absolute errorloss function. Its corresponding risk, called the mean absolute error, isgiven by

Risk = E (|T − θ|; θ) .

1.3.11 Problem

ShowMSE(T ; θ) = V ar(T ; θ) + [Bias(T ; θ)]2

whereBias(T ; θ) = E (T ; θ)− θ.

1.3.12 Example

Let X1, . . . ,Xn be a random sample from a UNIF(0, θ) distribution. Com-pare the M.S.E.’s of the following three estimators of θ:

T1 = 2X, T2 = X(n), T3 = (n+ 1)X(1)

8 CHAPTER 1. PROPERTIES OF ESTIMATORS

where

X(n) = max(X1, . . . ,Xn) and X(1) = min(X1, . . . ,Xn).

1.3.13 Problem

Let X1, . . . ,Xn be a random sample from a UNIF(θ, 2θ) distribution withθ > 0. Consider the following estimators of θ:

T1 =1

2X(n), T2 = X(1), T3 =

1

3X(n) +

1

3X(1), T4 =

5

14X(n) +

2

7X(1).

(a) Show that all four estimators can be written in the form

Za = aX(1) +1

2(1− a)X(n) (1.1)

for suitable choice of a.

(b) Find E(Za; θ) and thus show that T3 is the only unbiased estimator ofθ of the form (1.1).

(c) Compare the M.S.E.’s of these estimators and show that T4 has thesmallest M.S.E. of all estimators of the form (1.1).

Hint: Find V ar(Za; θ), show

Cov(X(1),X(n); θ) =θ2

(n+ 1)2 (n+ 2),

and thus find an expression for MSE(Za; θ).

1.3.14 Problem

Let X1, . . . ,Xn be a random sample from the N(μ,σ2) distribution. Con-sider the following estimators of σ2:

S2, T1 =n− 1n

S2, T2 =n− 1n+ 1

S2.

Compare the M.S.E.’s of these estimators by graphing them as a functionof σ2 for n = 5.

1.3.15 Example

Let X ∼N(θ, 1). Consider the following three estimators of θ:

T1 = X, T2 =X

2, T3 = 0.

1.3. UNBIASEDNESS AND MEAN SQUARE ERROR 9

Which estimator is better in terms of M.S.E.?

Now

MSE (T1; θ) = Eh(X − θ)2 ; θ

i= V ar (X; θ) = 1

MSE (T2; θ) = E

"µX

2− θ

¶2; θ

#= V ar

µX

2; θ

¶+

∙E

µX

2; θ

¶− θ

¸2=

1

4+

µθ

2− θ

¶2=1

4

¡θ2 + 1

¢MSE (T3; θ) = E

h(0− θ)

2; θi= θ2

The M.S.E.’s can be compared by graphing them as functions of θ. SeeFigure 1.3.

-2 -1.5 -1 -0.5 0 0.5 1 1.5 20

0.5

1

1.5

2

2.5

3

3.5

4

θ

MSE

MSE(T 1)

MSE(T 2)

MSE(T 3)

Figure 1.3: Comparison of M.S.E.’s for Example 1.3.15

One of the conclusions of the above example is that there is no estimator,even the natural one, T1 = X which outperforms all other estimators.One is better for some values of the parameter in terms of smaller risk,while another, even the trivial estimator T3, is better for other values ofthe parameter. In order to achieve a best estimator, it is unfortunately

10 CHAPTER 1. PROPERTIES OF ESTIMATORS

necessary to restrict ourselves to a specific class of estimators and selectthe best within the class. Of course, the best within this class will only beas good as the class itself, and therefore we must ensure that restrictingourselves to this class is sensible and not unduly restrictive. The class ofall estimators is usually too large to obtain a meaningful solution. Onepossible restriction is to the class of all unbiased estimators.

1.3.16 Definition

An estimator T = T (X) is said to be a uniformly minimum variance un-biased estimator (U.M.V.U.E.) of the parameter θ if (i) it is an unbiasedestimator of θ and (ii) among all unbiased estimators of θ it has the smallestM.S.E. and therefore the smallest variance.

1.3.17 Problem

Suppose X has a GAM(2, θ) distribution and consider the class of estima-tors aX; a ∈ <+. Find the estimator in this class which minimizes themean absolute error for estimating the scale parameter θ. Hint: Show

E (|aX − θ|; θ) = θE (|aX − 1|; θ = 1) .

Is this estimator unbiased? Is it the best estimator in the class of allfunctions of X?

1.4 Sufficiency

A sufficient statistic is one that, from a certain perspective, contains all thenecessary information for making inferences about the unknown parame-ters in a given model. By making inferences we mean the usual conclusionsabout parameters such as estimators, significance tests and confidence in-tervals.Suppose the data are X and T = T (X) is a sufficient statistic. The

intuitive basis for sufficiency is that ifX has a conditional distribution givenT (X) that does not depend on θ, then X is of no value in addition to T inestimating θ. The assumption is that random variables carry information ona statistical parameter θ only insofar as their distributions (or conditionaldistributions) change with the value of the parameter. All of this, of course,assumes that the model is correct and θ is the only unknown. It shouldbe remembered that the distribution of X given a sufficient statistic Tmay have a great deal of value for some other purpose, such as testing thevalidity of the model itself.

1.4. SUFFICIENCY 11

1.4.1 Definition

A statistic T (X) is sufficient for a statistical model f (x; θ) ; θ ∈ Ω if thedistribution of the data X1, . . . ,Xn given T = t does not depend on theunknown parameter θ.

To understand this definition suppose that X is a discrete random vari-able and T = T (X) is a sufficient statistic for the model f (x; θ) ; θ ∈ Ω.Suppose we observe data x with corresponding value of the sufficient sta-tistic T (x) = t. To Experimenter A we give the observed data x whileto Experimenter B we give only the value of T = t. Experimenter Acan obviously calculate T (x) = t as well. Is Experimenter A “better off”than Experimenter B in terms of making inferences about θ? The answeris no since Experimenter B can generate data which is “as good as” thedata which Experimenter A has in the following manner. Since T (X) isa sufficient statistic, the conditional distribution of X given T = t doesnot depend on the unknown parameter θ. Therefore Experimenter B canuse this distribution and a randomization device such as a random numbergenerator to generate an observation y from the random variable Y suchthat

P (Y = y|T = t) = P (X = y|T = t) (1.2)

and such thatX and Y have the same unconditional distribution. So Exper-imenter A who knows x and experimenter B who knows y have equivalentinformation about θ. Obviously Experimenter B did not gain any new in-formation about θ by generating the observation y. All of her informationfor making inferences about θ is contained in the knowledge that T = t.Experimenter B has just as much information as Experimenter A, whoknows the entire sample x.Now X and Y have the same unconditional distribution because

P (X = x; θ)

= P [X = x, T (X) = T (x) ; θ]

since the event X = x is a subset of the event T (X) = T (x)= P [X = x|T (X) = T (x)]P [T (X) = T (x) ; θ]= P (X = x|T = t)P (T = t; θ) where t = T (x)

= P (Y = x|T = t)P (T = t; θ) using (1.2)

= P [Y = x|T (X) = T (x)]P [T (X) = T (x) ; θ]= P [Y = x, T (X) = T (x) ; θ]

= P (Y = x; θ)

since the event Y = x is a subset of the event T (X) = T (x) .

12 CHAPTER 1. PROPERTIES OF ESTIMATORS

The use of a sufficient statistic is formalized in the following principle:

1.4.2 The Sufficiency Principle

Suppose T (X) is a sufficient statistic for a model f (x; θ) ; θ ∈ Ω. Supposex1, x2 are two different possible observations that have identical values ofthe sufficient statistic:

T (x1) = T (x2).

Then whatever inference we would draw from observing x1 we should drawexactly the same inference from x2.

If we adopt the sufficiency principle then we partition the sample space(the set of all possible outcomes) into mutually exclusive sets of outcomesin which all outcomes in a given set lead to the same inference about θ.This is referred to as data reduction.

1.4.3 Example

Let (X1, . . . ,Xn) be a random sample from the POI(θ) distribution. Show

that T =nPi=1Xi is a sufficient statistic for this model.

1.4.4 Problem

Let X1, . . . ,Xn be a random sample from the Bernoulli(θ) distribution and

let T =nPi=1Xi.

(a) Find the conditional distribution of (X1, . . . ,Xn) given T = t and thusshow T is a sufficient statistic for this model.

(b) Explain how you would generate data with the same distribution as theoriginal data using the value of the sufficient statistic and a randomizationdevice.

(c) Let U = U(X1) = 1 if X1 = 1 and 0 otherwise. Find E(U) andE(U |T = t).

1.4.5 Problem

Let X1, . . . ,Xn be a random sample from the GEO(θ) distribution and let

T =nPi=1Xi.

1.4. SUFFICIENCY 13

(a) Find the conditional distribution of (X1, . . . ,Xn) given T = t and thusshow T is a sufficient statistic for this model.

(b) Explain how you would generate data with the same distribution as theoriginal data using the value of the sufficient statistic and a randomizationdevice.

(d) Find E(X1|T = t).

1.4.6 Problem

Let X1, . . . ,Xn be a random sample from the EXP(1, θ) distribution andlet T = X(1).

(a) Find the conditional distribution of (X1, . . . ,Xn) given T = t and thusshow T is a sufficient statistic for this model.

(b) Explain how you would generate data with the same distribution as theoriginal data using the value of the sufficient statistic and a randomizationdevice.

(c) Find E [(X1 − 1) ; θ] and E [(X1 − 1) |T = t].

1.4.7 Problem

Let X1, . . . ,Xn be a random sample from the distribution with probabilitydensity function f (x; θ) . Show that the order statisticT (X) = (X(1), . . . ,X(n)) is sufficient for the model f (x; θ) ; θ ∈ Ω.

The following theorem gives a straightforward method for identifyingsufficient statistics.

1.4.8 Factorization Criterion for Sufficiency

Suppose X has probability (density) function f (x; θ) ; θ ∈ Ω and T (X)is a statistic. Then T (X) is a sufficient statistic for f (x; θ) ; θ ∈ Ω if andonly if there exist two non—negative functions g(.) and h(.) such that

f (x; θ) = g(T (x); θ)h(x), for all x, θ ∈ Ω.

Note that this factorization need only hold on a set A of possible values ofX which carries the full probability, that is,

f (x; θ) = g(T (x); θ)h(x), for all x ∈ A, θ ∈ Ω

where P (X ∈ A; θ) = 1, for all θ ∈ Ω.

14 CHAPTER 1. PROPERTIES OF ESTIMATORS

Note that the function g(T (x); θ) depends on both the parameter θ andthe sufficient statistic T (X) while the function h(x) does not depend onthe parameter θ.

1.4.9 Example

Let X1, . . . ,Xn be a random sample from the N(μ,σ2) distribtion. Show

that (nPi=1Xi,

nPi=1X2i ) is a sufficient statistic for this model. Show that

(X, S2) is also a sufficient statistic for this model.

1.4.10 Example

Let X1, . . . ,Xn be a random sample from the WEI(1, θ) distribution. Finda sufficient statistic for this model.

1.4.11 Example

LetX1, . . . ,Xn be a random sample from the UNIF(0, θ) distribution. Showthat T = X(n) is a sufficient statistic for this model. Find the conditionalprobability density function of (X1, . . . ,Xn) given T = t.

1.4.12 Problem

Let X1, . . . ,Xn be a random sample from the EXP(1, θ) distribution. Showthat X(1) is a sufficient statistic for this model and find the conditionalprobability density function of (X1, . . . ,Xn) given X(1) = t.

1.4.13 Problem

Use the Factorization Criterion for Sufficiency to show that if T (X) isa sufficient statistic for the model f (x; θ) ; θ ∈ Ω then any one-to-onefunction of T is also a sufficient statistic.

We have seen above that sufficient statistics are not unique. One-to-one functions of a statistic contain the same information as the originalstatistic. Fortunately, we can characterise all one-to-one functions of astatistic in terms of the way in which they partition the sample space. Notethat the partition induced by the sufficient statistic provides a partition ofthe sample space into sets of observations which lead to the same inferenceabout θ. See Figure 1.4.

1.4. SUFFICIENCY 15

1.4.14 Definition

The partition of the sample space induced by a given statistic T (X) is thepartition or class of sets of the form x;T (x) = t as t ranges over itspossible values.

SAMPLE SPACE

x;T(x)=5

x;T(x)=1

x;T(x)=2

x;T(x)=3

x;T(x)=4

Figure 1.4: Partition of the sample space induced by T

From the point of view of statistical information on a parameter, a sta-tistic is sufficient if it contains all of the information available in a data setabout a parameter. There is no guarantee that the statistic does not containmore information than is necessary. For example, the data (X1, . . . ,Xn) isalways a sufficient statistic (why?), but in many cases, there is a furtherdata reduction possible. For example, for independent observations froma N(θ, 1) distribution, the sample mean X is also a sufficient statistic butit is reduced as much as possible. Of course, T = (X)3 is a sufficient sta-tistic since T and X are one-to-one functions of each other. From X wecan obtain T and from T we can obtain X so both of these statistics areequivalent in terms of the amount of information they contain about θ.Now suppose the function g is a many-to-one function, which is not in-

vertible. Suppose further that g (X1, . . . ,Xn) is a sufficient statistic. Thenthe reduction from (X1, . . . ,Xn) to g (X1, . . . ,Xn) is a non-trivial reduc-tion of the data. Sufficient statistics that have experienced as much datareduction as is possible without losing the sufficiency property are calledminimal sufficient statistics.

16 CHAPTER 1. PROPERTIES OF ESTIMATORS

1.5 Minimal Sufficiency

Now we wish to consider those circumstances under which a given statistic(actually the partition of the sample space induced by the given statistic)allows no further real reduction. Suppose g(.) is a many-to-one functionand hence is a real reduction of the data. Is g(T ) still sufficient? In somecases, as in the example below, the answer is “no”.

1.5.1 Problem

Let X1, . . . ,Xn be a random sample from the Bernoulli(θ) distribution.

Show that T (X) =nPi=1Xi is sufficient for this model. Show that if g is not

a one-to-one function, (g(t1) = g(t2) = g0 for some integers t1 and t2 where0 ≤ t1 < t2 ≤ n) then g(T ) cannot be sufficient for f (x; θ) ; θ ∈ Ω.Hint: Find P (T = t1|g(T ) = g0).

1.5.2 Definition

A statistic T (X) is a minimal sufficient statistic for f (x; θ) ; θ ∈ Ω if itis sufficient and if for any other sufficient statistic U(X), there exists afunction g(.) such that T (X) = g(U(X)).

This definition says that a minimal sufficient statistic is a function ofevery other sufficient statistic. In terms of the partition induced by theminimal sufficient this implies that the minimal sufficient statistic inducesthe coarsest partition possible of the sample space among all sufficientstatistics. This partition is called the minimal sufficient partition.

1.5.3 Problem

Prove that if T1 and T2 are both minimal sufficient statistics, then theyinduce the same partition of the sample space.

The following theorem is useful in showing a statistic is minimally suf-ficient.

1.5. MINIMAL SUFFICIENCY 17

1.5.4 Theorem - Minimal Sufficient Statistic

Suppose the model is f (x; θ) ; θ ∈ Ω and let A = support of X. PartitionA into the equivalence classes defined by

Ay =½x;f (x; θ)

f(y; θ)= H(x, y) for all θ ∈ Ω

¾, y ∈ A.

This is a minimal sufficient partition. The statistic T (X) which inducesthis partition is a minimal sufficient statistic.

The proof of this theorem is given in Section 5.4.2 of the Ap-pendix.

1.5.5 Example

Let (X1, . . . ,Xn) be a random sample from the distribution with probabilitydensity function

f (x; θ) = θxθ−1, 0 < x < 1, θ > 0.

Find a minimal sufficient statistic for the model f (x; θ) ; θ ∈ Ω.

1.5.6 Example

Let X1, . . . ,Xn be a random sample from the N(θ, θ2) distribution. Find aminimal sufficient statistic for the model f (x; θ) ; θ ∈ Ω.

1.5.7 Problem

LetX1, . . . ,Xn be a random sample from the LOG(1, θ) distribution. Provethat the order statistic (X(1), . . . ,X(n)) is a minimal sufficient statistic forthe model f (x; θ) ; θ ∈ Ω.

1.5.8 Problem

Let X1, . . . ,Xn be a random sample from the CAU(1, θ) distribution. Finda minimal sufficient statistic for the model f (x; θ) ; θ ∈ Ω.

1.5.9 Problem

Let X1, . . . ,Xn be a random sample from the UNIF(θ, θ + 1) distribution.Find a minimal sufficient statistic for the model f (x; θ) ; θ ∈ Ω.

18 CHAPTER 1. PROPERTIES OF ESTIMATORS

1.5.10 Problem

Let Ω denote the set of all probability density functions. Let (X1, . . . ,Xn)be a random sample from a distribution with probability density functionf ∈ Ω. Prove that the order statistic (X(1), . . . ,X(n)) is a minimal suffi-cient statistic for the model f(x); f ∈ Ω. Note that in this example theunknown “parameter” is f .

1.5.11 Problem - Linear Regression

Suppose E(Y ) = Xβ where Y = (Y1, . . . , Yn)T is a vector of independent

and normally distributed random variables with V ar(Yi) = σ2, i = 1, . . . , n,X is a n× k matrix of known constants of rank k and β = (β1, . . . ,βk)

T isa vector of unknown parameters. Let

β =¡XTX

¢−1XTY and S2e = (Y −Xβ)T (Y −Xβ)/(n− k).

Show that (β, S2e ) is a minimal sufficient statistic for this model.Hint: Show

(Y −Xβ)T (Y −Xβ) = (n− k)S2e + (β − β)TXTX(β − β).

1.6 Completeness

The property of completeness is one which is useful for determining theuniqueness of estimators, for verifying, in some cases, that a minimal suffi-cient statistic has been found, and for finding U.M.V.U.E.’s.

Let X1, . . . ,Xn denote the observations from a distribution with proba-bility (density) function f (x; θ) ; θ ∈ Ω. Suppose T (X) is a statistic andu(T ), a function of T , is an unbiased estimator of θ so that E[u(T ); θ] = θfor all θ ∈ Ω. Under what circumstances is this the only unbiased esti-mator which is a function of T? To answer this question, suppose u1(T )and u2(T ) are both unbiased estimators of θ and consider the differenceh(T ) = u1(T ) − u2(T ). Since u1(T ) and u2(T ) are both unbiased estima-tors we have E[h(T ); θ] = 0 for all θ ∈ Ω. Now if the only function h(T )which satisfies E[h(T ); θ] = 0 for all θ ∈ Ω is the function h(t) = 0, then thetwo unbiased estimators must be identical. A statistic T with this propertyis said to be complete. The property of completeness is really a property ofthe family of distributions of T generated as θ varies.

1.6. COMPLETENESS 19

1.6.1 Definition

The statistic T = T (X) is a complete statistic for f (x; θ) ; θ ∈ Ω if

E[h(T ); θ] = 0, for all θ ∈ Ω

implies

P [h(T ) = 0; θ] = 1 for all θ ∈ Ω.

1.6.2 Example

LetX1, . . . ,Xn be a random sample from the N(θ, 1) distribution. Consider

T = T (X) = (X1,nPi=2Xi). Prove that T is a sufficient statistic for the model

f (x; θ) ; θ ∈ Ω but not a complete statistic.

1.6.3 Example

Let X1, . . . ,Xn be a random sample from the Bernoulli(θ) distribution.

Prove that T = T (X) =nPi=1Xi is a complete sufficient statistic for the

model f (x; θ) ; θ ∈ Ω.

1.6.4 Example

LetX1, . . . ,Xn be a random sample from the UNIF(0, θ) distribution. Showthat T = T (X) = X(n) is a complete statistic for the modelf (x; θ) ; θ ∈ Ω.

1.6.5 Problem

Prove that any one-to-one function of a complete sufficient statistic is acomplete sufficient statistic.

1.6.6 Problem

Let X1, . . . ,Xn be a random sample from the N(θ, aθ2) distribution wherea > 0 is a known constant and θ > 0. Show that the minimal sufficientstatistic is not a complete statistic.

20 CHAPTER 1. PROPERTIES OF ESTIMATORS

1.6.7 Theorem

If T (X) is a complete sufficient statistic for the model f (x; θ) ; θ ∈ Ωthen T (X) is a minimal sufficient statistic for f (x; θ) ; θ ∈ Ω.

The proof of this theorem is given in Section 5.4.3 of the Ap-pendix.

1.6.8 Problem

The converse to the above theorem is not true. Let X1, . . . ,Xn be arandom sample from the UNIF(θ − 1, θ + 1) distribution. Show that T =T (X) = (X(1),X(n)) is a minimal sufficient statistic for the model. Showalso that for the non-zero function

h(T ) =X(n) −X(1)

2− n− 1n+ 1

,

E[h(T ); θ] = 0 for all θ ∈ Ω and therefore T is not a complete statistic.

1.6.9 Example

Let X = X1, . . . ,Xn be a random sample from the UNIF(0, θ) distribu-tion. Prove that T = T (X) = X(n) is a minimal sufficient statistic forf (x; θ) ; θ ∈ Ω.

1.6.10 Problem

Let X = X1, . . . ,Xn be a random sample from the EXP(1, θ) distribu-tion. Prove that T = T (X) = X(1) is a minimal sufficient statistic forf (x; θ) ; θ ∈ Ω.

1.6.11 Theorem

For any random variables X and Y ,

E(X) = E[E(X|Y )]

and

V ar(X) = E[V ar(X|Y )] + V ar[E(X|Y )]

1.6. COMPLETENESS 21

1.6.12 Theorem

If T = T (X) is a complete statistic for the model f (x; θ) ; θ ∈ Ω, thenthere is at most one function of T that provides an unbiased estimator ofthe parameter τ(θ).

1.6.13 Problem

Prove Theorem 1.6.12.

1.6.14 Theorem (Lehmann-Scheffe)

If T = T (X) is a complete sufficient statistic for the model f (x; θ) ; θ ∈ Ωand E [g (T ) ; θ] = τ(θ), then g(T ) is the unique U.M.V.U.E. of τ(θ).

1.6.15 Example

Let X1, . . . ,Xn be a random sample from the Bernoulli(θ) distribution.Find the U.M.V.U.E. of τ(θ) = θ2.

1.6.16 Example

Let X1, . . . ,Xn be a random sample from the UNIF(0, θ) distribution. Findthe U.M.V.U.E. of τ(θ) = θ.

1.6.17 Problem

Let X1, . . . ,Xn be a random sample from the Bernoulli(θ) dsitribution.Find the U.M.V.U.E. of τ(θ) = θ(1− θ).

1.6.18 Problem

Suppose X has a Hypergeometric distribution with p.f.

f (x; θ) =

µNθ

x

¶µN −Nθ

n− x

¶µN

n

¶ , x = 0, 1, . . . ,min (Nθ, N −Nθ) ;

θ ∈ Ω =

½0,1

N,2

N, . . . , 1

¾Show that X is a complete sufficient statistic. Find the U.M.V.U.E. of θ.

22 CHAPTER 1. PROPERTIES OF ESTIMATORS

1.6.19 Problem

Let X1, . . . ,Xn be a random sample from the EXP(β,μ) distribution whereβ is known. Show that T = X(1) is a complete sufficient statistic for thismodel. Find the U.M.V.U.E. of μ and the U.M.V.U.E. of μ2.

1.6.20 Problem

Suppose X1, ...,Xn is a random sample from the UNIF(a, b) distribution.Show that T = (X(1),X(n)) is a complete sufficient statistic for this model.Find the U.M.V.U.E.’s of a and b. Find the U.M.V.U.E. of the mean of Xi.

1.6.21 Problem

Let T (X) be an unbiased estimator of τ(θ). Prove that T (X) is a U.M.V.U.E.of τ(θ) if and only if E(UT ; θ) = 0 for all U(X) such that E(U) = 0 for allθ ∈ Ω.

1.6.22 Theorem (Rao-Blackwell)

If T = T (X) is a complete sufficient statistic for the model f (x; θ) ; θ ∈ Ωand U = U(X) is any unbiased estimator of τ(θ), then E(U |T ) is theU.M.V.U.E. of τ(θ).

1.6.23 Problem

Let X1, . . . ,Xn be a random sample from the EXP(β,μ) distribution whereβ is known. Find the U.M.V.U.E. of τ (μ) = P (X1 > c;μ) where c ∈ < isa known constant. Hint: Let U = U(X1) = 1 if X1 ≥ c and 0 otherwise.

1.6.24 Problem

Let X1, . . . ,Xn be a random sample from the DU(θ) distribution. Showthat T = X(n) is a complete sufficient statistic for this model. Find theU.M.V.U.E. of θ.

1.7. THE EXPONENTIAL FAMILY 23

1.7 The Exponential Family

1.7.1 Definition

Suppose X = (X1, . . . ,Xp) has a (joint) probability (density) function ofthe form

f (x; θ) = C(θ) exp

"kPj=1

qj(θ)Tj(x)

#h(x) (1.3)

for functions qj(θ), Tj(x), h(x), C(θ). Then we say that f (x; θ) is a mem-ber of the exponential family of densities. We call (T1(X), . . . , Tk(X)) thenatural sufficient statistic.

It should be noted that the natural sufficient statistic is not unique.Multiplication of Tj by a constant and division of qj by the same constantresults in the same function f (x; θ). More generally linear transformationsof the Tj and the qj can also be used.

1.7.2 Example

Prove that T (X) = (T1(X), . . . , Tk(X)) is a sufficient statistic for the modelf (x; θ) ; θ ∈ Ω where f (x; θ) has the form (1.3).

1.7.3 Example

Show that the BIN(n, θ) distribution has an exponential family distributionand find the natural sufficient statistic.

One of the important properties of the exponential family is its closureunder repeated independent sampling.

1.7.4 Theorem

Let X1, . . . ,Xn be a random sample from the distribution with probability(density) function given by (1.3). Then (X1, . . . ,Xn) also has an exponen-tial family form, with joint probability (density) function

f(x1, . . . xn; θ) = [C (θ)]n exp

(kPj=1

qj (θ)

∙nPi=1Tj (xi)

¸)nQi=1h (xi) .

24 CHAPTER 1. PROPERTIES OF ESTIMATORS

In other words, C is replaced by Cn and Tj(x) bynPi=1Tj(xi). The natural

sufficient statistic is µnPi=1T1(Xi), . . . ,

nPi=1Tk(Xi)

¶.

1.7.5 Example

Let X1, . . . ,Xn be a random sample from the POI(θ) distribution. Showthat X1, . . . ,Xn is a member of the exponential family.

1.7.6 Canonical Form of the Exponential Family

It is usual to reparameterize equation (1.3) by replacing qj(θ) by a newparameter ηj . This results in the canonical form of the exponential family

f(x; η) = C(η) exp

"kPj=1

ηjTj(x)

#h(x).

The natural parameter space in this form is the set of all values of η forwhich the above function is integrable; that is

η;∞Z−∞

f(x; η)dx <∞.

If X is discrete the intergral is replaced by the sum over all x such thatf(x; η) > 0.

If the statistic satisfies a linear constraint, for example,

P

ÃkPj=1

Tj(X) = 0; η

!= 1,

then the number of terms k can be reduced. Unless this is done, the pa-rameters ηj are not all statistically meaningful. For example the data maypermit us to estimate η1+η2 but not allow estimation of η1 and η2 individ-ually. In this case we call the parameter “unidentifiable”. We will need toassume that the exponential family representation is minimal in the sensethat neither the ηj nor the Tj satisfy any linear constraints.

1.7. THE EXPONENTIAL FAMILY 25

1.7.7 Definition

We will say that X has a regular exponential family distribution if it is incanonical form, is of full rank in the sense that neither the Tj nor the ηjsatisfy any linear constraints, and the natural parameter space contains ak−dimensional rectangle. By Theorem 1.7.4 if Xi has a regular exponentialfamily distribution then X = (X1, . . . ,Xn) also has a regular exponentialfamily distribution.

1.7.8 Example

Show that X ∼ BIN(n, θ) has a regular exponential family distribution.

1.7.9 Theorem

If X has a regular exponential family distribution with natural sufficientstatistic T (X) = (T1(X), . . . , Tk(X)) then T (X) is a complete sufficientstatistic. Reference: Lehmann and Ramano (2005), Testing StatisticalHypotheses (3rd edition), pp. 116-117.

1.7.10 Differentiating under the Integral

In Chapter 2, it will be important to know if a family of models has theproperty that differentiation under the integral is possible. We state thatfor a regular exponential family, it is possible to differentiate under theintegral, that is,

∂m

∂ηmi

ZC(η) exp

"kPj=1

ηjTj(x)

#h(x)dx =

Z∂m

∂ηmiC(η) exp

"kPj=1

ηjTj(x)

#h(x)dx

for any m = 1, 2, . . . and any η in the interior of the natural parameterspace.

1.7.11 Example

Let X1, . . . ,Xn be a random sample from the N(μ,σ2) distribution. Finda complete sufficient statistic for this model. Find the U.M.V.U.E.’s of μand σ2.

1.7.12 Example

Show that X ∼ N(θ, θ2) does not have a regular exponential family distri-bution.

26 CHAPTER 1. PROPERTIES OF ESTIMATORS

1.7.13 Example

Suppose (X1,X2,X3) have joint density

f (x1, x2, x3; θ1, θ2, θ3) = P (X1 = x1,X2 = x2,X3 = x3; θ1, θ2, θ3)

=n!

x1!x2!x3!θx11 θx22 θx33

xi = 0, 1, . . . ; i = 1, 2, 3; x1 + x2 + x3 = n

0 < θi < 1; i = 1, 2, 3; θ1 + θ2 + θ3 = 1

Find the U.M.V.U.E. of θ1, θ2, and θ1θ2.

Since

f (x1, x2, x3; θ1, θ2, θ3) = exp

"3Pj=1

qj (θ1, θ2, θ3)Tj (x1, x2, x3)

#h (x1, x2, x3)

where

qj (θ1, θ2, θ3) = log θj , Tj (x1, x2, x3) = xj , j = 1, 2, 3 and h (x1, x2, x3) =n!

x1!x2!x3!,

(X1,X2,X3) is a member of the exponential family. But

3Pj=1

Tj (x1, x2, x3) = n and θ1 + θ2 + θ3 = 1

and thus (X1,X2,X3) is not a member of the regular exponential family.However by substituting X3 = n −X1 −X2 and θ3 = 1 − θ1 − θ2 we canshow that (X1,X2) has a regular exponential family distribution.Let

η1 = log

µθ1

1− θ1 − θ2

¶, η2 = log

µθ2

1− θ1 − θ2

¶then

θ1 =eη1

1 + eη1 + eη2, θ2 =

eη2

1 + eη1 + eη2.

LetT1 (x1, x2) = x1, T2 (x1, x2) = x2,

C (η1, η2) =

µ1

1 + eη1 + eη2

¶n, and h (x1, x2) =

n!

x1!x2! (n− x1 − x2)!.

In canonical form (X1,X2) has p.f.

f (x1, x2; η1, η2) = C (η1, η2) exp [η1T1 (x1, x2) + η2T2 (x1, x2)]h (x1, x2)

1.7. THE EXPONENTIAL FAMILY 27

with natural parameter space (η1, η2) ; η1 ∈ <, η2 ∈ < which contains atwo-dimensional rectangle. The η0js and the T

0js do not satisfy any linear

constraints. Therefore (X1,X2) has a regular exponential family distri-bution with natural sufficient statistic T (X1,X2) = (X1,X2) and thusT (X1,X2) is a complete sufficient statistic.By the properties of the multinomial distribution (see Section 5.2.2)

we have X1 v BIN (n, θ1) , X2 v BIN (n, θ2) and Cov (X1,X2) = −nθ1θ2.Since

E

µX1n; θ1, θ2

¶=nθ1n= θ1 and E

µX2n; θ1, θ2

¶=nθ2n= θ2

then by the Lehmann-Scheffe Theorem X1/n is the U.M.V.U.E. of θ1 andX2/n is the U.M.V.U.E. of θ2.Since

−nθ1θ2 = Cov (X1,X2; θ1, θ2)

= E (X1X2; θ1, θ2)− E (X1; θ1, θ2)E (X2; θ1, θ2)= E (X1X2; θ1, θ2)− n2θ1θ2

or

E

µX1X2n (n− 1) ; θ1, θ2

¶= θ1θ2

then by the Lehmann-Scheffe TheoremX1X2/ [n (n− 1)] is the U.M.V.U.E.of θ1θ2.

1.7.14 Example

Let X1, . . . ,Xn be a random sample from the POI(θ) distribution. Find theU.M.V.U.E. of τ(θ) = e−θ. Show that the U.M.V.U.E. is also a consistentestimator of τ(θ).

Since (X1, . . . ,Xn) is a member of the regular exponential family with

natural sufficient statistic T =nPi=1Xi therefore T is a complete sufficient

statistic. Consider the random variable U(X1) = 1 if X1 = 0 and 0 other-wise. Then

E [U(X1); θ] = 1 · P (X1 = 0; θ) = e−θ, θ > 0

and U(X1) is an unbiased estimator of τ(θ) = e−θ. Therefore by the Rao-

Blackwell Theorem E (U |T ) is the U.M.V.U.E. of τ(θ) = e−θ.

28 CHAPTER 1. PROPERTIES OF ESTIMATORS

Since X1, . . . ,Xn is a random sample from the POI(θ) distribution,

X1 v POI (θ) , T =nPi=1Xi v POI (nθ) and

nPi=2Xi v POI ((n− 1) θ) .

Thus

E (U |T = t) = 1 · P (X1 = 0|T = t)

=

P

µX1 = 0,

nPi=1Xi = t; θ

¶P (T = t; θ)

=

P

µX1 = 0,

nPi=2Xi = t− 0; θ

¶P (T = t; θ)

= e−θ[(n− 1) θ]t e−(n−1)θ

t!÷ (nθ)

t

t!e−nθ

=

µ1− 1

n

¶t, t = 0, 1, . . .

Therefore E (U |T ) =¡1− 1

n

¢Tis the U.M.V.U.E. of θ.

Since X1, . . . ,Xn is a random sample from the POI(θ) distribution thenby the W.L.L.N. X →p θ and by the Limit Theorems (see Section 5.3)

E (U |T ) =µ1− 1

n

¶T=

∙µ1− 1

n

¶n¸X→p e

−θ

and therefore E (U |T ) a consistent estimator of e−θ.

1.7.15 Example

Let X1, . . . ,Xn be a random sample from the N(θ, 1) distribution. Find theU.M.V.U.E. of τ(θ) = Φ(c − θ) = P (Xi ≤ c; θ) for some constant c whereΦ is the standard normal cumulative distribution function. Show that theU.M.V.U.E. is also a consistent estimator of τ(θ).

Since (X1, . . . ,Xn) is a member of the regular exponential family with

natural sufficient statistic T =nPi=1Xi therefore T is a complete sufficient

statistic. Consider the random variable U(X1) = 1 if X1 ≤ c and 0 other-wise. Then

E [U(X1); θ] = 1 · P (X1 ≤ c; θ) = Φ(c− θ), θ ∈ <

1.7. THE EXPONENTIAL FAMILY 29

and U(X1) is an unbiased estimator of τ(θ) = Φ(c− θ). Therefore by theRao-Blackwell Theorem E (U |T ) is the U.M.V.U.E. of τ(θ) = Φ(c− θ).Since X1, . . . ,Xn is a random sample from the N(θ, 1) distribution,

X1 v N(θ, 1), T =nPi=1Xi v N(nθ, n) and

nPi=2Xi v N((n− 1) θ, n− 1).

The conditional p.d.f. of X1 given T = t is

f (x1|T = t)

=1√2πexp

∙−12(x1 − θ)2

¸× 1p

2π (n− 1)exp

(− [t− x1 − (n− 1) θ]

2

2 (n− 1)

)÷ 1√

2πnexp

(− [t− nθ]

2

2n

)

=1q

2π¡1− 1

n

¢ exp(−12

"x21 +

(t− x1)2

n− 1 − t2

n

#)

=1q

2π¡1− 1

n

¢ exp"− 1

2¡1− 1

n

¢ µx1 − t

n

¶2#

which is the p.d.f. of a N( tn , 1−1n) random variable. Since X1|T = t has a

N( tn , 1−1n) distribution,

E (U |T ) = 1 · P (X1 ≤ c|T )

= Φ

⎛⎝ c− T/nq¡1− 1

n

¢⎞⎠

is the U.M.V.U.E. of τ(θ) = Φ(c− θ).Since X1, . . . ,Xn is a random sample from the N(θ, 1) distribution then

by the W.L.L.N. X →p θ and by the Limit Theorems

E (U |T ) = Φ

⎛⎝ c− T/nq¡1− 1

n

¢⎞⎠ = Φ

⎛⎝ c− Xq¡1− 1

n

¢⎞⎠→p Φ (c− θ)

and therefore E (U |T ) a consistent estimator τ(θ) = Φ(c− θ).

30 CHAPTER 1. PROPERTIES OF ESTIMATORS

1.7.16 Problem

Let X1, . . . ,Xn be a random sample from the distribution with probabilitydensity function

f (x; θ) = θxθ−1, 0 < x < 1, θ > 0.

Show that the geometric mean of the sample

µnQi=1Xi

¶1/nis a complete

sufficient statistic and find the U.M.V.U.E. of θ.

Hint: − logXi ∼ EXP(1/θ).

1.7.17 Problem

Let X1, . . . ,Xn be a random sample from the EXP(β,μ) distribution where

μ is known. Show that T =nPi=1Xi is a complete sufficient statistic. Find

the U.M.V.U.E. of β2.

1.7.18 Problem

Let X1, . . . ,Xn be a random sample from the GAM(α,β) distribution andθ = (α,β). Find the U.M.V.U.E. of τ(θ) = αβ.

1.7.19 Problem

Let X ∼ NB(k, θ). Find the U.M.V.U.E. of θ.Hint: Find E[(X + k − 1)−1; θ].

1.7.20 Problem

Let X1, . . . ,Xn be a random sample from the N(θ, 1) distribution. Findthe U.M.V.U.E. of τ(θ) = θ2.

1.7.21 Problem

Let X1, . . . ,Xn be a random sample from the N(0, θ) distribution. Findthe U.M.V.U.E. of τ(θ) = θ2.

1.7.22 Problem

Let X1, . . . ,Xn be a random sample from the POI(θ) distribution. Findthe U.M.V.U.E. for τ(θ) = (1 + θ)e−θ.

Hint: Find P (X1 ≤ 1; θ).

1.7. THE EXPONENTIAL FAMILY 31

Member of the REF Complete Sufficient Statistic

POI (θ)nPi=1Xi

BIN (n, θ)nPi=1Xi

NB(k, θ)nPi=1Xi

N¡μ,σ2

¢σ2 known

nPi=1Xi

N¡μ,σ2

¢μ known

nPi=1(Xi − μ)2

N¡μ,σ2

¢ µnPi=1Xi,

nPi=1X2i

¶GAM(α,β) α known

nPi=1Xi

GAM(α,β) β knownnQi=1Xi

GAM(α,β)

µnPi=1Xi,

nQi=1Xi

¶EXP (β,μ) μ known

nPi=1Xi

Not a Member of the REF Complete Sufficient Statistic

UNIF (0, θ) X(n)

UNIF (a, b)¡X(1), X(n)

¢EXP(β,μ) β known X(1)

EXP(β,μ)

µX(1),

nPi=1Xi

32 CHAPTER 1. PROPERTIES OF ESTIMATORS

1.7.23 Problem

Let X1, . . . ,Xn be a random sample form the POI(θ) distribution. Findthe U.M.V.U.E. for τ(θ) = e−2θ. Hint: Find E[(−1)X1 ; θ] . Show that thisestimator has some undesirable properties when n = 1 and n = 2 but whenn is large, it is approximately equal to the maximum likelihood estimator.

1.7.24 Problem

Let X1, . . . ,Xn be a random sample from the GAM(2, θ) distribution. Findthe U.M.V.U.E. of τ1(θ) = 1/θ and the U.M.V.U.E. ofτ2(θ) = P (X1 > c; θ) where c > 0 is a constant.

1.7.25 Problem

In Problem 1.5.11 show that β is the U.M.V.U.E. of β and S2e is theU.M.V.U.E. of σ2.

1.7.26 Problem

A Brownian Motion process is a continuous-time stochastic process X (t)which is often used to describe the value of an asset. Assume X (t) rep-resents the market price of a given asset such as a portfolio of stocks attime t and x0 is the value of the portfolio at the beginning of a given timeperiod (assume that the analysis is conditional on x0 so that x0 is fixedand known). The distribution of X (t) for any fixed time t is assumed to beN(x0+μt,σ2t) for 0 < t ≤ 1. The parameter μ is the drift of the Brownianmotion process and the parameter σ is the diffusion coefficient. Assumethat t = 1 corresponds to the end of the time period so X (1) is the closingprice.Suppose that we record both the period high max

0≤t≤1X (t) and the close

X (1). Define random variables

M = max0≤t≤1

X (t)− x0

andY = X (1)− x0.

The joint probability density function of (M,Y ) can be shown to be

f(m, y;μ,σ2) =2 (2m− y)√

2πσ3exp

½1

2σ2

h2μy − μ2 − (2m− y)2

i¾,

m > 0, −∞ < y < m, μ ∈ < and σ2 > 0.

1.8. ANCILLARITY 33

(a) Show that (M,Y ) has a regular exponential family distribution.

(b) Let Z = M(M − Y ). Show that Y v N¡μ,σ2

¢and Z v EXP

¡σ2/2

¢independently.

(c) Suppose we record independent pairs of observations (Mi, Yi),i = 1, . . . n on the portfolio for a total of n distinct time periods. Find theU.M.V.U.E.’s of μ and σ2.

(d) Show that the estimators

V1 =1

n− 1nPi=1

¡Yi − Y

¢2and

V2 =2

n

nPi=1Zi =

2

n

nPi=1Mi (Mi − Yi)

are also unbiased estimators of σ2. How do we know that neither of theseestimators is the U.M.V.U.E. of σ2? Show that the U.M.V.U.E. of σ2 canbe written as a weighted average of V1 and V2. Compare the variances ofall three estimators.

(e) An up-and-out call option on the portfolio is an option with exerciseprice ξ (a constant) which pays a total of (X (1)− ξ) dollars at the end ofone period provided that this quantity is positive and provided that X (t)never exceeded the value of a barrier throughout this period of time, thatis, provided that M < a. Thus the option pays

g(M,Y ) = max Y − (ξ − x0), 0 if M < a

and otherwise g(M,Y ) = 0. Find the expected value of such an option,that is, find the expected value of g(M,Y ).

1.8 Ancillarity

Let X = (X1, . . . ,Xn) denote observations from a distribution with proba-bility (density) function f (x; θ) ; θ ∈ Ω and let U(X) be a statistic. Theinformation on the parameter θ is provided by the sensitivity of the distri-bution of a statistic to changes in the parameter. For example, suppose amodest change in the parameter value leads to a large change in the ex-pected value of the distribution resulting in a large shift in the data. Thenthe parameter can be estimated fairly precisely. On the other hand, if astatistic U has no sensitivity at all in distribution to the parameter, thenit would appear to contain little information for point estimation of thisparameter. A statistic of the second kind is called an ancillary statistic.

34 CHAPTER 1. PROPERTIES OF ESTIMATORS

1.8.1 Definition

U(X) is an ancillary statistic if its distribution does not depend on theunknown parameter θ.

Ancillary statistics are, in a sense, orthogonal or perpendicular to mini-mal sufficient statistics. Ancillary statistics are analogous to the residuals ina multiple regression, while the complete sufficient statistics are analogousto the estimators of the regression coefficients. It is well-known that theresiduals are uncorrelated with the estimators of the regression coefficients(and independent in the case of normal errors). However, the “irrelevance”of the ancillary statistic seems to be limited to the case when it is not partof the minimal (preferably complete) sufficient statistic as the followingexample illustrates.

1.8.2 Example

Suppose a fair coin is tossed to determine a random variable N = 1 withprobability 1/2 and N = 100 otherwise. We then observe a Binomial ran-dom variable X with parameters (N, θ). Show that the minimal sufficientstatistic is (X,N) but that N is an ancillary statistic. Is N irrelevant toinference about θ?

In this example it seems reasonable to condition on an ancillary compo-nent of the minimal sufficient statistic. Conducting inference conditionallyon the ancillary statistic essentially means treating the observed number oftrials as if it had been fixed in advance instead of the result of the toss ofa fair coin. This example also illustrates the use of the following principle:

1.8.3 The Conditionality Principle

Suppose the minimal sufficient statistic can be written in the formT = (U,A) where A is an ancillary statistic. Then all inference should beconducted using the conditional distribution of the data given the value ofthe ancillary statistic, that is, using the distribution of X|A.

Some difficulties arise from the application of this principle since thereis no general method for constructing the ancillary statistic and ancillarystatistics are not necessarily unique.

1.8. ANCILLARITY 35

The following theorem allows us to use the properties of completenessand ancillarity to prove the independence of two statistics without findingtheir joint distribution.

1.8.4 Basu’s Theorem

Consider X with probability (density) function f (x; θ) ; θ ∈ Ω. Let T (X)be a complete sufficient statistic. Then T (X) is independent of every an-cillary statistic U(X).

1.8.5 Proof

We need to show

P [U(X) ∈ B,T (X) ∈ C; θ] = P [U(X) ∈ B; θ] · P [T (X) ∈ C; θ]

for all sets B,C and all θ ∈ Ω.Let

g(t) = P [U(X) ∈ B|T (X) = t]− P [U(X) ∈ B]

for all t ∈ A where P (T ∈ A; θ) = 1. By sufficiency, P [U(X) ∈ B|T (X) = t]does not depend on θ, and by ancillarity, P [U(X) ∈ B] also does notdepend on θ. Therefore g(T ) is a statistic.Let

IU(X) ∈ B =½

1 if U(X) ∈ B0 else.

ThenE[IU(X) ∈ B] = P [U(X) ∈ B],

E[IU(X) ∈ B|T = t] = P [U(X) ∈ B|T = t],

andg(t) = E[IU(X) ∈ B|T (X) = t]−E[IU(X) ∈ B].

This gives

E[g(T )] = E[E[IU(X) ∈ B|T ]]−E[IU(X) ∈ B]= E[IU(X) ∈ B]−E[IU(X) ∈ B]= 0 for all θ ∈ Ω,

and since T is complete this implies P [g(T ) = 0; θ] = 1 for all θ ∈ Ω.Therefore

P [U(X) ∈ B|T (X) = t] = P [U(X) ∈ B] for all t ∈ A and all B. (1.4)

36 CHAPTER 1. PROPERTIES OF ESTIMATORS

Suppose T has probability density function h(t; θ). Then

P [U(X) ∈ B,T (X) ∈ C; θ] =ZC

P [U(X) ∈ B|T = t]h(t; θ)dt

=

ZC

P [U(X) ∈ B]h(t; θ)dt by (1.4)

= P [U(X) ∈ B] ·ZC

h(t; θ)dt

= P [U(X) ∈ B] · P [T (X) ∈ C; θ]

true for all sets B,C and all θ ∈ Ω as required.¥

1.8.6 Example

Let X1, . . . ,Xn be a random sample from the EXP(θ) distribution. Show

that T (X1, . . . ,Xn) =nPi=1

Xi and U(X1, . . . ,Xn) = (X1/T, . . . ,Xn/T ) are

independent random variables. Find E(X1/T ).

1.8.7 Example

Let X1, . . . ,Xn be a random sample from the N(μ,σ2) distribution. Provethat X and S2 are independent random variables.

1.8.8 Problem

Let X1, . . . ,Xn be a random sample from the distribution with p.d.f.

f (x;β) =2x

β2, 0 < x ≤ β.

(a) Show that β is a scale parameter for this model.

(b) Show that T = T (X1, . . . ,Xn) = X(n) is a complete sufficient statisticfor this model.

(c) Find the U.M.V.U.E. of β.

(d) Show that T and U = U (X) = X1/T are independent random variables.

(e) Find E (X1/T ).

1.8. ANCILLARITY 37

1.8.9 Problem

Let X1, . . . ,Xn be a random sample from the GAM(α,β) distribution.

(a) Show that β is a scale parameter for this model.

(b) Suppose α is known. Show that T = T (X1, . . . ,Xn) =nPi=1Xi is a

complete sufficient statistic for the model.

(c) Show that T and U = U(X1, . . . ,Xn) = (X1/T, . . . ,Xn/T ) are inde-pendent random variables.

(d) Find E(X1/T ).

1.8.10 Problem

In Problem 1.5.11 show that β and S2e are independent random variables.

1.8.11 Problem

Let X1, . . . ,Xn be a random sample from the EXP(β,μ) distribution.

(a) Suppose β is known. Show that T1 = X(1) is a complete sufficientstatistic for the model.

(b) Show that T1 and T2 =nPi=1

¡Xi −X(1)

¢are independent random vari-

ables.

(c) Find the p.d.f. of T2. Hint: Show

nPi=1(Xi − μ) = n (T1 − μ) + T2.

(d) Show that (T1, T2) is a complete sufficient statistic for the modelf (x1, . . . , xn;μ,β) ;μ ∈ <, β > 0.(e) Find the U.M.V.U.E.’s of β and μ.

38 CHAPTER 1. PROPERTIES OF ESTIMATORS

1.8.12 Problem

Let X1, . . . ,Xn be a random sample from the distribution with p.d.f.

f (x;α,β) =αxα−1

βα, α > 0, 0 < x ≤ β.

(a) Show that if α is known then T1 = X(n) is a complete sufficient statisticfor the model.

(b) Show that T1 and T2 =nQi=1

Xi

T1are independent random variables.

(c) Find the p.d.f. of T2. Hint: Show

nXi=1

log

µXiβ

¶= log T2 + n log

µT1β

¶.

(d) Show that (T1, T2) is a complete sufficient statistic for the model.

(e) Find the U.M.V.U.E. of α.

Chapter 2

Maximum LikelihoodEstimation

2.1 Maximum Likelihood Method- One Parameter

Suppose we have collected the data x (possibly a vector) and we believe thatthese data are observations from a distribution with probability function

P (X = x; θ) = f(x; θ)

where the scalar parameter θ is unknown and θ ∈ Ω. The probability ofobserving the data x is equal to f(x; θ). When the observed value of x issubstituted into f(x; θ), then f(x; θ) is a function of the parameter θ only.In the absence of any other information, it seems logical that we shouldestimate the parameter θ using a value most compatible with the data. Forexample we might choose the value of θ which maximizes the probabilityof the observed data.

2.1.1 Definition

Suppose X is a random variable with probability functionP (X = x; θ) = f(x; θ), where θ ∈ Ω is a scalar and suppose x is the ob-served data. The likelihood function for θ is

L(θ) = P (observing the data x; θ)

= P (X = x; θ)

= f(x; θ), θ ∈ Ω.

39

40 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION

If X = (X1, . . . ,Xn) is a random sample from the probability functionP (X = x; θ) = f(x; θ) and x = (x1, . . . , xn) are the observed data then thelikelihood function for θ is

L(θ) = P (observing the data x; θ)

= P (X1 = x1, . . . ,Xn = xn; θ)

=nQi=1f(xi; θ), θ ∈ Ω.

The value of θ which maximizes the likelihood L (θ) also maximizesthe logarithm of the likelihood function. (Why?) Since it is easier to findthe derivative of the sum of n terms rather than the product, we usuallydetermine the maximum of the logarithm of the likelihood function.

2.1.2 Definition

The log likelihood function is defined as

l(θ) = logL(θ), θ ∈ Ω

where log is the natural logarithmic function.

2.1.3 Definition

The value of θ that maximizes the likelihood function L(θ) or equivalentlythe log likelihood function l(θ) is called the maximum likelihood (M.L.)estimate. The M.L. estimate is a function of the data x and we writeθ = θ (x). The corresponding M.L. estimator is denoted θ = θ(X).

2.1.4 Example

Suppose in a sequence of n Bernoulli trials with P (Success) = θ we haveobserved x successes. Find the likelihood function L (θ), the log likelihoodfunction l (θ), the M.L. estimate of θ and the M.L. estimator of θ.

2.1.5 Example

Suppose we have collected data x1, . . . , xn and we believe these observa-tions are independent observations from a POI(θ) distribution. Find thelikelihood function, the log likelihood function, the M.L. estimate of θ andthe M.L. estimator of θ.

2.1. MAXIMUMLIKELIHOODMETHOD- ONE PARAMETER 41

2.1.6 Problem

Suppose we have collected data x1, . . . , xn and we believe these observa-tions are independent observations from the DU(θ) distribution. Find thelikelihood function, the M.L. estimate of θ and the M.L. estimator of θ.

2.1.7 Definition

The score function is defined as

S(θ) =d

dθl (θ) =

d

dθlogL (θ) , θ ∈ Ω.

2.1.8 Definition

The information function is defined as

I(θ) = − d2

dθ2l(θ) = − d

2

dθ2logL(θ), θ ∈ Ω.

I(θ) is called the observed information.

In Section 2.7 we will see how the observed information I(θ) can be usedto construct approximate confidence intervals for the unknown parameterθ. I(θ) also tells us about the concavity of the log likelihood function.

Suppose in Example 2.1.5 the M.L. estimate of θ was θ = 2. If n = 10then I(θ) = 10/2 = 5. If n = 25 then I(θ) = 25/2 = 12.5. See Figure2.1. The log likelihood function is more concave down for n = 25 than forn = 10 which reflects the fact that as the number of observations increaseswe have more “information” about the unknown parameter θ.

2.1.9 Finding M.L. Estimates

IfX1, . . . ,Xn is a random sample from a distribution whose support set doesnot depend on θ then we usually find θ by solving S(θ) = 0. It is important

to verify that θ is the value of θ which maximizes L (θ) or equivalently l (θ).This can be done using the First Derivative Test. Note that the conditionI(θ) > 0 only checks for a local maximum.

Although we view the likelihood, log likelihood, score and information func-tions as functions of θ they are, of course, also functions of the observed

42 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION

1 1.5 2 2.5 3 3.5-10

-8

-6

-4

-2

0

2

θ

R(θ)

n=10

n=25

Figure 2.1: Poisson Log Likelihoods for n = 10 and n = 25

data x. When it is important to emphasize the dependence on the datax we will write L(θ;x), S(θ;x), etc. Also when we wish to determine thesampling properties of these functions as functions of the random variableX we will write L(θ;X), S(θ;X), etc.

2.1.10 Definition

If θ is a scalar then the expected or Fisher information (function) is givenby

J(θ) = E [I(θ;X); θ] = E

∙− ∂2

∂θ2l(θ;X); θ

¸, θ ∈ Ω.

Note:

If X1, . . . ,Xn is a random sample from f(x; θ) then

J (θ) = E

∙− ∂2

∂θ2l(θ;X); θ

¸= nE

∙− ∂2

∂θ2log f(X; θ); θ

¸where X has probability function f(x; θ).

2.1. MAXIMUMLIKELIHOODMETHOD- ONE PARAMETER 43

2.1.11 Example

Find the Fisher information based on a random sampleX1, . . . ,Xn from thePOI(θ) distribution and compare it to the variance of the M.L. estimator

θ. How does the Fisher information change as n increases?

The Poisson model is used to model the number of events occurring intime or space. Suppose it is not possible to observe the number of eventsbut only whether or not one or more events has occurred. In other wordsit is only possible to observe the outcomes “X = 0” and “X > 0”. LetY be the number of times the outcome “X = 0” is observed in a sampleof size n. Find the M.L. estimator of θ for these data. Compare theFisher information for these data with the Fisher information based on(X1, . . . ,Xn). See Figure 2.2

0 1 2 3 4 5 6 7 80

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

θ

Ratio ofInformationFunctions

Figure 2.2: Ratio of Fisher Information Functions

2.1.12 Problem

Suppose X ∼ BIN(n, θ) and we observe X. Find θ, the M.L. estimator ofθ, the score function, the information function and the Fisher information.Compare the Fisher information with the variance of θ.

44 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION

2.1.13 Problem

Suppose X ∼ NB(k, θ) and we observe X. Find the M.L. estimator of θ,the score function and the Fisher information.

2.1.14 Problem - Randomized Sampling

A professor is interested estimating the unknown quantity θ which is theproportion of students who cheat on tests. She conducts an experiment inwhich each student is asked to toss a coin secretly. If the coin comes up ahead the student is asked to toss the coin again and answer “Yes” if thesecond toss is a head and “No” if the second toss is a tail. If the first tossof the coin comes up a tail, the student is asked to answer “Yes” or “No”to the question: Have you ever cheated on a University test? Studentsare assumed to answer more honestly in this type of randomized responsesurvey because it is not known to the questioner whether the answer “Yes”is a result of tossing the coin twice and obtaining two heads or because thestudent obtained a tail on the toss of the coin and then answered “Yes” tothe question about cheating.

(a) Find the probability that x students answer “Yes” in a class of n stu-dents.

(b) Find the M.L. estimator of θ based on X students answering “Yes” ina class of n students. Be sure to verify that your answer corresponds to amaximum.

(c) Find the Fisher information for θ.

(d) In a simpler experiment n students could be asked to answer “Yes”or “No” to the question: Have you ever cheated on a University test? Ifwe could assume that they answered the question honestly then we wouldexpect to obtain more information about θ from this simpler experiment.Determine the amount of information lost in doing the randomized responseexperiment as compared to the simpler experiment.

2.1.15 Problem

Suppose (X1,X2) ∼ MULT(n, θ2, 2θ (1− θ)). Find the M.L. estimator ofθ, the score function and the Fisher information.

2.1.16 Likelihood Functions for Continuous Models

Suppose X is a continuous random variable with probability density func-tion f(x; θ). We will often observe only the value of X rounded to some

2.1. MAXIMUMLIKELIHOODMETHOD- ONE PARAMETER 45

degree of precision (say one decimal place) in which case the actual obser-vation is a discrete random variable. For example, suppose we observe Xcorrect to one decimal place. Then

P (we observe 1.1) =

1.15Z1.05

f(x; θ)dx ≈ (1.15− 1.05) · f(1.1; θ)

assuming the function f(x; θ) is quite smooth over the interval. More gen-erally, if we observe X rounded to the nearest ∆ (assumed small) then thelikelihood of the observation is approximately ∆f(observation; θ). Sincethe precision ∆ of the observation does not depend on the parameter, thenmaximizing the discrete likelihood of the observation is essentially equiva-lent to maximizing the probability density function f(observation; θ) overthe parameter.Therefore if X = (X1, . . . ,Xn) is a random sample from the probability

density function f(x; θ) and x = (x1, . . . , xn) are the observed data thenwe define the likelihood function for θ as

L(θ) = L (θ;x) =nQi=1f(xi; θ), θ ∈ Ω.

See also Problem 2.8.12.

2.1.17 Example

Suppose X1, . . . ,Xn is a random sample from the distribution with proba-bility density function

f(x; θ) = θxθ−1, 0 ≤ x ≤ 1, θ > 0.

Find the score function, the M.L. estimator, and the information functionof θ. Find the observed information. Find the mean and variance of θ.Compare the Fisher infomation and the variance of θ.

2.1.18 Example

Suppose X1, . . . ,Xn is a random sample from the UNIF(0, θ) distribution.Find the M.L. estimator of θ.

2.1.19 Problem

Suppose X1, . . . ,Xn is a random sample from the UNIF(θ, θ + 1) distribu-tion. Show the M.L. estimator of θ is not unique.

46 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION

2.1.20 Problem

Suppose X1, . . . ,Xn is a random sample from the DE(1, θ) distribution.Find the M.L. estimator of θ.

2.1.21 Problem

Show that if θ is the unique M.L. estimator of θ then θ must be a functionof the minimal sufficient statistic.

2.1.22 Problem

The word information generally implies something that is additive. Sup-pose X has probability (density) function f (x; θ) , θ ∈ Ω and independentlyY has probability (density) function g(y; θ), θ ∈ Ω. Show that the Fisherinformation in the joint observation (X,Y ) is the sum of the Fisher infor-mation in X plus the Fisher information in Y .

Often S(θ) = 0 must be solved numerically using an iterative methodsuch as Newton’s Method.

2.1.23 Newton’s Method

Let θ(0) be an initial estimate of θ. We may update that value as follows:

θ(i+1) = θ(i) +S(θ(i))

I(θ(i)), i = 0, 1, . . .

Notes:

(1) The initial estimate, θ(0), may be determined by graphing L (θ) or l (θ).

(2) The algorithm is usually run until the value of θ(i) no longer changesto a reasonable number of decimal places. When the algorithm is stoppedit is always important to check that the value of θ obtained does indeedmaximize L (θ).

(3) This algorithm is also called the Newton-Raphson Method.

(4) I (θ) can be replaced by J (θ) for a similar algorithm which is called themethod of scoring or Fisher’s method of scoring.

(5) The value of θ may also be found by maximizing L(θ) or l(θ) usingthe maximization (minimization) routines available in various statisticalsoftware packages such as Maple, S-Plus, Matlab, R etc.

(6) If the support of X depends on θ (e.g. UNIF(0, θ)) then θ is not foundby solving S(θ) = 0.

2.1. MAXIMUMLIKELIHOODMETHOD- ONE PARAMETER 47

2.1.24 Example

Suppose X1, . . . ,Xn is a random sample from the WEI(1,β) distribution.Explain how you would find the M.L. estimate of β using Newton’s Method.How would you find the mean and variance of the M.L. estimator of β?

2.1.25 Problem - Likelihood Function for Grouped Data

Suppose X is a random variable with probability (density) function f (x; θ)and P (X ∈ A; θ) = 1. Suppose A1, A2, . . . , Am is a partition of A and let

pj(θ) = P (X ∈ Aj ; θ), j = 1, . . . ,m.

Suppose n independent observations are collected from this distribution butit is only possible to determine to which one of the m sets, A1, A2, . . . , Am,the i’th observation belongs. The observed data are:

Outcome A1 A2 ... Am TotalFrequency f1 f2 ... fm n

(a) Show that the Fisher information for these data is given by

J(θ) = nmPj=1

[p0j(θ)]2

pj(θ).

Hint: SincemPj=1

pj(θ) = 1,d

∙mPi=1pj(θ)

¸= 0.

(b) Explain how you would find the M.L. estimate of θ.

2.1.26 Definition

The relative likelihood function R(θ) is defined by

R(θ) = R (θ;x) =L(θ)

L(θ), θ ∈ Ω.

The relative likelihood function takes on values between 0 and 1.and canbe used to rank possible parameter values according to their plausibilitiesin light of the data. If R(θ1) = 0.1, say, then θ1 is rather an implausible

parameter value because the data are ten times more probable when θ = θthan they are when θ = θ1. However, if R(θ1) = 0.5, say, then θ1 is a fairlyplausible value because it gives the data 50% of the maximum possibleprobability under the model.

48 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION

2.1.27 Definition

The set of θ values for which R(θ) ≥ p is called a 100p% likelihood regionfor θ. If the region is an interval of real values then it is called a 100p%likelihood interval (L.I.) for θ.

Values inside the 10% L.I. are referred to as plausible and values outsidethis interval as implausible. Values inside a 50% L.I. are very plausible andoutside a 1% L.I. are very implausible in light of the data.

2.1.28 Definition

The log relative likelihood function is the natural logarithm of the relativelikelihood function:

r(θ) = r (θ;x) = log[R(θ)] = log[L(θ]− log[L(θ)] = l(θ)− l(θ), θ ∈ Ω.

Likelihood regions or intervals may be determined from a graph of R(θ)or r(θ) and usually it is more convenient to work with r(θ). Alternatively,they can be found by solving r(θ) − log p = 0. Usually this must be donenumerically.

2.1.29 Example

Plot the relative likelihood function for θ in Example 2.1.5 if n = 15 andθ = 1. Find the 15% L.I.’s for θ. See Figure 2.3

2.1.30 Problem

Suppose X ∼ BIN(n, θ). Plot the relative likelihood function for θ if x = 3is observed for n = 100. On the same graph plot the relative likelihoodfunction for θ if x = 6 is observed for n = 200. Compare the graphs as wellas the 10% L.I. and 50% L.I. for θ.

2.1.31 Problem

Suppose X1, . . . ,Xn is a random sample from the EXP(1, θ) distribution.Plot the relative likelihood function for θ if n = 20 and x(1) = 1. Find 10%and 50% L.I.’s for θ.

2.1. MAXIMUMLIKELIHOODMETHOD- ONE PARAMETER 49

0 0.5 1 1.5 2 2.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

θ

R(θ)

Figure 2.3: Relative Likelihood Function for Example 2.1.29

2.1.32 Problem

The following model is proposed for the distribution of family size in a largepopulation:

P (k children in family; θ) = θk, for k = 1, 2, . . .

P (0 children in family; θ) =1− 2θ1− θ

.

The parameter θ is unknown and 0 < θ < 12 . Fifty families were chosen at

random from the population. The observed numbers of children are givenin the following table:

No. of children 0 1 2 3 4 TotalFrequency observed 17 22 7 3 1 50

(a) Find the likelihood, log likelihood, score and information functions forθ.

50 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION

(b) Find the M.L. estimate of θ and the observed information.

(c) Find a 15% likelihood interval for θ.

(d) A large study done 20 years earlier indicated that θ = 0.45. Is thisvalue plausible for these data?

(e) Calculate estimated expected frequencies. Does the model give a rea-sonable fit to the data?

2.1.33 Problem

The probability that k different species of plant life are found in a randomlychosen plot of specified area is

pk (θ) =

¡1− e−θ

¢k+1(k + 1) θ

, k = 0, 1, . . . ; θ > 0.

The data obtained from an examination of 200 plots are given in the tablebelow:

No. of species 0 1 2 3 ≥ 4 TotalFrequency observed 147 36 13 4 0 200

(a) Find the likelihood, log likelihood, score and information functions forθ.

(b) Find the M.L. estimate of θ and the observed information.

(c) Find a 15% likelihood interval for θ.

(d) Is θ = 1 a plausible value of θ in light of the observed data?

(e) Calculate estimated expected frequencies. Does the model give a rea-sonable fit to the data?

2.2 Principles of Inference

In Chapter 1 we discussed the Sufficiency Principle and the ConditionalityPrinciple. There is another principle which is equivalent to the SufficiencyPrinciple. The likelihood ratios generate the minimal sufficient partition.In other words, two likelihood ratios will agree

f (x1; θ)

f (x1; θ0)=f (x2; θ)

f (x2; θ0)

if and only if the values of the minimal sufficient statistic agree, that is,T (x1) = T (x2). Thus we obtain:

2.2. PRINCIPLES OF INFERENCE 51

2.2.1 The Weak Likelihood Principle

Suppose for two different observations x1, x2, the likelihood ratios

f (x1; θ)

f (x1; θ0)=f (x2; θ)

f (x2; θ0)

for all values of θ, θ0 ∈ Ω. Then the two different observations x1, x2 shouldlead to the same inference about θ.

A weaker but similar principle, the Invariance Principle follows. Thiscan be used, for example, to argue that for independent identically dis-tributed observations, it is only the value of the observations (the orderstatistic) that should be used for inference, not the particular order inwhich those observations were obtained.

2.2.2 Invariance Principle

Suppose for two different observations x1, x2,

f (x1; θ) = f (x2; θ)

for all values of θ ∈ Ω. Then the two different observations x1, x2 shouldlead to the same inference about θ.

There are relationships among these and other principles. For exam-ple, Birnbaum proved that the Conditionality Principle and the SufficiencyPrinciple above imply a stronger version of a Likelihood Principle. How-ever, it is probably safe to say that while probability theory has been quitesuccessfully axiomatized, it seems to be difficult if not impossible to de-rive most sensible statistical procedures from a set of simple mathematicalaxioms or principles of inference.

2.2.3 Problem

Consider the model f (x; θ) ; θ ∈ Ω and suppose that θ is the M.L. esti-mator based on the observation X. We often draw conclusions about theplausibility of a given parameter value θ based on the relative likelihoodL(θ)

L(θ). If this is very small, for example, less than or equal to 1/N , we regard

the value of the parameter θ as highly unlikely. But what happens if thistest declares every value of the parameter unlikely?Suppose f (x; θ) = 1 if x = θ and f (x; θ) = 0 otherwise, where

θ = 1, 2, . . .N . Define f0(x) to be the discrete uniform distribution on the

52 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION

integers 1, 2, . . . , N. In this example the parameter space isΩ = θ; θ = 0, 1, ...,N. Show that the relative likelihood

f0 (x)

f (x; θ)≤ 1

N

no matter what value of x is observed. Should this be taken to mean thatthe true distribution cannot be f0?

2.3 Properties of the Score and Information- Regular Model

Consider the model f (x; θ) ; θ ∈ Ω. The following is a set of sufficientconditions which we will use to determine the properties of the M.L. es-timator of θ. These conditions are not the most general conditions butare sufficiently general for most applications. Notable exceptions are theUNIF(0, θ) and the EXP(1, θ) distributions which will be considered sepa-rately.For convenience we call a family of models which satisfy the following

conditions a regular family of distributions. (See 1.7.9.)

2.3.1 Regular Model

Consider the model f (x; θ) ; θ ∈ Ω. Suppose that:(R1) The parameter space Ω is an open interval in the real line.

(R2) The densities f (x; θ) have common support, so that the setA = x; f (x; θ) > 0 , does not depend on θ.

(R3) For all x ∈ A, f (x; θ) is a continuous, three times differentiable func-tion of θ.

(R4) The integralRA

f (x; θ) dx can be twice differentiated with respect to θ

under the integral sign, that is,

∂k

∂θk

ZA

f (x; θ) dx =

ZA

∂k

∂θkf (x; θ) dx, k = 1, 2 for all θ ∈ Ω.

(R5) For each θ0 ∈ Ω there exist a positive number c and function M (x)(both of which may depend on θ0), such that for all θ ∈ (θ0 − c, θ0 + c)¯

∂3 log f (x; θ)

∂θ3

¯< M(x)

2.3. PROPERTIES OF THE SCORE AND INFORMATION- REGULARMODEL53

holds for all x ∈ A, and

E [M (X) ; θ] <∞ for all θ ∈ (θ0 − c, θ0 + c) .

(R6) For each θ ∈ Ω,

0 < E

(∙∂2 log f (X; θ)

∂θ2

¸2; θ

)<∞

If these conditions hold with X a discrete random variable and theintegrals replaced by sums, then we shall also call this a regular family ofdistributions.

Condition (R3) insures that the function ∂ log f (x; θ) /∂θ has, for eachx ∈ A, a Taylor expansion as a function of θ.The following lemma provides one method of determining whether dif-

ferentiation under the integral sign (condition (R4)) is valid.

2.3.2 Lemma

Suppose ∂g (x; θ) /∂θ exists for all θ ∈ Ω, and all x ∈ A. Suppose also thatfor each θ0 ∈ Ω there exist a positive number c, and function G (x) (bothof which may depend on θ0), such that for all θ ∈ (θ0 − c, θ0 + c)¯

∂g (x; θ)

∂θ

¯< G(x)

holds for all x ∈ A, and ZA

G (x) dx <∞.

Then∂

∂θ

ZA

g(x, θ)dx =

ZA

∂θg(x, θ)dx.

2.3.3 Theorem - Expectation and Variance of the ScoreFunction

If X = (X1, . . . ,Xn) is a random sample from a regular modelf (x; θ) ; θ ∈ Ω then

E[S(θ;X); θ] = 0

54 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION

and

V ar[S(θ;X); θ] = E[S(θ;X)]2; θ = E[I(θ;X); θ] = J(θ) <∞

for all θ ∈ Ω.

2.3.4 Problem - Invariance Property of M.L.Estimators

Suppose X1, . . . ,Xn is a random sample from a distribution with proba-bility (density) function f (x; θ) where f (x; θ) ; θ ∈ Ω is a regular family.Let S(θ) and J(θ) be the score function and Fisher information respectivelybased on X1, . . . ,Xn. Consider the reparameterization τ = h(θ) where his a one-to-one differentiable function with inverse function θ = g(τ). LetS∗(τ) and J∗(τ) be the score function and Fisher information respectivelyunder the reparameterization.(a) Show that τ = h(θ) is the M.L. estimator of τ where θ is the M.L.estimator of θ.

(b) Show that E[S∗(τ ;X); τ ] = 0 and J∗(τ) = [g0(τ)]2J [g(τ)].

2.3.5 Problem

It is natural to expect that if we compare the information available in theoriginal data X and the information available in some statistic T (X), thelatter cannot be greater than the former since T can be obtained from X.Show that in a regular model the Fisher information calculated from themarginal distribution of T is less than or equal to the Fisher informationfor X. Show that they are equal for all values of the parameter if and onlyif T is a sufficient statistic for f (x; θ) ; θ ∈ Ω.

2.4 Maximum Likelihood Method- Multiparameter

The case of several parameters is exactly analogous to the one parametercase. Suppose θ = (θ1, . . . , θk)

T . The log likelihood function l (θ1, . . . , θk) =logL (θ1, . . . , θk) is a function of k parameters. The M.L. estimate of θ,

θ = (θ1, . . . , θk)T is usually found by solving ∂l

∂θj= 0, j = 1, . . . , k simulta-

neously.The invariance property of the M.L. estimator also holds in the multi-

parameter case.

2.4. MAXIMUMLIKELIHOODMETHOD-MULTIPARAMETER55

2.4.1 Definition

If θ = (θ1,. . . , θk)T then the score vector is defined as

S(θ) =

⎡⎢⎢⎢⎣∂l∂θ1

...∂l∂θk

⎤⎥⎥⎥⎦ , θ ∈ Ω.

2.4.2 Definition

If θ = (θ1, . . . , θk)T then the information matrix I(θ) is a k × k symmetric

matrix whose (i, j) entry is given by

− ∂2

∂θi∂θjl(θ), θ ∈ Ω.

I(θ) is called the observed information matrix.

2.4.3 Definition

If θ = (θ1, . . . , θk)T then the expected or Fisher information matrix J(θ) is

a k × k symmetric matrix whose (i, j) entry is given by

E

∙− ∂2

∂θi∂θjl(θ;X); θ

¸, θ ∈ Ω.

2.4.4 Expectation and Variance of the Score Vector

For a regular family of distributions

E[S(θ;X); θ] =

⎡⎢⎣ 0...0

⎤⎥⎦and

V ar[S(θ;X); θ] = E[S(θ;X)S(θ;X)T ; θ] = E [I (θ;X) ; θ] = J(θ).

2.4.5 Likelihood Regions

The set of θ values for which R(θ) ≥ p is called a 100p% likelihood regionfor θ.

56 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION

2.4.6 Example:

SupposeX1, ...,Xn is a random sample from the N(μ,σ2) distribution. Find

the score vector, the information matrix, the Fisher information matrix and

the M.L. estimator of θ =¡μ,σ2

¢T. Find the observed information matrix

I¡μ, σ2

¢and thus verify that

¡μ, σ2

¢is the M.L. estimator of

¡μ,σ2

¢. Find

the Fisher information matrix J¡μ,σ2

¢.

Since X1, ...,Xn is a random sample from the N¡μ,σ2

¢distribution the

likelihood function is

L¡μ,σ2

¢=

nQi=1

1√2πσ

exp

∙−12σ2

(xi − μ)2¸

= (2π)−n/2¡σ2¢−n/2

exp

∙−12σ2

nPi=1(xi − μ)2

¸= (2π)−n/2

¡σ2¢−n/2

exp

∙−12σ2

nPi=1

¡x2i − 2μxi − μ2

¢¸= (2π)−n/2

¡σ2¢−n/2

exp

∙−12σ2

µnPi=1x2i − 2μ

nPi=1xi + nμ

2

¶¸= (2π)

−n/2 ¡σ2¢−n/2

exp

∙−12σ2

¡t1 − 2μt2 + nμ2

¢¸, μ ∈ <, σ2 > 0

where

t1 =nPi=1x2i and t2 =

nPi=1xi.

The log likelihood function is

l¡μ,σ2

¢=−n2log (2π)− n

2log¡σ2¢− 1

2σ2

nPi=1(xi − μ)

2

=−n2log (2π)− n

2log¡σ2¢− 12

¡σ2¢−1 ∙ nP

i=1(xi − x)2 + n (x− μ)

2

¸=−n2log (2π)− n

2log¡σ2¢− 12

¡σ2¢−1 h

(n− 1) s2 + n (x− μ)2i, μ ∈ <, σ2 > 0

where

s2 =1

n− 1nPi=1

(xi − x)2 .

Now∂l

∂μ=n

σ2(x− μ) = n

¡σ2¢−1

(x− μ)

2.4. MAXIMUMLIKELIHOODMETHOD-MULTIPARAMETER57

and∂l

∂σ2= −n

2

¡σ2¢−1

+1

2

¡σ2¢−2 h

(n− 1) s2 + n (x− μ)2i.

The equations∂l

∂μ= 0,

∂l

∂σ2= 0

are solved simultaneously for

μ = x and σ2 =1

n

nPi=1(xi − x)2 =

(n− 1)n

s2.

Since

− ∂2l

∂μ2=

n

σ2, − ∂2l

∂σ2∂μ=n (x− μ)

σ4

− ∂2l

∂ (σ2)2 = −n

2

1

σ4+1

σ6

h(n− 1) s2 + n (x− μ)

2i

the information matrix is

I¡μ,σ2

¢=

⎡⎣ n/σ2 n (x− μ) /σ4

n (x− μ) /σ4 −n21σ4 +

1σ6

h(n− 1) s2 + n (x− μ)2

i ⎤⎦ , μ ∈ <, σ2 > 0.

Since

I11¡μ, σ2

¢=n

σ2> 0 and det I

¡μ, σ2

¢=n2

2σ6> 0

then by the Second Derivative Test the M.L. estimates of of μ and σ2 are

μ = x and σ2 =1

n

nPi=1(xi − x)2 =

(n− 1)n

s2

and the M.L. estimators are

μ = X and σ2 =1

n

nPi=1

¡Xi − X

¢2=(n− 1)n

S2.

The observed information is

I¡μ, σ2

¢=

"n/σ2 0

0 12

¡n/σ4

¢ # .Now

E³ nσ2;μ,σ2

´=n

σ2, E

"n¡X − μ

¢σ4

;μ,σ2

#= 0,

58 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION

and

E

½−n2

1

σ4+1

σ6

h(n− 1)S2 + n

¡X − μ

¢2i;μ,σ2

¾= −n

2

1

σ4+1

σ6

n(n− 1)E(S2;μ,σ2) + nE

h¡X − μ

¢2;μ,σ2

io= −n

2

1

σ4+1

σ6£(n− 1)σ2 + σ2

¤=

n

2σ4

since

E£¡X − μ

¢;μ,σ2

¤= 0,

Eh¡X − μ

¢2;μ,σ2

i= V ar

¡X;μ,σ2

¢=

σ2

nand E

¡S2;μ,σ2

¢= σ2.

Therefore the Fisher information matrix is

J¡μ,σ2

¢=

"nσ2 0

0 n2σ4

#

and the inverse of the Fisher information matrix is

£J¡μ,σ2

¢¤−1=

⎡⎣ σ2

n 0

0 2σ4

n

⎤⎦ .Now

V ar¡X¢=

σ2

n

V ar¡σ2¢= V ar

∙1

n

nPi=1

¡Xi − X

¢2¸=2(n− 1)σ4

n2≈2σ4

n

and

Cov(X, σ2) =1

nCov(X,

nPi=1

¡Xi − X

¢2) = 0

since X andnPi=1

¡Xi − X

¢2are independent random variables. Inferences

for μ and σ2 are usually made using

X − μ

S/√nv t (n− 1) and

(n− 1)S2σ2

v χ2 (n− 1) .

2.4. MAXIMUMLIKELIHOODMETHOD-MULTIPARAMETER59

The relative likelihood function is

R¡μ,σ2

¢=L¡μ,σ2

¢L (μ, σ2)

=

µσ2

σ2

¶n/2exp

nn2− n

2σ2

hσ2 + (x− μ)2

io, μ ∈ <, σ2 > 0.

See Figure 2.4 for a graph of R¡μ,σ2

¢for n = 350, μ = 160 and σ2 = 36.

159159.5

160160.5

161

3032

3436

3840

420

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

μσ2

Figure 2.4: Normal Likelihood function for n = 350, μ = 160 and σ2 = 36

2.4.7 Problem - The Score Equation and theExponential Family

Suppose X has a regular exponential family distribution of the form

f(x; η) = C(η) exp

"kPj=1

ηjTj(x)

#h(x).

where η = (η1, . . . , ηk)T . Show that

E[Tj(X); η] =−∂ logC (η)

∂ηj, j = 1, . . . , k

60 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION

and

Cov(Ti(X), Tj(X); η) =−∂2 logC (η)

∂ηi∂ηj, i, j = 1, . . . , k.

Suppose that (x1, . . . , xn) are the observed data for a random samplefrom fη(x). Show that the score equations

∂ηjl (η) = 0, j = 1, ..., k

can be written as

E

∙nPi=1Tj(Xi); η

¸=

nPi=1Tj(xi), j = 1, . . . , k.

2.4.8 Problem

Suppose X1, . . . ,Xn is a random sample from the N(μ,σ2) distribution.Use the result of Problem 2.4.7 to find the score equations for μ and σ2 andverify that these are the same equations obtained in Example 2.4.7.

2.4.9 Problem

Suppose (X1, Y1) , . . . , (Xn, Yn) is a random sample from the BVN(μ,Σ)distribution. Find the M.L. estimators of μ1, μ2, σ

21 , σ

22, and ρ. You do not

need to verify that your answer corresponds to a maximum. Hint: Use theresult from Problem 2.4.7.

2.4.10 Problem

Suppose (X1,X2) ∼ MULT(n, θ1, θ2). Find the M.L. estimators of θ1 andθ2, the score function and the Fisher information matrix.

2.4.11 Problem

Suppose X1, . . . ,Xn is a random sample from the UNIF(a, b) distribution.Find the M.L. estimators of a and b. Verify that your answer correspondsto a maximum. Find the M.L. estimator of τ(a, b) = E (Xi).

2.4.12 Problem

Suppose X1, . . . ,Xn is a random sample from the UNIF(μ − 3σ,μ + 3σ)distribution. Find the M.L. estimators of μ and σ.

2.4. MAXIMUMLIKELIHOODMETHOD-MULTIPARAMETER61

2.4.13 Problem

Suppose X1, . . . ,Xn is a random sample from the EXP(β,μ) distribution.Find the M.L. estimators of β and μ. Verify that your answer correspondsto a maximum. Find the M.L. estimator of τ(β,μ) = xα where xα is the αpercentile of the distribution.

2.4.14 Problem

In Problem 1.7.26 find the M.L. estimators of μ and σ2. Verify that youranswer corresponds to a maximum.

2.4.15 Problem

Suppose E(Y ) = Xβ where Y = (Y1, . . . , Yn)T is a vector of independent

and normally distributed random variables with V ar(Yi) = σ2, i = 1, . . . , n,X is a n× k matrix of known constants of rank k and β = (β1, . . . ,βk)

T isa vector of unknown parameters. Show that the M.L. estimators of β andσ2 are given by

β =¡XTX

¢−1XTY and σ2 = (Y −Xβ)T (Y −Xβ)/n.

2.4.16 Newton’s Method

In the multiparameter case θ = (θ1, . . . , θk)T Newton’s method is given by:

θ(i+1) = θ(i) + [I(θ(i))]−1S(θ(i)), i = 0, 1, 2, . . .

I(θ) can also be replaced by the Fisher information J(θ).

2.4.17 Example

The following data are 30 independent observations from a BETA(a, b)distribution:

0.2326, 0.0465, 0.2159, 0.2447, 0.0674, 0.3729, 0.3247, 0.3910, 0.3150,0.3049, 0.4195, 0.3473, 0.2709, 0.4302, 0.3232, 0.2354, 0.4014, 0.3720,0.5297, 0.1508, 0.4253, 0.0710, 0.3212, 0.3373, 0.1322, 0.4712, 0.4111,0.1079, 0.0819, 0.3556

The likelihood function for observations x1, x2, ..., xn is

L(a, b) =nQi=1

Γ(a+ b)

Γ(a)Γ(b)xa−1i (1− xi)b−1 , a > 0, b > 0

=

∙Γ(a+ b)

Γ(a)Γ(b)

¸n ∙ nQi=1xi

¸a−1 ∙ nQi=1(1− xi)

¸b−1.

62 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION

The log likelihood function is

l(a, b) = n [logΓ(a+ b)− logΓ(a)− logΓ(b) + (a− 1)t1 + (b− 1)t2]

where

t1 =1

n

nPi=1

log xi and t2 =1

n

nPi=1

log(1− xi).

(T1, T2) is a sufficient statistic for (a, b) where

T1 =1

n

nPi=1logXi and T2 =

1

n

nPi=1log(1−Xi).

Why?Let

Ψ (z) =d logΓ(z)

dz=Γ0 (z)

Γ(z)

which is called the digamma function. The score vector is

S (a, b) =

∙∂l/∂a∂l/∂b

¸= n

"Ψ (a+ b)−Ψ (a) + t1Ψ (a+ b)−Ψ (b) + t2

#.

S (a, b) = [0 0]T must be solved numerically to find the M.L. estimates ofa and b.

Let

Ψ0 (z) =d

dzΨ (z)

which is called the trigamma function. The information matrix is

I(a, b) = n

"Ψ0 (a)−Ψ0 (a+ b) −Ψ0 (a+ b)−Ψ0 (a+ b) Ψ0 (b)−Ψ0 (a+ b)

#

which is also the Fisher or expected information matrix.

For the data above

t1 =1

30

nPi=1log xi = −1.3929 and t2 =

1

30log

nPi=1log(1− xi) = −0.3594.

The M.L. estimates of a and b can be found using Newton’s Method givenby ∙

a(i+1)

b(i+1)

¸=

∙a(i)

b(i)

¸+hI(a(i), b(i))

i−1S(a(i), b(i))

2.4. MAXIMUMLIKELIHOODMETHOD-MULTIPARAMETER63

for i = 0, 1, ... until convergence. Newton’s Method converges after 8 inter-ations beginning with the initial estimates a(0) = 2, b(0) = 2. The iterationsare given below:∙0.64492.2475

¸=

∙22

¸+

∙10.8333 −8.5147−8.5147 10.8333

¸−1 ∙ −16.787114.2190

¸∙1.08523.1413

¸=

∙0.64492.2475

¸+

∙84.5929 −12.3668−12.3668 4.3759

¸−1 ∙26.1919−1.5338

¸∙1.69734.4923

¸=

∙1.08523.1413

¸+

∙35.8351 −8.0032−8.0032 3.2253

¸−1 ∙11.1198−0.5408

¸∙2.31335.8674

¸=

∙1.69734.4923

¸+

∙18.5872 −5.2594−5.2594 2.2166

¸−1 ∙4.2191−0.1922

¸∙2.64716.6146

¸=

∙2.31335.8674

¸+

∙12.2612 −3.9004−3.9004 1.6730

¸−1 ∙1.1779−0.0518

¸∙2.70586.7461

¸=

∙2.64716.6146

¸+

∙10.3161 −3.4203−3.4203 1.4752

¸−1 ∙0.1555−0.0067

¸∙2.70726.7493

¸=

∙2.70586.7461

¸+

∙10.0345 −3.3478−3.3478 1.4450

¸−1 ∙0.0035−0.0001

¸∙2.70726.7493

¸=

∙2.70726.7493

¸+

∙10.0280 −3.3461−3.3461 1.4443

¸−1 ∙0.00000.0000

¸The M.L. estimates are a = 2.7072 and b = 6.7493.

The observed information matrix is

I(a, b) =

∙10.0280 −3.3461−3.3461 1.4443

¸Note that since det[I(a, b)] = (10.0280) (1.4443)− (3.3461)2 > 0 and[I(a, b)]11 = 10.0280 > 0 and then by the Second Derivative Test we havefound the M.L. estimates.

A graph of the relative likelihood function is given in Figure 2.5.

64 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION

1 1.5 2 2.5 3 3.5 4 4.5 5

0

5

10

150

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

a

b

Figure 2.5: Relative Likelihood for Beta Example

A 100p% likelihood region for (a, b) is given by (a, b) ; R (a, b) ≥ p.The 1%, 5% and 10% likelihood regions for (a, b) are shown in Figure 2.6.Note that the likelihood contours are elliptical in shape and are skewed rela-tive to the ab coordinate axes. Since this is a regular model and S(a, b) = 0then by Taylor’s Theorem we have

L (a, b) ≈ L(a, b) + S(a, b)

"a− ab− b

#+1

2

£a− a b− b

¤I(a, b)

"a− ab− b

#

= L(a, b) +1

2

£a− a b− b

¤I(a, b)

"a− ab− b

#

for all (a, b) sufficiently close to (a, b). Therefore

R (a, b) =L (a, b)

L(a, b)

2.4. MAXIMUMLIKELIHOODMETHOD-MULTIPARAMETER65

≈ 1−h2L(a, b)

i−1 £a− a b− b

¤I(a, b)

"a− ab− b

#

= 1−h2L(a, b)

i−1 £a− a b− b

¤ ∙ I11 I12I12 I22

¸"a− ab− b

#

= 1−h2L(a, b)

i−1 h(a− a)2 I11 + 2 (a− a) (b− b)I12 + (b− b)2I22

i.

The set of points (a, b) which satisfy R (a, b) = p is approximately the setof points (a, b) which satisfy

(a− a)2 I11 + 2 (a− a) (b− b)I12 + (b− b)2I22 = 2 (1− p)L(a, b)

which we recognize as the points on an ellipse centred at (a, b). The skew-ness of the likelihood contours relative to the ab coordinate axes is deter-mined by the value of I12. If this value is close to zero the skewness will besmall.

2.4.18 Problem

The following data are 30 independent observations from a GAM(α,β)distribution:

15.1892, 19.3316, 1.6985, 2.0634, 12.5905, 6.0094,13.6279, 14.7847, 13.8251, 19.7445, 13.4370, 18.6259,2.7319, 8.2062, 7.3621, 1.6754, 10.1070, 3.2049,21.2123, 4.1419, 12.2335, 9.8307, 3.6866, 0.7076,7.9571, 3.3640, 12.9622, 12.0592, 24.7272, 12.7624

For these data t1 =30Pi=1log xi = 61.1183 and t2 =

30Pi=1xi = 309.8601.

Find the M.L. estimates of α and β for these data, the observed informationI(α, β) and the Fisher information J(α,β). On the same graph plot the1%, 5%, and 10% likelihood regions for (α,β). Comment.

2.4.19 Problem

Suppose X1, . . . ,Xn is a random sample from the distribution with proba-bility density function

f (x;α,β) =αβ

(1 + βx)α+1x > 0; α,β > 0.

Find the Fisher information matrix J(α,β).

66 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION

a

b

10%

5%

1%

(1.5,8)

(2.7,6.7)

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5

2

4

6

8

10

12

14

Figure 2.6: Likelihood Regions for BETA(a,b) Example

The following data are 15 independent observations from this distribu-tion:

9.53, 0.15, 0.77, 0.47, 4.10, 1.60, 0.42, 0.01, 2.30, 0.40,

0.80, 1.90, 5.89, 1.41, 0.11

Find the M.L. estimates of α and β for these data and the observed infor-mation I(α, β). On the same graph plot the 1%, 5%, and 10% likelihoodregions for (α,β). Comment.

2.4.20 Problem

Suppose X1, . . . ,Xn is a random sample from the CAU(β,μ) distribution.Find the Fisher information for (β,μ).

2.5. INCOMPLETE DATA AND THE E.M. ALGORITHM 67

2.4.21 Problem

A radioactive sample emits particles at a rate which decays with time,the rate being λ(t) = λe−βt. In other words, the number of particlesemitted in an interval (t, t+ h) has a Poisson distribution with parametert+hRt

λe−βsds and the number emitted in disjoint intervals are independent

random variables. Find the M.L. estimate of λ and β, λ > 0, β > 0 ifthe actual times of the first, second, ..., n’th decay t1 < t2 < . . . tn areobserved. Show that β satisfies the equation

βtn

eβtn − 1= 1− βt where t =

1

n

nPi=1ti.

2.4.22 Problem

In Problem 2.1.23 suppose θ = (θ1, . . . , θk)T . Find the Fisher information

matrix and explain how you would find the M.L. estimate of θ.

2.5 Incomplete Data and The E.M. Algorithm

The E.M. algorithm, which was popularized by Dempster, Laird and Rubin(1977), is a useful method for finding M.L. estimates when some of thedata are incomplete but can also be applied to many other contexts suchas grouped data, mixtures of distributions, variance components and factoranalysis.The following are two examples of incomplete data:

2.5.1 Censored Exponential Data

Suppose Xi ∼ EXP(θ), i = 1, . . . , n. Suppose we only observe Xi for mobservations and the remaining n−m observations are censored at a fixedtime c. The observed data are of the form Yi = min(Xi, c), i = 1, . . . , n.Note that Y = Y (X) is a many-to-one mapping. (X1, . . . ,Xn) are calledthe complete data and (Y1, . . . , Yn) are called the incomplete data.

2.5.2 “Lumped” Hardy-Weinberg Data

A gene has two forms A and B. Each individual has a pair of these genes,one from each parent, so that there are three possible genotypes: AA, ABand BB. Suppose that, in both male and female populations, the proportionof A types is equal to θ and the proportion of B types is equal to 1 − θ.

68 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION

Suppose further that random mating occurs with respect to this gene pair.Then the proportion of individuals with genotypes AA, AB and BB in thenext generation are θ2, 2θ(1− θ) and (1− θ)2 respectively. Furthermore, ifrandom mating continues, these proportions will remain nearly constant forgeneration after generation. This is the famous result from genetics calledthe Hardy-Weinberg Law. Suppose we have a group of n individuals andlet X1 = number with genotype AA, X2 = number with genotype AB andX3 = number with genotype BB. Suppose however that it is not possibleto distinguish AA0s from AB0s so that the observed data are (Y1, Y2) whereY1 = X1 +X2 and Y2 = X3. The complete data are (X1,X2,X3) and theincomplete data are (Y1, Y2).

2.5.3 Theorem

Suppose X, the complete data, has probability (density) function f (x; θ)and Y = Y (X), the incomplete data, has probability (density) functiong(y; θ). Suppose further that f (x; θ) and g (x; θ) are regular models. Then

∂θlog g(y; θ) = E[S(θ;X)|Y = y].

Suppose θ, the value which maximizes log g(y; θ), is found by solving∂∂θ log g(y; θ) = 0. By the previous theorem θ is also the solution to

E[S(θ;X)|Y = y; θ] = 0.

Note that θ appears in two places in the second equation, as an argumentin the function S as well as an argument in the expectation E.

2.5.4 The E.M. Algorithm

The E.M. algorithm solves E[S(θ;X)|Y = y] = 0 using an iterative two-stepmethod. Let θ(i) be the estimate of θ from the ith iteration.

(1) E-Step (Expectation Step)

Calculate

Ehlog f (X; θ) |Y = y; θ(i)

i= Q(θ, θ(i)).

(2) M-step (Maximization Step)

2.5. INCOMPLETE DATA AND THE E.M. ALGORITHM 69

Find the value of θ which maximizes Q(θ, θ(i)) and set θ(i+1) equal tothis value. θ(i+1) is found by solving

∂θQ(θ, θ(i)) = E

∙∂

∂θlog f (X; θ) |Y = y; θ(i)

¸= E

hS(θ;X)|Y = y; θ(i)

i= 0

with respect to θ.Note that

EhS(θ(i+1);X)|Y = y; θ(i)

i= 0

2.5.5 Example

Give the E.M. algorithm for the “Lumped” Hardy-Weinberg example. Findθ if n = 10 and y1 = 3. Show how θ can be found explicitly by solving∂∂θ log g(y; θ) = 0 directly.

The complete data (X1,X2) have joint p.f.

f (x1, x2; θ) =n!

x1!x2! (n− x1 − x2)!£θ2¤x1[2θ (1− θ)]

x2h(1− θ)

2in−x1−x2

= θ2x1+x2 (1− θ)2n−(2x1+x2) · h (x1, x2)x1, x2 = 0, 1, . . . ; x1 + x2 ≤ n; 0 < θ < 1

where

h (x1, x2) =n!

x1!x2! (n− x1 − x2)!2x2 .

It is easy to see (show it!) that (X1,X2) has a regular exponential familydistribution with natural sufficient statistic T = T (X1,X2) = 2X1 + X2.The incomplete data are Y = X1 +X2.

For the E-Step we need to calculate

Q(θ, θ(i)) = Ehlog f (X1,X2; θ) |Y = y; θ(i)

i= E

n(2X1 +X2) log θ + [2n− (2X1 +X2)] log (1− θ) |Y = X1 +X2 = y; θ(i)

o+E

hh (X1,X2) |Y = X1 +X2 = y; θ(i)

i. (2.1)

To find these expectations we note that by the properties of the multino-mial distribution

X1|X1 +X2 = y v BINµy,

θ2

θ2 + 2θ (1− θ)

¶= BIN

µy,

θ

2− θ

70 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION

and

X2|X1 +X2 = y v BINµy, 1− θ

2− θ

¶.

Therefore

E³2X1 +X2|Y = X1 +X2 = y; θ(i)

´= 2y

µθ(i)

2− θ(i)

¶+ y

µ1− θ(i)

2− θ(i)

¶= y

µ2

2− θ(i)

¶= yp

³θ(i)´

(2.2)

where

p (θ) =2

2− θ.

Substituting (2.2) into (2.1) gives

Q(θ, θ(i)) = yp³θ(i)´log θ+

h2n− yp

³θ(i)´ilog (1− θ)+E

hh (X1,X2) |Y = X1 +X2 = y; θ(i)

i.

Note that we do not need to simplify the last term on the right hand sidesince it does not involve θ.

For the M-Step we need to solve

∂θQ(θ, θ(i)) = 0.

Now

∂θQ³θ, θ(i)

´=

yp¡θ(i)¢

θ− 2n− yp(θ

(i))

(1− θ)

=yp(θ(i)) (1− θ)−

£2n− yp(θ(i))

¤θ

θ (1− θ)

=yp(θ(i))− 2nθ

θ (1− θ)

and∂

∂θQ(θ, θ(i)) = 0

if

θ =yp(θ(i))

2n=y

2n

µ2

2− θ(i)

¶=y

n

µ1

2− θ(i)

¶.

Therefore θ(i+1) is given by

θ(i+1) =y

n

µ1

2− θ(i)

¶. (2.3)

2.5. INCOMPLETE DATA AND THE E.M. ALGORITHM 71

Our algorithm for finding the M.L. estimate of θ is

θ(i+1) =y

n

µ1

2− θ(i)

¶, i = 0, 1, . . .

For the data n = 10 and y = 3 let the initial guess for θ be θ(0) = 0.1.Note that the initial guess does not really matter in this example since thealgorithm converges rapidly for any initial guess between 0 and 1.

For the given data and initial guess we obtain:

θ(1) =3

10

µ1

2− 0.1

¶= 0.1579

θ(2) =3

10

µ1

2− θ(1)

¶= 0.1629

θ(3) =3

10

µ1

2− θ(2)

¶= 0.1633

θ(4) =3

10

µ1

2− θ(3)

¶= 0.1633.

So the M.L. estimate of θ is θ = 0.1633 to four decimal places.

In this example we can find θ directly since

Y = X1 +X2 v BIN¡n, θ2 + 2θ (1− θ)

¢and therefore

g (y; θ) =

µn

y

¶£θ2 + 2θ (1− θ)

¤y h(1− θ)

2in−y

, y = 0, 1, . . . , n; 0 < θ < 1

=

µn

y

¶qy (1− q)n−y , where q = 1− (1− θ)

2

which is a binomial likelihood so the M.L. estimate of q is q = y/n.

By the invariance property of M.L. estimates the M.L. estimate ofθ = 1−

√1− q is

θ = 1−p1− q = 1−

p1− y/n. (2.4)

For the data n = 10 and y = 3 we obtain θ = 1−p1− 3/10 = 0.1633

to four decimal places which is the same answer as we found using the E.M.algorithm.

72 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION

2.5.6 E.M. Algorithm and the Regular ExponentialFamily

SupposeX, the complete data, has a regular exponential family distributionwith probability (density) function

f(x; θ) = C(θ) exp

"kPj=1

qj (θ)Tj(x)

#h(x), θ = (θ1, . . . , θk)

T

and let Y = Y (X) be the incomplete data. Then the M-step of the E.M.algorithm is given by

E[Tj(X); θ(i+1)] = E[Tj(X)|Y = y; θ(i)], j = 1, . . . , k. (2.5)

2.5.7 Problem

Prove (2.5) using the result from Problem 2.4.7.

2.5.8 Example

Use (2.5) to find the M-step for the “Lumped” Hardy-Weinberg example.

Since the natural sufficient statistic is T = T (X1,X2) = 2X1 +X2 theM-Step is given by

Eh2X1 +X2; θ

(i+1)i= E

h2X1 +X2|Y = y; θ(i)

i. (2.6)

Using (2.2) and the fact that

X1 v BIN¡n, θ2

¢and X2 v BIN (n, 2θ (1− θ)) ,

then (2.6) can be written as

2nhθ(i+1)

i2+ n

h2θ(i+1)

i h1− θ(i+1)

i= y

µ2

2− θ(i)

¶or

θ(i+1) =y

2n

µ2

2− θ(i)

¶=y

n

µ1

2− θ(i)

¶which is the same result as in (2.3).

If the algorithm converges and

limi→∞

θ(i) = θ

2.5. INCOMPLETE DATA AND THE E.M. ALGORITHM 73

(How would you prove this? Hint: Recall the Monotonic Sequence Theo-rem.) then

limi→∞

θ(i+1) = limi→∞

y

n

µ1

2− θ(i)

¶or

θ =y

n

µ1

2− θ

¶.

Solving for θ givesθ = 1−

p1− y/n

which is the same result as in (2.4).

2.5.9 Example

Use (2.5) to give the M-step for the censored exponential data example.

Assuming the algorithm converges, find an expression for θ. Show that thisis the same θ which is obtained when ∂

∂θ log g(y; θ) = 0 is solved directly.

2.5.10 Problem

Suppose X1, . . . ,Xn is a random sample from the N(μ,σ2) distribution.Suppose we observe Xi, i = 1, . . . ,m but for i = m + 1, . . . , n we observeonly that Xi > c.

(a) Give explicitly the M-step of the E.M. algorithm for finding the M.L.estimate of μ in the case where σ2 is known.

(b) Give explicitly the M-step of the E.M. algorithm for finding the M.L.estimates of μ and σ2.

Hint: If Z ∼ N(0, 1) show that

E(Z|Z > b) = φ(b)

1−Φ(b) = h(b)

where φ is the probability density function and Φ is the cumulative distri-bution function of Z and h is called the hazard function.

2.5.11 Problem

Let (X1, Y1) , . . . , (Xn, Yn) be a random sample from the BVN(μ,Σ) dis-tribution. Suppose that some of the Xi and Yi are missing as follows: fori = 1, . . . , n1 we observe both Xi and Yi, for i = n1 + 1, . . . , n2 we observeonly Xi and for i = n2 + 1, . . . , n we observe only Yi. Give explicitly theM-step of the E.M. algorithm for finding the M.L. estimates of

74 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION

(μ1,μ2,σ21 ,σ

22 , ρ).

Hint: Xi|Yi = yi ∼ N(μ1 + ρσ1(yi − μ2)/σ2, (1− ρ2)σ21).

2.5.12 Problem

The data in the table below were obtained through the National Crime Sur-vey conducted by the U.S. Bureau of Census (See Kadane (1985), Journalof Econometrics, 29, 46-67.). Households were visited on two occasions, sixmonths apart, to determine if the occupants had been victimized by crimein the preceding six-month period.

Second visitFirst visit Crime-free (X2 = 0) Victims (X2 = 1) NonrespondentsCrime-free (X1 = 0) 392 55 33Victims (X1 = 1) 76 38 9Nonrespondents 31 7 115

Let X1i = 1 (0) if the occupants in household i were victimized (not vic-timized) by crime in the preceding six-month period on the first visit.

Let X2i = 1 (0) if the occupants in household i were victimized (not vic-timized) by crime during the six-month period between the first visit andsecond visits.

Let θjk = P (X1i = j,X2i = k) , j = 0, 1; k = 0, 1; i = 1, . . . ,N.

(a) Write down the probability of observing the complete dataXi = (X1i,X2i) ,i = 1, . . . , N and show that X = (X1, . . . ,XN) has a regular exponentialfamily distribution.

(b) Give the M-step of the E.M. algorithm for finding the M.L. estimateof θ = (θ00, θ01, θ10, θ11). Find the M.L. estimate of θ for the data in thetable. Note: You may ignore the 115 households that did not respond tothe survey at either visit.

Hint:

E [(1−X1i) (1−X2i) ; θ] = P (X1i = 0,X2i = 0; θ) = θ00

E [(1−X1i) (1−X2i) |X1i = 1; θ] = 0

E [(1−X1i) (1−X2i) |X1i = 0; θ] =P (X1i = 0,X2i = 0; θ)

P (X1i = 0; θ)=

θ00θ00 + θ01

etc.

2.6. THE INFORMATION INEQUALITY 75

(c) Find the M.L. estimate of the odds ratio

τ =θ00θ11θ01θ10

.

What is the significance of τ = 1?

2.6 The Information Inequality

Suppose we consider estimating a parameter τ(θ),where θ is a scalar, usingan unbiased estimator T (X). Is there any limit to how well an estimator likethis can behave? The answer for unbiased estimators is in the affirmative,and a lower bound on the variance is given by the information inequality.

2.6.1 Information Inequality - One Parameter

Suppose T (X) is an unbiased estimator of the parameter τ(θ) in a regularstatistical model f (x; θ) ; θ ∈ Ω. Then

V ar(T ) ≥ [τ0 (θ)]2

J (θ).

Equality holds if and only ifX has a regular exponential family with naturalsufficient statistic T (X).

2.6.2 Proof

Since T is an unbiased estimator of τ(θ),ZA

T (x)f (x; θ) dx = τ(θ), for all θ ∈ Ω,

where P (X ∈ A; θ) = 1. Since f (x; θ) is a regular model we can take thederivative with respect to θ on both sides and interchange the integral andderivative to obtain: Z

A

T (x)∂f (x; θ)

∂θdx = τ 0(θ).

Since E[S(θ;X)] = 0, this can be written as

Cov[T, S(θ;X)] = τ 0(θ)

76 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION

and by the covariance inequality, this implies

V ar(T )V ar[S(θ;X)] ≥ [τ 0(θ]2 (2.7)

which, upon dividing by J(θ) = V ar[S(θ;X)], provides the desired result.Now suppose we have equality in (2.7). Equality in the covariance in-

equality obtains if and only if the random variables T and S(θ;X) are linearfunctions of one another. Therefore, for some (non-random) c1(θ), c2(θ), ifequality is achieved,

S(θ;x) = c1(θ)T (x) + c2(θ) all x ∈ A.

Integrating with respect to θ,

log f (x; θ) = C1(θ)T (x) + C2(θ) + C3(x)

where we note that the constant of integration C3 is constant with respectto changing θ but may depend on x. Therefore,

f (x; θ) = C(θ) exp [C1(θ)T (x)]h(x)

where C(θ) = eC2(θ) and h(x) = eC3(x) which is exponential family withnatural sufficient statistic T (X).¥

The special case of the information inequality that is of most interest isthe unbiased estimation of the parameter θ. The above inequality indicatesthat any unbiased estimator T of θ has variance at least 1/J(θ). The lowerbound is achieved only when f (x; θ) is regular exponential family withnatural sufficient statistic T .

Notes:1. If equality holds then T (X) is called an efficient estimator of τ(θ).2. The number

[τ 0(θ)]2

J(θ)

is called the Cramer-Rao lower bound (C.R.L.B.).3. The ratio of the C.R.L.B. to the variance of an unbiased estimator iscalled the efficiency of the estimator.

2.6.3 Example

Suppose X1, . . . ,Xn is a random sample from the POI(θ) distribution.Show that the variance of the U.M.V.U.E. of θ achieves the Cramer-Rao

2.6. THE INFORMATION INEQUALITY 77

lower bound for unbiased estimators of θ and find the lower bound. Whatis the U.M.V.U.E. of τ(θ) = θ2? Does the variance of this estimator achievethe Cramer-Rao lower bound for unbiased estimators of θ2? What is thelower bound?

2.6.4 Example

Suppose X1, . . . ,Xn is a random sample from the distribution with proba-bility density function

f (x; θ) = θxθ−1, 0 < x < 1, θ > 0.

Show that the variance of the U.M.V.U.E. of θ does not achieve the Cramer-Rao lower bound. What is the efficiency of the U.M.V.U.E.?

For some time it was believed that no estimator of θ could havevariance smaller than 1/J(θ) at any value of θ but this was demonstratedincorrect by the following example of Hodges.

2.6.5 Problem

Let X1, . . . ,Xn is a random sample from the N(θ, 1) distribution and define

T (X) =X

2if |X| ≤ n−1/4, T (X) = X otherwise.

Show that E(T ) ≈ θ, V ar(T ) ≈ 1/n if θ 6= 0, and V ar(T ) ≈ 14n if θ = 0.

Show that the Cramer-Rao lower bound for estimating θ is equal to 1n .

This example indicates that it is possible to achieve variance smallerthan 1/J(θ) at one or more values of θ. It has been proved that this is theexception. In fact the set of θ for which the variance of an estimator is lessthan 1/J(θ) has measure 0, which means, for example, that it may be afinite set or perhaps a countable set, but it cannot contain a non-degenerateinterval of values of θ.

2.6.6 Problem

For each of the following determine whether the variance of the U.M.V.U.E.of θ based on a random sample X1, . . . ,Xn achieves the Cramer-Rao lowerbound. In each case determine the Cramer-Rao lower bound and find theefficiency of the U.M.V.U.E.(a) N(θ, 4)(b) Bernoulli(θ)(c) N(0, θ2)(d) N(0, θ).

78 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION

2.6.7 Problem

Find examples of the following phenomena in a regular statistical model.(a) No unbiased estimator of τ(θ) exists.(b) An unbiased estimator of τ(θ) exists but there is no U.M.V.U.E.(c) A U.M.V.U.E. of τ(θ) exists but its variance is strictly greater than theCramer-Rao lower bound.(d) A U.M.V.U.E. of τ(θ) exists and its variance equals the Cramer-Raolower bound.

2.6.8 Information Inequality - Multiparameter

The right hand side in the information inequality generalizes naturally tothe multiple parameter case in which θ is a vector. For example ifθ = (θ1, . . . , θk)

T , then the Fisher information J(θ) is a k × k matrix. Ifτ(θ) is any real-valued function of θ then its derivative is a column vector

we will denote by D(θ) =³∂τ∂θ1, . . . , ∂τ

∂θk

´T. Then if T (X) is any unbiased

estimator of τ(θ) in a regular model,

V ar(T ) ≥ [D(θ)]T [J(θ)]−1D(θ) for all θ ∈ Ω.

2.6.9 Example

Let X1, . . . ,Xn be a random sample from the N(μ,σ2) distribution. Findthe U.M.V.U.E. of σ and determine whether the U.M.V.U.E. is an efficientestimator of σ. What happens as n→∞? Hint:

Γ (k + a)

Γ (k + b)= ka−b

∙1 +

(a+ b− 1) (a− b)2k

+O

µ1

k2

¶¸as k →∞

2.6.10 Problem

Let X1, . . . ,Xn be a random sample from the N(μ,σ2) distribution. Findthe U.M.V.U.E. of μ/σ and determine whether the U.M.V.U.E. is an effi-cient estimator. What happens as n→∞?

2.6.11 Problem

Let X1, . . . ,Xn be a random sample from the GAM(α,β) distribution.Find the U.M.V.U.E. of E(Xi;α,β) = αβ and determine whether theU.M.V.U.E. is an efficient estimator.

2.7. ASYMPTOTIC PROPERTIES OFM.L.ESTIMATORS - ONE PARAMETER 79

2.6.12 Problem

Consider the model in Problem 1.7.26.

(a) Find the M.L. estimators of μ and σ2 using the result from Problem2.4.7. You do not need to verify that your answer corresponds to a maxi-mum. Compare the M.L. estimators with the U.M.V.U.E.’s of μ and σ2.

(b) Find the observed information matrix and the Fisher information.

(c) Determine if the U.M.V.U.E.’s of μ and σ2 are efficient estimators.

2.7 Asymptotic Properties of M.L.Estimators - One Parameter

One of the more successful attempts at justifying estimators and demon-strating some form of optimality has been through large sample theory orthe asymptotic behaviour of estimators as the sample size n → ∞. Oneof the first properties one requires is consistency of an estimator. Thismeans that the estimator converges to the true value of the parameter asthe sample size (and hence the information) approaches infinity.

2.7.1 Definition

Consider a sequence of estimators Tn where the subscript n indicates thatthe estimator has been obtained from data X1, . . . ,Xn with sample size n.Then the sequence is said to be a consistent sequence of estimators of τ(θ)if Tn →p τ(θ) for all θ ∈ Ω.

It is worth a reminder at this point that probability (density) functionsare used to produce probabilities and are only unique up to a point. Forexample if two probability density functions f(x) and g(x) were such thatthey produced the same probabilities, or the same cumulative distributionfunction, for example,

xZ−∞

f(z)dz =

xZ−∞

g(z)dz

for all x, then we would not consider them distinct probability densities,even though f(x) and g(x) may differ at one or more values of x. Nowwhen we parameterize a given statistical model using θ as the parameter, itis natural to do so in such a way that different values of the parameter leadto distinct probability (density) functions. This means, for example, that

80 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION

the cumulative distribution functions associated with these densities aredistinct. Without this assumption it would be impossible to accurately es-timate the parameter since two different parameters could lead to the samecumulative distribution function and hence exactly the same behaviour ofthe observations. Therefore we assume:

(R7) The probability (density) functions corresponding to different valuesof the parameters are distinct, that is, θ 6= θ∗ =⇒ f (x; θ) 6= f (x; θ∗).This assumption together with assumptions (R1)− (R6) (see 2.3.1) are

sufficient conditions for the theorems given in this section.

2.7.2 Theorem - Consistency of the M.L. Estimator(Regular Model)

Suppose X1, . . . ,Xn is a random sample from a model f (x; θ) ; θ ∈ Ωsatisfying regularity conditions (R1)−(R7). Then with probability tendingto 1 as n→∞, the likelihood equation or score equation

nXi=1

∂θlogf (Xi; θ) = 0

has a root θn such that θn converges in probability to θ0, the true value ofthe parameter, as n→∞.

The proof of this theorem is given in Section 5.4.9 of the Ap-pendix.

The likelihood equation does not always have a unique root as the fol-lowing problem illustrates.

2.7.3 Problem

Indicate whether or not the likelihood equation based on X1, . . . ,Xn has aunique root in each of the cases below:(a) LOG(1, θ)(b) WEI(1, θ)(c) CAU(1, θ)

The consistency of the M.L. estimator is one indication that it performsreasonably well. However, it provides no reason to prefer it to some otherconsistent estimator. The following result indicates that M.L. estimatorsperform as well as any reasonable estimator can, at least in the limit asn→∞.

2.7. ASYMPTOTIC PROPERTIES OFM.L.ESTIMATORS - ONE PARAMETER 81

2.7.4 Theorem - Asymptotic Distribution ofthe M.L. Estimator (Regular Model)

Suppose (R1)− (R7) hold. Suppose θn is a consistent root of the likelihoodequation as in Theorem 2.7.2. Thenp

J(θ0)(θn − θ0)→D Z ∼ N(0, 1)

where θ0 is the true value of the parameter.

The proof of this theorem is given in Section 5.4.10 of theAppendix.

Note: Since J(θ) is the Fisher expected information based on a randomsample from the model f (x; θ) ; θ ∈ Ω,

J(θ) = E

∙−

nPi=1

∂2

∂θ2logf (Xi; θ) ; θ

¸= nE

∙− ∂2

∂θ2logf (X; θ) ; θ

¸where X has probability (density) function f (x; θ).

This theorem implies that for a regular model and sufficiently large n,θn has an approximately normal distribution with mean θ0 and variance[J(θ0)]

−1. [J(θ0)]

−1is called the asymptotic variance of θn. This theorem

also asserts that θn is asymptotically unbiased and its asymptotic varianceapproaches the Cramer-Rao lower bound for unbiased estimators of θ.

By the Limiting Theorems it also follows that

τ(θn)− τ (θ0)q[τ 0 (θ0)]

2 /J (θ0)→D Z v N(0, 1) .

Compare this result with the Information Inequality.

2.7.5 Definition

Suppose X1, . . . ,Xn is a random sample from a regular statistical modelf (x; θ) ; θ ∈ Ω. Suppose also that Tn = Tn(X1, ...,Xn) is asymptoticallynormal with mean θ and variance σ2T /n. The asymptotic efficiency of Tn isdefined to be ½

σ2T ·E∙−∂

2 log f (X; θ)

∂θ2; θ

¸¾−1where X has probability (density) function f (x; θ).

82 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION

2.7.6 Problem

SupposeX1, . . . ,Xn is a random sample from a distribution with continuousprobability density function f (x; θ) and cumulative distribution functionF (x; θ) where θ is the median of the distribution. Suppose also that f (x; θ)is continuous at x = θ. The sample median Tn = med(X1, . . . ,Xn) is apossible estimator of θ.(a) Find the probability density function of the median if n = 2m + 1 isodd.(b) Prove

√n(Tn − θ)→D T ∼ N

µ0,

1

4[f(0; 0)]2

¶(c) If X1, ...,Xn is a random sample from the N(θ, 1) distribution find theasymptotic efficiency of Tn.(d) If X1, ...,Xn is a random sample from the CAU(1, θ) distribution findthe asymptotic efficiency of Tn.

2.8 Interval Estimators

2.8.1 Definition

Suppose X is a random variable whose distribution depends on θ. Supposethat A(x) and B(x) are functions such that A(x) ≤ B(x) for all x ∈ supportof X and θ ∈ Ω. Let x be the observed data. Then (A(x), B(x)) is aninterval estimate for θ. The interval (A(X), B(X)) is an interval estimatorfor θ.

Likelihood intervals are one type of interval estimator. Confidence in-tervals are another type of interval estimator.

We now consider a general approach for constructing confidence inter-vals based on pivotal quantities.

2.8.2 Definition

Suppose X is a random variable whose distribution depends on θ. Therandom variable Q(X; θ) is called a pivotal quantity if the distribution ofQ does not depend on θ. Q(X; θ) is called an asymptotic pivotal quantity ifthe limiting distribution of Q as n→∞ does not depend on θ.

For example, for a random sample X1, . . . ,Xn from a N(θ,σ2) distrib-

2.8. INTERVAL ESTIMATORS 83

ution where σ2 is known, the statistic

T =

√n¡X − θ

¢σ

is a pivotal quantity whose distribution does not depend on θ. IfX1, . . . ,Xnis a random sample from a distribution, not necessarily normal, havingmean θ and known variance σ2 then the asymptotic distribution of T isN(0, 1) by the C.L.T. and T is an asymptotic pivotal quantity.

2.8.3 Definition

Suppose A(X) and B(X) are statistics. If P [A(X) < θ < B(X)] = p,0 < p < 1 then (a(x), b(x)) is called a 100p% confidence interval (C.I.) forθ.

Pivotal quantities can be used for constructing C.I.’s in the followingway. Since the distribution of Q(X; θ) is known we can write down aprobability statement of the form

P (q1 ≤ Q(X; θ) ≤ q2) = p.

If Q is a monotone function of θ then this statement can be rewritten as

P [A(X) ≤ θ ≤ B(X)] = p

and the interval [a(x), b(x)] is a 100p% C.I..

The following theorem gives the pivotal quantity in the case in which θis either a location parameter or a scale parameter.

2.8.4 Theorem

Let X = (X1, ...,Xn) be a random sample from the model f (x; θ) ; θ ∈ Ωand let θ = θ(X) be the M.L. estimator of the scalar parameter θ based onX.

(1) If θ is a location parameter then Q = Q (X) = θ−θ is a pivotal quantity.(2) If θ is a scale parameter then Q = Q (X) = θ/θ is a pivotal quantity.

2.8.5 Asymptotic Pivotal Quantities and ApproximateConfidence Intervals

In cases in which an exact pivotal quantity cannot be constructed we canuse the limiting distribution of θn to construct approximate C.I.’s. Since

[J(θn)]1/2(θn − θ0)→D Z v N(0, 1)

84 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION

then [J(θn)]1/2(θ−θ0) is an asymptotic pivotal quantity and an approximate

100p% C.I. based on this asymptotic pivotal quantity is given byhθn − a[J(θn)]−1/2, θn + a[J(θn)]−1/2

iwhere θn = θn(x1, ..., xn) is the M.L. estimate of θ, P (−a < Z < a) = pand Z v N(0, 1).

Similarly since

[I(θn;X)]1/2(θn − θ0)→D Z v N(0, 1)

where X = (X1, ...,Xn) then [I(θn;X)]1/2(θn−θ0) is an asymptotic pivotal

quantity and an approximate 100p% C.I. based on this asymptotic pivotalquantity is given byh

θn − a[I(θ)]−1/2, θn + a[I(θn)]−1/2i

where I(θn) is the observed information.

Finally since

−2 logR(θ0;X)→D W v χ2(1)

then −2 logR(θ0;X) is an asymptotic pivotal quantity and an approximate100p% C.I. based on this asymptotic pivotal is

θ : −2 logR(θ;x) ≤ b

where x = (x1, ..., xn) are the observed data, P (W ≤ b) = p andW v χ2(1). Usually this must be calculated numerically.

Since

τ(θn)− τ (θ0)rhτ 0(θn)

i2/J(θn)

→D Z v N(0, 1)

an approximate 100p% C.I. for τ (θ) is given by"τ(θn)− a

½hτ 0(θn)

i2/J(θn)

¾1/2, τ(θn) + a

½hτ 0(θn)

i2/J(θn)

¾1/2#

where P (−a < Z < a) = p and Z v N(0, 1).

2.8. INTERVAL ESTIMATORS 85

2.8.6 Likelihood Intervals and Approximate ConfidenceIntervals

A 15% L.I. for θ is given by θ : R(θ;x) ≥ 0.15. Since

−2 logR(θ0;X)→D W v χ2(1)

we have

P [R(θ;X) ≥ 0.15] = P [−2 logR(θ;X) ≤ −2 log (0.15)]= P [−2 logR(θ;X) ≤ 3.79]≈ P (W ≤ 3.79) = P

¡Z2 ≤ 3.79

¢where Z v N(0, 1)

≈ P (−1.95 ≤ Z ≤ 1.95)≈ 0.95

and therefore a 15% L.I. is an approximate 95% C.I. for θ.

2.8.7 Example

Suppose X1, . . . ,Xn is a random sample from the distribution with proba-bility density function

f (x; θ) = θxθ−1, 0 < x < 1.

The likelihood function for observations x1, . . . , xn is

L (θ) =nQi=1

θxθ−1i = θnµ

nQi=1xi

¶θ−1, θ > 0

The log likelihood and score function are

l (θ) = n log θ + (θ − 1)nPi=1log xi, θ > 0

S (θ) =n

θ+ t =

1

θ(n+ tθ)

where t =nPi=1log xi. Since

S (θ) > 0 for 0 < θ < −n/t and S (θ) < 0 for θ > −n/t

therefore by the First Derivative Test, θ = −n/t is the M.L. estimate of θ.The M.L. estimator of θ is θ = −n/T where T =

nPi=1logXi.

86 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION

The information function is

I (θ) =n

θ2, θ > 0

and the Fisher information is

J (θ) = E [I (θ;X)] = E³ nθ2

´=n

θ2

By the W.L.L.N.

−Tn= − 1

n

nPi=1logXi →p E (− logXi; θ0) =

1

θ0

and by the Limit Theorems

θn =n

T→p θ0

and thus θn is a consistent estimator of θ0.

By Theorem 2.7.4pJ (θ0)(θn − θ0) =

rn

θ20(θn − θ0)→D Z v N(0, 1) . (2.8)

The asymptotic variance of θn is equal to θ20/n whereas the actual variance

of θn is

V ar(θn) =n2θ20

(n− 1)2 (n− 2).

This can be shown using the fact that − logXi v EXP(1/θ0), i = 1, . . . , nindependently which means

T = −nPi=1logXi v GAM

µn,1

θ0

¶(2.9)

and then using the result from Problem 1.3.4. Therefore the asymptoticvariance and the actual variance of θn are not identical but are close invalue for large n.

An approximate 95% C.I. for θ based onqJ(θn)(θn − θ0)→D Z v N(0, 1)

is given by∙θn − 1.96/

qJ(θn), θn + 1.96/

qJ(θn)

¸=hθn − 1.96θn/

√n, θn + 1.96θn/

√ni.

2.8. INTERVAL ESTIMATORS 87

Note the width of the C.I. which is equal to 2 (1.96) θn/√n decreases as

1/√n.

An exact C.I. for θ can be obtained in this case since

T

θ−1= Tθ =

θv GAM(n, 1)

and therefore nθ/θ is a pivotal quantitiy. Since

2Tθ =2nθ

θv χ2 (2n)

we can use values from the chi-squared tables. From the chi-squared tableswe find values a and b such that

P (a ≤W ≤ b) = 0.95 where W v χ2 (2n) .

Then

P

µa ≤ 2nθ

θ≤ b¶= 0.95

or

P

Ãaθ

2n≤ θ ≤ bθ

2n

!= 0.95

and a 95% C.I. for θ is "aθ

2n,bθ

2n

#.

If we choose

P (W ≤ a) = 1− 0.952

= 0.025 = P (W ≥ b)

then we obtain an “equal-tail” C.I. for θ. This is not the narrowest C.I.but it is easier to obtain than the narrowest C.I.. How would you obtainthe narrowest C.I.?

2.8.8 Example

Suppose X1, . . . ,Xn is a random sample from the POI (θ) distribution. Theparameter θ is neither a location or scale parameter. The M.L. estimatorof θ and the Fisher information are

θn = Xn and J (θ) =n

θ.

88 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION

By Theorem 2.7.4

pJ (θ0)

³θn − θ0

´=

rn

θ0

³θn − θ0

´→D Z v N(0, 1) . (2.10)

By the C.L.T.Xn − θ0p

θ0/n→D Z v N(0, 1)

which is the same result.

The asymptotic variance of θn is equal to θ0/n. Since the actual variance

of θn is

V ar(θn) = V ar¡Xn¢=

θ0n

the asymptotic variance and the actual variance of θn are identical in thiscase.

An approximate 95% C.I. for θ based onqJ(θn)

³θn − θ0

´→D Z v N(0, 1)

is given by∙θn − 1.96/

qJ(θn), θn + 1.96/

qJ(θn)

¸=

∙θn − 1.96

qθn/n, θn + 1.96

qθn/n

¸.

An approximate 95% C.I. for τ (θ) = e−θ can be based on the asymptoticpivotal

τ(θn)− τ (θ0)rhτ 0(θn)

i2/J(θn)

→D Z v N(0, 1) .

For τ (θ) = e−θ = P (X1 = 0; θ), τ0 (θ) = d

¡e−θ

¢= −e−θ and the approx-

imate 95% C.I. is given by"τ³θn

´− 1.96

½hτ 0(θn)

i2/J(θn)

¾1/2, τ³θn

´− 1.96

½hτ 0(θn)

i2/J(θn)

¾1/2#

=

∙e−θn − 1.96e−θn

qθn/n, e

−θn + 1.96e−θnqθn/n

¸. (2.11)

which is symmetric about the M.L. estimate τ(θn) = e−θn .

2.8. INTERVAL ESTIMATORS 89

Alternatively since

0.95 ≈ P

µθn − 1.96

qθn/n ≤ θ ≤ θn + 1.96

qθn/n

¶= P

µ−θn + 1.96

qθn/n ≥ −θ ≥ −θn − 1.96

qθn/n

¶= P

µexp

µ−θn + 1.96

qθn/n

¶≥ e−θ ≥ exp

µ−θn − 1.96

qθn/n

¶¶= P

µexp

µ−θn − 1.96

qθn/n

¶≤ e−θ ≤ exp

µ−θn + 1.96

qθn/n

¶¶therefore∙

exp

µ−θn − 1.96

qθn/n

¶, exp

µ−θn + 1.96

qθn/n

¶¸(2.12)

is also an approximate 95% C.I. for τ (θ).

If n = 20 and θn = 3 then the C.I. (2.11) is equal to [0.012, 0.0876]while the C.I. (2.12) is equal to [0.0233, 0.1064] .

2.8.9 Example

Suppose X1, . . . ,Xn is a random sample from the EXP(1, θ) distribution.

f (x; θ) = e−(x−θ), x ≥ θ

The likelihood function for observations x1, . . . , xn is

L (θ) =nQi=1e−(xi−θ) if xi ≥ θ > −∞, i = 1, . . . , n

= exp

µ−

nPi=1xi

¶enθ if −∞ < θ ≤ x(1)

and L (θ) is equal to 0 if θ > x(1). To maximize this function of θ we note

that we want to make the term enθ as large as possible subject to θ ≤ x(1)which implies that θn = x(1) is the M.L. estimate and θn = X(1) is the M.L.estimator of θ.

Since the support ofXi depends on the unknown parameter θ, the modelis not a regular model. This means that Theorem 2.7.2 and 2.7.4 cannotbe used to determine the asymptotic properties of θn. Since

P (θn ≤ x; θ0) = 1−nQi=1P (Xi > x; θ0) = 1− e−n(x−θ0), x ≥ θ0

90 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION

then θn v EXP¡1n , θ0

¢. Therefore

limn→∞

E(θn) = limn→∞

µθ0 +

1

n

¶= θ0

and

limn→∞

V ar(θn) = limn→∞

µ1

n

¶2= 0

by Theorem 5.3.8, θn →p θ0 and θn is a consistent estimator.

Since

Phn(θn − θ0) ≤ t; θ0

i= P

µθn ≤

t

n+ θ0; θ0

¶= 1− en(t/n+θ0−θ0)

= 1− e−t, t ≥ 0true for n = 1, 2, . . ., therefore n(θn − θ0) v EXP (1) for n = 1, 2, . . .

and therefore the asymptotic distribution of n(θn − θ0) is also EXP (1).

Since we know the exact distribution of θn for n = 1, 2, . . ., the asymptoticdistribution is not needed for obtaining C.I.’s.

For this model the parameter θ is a location parameter and therefore(θ − θ) is a pivotal quantity and in particular

P (θ − θ ≤ t; θ) = P (θ ≤ t+ θ; θ)

= 1− e−nt, t ≥ 0.

Since (θ−θ) is a pivotal quantity, a C.I. for θ would take the formhθ − b, θ − a

iwhere 0 ≤ a ≤ b. Unless a = 0 this interval would not contain the M.L. es-timate θ and therefore a “one-tail” C.I. makes sense in this case. To obtaina 95% “one-tail” C.I. for θ we solve

0.95 = P (θ − b ≤ θ ≤ θ; θ)

= P³0 ≤ θ − θ ≤ b; θ

´= 1− e−nb

which gives

b = − 1nlog (0.05) =

log 20

n.

Therefore ∙θ − log 20

n, θ

¸is a 95% “one-tail” C.I. for θ.

2.8. INTERVAL ESTIMATORS 91

2.8.10 Problem

Let X1, . . . ,Xn be a random sample from the distribution with p.d.f.

f (x;β) =2x

β2, 0 < x ≤ β.

(a) Find the likelihood function of β and the M.L. estimator of β.

(b) Find the M.L. estimator of E (X;β) where X has p.d.f. f (x;β).

(c) Show that the M.L. estimator of β is a consistent estimator of β.

(d) If n = 15 and x(15) = 0.99, find the M.L. estimate of β. Plot the relativelikelihood function for β and find 10% and 50% likelihood intervals for β.

(e) If n = 15 and x(15) = 0.99, construct an exact 95% one-tail C.I. for β.

2.8.11 Problem

Suppose X1, . . . ,Xn is a random sample from the UNIF(0, θ) distribution.

Show that the M.L. estimator θn is a consistent estimator of θ. How wouldyou construct a C.I. for θ?

2.8.12 Problem

A certain type of electronic equipment is susceptible to instantaneous failureat any time. Components do not deteriorate significantly with age and thedistribution of the lifetime is the EXP(θ) density. Ten components weretested independently with the observed lifetimes, to the nearest days, givenby 70 11 66 5 20 4 35 40 29 8.

(a) Find the M.L. estimate of θ and verify that it corresponds to a localmaximum. Find the Fisher information and calculate an approximate 95%C.I. for θ based on the asymptotic distribution of θ. Compare this with anexact 95% C.I. for θ.

(b) The estimate in (a) ignores the fact that the data were rounded to thenearest day. Find the exact likelihood function based on the fact that theprobability of observing a lifetime of i days is given by

g(i; θ) =

i+0.5Zi−0.5

1

θe−x/θdx, i = 1, 2, . . . and g(0; θ) =

0.5Z0

1

θe−x/θdx.

Obtain the M.L. estimate of θ and verify that it corresponds to a localmaximum. Find the Fisher information and calculate an approximate 95%C.I. for θ. Compare these results with those in (a).

92 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION

2.8.13 Problem

The number of calls to a switchboard per minute is thought to have aPOI(θ) distribution. However, because there are only two lines available,we are only able to record whether the number of calls is 0, 1 , or ≥ 2. For50 one minute intervals the observed data were: 25 intervals with 0 calls,16 intervals with 1 call and 9 intervals with ≥ 2 calls.(a) Find the M.L. estimate of θ.

(b) By computing the Fisher information both for this problem and for onewith full information, that is, one in which all of the values of X1, . . . ,X50had been recorded, determine how much information was lost by the factthat we were only able to record the number of times X > 1 rather than thevalue of these X’s. How much difference does this make to the asymptoticvariance of the M.L. estimator?

2.8.14 Problem

Let X1, . . . ,Xn be a random sample from a UNIF(θ, 2θ) distribution. Show

that the M.L. estimator θ is a consistent estimator of θ. What is theminimal sufficient statistic for this model? Show that θ = 5

14X(n) +27X(1)

is a consistent estimator of θ which has smaller M.S.E. than θ.

2.9 Asymptotic Properties of M.L.Estimators - Multiparameter

Under similar regularity conditions to the univariate case, the conclusion ofTheorem 2.7.2 holds in the multiparameter case θ = (θ1, . . . , θk)

T , that is,

each component of θn converges in probability to the corresponding compo-nent of θ0. Similarly, Theorem 2.7.4 remains valid with little modification:

[J (θ0)]1/2 (θn − θ0)→D Z ∼MVN(0k, Ik)

where 0k is a k×1 vector of zeros and Ik is the k×k identity matrix. There-fore for a regular model and sufficiently large n, θn has approximately a mul-tivariate normal distribution with mean vector θ0 and variance/covariance

matrix [J (θ0)]−1.

Consider the reparameterization

τj = τj(θ), j = 1, . . . ,m ≤ k.

It follows that©[D(θ0)]

T [J(θ0)]−1D(θ0)

ª−1/2[τ(θn)− τ(θ0)]→D Z ∼MVN(0m, Im)

2.9. ASYMPTOTIC PROPERTIES OFM.L.ESTIMATORS - MULTIPARAMETER 93

where τ(θ) = (τ1(θ), . . . , τm(θ))T and D(θ) is a k × m matrix with (i, j)

element equal to ∂τj/∂θi.

2.9.1 Definition

A 100p% confidence region for the vector θ based on X = (X1, ...,Xn) is aregion R(X) ⊂ Rk which satisfies

P (θ ∈ R(X); θ) = p.

2.9.2 Aymptotic Pivotal Quantities and ApproximateConfidence Regions

Since[J(θn)]

1/2(θn − θ0)→D Z v MVN(0k, Ik)it follows that

(θn − θ0)TJ(θn)(θn − θ0)→D W v χ2(k)

and an approximate 100p% confidence region for θ based on this asymptoticpivotal is the set of all θ vectors in the set

θ : (θn − θ)TJ(θn)(θn − θ) ≤ b

where θn = θ(x1, ..., xn) is the M.L. estimate of θ and b is the value suchthat P (W < b) = p where W v χ2(k).

Similarly since

[I(θn;X)]1/2(θn − θ0)→D Z v MVN(0k, Ik)

it follows that

(θn − θ0)T I(θn;X)(θn − θ0)→D W v χ2(k)

where X = (X1, ...,Xn). An approximate 100p% confidence region for θbased on this asymptotic pivotal quantity is the set of all θ vectors in theset

θ : (θn − θ)T I(θn)(θn − θ) ≤ b

where I(θn) is the observed information matrix.

Finally since−2 logR(θ0;X)→D W v χ2(k)

94 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION

an approximate 100p% confidence region for θ based on this asymptoticpivotal quantity is the set of all θ vectors in the set

θ : −2 logR(θ;x) ≤ b

where x = (x1, ..., xn) are the observed data, R(θ;x) is the relative likeli-hood function. Note that since

θ : −2 logR(θ;x) ≤ b = θ : R(θ;x) ≥ e−b/2

this approximate 100p% confidence region is also a 100e−b/2% likelihoodregion for θ.

Approximate confidence intervals for a single parameter, say θi, fromthe vector of parameters θ = (θ1, ..., θi, ..., θk)

T can also be obtained. Since

[J(θn)]1/2(θn − θ0)→D Z v MVN(0k, Ik)

it follows that an approximate 100p% C.I. for θi is given byhθi − a

pvii, θi + a

pvii

iwhere θi is the M.L. estimate of θi, vii is the (i, i) entry of [J(θn)]

−1 and ais the value such that P (−a < Z < a) = p where Z v N(0, 1).Similarly since

[I(θn;X)]1/2(θn − θ0)→D Z v MVN(0k, Ik)

it follows that an approximate 100p% C.I. for θi is given byhθi − a

pvii, θi + a

pvii

iwhere vii is the (i, i) entry of [I(θn)]

−1.

If τ(θ) is a scalar function of θ thenn[D(θn)]

T [J(θn)]−1D(θn)

o−1/2[τ(θn)− τ(θ0)]→D Z ∼ N(0, 1)

where D(θ) is a k× 1 vector with ith element equal to ∂τ/∂θi. An approx-imate 100p% C.I. for τ(θ) is given by∙τ(θn)− a

n[D(θn)]

T [J(θn)]−1D(θn)

o1/2, τ(θn) + a

n[D(θn)]

T [J(θn)]−1D(θn)

o1/2¸.

(2.13)

2.9. ASYMPTOTIC PROPERTIES OFM.L.ESTIMATORS - MULTIPARAMETER 95

2.9.3 Example

Recall from Example 2.4.17 that for a random sample from the BETA(a, b)distribution the information matix and the Fisher information matrix aregiven by

I(a, b) = n

"Ψ0 (a)−Ψ0 (a+ b) −Ψ0 (a+ b)−Ψ0 (a+ b) Ψ0 (b)−Ψ0 (a+ b)

#= J(a, b).

Since £a− a0 b− b0

¤J³a, b´" a− a0

b− b0

#→D W v χ2 (2) ,

an approximate 100p% confidence region for (a, b) is given by

(a, b) :£a− a b− b

¤J(a, b)

"a− ab− b

#< c

where P (W ≤ c) = p. Since χ2 (2) = GAM(1, 2) = EXP(2), c can bedetermined using

p = P (W ≤ c) =Z c

0

1

2e−x/2dx = 1− e−c/2

which givesc = −2 log(1− p).

For p = 0.95, c = −2 log(0.05) = 5.99. An approximate 95% confidenceregion is given by

(a, b) :£a− a b− b

¤J(a, b)

"a− ab− b

#< 5.99.

Let

J(a, b) =

"J11 J12

J12 J22

#then the confidence region can be written as

(a, b) : (a− a)2 J11 + 2 (a− a) (b− b)J12 + (b− b)2J22 ≤ 5.99

which can be seen to be the points inside an on the ellipse centred at (a, b).

For the data in Example 2.4.15, a = 2.7072, b = 6.7493 and

J(a, b) =

∙10.0280 −3.3461−3.3461 1.4443

¸.

96 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION

(2.7,6.7)

(1.5,8)

a

b

90%

95%

99%

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5

2

4

6

8

10

12

14

Figure 2.7: Approximate Confidence Regions for Beta(a,b) Example

Approximate 90%, 95% and 99% confidence regions are shown in Figure2.7.A 10% likelihood region for (a, b) is given by (a, b) : R(a, b;x) ≥ 0.1.

Since

−2 logR(a0, b0;X)→D W v χ2 (2) = EXP (2)

we have

P [R(a, b;X) ≥ 0.1] = P [−2 logR(a, b;X) ≤ −2 log (0.1)]≈ P (W ≤ −2 log (0.1))= 1− e−[−2 log(0.1)]/2

= 1− 0.1 = 0.9

and therefore a 10% likelihood region corresponds to an approximate 90%confidence region. Similarly 1% and 5% likelihood regions correspond toapproximate 99% and 95% confidence regions respectively. Compare the

2.9. ASYMPTOTIC PROPERTIES OFM.L.ESTIMATORS - MULTIPARAMETER 97

likelihood regions in Figure 2.6 with the approximate confidence regionsshown in Figure 2.7. What do you notice?

Let hJ(a, b)

i−1=

"v11 v12

v12 v22

#.

Since

[J(a, b)]1/2∙a− a0b− b0

¸→D Z v BVN

µ∙00

¸,

∙1 00 1

¸¶then for large n, V ar(a) ≈ v11, V ar(b) ≈ v22 and Cov(a, b) ≈ v12. There-fore an approximate 95% C.I. for a is given byh

a− 1.96pv11, a+ 1.96

pv11

iand an approximate 95% C.I. for b is given byh

b− 1.96pv22, b+ 1.96

pv22

i.

For the given data a = 2.7072, b = 6.7493 andhJ(a, b)

i−1=

∙0.4393 1.01781.0178 3.0503

¸so the approximate 95% C.I. for a ish

2.7072 + 1.96√0.44393, 2.7072− 1.96

√0.44393

i= [1.4080, 4.0063]

and the approximate 95% C.I. for b ish6.7493− 1.96

√3.0503, 6.7493 + 1.96

√3.0503

i= [3.3261, 10.1725] .

Note that a = 1.5 is in the approximate 95% C.I. for a and b = 8 is inthe approximate 95% C.I. for b and yet the point (1.5, 8) is not in theapproximate 95% joint confidence region for (a, b). Clearly these marginalC.I.’s for a and b must be used with care.

To obtain an approximate 95% C.I. for

τ (a, b) = E (X; a, b) =a

a+ b

98 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION

we use (2.13) with

D (a, b) =£

∂τ∂a

∂τ∂b

¤T=h

b(a+b)2

−a(a+b)2

iTand

v = [D(a, b)]T [J(a, b)]−1D(a, b)

=h

b(a+b)2

−a(a+b)2

i " v11 v12

v12 v22

#"b

(a+b)2−a

(a+b)2

#

For the given data

τ(a, b) =a

a+ b=

2.7072

2.7072 + 6.7493= 0.28628

andv = 0.00064706.

The approximate 95% C.I. for τ (a, b) = E (X; a, b) = a/ (a+ b) ish0.28628− 1.96

√0.00064706, 0.28628 + 1.96

√0.00064706

i= [0.23642, 0.33614] .

2.9.4 Problem

In Problem 2.4.10 find an approximate 95% C.I. for Cov(X1,X2; θ1, θ2).

2.9.5 Problem

In Problem 2.4.18 find an approximate 95% joint confidence region for (α,β)and approximate 95% C.I.’s for β and τ(α,β) = E(X;α,β) = αβ.

2.9.6 Problem

In Problem 2.4.19 find approximate 95% C.I.’s for β and E(X;α,β).

2.9.7 Problem

Suppose X1, ...,Xn is a random sample from the EXP(β,μ) distribution.

Show that the M.L. estimators βn and μn are consistent estimators. Howwould you construct a joint confidence region for (β,μ)? How would youconstruct a C.I. for β? How would you construct a C.I. for μ?

2.9. ASYMPTOTIC PROPERTIES OFM.L.ESTIMATORS - MULTIPARAMETER 99

2.9.8 Problem

Consider the model in Problem 1.7.26. Explain clearly how you wouldconstruct a C.I. for σ2 and a C.I. for μ.

2.9.9 Problem

Let X1, . . . ,Xn be a random sample from the distribution with p.d.f.

f (x;α,β) =αxα−1

βα, 0 < x ≤ β, α > 0.

(a) Find the likelihood function of α and β and the M.L. estimators of αand β.

(b) Show that the M.L. estimator of α is a consistent estimator of α. Showthat the M.L. estimator of β is a consistent estimator of β.

(c) If n = 15, x(15) = 0.99 and15Pi=1log xi = −7.7685 find the M.L. estimates

of α and β.

(d) If n = 15, x(15) = 0.99 and15Pi=1log xi = −7.7685, construct an exact

95% equal-tail C.I. for α and an exact 95% one-tail C.I. for β.

(e) Explain how you would construct a joint likelihood region for α and β.Explain how you would construct a joint confidence region for α and β?

2.9.10 Problem

The following are the results, in millions of revolutions to failure, of en-durance tests for 23 deep-groove ball bearings:

17.88 28.92 33.00 41.52 42.12 45.6048.48 51.84 51.96 54.12 55.56 67.8068.64 68.64 68.88 84.12 93.12 98.64105.12 105.84 127.92 128.04 173.40

As a result of testing thousands of ball bearings, it is known that theirlifetimes have a WEI(θ,β) distribution.

(a) Find the M.L. estimates of θ and β and the observed information I(θ, β).

(b) Plot the 1%, 5% and 10% likelihood regions for θ and β on the samegraph.

100 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION

(c) Plot the approximate 99%, 95% and 90% joint confidence regions for θand β on the same graph. Compare these with the likelihood regions in (b)and comment.

(d) Calulate an approximate 95% confidence interval for β.

(e) The value β = 1 is of interest since WEI(θ, 1) = EXP (θ). Is β = 1 aplausible value of β in light of the observed data? Justify your conclusion.

(f) If X vWEI (θ,β) then

P (X > 80; θ,β) = exp

"−µ80

θ

¶β#= τ (θ,β) .

Find an approximate 95% confidence interval for τ (θ,β).

2.9.11 Example - Logistic Regression

Pistons are made by casting molten aluminum into moulds and then ma-chining the raw casting. One defect that can occur is called porosity, dueto the entrapment of bubbles of gas in the casting as the metal solidifies.The presence or absence of porosity is thought to be a function of pouringtemperature of the aluminum.One batch of raw aluminum is available and the pistons are cast in 8

different dies. The pouring temperature is set at one of 4 levels

750, 775, 800, 825

and at each level, 3 pistons are cast in the 8 dies available. The presence(1) or absence (0) of porosity is recorded for each piston and the data aregiven below:

Temperature Total

750 0 0 1 0 1 0 1 1 0 1 1 01 1 0 1 0 0 1 0 1 0 1 1 13

775 0 0 1 0 0 0 1 1 1 1 1 10 0 1 0 1 0 0 1 0 0 1 0 11

800 0 0 0 1 0 1 1 0 0 0 0 01 0 1 1 0 0 1 1 1 0 0 1 10

825 0 0 0 1 0 0 1 1 0 1 1 10 0 0 0 0 0 0 0 1 0 1 0 8

In Figure 2.8, the scatter plot of the proportion of pistons with porosityversus temperature shows that there is a general decrease in porosity astemperature increases.

2.9. ASYMPTOTIC PROPERTIES OFM.L.ESTIMATORS - MULTIPARAMETER 101

740 750 760 770 780 790 800 810 820 8300.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

Temperature

proportion

of defects

Figure 2.8: Proportion of Defects vs Temperature

A model for these data is

Yij ∼ BIN(1, pi), i = 1, . . . , 4, j = 1, . . . , 24 independently

where i indicates the level of pouring temperature, j the replication. Wewould like to fit a curve, a function of the pouring temperature, to theprobabilities pi and the most common function used for this purpose is thelogistic function, ez/(1 + ez). This function is bounded between 0 and 1and so can be used to model probabilities. We may choose the exponent zto depend on the explanatory variates resulting in:

pi = pi (α,β) =eα+β(xi−x)

1 + eα+β(xi−x).

In this expression, xi is the pouring temperature at level i, x = 787.5 is theaverage pouring temperature, and α, β are two unknown parameters. Notealso that

logit(pi) = log

µpi

1− pi

¶= α+ β(xi − x).

The likelihood function is

L(α,β) =4Qi=1

24Qj=1

P (Yij = yij ;α,β) =4Qi=1

24Qj=1

pyiji (1− pi)(1−yij)

102 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION

and the log likelihood is

l (α,β) =4Pi=1

24Pj=1

[yij log (pi) + (1− yij) log (1− pi)] .

Note that∂pi∂α

= pi (1− pi)

and∂pi∂β

= (xi − x) pi (1− pi)

so that

∂l(α,β)

∂α=

4Pi=1

24Pj=1

∂l

∂pi· ∂pi∂α

=4Pi=1

24Pj=1

∙yijpi− 1− yij1− pi

¸pi (1− pi)

=4Pi=1

24Pj=1

[yij (1− pi)− (1− yij) pi] =4Pi=1

24Pj=1

(yij − pi)

=4Pi=1(yi. − 24pi)

where

yi. =24Pj=1

yij .

Similarly∂l(α,β)

∂β=

4Pi=1(xi − x)(yi. − 24pi).

The score function is

S(α,β) =

⎡⎢⎢⎢⎣4Pi=1(yi. − 24pi)

4Pi=1(xi − x)(yi. − 24pi)

⎤⎥⎥⎥⎦ .Since

∂2l (α,β)

∂α2= 24

4Pi=1

∂pi∂α

= 244Pi=1pi (1− pi) ,

∂2l (α,β)

∂β2= 24

4Pi=1(xi − x)

∂pi∂β

= 244Pi=1(xi − x)2 pi (1− pi) ,

and∂2l (α,β)

∂α∂β= 24

4Pi=1

∂pi∂β

= 244Pi=1(xi − x) pi (1− pi)

2.9. ASYMPTOTIC PROPERTIES OFM.L.ESTIMATORS - MULTIPARAMETER 103

the information matrix and the Fisher information matrix are equal andgiven by

I (α,β) = J (α,β) =

⎡⎢⎢⎢⎣24

4Pi=1pi (1− pi) 24

4Pi=1(xi − x)pi(1− pi)

244Pi=1(xi − x)pi(1− pi) 24

4Pi=1(xi − x)2 pi (1− pi)

⎤⎥⎥⎥⎦ .

To find the M.L. estimators of α and β we must solve

∂l(α,β)

∂α= 0 =

∂l(α,β)

∂β

simultaneously which must be done numerically using a method such asNewton’s method. Initial estimates of α and β can be obtained by drawinga line through the points in Figure 2.8, choosing two points on the line andthen solving for α and β. For example, suppose we require that the linepass through the points (775, 11/24) and (825, 8/24). We obtain

−0.167 = logit(11/24) = α+ β (775− 787.5) = α+ β(−12.5)

−0.693 = logit(8/24) = α+ β (825− 787.5) = α+ β(37.5),

and these result in initial estimates: α(0) = −0.298, β(0) = −0.0105.Now

J³α(0),β(0)

´=

∙23.01 −27.22−27.22 17748.30

¸and

S³α(0),β(0)

´= [0.9533428, − 9.521179]T

and the first iteration of Newton’s method gives∙α(1)

β(1)

¸=

∙α(0)

β(0)

¸+hJ³α(0),β(0)

´i−1S³α(0),β(0)

´=

∙−0.2571332−0.01097377

¸.

Repeating this process does not substantially change these estimates, so wehave the M.L. estimates:

α = −0.2571831896 β = −0.01097623887.

The Fisher information matrix evaluated at the M.L. estimate is

J(α, β) =

"23.09153 −24.63342−24.63342 17783.63646

#.

104 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION

The inverse of this matrix gives an estimate of the asymptotic variance/covariancematrix of the estimators:

[J(α, β)]−1 =

"0.0433700024 0.0000600749759

0.0000600749759 0.00005631468312

#.

740 750 760 770 780 790 800 810 820 830 840

0.35

0.4

0.45

0.5

0.55

0.6

0.65

Temperature

proportion

of defects

Figure 2.9: Fitted Model for Proportion of Defects as a Function of Tem-perature

A plot of

p(x) =exp

hα+ β(x− x)

i1 + exp

hα+ β(x− x)

iis shown in Figure 2.9. Note that the curve is very close to a straight lineover the range of x.The 1%, 5% and 10% likelihood regions for (α,β) are shown in Figure

2.10. Note that these likelihood regions are very elliptical in shape. Thisfollows since the (1, 2) entry in the estimated variance/covariance matrix

[J(α, β)]−1 is very close to zero which implies that the estimators α and

β are not highly correlated. This allows us to make inferences more easilyabout β alone. Plausible values for β can be determined from the likelihoodregions in 2.10. A model with no effect due to pouring temperature corre-sponds to β = 0. The likelihood regions indicate that the value, β = 0, isa very plausible value in light of the data for all plausible value of α.

2.9. ASYMPTOTIC PROPERTIES OFM.L.ESTIMATORS - MULTIPARAMETER 105

alpha

(-.26,-.011)

10%

5%

1%

beta

-1 -0.5 0 0.5-0.04

-0.03

-0.02

-0.01

0

0.01

0.02

Figure 2.10: Likelihood regions for (α,β).

The probability of a defect when the pouring temperature is x = 750 isequal to

τ = τ (α,β) =eα+β(750−x)

1 + eα+β(750−x)=

eα+β(−37.5)

1 + eα+β(−37.5)=

1

e−α−β(−37.5) + 1.

By the invariance property of M.L. estimators the M.L. estimator of τ is

τ = τ(α, β) =eα+β(−37.5)

1 + eα+β(−37.5)=

1

e−α−β(−37.5) + 1

and the M.L. estimate is

τ =1

e−α−β(−37.5) + 1=

1

e0.2571831896+0.01097623887(−37.5) + 1= 0.5385.

To construct a approximate C.I. for τ we need an estimate of V ar (τ ;α,β).Now

V arh−α− β(−37.5);α,β

i= (−1)2 V ar (α;α,β) + (37.5)2 V ar(β;α,β) + 2 (−1) (37.5)Cov(α, β;α,β)

and using [J(α, β)]−1 we estimate this variance by

v = 0.0433700024 + (37.5)2 (0.00005631468312)− 2(37.5) (0.0000600749759)= 0.118057.

106 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION

An approximate 95% C.I. for −α− β(−37.5) is

[−α− β(−37.5)− 1.96√v, − α− β(−37.5) + 1.96

√v]

=h−0.154426− 1.96

√0.118057, − 0.154426 + 1.96

√0.118057

i= [−0.827870, 0.519019] .

An approximate 95% C.I. for

τ = τ (α,β) =1

e−α−β(−37.5) + 1

is ∙1

exp (0.519019) + 1,

1

exp (−0.827870) + 1

¸= [0.373082, 0.695904] .

The near linearity of the fitted function as indicated in Figure 2.9 seemsto imply that we need not use the logistic function treated in this example,but that a straight line could have been fit to these data with similar resultsover the range of temperatures observed. Indeed, a simple linear regressionwould provide nearly the same fit. However, if values of pi near 0 or 1 hadbeen observed, e.g. for temperatures well above or well below those usedhere, the non-linearity of the logistic function would have been importantand provided some advantage over simple linear regression.

2.9.12 Problem - The Challenger Data

On January 28, 1986, the twenty-fifth flight of the U.S. space shuttle pro-gram ended in disaster when one of the rocket boosters of the ShuttleChallenger exploded shortly after lift-off, killing all seven crew members.The presidential commission on the accident concluded that it was causedby the failure of an O-ring in a field joint on the rocket booster, and thatthis failure was due to a faulty design that made the O-ring unacceptablysensitive to a number of factors including outside temperature. Of the pre-vious 24 flights, data were available on failures of O-rings on 23, (one waslost at sea), and these data were discussed on the evening preceding theChallenger launch, but unfortunately only the data corresponding to the7 flights on which there was a damage incident were considered importantand these were thought to show no obvious trend. The data are given inTable 1. (See Dalal, Fowlkes and Hoadley (1989), JASA, 84, 945-957.)

2.9. ASYMPTOTIC PROPERTIES OFM.L.ESTIMATORS - MULTIPARAMETER 107

Table 1Date Temperature Number of

Damage Incidents4/12/81 66 011/12/81 70 13/22/82 69 06/27/82 80 Not available1/11/82 68 04/4/83 67 06/18/83 72 08/30/83 73 011/28/83 70 02/3/84 57 14/6/84 63 18/30/84 70 110/5/84 78 011/8/84 67 01/24/85 53 34/12/85 67 04/29/85 75 06/17/85 70 07/29/85 81 08/27/85 76 010/3/85 79 010/30/85 75 211/26/85 76 01/12/86 58 11/28/86 31 Challenger Accident

(a) Let

p(t;α,β) = P (at least one damage incident for a flight at temperature t)

=eα+βt

1 + eα+βt.

Using M.L. estimation fit the model

Yi ∼ BIN(1, p (ti;α,β)), i = 1, . . . , 23

to the data available from the flights prior to the Challenger accident. Youmay ignore the flight for which information on damage incidents is notavailable.

(b) Plot 10% and 50% likelihood regions for α and β.

108 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION

(c) Find an approximate 95% C.I. for β. How plausible is the value β = 0?

(d) Find an approximate 95% C.I. for p(t) if t = 31, the temperature onthe day of the disaster. Comment.

2.10 Nuisance Parameters andM.L. Estimation

Suppose X1, . . . ,Xn is a random sample from the distribution with proba-bility (density) function f (x; θ). Suppose also that θ = (λ,φ) where λ is avector of parameters of interest and φ is a vector of nuisance parameters.The profile likelihood is one modification of the likelihood which allows

us to look at estimation methods for λ in the presence of the nuisanceparameter φ.

2.10.1 Definition

Suppose θ = (λ,φ) with likelihood function L(λ,φ). Let φ(λ) be the M.L.estimator of φ for a fixed value of λ. Then the profile likelihood for λ isgiven by L(λ, φ(λ)).

The M.L. estimator of λ based on the profile likelihood is, of course,the same estimator obtained by maximizing the joint likelihood L(λ,φ)simultaneously over λ and φ. If the profile likelihood is used to constructlikelihood regions for λ, care must be taken since the imprecision in theestimation of the nuisance paramter φ is not taken into account.Profile likelihood is one example of a group of modifications of the like-

lihood known as pseudo-likelihoods which are based on a derived likelihoodfor a subset of parmeters. Marginal likelihood, conditional likelihood andpartial likelihood are also included in this class.Suppose that θ = (λ,φ) and the data X, or some function of the data,

can be partitioned into U and V . Suppose also that

f(u, v; θ) = f(u;λ) · f(v|u; θ).

If the conditional distribution of V given U does depends only on φ thenestimation of λ can be based on f(u;λ), the marginal likelihood for λ. Iff(v|u; θ) depends on both λ and φ then the marginal likelihood may still beused for estimation of λ if, in ignoring the conditional distribution, there islittle information lost.If there is a factorization of the form

f(u, v; θ) = f(u|v;λ) · f(v; θ)

2.11. PROBLEMS WITH M.L. ESTIMATORS 109

then estimation of λ can be based on f(u|v;λ) the conditional likelihoodfor λ.

2.10.2 Problem

Suppose X1, . . . ,Xn is a random sample from a N(μ,σ2) distribution andthat σ is the parameter of interest while μ is a nuisance parameter. Find theprofile likelihood of σ. Let U = S2 and V = X. Find f(u;σ), the marginallikelihood of σ and f(u|v;σ), the conditional likelihood of σ. Compare thethree likelihoods.

2.11 Problems with M.L. Estimators

2.11.1 Example

This is an example to indicate that in the presence of a large number ofnuisance parameters, it is possible for a M.L. estimator to be inconsistent.Suppose we are interested in the effect of environment on the performance ofidentical twins in some test, where these twins were separated at birth andraised in different environments. If the vector (Xi, Yi) denotes the scoresof the i’th pair of twins, we might assume (Xi, Yi) are both independentN(μi,σ

2) random variables. We wish to estimate the parameter σ2 basedon a sample of n twins. Show that the M.L. estimator of σ2 is

σ2 = 14n

nPi=1

(Xi − Yi)2 and this is a biased and inconsistent estimator of

σ2. Show, however, that a simple modification results in an unbiased andconsistent estimator.

2.11.2 Example

Recall that Theorem 2.7.2 states that under some conditions a root of thelikelihood equation exists which is consistent as the sample size approachesinfinity. One might wonder why the theorem did not simply make the sameassertion for the value of the parameter providing the global maximum ofthe likelihood function. The answer is that while the consistent root ofthe likelihood equation often corresponds to the global maximum of thelikelihood function, there is no guarantee of this without some additionalconditions. This somewhat unusual example shows circumstances underwhich the consistent root of the likelihood equation is not the global maxi-mizer of the likelihood function. Suppose Xi, i = 1, . . . , n are independent

110 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION

observations from the mixture density of the form

f (x; θ) =²√2πσ

e−(x−μ)2/2σ2 +

1− ²√2πe−(x−μ)

2/2

where θ = (μ,σ2) with both parameters unknown. Notice that the likeli-hood function L(μ,σ) → ∞ for μ = xj ,σ → 0 for any j = 1, . . . , n. Thismeans that the globally maximizing σ is σ = 0, which lies on the boundaryof the parameter space. However, there is a local maximum of the likeli-hood function at some σ > 0 which provides a consistent estimator of theparameter.

2.11.3 Unidentifiability and Singular Information Ma-trices

Suppose we observe two independent random variables Y1, Y2 having nor-mal distributions with the same variance σ2 and means θ1 + θ2, θ2 + θ3respectively. In this case, although the means depend on the parameterθ = (θ1, θ2, θ3), the value of this vector parameter is unidentifiable in thesense that, for some pairs of distinct parameter values, the probability den-sity function of the observations are identical. For example the parameterθ = (1, 0, 1) leads to exactly the same joint distribution of Y1, Y2 as doesthe parameter θ = (0, 1, 0). In this case, we we might consider only thetwo parameters (φ1,φ2) = (θ1 + θ2, θ2 + θ3) and anything derivable fromthis pair estimable, while parameters such as θ2 that cannot be obtainedas functions of φ1,φ2 are consequently unidentifiable. The solution to theoriginal identifiability problem is the reparametrization to the new para-meter (φ1,φ2) in this case, and in general, unidentifiability usually meansone should seek a new, more parsimonious parametrization.

In the above example, compute the Fisher information matrix for theparameter θ = (θ1, θ2, θ3). Notice that the Fisher information matrix issingular. This means that if you were to attempt to compute the asymptoticvariance of the M.L. estimator of θ by inverting the Fisher informationmatrix, the inversion would be impossible. Attempting to invert a singularmatrix is like attempting to invert the number zero. It results in one ormore components that you can consider to be infinite. Arguing intuitively,the asymptotic variance of the M.L. estimator of some of the parametersis infinite. This is an indication that asymptotically, at least, some of theparameters may not be identifiable. When parameters are unidentifiable,the Fisher information matrix is generally singular. However, when J(θ)is singular for all values of θ, this may or may not mean parameters areunidentifiable for finite sample sizes, but it does usually mean one should

2.12. HISTORICAL NOTES 111

take a careful look at the parameters with a possible view to adoptinganother parametrization.

2.11.4 U.M.V.U.E.’s and M.L. Estimators: A Com-parison

Should we use U.M.V.U.E.’s or M.L. estimators? There is no general con-sensus among statisticians.

1. If we are estimating the expectation of a natural sufficient statisticTi(X) in a regular exponential family both M.L. and unbiasednessconsiderations lead to the use of Ti as an estimator.

2. When sample sizes are large U.M.V.U.E.’s and M.L. estimators areessentially the same. In that case use is governed by ease of compu-tation. Unfortunately how large “large” needs to be is usually un-known. Some studies have been carried out comparing the behaviourof U.M.V.U.E.’s and M.L. estimators for various small fixed samplesizes. The results are, as might be expected, inconclusive.

3. M.L. estimators exist “more frequently” and when they do they areusually easier to compute than U.M.V.U.E.’s. This is essentially be-cause of the appealing invariance property of M.L.E.’s.

4. Simple examples are known for which M.L. estimators behave badlyeven for large samples (see Examples 2.10.1 and 2.10.2 above).

5. U.M.V.U.E.’s and M.L. estimators are not necessarily robust. Theyare sensitive to model misspecification.

In Chapter 3 we examine other approaches to estimation.

2.12 Historical Notes

The concept of sufficiency is due to Fisher (1920), who in his fundamentalpaper of 1922 also introduced the term and stated the factorization crite-rion. The criterion was rediscovered by Neyman (1935) and was provedfor general dominated families by Halmos and Savage (1949). The the-ory of minimal sufficiency was initiated by Lehmann and Scheffe (1950)and Dynkin (1951). For further generalizations, see Bahadur (1954) andLanders and Rogge (1972).One-parameter exponential families as the only (regular) families of dis-

tributions for which there exists a one-dimensional sufficient statistic were

112 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION

also introduced by Fisher (1934). His result was generalized to more thanone dimension by Darmois (1935), Koopman (1936) and Pitman (1936). Amore recent discussion of this theorem with references to the literature isgiven, for example, by Hipp (1974). A comprehensive treatment of expo-nential families is provided by Barndorff-Nielsen (1978).The concept of unbiasedness as “lack of systematic error” in the esti-

mator was introduced by Gauss (1821) in his work on the theory of leastsquares. The first UMVU estimators were obtained by Aitken and Silver-stone (1942) in the situation in which the information inequality yields thesame result. UMVU estimators as unique unbiased functions of a suitablesufficient statistic were derived in special cases by Halmos (1946) and Kol-mogorov (1950), and were pointed out as a general fact by Rao (1947). Theconcept of completeness was defined, its implications for unbiased estima-tion developed, and the Lehmann-Scheffe Theorem obtained in Lehmannand Scheffe (1950, 1955, 1956). Basu’s theorem is due to Basu (1955, 1958).The origins of the concept of maximum likelihood go back to the work

of Lambert, Daniel Bernoulli, and Lagrange in the second half of the 18thcentury, and of Gauss and Laplace at the beginning of the 19th. [Fordetails and references, see Edwards (1974).] The modern history beginswith Edgeworth (1908-09) and Fisher (1922, 1925), whose contributionsare discussed by Savage (1976) and Pratt (1976).The amount of information that a data set contains about a parameter

was introduced by Edgeworth (1908, 1909) and was developed systemati-cally by Fisher (1922 and later papers). The first version of the informationinequality appears to have been given by Frechet (1943), Rao (1945), andCramer (1946). The designation “information inequality” which replacedthe earlier “Cramer-Rao inequality” was proposed by Savage (1954).Fisher’s work on maximum likelihood was followed by a euphoric belief

in the universal consistence and asymptotic efficiency of maximum likeli-hood estimators, at least in the i.i.d. case. The true situation was sortedout gradually. Landmarks are Wald (1949), who provided fairly generalconditions for consistency, Cramer (1946), who defined the “regular” casein which the likelihood equation has a consistent asymptotically efficientroot, the counterexamples of Bahadur (1958) and Hodges (Le Cam, 1953),and Le Cam’s resulting theorem on superefficiency (1935).

Chapter 3

Other Methods ofEstimation

3.1 Best Linear Unbiased Estimators

The problem of finding best unbiased estimators is considerably simpler ifwe limit the class in which we search. If we permit any function of thedata, then we usually require the heavy machinery of complete sufficiencyto produce U.M.V.U.E.’s. However, the situation is much simpler if wesuggest some initial random variables and then require that our estimatorbe a linear combination of these. Suppose, for example we have randomvariables Y1, Y2, Y3 with E(Y1) = α + θ, E(Y2) = α − θ, E(Y3) = θ whereθ is the parameter of interest and α is another parameter. What linearcombinations of the Yi’s provide an unbiased estimator of θ and among thesepossible linear combinations which one has the smallest possible variance?To answer these questions, we need to know the covariances Cov(Yi, Yj)(at least up to some scalar multiple). Suppose Cov(Yi, Yj) = 0, i 6= j andV ar(Yj) = σ2. Let Y = (Y1, Y2, Y3)

T and β = (α, θ)T . The model can bewritten as a general linear model as

Y = Xβ + ²

where

X =

⎡⎣ 1 11 −10 1

⎤⎦ ,ε = (ε1, ε2, ε3)

T , and the εi’s are uncorrelated random variables withE(²i) = 0 and V ar(εi) = σ2. Then the linear combination of the compo-

113

114 CHAPTER 3. OTHER METHODS OF ESTIMATION

nents of Y that has the smallest variance among all unbiased estimators ofβ is given by the usual regression formula β = (α, θ)T = (XtX)−1XTY andθ = 1

3(Y1−Y2+Y3) provides the best estimator of θ in the sense of smallestvariance. In other words, the linear combination of the components of Ywhich has smallest variance among all unbiased estimators of aTβ is aT βwhere aT = (0, 1). This result follows from the following theorem.

3.1.1 Gauss-Markov Theorem

Suppose Y = (Y1, . . . , Yn)T is a vector of random variables such that

Y = Xβ + ε

where X is a n × k (design) matrix of known constants having rank k,β = (β1, . . . ,βk)

T is a vector of unknown parameters and ε = (ε1, . . . , εn)T

is a vector of random variables such that E (εi) = 0, i = 1, . . . , n and

V ar (ε) = σ2B

whereB is a known non-singular matrix and σ2 is a possibly unknown scalarparameter. Let θ = aTβ, where a is a known k × 1 vector. The unbiasedestimator of θ having smallest variance among all unbiased estimators thatare linear combinations of the components of Y is

θ = aT (XTB−1X)−1XTB−1Y.

Note that this result does not depend on any assumed normality of thecomponents of Y but only on the first and second moment behaviour, thatis, the mean and the covariances. The special case when B is the identitymatrix is the least squares estimator.

3.1.2 Problem

Show that if the conditions of the Gauss-Markov Theorem hold and the εi’sare assumed to be normally distributed then the U.M.V.U.E. of β is givenby

β = (XTB−1X)−1XTB−1Y

(see Problems 1.5.11 and 1.7.25). Use this result to prove the Gauss-MarkovTheorem in the case in which the εi’s are not assumed to be normallydistributed.

3.2. EQUIVARIANT ESTIMATORS 115

3.1.3 Example

Suppose T1, . . . , Tn are independent unbiased estimators of θ with knownvariances V ar(Ti) = σ2i , i = 1, . . . , n. Find the best linear combination ofthese estimators, that is, the one that results in an unbiased estimator of θhaving the minimum variance among all linear unbiased estimators.

3.1.4 Problem

Suppose Yij , i = 1, 2; j = 1, . . . , n are independent random variables withE(Yij) = μ+ αi and V ar(Yij) = σ2 where α1 + α2 = 0.

(a) Find the best linear unbiased estimator of α1.

(b) Under what additional assumptions is this estimator the U.M.V.U.E.?Justify your answer.

3.1.5 Problem

Suppose Yij , i = 1, 2; j = 1, . . . , ni are independent random variables with

E (Yij) = αi + βi (xij − xi) , V ar (Yij) = σ2 and xi =1

ni

niPj=1

xij .

(a) Find the best linear unbiased estimators of α1, α2, β1 and β2.

(b) Under what additional assumptions are these estimators the U.M.V.U.E.’s?Justify your answer.

3.1.6 Problem

Suppose X1, . . . ,Xn is a random sample from the N(μ,σ2) distribution.Find the linear combination of the random variables (Xi−X)2, i = 1, . . . , nwhich minimizes the M.S.E. for estimating σ2. Compare this estimator withthe M.L. estimator and the U.M.V.U.E. of σ2.

3.2 Equivariant Estimators

3.2.1 Definition

A model f (x; θ) ; θ ∈ < such that f (x; θ) = f0(x− θ) with f0 known iscalled a location invariant family and θ is called a location parameter. (See1.2.3.)

In many examples the location of the origin is arbitrary. For exampleif we record temperatures in degrees celcius, the 0 point has been more or

116 CHAPTER 3. OTHER METHODS OF ESTIMATION

less arbitrarily chosen and we might wish that our inference methods do notdepend on the choice of origin. This can be ensured by requiring that theestimator when it is applied to shifted data, is shifted by the same amount.

3.2.2 Definition

The estimator θ(X1, . . . ,Xn) is location equivariant if

θ(x1 + a, . . . , xn + a) = θ(x1, . . . , xn) + a

for all values of (x1, . . . , xn) and real constants a.

3.2.3 Example

Suppose X1, . . . ,Xn is a random sample from a N(θ, 1) distribution. Showthat the U.M.V.U.E. of θ is a location equivariant estimator.

Of course, location equivariant estimators do not make much sense forestimating variances; they are naturally connected to estimating the loca-tion parameter in a location invariant family.

We call a given estimator, θ(X), minimum risk equivariant (M.R.E.) if,among all location equivariant estimators, it has the smallest M.S.E.. It isnot difficult to show that a M.R.E. estimator must be unbiased (Problem3.2.8). Remarkably, best estimators in the class of location equivariantestimators are known, due to the following theorem of Pitman.

3.2.4 Theorem

Suppose X1, . . . ,Xn is a random sample from a location invariant familyf (x; θ) = f0(x− θ), θ ∈ < , with known density f0. Then among all lo-cation equivariant estimators, the one with smallest M.S.E. is the Pitmanlocation equivariant estimator given by

θ(X1, ...,Xn) =

∞R−∞

unQi=1f0 (Xi − u) du

∞R−∞

nQi=1f0 (Xi − u) du

. (3.2)

3.2.5 Example

Let X1, . . . ,Xn be a random sample from the N(θ, 1) distribution. Showthat the Pitman estimator of θ is the U.M.V.U.E. of θ.

3.2. EQUIVARIANT ESTIMATORS 117

3.2.6 Problem

Prove that the M.R.E. estimator is unbiased.

3.2.7 Problem

Let (X1,X2) be a random sample from the distribution with probabilitydensity function

f (x; θ) = −6(x− θ − 12)(x− θ +

1

2), θ − 1

2< x < θ +

1

2.

Show that the Pitman estimator of θ is θ (X1,X2) = (X1 +X2)/2.

3.2.8 Problem

Let X1, . . . ,Xn be a random sample from the EXP(1, θ) distribution. Findthe Pitman estimator of θ and compare it to the M.L. estimator of θ andthe U.M.V.U.E. of θ.

3.2.9 Problem

Let X1, . . . ,Xn be a random sample from the UNIF(θ − 1/2, θ + 1/2) dis-tribution. Find the Pitman estimator of θ. Show that the M.L. estimatoris not unique in this case.

3.2.10 Problem

Suppose X1, . . . ,Xn is a random sample from a location invariant familyf (x; θ) = f0(x− θ), θ ∈ <. Show that if the M.L. estimator is uniquethen it is a location equivariant estimator.

Since the M.R.E. estimator is an unbiased estimator, it follows thatif there is a U.M.V.U.E. in a given problem and if that U.M.V.U.E. islocation equivariant then the M.R.E. estimator and the U.M.V.U.E. mustbe identical. M.R.E. estimators are primarily used when no U.M.V.U.E.exists. For example, the Pitman estimator of the location parameter fora Cauchy distribution performs very well by comparison with any otherestimator, including the M.L. estimator.

3.2.11 Definition

A model f (x; θ) ; θ > 0 such that f (x; θ) = 1θf1(

xθ ) with f1 known is

called a scale invariant family and θ is called a scale parameter (See 1.2.3).

118 CHAPTER 3. OTHER METHODS OF ESTIMATION

3.2.12 Definition

An estimator θk = θk(X1, . . . ,Xn) is scale equivariant if

θk(cx1, . . . , cxn) = ckθk(x1, . . . , xn)

for all values of (x1, . . . , xn) and c > 0.

3.2.13 Theorem

Suppose X1, . . . ,Xn is a random sample from a scale invariant family©f (x; θ) = 1

θf1(xθ ), θ > 0

ª, with known density f1. The Pitman scale equivari-

ant estimator of θk which minimizes

E

⎡⎣Ã θk − θk

θk

!2; θ

⎤⎦(the scaled M.S.E.) is given by

θk = θk(X1, . . . ,Xn) =

∞R0

un+k−1nQi=1f1(uXi)du

∞R0

un+2k−1nQi=1f1(uXi)du

for all k for which the integrals exist.

3.2.14 Problem

(a) Show that the EXP(θ) density is a scale invariant family.

(b) Show that the U.M.V.U.E. of θ based on a random sample X1, . . . ,Xn isa scale equivariant estimator and compare it to the Pitman scale equivari-ant estimator of θ. How does the M.L. estimator of θ compare with theseestimators?

(c) Find the Pitman scale equivariant estimator of θ−1.

3.2.15 Problem

(a) Show that the N(0,σ2) density is a scale invariant family.

(b) Show that the U.M.V.U.E. of σ2 based on a random sample X1, . . . ,Xnis a scale equivariant estimator and compare it to the Pitman scale equivari-ant estimator of σ2. How does the M.L. estimator of σ2 compare with theseestimators?

(c) Find the Pitman scale equivariant estimator of σ and compare it to theM.L. estimator of σ and the U.M.V.U.E. of σ.

3.3. ESTIMATING EQUATIONS 119

3.2.16 Problem

(a) Show that the UNIF(0, θ) density is a scale invariant family.

(b) Show that the U.M.V.U.E. of θ based on a random sample X1, . . . ,Xn isa scale equivariant estimator and compare it to the Pitman scale equivari-ant estimator of θ. How does the M.L. estimator of θ compare with theseestimators?

(c) Find the Pitman scale equivariant estimator of θ2.

3.3 Estimating Equations

To find the M.L. estimator, we usually solve the likelihood equation

nXi=1

∂θlogf (Xi; θ) = 0. (3.3)

Note that the function on the left hand side is a function of both the obser-vations and the parameter. Such a function is called an estimating function.Most sensible estimators, like the M.L. estimator, can be described easilythrough an estimating function. For example, if we know V ar(Xi) = θfor independent identically distributed Xi, then we can use the estimatingfunction

Ψ(θ;X) =nPi=1(Xi − X)2 − (n− 1)θ

to estimate the parameter θ, without any other knowledge of the distri-bution, its density, mean etc. The estimating function is set equal to 0and solved for θ. The above estimating function is an unbiased estimatingfunction in the sense that

E[Ψ(θ;X); θ] = 0, θ ∈ Ω. (3.4)

This allows us to conclude that the function is at least centered appropri-ately for the estimation of the parameter θ. Now suppose that Ψ is anunbiased estimating function corresponding to a large sample. Often Ψ canbe written as the sum of independent components, for example

Ψ(θ;X) =nPi=1

ψ(θ;Xi). (3.5)

Now suppose θ is a root of the estimating equation

Ψ(θ;X) = 0.

120 CHAPTER 3. OTHER METHODS OF ESTIMATION

Then for θ sufficiently close to θ,

Ψ(θ;X) = Ψ(θ;X)−Ψ(θ;X) ≈ (θ − θ)∂

∂θΨ(θ;X).

Now using the Central Limit Theorem, assuming that θ is the true valueof the parameter and provided ψ is a sum as in (3.5), the left hand sideis approximately normal with mean 0 and variance equal to V ar[Ψ(θ;X)].The term ∂

∂θΨ(θ;X) is also a sum of similar derivatives of the individualψ(θ;Xi). If a law of large numbers applies to these terms, then when dividedby n this sum will be asymptotically equivalent to 1

nE£∂∂θΨ(θ,X); θ

¤. It

follows that the root θ will have an approximate normal distribution withmean θ and variance

V ar [Ψ(θ;X); θ]©E£∂∂θΨ(θ;X); θ

¤ª2 .By analogy with the relation between asymptotic variance of the M.L. es-timator and the Fisher information, we call the reciprocal of the aboveasymptotic variance formula the Godambe information of the estimatingfunction. This information measure is

J(Ψ; θ) =

©E£∂∂θΨ(θ;X); θ

¤ª2V ar [Ψ(θ;X); θ]

. (3.1)

Godambe(1960) proved the following result.

3.3.1 Theorem

Among all unbiased estimating functions satisfying the usual regularity con-ditions (see 2.3.1), an estimating function which maximizes the Godambeinformation (3.1) is of the form c(θ)S(θ;X) where c(θ) is non-random.

3.3.2 Problem

Prove Theorem 3.3.1.

3.3.3 Example

Suppose X = (X1, . . . ,Xn) is a random sample from a distribution with

E(logXi; θ) = eθ and V ar(logXi; θ) = e

2θ, i = 1, . . . , n.

Consider the estimating function

Ψ(θ;X) =nPi=1(logXi − eθ).

3.3. ESTIMATING EQUATIONS 121

(a) Show that Ψ(θ;X) is an unbiased estimating function.

(b) Find the estimator θ which satisfies Ψ(θ;X) = 0.

(c) Construct an approximate 95% C.I. for θ.

3.3.4 Problem

Suppose X1, . . . ,Xn is a random sample from the Bernoulli(θ) distribution.Suppose also that (²i, . . . , ²n) are independent N(0,σ

2) random variablesindependent of the Xi’s. Define Yi = θXi + ²i, i = 1, . . . , n. We observeonly the values (Xi, Yi), i = 1, . . . , n. The parameter θ is unknown and the²i’s are unobserved. Define the estimating function

Ψ[θ; (X,Y )] =nPi=1(Yi − θXi).

(a) Show that this is an unbiased estimating function for θ.

(b) Find the estimator θ which satisfies Ψ[θ; (X,Y )] = 0. Is θ an unbiasedestimator of θ?

(c) Construct an approximate 95% C.I. for θ.

3.3.5 Problem

Consider random variables X1, . . . ,Xn generated according to a first orderautoregressive process

Xi = θXi−1 + Zi,

where X0 is a constant and Z1, . . . , Zn are independent N(0,σ2) random

variables.

(a) Show that

Xi = θiX0 +iP

j=1θi−jZj.

(b) Show that

Ψ(θ;X) =n−1Pi=0

Xi(Xi+1 − θXi)

is an unbiased estimating function for θ.

(c) Find the estimator θ which satisfies Ψ(θ;X) = 0. Compare the asymp-totic variance of this estimator with the Cramer-Rao lower bound.

122 CHAPTER 3. OTHER METHODS OF ESTIMATION

3.3.6 Problem

Let X1, . . . ,Xn be a random sample from the POI(θ) distribution. SinceV ar (Xi; θ) = θ, we could use the sample variance S2 rather than the samplemean X as an estimator of θ, that is, we could use the estimating function

Ψ(θ;X) =1

n− 1nPi=1(Xi − X)2 − θ.

Find the asymptotic variance of the resulting estimator and hence the as-ymptotic efficiency of this estimation method. (Hint: The sample vari-ance S2 has asymptotic variance V ar(S2) ≈ 1

nE[(Xi − μ)4] − σ4 whereE(Xi) = μ and V ar(Xi) = σ2.)

3.3.7 Problem

Suppose Y1, . . . , Yn are independent random variable such that E (Yi) = μiand V ar (Yi) = v (μi) , i = 1, . . . , n where v is a known function. Suppose

also that h (μi) = xTi β where h is a known function, xi = (xi1, . . . , xik)

T ,i = 1, . . . , n are vectors of known constants and β is a k × 1 vector of un-known parameters. The quasi-likelihood estimating equation for estimatingβ is µ

∂μ

∂β

¶T[V (μ)]−1 (Y − μ) = 0

where Y = (Y1, . . . , Yn)T , μ = (μ1, . . . ,μn)

T , V (μ) = diag v (μ1) , . . . , v (μn),and ∂μ

∂β is the n× k matrix whose (i, j) entry is∂μi∂βj.

(a) Show that this is an unbiased estimating equation for all β.

(b) Show that if Yi v POI(μi) and log (μi) = xTi β then the quasi-likelihoodestimating equation is also the likelihood equation for estimating β.

3.3.8 Problem

SupposeX1, . . . ,Xn be a random sample from a distribution with p.f./p.d.f.f (x; θ). It is well known that the estimator X is sensitive to extremeobservations while the sample median is not. Attempts have been madeto find robust estimators which are not unduly affected by outliers. Onesuch class proposed by Huber (1981) is the class of M-estimators. Theseestimators are defined as the estimators which minimize

nPi=1

ρ (Xi; θ) (3.2)

3.3. ESTIMATING EQUATIONS 123

with respect to θ for some function ρ. The “M” stands for “maximumlikelihood type” since for ρ (x; θ) = −logf (x; θ) the estimator is the M.L.estimator. Since minimizing (3.2) is usually equivalent to solving

Ψ(θ;X) =nPi=1

∂θρ (Xi; θ) = 0,

M-estimators may also be defined in terms of estimating functions.

Three examples of ρ functions are:

(1) ρ (x; θ) = (x− θ)2/2

(2) ρ (x; θ) = |x− θ|(3)

ρ (x; θ) =

½(x− θ)

2/2 if |x− θ| ≤ c

c |x− θ|− c2/2 if |x− θ| > c

(a) For all three ρ functions, find a p.d.f. f (x; θ) such thatρ (x; θ) = −logf (x; θ) + log k.(b) For the f (x; θ) obtained for (3), graph the p.d.f. for θ = 0, c = 1 andθ = 0, c = 1.5. On the same graph plot the N(0, 1) and t(2) p.d.f.’s. Whatdo you notice?

(c) The following data, ordered from smallest to largest, were randomlygenerated from a t(2) distribution:

−1.75 −1.24 −1.15 −1.09 −1.02 −0.93 −0.92 −0.91−0.78 −0.61 −0.59 −0.58 −0.44 −0.35 −0.26 −0.20−0.18 −0.18 −0.17 −0.15 −0.08 −0.04 0.02 0.030.09 0.14 0.25 0.36 0.43 0.93 1.03 1.131.16 1.34 1.61 1.95 2.25 2.37 2.59 4.82

Construct a frequency histogram for these data.

(d) Find the M-estimate for θ for each of the ρ functions given above. Usec = 1 for (3).

(e) Compare these estimates with the M.L. estimate obtained by assumingthat X1, . . . ,Xn is a random sample from the distribution with p.d.f.

f (x; θ) =Γ (3/2)√2π

"1 +

(x− θ)2

2

#−3/2, t ∈ <

which is a t(2) p.d.f. if θ = 0.

124 CHAPTER 3. OTHER METHODS OF ESTIMATION

3.4 Bayes Estimation

There are two major schools of thought on the way in which statisticalinference is conducted, the frequentist and the Bayesian school. Typically,these schools differ slightly on the actual methodology and the conclusionsthat are reached, but more substantially on the philosophy underlying thetreatment of parameters. So far we have considered a parameter as anunknown constant underlying or indexing the probability density functionof the data. It is only the data, and statistics derived from the data thatare random.

However, the Bayesian begins by asserting that the parameter θ is sim-ply the realization of some larger random experiment. The parameter isassumed to have been generated according to some distribution, the priordistribution π and the observations then obtained from the correspondingprobability density function f (x; θ) interpreted as the conditional proba-bility density of the data given the value of θ. The prior distribution π(θ)quantifies information about θ prior to any further data being gathered.Sometimes π(θ) can be constructed on the basis of past data. For example,if a quality inspection program has been running for some time, the distri-bution of the number of defectives in past batches can be used as the priordistribution for the number of defectives in a future batch. The prior canalso be chosen to incorporate subjective information based on an expert’sexperience and personal judgement. The purpose of the data is then toadjust this distribution for θ in the light of the data, to result in the pos-terior distribution for the parameter. Any conclusions about the plausiblevalue of the parameter are to be drawn from the posterior distribution. Fora frequentist, statements like P (1 < θ < 2) are meaningless; all random-ness lies in the data and the parameter is an unknown constant. Hencethe effort taken in earlier courses in carefully assuring students that if anobserved 95% confidence interval for the parameter is 1 ≤ θ ≤ 2 this doesnot imply P (1 ≤ θ ≤ 2) = 0.95. However, a Bayesian will happily quotesuch a probability, usually conditionally on some observations, for example,P (1 ≤ θ ≤ 2|x) = 0.95.

3.4.1 Posterior Distribution

Suppose the parameter is initially chosen at random according to the priordistribution π(θ) and then given the value of the parameter the observationsare independent identically distributed, each with conditional probability(density) function f (x; θ). Then the posterior distribution of the parameter

3.4. BAYES ESTIMATION 125

is the conditional distribution of θ given the data x = (x1, . . . , xn)

π(θ|x) = cπ(θ)nQi=1f(xi; θ) = cπ(θ)L(θ;x)

where

c−1 =

∞Z−∞

π(θ)L(θ;x)dθ

is independent of θ and L(θ;x) is the likelihood function. Since Bayesianinference is based on the posterior distribution it depends only on the datathrough the likelihood function.

3.4.2 Example

Suppose a coin is tossed n times with probability of heads θ. It is knownfrom “previous experience with coins” that the prior probability of heads isnot always identically 1/2 but follows a BETA(10, 10) distribution. If then tosses result in x heads, find the posterior density function for θ.

3.4.3 Definition - Conjugate Prior Distribution

If a prior distribution has the property that the posterior distribution isin the same family of distributions as the prior then the prior is called aconjugate prior.

3.4.4 Conjugate Prior Distribution for the Exponen-tial Family

Suppose X1, . . . ,Xn is a random sample from the exponential family

f (x; θ) = C(θ) exp[q(θ)T (x)]h(x)

and θ is assumed to have the prior distribution with parameters a, b givenby

π(θ) = π (θ; a, b) = k[C(θ)]a exp[bq(θ)] (3.8)

where

k−1 =∞R−∞

[C (θ)]a exp [bq (θ)] dθ.

Then the posterior distribution of θ, given the data x = (x1, . . . , xn) iseasily seen to be given by

π(θ|x) = c[C(θ)]a+n expq(θ)[b+nPi=1T (xi)]

126 CHAPTER 3. OTHER METHODS OF ESTIMATION

where

c−1 =∞R−∞

[C (θ)]a+n exp

½q (θ)

∙b+

nPi=1T (xi)

¸¾dθ.

Notice that the posterior distribution is in the same family of distributionsas (3.8) and thus π(θ) is a conjugate prior. The value of the parameters ofthe posterior distribution reflect the choice of parameters in the prior.

3.4.5 Example

Find the conjugate prior for θ for a random sample X1, . . . ,Xn from thedistribution with probability density function

f (x; θ) = θxθ−1, 0 < x < 1, θ > 0.

Show that the posterior distribution of θ given the data x = (x1, . . . , xn) isin the same family of distributions as the prior.

3.4.6 Problem

Find the conjugate prior distribution of the parameter θ for a randomsample X1, . . . ,Xn from each of the following distributions. In each case,find the posterior distribution of θ given the data x = (x1, . . . , xn).

(a) POI(θ)(b) N(θ,σ2), σ2 known(c) N(μ, θ), μ known(d) GAM(α, θ), α known.

3.4.7 Problem

Suppose X1, . . . ,Xn is a random sample from the UNIF(0, θ) distribution.Show that the prior distribution θ ∼ PAR(a, b) is a conjugate prior.

3.4.8 Problem

Suppose X1, . . . ,Xn is a random sample from the N(μ, 1θ ) where μ and θare unknown. Show that the joint prior given by

π(μ, θ) = cθb1/2 exp

½−θ2

£a1 + b2(a2 − μ)2

¤¾, θ > 0, μ ∈ <

where a1, a2, b1 and b2 are parameters, is a conjugate prior. This prior iscalled a normal-gamma prior. Why? Hint: π(μ, θ) = π1(μ|θ)π2(θ).

3.4. BAYES ESTIMATION 127

3.4.9 Empirical Bayes

In the conjugate prior given in (3.8) there are two parameters, a and b,which must be specified. In an empirical Bayes approach the parameters ofthe prior are assumed to be unknown constants and are estimated from thedata. Suppose the prior distribution for θ is π(θ;λ) where λ is an unknownparameter (possibly a vector) and X1, . . . ,Xn is a random sample fromf (x; θ). The marginal distribution of X1, . . . ,Xn is given by

f(x1, . . . , xn;λ) =

∞Z−∞

π(θ;λ)nQi=1f(xi; θ)dθ

which depends on the data X1, . . . ,Xn and λ and therefore can be used toestimate λ.

3.4.10 Example

In Example 3.4.5 find the marginal distribution of (X1, . . . ,Xn) and in-dicate how it could be used to estimate the parameters a and b of theconjugate prior.

3.4.11 Problem

Suppose X1, . . . ,Xn is a random sample from the POI(θ) distribution.If a conjugate prior is assumed for θ find the marginal distribution of(X1, . . . ,Xn) and indicate how it could be used to estimate the parametersa and b of the conjugate prior.

3.4.12 Problem

An insurance company insures n drivers. For each driver the companyknows Xi the number of accidents driver i has had in the past three years.To estimate each driver’s accident rate λi the company assumes (λ1, . . . ,λn)is a random sample from the GAM(a, b) distribution where a and b areunknown constants and Xi ∼ POI(λi), i = 1, . . . , n independently. Findthe marginal distribution of (X1, . . . ,Xn) and indicate how you would findthe M.L. estimates of a and b using this distribution. Another approach toestimating a and b would be to use the estimators

a =X

b, b =

nPi=1X2i

nPi=1Xi

−¡1 + X

¢.

128 CHAPTER 3. OTHER METHODS OF ESTIMATION

Show that these are consistent estimators of a and b respectively.

3.4.13 Noninformative Prior Distributions

The choice of the prior distribution to be the conjugate prior is often mo-tivated by mathematical convenience. However, a Bayesian would also likethe prior to accurately represent the preliminary uncertainty about theplausible values of the parameter, and this may not be easily translatedinto one of the conjugate prior distributions. Noninformative priors are theusual way of representing ignorance about θ and they are frequently usedin practice. It can be argued that they are more objective than a subjec-tively assessed prior distribution since the latter may contain personal biasas well as background knowledge. Also, in some applications the amountof prior information available is far less than the information contained inthe data. In this case there seems little point in worrying about a precisespecification of the prior distribution.If in Example 3.4.2 there were no reason to prefer one value of θ over

any other then a noninformative or ‘flat’ prior disribution for θ that couldbe used is the UNIF(0, 1) distribution. For estimating the mean θ of aN(θ, 1) distribution the possible values for θ are (−∞,∞). If we take theprior distribution to be uniform on (−∞,∞), that is,

π(θ) = c, −∞ < θ <∞

then this is not a proper density since

∞Z−∞

π(θ)dθ = c

∞Z−∞

dθ =∞.

Prior densities of this type are called improper priors. In this case we couldconsider a sequence of prior distributions such as the UNIF(−M,M) whichapproximates this prior as M → ∞. Suppose we call such a prior densityfunction πM . Then the posterior distribution of the parameter is given by

πM (θ|x) = cπM (θ)L(θ;x)

and it is easy to see that as M → ∞, this approaches a constant multipleof the likelihood function L(θ). This provides another interpretation of thelikelihood function. We can consider it as proportional to the posteriordistribution of the parameter when using a uniform improper prior on thewhole real line. The language is somewhat sloppy here since, as we haveseen, the uniform distribution on the whole real line really makes sense onlythrough taking limits for uniform distributions on finite intervals.

3.4. BAYES ESTIMATION 129

In the case of a scale parameter, which must take positive values suchas the normal variance, it is usual to express ignorance of the prior distri-bution of the parameter by assuming that the logarithm of the parameteris uniform on the real line.

3.4.14 Example

Let X1, . . . ,Xn be a random sample from a N(μ,σ2) distribution and as-sume that the prior distributions of μ, and log(σ2) are independent im-proper uniform distributions. Show that the marginal posterior distribu-tion of μ given the data x = (x1, . . . , xn) is such that

√n (μ− x) /s has a

t distribution with n− 1 degrees of freedom. Show also that the marginalposterior distribution of σ2 given the data x is such that 1/σ2 has a GAM³n−12 , 2

(n−1)s2´distribution.

3.4.15 Jeffreys’ Prior

A problem with nonformative prior distributions is whether the prior dis-tribution should be uniform for θ or some function of θ, such as θ2 or log(θ).It is common to use a uniform prior for τ = h(θ) where h(θ) is the functionof θ whose Fisher information does not depend on the unknown parame-ter. This idea is due to Jeffreys and leads to a prior distribution which isproportional to [J(θ)]1/2. Such a prior is referred to as a Jeffreys’ prior.

3.4.16 Problem

Suppose f (x; θ) ; θ ∈ Ω is a regular model and J(θ) = Eh− ∂2

∂θ2 logf (X; θ) ; θi

is the Fisher information. Consider the reparameterization

τ = h(θ) =

θZθ0

pJ(u)du , (3.9)

where θ0 is a constant. Show that the Fisher information for the reparame-terization is equal to one (see Problem 2.3.4). (Note: Since the asymptoticvariance of the M.L. estimator τn is equal to 1/n, which does not dependon τ, (3.9) is called a variance stabilizing transformation.)

3.4.17 Example

Find the Jeffreys’ prior for θ if X has a BIN(n, θ) distribution. Whatfunction of θ has a uniform prior distribution?

130 CHAPTER 3. OTHER METHODS OF ESTIMATION

3.4.18 Problem

Find the Jeffreys’ prior distribution for a random sample X1, . . . ,Xn fromeach of the following distributions. In each case, find the posterior distri-bution of the parameter θ given the data x = (x1, . . . , xn). What functionof θ has a uniform prior distribution?

(a) POI(θ)(b) N(θ,σ2), σ2 known(c) N(μ, θ), μ known(d) GAM(α, θ), α known.

3.4.19 Problem

If θ is a vector then the Jeffreys’ prior is taken to be proportional to thesquare root of the determinant of the Fisher information matrix. Suppose(X1,X2) v MULT(n, θ1, θ2). Find the Jeffreys’ prior for (θ1, θ2). Find theposterior distribution of (θ1, θ2) given (x1, x2). Find the marginal posteriordistribution of θ1 given (x1, x2) and the marginal posterior distribution ofθ2 given (x1, x2).Hint: Show

1Z0

1−xZ0

xa−1yb−1 (1− x− y)c−1 dydx = Γ (a)Γ (b)Γ (c)Γ (a+ b+ c)

, a, b, c > 0

3.4.20 Problem

Suppose E(Y ) = Xβ where Y = (Y1, . . . , Yn)T is a vector of independent

and normally distributed random variables with V ar(Yi) = σ2, i = 1, . . . , n,X is a n× k matrix of known constants of rank k and β = (β1, . . . ,βk)

T isa vector of unknown parameters. Let

β =¡XTX

¢−1XT y and s2e = (y −Xβ)T (y −Xβ)/ (n− k)

where y = (y1, . . . , yn)T are the observed data.

(a) Find the joint posterior distribution of β and σ2 given the data if thejoint (improper) prior distribution of β and σ2 is assumed to be proportionalto σ−2.

(b) Show that the marginal posterior distribution of σ2 given the data y is

such that σ−2 has a GAM³n−k2 , 2

(n−k)s2e

´distribution.

(c) Find the marginal posterior distribution of β given the data y.

3.4. BAYES ESTIMATION 131

(d) Show that the conditional posterior distribution of β given σ2 and the

data y is MVN³β,σ2

¡XTX

¢−1´.

(e) Show that (β − β)TXT (Xβ − β)/¡ks2e¢has a Fk,n−k distribution.

3.4.21 Bayes Point Estimators

One method of obtaining a point estimator of θ is to use the posteriordistribution and a suitable loss function.

3.4.22 Theorem

Suppose X has p.f./p.d.f. f (x; θ) and θ has prior distribution π(θ). TheBayes estimator of θ for squared error loss with respect to the prior π(θ)given X is

θ = θ(X) =

∞Z−∞

θπ(θ|X)dθ = E(θ|X)

which is the mean of the posterior distribution π(θ|X). This estimatorminimizes

E[(θ − θ)2] =

∞Z−∞

⎡⎣ ∞Z−∞

³θ − θ

´2f (x; θ) dx

⎤⎦π(θ)dθ.3.4.23 Example

Suppose X1, . . . ,Xn is a random sample from the distribution with proba-bility density function

f (x; θ) = θxθ−1, 0 < x < 1, θ > 0.

Using a conjugate prior for θ find the Bayes estimator of θ for squared errorloss. What is the Bayes estimator of τ = 1/θ for squared error loss? DoBayes estimators satisfy an invariance property?

3.4.24 Example

In Example 3.4.14 find the Bayes estimators of μ and σ2 for squared errorloss based on their respective marginal posterior distributions.

3.4.25 Problem

Prove Theorem 3.4.22. Hint: Show that E[(X − c)2] is minimized by thevalue c = E(X).

132 CHAPTER 3. OTHER METHODS OF ESTIMATION

3.4.26 Problem

For each case in Problems 3.4.7 and 3.4.18 find the Bayes estimator of θfor squared error loss and compare the estimator with the U.M.V.U.E. asn→∞.

3.4.27 Problem

In Problem 3.4.12 find the Bayes estimators of (λ1, . . . ,λn) for squarederror loss.

3.4.28 Problem

Let X1, . . . ,Xn be a random sample from a GAM(α,β) distribution whereα is known. Find the posterior distribution of λ = 1/β given X1, . . . ,Xn ifthe improper prior distribution of λ is assumed to be proportional to 1/λ.Find the Bayes estimator of β for squared error loss and compare it to theU.M.V.U.E. of β.

3.4.29 Problem

In Problem 3.4.19 find the Bayes estimators of θ1 and θ2 for squared errorloss using their respective marginal posterior distributions. Compare theseto the U.M.V.U.E.’s.

3.4.30 Problem

In Problem 3.4.20 find the Bayes estimators of β and σ2 for squared errorloss using their respective marginal posterior distributions. Compare theseto the U.M.V.U.E.’s.

3.4.31 Problem

Show that the Bayes estimator of θ for absolute error loss with respect tothe prior π(θ) given data X is the median of the posterior distribution.Hint:

d

dy

b(y)Za(y)

g(x, y)dx = g(b (y) , y) · b0 (y)− g(a (y) , y) · a0 (y) +b(y)Za(y)

∂g(x, y)

∂ydx.

3.4. BAYES ESTIMATION 133

3.4.32 Bayesian Intervals

There remains, after many decades, a controversy between Bayesians andfrequentists about which approach to estimation is more suitable to the realworld. The Bayesian has advantages at least in the ease of interpretationof the results. For example, a Bayesian can use the posterior distributiongiven the data x = (x1, . . . , xn) to determine points a = a(x), b = b(x) suchthat

aZa

π(θ|x)dθ = 0.95

and then give a Bayesian confidence interval (a, b) for the parameter. Ifthis results in [2, 5] the Bayesian will state that (in a Bayesian model, sub-ject to the validity of the prior) the conditional probability given the datathat the parameter falls in the interval [2, 5] is 0.95. No such probabilitycan be ascribed to a confidence interval for frequentists, who see no ran-domness in the parameter to which this probability statement is supposedto apply. Bayesian confidence regions are also called credible regions inorder to make clear the distinction between the interpretation of Bayesianconfidence regions and frequentist confidence regions.

Suppose π(θ|x) is the posterior distribution of θ given the data x andA is a subset of Ω. If

P (θ ∈ A|x) =ZA

π(θ|x)dθ = p

then A is called a p credible region for θ. A credible region can be formedin many ways. If (a, b) is an interval such that

P (θ < a|x) = 1− p2

= P (θ > b|x)

then [a, b] is called a p equal-tailed credible region. A highest posterior den-sity (H.P.D.) credible region is constructed in a manner similar to likelihoodregions. The p H.P.D. credible region is given by θ : π(θ|x) ≥ c where cis chosen such that

p =

Zθ:π(θ|x)≥c

π(θ|x)dθ.

A H.P.D. credible region is optimal in the sense that it is the shortestinterval for a given value of p.

134 CHAPTER 3. OTHER METHODS OF ESTIMATION

3.4.33 Example

Suppose X1, . . . ,Xn is a random sample from the N(μ,σ2) distributionwhere σ2 is known and μ has the conjugate prior. Find the p = 0.95 H.P.D.credible region for μ. Compare this to a 95% C.I. for μ.

3.4.34 Problem

Suppose (X1, . . . ,X10) is a random sample from the GAM(2, 1θ ) distribu-

tion. If θ has the Jeffreys’ prior and10Pi=1xi = 4 then find and compare

(a) the 0.95 equal-tailed credible region for θ

(b) the 0.95 H.P.D. credible region for θ

(c) the 95% exact equal tail C.I. for θ.

Finally, although statisticians argue whether the Bayesian or the fre-quentist approach is better, there is really no one right way to do statistics.Some problems are best solved using a frequentist approach while othersare best solved using a Bayesian approach. There are certainly instances inwhich a Bayesian approach seems sensible— particularly for example if theparameter is a measurement on a possibly randomly chosen individual (saythe expected total annual claim of a client of an insurance company).

Chapter 4

Hypothesis Tests

4.1 Introduction

Statistical estimation usually concerns the estimation of the value of a pa-rameter when we know little about it except perhaps that it lies in a givenparameter space, and when we have no a priori reason to prefer one valueof the parameter over another. If, however, we are asked to decide betweentwo possible values of the parameter, the consequences of one choice of theparameter value may be quite different from another choice. For example,if we believe Yi is normally distributed with mean α + βxi and varianceσ2 for some explanatory variables xi, then the value β = 0 means thereis no relation between Yi and xi. We need neither collect the values of xinor build a model around them. Thus the two choices β = 0 and β = 1are quite different in their consequences. This is often the case. An excel-lent example of the complete asymmetry in the costs attached to these twochoices is Problem 4.4.17.

A hypothesis test involves a (usually natural) separation of the para-meter space Ω into two disjoint regions, Ω0 and Ω− Ω0. By the differencebetween the two sets we mean those points in the former (Ω) that are notin the latter (Ω0). This partition of the parameter space corresponds totesting the null hypothesis that the parameter is in Ω0. We usually writethis hypothesis in the form

H0 : θ ∈ Ω0.

The null hypothesis is usually the status quo. For example in a test of anew drug, the null hypothesis would be that the drug had no effect, or nomore of an effect than drugs already on the market. The null hypothesis

135

136 CHAPTER 4. HYPOTHESIS TESTS

is only rejected if there is reasonably strong evidence against it. The alter-native hypothesis determines what departures from the null hypothesis areanticipated. In this case, it might be simply

H1 : θ ∈ Ω−Ω0.

Since we do not know the true value of the parameter, we must base ourdecision on the observed value of X. The hypothesis test is conductedby determining a partition of the sample space into two sets, the criticalor rejection region R and its complement R which is called the acceptanceregion. We declare that H0 is false (in favour of the alternative) if weobserve x ∈ R. When a test of hypothesis is conducted there are two typesof possible errors: reject the null hypothesis H0 when it is true (Type Ierror) and accept H0 when it is false (Type II error).

4.1.1 Definition

The power function of a test with rejection region R is the function

β(θ) = P (X ∈ R; θ) = P (reject H0; θ), θ ∈ Ω.

Note that

β (θ) = 1− P (accept H0; θ)= 1− P

¡X ∈ R; θ

¢= 1− P (type II error; θ) for θ ∈ Ω−Ω0.

In order to minimize the two types of possible errors in a test of hy-pothesis, it is obviously desirable that the power function β(θ) be small forθ ∈ Ω0 but large for θ ∈ Ω− Ω0.The probability of rejectingH0 when it is true determines one important

measure of the performance of a test, the level of significance.

4.1.2 Definition

A test has level of significance α if β(θ) ≤ α for all θ ∈ Ω0.

The level of significance is simply an upper bound on the probability ofa type I error. There is no assurance that the upper bound is tight, that is,that equality is achieved somewhere. The lowest such upper bound is oftencalled the size of the test.

4.1. INTRODUCTION 137

4.1.3 Definition

The size of a test is equal to supθ∈Ω0

β(θ).

4.1.4 Example

Suppose we toss a coin 100 times to determine if the coin is fair. LetX = number of heads observed. Then the model is

X v BIN (n, θ) , θ ∈ Ω = θ : 0 < θ < 1 .

The null hypothesis is H0 : θ = 0.5 and Ω0 = 0.5. This is an exam-ple of a simple hypothesis. A simple null hypothesis is one for which Ω0contains a single point. The alternative hypothesis is H1 : θ 6= 0.5 andΩ−Ω0 = θ : 0 < θ < 1, θ 6= 0.5 . The alternative hypothesis is not a sim-ple hypothesis since Ω−Ω0 contains more than one point. It is an exampleof a composite hypothesis.The sample space is S = x : x = 0, 1, . . . , 100. Suppose we choose the

rejection region to be

R = x : |x− 50| ≥ 10 = x : x ≤ 40 or x ≥ 60 .

The test of hypothesis is conducted by rejecting the null hypothesis H0 infavour of the alternative H1 if x ∈ R. The acceptance region is

R = x : 41 ≤ x ≤ 59 .

The power function is

β (θ) = P (X ∈ R; θ)= P (X ≤ 40 ∪X ≥ 60; θ)

= 1−59P

x=41

µ100

x

¶θx (1− θ)

100−x

A graph of the power function is given below.For this example Ω0 = 0.5 consists of a single point and therefore

P (type I error) = size of test

= β (0.5)

= P (X ∈ R; θ = 0.5)= P (X ≤ 40 ∪X ≥ 60; θ = 0.5)

= 1−59P

x=41

µ100

x

¶(0.5)x (0.5)100−x

≈ 0.05689.

138 CHAPTER 4. HYPOTHESIS TESTS

0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.70

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

theta

beta(theta)

Figure 4.1: Power Function for Binomial Example

4.2 Uniformly Most Powerful Tests

Tests are often constructed by specifying the size of the test, which inturn determines the probability of the type I error, and then attemptingto minimize the probability that the null hypothesis is accepted when it isfalse (type II error). Equivalently, we try to maximize the power functionof the test for θ ∈ Ω−Ω0.

4.2.1 Definition

A test with power function β(θ) is a uniformly most powerful (U.M.P.) testof size α if, for all other tests of the same size α having power functionβ∗(θ), we have β(θ) ≥ β∗(θ) for all θ ∈ Ω−Ω0.

The word “uniformly” above refers to the fact that one function dom-inates another, that is, β(θ) ≥ β∗(θ) uniformly for all θ ∈ Ω − Ω0. Whenthe alternative Ω − Ω0 consists of a single point θ1 then the construc-

4.2. UNIFORMLY MOST POWERFUL TESTS 139

tion of a best test is particularly easy. In this case, we may drop the word“uniformly” and refer to a “most powerful test”. The construction of abest test, by this definition, is possible under rather special circumstances.First, we often require a simple null hypothesis. This is the case when Ω0consists of a single point θ0 and so we are testing the null hypothesisH0 : θ = θ0.

4.2.2 Neyman-Pearson Lemma

LetX have probability (density) function f (x; θ) , θ ∈ Ω. Consider testing asimple null hypothesis H0 : θ = θ0 against a simple alternative H1 : θ = θ1.For a constant c, suppose the rejection region defined by

R = x; f (x; θ1)f (x; θ0)

> c

corresponds to a test of size α. Then the test with this rejection region isa most powerful test of size α for testing H0 : θ = θ0 against H1 : θ = θ1.

4.2.3 Proof

Consider another rejection region R1 with the same size. Then

P (X ∈ R; θ0) = P (X ∈ R1; θ0) = α or

ZR

f(x; θ0)dx =

ZR1

f(x; θ0)dx.

ThereforeZR∩R1

f(x; θ0)dx+

ZR∩R1

f(x; θ0)dx =

ZR∩R1

f(x; θ0)dx+

ZR∩R1

f(x; θ0)dx

and ZR∩R1

f(x; θ0)dx =

ZR∩R1

f(x; θ0)dx. (4.1)

For x ∈ R ∩ R1,

f(x; θ1)

f(x; θ0)> c or f(x; θ1) > cf(x; θ0)

and thus ZR∩R1

f(x; θ1) > c

ZR∩R1

f(x; θ0)dx. (4.2)

140 CHAPTER 4. HYPOTHESIS TESTS

For x ∈ R ∩R1, f(x; θ1) ≤ cf(x; θ0), and thus

−Z

R∩R1

f(x; θ1)dx ≥ −cZ

R∩R1

f(x; θ0)dx. (4.3)

Now

β(θ1) = P (X ∈ R; θ1) = P (X ∈ R ∩R1; θ1) + P (X ∈ R ∩ R1; θ1)

=

ZR∩R1

f(x; θ1)dx+

ZR∩R1

f(x; θ1)dx

and

β1(θ1) = P (X ∈ R1; θ1)

=

ZR∩R1

f(x; θ1)dx+

ZR∩R1

f(x; θ1)dx.

Therefore, using (4.1), (4.2), and (4.3) we have

β(θ1)− β1(θ1) =

ZR∩R1

f(x; θ1)dx−Z

R∩R1

f(x; θ1)dx

≥ c

ZR∩R1

f(x; θ0)dx− cZ

R∩R1

f(x; θ0)dx

= c

⎡⎢⎣ ZR∩R1

f(x; θ0)dx−Z

R∩R1

f(x; θ0)dx

⎤⎥⎦ = 0and the test with rejection region R is therefore the most powerful.¥

4.2.4 Example

Suppose X1, . . . ,Xn are independent N(θ, 1) random variables. We con-sider only the parameter space Ω = [0,∞). Suppose we wish to test thehypothesis H0 : θ = 0 against H1 : θ > 0.

(a) Choose an arbitrary θ1 > 0 and obtain the rejection region for the mostpowerful test of size 0.05 of H0 against H1 : θ = θ1.

(b) Does this test depend on the value of θ1 you chose? Can you concludethat it is uniformly most powerful?

4.2. UNIFORMLY MOST POWERFUL TESTS 141

(c) Graph the power function of the test.

(d) Find the rejection region for the uniformly most powerful test of H0 :θ = 0 against H1 : θ < 0. Find and graph the power function of this test.

-1.5 -1 -0.5 0 0.5 1 1.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

theta

powerfunction

Figure 4.2: Power Functions for Examples 4.2.4 and 4.2.5: — β (θ), - -β1 (θ), -· β2 (θ)

4.2.5 Example

LetX1, . . . ,Xn be a random sample from the N(θ, 1) distribution. Considerthe rejection region (x1, . . . , xn); |x| > 1.96/

√n for testing the hypothesis

H0 : θ = 0 against H1 : θ 6= 0. What is the size of this test? Graph thepower function of this test. Is this test uniformly most powerful?

4.2.6 Problem - Sufficient Statistics and HypothesisTests

Suppose X has probability (density) function f (x; θ) , θ ∈ Ω. Supposealso that T = T (X) is a minimal sufficient statistic for θ. Show that the

142 CHAPTER 4. HYPOTHESIS TESTS

rejection region of the most powerful test of H0 : θ = θ0 against H1 : θ = θ1depends on the data X only through T.

4.2.7 Problem

Let X1, . . . ,X5 be a random sample from the distribution with probabilitydensity function

f (x; θ) =θ

xθ+1, x ≥ 1, θ > 0.

(a) Find the rejection region for the most powerful test of size 0.05 ofH0 : θ = 1 against H1 : θ = θ1 where θ1 > 1. Note: log (Xi) v EXP

¡1θ

¢.

(b) Explain why the rejection region in (a) is also the rejection region for theuniformly most powerful test of size 0.05 of H0 : θ = 1 against H1 : θ > 1.Sketch the power function of this test.

(c) Find the uniformly most powerful test of size 0.05 of H0 : θ = 1 againstH1 : θ < 1. On the same graph as in (b) sketch the power function of thistest.

(d) Explain why there is no uniformly most powerful test of H0 againstH1 : θ 6= 1. What reasonable test of size 0.05 might be used for testing H0against H1 : θ 6= 1? On the same graph as in (b) sketch the power functionof this test.

4.2.8 Problem

Let X1, . . . ,X10 be a random sample from the GAM(12 , θ) distribution.

(a) Find the rejection region for the most powerful test of size 0.05 ofH0 : θ = 2 against the alternative H1 : θ = θ1 where θ1 < 2.

(b) Explain why the rejection region in (a) is also the rejection region forthe uniformly most powerful test of size 0.05 of H0 : θ = 2 against thealternative H1 : θ < 2. Sketch the power function of this test.

(c) Find the uniformly most powerful test of size 0.05 of H0 : θ = 2 againstthe alternative H1 : θ > 2. On the same graph as in (b) sketch the powerfunction of this test.

(d) Explain why there is no uniformly most powerful test of H0 againstH1 : θ 6= 2. What reasonable test of size 0.05 might be used for testing H0against H1 : θ 6= 2? On the same graph as in (b) sketch the power functionof this test.

4.2. UNIFORMLY MOST POWERFUL TESTS 143

4.2.9 Problem

Let X1, . . . ,Xn be a random sample from the UNIF(0, θ) distribution. Findthe rejection region for the uniformly most powerful test of H0 : θ = 1against the alternative H1 : θ > 1 of size 0.01. Sketch the power functionof this test for n = 10.

4.2.10 Problem

We anticipate collecting observations (X1, . . . ,Xn) from a N(μ,σ2) distri-bution in order to test the hypothesis H0 : μ = 0 against the alternativeH1 : μ > 0 at level of significance 0.05. A preliminary investigation yieldsσ ≈ 2. How large a sample must we take in order to have power equal to0.95 when μ = 1?

4.2.11 Relationship Betweeen Hypothesis Tests andConfidence Intervals

There is a close relationship between hypothesis tests and confidence inter-vals as the following example illustrates. Suppose X1, . . . ,Xn is a randomsample from the N(θ, 1) distribution and we wish to test the hypothesis H0 :θ = θ0 against H1 : θ 6= θ0. The rejection region x; |x − θ0| > 1.96/

√n

is a size α = 0.05 rejection region which has a corresponding acceptanceregion x; |x−θ0| ≤ 1.96/

√n. Note that the hypothesis H0 : θ = θ0 would

not be rejected at the 0.05 level if |x− θ0| ≤ 1.96/√n or equivalently

x− 1.96/√n ≤ θ0 ≤ x+ 1.96/

√n

which is a 95% C.I. for θ.

4.2.12 Problem

Let (X1, . . . ,X5) be a random sample from the GAM(2, θ) distribution.Show that

R =

½x;

5Pi=1xi < 4.7955θ0 or

5Pi=1xi > 17.085θ0

¾is a size 0.05 rejection region for testing H0 : θ = θ0. Show how thisrejection region may be used to construct a 95% C.I. for θ.

144 CHAPTER 4. HYPOTHESIS TESTS

4.3 Locally Most Powerful Tests

It is not always possible to construct a uniformly most powerful test. Forthis reason, and because alternative values of the parameter close to thoseunder H0 are the hardest to differentiate from H0 itself, one may wish todevelop a test that is best able to test the hypthesis H0 : θ = θ0 againstalternatives very close to θ0. Such a test is called locally most powerful.

4.3.1 Definition

A test of H0 : θ = θ0 against H1 : θ > θ0 with power function β(θ) islocally most powerful if, for any other test having the same size and havingpower function β∗ (θ), there exists an ² > 0 such that β(θ) ≥ β∗ (θ) for allθ0 < θ < θ0 + ².

This definition asserts that there is a neighbourhood of the null hypoth-esis in which the test is most powerful.

4.3.2 Theorem

Suppose f (x; θ) ; θ ∈ Ω is a regular statistical model with correspondingscore function

S(θ;x) =∂

∂θlog f (x; θ) .

A locally most powerful test of H0 : θ = θ0 against H1 : θ > θ0 has rejectionregion

R = x; S(θ0;x) > c ,

where c is a constant determined by

P [S(θ0;X) > c; θ0] = size of test.

Since this test is based on the score function, it is also called a scoretest.

4.3.3 Example

Suppose X1, . . . ,Xn is a random sample from a N(θ, 1) distribution. Showthat the locally most powerful test of H0 : θ = 0 against H1 : θ > 0 is alsothe uniformly most powerful test.

4.3. LOCALLY MOST POWERFUL TESTS 145

4.3.4 Problem

Consider a single observation X from the LOG(1, θ) distribution. Find therejection region for the locally most powerful test of H0 : θ = 0 againstH1 : θ > 0. Is this test also uniformly most powerful? What is the powerfunction of the test?

SupposeX = (X1, . . . ,Xn) is a random sample from a regular statisticalmodel f (x; θ) ; θ ∈ Ω and the exact distribution of

S(θ0;X) =nPi=1

∂θlog f(Xi; θ0)

is difficult to obtain. Since, under H0 : θ = θ0,

S(θ0;X)pJ (θ0)

→D Z v N(0, 1)

by the C.L.T., an approximate size α rejection region for testing H0 : θ = θ0against H1 : θ > θ0 is given by(

x;S(θ0;x)pJ (θ0)

≥ a)

where P (Z ≥ a) = α and Z v N(0, 1). J(θ0) may be replaced by I(θ0;x).

4.3.5 Example

Suppose X1, . . . ,Xn is a random sample from the CAU(1, θ) distribution.Find an approximate rejection region for a locally most powerful size 0.05test of H0 : θ = θ0 against H1 : θ < θ0. Hint: Show J(θ) = n/2.

4.3.6 Problem

Suppose X1, . . . ,Xn is a random sample from the WEI(1, θ) distribution.Find an approximate rejection region for a locally most powerful size 0.01test of H0 : θ = θ0 against H1 : θ > θ0.

Hint: Show that

J(θ) =n

θ2(1 +

π2

6+ γ2 − 2γ)

where

γ = −∞Z0

(log y)e−ydy ≈ 0.5772

is Euler’s constant.

146 CHAPTER 4. HYPOTHESIS TESTS

4.4 Likelihood Ratio Tests

Consider a test of the hypothesis H0 : θ ∈ Ω0 against H1 : θ ∈ Ω − Ω0.We have seen that for prescribed θ0 ∈ Ω0, θ1 ∈ Ω−Ω0, the most powerfultest of the simple null hypothesis H0 : θ = θ0 against a simple alternativeH1 : θ = θ1 is based on the likelihood ratio fθ1(x)/fθ0(x). By the Neyman-Pearson Lemma it has rejection region

R =

½x;f(x; θ1)

f(x; θ0)> c

¾where c is a constant determined by the size of the test. When either thenull or the alternative hypothesis are composite (i.e. contain more than onepoint) and there is no uniformly most powerful test, it seems reasonable touse a test with rejection region R for some choice of θ1, θ0. The likelihoodratio test does this with θ1 replaced by θ, the M.L. estimator over all possiblevalues of the parameter, and θ0 replaced by the M.L. estimator of theparameter when it is restricted to Ω0. Thus, the likelihood ratio test ofH0 : θ ∈ Ω0 versus H1 : θ ∈ Ω−Ω0 has rejection region R = x;Λ(x) > cwhere

Λ(x) =

supθ∈Ω

f (x; θ)

supθ∈Ω0

f (x; θ)=

supθ∈Ω

L (θ;x)

supθ∈Ω0

L (θ;x)

and c is determined by the size of the test. In general, the distribution ofthe test statistic Λ(X) may be difficult to find. Fortunately, however, theasymptotic distribution is known under fairly general conditions. In a fewcases, we can show that the likelihood ratio test is equivalent to the use ofa statistic with known distribution. However, in many cases, we need torely on the asymptotic chi-squared distribution of Theorem 4.4.7.

4.4.1 Example

Let X1, . . . ,Xn be a random sample from the N(μ,σ2) distribution whereμ and σ2 are unknown. Consider a test of

H0 : μ = 0, 0 < σ2 <∞

against the alternative

H1 : μ 6= 0, 0 < σ2 <∞.

(a) Show that the likelihood ratio test of H0 against H1 has rejection region

4.4. LIKELIHOOD RATIO TESTS 147

R = x;nx2/s2 > c.(b) Show under H0 that the statistic T = nX

2/S2 has a F(1, n− 1) distri-bution and thus find a size 0.05 test for n = 20.

(c) What rejection region would you use for testingH0 : μ = 0, 0 < σ2 <∞against the one-sided alternative H1 : μ > 0, 0 < σ2 <∞?

4.4.2 Problem

Suppose X ∼ GAM(2,β1) and Y ∼ GAM(2,β2) independently.(a) Show that the likelihood ratio statistic for testing the hypothesis H0 :β1 = β2 against the alternative H1 : β1 6= β2 is a function of the statisticT = X/(X + Y ).

(b) Find the distribution of T under H0.

(b) Find the rejection region for a size 0.01 test. What rejection regionwould you use for testing H0 : β1 = β2 against the one-sided alternativeH1 : β1 > β2?

4.4.3 Problem

Let (X1, . . . ,Xn) be a random sample from the N(μ,σ2) distribution andindependently let (Y1, . . . , Yn) be a random sample from the N(θ,σ2) dis-tribution where σ2 is known.

(a) Show that the likelihood ratio statistic for testing the hypothesis H0 :μ = θ against the alternative H1 : μ 6= θ is a function of T = |X − Y |.(b) Find the rejection region for a size 0.05 test. Is this test U.M.P.? Why?

4.4.4 Problem

Suppose X1, . . . ,Xn are independent EXP(λ) random variables and inde-pendently Y1, . . . , Ym are independent EXP(μ) random variables.

(a) Show that the likelihood ratio statistic for testing the hypothesis H0 :λ = μ against the alternative H1 : λ 6= μ is a function of

T =

nPi=1Xi∙

nPi=1Xi +

mPi=1Yi

¸ .

(b) Find the distribution of T under H0. Explain clearly how you wouldfind a size α = 0.05 rejection region.

148 CHAPTER 4. HYPOTHESIS TESTS

(c) For n = 20 find the rejection region for the one-sided alternative H1 :λ > μ for a size 0.05 test.

4.4.5 Problem

Suppose X1, . . . ,Xn is a random sample from the EXP(β,μ) distributionwhere β and μ are unknown.

(a) Show that the likelihood ratio statistic for testing the hypothesisH0 : β = 1 against the alternative H1 : β 6= 1 is a function of the statistic

T =nPi=1

¡Xi −X(1)

¢.

(b) Show that under H0, 2T has a chi-squared distribution (see Problem1.8.11).

(c) For n = 12 find the rejection region for the one-sided alternativeH1 : β > 1 for a size 0.05 test.

4.4.6 Problem

Let X1, . . . ,Xn be a random sample from the distribution with p.d.f.

f (x;α,β) =αxα−1

βα, 0 < x ≤ β.

(a) Show that the likelihood ratio statistic for testing the hypothesisH0 : α = 1 against the alternative H1 : α 6= 1 is a function of the statistic

T =nQi=1

¡Xi/X(n)

¢(b) Show that under H0, −2 log T has a chi-squared distribution (see Prob-lem 1.8.12).

(c) For n = 14 find the rejection region for the one-sided alternativeH1 : α > 1 for a size 0.05 test.

4.4.7 Problem

Suppose Yi ∼ N(α+βxi,σ2), i = 1, 2, . . . , n independently where x1, . . . , xnare known constants and α, β and σ2 are unknown parameters.

4.4. LIKELIHOOD RATIO TESTS 149

(a) Show that the likelihood ratio statistic for testing H0 : β = 0 againstthe alternative H1 : β 6= 0 is a function of

T =

β2nPi=1(xi − x)2

S2e

where

S2e =1

n− 2nPi=1(Yi − α− βxi)

2.

(b) What is the distribution of T under H0?

4.4.8 Theorem - Asymptotic Distribution of the Like-lihood Ratio Statistic (Regular Model)

Suppose X = (X1, . . . ,Xn) is a random sample from a regular statisticalmodel f (x; θ) ; θ ∈ Ω with Ω an open set in k−dimensional Euclideanspace. Consider a subset of Ω defined by Ω0 = θ(η); η ∈ open subset ofq-dimensional Euclidean space . Then the likelihood ratio statistic definedby

Λn(X)=

supθ∈Ω

nQi=1f (X; θ)

supθ∈Ω0

nQi=1f (X; θ)

=

supθ∈Ω

L (θ;X)

supθ∈Ω0

L (θ;X)

is such that, under the hypothesis H0 : θ ∈ Ω0,

2 logΛn(X)→D W v χ2(k − q).

Note: The number of degrees of freedom is the difference between thenumber of parameters that need to be estimated in the general model,and the number of parameters left to be estimated under the restrictionsimposed by H0.

4.4.9 Example

Suppose X1, . . . ,Xn are independent POI(λ) random variables and inde-pendently Y1, . . . , Yn are independent POI(μ) random variables.

(a) Find the likelihood ratio test statistic for testing H0 : λ = μ against thealternative H1 : λ 6= μ.

(b) Find the approximate rejection region for a size α = 0.05 test. Be sureto justify the approximation.

(c) Find the rejection region for the one-sided alternative H1 : λ < μ for asize 0.05 test.

150 CHAPTER 4. HYPOTHESIS TESTS

4.4.10 Problem

Suppose (X1,X2) ∼ MULT(n, θ1, θ2).(a) Find the likelihood ratio statistic for testing H0 : θ1 = θ2 = θ3 againstall alternatives.

(b) Find the approximate rejection region for a size 0.05 test. Be sure tojustify the approximation.

4.4.11 Problem

Suppose (X1,X2) ∼ MULT(n, θ1, θ2).(a) Find the likelihood ratio statistic for testing H0 : θ1 = θ2, θ2 = 2θ(1−θ)against all alternatives.

(a) Find the approximate rejection region for a size 0.05 test. Be sure tojustify the approximation.

4.4.12 Problem

Suppose (X1, Y1), . . . , (Xn, Yn) is a random sample from the BVN(μ,Σ)distribution with (μ,Σ) unknown.

(a) Find the likelihood ratio statistic for testing H0 : ρ = 0 against thealternative H1 : ρ 6= 0.(b) Find the approximate size 0.05 rejection region. Be sure to justify theapproximation.

4.4.13 Problem

Suppose in Problem 2.1.25 we wish to test the hypothesis that the dataarise from the assumed model. Show that the likelihood ratio statistic isgiven by

Λ = 2kPi=1

Fi log

µFiEi

¶where Ei = npi(θ) and θ is the M.L. estimator of θ.What is the asymptoticdistribution of Λ? Another test statistic which is commonly used is thePearson goodness of fit statistic given by

kPi=1

(Fi −Ei)2Ei

which also has an approximate χ2 distribution.

4.4. LIKELIHOOD RATIO TESTS 151

4.4.14 Problem

In Example 2.1.32 test the hypothesis that the data arise from the assumedmodel using the likelihood ratio statistic. Compare this with the answerthat you obtain using the Pearson goodness of fit statistic.

4.4.15 Problem

In Example 2.1.33 test the hypothesis that the data arise from the assumedmodel using the likelihood ratio statistic. Compare this with the answerthat you obtain using the Pearson goodness of fit statistic.

4.4.16 Problem

In Example 2.9.11 test the hypothesis that the data arise from the assumedmodel using the likelihood ratio statistic. Compare this with the answerthat you obtain using the Pearson goodness of fit statistic.

4.4.17 Problem

Suppose we have n independent repetitions of an experiment in which eachoutcome is classified according to whether event A occurred or not as wellas whether event B occurred or not. The observed data can be arrangedin a 2× 2 contingency table as follows:

B B TotalAA

f11 f12f21 f22

r1r2

Total c1 c2 n

Find the likelihood ratio statistic for testing the hypothesis that the eventsA and B are independent, that is, H0 : P (A ∩B) = P (A)P (B).

4.4.18 Problem

Suppose E(Y ) = Xβ where Y = (Y1, . . . , Yn)T is a vector of independent

and normally distributed random variables with V ar(Yi) = σ2, i = 1, . . . , n,X is a n × k matrix of known constants of rank k and β = (β1, . . . ,βk)

T

is a vector of unknown parameters. Find the likelihood ratio statistic fortesting the hypothesis H0 : βi = 0 against the alternative H1 : βi 6= 0 whereβi is the ith element of β.

152 CHAPTER 4. HYPOTHESIS TESTS

4.4.19 Signed Square-root Likelihood Ratio Statistic

Suppose X = (X1, . . . ,Xn) is a random sample from a regular statistical

model f (x; θ) ; θ ∈ Ω where θ = (θ1, θ2)T, θ1 is a scalar and Ω is an

open set in <k. Suppose also that the null hypothesis is H0 : θ1 = θ10.Let θ = (θ1, θ2) be the maximum likelihood of θ and let θ = (θ10, θ2 (θ10))where θ2 (θ10) is the maximum likelihood value of θ2 assuming θ1 = θ10.Then by Theorem 4.4.8

2 logΛn(X) = 2l(θ;X)− 2l(θ;X)→D W v χ2(1)

under H0 or equivalentlyh2l(θ;X)− 2l(θ;X)

i1/2→D Z v N(0, 1)

under H0. The signed square-root likelihood ratio statistic defined by

sign(θ1 − θ10)h2l(θ;X)− 2l(θ;X)

i1/2can be used to test one-sided alternatives such as H1 : θ1 > θ10 or H1 :θ1 < θ10. For example if the alternative hypothesis were H1 : θ1 > θ10 thenthe rejection region for an approximate size 0.05 test would be given by½

x; sign(θ1 − θ10)h2l(θ;X)− 2l(θ;X)

i1/2> 1.645

¾.

4.4.20 Problem - The Challenger Data

In Problem 2.8.9 test the hypothesis that β = 0. What would a sensiblealternative be? Describe in detail the null and the alternative hypothesesthat you have in mind and the relative costs of the two different kinds oferrors.

4.4.21 Significance Tests and p-values

We have seen that a test of hypothesis is a rule which allows us to decidewhether to accept the null hypothesis H0 or to reject it in favour of thealternative hypothesisH1 based on the observed data. A test of significancecan be used in situations in which H1 is difficult to specify. A (pure) testof significance is a procedure for measuring the strength of the evidenceprovided by the observed data against H0. This method usually involveslooking at the distribution of a test statistic or discrepancy measure Tunder H0. The p-value or significance level for the test is the probability,

4.5. SCORE AND MAXIMUM LIKELIHOOD TESTS 153

computed under H0, of observing a T value at least as extreme as the valueobserved. The smaller the observed p-value, the stronger the evidenceagainst H0. The difficulty with this approach is how to find a statisticwith ‘good properties’. The likelihood ratio statistic provides a general teststatistic which may be used.

4.5 Score and Maximum Likelihood Tests

4.5.1 Score or Rao Tests

In Section 4.3 we saw that the locally most powerful test was a score test.Score tests can be viewed as a more general class of tests of H0 : θ = θ0against H1 : θ ∈ Ω−θ0. If the usual regularity conditions hold then underH0 : θ = θ0 we have

S(θ0;X)[J(θ0)]−1/2 →D Z v N(0, 1).

and thus

R(X; θ0) = [S(θ0;X)]2[J(θ0)]−1 →D Y v χ2(1).

For a vector θ = (θ1, . . . , θk)T we have

R(X; θ0) = [S(θ0;X)]T [J(θ0)]−1S(θ0;X)→D Y v χ2(k). (4.4)

The corresponding rejection region is

R = x; R(x; θ0) > c

where c is determined by the size of the test, that is, c satisfiesP [R(X; θ0) > c; θ0] = α. An approximate value for c can be determinedusing P (Y > c) = α where Y v χ2(k).The test based onR(X; θ0) is asymptotically equivalent to the likelihood

ratio test. In (4.4) J(θ0) may be replaced by I(θ0) for an asymptoticallyequivalent test.Such test statistics are called score or Rao test statistics.

4.5.2 Maximum Likelihood or Wald Tests

Suppose that θ is the M.L. estimator of θ over all θ ∈ Ω and we wish to testH0 : θ = θ0 against H1 : θ ∈ Ω − θ0. If the usual regularity conditionshold then under H0 : θ = θ0

W (X; θ0) = (θ − θ0)TJ(θ0)(θ − θ0)→D Y v χ2(k). (4.5)

154 CHAPTER 4. HYPOTHESIS TESTS

The corresponding rejection region is

R = x; W (x; θ0) > c

where c is determined by the size of the test, that is, c satisfiesP [W (X; θ0) > c; θ0] = α. An approximate value for c can be determinedusing P (Y > c) = α where Y v χ2(k).

The test based on W (X; θ0) is asymptotically equivalent to the likeli-

hood ratio test. In (4.5) J(θ0) may also be replaced by J(θ), I(θ0) or I(θ)to obtain an asymptotically equivalent test statistic.

Such statistics are called maximum likelihood or Wald test statistics.

4.5.3 Example

Suppose X v POI (θ). Find the score test statistic (4.4) and the maximumlikelihood test statistic (4.5) for testing H0 : θ = θ0 against H1 : θ 6= θ0.

4.5.4 Problem

Find the score test statistic (4.4) and the Wald test statistic (4.5) for testingH0 : θ = θ0 against H1 : θ 6= θ0 based on a random sample (X1, . . . ,Xn)from each of the following distributions:

(a) EXP(θ)

(b) BIN(n, θ)

(c) N(θ,σ2), σ2 known

(d) EXP(θ,μ), μ known

(e) GAM(α, θ), α known

4.5.5 Problem

Let (X1, . . . ,Xn) be a random sample from the PAR(1, θ) distribution.Find the score test statistic (4.4) and the maximum likelihood test statistic(4.5) for testing H0 : θ = θ0 against H1 : θ 6= θ0.

4.5.6 Problem

Suppose (X1, . . . ,Xn) is a random sample from an exponential family modelf (x; θ) ; θ ∈ Ω. Show that the score test statistic (4.4) and the maximumlikelihood test statistic (4.5) for testing H0 : θ = θ0 against H1 : θ 6= θ0 areidentical if the maximum likelihood estimator of θ is a linear function ofthe natural sufficient statistic.

4.6. BAYESIAN HYPOTHESIS TESTS 155

4.6 Bayesian Hypothesis Tests

Suppose we have two simple hypotheses H0 : θ = θ0 and H1 : θ = θ1.The prior probability that H0 is true is denoted by P (H0) and the priorprobability that H1 is true is P (H1) = 1 − P (H0). P (H0)/P (H1) are theprior odds. Suppose also that the data x have probability (density) functionf (x; θ). The posterior probability thatHi is true is denoted by P (Hi|x), i =0, 1. The Bayesian aim in hypothesis testing is to determine the posteriorodds based on the data x given by

P (H0|x)P (H1|x)

=P (H0)

P (H1)× f(x; θ0)f(x; θ1)

.

The ratio f(x; θ0)/f(x; θ1) is called the Bayes factor. If P (H0) = P (H1)then the posterior odds are just a likelihood ratio. The Bayes factor mea-sures how the data have changed the odds as to which hypothesis is true.If the posterior odds were equal to q then a Bayesian would conclude thatH0 is q times more likely to be true than H1. A Bayesian may also decideto accept H0 rather than H1 if q is suitably large.If we have two composite hypotheses H0 : θ ∈ Ω0 and H1 : θ ∈ Ω− Ω0

then a prior distribution for θ must be specified for each hypothesis. Wedenote these by π0(θ|H0) and π1(θ|H1). In this case the posterior odds are

P (H0|x)P (H1|x)

=P (H0)

P (H1)·B

where B is the Bayes factor given by

B =

RΩ0

f (x; θ)π0(θ|H0)dθRΩ−Ω0

f (x; θ)π1(θ|H1)dθ.

For the hypotheses H0 : θ = θ0 and H1 : θ 6= θ0 the Bayes factor is

B =f(x; θ0)R

θ 6=θ0f (x; θ)π1(θ|H1)dθ

.

4.6.1 Problem

Suppose (X1, . . . ,Xn) is a random sample from a POI(θ) distribution andwe wish to test H0 : θ = θ0 against H1 : θ 6= θ0. Find the Bayes factor ifunder H1 the prior distribution for θ is the conjugate prior.

156 CHAPTER 4. HYPOTHESIS TESTS

Chapter 5

Appendix

5.1 Inequalities and Useful Results

5.1.1 Holder’s Inequality

Suppose X and Y are random variables and p and q are positive numberssatisfying

1

p+1

q= 1.

Then|E (XY )| ≤ E (|XY |) ≤ [E (|X|p)]1/p [E (|Y |q)]1/q .

Letting Y = 1 we have

E (|X|) ≤ [E (|X|p)]1/p , p > 1.

5.1.2 Covariance Inequality

If X and Y are random variables with variances σ21 and σ22 respectivelythen

[Cov (X,Y )]2 ≤ σ21σ22 .

5.1.3 Chebyshev’s Inequality

If X is a random variable with E(X) = μ and V ar(X) = σ2 <∞ then

P (|X − μ| ≥ k) ≤ σ2

k2.

for any k > 0.

157

158 CHAPTER 5. APPENDIX

5.1.4 Jensen’s Inequality

If X is a random variable and g (x) is a convex function then

E [g (X)] ≥ g [E (X)] .

5.1.5 Corollary

If X is a non-degenerate random variable and g (x) is a strictly convexfunction. Then

E [g (X)] > g [E (X)] .

5.1.6 Stirling’s Formula

For large nΓ (n+ 1) ≈

√2πnn+1/2e−n.

5.1.7 Matrix Differentiation

Suppose x = (x1, . . . , xk)T, b = (b1, . . . , bk)

Tand A is a k × k symmetric

matrix. Then

∂x

¡xT b

¢=

∙∂

∂x1

¡xT b

¢, . . . ,

∂xk

¡xT b

¢¸T= b

and∂

∂x

¡xTAx

¢=

∙∂

∂x1

¡xTAx

¢, . . . ,

∂xk

¡xTAx

¢¸T= 2Ax.

5.2. DISTRIBUTIONAL RESULTS 159

5.2 Distributional Results

5.2.1 Functions of Random Variables

Univariate One-to-One Transformation

Suppose X is a continuous random variable with p.d.f. f (x) and supportset A. Let Y = h (X) be a real-valued, one-to-one function from A to B.Then the probability density function of Y is

g (y) = f¡h−1 (y)

¢·¯d

dyh−1 (y)

¯, y ∈ B.

Multivariate One-to-One Transformation

Suppose (X1, . . . ,Xn) is a vector of random variables with joint p.d.f.f (x1, . . . , xn) and support set RX . Suppose the transformation S definedby

Ui = hi(X1, . . . ,Xn), i = 1, . . . , n

is a one-to-one, real-valued transformation with inverse transformation

Xi = wi(U1, . . . , Un), i = 1, . . . , n.

Suppose also that S maps RX into RU . Then g(u1, . . . , un), the joint p.d.f.of (U1, . . . , Un) , is given by

g(u) = f(w1(u), . . . , wn(u))

¯∂(x1, . . . , xn)

∂(u1, . . . , un)

¯, (u1, . . . , un) ∈ RU

where

∂(x1, . . . , xn)

∂(u1, . . . , un)=

¯¯

∂x1∂u1

· · · ∂x1∂un

......

∂xn∂u1

· · · ∂xn∂un

¯¯ = ∙∂(u1, . . . , un)

∂(x1, . . . , xn)

¸−1

is the Jacobian of the transformation.

160 CHAPTER 5. APPENDIX

5.2.2 Order Statistic

The following results are derived in Casella and Berger, Section 5.4.

Joint Distribution of the Order Statistic

SupposeX1, . . . ,Xn is a random sample from a continuous distribution withprobability density function f (x). The joint probability density functionof the order statistic T =

¡X(1), . . . ,X(n)

¢= (T1, . . . , Tn) is

g (t1, . . . , tn) = n!nQi=1f (ti) , −∞ < t1 < · · · < tn <∞.

Distribution of the Maximum and the Minimum of a Vector ofRandom Variables

Suppose X1, . . . ,Xn is a random sample from a continuous distributionwith probability density function f (x), support set A, and cumulative dis-tribution function F (x).

The probability density function of U = X(i), i = 1, . . . , n is

n!

(i− 1)! (n− i)!f (u) [F (u)]i−1

[1− F (u)]n−i , u ∈ A.

In particular the probability density function of T = X(n) = max (X1, ...,Xn)is

g1 (t) = nf (t) [F (t)]n−1 , t ∈ A

and the probability density function of S = X(1) = min (X1, ...,Xn) is

g2 (s) = nf (s) [1− F (s)]n−1 , s ∈ A.

5.2. DISTRIBUTIONAL RESULTS 161

The joint p.d.f. of U = X(i) and V = X(j), 1 ≤ i < j ≤ n is given by

n!

(i− 1)! (j − 1− i)! (n− j)!f (u) f (v) [F (u)]i−1

[F (v)− F (u)] [1− F (v)]n−j ,

u < v, u ∈ A, v ∈ A.

In particular the joint probability density function of S = X(1) andT = X(n) is

g (s, t) = n (n− 1) f (s) f (t) [F (t)− F (s)]n−2 , s < t, s ∈ A, t ∈ A.

5.2.3 Problem

If Xi ∼ UNIF(a, b), i = 1, . . . , n independently, then show

X(1) − ab− a ∼ BETA(1, n)

X(n) − ab− a ∼ BETA(n, 1)

162 CHAPTER 5. APPENDIX

5.2.4 Distribution of Sums of Random Variables

(1) If Xi ∼ POI(μi), i = 1, . . . , n independently, thennPi=1Xi ∼ POI

µnPi=1

μi

¶.

(2) If Xi ∼ BIN(ni, p), i = 1, . . . , n independently, thennPi=1Xi ∼ BIN

µnPi=1ni, p

¶.

(3) If Xi ∼ NB(ki, p), i = 1, . . . , n independently, thennPi=1Xi ∼ NB

µnPi=1ki, p

¶.

(4) If Xi ∼ N(μi,σ2i ), i = 1, . . . , n independently, thennPi=1aiXi ∼ N

µnPi=1aiμi,

nPi=1a2iσ

2i

¶.

(5) If Xi ∼ N(μ,σ2), i = 1, . . . , n independently, thennPi=1

Xi ∼ N¡nμ, nσ2

¢and X ∼ N

¡μ, σ2/n

¢.

(6) If Xi ∼ GAM(αi,β), i = 1, . . . , n independently, thennPi=1Xi ∼ GAM

µnPi=1

αi, β

¶.

(7) If Xi ∼ GAM(1,β) = EXP(β), i = 1, . . . , n independently, thennPi=1Xi ∼ GAM(n, β) .

(8) If Xi ∼ χ2 (ki), i = 1, . . . , n independently, then

nPi=1Xi ∼ χ2

µnPi=1ki

¶.

(9) If Xi ∼ GAM(αi,β), i = 1, . . . , n independently where αi is a posi-tive integer, then

2

β

nPi=1Xi ∼ χ2

µ2

nPi=1

αi

¶.

(10) If Xi ∼ N(μ,σ2), i = 1, . . . , n independently, thennPi=1

µXi − μ

σ

¶2∼ χ2 (n) .

5.2. DISTRIBUTIONAL RESULTS 163

5.2.5 Theorem - Properties of the Multinomial Distri-bution

Suppose (X1, . . . ,Xk) ∼MULT(n, p1, . . . , pk) with joint p.f.

f(x1, . . . , xk) =n!

x1!x2! · · ·xk+1!px11 p

x22 · · · p

xk+1k+1

xi = 0, . . . , n, i = 1, . . . , k+1, xk+1 = n−kPi=1xi, 0 < pi < 1, i = 1, . . . , k+1,

k+1Pi=1

pi = 1. Then

(1) (X1, . . . ,Xk) has joint m.g.f.

M(t1, . . . , tk) = (p1et1 + · · ·+ pketk + pk+1)n, (t1, . . . , tk) ∈ <k.

(2) Any subset of X1, . . . ,Xk+1 also has a multinomial distribution. Inparticular

Xi v BIN (n, pi) , i = 1, . . . , k + 1.

(3) If T = Xi +Xj , i 6= j, then

T v BIN (n, pi + pj) .

(4)

Cov (Xi,Xj) = −nθiθj .

(5) The conditional distribution of any subset of (X1, . . . ,Xk+1) given therest of the coordinates is a multinomial distribution. In particular theconditional p.f. of Xi given Xj = xj , i 6= j, is

Xi|Xj = xj ∼ BINµn− xj ,

pi1− pj

¶.

(6) The conditional distribution of Xi given T = Xi +Xj = t, i 6= j, is

Xi|Xi +Xj = t ∼ BINµt,

pipi + pj

¶.

164 CHAPTER 5. APPENDIX

5.2.6 Definition - Multivariate Normal Distribution

Let X = (X1, . . . ,Xk)T be a k × 1 random vector with E(Xi) = μi and

Cov(Xi,Xj) = σij , i, j = 1, . . . , k. (Note: Cov(Xi,Xi) = σii = V ar(Xi) =σ2i .) Let μ = (μ1, . . . ,μk)

T be the mean vector and Σ be the k×k symmetriccovariance matrix whose (i, j) entry is σij . Suppose also that Σ

−1 exists. Ifthe joint p.d.f. of (X1, . . . ,Xk) is given by

f(x1, . . . , xk) =1

(2π)k/2|Σ|1/2 exp∙−12(x− μ)TΣ−1(x− μ)

¸, x ∈ Rk

where x = (x1, . . . , xk)T then X is said to have a multivariate normal

distribution. We write X v MVN(μ,Σ).

5.2.7 Theorem - Properties of the MVN Distribution

Suppose X = (X1, . . . ,Xk)T v MVN(μ,Σ). Then

(1) X has joint m.g.f.

M(t) = exp

µμT t+

1

2tTΣt

¶, t = (t1, . . . , tk)

T ∈ <k.

(2) Any subset of X1, . . . ,Xk also has a MVN distribution and in particular

Xi v N(μi, σ2i ), i = 1, . . . , k.(3)

(X − μ)TΣ−1(X − μ) v χ2(k).

(4) Let c = (c1, . . . , ck)T be a nonzero vector of constants then

cTX =kPi=1ciXi v N(cTμ, cTΣc).

(5) Let A be a k × p vector of constants of rank p then

ATX v N(ATμ, ATΣA).

(6) The conditional distribution of any subset of (X1, . . . ,Xk) given the

rest of the coordinates is a multivariate normal distribution. In particularthe conditional p.d.f. of Xi given Xj = xj , i 6= j, is

Xi|Xj = xj ∼ N(μi + ρijσi(xj − μj)/σj , (1− ρ2ij)σ2i )

5.2. DISTRIBUTIONAL RESULTS 165

In following figures the BVN joint p.d.f. is graphed. The graphs allhave the same mean vector μ = [0 0]T but different variance/covariancematrices Σ. The axes all have the same scale.

-3-2

-10

12

3

-3-2

-10

12

30

0.05

0.1

0.15

0.2

xy

f(x,y

)

Figure 5.1:

Graph of BVN p.d.f. with μ = [0 0]T and Σ = [1 0 ; 0 1].

166 CHAPTER 5. APPENDIX

-3-2

-10

12

3

-3-2

-10

12

30

0.05

0.1

0.15

0.2

xy

f(x,y

)

Graph of BVN p.d.f. with μ = [0 0]T and Σ = [1 0.5; 0.5 1].

5.2. DISTRIBUTIONAL RESULTS 167

-3-2

-10

12

3

-3-2

-10

12

30

0.05

0.1

0.15

0.2

xy

f(x,y

)

Graph of BVN p.d.f. with μ = [0 0]T and Σ = [0.6 0.5; 0.5 1].

168 CHAPTER 5. APPENDIX

5.3 Limiting Distributions

5.3.1 Definition - Convergence in Probability to a Con-stant

The sequence of random variables X1,X2, . . . ,Xn, . . . converges in proba-bility to the constant c if for each ² > 0

limn→∞

P (|Xn − c| ≥ ²) = 0.

We write Xn →p c.

5.3.2 Theorem

If X1,X2, ...,Xn, ... is a sequence of random variables such that

limn→∞

P (Xn ≤ x) =½0 x < b

1 x > b

then Xn →p b.

5.3.3 Theorem - Weak Law of Large Numbers

If X1, . . . ,Xn is a random sample from a distribution withE(Xi) = μ and V ar(Xi) = σ2 <∞ then

Xn =1

n

nPi=1Xi →p μ.

5.3.4 Problem

Suppose X1,X2, . . . ,Xn, . . . is a sequence of random variables such thatE(Xn) = c and lim

n→∞V ar(Xn) = 0. Show that Xn →p c.

5.3.5 Problem

Suppose X1,X2, . . . ,Xn, . . . is a sequence of random variables such thatXn/n→p b < 0. Show that lim

n→∞P (Xn < 0) = 1.

5.3.6 Problem

Show that if Yn →p a and

limn→∞

P (|Xn| ≤ Yn) = 1

5.3. LIMITING DISTRIBUTIONS 169

then Xn is bounded in probability, that is, there exists b > 0 such that

limn→∞

P (|Xn| ≤ b) = 1.

5.3.7 Definition - Convergence in Distribution

The sequence of random variables X1,X2, . . . ,Xn, . . . converges in distrib-ution to a random variable X if

limn→∞

P (Xn ≤ x) = P (X ≤ x) = F (x)

for all values of x at which F (x) is continuous. We write Xn →D X.

5.3.8 Theorem

Suppose X1, ...,Xn, ... is a sequence of random variables with E(Xn) = μnand V ar(Xn) = σ2n. If lim

n→∞μn = μ and lim

n→∞σ2n = 0 then Xn →p μ.

5.3.9 Central Limit Theorem

If X1, . . . ,Xn is a random sample from a distribution withE(Xi) = μ and V ar(Xi) = σ2 <∞ then

Yn =

nPi=1Xi − nμ√nσ

=

√n(Xn − μ)

σ→D Z v N(0, 1).

5.3.10 Limit Theorems

1. If Xn →p a and g is continuous at a, then g(Xn)→p g(a).

2. If Xn →p a, Yn →p b and g(x, y) is continuous at (a, b) theng(Xn, Yn)→p g(a, b).

3. (Slutsky) If Xn →D X, Yn →p b and g(x, b) is continuous for allx ∈ support of X then g(Xn, Yn)→D g(X, b).

4. (Delta Method) If X1,X2, ...,Xn, ... is a sequence of random variablessuch that

nb(Xn − a)→D X

for some b > 0 and if the function g(x) is differentiable at a withg0(a) 6= 0 then

nb[g(Xn)− g(a)]→D g0(a)X.

170 CHAPTER 5. APPENDIX

5.3.11 Problem

If Xn →p a > 0, Yn →p b 6= 0 and Zn →D Z ∼ N(0, 1) then find thelimiting distributions of

(1) X2n (2)

pXn (3) XnYn (4) Xn + Yn (5) Xn/Yn

(6) 2Zn (7) Zn + Yn (8) XnZn (9) Z2n (10) 1/Zn

5.4. PROOFS 171

5.4 Proofs

5.4.1 Theorem

Suppose the model is f (x; θ) ; θ ∈ Ω and let A = support of X. PartitionA into the equivalence classes defined by

Ay =½x;f (x; θ)

f(y; θ)= H(x, y) for all θ ∈ Ω

¾, y ∈ A. (5.1)

This is a minimal sufficient partition. The statistic T (X) which inducesthis partition is a minimal sufficient statistic.

5.4.2 Proof

We give the proof for the case in which A does not depend on θ.Let T (X) be the statistic which induces the partition in (5.1). To show

that T (X) is sufficient we define

B = t : t = T (x) for some x ∈ A .

Then the set A can be written as

A = ∪t∈BAt

where

At = x : T (x) = t , t ∈ B.

The statistic T (X) induces the partition defined by At, t ∈ B. For eachAt we can choose and fix one element, xt ∈ A. Obviously T (xt) = t. Letg (t; θ) be a function defined on B such that

g (t; θ) = f (xt; θ) , t ∈ B.

Consider any x ∈ A. For this x we can calculate T (x) = t and thusdetermine the set At to which x belongs as well as the value xt which waschosen for this set. Obviously T (x) = T (xt). By the definition of thepartition induced by T (X), we know that for all x ∈ At, f (x; θ) /f (xt; θ) isa constant function of θ. Therefore for any x ∈ A we can define a function

h (x) =f (x; θ)

f (xt; θ)

where T (x) = T (xt) = t.

172 CHAPTER 5. APPENDIX

Therefore for all x ∈ A and θ ∈ Ω we have

f (x; θ) = f (xt; θ)f (x; θ)

f (xt; θ)

= g (t; θ)h (x)

= g (T (xt) ; θ)h (x)

= g (T (x) ; θ)h (x)

and by the Factorization Criterion for Sufficieny T (X) is a sufficient sta-tistic.To show that T (X) is a minimal sufficient statistic suppose that T1 (X)

is any other sufficient statistic. By the Factorization Criterion for Sufficieny,there exist functions h1 (x) and g1 (t; θ) such that

f (x; θ) = g1 (T1 (x) ; θ)h1 (x)

for all x ∈ A and θ ∈ Ω. Let x and y be any two points in A withT1 (x) = T1 (y). Then

f (x; θ)

f (y; θ)=

g1 (T1 (x) ; θ)h1 (x)

g1 (T1 (y) ; θ)h1 (y)

=h1 (x)

h1 (y)

= function of x and y which does not depend on θ

and therefore by the definition of T (X) this implies T (x) = T (y). Thisimplies that T1 induces either the same partition of A as T (X) or it inducesa finer partition of A than T (X) and therefore T (X) is a function of T1 (X).Since T (X) is a function of every other sufficient statistic, therefore T (X)is a minimal sufficient statistic.

5.4.3 Theorem

If T (X) is a complete sufficient statistic for the model f (x; θ) ; θ ∈ Ωthen T (X) is a minimal sufficient statistic for f (x; θ) ; θ ∈ Ω.

5.4.4 Proof

Suppose U = U (X) is a minimal sufficient statistic for the modelf (x; θ) ; θ ∈ Ω. The function E (T |U) is a function of U which does notdepend on θ since U is a sufficient statistic. Also by Definition 1.5.2, U is

5.4. PROOFS 173

a function of the sufficient statistic T which implies E (T |U) is a functionof T .Let

h (T ) = T −E (T |U) .

Now

E [h (T ) ; θ] = E [T − E (T |U) ; θ]= E (T ; θ)−E [E (T |U) ; θ]= E (T ; θ)−E (T ; θ)= 0, for all θ ∈ Ω.

Since T is complete this implies

P [h (T ) = 0; θ] = 1, for all θ ∈ Ω

or

P [T = E (T |U)] = 1, for all θ ∈ Ω

and therefore T is a function of U . This can only be true if T is also aminimal sufficent statistic for the model.

The regularity conditions are repeated here since they are used in theproofs that follow.

5.4.5 Regularity Conditions

Consider the model f (x; θ) ; θ ∈ Ω. Suppose that:

(R1) The parameter space Ω is an open interval in the real line.

(R2) The densities f (x; θ) have common support, so that the setA = x; f (x; θ) > 0 , does not depend on θ.

(R3) For all x ∈ A, f (x; θ) is a continuous, three times differentiable func-tion of θ.

(R4) The integralRA

f (x; θ) dx can be twice differentiated with respect to θ

under the integral sign, that is,

∂k

∂θk

ZA

f (x; θ) dx =

ZA

∂k

∂θkf (x; θ) dx, k = 1, 2 for all θ ∈ Ω.

174 CHAPTER 5. APPENDIX

(R5) For each θ0 ∈ Ω there exist a positive number c and function M (x)(both of which may depend on θ0), such that for all θ ∈ (θ0 − c, θ0 + c)¯

∂3 log f (x; θ)

∂θ3

¯< M(x)

holds for all x ∈ A, and

E [M (X) ; θ] <∞ for all θ ∈ (θ0 − c, θ0 + c) .

(R6) For each θ ∈ Ω,

0 < E

(∙∂2 log f (X; θ)

∂θ2

¸2; θ

)<∞

(R7) The probability (density) functions corresponding to different values

of the parameters are distinct, that is, θ 6= θ∗ =⇒ f (x; θ) 6= f (x; θ∗).

The following lemma is required for the proof of consistency of the M.L.estimator.

5.4.6 Lemma

If X is a non-degenerate random variable with model f (x; θ) ; θ ∈ Ω sat-isfying (R1)− (R7) then

E [log f (X; θ)− log f (X; θ0) ; θ0] < 0 for all θ, θ0 ∈ Ω, and θ 6= θ0.

5.4.7 Proof

Since g (x) = − log x is strictly convex and X is a non-degenerate randomvariable then by the corollary to Jensen’s inequality

E [log f (X; θ)− log f (X; θ0) ; θ0] = E

½log

∙f (X; θ)

f (X; θ0)

¸; θ0

¾< log

½E

∙f (X; θ)

f (X; θ0); θ0

¸¾for all θ, θ0 ∈ Ω, and θ 6= θ0.

5.4. PROOFS 175

Since

E

∙f (X; θ)

f (X; θ0); θ0

¸=

ZA

f (x; θ)

f (x; θ0)f (x; θ0) dx

=

ZA

f (x; θ) dx = 1, for all θ ∈ Ω

therefore

E [l (θ;X)− l (θ0;X) ; θ0] < log (1) = 0, for all θ, θ0 ∈ Ω, and θ 6= θ0.

5.4.8 Theorem

Suppose (X1, . . . ,Xn) is a random sample from a model f (x; θ) ; θ ∈ Ωsatisfying regularity conditions (R1)−(R7). Then with probability tendingto 1 as n→∞, the likelihood equation or score equation

nXi=1

∂θlogf (Xi; θ) = 0

has a root θn such that θn converges in probability to θ0, the true value ofthe parameter, as n→∞.

5.4.9 Proof

Let

ln (θ;X) = ln (θ;X1, . . . ,Xn)

= log

∙nQi=1f (Xi; θ)

¸=

nPi=1logf (Xi; θ) , θ ∈ Ω.

Since f (x; θ) is differentiable with respect to θ for all θ ∈ Ω this impliesln (θ;x) is differentiable with respect to θ for all θ ∈ Ω and also ln (θ;x) isa continuous function of θ for all θ ∈ Ω.By the above lemma we have for any δ > 0 such that θ0 ± δ ∈ Ω,

E [ln (θ0 + δ;X)− ln (θ0;X) ; θ0] < 0 (5.2)

andE [ln (θ0 − δ;X)− ln (θ0;X) ; θ0] < 0. (5.3)

176 CHAPTER 5. APPENDIX

By (5.2) and the WLLN

1

n[ln (θ;X)− ln (θ0;X)]→p b < 0

which implies

limn→∞

P [ln (θ0 + δ;X)− ln (θ0;X) < 0] = 1

(see Problem 5.3.5). Therefore there exists a sequence of constants ansuch that 0 < an < 1, lim

n→∞an = 0 and

P [ln (θ0 + δ;X)− ln (θ0;X) < 0] = 1− an.

Let

An = An (δ) = x : ln (θ0 + δ;x)− ln (θ0;x) < 0

where x = (x1, . . . , xn). Then

limn→∞

P (An; θ0) = limn→∞

(1− an) = 1.

Let

Bn = Bn (δ) = x : ln (θ0 − δ;x)− ln (θ0;x) < 0 .

then by the same argument as above there exists a sequence of constantsbn such that 0 < bn < 1, lim

n→∞bn = 0 and

limn→∞

P (Bn; θ0) = limn→∞

(1− bn) = 1.

Now

P (An ∩Bn; θ0) = P (An; θ0) + P (Bn; θ0)− P (An ∪Bn; θ0)= 1− an + 1− bn − P (An ∪Bn; θ0)= 1− an − bn + [1− P (An ∪Bn; θ0)]≥ 1− an − bn

since 1− P (An ∪Bn; θ0) ≥ 0. Therefore

limn→∞

P (An ∩Bn; θ0) = limn→∞

(1− an − bn) = 1 (5.4)

Continuity of ln (θ;x) for all θ ∈ Ω implies that for any x ∈ An ∩ Bn,there exists a value θn (δ) = θn (δ;x) ∈ (θ0 − δ, θ0 + δ) such that ln (θ;x)

5.4. PROOFS 177

has a local maximum at θ = θn (δ). Since ln (θ;x) is differentiable withrespect to θ this implies (Fermat’s theorem)

∂ln (θ;x)

∂θ=

nXi=1

∂θlogf (xi; θ) = 0 for θ = θn (δ) .

Note that ln (θ;x) may have more than one local maximum on the interval

(θ0 − δ, θ0 + δ) and therefore θn (δ) may not be unique. If x /∈ An ∩ Bn,then θn (δ) may not exist in which case we define θn (δ) to be a fixed ar-

bitrary value. Note also that the sequence of rootsnθn (δ)

odepends on

δ.Let θn = θn (x) be the value of θ closest to θ0 such that

∂ln (θ;x) /∂θ = 0. If such a root does not exist we define θn to be a fixedarbitrary value. Since

1 ≥ Phθn ∈ (θ0 − δ, θ0 + δ) ; θ0

i≥ P

hθn (δ) ∈ (θ0 − δ, θ0 + δ) ; θ0

i≥ P (An ∩Bn; θ0) (5.5)

then by (5.4) and the Squeeze Theorem we have

limn→∞

Phθn ∈ (θ0 − δ, θ0 + δ) ; θ0

i= 1.

Since this is true for all δ > 0, θn →p θ0.

5.4.10 Theorem

Suppose (R1)− (R7) hold. Suppose θn is a consistent root of the likelihoodequation as in Theorem 5.4.8. Thenp

J(θ0)(θn − θ0)→D Z ∼ N(0, 1)

where θ0 is the true value of the parameter.

5.4.11 Proof

Let

S1 (θ;x) =∂

∂θlog f (x; θ)

and

I1 (θ;x) = −∂

∂θS1 (θ;x) = −

∂2

∂θ2log f (x; θ)

178 CHAPTER 5. APPENDIX

be the score and information functions respectively for one observation fromf (x; θ) ; θ ∈ Ω. Since f (x; θ) ; θ ∈ Ω is a regular model

E [S1 (θ;X) ; θ] = 0, θ ∈ Ω (5.6)

and

V ar [S1 (θ;X)] ; θ = E [I1 (θ;x) ; θ] = J1 (θ) <∞, θ ∈ Ω. (5.7)

Let

An =

½(x1, . . . , xn) ; such that

nPi=1

∂θlogf (xi; θ) =

nPi=1S1 (θ;xi) = 0 has a solution

¾and for (x1, . . . , xn) ∈ An, let θn = θn (x1, . . . , xn) be the value of θ such

thatnPi=1S1(θn;xi) = 0.

ExpandnPi=1S1(θn;xi) as a function of θn about θ0 to obtain

nPi=1S1(θn;xi) =

nPi=1S1 (θ0;xi)− (θn − θ0)

nPi=1I1 (θ;xi)

+1

2(θn − θ0)

2nPi=1

∂3

∂θ3log f (xi; θ) |θ=θ∗n (5.8)

where θ∗n = θ∗n (x1, . . . , xn) lies between θ0 and θn by Taylor’s Theorem.Suppose (x1, . . . , xn) ∈ An. Then the left side of (5.8) equals zero and

thus

nPi=1S1 (θ0;xi) = (θn − θ0)

nPi=1I1 (θ;xi)−

1

2(θn − θ0)

2nPi=1

∂3

∂θ3log f (xi; θ) |θ=θ∗n

= (θn − θ0)

∙nPi=1I1 (θ;xi)−

1

2(θn − θ0)

2nPi=1

∂3

∂θ3log f (xi; θ) |θ=θ∗n

¸or

nPi=1S1 (θ0;xi)pnJ1 (θ0)

=(θn − θ0)pnJ1 (θ0)

∙nPi=1I1 (θ;xi)−

1

2(θn − θ0)

2nPi=1

∂3

∂θ3log f (xi; θ) |θ=θ∗n

¸

=pJ (θ0)(θn − θ0)

⎡⎢⎢⎣1n

nPi=1I1 (θ;xi)

J1 (θ0)− 12(θn − θ0)

2

1n

nPi=1

∂3

∂θ3 log f (xi; θ) |θ=θ∗nJ1 (θ0)

⎤⎥⎥⎦

5.4. PROOFS 179

Therefore for (X1, . . . ,Xn) we have⎡⎢⎢⎣nPi=1S1 (θ0;Xi)pnJ1 (θ0)

⎤⎥⎥⎦ I(X1, . . . ,Xn) ∈ An (5.9)

=pJ (θ0)(θn−θ0)

⎡⎢⎢⎣1n

nPi=1I1 (θ;Xi)

J1 (θ0)− (θn − θ0)

2

2J1 (θ0)

nPi=1

∂3

∂θ3 log f (Xi; θ) |θ=θ∗nn

⎤⎥⎥⎦ I(X1, . . . ,Xn) ∈ Anwhere θ∗n = θ∗n (X1, . . . ,Xn) . By an argument similar to that used in Proof5.4.4

limn→∞

P [(X1, . . . ,Xn) ∈ An; θ0] = 1. (5.10)

Since S1 (θ0;Xi), i = 1, . . . , n are i.i.d. random variables with mean andvariance given by (5.6) and (5.7) then by the CLT

nPi=1S1 (θ0;Xi)pnJ1 (θ0)

→D Z v N(0, 1) . (5.11)

Since I1 (θ0;Xi), i = 1, . . . , n are i.i.d. random variables with meanJ1 (θ) then by the WLLN

1

n

nPi=1I1 (θ;Xi)→p J1 (θ)

and thus1n

nPi=1I1 (θ;Xi)

J1 (θ0)→p 1 (5.12)

by the Limit Theorems.To complete the proof we need to show

(θn − θ0)2

∙1

n

nPi=1

∂3

∂θ3log f (Xi; θ) |θ=θ∗n

¸→p 0. (5.13)

Since θn →p θ0 we only need to show that

1

n

nPi=1

∂3

∂θ3log f (Xi; θ) |θ=θ∗n (5.14)

180 CHAPTER 5. APPENDIX

is bounded in probability. Since θn →p θ0 implies θ∗n →p θ0 then by (R5)

limn→∞

P

½¯1

n

nPi=1

∂3

∂θ3log f (Xi; θ) |θ=θ∗n

¯≤ 1

n

nPi=1M (Xi) ; θ0

¾= 1.

Also by (R5) and the WLLN

1

n

nPi=1M (Xi)→p E[M(X); θ0] <∞.

It follows that (5.14) is bounded in probability (see Problem 5.3.6).Therefore p

J (θ0)(θn − θ0)→D Z v N(0, 1)follows from (5.9), (5.11)-(5.13) and Slutsky’s Theorem.

Special Discrete Distributions

Notation andParameters p.f. Mean Variance m.g.f.

BinomialX BINn,p n

x pxqn−x np npq pet qn

0 p 1 x 0,1, . . . ,nq 1 − p

BernoulliX Bernoullip pxq1−x p pq pet q

0 p 1 x 0,1q 1 − p

Negative BinomialX NBk,p −k

x pk−qx kqp

kqp2 p

1−qet k

0 p 1q 1 − p x 0,1, . . . t − logq

GeometricX GEOp pqx q/p q/p2 p

1−qet

0 p 1 x 0,1, . . . t − logqq 1 − p

Special Discrete Distributions

Notation andParameters p.f. Mean Variance m.g.f.

HypergeometricX HYPn,M,N M

x N−Mn−x / N

n nM/N n MN 1 −

MN

N−nN−1 *

n 1,2,,N x 0,1,,nM 0,1,,N

PoissonX POI e−x/x! eet−1

0 x 0,1,

Discrete Uniform

X DUN 1/N N12

N2−112

1N

et−eN1t

1−et

N 1,2, x 1,2,,N t ≠ 0

* Not Tractable

Special Continuous Distributions

Notation andParameters p.d.f. Mean Variance m.g.f.

Uniform

X UNIFa,b 1/b − a ab2

b−a2

12ebt−eat

b−at

a b a ≤ x ≤ b t ≠ 0

NormalX N,2 1

2 e−x−/2/2 2 et2t2/2

2 0

GammaX GAM, 1

Γx−1e−x/ 2 1 − t−

0 x 0 0 t 1/

Inverted GammaX IG, 1

Γx−−1e−1/x 1

−11

2−12−2*

0 0 x 0

Special Continuous Distributions

Notation andParameters p.d.f. Mean Variance m.g.f.

ExponentialX EXP 1

e−x/ 2 1 − t−1

0 x ≥ 0 t 1/

Two-ParameterExponentialX EXP, 1

e−x−/ 2 et1 − t−1

0 x ≥ t 1/

DoubleExponentialX DE, 1

2 e−|x−|/ 22 et1 − 2t2−1

0 |t| 1/

WeibullX WEI,

x−1e−x/ Γ1 1

2Γ1 2 *

−Γ21 1

0 x 0 0

* Not Tractable.

Special Continuous Distributions

Notation andParameters p.d.f. Mean Variance m.g.f.

ExtremeValue

X EV, 1 ex−/−ex−/ − 22

6 etΓ1 t

0 0.5772 t −1/(Euler’sconst.)

CauchyX CAU, 1

1x−/2** ** **

0

Pareto

X PAR,

x1−1

2−12−2

**

, 0 x ≥ 1 2

Logistic

X LOG, e−x−/

1e−x−/2 22

3 etΓ1 − tΓ1 t

0 |t| 1/** Does not exist.

Special Continuous Distributions

Notation andParameters p.d.f. Mean Variance m.g.f.

Chi-SquaredX 2 1

2/2Γ/2x/2−1e−x/2 2 1 − 2t−/2

1,2, x 0 t 1/2

Student’s t

X t Γ 12

Γ 2 1

1 x2 − 1

2 0 −2 **

1,2, 1 2

Snedecor’s F

X F1,2Γ

122

Γ12 Γ

22

12

12 x

12 −1 2

2−222

212−212−222−4

**

1 1,2, 1 12 x−

122 2 2 4 2

2 1,2, x 0

Beta

X BETAa,b ΓabΓaΓb xa−11 − xb−1 a

abab

ab1ab2 *

a 0 0 x 1b 0

* Not Tractable.** Does not exist.

Special Multivariate Distributions

Notation andParameters p.f./p.d.f. m.g.f.

MultinomialX X1, X2, , Xk

X MULTn,p1,,pk fx1,,xk n!x1!x2!xk1! p1

x1p2x2pk1

xk1 p1et1 pketk pk1n

0 pi 1, ∑i1

k1pi 1 0 ≤ xi ≤ n, xk1 n − ∑

i1

kxi

Bivariate Normal

X X1X2

BVN, fx1, x2 1212 1−2

eTt 12 tTt

0 1,2 exp −121−2

x1−11 2 − 2 x1−1

1 x2−22 x2−2

2 2 t t1t2

−1 1

12 , 1

2||1/2 exp − 12 x −

T−1x −

1

2 12

12 22