Mathematical Statistics - TU Kaiserslautern · Mathematical Statistics Claudia Redenbach ... 8.2 Applications ... 4 Chapter 2 Repetition and Notation

Mathematical Statistics

Claudia Redenbach

TU KaiserslauternWinter Term 2012/2013

Contents

1 Introduction 1

2 Repetition and Notation 22.1 Basic notions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.2 Examples of distribution families . . . . . . . . . . . . . . . . . . . . . . . . 52.3 Useful inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.4 Concepts of convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.5 Laws of large numbers and limit theorems . . . . . . . . . . . . . . . . . . . 82.6 Random vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.7 Conditional expectation and probabilities . . . . . . . . . . . . . . . . . . . . 9

3 Parameter Estimation 123.1 Properties of estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1.1 Consistency, loss and risk . . . . . . . . . . . . . . . . . . . . . . . . 133.1.2 Unbiasedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.1.3 Sufficiency and completeness . . . . . . . . . . . . . . . . . . . . . . . 163.1.4 Exponential families . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.1.5 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Bayes estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.3 Minimax estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.4 Maximum likelihood estimators . . . . . . . . . . . . . . . . . . . . . . . . . 283.5 Consistency and asymptotic normality of M-estimators . . . . . . . . . . . . 313.6 Asymptotic comparison of estimators . . . . . . . . . . . . . . . . . . . . . . 37

4 Confidence Sets 394.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.2 Important distribution families . . . . . . . . . . . . . . . . . . . . . . . . . 394.3 Estimators of parameters of a normal distribution and their distributions . . 424.4 Confidence intervals for the parameters of some common distributions . . . . 434.5 Two-sample problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5 Hypothesis Testing 505.1 Basic notions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505.2 Tests for normally distributed data . . . . . . . . . . . . . . . . . . . . . . . 545.3 Likelihood ratio tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585.4 The χ2-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

ii Contents

5.4.1 Derivation of the χ2-test . . . . . . . . . . . . . . . . . . . . . . . . . 635.4.2 Goodness-of-fit tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 685.4.3 Test of independence . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.5 Asymptotic results for likelihood ratio tests . . . . . . . . . . . . . . . . . . . 71

6 Empirical Processes and Kolmogorov-Smirnov Test 776.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 776.2 Weak convergence of stochastic processes . . . . . . . . . . . . . . . . . . . . 786.3 The Functional Central Limit Theorem . . . . . . . . . . . . . . . . . . . . 806.4 Goodness-of-fit tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

7 Statistical Functionals and Applications 857.1 Statistical functionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 857.2 Asymptotics for statistical functionals . . . . . . . . . . . . . . . . . . . . . . 877.3 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

8 Bootstrap 958.1 The non-parametric bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . 968.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

Chapter 1

Introduction

Example 1.0.1A company has developed a new winter tyre. Now they want to test its performance. Theymeasure the stopping distance of a car driving at constant speed on an icy road several timesand try to answer the following questions:

(i) What is the expected stopping distance µ under the given conditions?

(ii) Can we find an interval which contains the true value of µ with high probability?

(iii) Is the expected stopping distance shorter than that of the old tyre model or of the tyreproduced by a competitor?

So a typical setting looks as follows:

(i) We observe a sample x = (x1, . . . , xn) as outcome of a random experiment.

(ii) We interpret the sample as a realisation of a random variable X, i.e., x = X(ω) forω ∈ Ω. We assume that we know the distribution of X up to some unknown parameterθ.

(iii) Given the sample x we draw some conclusions on the parameter θ:

• Find a value for θ (point estimation).

• Find a set which contains the true parameter with high probability (confidencesets).

• Test whether the data support a hypothesis on the value of θ.

• Often the information contained in the sample is reduced using a suitable trans-formation T (x). This transformation is called statistic.

Chapter 2

Repetition and Notation

2.1 Basic notions

Definition 2.1.1 (σ-algebra)

(i) A σ-algebra A is a system of subsets of a set Ω such that

(a) ∅ ∈ A(b) If A ∈ A then Ac ∈ A.

(c) If A1, A2, . . . ∈ A then⋃∞i=1Ai ∈ A.

(ii) A pair (Ω,A), where A is a σ-algebra on Ω is called a measurable space.

(iii) The Borel σ-algebra B of Rn is the smallest σ-algebra containing all open subsets ofRn.

Definition 2.1.2 (Probability measure)Let (Ω,A) be a measurable space. A mapping P : A → [0, 1] is a probability measure on(Ω,A) if

(i) P (∅) = 0, P (Ω) = 1

(ii) If A1, A2, . . . ∈ A are pairwise disjoint then P (∞⋃i=1

Ai) =∞∑i=1

P (Ai) (σ-additivity).

(iii) P (Ac) = 1− P (A) for A ∈ A.

The triple (Ω,A, P ) is called probability space.

Definition 2.1.3 (Random variable)Let (Ω,A, P ) be a probability space and (S,S) a measurable space.

(i) A mapping X : Ω→ S is called measurable if X−1(T ) ∈ A for all T ∈ S.

(ii) A random variable is a measurable mapping X : Ω→ S.

(iii) We will use capital letters for random variables X. The lower case letter x = X(ω)will represent its realisation, i.e. a fixed element in S.

2.1 Basic notions 3

Definition 2.1.4 (Distribution)The distribution or law of a random variable X : Ω→ Rn is the probability measure PX onRn given by

PX(B) = P (X ∈ B), B ∈ B.

We will write L(X) = PX .

Definition 2.1.5 (Distribution function)For a random variable X = (X1, . . . , Xn) : (Ω,A, P )→ (Rn,B) the function FX : Rn → [0, 1]given by

x = (x1, . . . , xn) 7→ P (X1 ≤ x1, . . . , Xn ≤ xn)

is called distribution function of X.

Definition 2.1.6 (Absolutely continuous distribution)A distribution PX is called absolutely continuous if it is defined by a probability density f ,i.e.,

PX(B) =

∫B

f(x)dx,

with

(i) f(x) ≥ 0 for all x ∈ Rn,

(ii)∫Rn f(x)dx = 1.

Then

(i) FX(y) =∫ y1

−∞ . . .∫ yn−∞ f(1, . . . , xn)dxn . . . dx1, and

(ii) F ′X(x) = f(x) for almost all x ∈ R.

Definition 2.1.7 (Discrete distribution)A distribution PX is called discrete if the random variable X takes at most countably manydifferent values x1, x2, ... with probabilities p1, p2, ... ≥ 0. Then

(i)∞∑i=1

pi = 1

(ii) PX(B) =∞∑i=1

pi 1IB(xi)

(iii) PX(xi) = pi = P (X = xi)

Definition 2.1.8 (Independence of random variables)The random variables X1, . . . , Xn are independent if

P (X1 ∈ B1, . . . , Xn ∈ Bn) =n∏i=1

P (Xi ∈ Bi) ∀B1, . . . , Bn ∈ B.

4 Chapter 2 Repetition and Notation

Definition 2.1.9 (Identically distributed random variables)The random variables X1, . . . , Xn are identically distributed if

L(Xk) = L(X1), k = 2, . . . , n.

Definition 2.1.10Let X, Y be real-valued random variables.

(i) The expectation of X is given by

E(X) =

∫RxdPX(x) =

∫Rxf(x)dx if X is absolutely continuous,

∞∑i=1

xiP (X = xi) =∞∑i=1

xipi if X is discrete..

(ii) EXk and E|X|k are the k-th moment and the k-th absolute moment of X, respectively.

(iii) The variance of X is

varX = E[(X − EX)2] = E[X2]− [E(X)]2,

its square root is called standard deviation of X.

(iv) The covariance of X and Y is defined as

cov(X, Y ) = E[(X − EX)(Y − EY )].

(v) The correlation of X and Y is

corr(X, Y ) =cov(X, Y )√

var(X) var(Y ).

X and Y are uncorrelated if corr(X, Y ) = 0.

Remark 2.1.11Let X and Y be real-valued random variables.

(i) The expectation is linear, i.e. E(aX + bY ) = aE(X) + bE(Y ), ∀a, b ∈ R.

(ii) For g : R→ R we have

E[g(X)] =

∫Rg(x)dPX(x) =

∫Rg(x)f(x)dx if X is absolutely continuous,

∞∑i=1

g(xi)P (X = xi) =∞∑i=1

g(xi)pi if X is discrete.

(iii) var(aX) = a2 var(X) ∀a ∈ R.

(iv) var(X + Y ) = var(X) + var(Y ) + 2 cov(X, Y )

2.2 Examples of distribution families 5

(v) −1 ≤ corr(X, Y ) ≤ 1 with corr(X, Y ) = ±1 ⇐⇒ Y = aX + b, b ∈ R, a>< 0.

Correlation is a measure of linear dependence of X and Y .Independent random variables are uncorrelated. But uncorrelated random variablesare not necessarily independent since the dependence can be non-linear.

Definition 2.1.12Let X1, . . . , Xn be real-valued random variables.

(i) The arithmetic or sample mean is

Xn =1

n

n∑i=1

Xi.

(ii) The sample median is

med(X) =

X(k+1) if n = 2k + 112(X(k) +X(k+1)) if n = 2k.

Here we have used the order statistic

X(1) ≤ X(2) ≤ . . . ≤ X(n).

(iii) The sample variance is

S2n =

1

n− 1

n∑i=1

(Xi −Xn)2.

(iv) For n ∈ N and x1, . . . , xn ∈ R the empirical distribution function Fn : R → [0, 1] isdefined as

Fn(z) =1

n

n∑i=1

1Ixi≤z, z ∈ R.

2.2 Examples of distribution families

Definition 2.2.1

(i) Binomial distribution B(n, p)Ω = 0, . . . , n, 0 < p < 1

pk =

(n

k

)pk(1− p)n−k, k = 1, . . . , n

L(X) = B(n, p): E(X) = np, varX = np(1− p).Special case: n = 1: Bernoulli distribution.

(ii) Poisson distribution Poi(λ)Ω = N0, λ > 0

pk =λk

k!exp(−λ), k ∈ N0

L(X) = Poi(λ): E(X) = λ, varX = λ.


(iii) Geometric distribution Geo(p)Ω = N, 0 < p < 1

pk = p(1− p)k, k ∈ N

L(X) = Geo(p): E(X) = 1p, varX = 1−p

p2 .

(iv) Normal distribution N(µ, σ2)with parameters µ ∈ R, σ2 > 0 and density

f(x) =1√

2πσ2exp

(−(x− µ)2

2σ2

), x ∈ R.

L(X) = N(µ, σ2): E(X) = µ, varX = σ2.

(v) Uniform distribution U(a, b)with parameters a < b and density

f(x) =1

b− a1I[a,b](x), x ∈ R.

L(X) = U(a, b): E(X) = a+b2

, varX = (b−a)2

12.

(vi) Gamma distribution Γ(a, b)Ω = (0,∞), a > 0, b > 0 with density

f(x) =ab

Γ(b)xb−1 exp(−ax) 1I(0,∞)(x), x ∈ R,

where Γ(x) =∞∫0

tx−1e−t.

L(X) = Γ(a, b): E(X) = ba, varX = b

a2 .Special cases:b = 1: Exponential distributiona = 1

2, b = n

2: χ2-distribution

2.3 Useful inequalities

Theorem 2.3.1 (Markov inequality)Let X be a real random variable whose expectation exists. Then

P (|X| ≥ ε) ≤ 1

εE(|X|) ∀ε > 0.

Theorem 2.3.2 (Chebyshev inequality)Let X be a real random variable whose variance exists. Then

P (|X − E(X)| ≥ ε) ≤ 1

ε2var(X) ∀ε > 0.

2.4 Concepts of convergence 7

Theorem 2.3.3 (Jensen inequality)Let X be a real random variable with finite expectation and let g : R → R be convex suchthat E[g(X)] exists. Then

g[E(X)] ≤ E[g(X)].

Theorem 2.3.4 (Cauchy-Schwarz inequality)Let X and Y be real random variables such that E(X2) and E(Y 2) exist. Then

[E(XY )]2 ≤ E(X2)E(Y 2).

2.4 Concepts of convergence

Definition 2.4.1Let (Xn)n∈N be a sequence of random variables.

(i) Xn converges to X in p-th mean, XnLp−→ X, if

E[|Xn −X|p]→ 0 for n→∞.

(ii) Xn converges to X in probability, Xnp−→ X, if for all ε > 0

P [|Xn −X| ≥ ε]→ 0 for n→∞.

(iii) Xn converges to X almost surely, Xna.s.−→ X, if

P [ limn→∞

Xn = X] = 1.

(iv) Assume that X,X1, X2, . . . have the distribution functions F, F1, F2, . . .. Xn converges

to X in distribution, XnL−→ X, if

Fn(x)→ F (x)

for all points of continuity of F .

Remark 2.4.2We have

XnLp−→ X =⇒ Xn

Lq−→ X for p > q ≥ 1

XnLp−→ X =⇒ Xn

p−→ X

Xna.s.−→ X =⇒ Xn

p−→ X

Xnp−→ X =⇒ Xn

L−→ X

Theorem 2.4.3 (Slutsky’s Lemma)Let Xn be a sequence of random variables which converges in distribution to a random variableX and let An and Bn be sequences of random variables which converge in probability toconstants a and b, then

An +BnXnp−−−→

n→∞a+ bX


2.5 Laws of large numbers and limit theorems

Theorem 2.5.1 (Strong law of large numbers)Let X1, X2, . . . be iid real random variables with EX1 = µ <∞. Then

Xna.s.−→ µ.

Theorem 2.5.2 (Central limit theorem)Let X1, X2, . . . be iid real random variables with EX1 = µ and 0 < varX1 = σ2 <∞. Then

Sn − ESn√varSn

=

√n(Xn − µ)

σ

L→ Z

where Sn =∑n

i=1Xi and L(Z) = N(0, 1).

Theorem 2.5.3 (Central limit theorem of Moivre-Laplace)Let L(Xn) = B(n, p) with n ∈ N and 0 < p < 1. Then

limn→∞

P

(Xn − np√np(1− p)

≤ x

)= Φ(x),

where Φ is the distribution function of a normally distributed random variable with mean 0and variance 1.

Theorem 2.5.4 (Glivenko-Cantelli)If X1, X2, . . . is a sequence of iid random variables then

P

(limn→∞

supz∈R|Fn(z)− FX(z)| = 0

)= 1.

2.6 Random vectors

A random vector X is a random variable with values in Rn. Many notions known forreal-valued random variables can be generalised to the case of random vectors. For thedistribution this was already done in Definition 2.1.4.

Definition 2.6.1 (Expectation)Let X = (X1, . . . , Xn) be a random variable in Rn.

The expectation of X is given by

E(X) =

∫R

. . .

∫R

xdPX(x) = (E(X1), . . . , E(Xn)).

The covariance matrix of X is given by

cov(X) = (cov(Xi, Xj))i,j=1,...,n.

Then cov(X) = E[(X − E(X))(X − E(X))T ].

2.7 Conditional expectation and probabilities 9

Theorem 2.6.2Let A be a m× n-matrix and b ∈ Rm. For the random vector Y = AX + b we have

EY = AEX + b

cov(Y ) = A cov(X)AT .

Remark 2.6.3Let X be a random vector. Then cov(X) is positive semidefinite, i.e., yT cov(X)y ≥ 0 forall y ∈ Rn.

Definition 2.6.4 (Multivariate normal distribution)A random vector X on Rn has a multivariate normal distribution with mean µ ∈ Rn andpositive definite covariance matrix Σ > 0 if it has the probability density

f(x) =1√

(2π)n det Σe−

12

(x−µ)TΣ−1(x−µ), x ∈ Rn.

We write L(X) = Nn(µ,Σ). The coordinates X1, . . . , Xn are jointly normally distributed.

Remark 2.6.5

(i) Let X1, X2 be jointly normally distributed. Then X1, X2 are independent iff they areuncorrelated.

(ii) Let L(X) = Nn(µ,Σ) and Y = AX + b ∈ Rm. If AΣAT is invertible then

L(Y ) = Nm(Aµ+ b, AΣAT ).

(iii) Let L(X) = Nn(µ,Σ). Then there are Z1, . . . , Zn iid with L(Zi) = N(0, 1) such that

X = Σ12Z + µ, where Z = (Z1, . . . , Zn).

2.7 Conditional expectation and probabilities

Let (Ω,A, P ) be a probability space and A,B ∈ A. The notions defined in this sectiongeneralise the concept of the conditional probability of the event A given the event B whichis defined as

P (A|B) =P (A ∩B)

P (B)

if P (B) > 0.

In this case, P (·|B) = 1IB P (·)P (B)

is a probability measure. Given a nonnegative real random

variable X on (Ω,A, P ) we may define

E(X|B) =

∫Ω

X(ω)dP (ω|B) =

∫BXdP

P (B).


Example 2.7.1Let X and Y be discrete random variables on N. Define

E(X|Y = ·) : Y (N)→ Ryk 7→ E(X|Y = yk)

with

E(X|Y = yk) =∞∑i=1

xiP (X = xi|Y = yk) =∞∑i=1

xiP (X = xi, Y = yk)

P (Y = yk).

Then we setE(X|Y ) = E(X|Y = ·) Y. (2.1)

Let X, Y be real random variables. The definition (2.1) is problematic if P (Y = y) = 0as e.g. for absolutely continuous random variables. However, if P (Y ∈ B) > 0 for B ∈ B wecan define

E(X|Y ∈ B) =

∫Y ∈BXdP

P (Y ∈ B).

In general, we can study the integrals∫Y ∈BXdP,B ∈ B. Coming back to the case that Y is

discrete, we get ∫Y ∈B

XdP =∑k:yk∈B

∫Y=yk

XdP =∑k:yk∈B

P (Y = yk)E(X|Y = yk)

=

∫Y ∈B

E(X|Y = ·) Y dP =

∫Y ∈B

E(X|Y )dP.

This equation is now used to get a general definition for E(X|Y ).

Definition 2.7.2Let X : (Ω,A)→ (R,B) and Y : (Ω,A)→ (Ω′,A′) be random variables.

(i) The conditional expectation of X given Y is a random variable

E(X|Y ) : (Ω, Y −1(A′))→ (R,B)

such thatE(X 1IC) = E[E(X|Y ) 1IC ] ∀C ∈ Y −1(A′).

(ii) The conditional expectation of X given Y = y is a random variable

E(X|Y = ·) : (Ω′,A′)→ (R,B)

y 7→ E(X|Y = y)

such thatE(X|Y ) = E(X|Y = ·) Y.

In the special case X = 1IA for some A ∈ A we call E(X|Y ) and E(X|Y = y) the conditionalprobability of A given Y and Y = y, respectively, and write P (A|Y ) and P (A|Y = y).

2.7 Conditional expectation and probabilities 11

Remark 2.7.3

(i) The conditional expectation has the following properties:

(a) E[E(X|Y )] = E(X)

(b) E[g(Y )|Y ] = g(Y ) and E[Xg(Y )|Y ] = g(Y )E(X|Y ) a.s. for all measurable g.

(c) If X and Y are independent, then E(X|Y ) = E(X) a.s.

(ii) Let X and Y be absolutely continuous random variables with joint density f(x, y) andmarginal densities fX and fY . Then

E(X|Y = y) =1

fY (y)

∫R

xf(x, y)dx.

Chapter 3

Parameter Estimation

In this chapter we study the following situation: Let X be a random variable with values inX and assume that the distribution L(X) is known up to some parameter θ, i.e.,

L(X) ∈ PXθ = Pθ : θ ∈ Θ.

Here, we assume Θ ⊂ Rd for d ≥ 1 and X = (X1, . . . , Xn) with independent random variablesX1, . . . , Xn. Based on a realisation x = (x1, . . . , xn) we want to find the value of θ.

Definition 3.0.1(i) A statistic T is a measurable function on X which does not depend on the unknown

parameter θ.

(ii) A statistic T (X1, . . . , Xn) : X → g(Θ) = g(θ) : θ ∈ Θ is called an estimator. We willwrite

g(θ)n := T (X1, . . . , Xn)

for the estimator w.r.t. sample size n.

(iii) The value T (x1, . . . , xn) taken for the observations x1, . . . , xn is called estimate.

Remark 3.0.2Since the distribution of X is determined by the parameter θ, we will write Pθ(X ∈ B) andEθ[g(X)] for probabilities and expectations calculated under the assumption that θ is thetrue parameter of the distribution of X.

Example 3.0.3Let X = 0, 1n and X = (X1, . . . , Xn)T with X1, . . . , Xn iid with

P (Xi = 1) = p and P (Xi = 0) = 1− p.

p can be interpreted as the probability of success in a Bernoulli experiment. Then θ = p,Θ = [0, 1] and L(X) ∈ Pθ : θ ∈ Θ where

Pθ((x1, . . . , xn)T) =n∏i=1

P (Xi = xi) =

(n

s

)θs(1− θ)n−s

with s =∑n

i=1 xi.Goal: Estimate θ = p.Guess: p = s

n.

3.1 Properties of estimators 13

3.1 Properties of estimators

The definition of an estimator does not assume anything but measurability. So T (X1, . . . , Xn) ≡5 is an estimator. Of course, this does not make sense in applications. Hence, we will in-troduce several properties of ’good’ estimators as well as measures for the quality of anestimator.

3.1.1 Consistency, loss and risk

A reasonable criterion is that the quality of the estimator should improve with increasingsample size. Therefore, the estimator should converge to the true value if the sample size ntends to infinity.

Definition 3.1.1A sequence Tn = Tn(X1, . . . , Xn), n ∈ N, of estimators for g(θ) is called consistent if for allθ ∈ Θ

Pθ (||Tn(X1, . . . , Xn)− g(θ)|| ≥ ε)n→∞−→ 0 ∀ε > 0.

That means Tn(X1, . . . , Xn)p→ g(θ) if L(X1, . . . , Xn) = Pθ,n for all n ≥ 1.

Example 3.1.2Let X1, X2 . . . be iid real-valued random variables with EXi = µ. Then µ = Xn is aconsistent estimator which is a consequence of the law of large numbers 2.5.1.In 3.0.3,

p =1

n#i : Xi = 1

is a consistent estimator.

If we have several consistent estimators, we can compare them using the loss function.

Definition 3.1.3

(i) A loss-function is a non-negative function L(t, θ) which is increasing in the distancebetween the estimated value t and θ, e.g., L(t, θ) = ||t − θ||2 (squared error loss),L(t, θ) = ||t− θ|| (absolute error loss) or

L(t, θ) =

0, ||t− θ|| ≤ ε||t− θ|| − ε, ||t− θ|| > ε

(ε-insensitive loss function).

(ii) The expected value of a loss-function R(θ, θ) = Eθ[L(θ, θ)] interpreted as a function ofθ is called risk-function. The risk-function w.r.t. the squared error loss Eθ[||θ− θ||2] isthe mean squared error (MSE).

Theorem 3.1.4Let θn be a sequence of estimators for θ. If MSEθ(θn)

n→∞−→ 0 for all θ ∈ Θ then the sequenceof estimators is consistent.

Proof:Convergence of the MSE to zero is equivalent to convergence of θn to θ in L2 sense. Thisimplies convergence in probability which is equivalent to consistency.

14 Chapter 3 Parameter Estimation

Remark 3.1.5

(i) The risk of an estimator θ does not necessarily exist (e.g., if the expectation of θ doesnot exist). Nevertheless, the sequence of estimators can be consistent.

(ii) An estimator should have a small risk for as many values of θ as possible. However, auniformly best estimator θ∗ fulfilling

R(θ∗n, θ) = minθn

R(θn, θ) ∀θ ∈ Θ

does not exist. This is due to the fact that the trivial estimator θn = θ0 for θ0 ∈ Θfulfills R(θn, θ0) = 0. Since θ0 was chosen arbitrarily, θ∗n would have risk 0 for all θ ∈ Θ.

3.1.2 Unbiasedness

Definition 3.1.6

(i) The bias of an estimator θ is given by biasθ(θ) = Eθ(θ)− θ.

(ii) An estimator with biasθ(θ) = 0, i.e. Eθ(θ) = θ, for all θ ∈ Θ is called unbiased.

On average, an unbiased estimator estimates the correct parameter value, i.e., the esti-mator is centred correctly.

Proposition 3.1.7Let X1, . . . , Xn be iid with mean µ and variance σ2. Then Xn is an unbiased estimator forµ,

S2n =

1

n− 1

n∑i=1

(Xi −Xn)2 and σ2n =

1

n

n∑i=1

(Xi − µ)2

are unbiased estimators for σ2.

Proof: (i)

E(Xn) =1

n

n∑i=1

E(Xi) = µ

(ii)

E(σ2n) =

1

n

n∑i=

E(Xi − µ)2 = σ2

(iii)

E(S2n) =

1

n− 1

n∑i=1

E[(Xi −Xn)2] =n

n− 1E[(X1 −Xn)2]


Now,

E[(X1 −Xn)2] = E[(X1 − µ)2]− 2E[(X1 − µ)(Xn − µ)] + E[(Xn − µ)2]

= σ2 − 2 cov(X1, Xn) + E[(Xn − µ)2]

where

cov(X1, Xn) =1

n

n∑i=1

E[(X1 − µ)(Xi − µ)] =σ2

n

E[(Xn − µ)2] =1

n2var

(n∑i=1

Xi

)=

1

n2

n∑i=1

varXi =σ2

n.

Hence,

E[(X1 −Xn)2] = σ2 − σ2

n=n− 1

nσ2.

Example 3.1.8An unbiased estimator does not necessarily have a lower MSE than a biased one. Forinstance, if L(Xi) = N(µ, σ2) then

MSE

(n− 1

nS2n

)< MSE(S2

n).

However, n−1nS2n is biased and tends to underestimate the true value.

Theorem 3.1.9The MSE of θ for θ ∈ Θ ⊂ R can be expressed as

Eθ[(θ − θ)2] = varθ θ + (biasθ θ)2.

Proof:

Eθ[(θ − θ)2] = Eθ(θ2)− [Eθ(θ)]

2 + [Eθ(θ)]2 − 2Eθ(θθ) + θ2 = varθ θ + (biasθ θ)

2.

The MSE of an unbiased estimator consequently reduces to its variance. Consistency ofa sequence of unbiased estimators can therefore be proven by showing that their variancetends to zero.

Definition 3.1.10An estimator T ∗n(X1, . . . , Xn) is called best unbiased estimator of g(θ) ∈ R if it is unbiasedand satisfies

varθ T∗n(X1, . . . , Xn) ≤ varθ Tn(X1, . . . , Xn)

for all unbiased estimators Tn. T ∗n is also called uniform minimum variance unbiased esti-mator (UMVUE) of g(θ).

Remark 3.1.11Definition 3.1.10 is a consequence of 3.1.5: Since no uniformly best estimator exists, werestrict the class of estimators and ask for the best unbiased estimator.


Theorem 3.1.12If Tn is a best unbiased estimator of θ, then it is almost surely unique.

Proof:Let Tn and T ∗n be UMVUE. Then Tn = 1

2(Tn + T ∗n) is also unbiased. Furthermore,

varθ Tn + varθ T∗n ≤UMV UE varθ Tn + varθ Tn = 2 varθ Tn

=1

2(varθ Tn + 2 covθ(Tn, T

∗n) + varθ T

∗n) .

Hence,

0 ≥ varθ Tn + varθ T∗n − 2 covθ(Tn, T

∗n) = varθ(Tn − T ∗n) = Eθ[(Tn − T ∗n)2]

which shows that Tn = T ∗n Pθ-a.s..

Definition 3.1.13An estimator θ is asymptotically unbiased if

limn→∞

Eθ(θn) = θ ∀θ ∈ Θ.

The next question is how to find best unbiased estimators. This requires more work.

3.1.3 Sufficiency and completeness

When using a statistic T to make inference on a parameter θ, two samples x and y areconsidered equal if T (x) = T (y). Hence, T can be regarded as a means of data reduction.This is not always reasonable (e.g., T ≡ 0). So the question is how to reduce the datawithout loosing any information on the parameter θ.

Definition 3.1.14Let X be a sample from a distribution family PXθ . A statistic S is called sufficient for θ ∈ Θif

Pθ(X ∈ B|S(X) = t),

does not depend on the unknown parameter θ for all t with Pθ(S(X) = t) 6= 0 and all B ∈ B.

A sufficient statistic S(X) contains as much information on θ as the original sample X.So S provides a data reduction without loosing information on θ. Hence, it is sufficient toconsider estimators T for θ of the form T (X) = g[S(X)].

Example 3.1.15In Example 3.0.3, S(X) =

∑ni=1Xi is a sufficient statistic for θ = p:

Pp(X1 = x1, . . . , Xn = xn|S(X) = k) =Pp(X1 = x1, . . . , Xn = xn, S(X) = k)

Pp(S(X) = k)

=

0

∑ni=1 xi 6= k

pk(1−p)n−k

(nk)pk(1−p)n−k= 1

(nk)otherwise

for all x1, . . . , xn ∈ 0, 1 and 0 ≤ k ≤ n.


Theorem 3.1.16 (Factorization Theorem or Neyman Criterion)Let Xbe a random variable with L(X) ∈ PXθ which is a family of discrete probability measureson (Rn,B). Then S is sufficient for PXθ if and only if

pθ(x) = gθ(S(x))h(x)

with measurable functions h, gθ ≥ 0.

Proof:=⇒: Set gθ(t) = Pθ(S(X) = t) and h(x) = Pθ(X = x|S(X) = S(x)). Then

pθ(x) = Pθ(X = x) = Pθ(X = x, S(X) = S(x))

= Pθ(X = x|S(X) = S(x))Pθ(S(X) = S(x)) = h(x)gθ(S(x))

⇐=:

Pθ(X = x|S(X) = S(x)) =Pθ(X = x, S(X) = S(x))

Pθ(S(X) = S(x))=

Pθ(X = x)∑y:S(y)=S(x)

Pθ(X = y)

=gθ(S(x))h(x)∑

y:S(y)=S(x)

gθ(S(y))h(y)=

h(x)∑y:S(y)=S(x)

h(y),

which is independent of θ.

Remark 3.1.17Theorem 3.1.16 also holds for absolutely continuous random variables. Then the density isgiven by fθ(x) = gθ(S(x))h(x).

Theorem 3.1.18 (Rao-Blackwell)Let S be a sufficient statistic for PXθ . For any unbiased estimator T (X) of g(θ) there existsanother unbiased estimator T (S(X)) with

varθ T (S(X)) ≤ varθ T (X), (3.1)

i.e. an unbiased estimator that only depends on the information contained in S which has avariance which is uniformly at least as good as that of T . Such an estimator is given by

T (t) = Eθ[T (X)|S(X) = t].

Proof:We first show that T (S(X)) is unbiased.

Eθ[T (S(X))] = Eθ[Eθ[T (X)|S(X)]] = Eθ[T (X)] = g(θ)

Now

varθ T (X) = Eθ[(T (X)− g(θ))2]

= Eθ[(T (X)− T (S(X)) + T (S(X))− g(θ)

)2]= Eθ

[(T (X)− T (S(X))

)2]+ varθ T (S(X)) + 2Eθ

[(T (X)− T (S(X))

)(T (S(X))− g(θ)

)]≥ varθ T (S(X))


since Eθ[(T (X)− T (S(X))

)2] ≥ 0 and

Eθ[(T (X)− T (S(X))

)(T (S(X))− g(θ)

)]= Eθ

[Eθ[(T (X)− T (S(X)))(T (S(X))− g(θ))|S(X)]

]= Eθ

[(T (S(X))− g(θ))[Eθ[T (X)|S(X)]− T (S(X))]

]= 0

Remark 3.1.19Theorem 3.1.18 requires sufficiency to assure that T (S(X)) is independent of θ.

When aiming at data reduction one tries to find minimal sufficient statistics in the sensethat they do not contain information which is not related to θ.

Definition 3.1.20Let PXθ be a distribution family. A statistic S is called complete for PXθ , if for all measurablefunctions g with Eθ[g(S(X))] = 0 for all θ ∈ Θ we have

Pθ(g(S(X)) = 0) = 1 ∀θ ∈ Θ.

Example 3.1.21 (i) Consider the situation in 3.0.3. If the parameter space is restrictedto Θ = (0, 1) then the statistic S(X) =

∑ni=1Xi is complete:

Ep[g(S)] =n∑t=1

g(t)

(n

t

)pt(1− p)n−t = (1− p)n

n∑t=0

g(t)

(n

t

)(p

1− p

)t.

Since 1− p 6= 0, we have Ep[g(S)] = 0 iff

0 =n∑t=0

g(t)

(n

t

)(p

1− p

)t=

n∑t=0

g(t)

(n

t

)rt

with r = p/(1 − p) ∈ (0,∞). So Ep[g(S)] is a polynomial in r which equals 0 iff allcoefficients equal 0, i.e., g(t) = 0 for all t.

(ii) Consider the same situation as in (i) and set T (X) = (S(X), X1). Then g(T (X)) =

X1 − S(X)n6= 0 but Ep[g(T )] = 0. Hence, T is not complete.

If the statistic S is also complete, this can be used to construct a best unbiased estimator.

Theorem 3.1.22 (Lehmann-Scheffe)Let S be a sufficent and complete statistic for PXθ . If there exists an unbiased estimator T ofg(θ) then T (S(X)) with T (s) = Eθ[T (X)|S(X) = s] is the almost surely unique best unbiasedestimator.

Proof:Let T0 be any unbiased estimator of g(θ). Set T0(S(X)) = Eθ[T0(X)|S(X)] which is unbiasedby Theorem 3.1.18. Hence

Eθ[T (S(X))− T0(S(X))] = 0 ∀θ ∈ Θ.


Since S is complete, this implies Pθ(T (S(X)) = T0(S(X))) = 1 for all θ ∈ Θ. Hence,

varθ T (S(X)) = varθ T0(S(X)) ≤3.1.18 varθ T0(X).

Remark 3.1.23Theorem 3.1.22 shows that a sufficient and complete statistic S is the best unbiased estimatorfor g(θ) = EθS(X) since Eθ[S(X)|S(X)] = S(X) almost surely.

3.1.4 Exponential families

Definition 3.1.24A distribution family Pθ : θ ∈ Θ is called exponential family if its densities or probabilityweights are a.e. of the form

fθ(x) = c(θ)h(x) exp

(k∑i=1

γi(θ)Ti(x)

)

with c(θ) > 0 and h(x) ≥ 0.

Example 3.1.25

(i) Pθ = N(µ, σ2), θ = (µ, σ2)

fθ(x) =1√

2πσ2exp

(−(x− µ)2

2σ2

)=

1√2πσ2

e−µ2

2σ2 exp

(− x2

2σ2+µx

σ2

).

Hence,

c(θ) =1√

2πσ2e−

µ2

2σ2 , h(x) = 1, γ1(θ) =µ

σ2, γ2(θ) = − 1

2σ2, T1(x) = x, T2(x) = x2.

(ii) X1, . . . , Xn iid N(µ, σ2), Pθ = N(µ, σ2)n

Then

c(θ) =

(1√

2πσ2e−

µ2

2σ2

)n, h(x) = 1, γ1(θ) =

µ

σ2, γ2(θ) = − 1

2σ2,

T1(x) =n∑i=1

xi, T2(x) =n∑i=1

x2i .

(iii) Pθ = B(n, p), θ = p

pθ(k) =

(n

k

)pk(1− p)n−k =

(n

k

)(1− p)n

(p

1− p

)kHence,

h(k) =

(n

k

), c(θ) = (1− p)n, γ1(θ) = log

(p

1− p

), T1(k) = k.


Definition 3.1.26Let Pθ = Pθ : θ ∈ Θ be an exponential family with density (or weights) fθ(x). Set

γ = (γ1, . . . , γk)T ∈ Γ = (γ1(θ), . . . , γk(θ))

T : θ ∈ Θ.

γ is called the natural parameter of Pθ. W.r.t. γ, the density (or weights) are of the form

fγ(x) = c′(γ)h(x) exp

(k∑i=1

γiTi(x)

).

Theorem 3.1.27Let P = Pγ : γ ∈ Γ be an exponential family with natural parametrisation and int(Γ) 6= ∅.Then T = (T1, . . . , Tk)

T is sufficient and complete.

Proof:Sufficiency follows immediately from Theorem 3.1.16. We show completeness for the case ofdiscrete random variables.Let X = k : pγ(k) 6= 0 = k : h(k) 6= 0. Then X consists of those values which are takenby X with positive probability. Define

T = T (X ) = t : T (k) = t for some k ∈ X,

the set of values taken by T (X) with positive probability. Note that X and T do not dependon the parameter γ. Let g be a function with Eγ[g(T (X))] = 0 for all γ ∈ Γ. Then we haveto show that P (g(T (X)) = 0) = 1, i.e., g(t) = 0 for all t ∈ T .

(i) Decompose g(t) = g+(t)− g−(t) with g±(t) ≥ 0. Then

0 = Eγ[g(T (X))] = Eγ[g+(T (X))]− Eγ[g−(T (X))],

henceEγ[g

+(T (X))] = Eγ[g−(T (X))] = αγ.

(ii) Fix γ0 ∈ int(Γ). Then

αγ = Eγ[g±(T (X))] =

∑k∈X

g±(T (k))Pγ(X = k)

=∑k∈X

g±(T (k))pγ(k)

pγ0(k)pγ0(k)

=∑t∈T

∑k:T (k)=t

g±(t)c′(γ)

c′(γ0)e(γ−γ0)T tpγ0(k)

=c′(γ)

c′(γ0)

∑t∈T

g±(t)Pγ0(T (X) = t)e(γ−γ0)T t

= αγ0

c′(γ)

c′(γ0)

∑t∈T

q±(t)e(γ−γ0)T t

with q±(t) = 1αγ0g±(t)Pγ0(T (X) = t) if αγ0 6= 0. Since

∑t∈T q

±(t) = 1, the quantities

q±(t) can be interpreted as probability weights.


(iii) Let Q± be random variables with P (Q± = t) = q±(t) for t ∈ T . Consider the momentgenerating function ψ(u) = E(eu

TQ). From (ii) we obtain

E(e(γ−γ0)TQ+

)=

αγαγ0

c′(γ0)

c′(γ)= E

(e(γ−γ0)TQ−

)for all γ ∈ Γ.

In particular, E(eu

TQ+)

= E(eu

TQ−)

for all u in an open neighbourhood of 0. Results

from probability theory imply L(Q+) = L(Q−), hence q+(t) = q−(t) for all t ∈ T . Sincethis implies g+(t) = g−(t) for all t ∈ T , we actually have g(t) = 0 for all t ∈ T .

(iv) It remains to consider the case αγ0 = 0. This means Eγ0 [g±(T (X))] = 0. Sinceg±(t) ≥ 0, we get g±(T (X)) = 0 a.s. which implies g(t) = 0 for all t ∈ T .

Example 3.1.28Let X = (X1, . . . , Xn)T with X1, . . . , Xn iid with L(X1) = N(µ, σ2). In example 3.1.25 wehave seen

γ1 =µ

σ2, γ2 = − 1

2σ2, T1(x) =

n∑i=1

xi, T2(x) =n∑i=1

x2i .

Hence, Γ = R× (−∞, 0), so int Γ 6= ∅ and

T (X) =

(n∑i=1

Xi,n∑i=1

X2i

)T

is sufficient and complete.

3.1.5 Efficiency

In the following, we will make the assumption(A1) fθ(x) is continuously differentiable w.r.t. θ for almost all x.

Definition 3.1.29 (Fisher’s information)Let L(X) = Pθ where θ ∈ Θ ⊂ R. Assume that the density (or probability weights) is partlydifferentiable w.r.t. θ. Then, the Fisher information of Pθ is defined as

I(Pθ) := Eθ

[(∂

∂θlog fθ(X)

)2].

For an m-dimensional parameter space, we get Fisher’s information matrix

I(Pθ) :=

(Eθ

[∂

∂θilog fθ(X)

∂

∂θjlog fθ(X)

])i,j=1,...,m

.

Remark 3.1.30In the literature, you might also find the definition

I(Pθ) :=

(Eθ

[− ∂2

∂θi∂θjlog fθ(X)

])i,j=1,...,m

for Fisher’s information matrix. Both definitions are identical, if we are allowed to inter-change integration and differentiation w.r.t. θi.


Proposition 3.1.31Let X = (X1, . . . , Xn)T where X1, . . . , Xn are iid with L(Xi) ∈ Pθ : θ ∈ Θ such that thedistributions Pθ are absolutely continuous and fulfill (A1). If additionally∫

R

∣∣∣∣ ∂∂θfθ(x)

∣∣∣∣ dx <∞,then

I(PXθ ) = nI(PX1

θ ).

Proof:Due to the independence we have

fθ(x) =n∏i=1

fθ,i(xi).

Hence,

Eθ

[(∂

∂θlog fθ(X)

)2]

= Eθ

( n∑i=1

∂

∂θlog fθ,i(Xi)

)2

=n∑i=1

Eθ

(∂

∂θlog fθ,i(Xi)

)2

+∑i 6=j

Eθ

(∂

∂θlog fθ,i(Xi)

∂

∂θlog fθ,j(Xj)

).

Due to the independence of the Xi, we have

Eθ

(∂

∂θlog fθ,i(Xi)

∂

∂θlog fθ,j(Xj)

)= Eθ

(∂

∂θlog fθ,i(Xi)

)Eθ

(∂

∂θlog fθ,j(Xj)

).

Now

Eθ

(∂

∂θlog fθ,j(Xj)

)=

∫R

∂

∂θlog fθ,j(x)fθ,j(x)dx

=

∫R

∂

∂θfθ,j(x)dx

=∂

∂θ

∫R

fθ,j(x)dx = 0.

Here, the smoothness assumptions on fθ,i were used to exchange integration and differenti-ation. Hence,

Eθ

[(∂

∂θlog fθ(X)

)2]

=n∑i=1

Eθ

[(∂

∂θlog fθ,i(Xi)

)2]

= nI(Pθ,i).


Theorem 3.1.32 (Cramer-Rao inequality)Let X = (X1, . . . , Xn)T with L(X) ∈ Pθ,n : θ ∈ Θ, Θ ⊂ R. Assume that Pθ,n has a density

which fulfills (A1). Let θn be an estimator with bias bn(θ) where the derivative b′n(θ) w.r.t.θ exists. Then we have

MSEθ(θn) ≥ (b′n(θ) + 1)2

I(Pθ,n)∀θ ∈ Θ.

Proof:We have

bn(θ) = Eθ(θn − θ) =

∫Rn

(θn(x)− θ)fθ(x)dx.

Due to the smoothness conditions on fθ we can compute the derivative of bn as follows:

b′n(θ) =∂

∂θ

∫Rn

(θn(x)− θ)fθ(x)dx

=

∫Rn

∂

∂θ

((θn(x)− θ)fθ(x)

)dx

= −∫Rn

fθ(x)dx+

∫Rn

(θn(x)− θ)∂fθ∂θ

(x)dx.

Since fθ is a density, we get

b′n(θ) + 1 =

∫Rn

(θn(x)− θ) ∂∂θ

log fθ(x)fθ(x)dx

= Eθ

((θn(X)− θ) ∂

∂θlog fθ(X)

)≤(Eθ[(θn(X)− θ)2]Eθ[

( ∂∂θ

log fθ(X))2

]

) 12

,

where the last step follows from the Cauchy-Schwarz inequality.

Remark 3.1.33If θn is unbiased, the Cramer-Rao inequality reduces to

varθ(θn) ≥ 1

I(Pθ,n).

Definition 3.1.34 (Efficiency)An unbiased estimator θn is called efficient if

varθ(θn) =1

I(Pθ,n)∀θ ∈ Θ.


3.2 Bayes estimators

Let X be a random variable with density f such that L(X) ∈ PXθ .In Bayesian estimation the parameter θ is interpreted as a random variable with density

π. The distribution given by π is called the prior distribution since it represents the initialassumption on the parameter θ. This distribution is chosen in advance and should take allavailable information on the problem into account.

Using Bayes theorem, we may write

f(X|θ)π(θ) = π(θ|X)f(X) (3.2)

or

π(θ|X) =f(X|θ)π(θ)∫

Θf(X|θ)π(θ)dθ

. (3.3)

π(θ|X) is called the posterior distribution. It represents the updated belief in the distributionof θ having seen the data X.

Definition 3.2.1 (Bayes risk)Let T be an estimator of θ. The Bayesian risk is

RB(T ) =

∫Θ

R(T, θ)π(θ)dθ. (3.4)

Remark 3.2.2Using (3.2), equation (3.4) can also be written as

RB(T ) =

∫Θ

∫R

L(T (x), θ)fθ(x)dx π(θ)dθ

=

∫R

∫Θ

L(T (x), θ)π(θ|x)dθ f(x)dx

Definition 3.2.3 (Bayes estimator)The Bayes estimator TB of θ is the estimator minimizing the Bayes risk RB.

Remark 3.2.4

(i) The Bayes estimator is the estimator minimising the posterior expected loss.

(ii) For given X, minimising the Bayes risk is equivalent to minimising∫Θ

L(T (X), θ)π(θ|X)dθ.

(iii) The Bayes risk is a weighted version of the risk function. The weight function π(θ)takes large values for θ which are likely to be the true parameter and which shouldtherefore be estimated accurately. For unlikely θ we can accept a bigger estimationerror. So if we have some knowledge or expectations on the true parameter this shouldbe used to determine the weight function π.

3.2 Bayes estimators 25

(iv) In practice, the a-priori distribution is often chosen such that the estimator can becalculated analytically.

Example 3.2.5Assume we are given some data X with L(X) = B(n, p) (as in Example 3.0.3) and we wantto estimate p.For the prior distribution we choose a beta distribution with parameters α > 0 and β > 0,i.e.

πα,β(θ) =θα−1(1− θ)β−1

C(α, β), 0 < θ < 1,

where

C(α, β) =Γ(α)Γ(β)

Γ(α + β).

The beta distribution is a common choice for the prior distribution. With (3.3) we get

π(θ|X) =θα−1+X(1− θ)β−1+n−X∫ 1

0uα−1+X(1− u)β−1+n−Xdu

, θ ∈ [0, 1]

which is a β-distribution again. Using the quadratic loss function we have to minimise∫Θ

(T (X)− θ)2π(θ|X)dθ.

Setting the derivative w.r.t. T to zero yields

−2

∫Θ

(T (X)− θ)π(θ|X)dθ = 0 ⇐⇒ T (X) =

∫Θ

θπ(θ|X)dθ.

Note that this formula holds for the quadratic loss function irrespectively of the chosendistributions. Now we plug in the posterior distribution and get

T (X) =

1∫0

θθα−1+X(1− θ)β−1+n−X∫ 1


dθ

=

∫ 1

0θα+X(1− θ)β−1+n−Xdθ∫ 1


=X + α

n+ α + β.

For α = β = 0 we arrive at T (X) = Xn. Note that this case is not allowed above. Hence, Xn

is only a limit of Bayes estimators.

Definition 3.2.6 (Admissibility)An estimator T is admissible if for all estimators S

R(S, θ) ≤ R(T, θ) ∀θ ∈ Θ

impliesR(S, θ) = R(T, θ) ∀θ ∈ Θ.


Proposition 3.2.7Let M be the closure of the interior points of Θ and M ∩ Θ = Θ. Assume that the lossfunction used to define the risk is continuous in θ. If π(θ) > 0 almost everywhere then theBayes estimator TB of θ for the a-priori density π is admissible.

Proof:Assume that TB is not admissible. Then there is an estimator S and θ0 ∈ Θ with

R(S, θ) ≤ R(TB, θ) ∀θ ∈ Θ

R(S, θ0) < R(TB, θ0).

Since Θ has no isolated points and R is continuous in θ we have R(S, θ) < R(TB, θ) for allθ in an environment of θ0. Hence,

RB(S) =

∫Θ

R(S, θ)π(θ)dθ < RB(TB),

in contradiction to the definition of TB.

3.3 Minimax estimators

Definition 3.3.1An estimator θM minimizing

RM(θ) = maxθ∈Θ

R(θ, θ) (3.5)

is called the minimax estimator of θ.

Remark 3.3.2

(i) Minimax estimators choose the smallest maximum risk, i.e., they take precautionsagainst the worst case situation. For many θ ∈ Θ, however, there might be estimatorswith much smaller risk.

(ii) The definition (3.5) assumes that the risk attains its maximum. This is the case if therisk is continuous and Θ is a compact set. In other cases the maximum can be replacedby the supremum.

It is nearly impossible to calculate the minimax estimator directly. The next theorem,however, gives a relation between Bayes and minimax estimators.

Theorem 3.3.3Let TB be a Bayes estimator for an arbitrary a-priori distribution π such that

R(TB, θ) ≤ RB(TB) ∀θ ∈ Θ.

Then TB is also the minimax estimator TM .If π is continuous and π(θ) > 0 for all θ ∈ Θ then

R(TM , θ) = RB(TM) ∀θ ∈ Θ,

i.e. the risk of the minimax estimator is constant.

3.3 Minimax estimators 27

Proof:By definition of the Bayes risk,

RB(T ) =

∫Θ

R(T, θ)π(θ)dθ ≤ maxθ∈Θ

R(T, θ)

for any estimator T . Then RB(TB) ≥ R(TB, θ) for all θ ∈ Θ implies

maxθ∈Θ

R(TB, θ) ≤ RB(TB) ≤ RB(T ) ≤ maxθ∈Θ

R(T, θ),

since TB is the corresponding Bayes estimator. Hence, TB = TM .To show the second statement we assume that

rmin := minθ∈Θ

R(TB, θ) < maxθ∈Θ

R(TB, θ) =: rmax.

Let

Θ0 =

θ ∈ Θ : R(TB, θ) <

1

2(rmin + rmax)

.

Since rmin < rmax this set is not empty. Since π is continuous and π(θ) > 0 we have∫Θ0

π(θ)dθ > 0.

Then

RB(TB) =

∫Θ0

R(TB, θ)π(θ)dθ +

∫Θ\Θ0

R(TB, θ)π(θ)dθ

≤ 1

2(rmin + rmax)

∫Θ0

π(θ)dθ + rmax

∫Θ\Θ0

π(θ)dθ

< rmax

which contradicts the assumption that maxθ∈ΘR(TB, θ) ≤ RB(TB)

Remark 3.3.4If we have an estimator T with a constant risk, then we only need an appropriate a-prioridistribution π to show that it is the minimax estimator.

Example 3.3.5Let L(X) = B(n, p) as in Example 3.0.3. Choose the loss-function

L(p, p) =(p− p)2

p(1− p)

which gives a higher weight to deviations for p close to 0 or 1 than for p = 12

(see Figure3.1).


Figure 3.1: The loss function for p = 12

(solid), p = 0.1 (dashed) and p = 0.9 (dotted).

Now the risk for the estimator Xn is

R(Xn, p) = E

[(Xn − p)2

p(1− p)

]=

1

p(1− p)var(Xn) =

1

n

which is constant. Now we need a prior distribution π such that Xn is the Bayes estimatorw.r.t. π. Choose U(0, 1) for π. Then the posterior distribution is proportional to

f(X|p)π(p) = pX(1− p)n−X .

We minimize the Bayes risk pointwise, i.e., we minimize

1∫0

(p− p)2

p(1− p)pX(1− p)n−Xdp

by differentiating w.r.t. p and get

p =

∫ 1

0pX(1− p)n−X−1dp∫ 1

0pX−1(1− p)n−X−1dp

.

This formula is equal to the formula in Example 3.2.5 if we choose α = β = 0.

3.4 Maximum likelihood estimators

Definition 3.4.1Let x be a realisation of X with values in X and L(X) ∈ Pθ : θ ∈ Θ. If Pθ(X = x) > 0the the likelihood function is defined as

L(θ|x) = Pθ(X = x) x ∈ X , θ ∈ Θ.

3.4 Maximum likelihood estimators 29

If Pθ(X = x) = 0 then it is defined as

L(θ|x) = limh→0

Pθ(X ∈ [x− h, x+ h])

2hx ∈ X , θ ∈ Θ.

For absolutely continuous random variables, we get L(θ|x) = fθ(x). If the estimator θ satisfies

L(θ(X)|X) = maxθ∈Θ

L(θ|X)

it is called the maximum likelihood estimator of θ.

Remark 3.4.2It is often convenient to use the log-likelihood function

l(θ|x) = logL(θ|x) x ∈ X , θ ∈ Θ

instead of L(θ|x) (cf. Examples 3.4.3 and 3.4.5).

Example 3.4.3Let X1, . . . , Xn be iid N(µ, σ2)-distributed. Then

L(µ, σ2|X1, . . . , Xn) =n∏i=1

1√2πσ2

exp

(−(Xi − µ)2

2σ2

)

= (2πσ2)−n2 exp

(− 1

2σ2

n∑i=1

(Xi − µ)2

)and

l(µ, σ2|X1, . . . , Xn) = −n2

log(2πσ2)− 1

2σ2

n∑i=1

(Xi − µ)2.

Differentiating w.r.t. µ and σ2 yields

∂l

∂µ(µ, σ2|X1, . . . , Xn) =

1

σ2

n∑i=1

(Xi − µ)

∂l

∂σ2(µ, σ2|X1, . . . , Xn) = − n

2σ2+

1

2σ4

n∑i=1

(Xi − µ).

Setting these to zero we obtain

µ = Xn

σ2 =1

n

n∑i=1

(Xi −Xn)2.

Remark 3.4.4From the proof of Proposition 3.1.7 we know that σ2 is biased. Nevertheless, we would preferthis estimator because Example 3.1.8 shows that

MSE(σ2) ≤MSE(S2) .

where S2 is the unbiased estimator for σ2 given in 3.1.7.


Example 3.4.5Let X1, . . . , XN be i.i.d. L(X1) = B(n, p) where n is known. Then

L(p|X) =N∏i=1

(n

Xi

)pXi(1− p)n−Xi

and

l(p|X) =

(N∑i=1

Xi

)log(p) +

(Nn−

N∑i=1

Xi

)log(1− p) +

N∑i=1

log

(n

Xi

).

The derivative is

∂l

∂p(p|X) =

N∑i=1

Xi

p−Nn−

N∑i=1

Xi

1− pwhich leads to

p =XN

n.

Example 3.4.6 (ML-estimator not unique)LetX1, . . . , Xn be iid with L(Xi) = U(θ0−1

2, θ0+1

2). This implies θ0−1

2≤ X1, . . . , Xn ≤ θ0+1

2

which is equivalent to θ0− 12≤ X(1) < X(n) ≤ θ0 + 1

2. Hence, the likelihood function becomes

L(θ|X) =

1, θ ∈ [X(n) − 1

2, X(1) + 1

2]

0, θ /∈ [X(n) − 12, X(1) + 1

2].

Obviously,

maxθL(θ|X) = 1 ∀θ ∈

[X(n) −

1

2, X(1) +

1

2

].

Hence, L(θ|X) is maximised for any statistic θ such that θ(X) ∈[X(n) − 1

2, X(1) + 1

2

].

Theorem 3.4.7Let X1, . . . , Xn be iid with L(Xi) = Pθ with θ ∈ Θ. Assume that there exists a maximumlikelihood estimator θ for θ. Let Y1, . . . , Yn be given by Yi = g(Xi) with an injective functiong. Then

θ(g−1(Y1), . . . , g−1(Yn))

is a maximum likelihood estimator for the parameter θ of the distribution Pθ g−1 of Yi.

Proof:Since we do not want to distinguish between the discrete and the absolutely continuous case,we will write Hθ for either the density or the probability weights of Xi in the following. Forθ ∈ Θ we have

L(θ|Y1, . . . , Yn) = Hθ g−1(Y1, . . . , Yn) = Hθ(g−1(Y1), . . . , g−1(Yn))

= Hθ(X1, . . . , Xn) = L(θ|X1, . . . , Xn)

The estimator θ(X1, . . . , Xn) = θ(g−1(Y1), . . . , g−1(Yn)) maximizes L(θ|X, . . . , Xn). Hence,it also maximizes L(θ|Y1, . . . , Yn).

3.5 Consistency and asymptotic normality of M-estimators 31

Example 3.4.8Let Y1, . . . , Yn be iid lognormal distributed with parameters µ and σ2, i.e.L(log Y1) = N(µ, σ2).To determine the estimator for θ = (µ, σ2) ∈ Θ = R × (0,∞) from the observations Yi, wedetermine the density of Yi.A lognormally distributed random variable Y is obtained via the transformation Y = exp(X)with L(X) = N(µ, σ2). Hence, P (Y < y) = 0 for y < 0. For y > 0 we apply the transfor-mation theorem for densities, which states that the density of Y = g(X) is

fY (y) = fX(g−1(y)) |Jg−1(y)| ,

where Jg−1 is the Jacobi-Matrix of g−1. Here, we have g(X) = exp(x), g−1(y) = log(y) andJg−1(y) = 1

y. Therefore the density of Y is given by

fY (y) =

1√

2πσ2

1y

exp−12

(log y−µ)2

σ2 y > 0

0 y ≤ 0

Then the log-likelihood function is

l(θ|Y1, . . . , Yn) = −n2

log(2πσ2)−n∑i=1

log(Yi)−1

2σ2

n∑i=1

(log Yi − µ)2

The derivative of the loglikelihood function differs from that for a normal random variablegiven in Example 3.4.3 only by the logarithm of the argument. Therefore we get

µ =1

n

n∑i=1

log Yi σ2 =1

n

n∑i=1

(log Yi − µ)2.

3.5 Consistency and asymptotic normality of M-estimators

Definition 3.5.1 (M-estimator)Let Qn : X × Θ → R be a functional of the random vector X = (X1, . . . , Xn) and of theunknown parameter θ ∈ Θ. An M-estimator θn of θ fulfills

Qn(X, θn(X)) ≥ Qn(X, θ) ∀θ ∈ Θ.

Remark 3.5.2

(i) The M in the name M-estimator stands for ”maximum”, as an M-estimator maximisesa suitable functional.

(ii) If estimators are based on the minimisation of a functional, e.g. Bayes estimators,the minimisation can be interpreted as a maximisation if the sign of the functional ischanged.

(iii) ML-estimators, Bayes estimators and minimax estimators are examples for M-estimators.

Theorem 3.5.3 (Consistency of M-estimators)Assume that the following conditions hold:


(i) The parameter space Θ ⊂ Rm is compact.

(ii) Qn(X, θ) is continuous in θ and measurable w.r.t. X.

(iii) 1nQn(X, θ) converges in probability and uniformly in θ to a deterministic function Q :

Θ→ R.

(iv) Q(θ) attains a unique global maximum in θ0 ∈ Θ.

Then the M-estimator θn fulfillsθn(X)

p−−−→n→∞

θ0.

Proof:The existence of θn(X) follows from assumptions (i) and (ii). We have to show that

P (||θn(X)− θ0|| > ε) −−−→n→∞

0 ∀ε > 0.

Choose ε > 0 and setδ = Q(θ0)− max

||θ−θ0||>εQ(θ) >

(iv)0.

If ||θn − θ0|| > ε then

δ ≤ Q(θ0)−Q(θn)

= Q(θ0)− 1

nQn(X, θ0) +

1

nQn(X, θ0)− 1

nQn(X, θn)︸︷︷︸

≤0 by definition of θn

+1

nQn(X, θn)−Q(θn)

≤ 2 supθ

∣∣∣∣ 1nQn(X, θ)−Q(θ)

∣∣∣∣Hence,

P (||θn − θ0|| > ε) ≤ P

(supθ|| 1nQn(X, θ)−Q(θ)|| ≥ δ

2

)(iii)−−−→n→∞

0

Remark 3.5.4

(i) There are many similar theorems with slightly different technical assumptions.

(ii) The critical point is usually to show the uniform convergence of 1nQn.

(iii) If Θ is not compact, one can try to construct a compact subset C ⊂ Θ with θ0 ∈ Cand show that P (θn ∈ C) −−−→

n→∞1.

Theorem 3.5.5 (Asymptotic normality of M-estimators)Additionally to the assumptions of Theorem 3.5.3 we assume that

(i) the Hesse-matrix(∂2Qn∂θi∂θj

)i,j

exists and is continuous in an open neighbourhood of θ0,


(ii)1

n

(∂2

∂θi∂θjQn(θ∗n)

)i,j

p−−−−→θ∗n→θ0

A(θ0),

where A(θ0) is a deterministic and invertible n× n-matrix, and

(iii)1√n

∂

∂θQn(θ0)

L−→ N(0, B(θ0))

where

B(θ0) = limn→∞

E

[∂

∂θQn(θ0)

∂

∂θQn(θ0)T

].

Then √n(θn(X)− θ0)

L−→ N(0, A(θ0)−1B(θ0)A(θ0)−1T

).

Proof:We perform a Taylor expansion of ∂Qn

∂θ(θn) around θ0:

∂

∂θQn(θn) =

∂

∂θQn(θ0) +

∂2

∂θ∂θTQn(θ∗n)(θn − θ0),

where θ∗n is between θn and θ0. Since θn maximizes Qn, the left-hand side equals 0. Hence,

√n(θn − θ0) = −

(1

n

∂2

∂θ∂θTQn(θ∗n)

)+(1√n

∂Qn

∂θ(θ0)

)where + denotes the pseudo-inverse of the Hesse matrix. Now θ∗n −−−−→

θn→θ0θ0. Hence,

√n(θn − θ0)

L−−−→n→∞

−A(θ0)−1Z,

where L(Z) = N(0, B(θ0)) by (ii), (iii), and Slutsky’s Lemma 2.4.3. Then the desired resultsfollows with 2.6.5 (ii).

Now we apply these theorems to the maximum likelihood estimators for iid data. Con-sider only Θ ⊂ R.

Theorem 3.5.6 (Consistency of maximum likelihood estimators)Let X1, . . . , Xn be iid with L(Xi) = Pθ0 with Pθ0 ∈ Pθ : θ ∈ Θ such that Pθ has density fθand

(a) Θ ⊂ R is compact,

(b) supp fθ = x : fθ(x) > 0 does not depend on θ,

(c) fθ is continuously differentiable in θ and ∂∂θfθ(x) fulfills

| ∂∂θfθ(x)| ≤ L(x)fθ(x) ∀x ∈ R, θ ∈ Θ

with some L(x) > 0,


(d) Eθ0 [L(X1)] <∞

(e) Eθ0 [log fθ(X1)] exists for all θ and has a unique global maximum in θ0.

If the maximum likelihood estimator θn exists then it is consistent.

Proof:We want to apply Theorem 3.5.3 with

Qn(X, θ) = l(θ|X) =n∑i=1

log fθ(Xi) and Q(θ) = Eθ0 [log fθ(X1)] =

∫R

log fθ(x)fθ0(x)dx.

Q(θ) exists due to assumption (e). We have to check that this choice fulfills the assumptionsof Theorem 3.5.3.

(i) Is fulfilled by assumption (a).

(ii) Qn is continuous in θ if log fθ(x) is continuous in θ which follows from (b) and (c).

(iii) The law of large numbers implies

1

nQn(X, θ)

p−−−→n→∞

Q(θ).

However, we need this convergence uniformly in θ. Using the mean-value theorem weget

log fθ(x)− log fη(x) =∂

∂θlog fθ(x)|θ=θ∗(x)

(θ − η)

with θ∗(x) between θ and η. But∣∣∣∣ ∂∂θ log fθ(x)|θ=θ∗(x)

∣∣∣∣ =| ∂∂θfθ∗(x)||fθ∗(x)|

≤ L(x).

Hence,| log fθ(x)− log fη(x)| ≤ L(x)|θ − η|,

i.e., log fθ(x) is Lipschitz continuous with Eθ0 [L(X1)] < ∞ by (d). Then the uniformconvergence follows from Theorem 3.5.7 below.

(iv) Follows from (e).

Theorem 3.5.7Let X1, . . . , Xn be iid real random variables, Θ ⊂ R compact, and g : R×Θ→ R measurablesuch that

(i) E|g(X1, θ)| <∞ for all θ ∈ Θ.

(ii) g(x, θ) is Lipschitz continuous in θ with Lipschitz constant L(x).

(iii) E[L(X1)] <∞


Then

supθ∈Θ

∣∣∣∣∣ 1nn∑i=1

g(Xi, θ)− E[g(X1, θ)]

∣∣∣∣∣ p−−−→n→∞

0.

Proof: (i) Let δ > 0. Since Θ is compact there are K ≥ 1 and θ1, . . . , θK ∈ Θ such thatfor all θ ∈ Θ there is a k ≤ K with |θ − θk| < δ.

(ii) Let U, V, U1, . . . , UK ∈ R and ε > 0. Note that

|U + V | > ε⇒ |U | > ε

2or |V | > ε

2,

henceP (|U + V | > ε) ≤ P

(|U | > ε

2

)+ P

(|V | > ε

2

). (3.6)

Furthermore,supk≤K

Uk > ε⇒ U1 > ε or U2 > ε or . . . or UK > ε,

hence

P

(supk≤K

Uk > ε

)≤

K∑k=1

P (Uk > ε) (3.7)

(iii) Let ε > 0 and define g0(x, θ) = g(x, θ)− E[g(X1, θ)]. Then

P

(supθ∈Θ

∣∣∣∣∣ 1nn∑i=1

g0(Xi, θ)

∣∣∣∣∣ > ε

)

= P

(supk≤K

supθ∈Θ,|θ−θk|<δ

∣∣∣∣∣ 1nn∑i=1

g0(Xi, θ)

∣∣∣∣∣ > ε

)

= P

(supk≤K

supθ∈Θ,|θ−θk|<δ

∣∣∣∣∣ 1nn∑i=1

(g0(Xi, θ)− g0(Xi, θk)

)+

1

n

n∑i=1

g0(Xi, θk))∣∣∣∣∣ > ε

)

≤(7.1)

P

(sup

η,θ∈Θ,|θ−η|<δ

∣∣∣∣∣ 1nn∑i=1

(g0(Xi, θ)− g0(Xi, η)

)∣∣∣∣∣ > ε

2

)+ P

(supk≤K

∣∣∣∣∣ 1nn∑i=1

g0(Xi, θk))∣∣∣∣∣ > ε

2

)

For |θ − η| < δ, we have∣∣∣∣∣ 1nn∑i=1

(g0(Xi, θ)− g0(Xi, η)

)∣∣∣∣∣≤ 1

n

n∑i=1

|g(Xi, θ)− g(Xi, η)|+ E [|g(X1, θ)− g(X1, η)|]

≤(ii)

1

n

n∑i=1

L(Xi)|θ − η|+ E[L(X1)]|θ − η| ≤ 1

n

n∑i=1

L(Xi)δ + E[L(X1)]δ

= δ

(1

n

n∑i=1

(L(Xi)− E[L(X1)]) + 2E[L(X1)]

).


Therefore,

P

(sup

η,θ∈Θ,|θ−θk|<δ

∣∣∣∣∣ 1nn∑i=1

(g0(Xi, θ)− g0(Xi, η)

)∣∣∣∣∣ > ε

2

)

≤ P

(1

n

n∑i=1

L(Xi)− E[L(X1)] >ε

2δ− 2E[L(X1)]

).

If δ is small enough such that ε2δ− 2E[L(X1)] > 0 this expression converges to 0 by

the law of large numbers. Furthermore,

P

(supk≤K

∣∣∣∣∣ 1nn∑i=1

g0(Xi, θk))∣∣∣∣∣ > ε

2

)≤

(7.2)

K∑k=1

P

(∣∣∣∣∣ 1nn∑i=1

g0(Xi, θk)

∣∣∣∣∣ > ε

2

)which converges to 0 by the law of large numbers.

Remark 3.5.8An example where assumption (b) is not fulfilled is Pθ = U(0, θ). In this case fθ is notcontinuous in θ.

Theorem 3.5.9 (Asymptotic normality of maximum likelihood estimators)Additionally to the assumptions of Theorem 3.5.6 let the following be satisfied

(f) fθ is twice continuously differentiable w.r.t. θ.

(g) ψ′(θ, x) := ∂2

∂θ2 log fθ(x) is Lipschitz continuous in θ, i.e.

|ψ′(θ, x)− ψ′(η, x)| ≤ H(x)|θ − η|

with E[H(X1)] <∞.

(h) 0 < I(Pθ0) <∞

Then√n(θn − θ0)

L−−−→n→∞

N

(0,

1

I(Pθ0)

).

Proof:We check the conditions of Theorem 3.5.5.

(i) follows from (f).

(ii)

1

n

∂2

∂θ2Qn(X, θ∗n) =

1

n

n∑i=1

[ψ′(θ∗n, Xi)− ψ′(θ0, Xi)]︸︷︷︸(1)

+1

n

n∑i=1

ψ′(θ0, Xi)︸︷︷︸(2)

Then

|(1)| ≤ 1

n

n∑i=1

H(Xi)|θ∗n − θ0|p−−−→

n→∞EH(X1) · 0 = 0

3.6 Asymptotic comparison of estimators 37

and

(2)p−−−→

n→∞Eψ′(θ0, Xj) = −I(Pθ0) 6= 0

by the law of large numbers.

(iii) We have

1√n

∂

∂θQn(θ0) =

1√n

n∑i=1

∂

∂θlog fθ0(Xi)

Now,

Eθ0

[∂

∂θlog fθ0(Xi)

]=

∫R

∂

∂θlog fθ0(x)fθ0(x)dx

=

∫R

∂

∂θfθ0(x)dx =

∂

∂θ

∫R

fθ0(x)dx = 0.

Furthermore,

var

(∂

∂θlog fθ0(X1)

)= E

[(∂

∂θlog fθ0(X1)

)2]

= I(Pθ0).

Hence, the Central Limit Theorem for iid data yields

1√n

∂

∂θQn(θ0)

L−−−→n→∞

N(0, I(Pθ0)).

Remark 3.5.10Theorem 3.5.9 also states that the maximum likelihood estimator is asymptotically efficientas it reaches the Cramer-Rao bound asymptotically.

3.6 Asymptotic comparison of estimators

Consider estimation of the parameter θ ∈ Θ based on a sample X1, . . . , Xn of iid randomvariables.

Definition 3.6.1Let T1,n and T2,m be estimators for θ based on sample sizes n and m, respectively. Choosen and m (depending on θ) such that

MSEθ(T1,n) = MSEθ(T2,m). (3.8)

(i) nm

is called the relative efficiency of T2,m relative to T1,n. If nm> 1 then T2,m is better

as it needs fewer observations to achieve the same quality.


(ii) If for some 0 < a <∞ and n,m satisfying (3.8)

n

m−−−−→n,m→∞

a

independently of θ then a is called the asymptotic relative efficiency ARE(T2, T1) of T2

relative to T1.

Example 3.6.2Let T1 and T2 be both asymptotically normal such that

√n(Ti,n − θ) −→

LN(0, σ2

i ), i = 1, 2.

Then for large n we havebiasθ(Ti,n) = E[Ti,n]− θ ≈ 0,

hence

MSEθ(Ti,n) ≈ varθ Ti,n ≈1

nσ2i , i = 1, 2.

Then (3.8) implies1

nσ2

1 ≈1

mσ2

2,

hencen

m=σ2

1

σ22

= ARE(T2, T1).

So T2 is better than T1 if σ22 < σ2

1. Sometimes this is also used as a definition for the ARE.

Chapter 4

Confidence Sets

An estimate for a parameter θ will typically show a certain deviation from the true val-ues. Therefore, it is of interest to ask for a set which contains the parameter with a highprobability.

4.1 Definitions

Definition 4.1.1 (Confidence set)Let Θ ⊂ Rd be the parameter space and 0 < γ < 1 a prespecified probability. A γ-confidence region for θ ∈ Θ is a random set A(X) ⊆ Θ determined by the data vector Xwith the property

Pθ(θ ∈ A(X)) ≥ γ ∀θ ∈ Θ.

For d = 1, A(X) is a random interval [g(X), h(X)] ⊆ Θ such that

Pθ(g(X) ≤ θ ≤ h(X)) ≥ γ ∀θ ∈ Θ.

So a γ-confidence region is a random set which contains the true parameter with probabilityat least γ.

Definition 4.1.2 (Quantile)Let F be a distribution function. Then

c = inf(s : F (s) ≥ γ)

is called the γ-quantile of F . The median of F is the 0.5-quantile.

4.2 Important distribution families

When computing confidence intervals for common distribution families such as the normaldistribution, the following distribution families will play a major role.

Definition 4.2.1 ( χ2-distribution)Let X1, . . . , Xn be iid and L(Xi) = N(0, 1). Then the random variable

Z = X21 + . . .+X2

n

is χ2-distributed with n degrees of freedom (L(Z) = χ2n).

40 Chapter 4 Confidence Sets

Proposition 4.2.2The χ2

n-distribution has the density function

fn(z) =2−

n2

Γ(n2)zn2−1e−

z2 , z > 0.

Proof:First we show that X2

1 has the density function

f1(z) =1√2πz

e−z2 for z > 0.

P (X21 ≤ t) = P (−

√t ≤ X1 ≤

√t) = 2

1√2π

√t∫

0

e−s2

2 ds =1√2π

t∫0

1√ze−

z2dz.

Now we know that the sum X+Y of independent random variables has the density function

(f ∗ g)(y) =

∫R

f(x)g(y − x)dx,

where ∗ denotes the convolution and f and g are the densties of X and Y , respectively.Induction now gives the desired result.

Definition 4.2.3 (t-distribution)Let X0, X1, . . . , Xn be iid and N(0, 1)-distributed. The distribution of

V =

√nX0√

X21 + . . .+X2

n

is called Student’s or t-distribution with n degrees of freedom (L(V ) = tn).

Proposition 4.2.4The tn-distribution has the density function

fn(v) =Γ(n+1

2)

√πnΓ(n

2)

(1 +

v2

n

)−n+12

, v ∈ R

Proof:If V is tn-distributed then V has the following form:

V =

√nX0√Z

,

where L(X0) = N(0, 1), L(Z) = χ2n and X0 and Z are independent. The joint density of

(X0, Z) therefore has the product form:

g(x, z) = conste−x2

2 zn2−1e−

z2 , x ∈ R, z > 0.

4.2 Important distribution families 41

It follows:

P (

√nX0√Z≤ t) =

∫∫(x,z)∈R2:

√nx≤t

√z

g(x, z)dxdz

= const

∞∫0

zn2−1e−

z2

t√

zn∫

−∞

e−x2

2 dxdz

=s= x√

z

const

∞∫0

zn−1

2 e−z2

t√n∫

−∞

e−12s2zdsdz

= const

t√n∫

−∞

∞∫0

zn−1

2 e−12z(1+s2)dzds

=u=z(1+s2)

const

t√n∫

−∞

(1 + s2)−n+1

2

∞∫0

un−1

2 e−u2 duds

=v=√nsconst

t∫−∞

(1 +

v2

n

)−n+12

dv.

Definition 4.2.5 ( F-distribution)Let X1, . . . , Xn and Y1, . . . , Ym be iid N(0, 1)-distributed. The distribution of

U =m

n

X21 + . . .+X2

n

Y 21 + . . .+ Y 2

m

is called F -distribution with n and m degrees of freedom (L(U) = Fn,m).

Proposition 4.2.6The Fn,m-distribution has the density function

fn,m(u) =Γ(n+m

2)( nm

)n2

Γ(n2)Γ(m

2)

un2−1

(1 + nmu)

n+m2

, u > 0.

Proof:If V is Fn,m-distributed, then V has the form

V =mX

nZ,

where L(X) = χ2n, L(Z) = χ2

m and X and Z are independent.The rest of the proof is nowsimilar to the proof of Proposition 4.2.4.


4.3 Estimators of parameters of a normal distribution

and their distributions

Theorem 4.3.1Let X1, . . . , Xn be ii N(µ, σ2)-distributed. Then

L(√

n(Xn − µ)

σ

)= N(0, 1).

Proof:The formula

√n(Xn−µ)

σyields a linear combination of normally distributed random variables,

which is normally distributed again. We compute its parameters

E

[√n(Xn − µ)

σ

]=

√n

σ

(E

[1

n

n∑i=1

Xi

]− µ

)=

√n

σ

(1

n

n∑i=1

(E[Xi]− µ)

)= 0

var

(√n(Xn − µ)

σ

)=

n

σ2var

(1

n

n∑i=1

Xi

)=

1

σ2n

n∑i=1

varXi = 1.

Theorem 4.3.2Let X1, . . . , Xn be ii N(µ, σ2)-distributed and let

s2n =

1

n− 1

n∑j=1

(Xj −Xn)2

be an estimator of the variance σ2. Then

L(

(n− 1)s2n

σ2

)= χ2

n−1.

Proof:Let Yi = Xi − µ. Then Y, . . . , Yn are ii N(0, σ2) distributed and

s2n =

1

n− 1

n∑i=1

(Xi −Xn)2 =1

n− 1

n∑i=1

(Yi − Y n)2.

We use the substitution Z = QY , where Q is an n× n rotation matrix, i.e. the transposedmatrix QT is the inverse of Q. The first row of Q should be ( 1√

n, . . . , 1√

n). This leads to

Z1 =1√n

n∑i=1

Yi =√nY n, (4.1)

n∑i=2

Z2i =

n∑i=1

Z2i − Z2

1 = ZTZ − Z21 = Y TQTQY − nY 2

n

=n∑i=1

Y 2i − nY

2

n =n∑i=1

(Yi − Y n)2 = (n− 1)s2n,

(4.2)

4.4 Confidence intervals for the parameters of some common distributions 43

The random vector Z has a multivariate normal distribution with

E(Zi) = [QE(Y )]i = 0,

E(ZZT ) = E(QY Y TQT ) = QE(Y Y T )QT = Qσ2In×nQT = σ2In×n.

Hence, Z1, . . . , Zn are iid with L(Zi) = N(0, σ2). Together with (4.2) we receive automati-cally that

L(

(n− 1)s2n

σ2

)= χ2

n−1

as claimed.

Corollary 4.3.3Let X1, . . . , Xn be ii N(µ, σ2)-distributed. Then Xn and s2

n are independent.

Proof:The corollary follows immediately from (4.1), (4.2) and the fact that the Zi are iid withL(Zi) = N(0, σ2).

Theorem 4.3.4Let X1, . . . , Xn be ii N(µ, σ2)-distributed. Then

L(√

n(Xn − µ)

sn

)= tn−1.

Proof:

√n(Xn − µ)

sn=

√n(Xn − µ)

σ︸︷︷︸=X0

√(n− 1)σ2

(n− 1)s2n

=

√n− 1X0√

Z

with L(X0) = N(0, 1) by Theorem 4.3.1 and Z = (n−1)s2nσ2 such that L(Z) = χ2

n−1 by Theorem4.3.2. Since X0 and Z are independent by Corollary 4.3.3, the result follows from thedefinition of the t-distribution.

4.4 Confidence intervals for the parameters of some

common distributions

Theorem 4.4.1Let X1, . . . , Xn be ii exponentially distributed with parameter λ > 0 (L(Xi) = Exp(λ)). Then

L(2λnXn) = χ22n.


Proof:From

P (2λXj > t) = P

(Xj >

t

2λ

)= e−

t2 , t ≥ 0

follows that

L(2λXj) = Exp

(1

2

)= χ2

2

by comparison of the densities. From the independence of the Xi it follows that

L(2λnXn) = L

(n∑j=1

2λXj

)= χ2

2n.

Definition 4.4.2Let 0 < α < 1 and m ∈ N. In the following, we will write

uα for the α-quantile of N(0, 1),χ2m,α for the α-quantile of χ2

m,tm,α for the α-quantile of tm,Fn,m,α for the α-quantile of Fn,m.

The distribution function of the standard normal distribution is denoted by Φ.

Example 4.4.3 (Exact confidence interval for the parameter of an exponentialdistriution)Let X1, . . . , Xn be ii exponentially distributed with parameter λ. The maximum likelihoodestimator for λ is λ = 1

Xn. Since L(2λnXn) = χ2

2n, we have

Pλ

(χ2

2n, 1−γ2

2nXn

≤ λ ≤χ2

2n,1− 1−γ2

2nXn

)= Pλ

(χ2

2n, 1−γ2

≤ 2λnXn ≤ χ22n, 1+γ

2

)= 1− 1− γ

2− 1− γ

2= γ.

So we obtain a confidence interval at level γ as

Iλ =

[χ2

2n, 1−γ2

2nXn

,χ2

2n, 1+γ2

2nXn

]

(Note that 1+γ2

= 1− 1−γ2

The next task is to find confidence intervals for the parameters of the normal distribution.It will turn out that one has to discuss four different cases depending on which of theparameters are known.


Example 4.4.4 (Exact confidence interval for the mean of a normal distributionN(µ, σ2) if σ2 is known)Let X1, . . . , Xn be ii N(µ, σ2)-distributed where σ2 is known. From Theorem 4.3.1 we havethat

L(√

n(Xn − µ)

σ

)= N(0, 1).

So, it follows that

Pµ

(Xn − u 1+γ

2

σ√n≤ µ ≤ Xn + u 1+γ

2

σ√n

)= Pµ

(−u 1+γ

2≤√n

σ(Xn − µ) ≤ u 1+γ

2

)= Φ(u 1+γ

2)− Φ(−u 1+γ

2)

= γ.

In this computation we have used the symmetry of the normal distribution, by which

u1−α = −uα

Thus, we have the confidence interval

Iµ =

[Xn − u 1+γ

2

σ√n

; Xn + u 1+γ2

σ√n

]for µ with confidence level γ.

Example 4.4.5 (Exact confidence interval for the mean of a normal distributionN(µ, σ2) if σ2 is unknown)Let X1, . . . , Xn be ii N(µ, σ2)-distributed random variables where σ2 is unknown. We esti-mate σ2 by

s2n =

1

n− 1

n∑i=1

(Xi −Xn)2.

From Theorem 4.3.4 we have that

L(√

n(Xn − µ)

sn

)= tn−1.

Then, using the symmetry of the tn-distribution, we have

Pµ

(Xn − tn−1, 1+γ

2

sn√n≤ µ ≤ Xn + tn−1, 1+γ

2

sn√n

)= Pµ

(−tn−1, 1+γ

2≤√n

sn(Xn − µ) ≤ tn−1, 1+γ

2

)= γ.

Thus, the exact confidence interval is

Iµ =

[Xn − tn−1, 1+γ

2

sn√n

; Xn + tn−1, 1+γ2

sn√n

].


Example 4.4.6 (Exact confidence interval for the variance of a normal distribu-tion N(µ, σ2) if µ is known)LetX1, . . . , Xn be iiN(µ, σ2)-distributed random variables with known µ. Let σn =

∑ni=1(Xi−

µ)2. The random variable 1σ2 σn is the sum of n independent squares of N(0, 1)-distributed

random variables and hence χ2n-distributed, i.e.

Pσ2

(σn

χ2n, 1+γ

2

≤ σ2 ≤ σnχ2n, 1−γ

2

)= Pσ2

(χ2n, 1−γ

2

≤ σnσ2≤ χ2

n, 1+γ2

)= γ.

Therefore, the confidence interval for σ2 with level γ is given by

Iσ2 =

[σn

χ2n, 1+γ

2

;σn

χ2n, 1−γ

2

].

Example 4.4.7 (Exact confidence interval for the variance of a normal distributionN(µ, σ2) if µ is unknown)Let X1, . . . , Xn be ii N(µ, σ2)-distributed random variables with unknown µ. According toTheorem 4.3.2, we have L

(n−1σ2 s

2n

)= χ2

n−1. This yields

Pσ2

((n− 1)s2

n

χ2n−1, 1+γ

2

≤ σ2 ≤ (n− 1)s2n

χ2n−1, 1−γ

2

)= Pσ2

(χ2n−1, 1−γ

2

≤ (n− 1)s2n

σ2≤ χ2

n−1, 1+γ2

)= γ.

So the confidence interval for σ2 with level γ results in

Iσ2 =

[(n− 1)s2

n

χ2n−1, 1+γ

2

;(n− 1)s2

n

χ2n−1, 1−γ

2

]

If the exact distribution of a parameter estimator is not known, one can use asymptoticresults to obtain approximate confidence intervals.

Definition 4.4.8 (Asymptotic confidence set)Let Θ ⊂ R be the parameter space and 0 < γ < 1 a prespecified probability. An asymptoticγ-confidence set for the parameter θ ∈ Θ is a random set

A(X) ⊆ Θ

determined by the random vector X = (X1, . . . , Xn) with the property

lim infn→∞

Pθ(θ ∈ A(X)) ≥ γ ∀θ ∈ Θ.

Definition 4.4.9An estimator θn is called asymptotically normal with rate of consistency an if for suitableσ > 0

an(θn − θ)L−−−→

n→∞N(0, σ2).

Typically, an =√n (e.g. in Theorem 3.5.9).


Proposition 4.4.10Let θn be an asymptotically normal estimator for θ with rate an and let σ2 p−−−→

n→∞σ2. Choose

0 < γ < 1. Then we have [θn −

σ

anu 1+γ

2, θn +

σ

anu 1+γ

2

],

is an asymptotic confidence interval for θ.

Proof:By the asymptotic normality we have

Pθ

(−u 1+γ

2≤ an

θn − θσ≤ u 1+γ

2

)−−−→n→∞

γ = P(−u 1+γ

2≤ Z ≤ u 1+γ

2

)if L(Z) = N(0, 1). Since an(θn−θ)

σ

L−→ Z and σσn

p−→ 1, we may apply Slutsky’s lemma (Theorem2.4.3) and replace σ by σ.

Example 4.4.11 (Approximate confidence interval for the parameter p of thebinomial distribution)Let X1, . . . , Xn be ii binomially distributed B(1, p) with parameter p, 0 < p < 1. ThenY = X1 + . . .+Xn is B(n, p)-distributed.We use pn = Y

nas an estimator for p. We can use Theorem 2.5.3 by which the random

variable √n(pn − p)√p(1− p)

=Y − np√np(1− p)

is asymptotically N(0, 1)-distributed. Furthermore,√pn(1− pn) is a consistent estimator

for the variance of a B(1, p)-distributed random variable. Using Proposition 4.4.10 we obtainthe following asymptotic confidence interval for p:

Ip =

[pn − u 1+γ

2

√pn(1− pn)

n; pn + u 1+γ

2

√pn(1− pn)

n

].

Example 4.4.12 (Application)In a poll before an election 6% out of 2000 people voted for the pirate party. How sure canthe party be to overcome the 5% hurdle? Let

Yi =

1 person i voted for the pirates0 person j voted against the pirates

Then X =∑n

i=1 Yi fulfills L(X) = B(n, p). We observe X = 120, hence our estimate for p is

p =X

n= 0.06

With u 1+0.952≈ 1.96 the approximate 95%- confidence interval is

Ip = [4.96%, 7.04%] .


4.5 Two-sample problems

In this section we assume that we have two samples whose parameters should be comparedby constructing confidence intervals of suitable functions of these parameters.

The setting is as follows:We have two samples

X1 = (X11, . . . , X1n1) and X2 = (X21, . . . , X2n2).

The random variables Xi1, . . . , Xini , i = 1, 2 are assumed to be iid with L(Xi,j) = Pθi , i = 1, 2such that θi ∈ Θ ⊂ Rd.

Definition 4.5.1 (Paired and unpaired samples)If X1 and X2 are independent, we speak of unpaired samples. If they are dependent, thesamples are paired.

Now we consider a transformation g : Θ×Θ→ R of the parameter vectors and look fora confidence interval of g(θ1, θ2). In the following examples, we will only consider the case

L(Xi) = N(µi, σ2i ), i = 1, 2.

Example 4.5.2 (Confidence interval for the difference µ1−µ2 for known variances)Let X1 and X2 be independent and assume that σ2

1, σ22 are known. Consider

g(µ1, µ2) = µ1 − µ2.

For the sample means X ini we have

• L(X ini) = N(µi,

σ2i

ni

), i = 1, 2

• X1n1 and X2n2 are independent.

Hence,

L(X1n1 −X2n2) = N

(µ1 − µ2,

σ21

n1

+σ2

2

n2

)such that

L

X1n1 −X2n2 − µ1 + µ2√σ2

1

n1+

σ22

n2

= N(0, 1).

Now a computation similar to that in Example 4.4.4 yields the γ-confidence interval

Iµ1−µ2 =

X1n1 −X2n2 − u 1+γ2

√σ2

1

n1

+σ2

2

n2

; X1n1 −X2n2 + u 1+γ2

√σ2

1

n1

+σ2

2

n2

.Example 4.5.3 (Confidence interval for

σ21

σ22

for unknown means)

Let X1 and X2 be independent with unknown µ1, µ2 and consider

g(σ1, σ2) =σ2

1

σ22

.

4.5 Two-sample problems 49

By Theorem 4.3.2 we have

L(

(ni − 1)s2ini

σ2i

)= χ2

ni−1, i = 1, 2

and s21n1

and s22n2

are independent. Hence,

L

(n1 − 1)(n2−1)s22n2

σ22

(n2 − 1)(n1−1)s21n1

σ21

= L(s2

2n2

s21n1

σ21

σ22

)= Fn2−1,n1−1.

Then a simple computation yields the γ-confidence interval

Iσ21σ2

2

=

[s2

1n1

s22n2

Fn2−1,n1−1, 1+γ2

;s2

1n1

s22n2

Fn2−1,n1−1, 1−γ2

].

Example 4.5.4 (Confidence interval for the difference µ1− µ2 for paired samples)Let n1 = n2 = n and consider the case of paired samples X1 and X2. Then the randomvariables

Zi = X1i −X2i

are iid with L(Zi) = N(µ1 − µ2, σ2) with unknown variance σ2. We can apply the result

from Example 4.4.5 and obtain a γ-confidence interval

Iµ1−µ2 =

[Zn − tn−1, 1+γ

2

sn√n

; Zn + tn−1, 1+γ2

sn√n

]where sn is the sample variance of the sample Z1, . . . , Zn.

Chapter 5

Hypothesis Testing

The aim of test theory is to decide based on a sample whether a given hypothesis on itsdistribution is true.

Consider the following setting (parametric test):Let X = (X1, . . . , Xn) with L(X) ∈ Pθ : θ ∈ Θ and Θ ⊂ Rd. Choose a disjoint partitionΘ = Θ0 ∪Θ1 and assume that L(X) = Pθ0 . Decide whether θ0 ∈ Θ0 or θ0 ∈ Θ1.

5.1 Basic notions

Definition 5.1.1 (Null hypothesis)H0 : θ0 ∈ Θ0 is called the (null) hypothesis.H1 : θ0 ∈ Θ1 is called the alternative.

Definition 5.1.2 (Test)A test is specified by a subset S ⊂ Rn. If X 6∈ S we reject the null hypothesis that θ0 ∈ Θ0

in favor of θ0 ∈ Θ1. If X ∈ S we do not reject the null hypothesis. S is called the acceptanceregion, its complement is the critical region.

A test is often constructed using a test statistic T whose distribution is known under thenull hypothesis H0. The acceptance region is typically of the form

S = x ∈ Rn : T (x) ∈ C0 for some set C0,

whileSc = x ∈ Rn : T (x) ∈ C1 for a set C1 such that C0 ∩ C1 = ∅.

Remark 5.1.3Note that Definition 5.1.2 says that the null hypothesis is not rejected rather than sayingthat it is accepted. X ∈ S only tells us that the data do not give significant evidence thatH0 is false. This is no proof that it is true!This procedure can be compared to a trial. As long as there is no sufficient evidence thatthe defendant is guilty we will keep the assumption of his innocence.

5.1 Basic notions 51

Definition 5.1.4 (Errors of type I and II)(i) The error of type I is the rejection of the null hypothesis H0 although it is true. Its

probability isα = Pθ0(X 6∈ S), θ0 ∈ Θ0.

(ii) The error of type II is keeping the null hypothesis H0 although H1 is true. Its proba-bility is

β = Pθ0(X ∈ S), θ0 ∈ Θ1.

H0 is true H0 is falsereject H0 error of type I correct decisionkeep H0 correct decision error of type II

Of course, one is interested in a test which makes the error probabilities as small aspossible. However, only one of the two probabilities can be controlled. In the example ofthe trial the type I error would be to convict an innocent defendant. The error of type II isto absolve a guilty defendant. Typically, the type I error is considered worse.

Definition 5.1.5 (Significance level)A test has significance level α if the probability for the error of type I is bounded by α, i.e.

Pθ(X 6∈ S) ≤ α, ∀θ ∈ Θ0.

A good test will also have a small probability of a type II error. Hence, this probabilitycan be used as a quality criterion.

Definition 5.1.6 (Power and operation characteristic)The probability that H0 is correctly rejected,

Q(S, θ0) = Pθ0(X 6∈ S) = 1− β, θ0 ∈ Θ1,

is called the power of the test based on S.The operation characteristic of the test is

A(S, θ0) = Pθ0(X ∈ S) =

1− α θ0 ∈ Θ0

β θ0 ∈ Θ1

Remark 5.1.7(i) The desirable values of the operation characteristic are

A(S, θ0) ≈

1 θ0 ∈ Θ0

0 θ0 ∈ Θ1

(ii) There is a link between the risk of an estimator and the error probability of a test:Consider the loss function

LS(x, θ) =

1 x ∈ S, θ ∈ Θ1 or x 6∈ S, θ ∈ Θ0 (error)0 else (correct decision)

Then

R(S, θ) = EθLS(X, θ) =

Pθ(X 6∈ S) = α θ ∈ Θ0

Pθ(X ∈ S) = β θ ∈ Θ1

is the error probability.

52 Chapter 5 Hypothesis Testing

Definition 5.1.8

(i) The test defined by S is called unbiased if its power fulfills

Q(S, θ0) ≥ α, θ0 ∈ Θ1,

i.e. the probability of correctly rejecting H0 is as least as high as the probability offalsely rejecting H0.

(ii) The test defined by S is called consistent if

Q(S, θ0) −−−→n→∞

1, θ0 ∈ Θ1,

Remark 5.1.9 (Construction of a test)

(i) Find a model for the data depending on the parameter θ ∈ Θ.

(ii) Formulate H0 and H1.

(iii) Choose the significance level α.

(iv) Choose a test statistic T and determine its distribution under the null hypothesis.

(v) Using the distribution of T compute C0 and C1.

Then the test statistic is evaluated for the given data and we decide whether to reject thenull hypothesis or not.Note:

• The significance level α is chosen before actually performing the test. Do not choosethe level which produces the desired results!

• H0 and H1 should be formulated such that H1 is the hypothesis you want to prove. Inthis case you can be sure to control the probability of a false decision if you reject H0

in favor of H1. In contrast, if the null hypothesis is not rejected the error probabilitycan be very high.

Example 5.1.10A particular medicine is known to help with probability p0 = 5

6. The company has developed

a new drug which helped 550 out of 600 patients in a test. Is the new drug better than theold one?Find a stochastic model:Let X denote the number of successes of the new drug. Then L(X) = B(n, p) with 0 < p < 1and n = 600. We have H0 : p ≤ 5

6(the old drug is better) and H1 : p > 5

6(the new drug is

better). Hence, Θ0 = [0, 56] and Θ1 = (5

6, 1]. Choose α = 0.05. We should reject H0 (the old

drug is better) if the new drug helps in suitably many cases. Hence, set S = x : x ≤ c fora suitable constant c. We need

Pp(X > c) ≤ α for all p ≤ p0 =5

6.

5.1 Basic notions 53

Figure 5.1: The operation characteristic for the test in Example 5.1.10 (black) and theoptimal operation characteristic (red).

Since Pp(X > c) ≤ Pp0(X > c) we try to solve the equation

Pp0(X > c) = α.

Using Theorem 2.5.3 we have

0.05 = Pp0(X > c) ≈ 1− Φ

(c− np0√np0(1− p0)

),

hence

c− np0√np0(1− p0)

≈ Φ−1(0.95) = 1.65,

which yields c ≈ 515. In our example X = 550 > c, so we reject H0 on the 5%-level.For the operation characteristic, we get

A(S, p) = Pp(X ≤ c) = Pp

(X − np√np(1− p)

≤ c− np√np(1− p)

)≈

2.5.3Φ

(c− np√np(1− p)

).

Remark 5.1.11Tests are closely related to confidence intervals:Let [g(X), h(X)] be a (1− α)-confidence interval for θ. Consider

H0 : θ = θ0

H1 : θ 6= θ0.


Then we can define the acceptance region

S = x : θ0 ∈ [g(x), h(x)].

This leads to a test with level α since

Pθ0(X 6∈ S) = 1− Pθ0(θ0 ∈ [g(X), h(X)]) ≤ 1− (1− α) = α.

Instead of giving an answer such as ”rejection” or ”no rejection”, statistics softwaretypically returns the so-called p-value which is a measure for the compatibility of the datawith the null hypothesis.

Definition 5.1.12 (p-value)Let (x1, . . . , xn) be a realisation of the random vector (X1, . . . , Xn). Consider a test basedon the test statistic T (X1, . . . , Xn). The p-value of the test is the smallest significance levelfor which the null hypothesis is rejected when observing the value t = T (x1, . . . , xn).

Remark 5.1.13(i) The null hypothesis is rejected at level α if α ≥ p.

(ii) The p-value can be interpreted as the probability to get a sample as extreme as ormore extreme than the data given that H0 is true.

(iii) The significance level should be chosen before having seen the values of p!

5.2 Tests for normally distributed data

An important application is to find tests for the parameters of a normally distributed randomvariable. As in the case of confidence intervals, we have to distinguish which of the parametersare known.

Definition 5.2.1 (One-sided Gauss-test)Model:

X1, . . . , Xn ii N(µ, σ2)-distributed, σ2 known

Problem:

H0 : µ = µ0,Θ0 = µ0H1 : µ > µ0,Θ1 = (µ0,∞)

The null hypothesis is thus one-sided.As Xn is a sufficient statistic for µ, we only consider tests of the form

S = x = (x1, . . . , xn) : xn ≤ c, c ∈ R

For fixed level α, we need c ∈ R such that

α = Pµ0(X /∈ S) = Pµ0(Xn > c) = Pµ0

(√n(Xn − µ0)

σ>

√n(c− µ0)

σ

)= 1− Pµ0

(√n(Xn − µ0)

σ≤√n(c− µ0)

σ

)= 1− Φ

(√n(c− µ0)

σ

),

5.2 Tests for normally distributed data 55

since L(√

n(Xn−µ0)σ

)= N(0, 1). Hence, we choose

c =σ√n

Φ−1(1− α) + µ0.

In summary:Test statistic:

T (X) =

√n(Xn − µ0)

σ,L(T (X)) = N(0, 1) under H0.

Acceptance region:

C0 = T (X) ≤ u1−α.

Remark 5.2.2In Definition 5.2.1 it makes no difference whether we use Θ0 = µ0 or Θ0 = (−∞, µ0]. Sincethe alternative hypothesis is only single-sided, µ0 is the critical value. If the true value µ issmaller than µ0, the value of the test statistic should be even smaller.

Sometimes, we are interested in deviations from µ0 in both directions.

Definition 5.2.3 (Two-sided Gauss-test)Model:

X1, . . . , Xn ii N(µ, σ2)-distributed and σ2 known.

Problem:

H0 : µ = µ0

H1 : µ 6= µ0,Θ1 = (−∞, µ0) ∪ (µ0,∞)

Test statistic:

T (X) =

√n(Xn − µ0)

σ,L(T (X)) = N(0, 1) under H0.

Acceptance region:

C0 = |T (X)| ≤ u1−α2.

We compare the powers of the one-sided and the two-sided test:For the one-sided test, we consider µ > µ0 and get

Pµ(T (X) > u1−α) = Pµ

(√n(Xn − µ0)

σ> u1−α

)= Pµ

(√n(Xn − µ)

σ> u1−α −

√n(µ− µ0)

σ

)= 1− Φ

(u1−α −

√n(µ− µ0)

σ

)= Φ

(−u1−α +

√n(µ− µ0)

σ

).


Figure 5.2: The powers for the one-sided (black) and the two-sided (red) Gauss test.

Then

α = Φ(−u1−α) < Φ

(−u1−α +

√n(µ− µ0)

σ

),

so the test is unbiased. A similar computation for the two-sided test yields

Pµ(|T (X)| > u1−α2)

= 1− Pµ(−u1−α2≤ T (X) ≤ u1−α

2)

= 1− Φ

(u1−α

2−√n(µ− µ0)

σ

)+ Φ

(−u1−α

2−√n(µ− µ0)

σ

)= Φ

(−u1−α

2+

√n(µ− µ0)

σ

)+ Φ

(−u1−α

2−√n(µ− µ0)

σ

).

This test can also be shown to be unbiased. Both tests are obviously consistent. Figure 5.2shows that the one-sided test has a higher power.

If σ2 is unknown, it is replaced by its estimator s2n which leads to the t-test.

Definition 5.2.4 (One-sided t-test)Model:

X1, . . . , Xn ii N(µ, σ2)-distributed, σ2 unknown.

Problem:

H0 : µ = µ0,Θ0 = µ0H1 : µ > µ0,Θ1 = (µ0,∞)

5.2 Tests for normally distributed data 57

Test statistic:

T (X) =

√n(Xn − µ0)

sn,L(T (X)) = tn−1 under H0.

Acceptance region:

C0 = T (X) ≤ tn−1,1−α.

Example 5.2.5A medical doctor has reason to believe that people with a tendency to heart attacks havea higher iron content in their blood than healthy people. A study has shown that the ironcontent in healthy men can be modelled by a random variable Z with L(Z) = N(115, σ2).We assume that the iron content in the blood of men with high risk of heart attacks isN(µ, σ2)-distributed. Test H0 : µ ≤ 115 versus H1 : µ > 115 at a level of 5%. Blood teststaken from n = 16 men gave xn = 126.625 and sn = 20.545. Hence,

T (x1, . . . , x16) = 2.263 > 1.753 = t15,0.95

such that we reject H0.

Definition 5.2.6 (Two-sided t-test)Model:

X1, . . . , Xn ii N(µ, σ2)-distributed and σ2 unknown.

Problem:

H0 : µ = µ0

H1 : µ 6= µ0,Θ1 = (−∞, µ0) ∪ (µ0,∞)

Test statistic:

T (X) =

√n(Xn − µ0)

sn,L(T (X)) = tn−1 under H0.

Acceptance region:

C0 = |T (X)| ≤ tn−1,1−α2.

As for confidence intervals, the task of performing a statistical test can be generalized totwo-sample problems.

Definition 5.2.7 (Two-sided two-sample t-test)Model:

X1, . . . , Xn ii N(µ1, σ2)-distributed and

Y1, . . . , Ym ii N(µ2, σ2)-distributed.

Problem:

H0 : µ1 = µ2

H1 : µ1 > µ2


Test statistic:

T (X, Y ) =Xn − Y m√

1n

+ 1ms,L(T (X, Y )) = tn+m−2 under H0

with

s2 =(n− 1)s2

x + (m− 1)s2y

n+m− 2,

where

s2x =

1

n− 1

n∑j=1

(Xj −Xn)2,

s2y =

1

m− 1

m∑j=1

(Yj − Y m)2.

Acceptance region:

C0 = |T (x, y)| ≤ tn+m−2,1−α2,

5.3 Likelihood ratio tests

A good test should have a large power for as many θ ∈ Θ1 as possible. Similar to the caseof estimators we can ask for uniformly best tests in this sense.

Definition 5.3.1 (Uniformly most powerful test)A level α test with acceptance region S is called uniformly most powerful (UMP) if

Q(S, θ) = Pθ(X /∈ S) ≥ Pθ(X /∈ S ′) = Q(S ′, θ) ∀θ ∈ Θ1

if S ′ defines an arbitrary level α test.

Definition 5.3.2 (Size of a test)A test with acceptance region S has size α if it has level α and

Pθ(X /∈ S) = α for some θ ∈ Θ0.

We have seen that the error probabilities, hence the power of a test can be interpretedas a risk. For point estimators, we know that there is no uniformly best estimator. How isthe situation in the case of tests?

Theorem 5.3.3 (Neyman-Pearson-Lemma)Let X1, . . . , Xn be random variables with common distribution Pθ, where θ ∈ Θ = θ0, θ1.We test

H0 : θ = θ0 versus H1 : θ = θ1.

Let L(θ, x) be the likelihood function. For the test statistic T we use

T (X) =L(θ0|X)

L(θ1|X).

5.3 Likelihood ratio tests 59

If there is a k such that

C0 = T (x) ≥ kC1 = T (x) < k

leads to a test with size α, then this is the uniformly most powerful test.

Proof:Let A0 and A1 denote the acceptance and critical region, respectively, of another test statisticS with size α. Then

Pθ0(T (X) ∈ C1) = Pθ0(S(X) ∈ A1) = α. (5.1)

We have to show that

Pθ1(T (X) ∈ C1) ≥ Pθ1(S(X) ∈ A1).

We introduce the following simplifying notation:

Ci = ω ∈ Ω : T (X(ω)) ∈ Ci, i ∈ 0, 1Ai = ω ∈ Ω : S(X(ω)) ∈ Ai, i ∈ 0, 1

We have to show that Pθ1(C1)− Pθ1(A1) ≥ 0

Pθ1(C1)− Pθ1(A1) = [Pθ1(C1 ∩ A1) + Pθ1(C1 \ A1)]− [Pθ1(C1 ∩ A1) + Pθ1(A1 \ C1)]

= Pθ1(C1 \ A1)− Pθ1(A1 \ C1).

From the construction of the test T follows

ω ∈ C1 \ A1 ⊂ C1 ⇒ L(θ1|X(ω)) >1

kL(θ0|X(ω))

ω ∈ A1 \ C1 ⊂ C0 ⇒ L(θ1|X(ω)) ≤ 1

kL(θ0|X(ω))

It plays no role whether the likelihood function is the density or a discrete probability inorder to obtain

Pθ1(C1 \ A1) ≥ 1

kPθ0(C1 \ A1)

Pθ1(A1 \ C1) ≤ 1

kPθ0(A1 \ C1).

Using this we get

Pθ1(C1 \ A1)− Pθ1(A1 \ C1) ≥ 1

k[[Pθ0(C1 \ A1) + Pθ0(C1 ∩ A1)]− [Pθ1(A1 \ C1) + Pθ0(C1 ∩ A1)]]

=1

k[Pθ0(C1)− Pθ0(A1)] = 0

by (5.1).


Remark 5.3.4In the case of discrete distributions it can be impossible to find a k such that the resultingtest has size α. In this case we can use a randomized test:We first determine a C such that

Pθ0(T (X) < C) < α and Pθ0(T (X) ≤ C) > α.

Then we reject H0 if T (X) < C and keep H0 if T (X) > C. If T (X) = C we perform aBernoulli experiment with success probability

p =α− Pθ0(T (X) < C)

Pθ0(T (X) = C).

We reject H0 in case of a success and keep it otherwise. Then

Pθ0(H0 rejected) = Pθ0(T (X) < C) + pPθ0(T (X) = C) = α.

Theorem 5.3.3 can be generalised to this setting.

Example 5.3.5A biologist wants to study the animals in a forest. For that, he sets out 33 traps to seewhich species are caught. When checking the traps he finds the following numbers of animalscaught:

# animals caught # traps0 91 112 43 54 15 26 1

We assume that the numbers X1, . . . , X33 of animals caught per trap are iid with L(Xi) =Poi(λ). We test

H0 : λ = λ0 = 1 versus H1 : λ = λ1 = 3.

The likelihood function is

L(λ|X) =e−33λλ

33∑i=1

Xi

33∏i=1

(Xi!)

.

So the critical region according to the Neyman-Pearson lemma ise−33λ0λ

33∑i=1

Xi

0

33∏i=1

(Xi!)

/

e−33λ1λ

33∑i=1

Xi

1

33∏i=1

(Xi!)

< k

5.3 Likelihood ratio tests 61

which corresponds to

e−33λ0λ

33∑i=1

Xi

0

e−33λ1λ

33∑i=1

Xi

1

< k

⇐⇒ −33λ0 + 33λ1 + [log(λ0)− log(λ1)]33∑i=1

Xi < log(k)

⇐⇒ [log(λ0)− log(λ1)]33∑i=1

Xi < log(k) + 33(λ0 − λ1)

Since λ1 > λ0 the critical region should be of the form

33∑i=1

Xi > C

where the constant C depends on the null hypothesis and the significance level but not on

the alternative. Since L(

33∑i=1

Xi

)= Poi(33λ), we have

L

(33∑i=1

Xi

)= Poi(33)

under H0. Since 54 animals were caught, the p-value is given as

Pλ0

(33∑i=1

Xi ≥ 54

)≈ 5 · 10−4

such that H0 is rejected for any reasonable significance level.

If we interchange H0 and H1, the critical region is of the form33∑i=1

Xi < C ′ for another

constant C ′. In this case, L(∑33

i=1Xi

)= Poi(99) and the p-value is

Pλ1

(33∑i=1

Xi ≤ 54

)≈ 6 · 10−7.

Hence, both hypotheses would be rejected! For the estimated mean, we get λ = 5433

= 1.63.The values of λ0 and λ1 are too far from this values which causes the rejection. Consequently,the model λ ∈ 1, 3 was misspecified for the given problem.

The Neyman-Pearson approach rejects H0 if the observation X = x is more probableunder the alternative. However, having only single point hypotheses is not very realistic inpractice. In order to generalise the Neyman-Pearson approach, we proceed as follows:For testing H0 : θ ∈ Θ0 versus H1 : θ ∈ Θ1 we estimate θ within both Θ0 and Θ1 bymaximum likelihood. Then we perform a Neyman-Person test of

H0 : θ = θ0 versus H1 : θ = θ1.


Definition 5.3.6 (Likelihood ratio test)The fraction

supθ∈Θ0L(θ|X)

supθ∈Θ1L(θ|X)

is called likelihood ratio and the test with the acceptance region C0

C0 =

supθ∈Θ0

L(θ|X)

supθ∈Θ1L(θ|X)

≥ cα

is the likelihood ratio test.

Proposition 5.3.7Let X1, . . . , Xn be ii N(µ, σ2)-distributed. The t-test given in Definition 5.2.4 and 5.2.6 isthe likelihood ratio test of the hypothesis H0 : µ = µ0 against the alternative H1 : µ > µ0

or H1 : µ 6= µ0, respectively.

Proof:We restrict ourselves to the two-sided alternative µ 6= µ0.It is known from Example 3.4.3 that

Xn =1

n

n∑j=1

Xj, and σ2 =1

n

n∑j=1

(Xj −Xn)2

are the maximum likelihood estimators of µ and σ2.For known mean µ = µ0 we have the maximum likelihood estimator

σ20 =

1

n

n∑j=1

(Xj − µ0)2.

The likelihood ratio is

T (X) =L(µ0, σ

20|X)

L(Xn, σ2|X)

=n∏j=1

1√2πσ2

0

exp(− (Xj−µ0)2

2σ20

)

1√2πσ2

exp(− (Xj−Xn)2

2σ2 )

=

(σ2

σ20

)n2

exp

(− 1

2σ20

n∑j=1

(Xj − µ0)2

)

exp

(− 1

2σ2

n∑j=1

(Xj −Xn)2

)

=

(σ2

σ20

)n2

due to the definition of σ2 and σ20.

This means that the likelihood ratio test has the acceptance region

σ2

σ20

≥ cα,

5.4 The χ2-test 63

with cα chosen according to the significance level α. Since

σ20 = σ2 + (µ0 −Xn)2,

this is equivalent to

(Xn − µ0)2

σ2≤ 1

cα− 1,

which holds when √n|Xn − µ0|

s≤(

(n− 1)

(1

cα− 1

)) 12

with

s2 =n

n− 1σ2.

The likelihood ratio test fails to reject H0 exactly when the two-sided t-statistic is below acertain bound, which is determined by the significance level.If we investigate the one-sided alternative, then we just have to consider the two casesXn ≤ µ0 and Xn > µ0.

Remark 5.3.8Although likelihood ratio tests us the same construction as the Neyman-Pearson test, theyare in general not uniformly most powerful.

5.4 The χ2-test

5.4.1 Derivation of the χ2-test

Definition 5.4.1We construct the χ2-test in the following way:Model:

Z = (Z1, . . . , Zd) multinomially distributed with parameters (n, p1, . . . , pd),

where pk ≥ 0, p1 + . . .+ pd = 1 and Z1 + . . .+ Zd = n.

Problem:

H0 : θ = (p1, . . . , pd) ∈ Θ0 =θ0 =

(p

(0)1 , . . . , p

(0)d

)H1 : θ 6= θ0.

Test statistic:

χ2 =d∑

k=1

(Zk − np(0)k )2

np(0)k

Acceptance region:

χ2 ≤ χ2d−1,1−α


Remark 5.4.2In the test defined above H0 is not rejected if the observed values do not differ too muchfrom the expectations under H0. The deviation is measured using a weighted mean of thesquared difference. The weights are chosen such that the distribution of the test statistic isknown asymptotically (see Theorem 5.4.4).

Proposition 5.4.3Let

Uk =1√n

(Zk − np(0)

k

), k = 1, . . . , d,

where Z = (Z1, . . . , Zd) is multinomially distributed with parameters(n, p

(0)1 , . . . , p

(0)d

).

Then the random vector U = (U1, . . . , Ud)T fulfills

UL−−−→

n→∞N(0,Σ),

where Σ = (σkl)1≤k,l≤d and

σkl = covθ0(Uk, Ul) =

p

(0)k

(1− p(0)

k

)k = l

−p(0)k p

(0)l k 6= l

Proof:Let X1, . . . , Xn be iid random variables with values in the finite set ξ1, . . . , ξd and

P (Xj = ξk) = p(0)k , j = 1, . . . , n, k = 1, . . . , d.

We consider the vector

ε(j) =(ε

(j)1 , . . . , ε

(j)d

)Tgiven by

ε(j)k =

1 Xj = ξk

0 Xj 6= ξk

We let

Z∗k =n∑j=1

ε(j)k ,

U∗k =1√n

(Z∗k − np

(0)k

).

(Z∗1 , . . . , Z∗d) and (U∗1 , . . . , U

∗d ) have the same distribution as (Z1, . . . , Zd) and (U1, . . . , Ud).

The random vectors ε(1), . . . , ε(n) are iid with

E(ε(j))

=(p

(0)1 , . . . , p

(0)d

)T= θT0 .

Therefore

U∗ = (U∗1 , . . . , U∗d )T =

1√n

n∑j=1

(ε(j) − θT0

)

5.4 The χ2-test 65

is the standardized sum of iid random variables with covariance matrix Σ given by

σkl = cov(ε

(j)k , ε

(j)l

)= E

((ε

(j)k − p

(0)k

)(ε

(j)l − p

(0)l

))=

−p(0)

k p(0)l k 6= l, (ε

(j)k ε

(j)l = 0)

p(0)k

(1− p(0)

k

)k = l

From the multivariate central limit theorem it follows that

L(U) = L(U∗)L−→ N(0,Σ).

Theorem 5.4.4Under the hypothesis, i.e. if Z = (Z1, . . . , Zd) is multinomially distributed with parameter(n, p

(0)1 , . . . , p

(0)d

), the χ2-statistic is asymptotically χ2

d−1-distributed:

χ2 =d∑

k=1

(Zk − np(0)

k

)2

np(0)k

L−−−→n→∞

χ2d−1.

Proof:

We set Yk = Uk/

√p

(0)k with Uk from Proposition 5.4.3 such that

χ2 =d∑

k=1

Y 2k

Y = (Y1, . . . , Yd)T L−→ N(0, Σ)

where

Σ = (σkl)1≤k,l≤d with σkl = covθ0(Yk, Yl) =

1− p(0)

k k = l

−√p

(0)k p

(0)l k 6= l

As Σ is a covariance matrix it is symmetric and positive semi-definite, i.e. has non-negativeeigenvalues λ1 ≥ . . . ≥ λd ≥ 0 and a corresponding orthonormal system of eigenvectorse1, . . . , ed. We see immediately that

ed =

(√p

(0)1 , . . . ,

√p

(0)d

)Tis an eigenvector with eigenvalue λd = 0.

Let v = (v1, . . . , vd)T be a vector which is orthogonal to ed, i.e.

d∑i=1

√p

(0)i vi = 0.

This yields:

(Σv)k = (1− p(0)k )vk −

∑l 6=k

√p

(0)k p

(0)l vl = vk −

√p

(0)k

d∑l=1

√p

(0)l vl = vk,


i.e. v is always an eigenvector with eigenvalue 1.This means that e1, . . . , ed−1 are eigenvectors with eigenvalues λ1 = . . . = λd−1 = 1.Let O be the orthogonal matrix with columns e1, . . . , ed and

V = (V1, . . . , Vd)T = OTY.

It follows from the orthogonality of O that

χ2 =d∑

k=1

Y 2k =

d∑k=1

V 2k

and

Vd = eTd Y =d∑

k=1

Uk =1√n

d∑k=1

(Zk − np(0)k ) = 0,

sinceZ1 + . . .+ Zd = n and p

(0)1 + . . .+ p

(0)d = 1.

We obtain

covθ0(Vk, Vl) = Eθ0(eTk Y YT el) = eTk Σel = λle

Tk el =

1 k = l < d

0 otherwise

So, we have shown that

χ2 =d−1∑k=1

V 2k with (V1, . . . , Vd−1)T

L−→ N(0, Id−1),

where Id−1 is the d− 1-dimensional unit matrix.Now the assertion follows immediately from the definition of the χ2-distribution.

Example 5.4.5We investigate the colour and shape of a sample of n = 556 peas and want to test whether theobserved frequencies of two features are compatible with the theory of genetic inheritance.

phenotype Zk p(0)k by theory np

(0)k

smooth, yellow 315 916

312.75smooth, green 108 3

16104.25

wrinkly, yellow 101 316

104.25wrinkly, green 32 1

1634.75

We model the phenotype as a multinomially distributed random variable. The individualobservations are assumed to be independent. The hypotheses are

H0 : p = p(0) and H1 : p 6= p(0).

We choose α = 0.05 and get

χ2 =4∑

k=1

(Zk − np(0)k )2

np(0)k

= 0.47 and χ23,0.95 = 7.81.

Hence, we cannot reject H0.

5.4 The χ2-test 67

Remark 5.4.6The χ2-test is actually an approximation of the likelihood ratio test for multinomially dis-tributed random variables:Let Z1, . . . , Zd be multinomially distributed with parameter

θ ∈ Θ = (p1, . . . , pd) : p1, . . . , pd ≥ 0,d∑i=1

pi = 1.

The likelihood function is

L(θ|Z) = Pθ(Z1, . . . , Zd) =

(n

Z1 . . . Zd

) d∏k=1

pZkk .

The maximum likelihood estimator is obtained as in the case of a binomial distribution aspk = Zk

nfor k = 1, . . . , d. Then the likelihood ratio is

λ(Z) =L(θ0|Z)

maxθ∈Θ L(θ|Z)=

d∏k=1

(np

(0)k

Zk

)Zk

.

H0 is not rejected if λ(Z) exceeds a certain threshold. Since the logarithm is strictly in-creasing, we can also consider log(λ(Z)). Since the distribution of log(λ(Z)) is difficult todetermine, we use an approximation. We use the Taylor expansion of log(1 + x) to obtain

log(λ(Z)) =d∑

k=1

Zk log

(np

(0)k

Zk

)

≈d∑

k=1

Zk

(np

(0)k

Zk− 1

)− 1

2

d∑k=1

Zk

(np

(0)k

Zk− 1

)2

= −1

2

d∑k=1

Zk

(np

(0)k

Zk− 1

)2

≈ −1

2

d∑k=1

(np(0)k − Zk)np

(0)k

.

The first term in the Taylor expansion vanishes since∑d

i=1 Zi = n and∑d

i=1 p(0)i = 1. The

last approximation is obtained by the law of large numbers as Zkn−−−→n→∞

p(0)k under H0.

Remark 5.4.7The χ2-test can be generalised to the case of higher dimensional hypotheses, i.e. hypothesesof the form

H0 : θ ∈ Θ0 = (p1, . . . , pd) : pk = pk(δ), k = 1, . . . , d, δ ∈ ∆ ⊆ Rq.

We call q the dimensionality of the hypothesis. In that case, we must estimate the parameterδ to get an estimate of the probabilities pk(δ).


Theorem 5.4.8Let Z = (Z1, . . . , Zd) be multinomially distributed with parameters (n, p1, . . . , pd).Under the hypothesis pk = pk(δ), k = 1, . . . , d, we have that

χ2 =d∑

k=1

(Zk − npk(δ))2

npk(δ)

L−→ χ2d−q−1,

where q is the dimensionality of the hypothesis and δ is the maximum likelihood estimator ofδ.

Proof:The proof of this theorem is similar to the proof of Theorem 5.4.4.

Remark 5.4.9For each estimated parameter the number of degrees of freedom is reduced by one, as theestimators introduce restrictions on the Zk. Hence, we only consider the case q < d− 1.

5.4.2 Goodness-of-fit tests

We test if the distribution of the data belongs to a given class of distributions.Problem:X1, . . . , Xn iid with values in X . Does L(Xj) belong to Pθ : θ ∈ Θ ?

To investigate this with the χ2-test, we divide X into d disjoint subsets X1, . . . ,Xd withX =

⋃di=1Xi and consider the indicator variables

ε(j)k =

1 Xj ∈ Xk0 Xj /∈ Xk

j = 1, . . . , n, k = 1, . . . , d,

and the counting variables

Zk =n∑j=1

ε(j)k , k = 1, . . . , d.

If L(Xj) ∈ Pθ : θ ∈ Θ, then Z = (Z1, . . . , Zd) is multinomially distributed with parameters(n, p1(θ), . . . , pd(θ)) for all θ ∈ Θ where pk(θ) = Pθ(Xj ∈ Xk).Therefore, we are in the situation of Theorem 5.4.8, where q is the dimension of Θ, i.e. wecan test the hypothesis L(Xj) ∈ Pθ : θ ∈ Θ by a χ2−test.

Remark 5.4.10The number d of subsets should increase with the sample size n, but not as fast as n. Oneshould have

npi(θ) −−−→n→∞

∞ ∀θ ∈ Θ, i = 1, . . . , d

and npi(θ) ≥ 5 for all i. There is no general recipe on how to choose the subsets. If someinformation on possible alternative distributions is available, one should choose many subsetsin areas where the groups of distributions are rather different.

Example 5.4.11A shop owner counts the number of customers entering his shop in intervals of five minutes.His counts are as follows:

5.4 The χ2-test 69

Number of customers k 0 1 2 3 4 ≥ 5Zk 381 360 173 65 13 8

He interprets his counts as an iid sample X1, . . . , Xn (n = 1000) of random variables withvalues in N0. He conjectures that L(Xi) = Poi(λ) for some λ > 0. So we test

H0 : L(Xi) ∈ Poi(λ) : λ > 0 versus H1 : L(Xi) 6∈ Poi(λ) : λ > 0.

The maximum likelihood estimator of λ is

λ = Xn = 0.994.

With that

k 0 1 2 3 4 ≥ 5

Eλ[Zk] = npk(λ) 369 367 184 61 15 4

Hence,

χ2 =5∑

k=0

(Zk − Eλ[Zk])2

Eλ[Zk]= 5.72.

Here, d− q − 1 = 6− 1− 1 = 4, so we compare this with χ24,0.95 = 9.5. This means that H0

cannot be rejected.

5.4.3 Test of independence

The χ2-test can be used to test for independence, too.Model:Let (X1, Y1), . . . , (Xn, Yn) be iid with values in 1, . . . ,mx × 1, . . . ,my and distributionspecified by

P (Xj = k, Yj = l) = pkl, k = 1, . . .mx, l = 1, . . . ,my.

Let p(x)1 , . . . , p

(x)mx and p

(y)1 , . . . , p

(y)my be the marginal distribution of Xi and Yj:

p(x)k = P (Xi = k) =

my∑l=1

pkl, k = 1, . . . ,mx,

p(y)l = P (Yj = l) =

mx∑k=1

pkl, l = 1, . . . ,my.

Problem:H0 : Xj, Yj independent against H1 : Xj, Yj dependent.The hypothesis is then equivalent to

H0 : ∀(k, l) pkl = p(x)k p

(y)l ,

H1 : ∃(k, l) pkl 6= p(x)k p

(y)l .

The parameter space

Θ =

(p11, p12, . . . , pmx,my) :

mx∑k=1

my∑l=1

pkl = 1, pkl ≥ 0, k = 1, . . .mx, l = 1, . . .my


has dimension mxmy − 1. The dimension of the hypothesis space

Θ0 =

(p11, p12, . . . , pmxmy) : pkl = p

(x)k p

(y)l ,

mx∑k=1

p(x)k = 1,

my∑l=1

p(y)l = 1, p

(x)1 , . . . , p(x)

mx , p(y)1 , . . . , p(y)

my ≥ 0

for the independence assumption is q = (mx − 1) + (my − 1).Test statistic:Let Zkl be the number of pairs (Xj, Yj) with Xj = k, Yj = l.We use the notation

Zk· =

my∑l=1

Zkl,

Z·l =mx∑k=1

Zkl.

The maximum likelihood estimator under the hypothesis is

p(x)k =

1

nZk·

p(y)l =

1

nZ·l

and thereforepkl = p

(x)k p

(y)l .

The χ2-statistic from Theorem 5.4.8 has the form

χ2 =mx∑k=1

my∑l=1

(Zkl − np(x)

k p(y)l

)2

np(x)k p

(y)l

=mx∑k=1

my∑l=1

(Zkl − Zk·Z·l

n

)2

Zk·Z·ln

and is under the hypothesis asymptotically χ2-distributed with

mxmy − (mx − 1)− (my − 1)− 1 = (mx − 1)(my − 1)

degrees of freedom.Acceptance region:

C0 = χ2 < χ2(mx−1)(my−1),1−α.

Remark 5.4.12Assume that we observe (Xi, Yi) where Xi has r possible outcomes and Yi has s possibleoutcomes. The information for a test on independence is often displayed in a contingencytable.

1 . . . s1 Z11 · · · Z1s Z1·...

......

...r Zr1 · · · Zrs Zr·

Z·1 · · · Z·n Z·· = n

5.5 Asymptotic results for likelihood ratio tests 71

Example 5.4.13A study published in Mann et al., British Medical Journal, 1975 investigates, whether smok-ing is related to heart attacks. The data is

Heart Attack No Heart AttackSmoker 45 83 128

Non-Smoker 14 74 8859 157 216

We test H0: smoking and heart attacks are independent. We have χ2 = 9.74, while χ21,0.99 =

6.63. So H0 is rejected at the 1%-level.

5.5 Asymptotic results for likelihood ratio tests

We consider the following situation: Let X1, . . . , Xn be iid with L(Xi) ∈ Pθ : θ ∈ Θ andΘ ⊆ R. Pθ should have density fθ.Test

H0 : θ = θ0 against H1 : θ 6= θ0

In this section we will use asymptotic results to derive the constant cα determining theacceptance region of a likelihood ratio test based on the test statistic

λn(X) =L(θ0|X)

supθ∈Θ L(θ|X)=L(θ0|X)

L(θn|X),

i.e. C0 = x : λn(x) ≥ cα. This is equivalent to finding c∗α such that

C0 = x : −2 log λn(x) = 2[l(θn|X)− l(θ0|X)] ≤ c∗α.

We will also consider two alternative tests:

Definition 5.5.1 (Wald’s test and Rao’s test)

(i) Let θn be the maximum likelihood estimator of θ. Wald’s test has the test statistic

Wn = nI(Pθn)(θn − θ0)2

and acceptance regionCW

0 = Wn ≤ cWα .

(ii) Let ψ(θ|X) := ∂∂θl(θ|X). Rao’s test has the test statistic

Rn =1

n

ψ2(θ0|X)

I(Pθ0)

and acceptance regionCR

0 = Rn ≤ cRα.

Remark 5.5.2

(i) The intuition behind Wald’s and Rao’s test is as follows:


• Wald’s test: Due to the consistency of the ML-estimator, we have θnp−→ θ, so

H0 : θ = θ0 is plausible if (θn−θ0)2 is small. Standardising by multiplication withnI(Pθn) yields an invariant limit distribution.

• Rao’s test: By definition of the ML-estimator, we have ψ(θn|X) = 0. By conti-nuity assumptions, ψ(θ|X) ≈ 0 if θn ≈ θ. So H0 is plausible if ψ(θ0|X) ≈ 0.

(ii) If Θ ⊂ Rd, Wn and Rn are defined as

Wn = n(θn − θ0)T I(Pθn)(θn − θ0)

Rn =1

nψT (θ0|X)I(Pθ0)−1ψ(θ0|X)

In the following, we will show that the likelihood ratio test, Wald’s and Rao’s test areasymptotically equivalent.

We assume that the assumptions of Theorem 3.5.9 are satisfied. Then

√n(θn − θ0)

L−−−→n→∞

N

(0,

1

I(Pθ0)

)under H0, which implies

√nI(Pθ0)(θn − θ0)

L−−−→n→∞

N (0, I(Pθ0)) .

The proof of Theorem 3.5.9 further yields

1√nψ(θ0|X) =

1√n

∂

∂θQn(θ0)

L−−−→n→∞

N (0, I(Pθ0)) ,

where Qn(θ) = l(θ|X) =∑n

i=1 log fθ(Xi).This states that the limit variables have the same distribution. In the following, we will

see that they are the same.

Lemma 5.5.3Under the assumptions of Theorem 3.5.9 and if H0 holds, we have

√n

(1

nψ(θ0|X)− I(Pθ0)(θn − θ0)

)p−−−→

n→∞0

Proof:We make a Taylor expansion of order 1:

0 =1√nψ(θn|X) =

1√nψ(θ0|X) +

1

nψ′(θ∗n|X)

√n(θn − θ0).

In the proof of Theorem 3.5.9 we have seen that

1

nψ′(θ∗n|X)

p−−−→n→∞

−I(Pθ0)

which shows the assertion.



(i)1nψ2(θ0|X)

I(Pθ0)

L−−−→n→∞

χ21

(ii)

nI(Pθ0)(θn − θ0)2 L−−−→n→∞

χ21

Proof:

(i) By the proof of Theorem 3.5.9

1√nψ(θ0|X) =

1√n

∂

∂θQn(θ0)

L−−−→n→∞

N (0, I(Pθ0)) .

Hence,1√nψ(θ0|X)√I(Pθ0)

L−−−→n→∞

N(0, 1)

and the square of this expression converges to a χ21-distributed random variable.

(ii) By Lemma 5.5.3 the difference of both terms converges to zero in probability. So the

assertion follows from (i) and Slutsky’s lemma (UnL−−−→

n→∞Z and Vn

p−−−→n→∞

0 implies

Un + VnL−−−→

n→∞Z).

Lemma 5.5.5If Θ ⊂ Rd and the assumptions of Lemma 5.5.4 are fulfilled in this multivariate situationthen

(i)1

nψT (θ0|X)I(Pθ0)−1ψ(θ0|X)

L−−−→n→∞

χ2d

(ii)

n(θn − θ0)T I(Pθ0)(θn − θ0)L−−−→

n→∞χ2d


(i)

[l(θn|X)− l(θ0|X)]− 1

2nI(Pθ0)(θn − θ0)2 p−−−→

n→∞0

(ii)

2[l(θn|X)− l(θ0|X)]L−−−→

n→∞χ2

1

Proof:


(i) We perform a Taylor expansion:

l(θ0|X)− l(θn|X) = l′(θn|X)(θn − θ0) +1

2(θ0 − θn)2l′′(θ∗n|X).

Here, l′(θn|X) = 0 and l′′(θ∗n|X) ∼ nI(Pθ0) as was seen in the proof of Theorem 3.5.9.

(ii) Follows from (i), Lemma 5.5.4 (ii) and Slutsky’s lemma.

Theorem 5.5.7Under the assumptions of Theorem 3.5.9 and if H0 holds, we have

−2 log λn,Wn, RnL−−−→

n→∞χ2

1.

Proof:The statement for λn follows from Lemma 5.5.6 (ii), the statement for Rn is obtained from

5.5.4 (i), and the statement for Wn is obtained using 5.5.4 (ii), I(Pθn)p−→ I(Pθ0) and Slutsky’s

lemma.

Theorem 5.5.8If Θ ⊂ Rd and the assumptions of Lemma 5.5.7 are fulfilled in this multivariate situationthen

−2 log λn,Wn, RnL−−−→

n→∞χ2d.

From this, we can conclude that c∗α, cWα and cRα can be chosen as χ2d,1−α.

Example 5.5.9Let L(X) = B(n, p) and test

H0 : θ = θ0 versus θ 6= θ0.

Then

L(p|X) =

(n

X

)pX(1− p)n−X

l(p|X) = log

(n

X

)+X log p+ (n−X) log(1− p)

ψ(p|X) =∂

∂pl(p|X) =

X

p− n−X

1− p

which yields p = Xn

and

I(Pp) = Epψ2(p|X) =

1

p2(1− p)2Ep(X − np)2 =

np(1− p)p2(1− p)2

=n

p(1− p).

For the LR-statistic, we get

λn =L(p0|X)

L(p|X)=

(p0

p

)X (1− p0

1− p

)n−X


and

−2 log λn = −2X lognp0

X− 2(n−X) log

n(1− p0)

n−X.

The Rao statistic is

Rn =ψ2(p0|X)

I(Pp0)=p0(1− p0)

n

(X

p0

− n−X1− p0

)2

=1

np0(1− p0)(X − np0)2

which is the test statistic of the χ2-test:A binomial distribution is a multinomial distribution with d = 2. Hence, set Z1 = X,Z2 = n−X, p1 = p and p2 = 1− p. Then

χ2 =(X − np0)2

np0

+(n−X − n(1− p0))2

n(1− p0)=

1

p0(1− p0)

(X − np0)2

n.

The Wald statistic, finally, is

Wn = I(Pp)(p− p0)2 =n

p(1− p)(X − np0)2

n2≈ 1

np0(1− p0)(X − np0)2

as p→ p0 under H0.

What happens if we test H0 : θ ∈ Θ0 against H1 : θ ∈ Θ1 and Θ0 is not one-point?

Assumption (R):Θ0 can be represented as

Θ0 = θ ∈ Θ; R1(θ) = · · · = Rq(θ) = 0 for some 1 ≤ q ≤ d

where Θ ⊆ Rd, Ri : Θ→ R.Let R1, . . . , Rq be not redundant, i.e. Θ0 cannot be represented with fewer constraints. Moreprecisely we assume that the q × d -Jacobi matrix

Cθ =

(∂Ri

∂θk

)i=1,...,qk=1,...,d

has rank q.Or, equivalently,Assumption (G):Θ0 can be represented as Θ0 = g(M), with g :M→ Θ0, M⊆ Rd−q, i.e. θk = gk(δ1, . . . , δd−q) fork = 1, . . . , d.Let g be not redundant, such that the d× (d− q)-Jacobi matrix

Dδ =

(∂gk∂δi

)k=1,...,di=1,...,d−q

has rank d − q. Then θ0n = g(δn) where δn is the maximum likelihood estimator in the

reparametrised model under H0

L(Xj) ∈ P 0δ : δ ∈ ∆, P 0

δ = Pg(δ) = Pθ for θ ∈ Θ0.


as well as the ML-estimator under the constraints given in assumption (R).Statistics:LR-statistic: λn =

supθ∈Θ0L(θ|X)

supθ∈Θ L(θ|X)or, equivalently, −2 log λn

Wald-statistic: Wn = nRT (θn)[Cθn I(Pθn)−1CT

θn

]−1

R(θn)

with R(θ) = (R1(θ), . . . , Rq(θ))T

Rao-statistic: Rn = 1nψT (θ0

n|X)DTδn

[DTδnI(Pθ0

n)Dδn

]−1

Dδnψ(θ0

n|X)

Since the maximum likelihood estimator θ0n under the constraints R1(θ) = · · · = Rq(θ) =

0 is computed using the Lagrange multiplier technique, Rao’s test is (in this setting) alsocalled the Lagrange multiplier test.

Theorem 5.5.10Under assumptions of Theorem 3.5.9 as well as (R), (G), we have under H0:

−2 log λL−−−→

n→∞χ2q (Wilk’s Theorem)

Wn, RnL−−−→

n→∞χ2q

Proof:Serfling, Approximation Theorems of Mathematical Statistics, Theorem 4.4.4

Chapter 6

Empirical Processes andKolmogorov-Smirnov Test

In this chapter we study convergence results for stochastic processes. The main goal is toderive a functional central limit theorem which can be used, for instance, to define non-parametric goodness-of-fit tests.

6.1 Definitions

Definition 6.1.1 (Stochastic process)A stochastic process is a collection X(t) : t ∈ I of random variables on (Ω,A, P ), whereI ⊂ R is an interval. The function X(t) = X(t, ω) for fixed ω ∈ Ω is called sample path.

Definition 6.1.2 (Wiener process)A stochastic process W (t) : t ≥ 0 is called Wiener process (or Brownian motion) if

(i) W (0) = 0 a.s.

(ii) t 7→ W (t) is a.s. continuous, i.e. P (W (·) ∈ C[0,∞)) = 1

(iii) For 0 ≤ t0 < t1 < . . . < tn it holds W (t0),W (t1) −W (t0), . . . ,W (tn) −W (tn−1) areindependent.

(iv) For 0 ≤ s < t: L (W (t)−W (s)) = N(0, t− s).

Definition 6.1.3 (Gaussian process)A stochastic process X(t) : t ∈ I is called a Gaussian process if its finite-dimensionaldistributions are normal, i.e. if for any t1, . . . , tn ∈ I it holds

L (X(t1), . . . , X(tn)) = Nn(a,Σ) .

Example 6.1.4

(i) A Wiener process is a Gaussian process with

E[W (s)] = 0 and cov(W (s),W (t)) = min(s, t).

Furthermore, any Gaussian process with almost surely continuous sample paths andEW (s) = 0, cov(W (s),W (t)) = min(s, t) is a Wiener process.

78 Chapter 6 Empirical Processes and Kolmogorov-Smirnov Test

Figure 6.1: Three simulated sample paths of a Brownian motion (left) and a Brownian bridge(right).

(ii) A Gaussian process B(t) : 0 ≤ t ≤ 1 with almost surely continuous sample pathsand E[B(t)] = 0, cov(B(t), B(s)) = min(s, t)− st is called Brownian bridge.

If W (·) is a Wiener process, then B(t) := W (t) − tW (1), 0 ≤ t ≤ 1, is a Brownianbridge. It has the property that B(0) = B(1) = 0 and can be interpreted as a Wienerprocess conditioned on W (1) = 0.

6.2 Weak convergence of stochastic processes

In this section, we will introduce a notion of convergence of a sequence Xn(t) : t ∈ I ofstochastic processes. This should generalise the convergence in law for random variables orvectors. Hence, we want

(Xn(t1), . . . , Xn(tk))L−−−→

n→∞(X(t1), . . . , X(tk))

H(Xn(·)) L−−−→n→∞

H(X(·))

for all continuous functions H with values in R.

Definition 6.2.1 (Weak convergence)Let Xn, X be random variables with values in a metric space (S, d).Then Xn converges weakly to X (Xn

w−−−→n→∞

X) if

E[H(Xn)] −−−→n→∞

E[H(X)]

for all continuous and bounded functions H : S → R.

6.2 Weak convergence of stochastic processes 79

Remark 6.2.2Let S = Rd. Then Xn

w−−−→n→∞

X iff XnL−−−→

n→∞X (Portmanteau Theorem).

Lemma 6.2.3 (Continuous mapping theorem)Let Xn

w−−−→n→∞

X and H : S → R continuous. Then:

H(Xn)w−−−→

n→∞H(X), i.e. H(Xn)

L−−−→n→∞

H(X).

Lemma 6.2.4Xn

w−−−→n→∞

X iff H(Xn)w−−−→

n→∞H(X) for all continuous H : S → R.

The condition given in the definition of weak convergence is hard to check in practice. Forspecial choices of S we will investigate alternative conditions which assure weak convergence.

Special case I: S = C[0, 1], d(f, g) = ‖f − g‖∞

Theorem 6.2.5Let Xn(t) : 0 ≤ t ≤ 1, X(t) : 0 ≤ t ≤ 1 be stochastic processes in C[0, 1], i.e. with a.s.continuous sample paths. If

(i) (Xn(t1), . . . , Xn(tk))L−−−→

n→∞(X(t1), . . . , X(tk)) ∀t1, . . . , tk ∈ [0, 1],

(ii) Xn(·) is tight, i.e. for each ε > 0 there exists a compact set K, such that

P (Xn ∈ K) > 1− ε ∀n .

Then: Xn(·) w−−−→n→∞

X(·) in C[0, 1].

Proof:Billingsley, Convergence of probability measures, Theorem 8.1

Remark 6.2.6One can show (Billingsley, Theorem 8.2) that tightness of Xn(·) is equivalent to

(i) ∀η > 0 there exists τ > 0 such that P (|Xn(0)| > τ) ≤ η for all n.

(ii) ∀ε, η > 0 there exist 0 < δ < 1 and n0 such that

P

(sup|s−t|<δ

|Xn(s)−Xn(t)| ≥ ε

)≤ η ∀n ≥ n0.

Special case II: S = D[0, 1], the space of cadlag functions (right-continuous and limitsfrom the left exist), i.e

xn ↓ x⇒ f(xn)→ f(x) xn ↑ x⇒ f(xn)→ c =: f(x−)


Definition 6.2.7 (Skorohod metric)Define

dS(x(·), y(·)) = infλ∈Λ

max

(sup

0≤t≤1|λ(t)− t|, sup

0≤s≤1|x(s)− y(λ(s))|

),

where Λ = λ : [0, 1] → [0, 1], λ strictly increasing and continuous, λ(0) = 0, λ(1) = 1.Then dS is a metric on D[0, 1] and called Skorohod metric.

Remark 6.2.8

(i) dS is a metric on D[0, 1].

(ii) Choosing λ0(t) = t one sees that dS(f, g) ≤ ||f−g||∞. Hence, convergence w.r.t. || · ||∞implies convergence w.r.t. dS.

(iii) If fn ∈ D[0, 1] and f is continuous then

dS(fn, f)→ 0 iff ||fn − f ||∞ → 0.

Theorem 6.2.9Let Xn(t) : 0 ≤ t ≤ 1 and X(t) : 0 ≤ t ≤ 1 be stochastic processes in D[0, 1]. IfX(·) ∈ C[0, 1] and

(i) (Xn(t1), . . . , Xn(tk))L−−−→

n→∞(X(t1), . . . , X(tk)) ∀t1, . . . , tk ∈ [0, 1], k ≥ 1.

(ii) Xn(·) is tight.

Then: Xn(·) w−−−→n→∞

X(·) in D[0, 1].

As for the case of C[0, 1], there are characterisations of tightness in D[0, 1] which areeasier to check in practice (see Billingsley, Theorems 15.2, 15.3, and 15.6).

6.3 The Functional Central Limit Theorem

Theorem 6.3.1 (FCLT or Donsker’s Theorem)Let X1, X2, . . . i.i.d. real random variables with E[X1] = 0, 0 < varX1 = σ2 < ∞. DefineSn(t) = 1√

nσ2Sbntc ∈ D[0, 1], where S0 = 0, Sk =

∑ki=1Xi and bxc = maxk ∈ Z; k ≤ x.

Then

Sn(t) : 0 ≤ t ≤ 1 w−−−→n→∞

W (t) : 0 ≤ t ≤ 1 in D[0, 1] ,

where W (·) is a Wiener process.

Corollary 6.3.2

If H : D[0, 1]→ R is continuous, then H(Sn(·)) L−−−→n→∞

H(W (·)).

6.4 Goodness-of-fit tests 81

6.4 Goodness-of-fit tests

Model: X1, . . . , Xn i.i.d. with continuous distribution function F (z) = P (Xj ≤ z).

Test: H0 : F = F0 against H1 : F 6= F0

Statistics: Kolmogorov-Smirnov statistic: Dn := sup−∞<x<∞ |Fn(x)− F0(x)|

Cramer-von-Mises statistic: Cn =∞∫−∞|Fn(x)− F0(x)|2 dF0(x)

By Glivenko-Cantelli we have

sup−∞<x<∞

|Fn(x)− F (x)| → 0 almost surely.

Hence, Dn and Cn should be small under H0.

Lemma 6.4.1Under H0, Dn is distribution-free, i.e. L(Dn) does not depend on F0 for all continuousdistribution functions F0.

Proof:For sake of simplicity we assume that F0 is strictly increasing. Then

Xj ≤ z ⇐⇒ F0(Xi) ≤ F0(z). (6.1)

Define random variables Yi = F0(Xi), i = 1, . . . , n. Then the random variables Y1, . . . , Ynare independent and

P (Yi ≤ y) = P (F0(Xi) ≤ F0(F−10 (y))) = P (Xi ≤ F−1

0 (y)) = F0(F−10 (y)) = y.

Hence, Y1, . . . , Yn are iid with L(Yi) = U [0, 1]. Let

Gn(t) =1

n

n∑i=1

1I[0,t](Yi).

By (6.1)

Fn(z) =1

n

n∑i=1

1I[0,F0(z)](Yi) = Gn(F0(z)),

which implies

Dn = supz|Fn(z)− F0(z)| = sup

z|Gn(F0(z))− F0(z)| = sup

0≤t≤1|Gn(t)− t|.

For non-strictly increasing F0 consider F+(x) := miny : F (y) > x. Then, this has theproperty that F+(x) ≤ z if and only if x ≤ F (z). Now use F+ instead of F−1 above.

Using this lemma, the critical values of the test statistic Dn can be computed by Monte-Carlo simulation for the case of a U[0,1] distribution. However, we will derive the asymptoticdistribution of Dn.


Definition 6.4.2 (Empirical process)The normalization Bn(x) =

√n(Fn(x) − F (x)) is called empirical process. It is a random

variable in D(R).

Lemma 6.4.3

Let E1, . . . , En+1 be i.i.d. Exp(1)-distributed and Rk =k∑j=1

Ej, k = 1, . . . , n+ 1. Then

L(U(1), . . . , U(n)

)= L

(R1

Rn+1

, . . . ,Rn

Rn+1

),

where U(1), . . . , U(n) are the order statistics of U1, . . . , Un iid with L(Ui) = U [0, ].

Proof:Breiman, Probability, Prop. 13.15

Theorem 6.4.4Let X1, X2, . . . be i.i.d. with continuous distribution function F0. Then,

√nDn =

√n sup−∞<x<∞

|Fn(x)− F0(x)| L−−−→n→∞

sup0≤t≤1

|B(t)|,

where B(t) : 0 ≤ t ≤ 1 is a Brownian bridge.

Proof:Let U1, . . . , Un be iid, L(Ui) = U [0, 1] and let Gn be the corresponding empirical distributionfunction. In the proof of Lemma 6.4.1 we have seen that

√nDn

L=√n sup

0≤t≤1|Gn(t)− t|

The maximum distance of Gn(t) and t occurs at the jump points of Gn which are located atU(j) with Gn(U(j)) = j

n. Hence,

√nDn

L=√n sup

1≤j≤n

(∣∣∣∣ jn − U(j)

∣∣∣∣ , ∣∣∣∣j − 1

n− U(j)

∣∣∣∣)L=√n sup

1≤j≤n

∣∣∣∣ jn − U(j)

∣∣∣∣+O

(1√n

)L=√n sup

1≤j≤n

∣∣∣∣ jn − Rj

Rn+1

∣∣∣∣+O

(1√n

)L=

n

Rn+1

sup1≤j≤n

∣∣∣∣Rj − j√n− j

n

Rn+1 − n√n

∣∣∣∣+O

(1√n

).

Rn+1 is a sum of iid random variables with mean and variance 1. This implies Rn+1

n→ 1 by

the law of large numbers. Furthermore, Rj − j is a sum of iid random variables with means0 and variance 1. We set Sj = Rj − j and

X(n)(t) =Sbntc√n.

6.4 Goodness-of-fit tests 83

Then

limn→∞

√nDn = lim

n→∞sup

0≤t≤1|X(n)(t)− tX(n)(1)|.

By the FCLT X(n) → W where W is a Wiener process. As a consequence,

√nDn

L−−−→n→∞

sup0≤t≤1

|B(t)|

where B is a Brownian bridge.

Remark 6.4.5

(i) The distribution of sup0≤t≤1 |B(t)| is called Kolmogorov distribution. Its distributionfunction is given by

F (x) = 1− 2∞∑k=1

(−1)k−1e−2k2x2

.

(ii) The nullhypothesis of a Kolmogorov-Smirnov test is rejected if

√nDn > K1−α

where K1−α is the 1− α-quantile of the Kolmogorov distribution.

(iii) The KS-test is an example of a nonparametric test. In contrast to the χ2-test it canalso be used for small sample sizes.

(iv) The KS-test does not make any assumptions on the distribution family forming thenullhypothesis. Hence, it can be used for a wide range of applications. However, it hasa small power in general.

(v) The results for the distribution of the KS-statistic are no longer valid if the parametersof the distribution are estimated from the data. Hence, the KS-test does not yieldcorrect results in that case. For normally distributed data, the Lilliefors test can beused instead.

Analogously, we can define the Kolmogorov-Smirnov-test for two-sample problems, i.e.for testing whether two samples have the same distribution.

Theorem 6.4.6Let X1, . . . , Xn and Y1, . . . , Ym be i.i.d. with distribution functions F and G, respectively,(both continuous). Furthermore be Fn(x), Gm(x) the corresponding empirical distributionfunctions. Under the hypothesis H0 : F = G,√

mn

m+ nTm,n :=

√mn

m+ nsup

−∞<x<∞|Fn(x)− Gm(x)| L−−−→

n→∞sup

0≤t≤1|B(t)| .

There are also other goodness-of-fit tests. One example is the Cramer-von-Mises testwhich is based on the following result.


Theorem 6.4.7Under the assumptions of Theorem 6.4.4, if H0 holds we get for the Cramer-von-Misesstatistic

nCn = n

∫|Fn(x)− F0(x)|2 dF0(x)

L−−−→n→∞

1∫0

B(t)2dt

Chapter 7

Statistical Functionals and Applications

7.1 Statistical functionals

In this section we consider the asymptotic behaviour of statistics which can be representedas functionals T (Fn) of the empirical distribution Fn.

Literature: R. Serfling, Approximation Theorems of Mathematical Statistics, Chapter 6

Model: X1, . . . , Xn i.i.d. with distribution (function) F on R

Consider the empirical distribution function

Fn(x) =1

n

n∑j=1

1(−∞,x](Xj) =1

n

n∑j=1

∆xj(x) (7.1)

where ∆z(x) denotes the distribution function of a point mass in z

∆z(x) =

1 x ≥ z

0 x < z.

We will study estimators or test statistics which may be written as

Tn = T (Fn) ,

where T : F → R is a functional on F , the set of all distribution functions on R, i.e.

F = F : R→ [0, 1] : F increasing, right-continuous, F (x) −−−−→x→−∞

0, F (x) −−−→x→∞

1.

The Glivenko-Cantelli theorem implies supx∈R|Fn(x)− F (x)| p−−−→

n→∞0. Hence, for smooth T we

would expectTn = T (Fn) −−−→

n→∞T (F ),

i.e. Tn is a consistent estimator of T (F ).

86 Chapter 7 Statistical Functionals and Applications

Example 7.1.1

(i) Sample momentsFor some measurable h, we set

T (F ) =

∫h(x)dF (x) = Eh(X1)

T (Fn) =1

n

n∑j=1

h(Xj)

For h(x) = xk, we get the k-th sample moment as an estimate of the k-th momentEXk

1 (if it exists).

(ii) Sample variance: Let

T (F ) =

∫(x−

∫z dF (z))2dF (x) = varX1

T (Fn) =1

n

n∑j=1

(Xj −1

n

n∑k=1

Xk)2

(iii) Parametric model M : L(Xj) ∈ Pθ, θ ∈ ΘLet Fθ be the distribution function of Pθ. If F 6= Fθ for all θ ∈ Θ, the model M ismisspecified. The estimator θn calculated from a misspecified model using maximumlikelihood is called quasi- or pseudo maximum likelihood (QML) estimator. It is aspecial kind of M -estimator.

The log likelihood of the sample under model M is

l(θ|X1, . . . , Xn) =n∑j=1

l(θ|Xj) = n

∫l(θ|x)dFn(x) .

Hence, we get the QML estimate as

θn = T (Fn) = arg maxθ∈Θ

∫l(θ|x)dFn(x) .

Under appropriate assumptions, θn → θ0 even if M is misspecified, where

θ0 = T (F ) = arg maxθ∈Θ

∫l(θ|x)dF (x) .

θ0 is the parameter of that distribution in M which is closest to F = L (Xj) . Moreprecisely,

θ0 = arg minθ∈Θ

KL(F,Fθ) ,

where KL(F,G) denotes the Kullback-Leibler distance between two distributions F,G(here: having densities f, g):

KL(F,G) =

∫log f(x)dF(x)−

∫log g(x)dF(x) =

∫f(x) log

f(x)

g(x)dx .

7.2 Asymptotics for statistical functionals 87

(iv) Kolmogorov-Smirnov test statistic

T (Fn) = supx|Fn(x)− F0(x)|

T (F ) = supx|F (x)− F0(x)|

(v) Cramer-von-Mises test statistic

T (Fn) =

∫(Fn(x)− F0(x))2dF0(x)

T (F ) =

∫(F (x)− F0(x))2dF0(x)

7.2 Asymptotics for statistical functionals

Asymptotic normality of maximum likelihood estimators was proven by linearisation, i.e. aTaylor expansion of order 1.

Idea: Define derivatives of functionals T (F ) to get something like a Taylor expansion:

T (Fn)− T (F ) =m∑k=1

1

k!T (k)(F ; Fn − F ) +Rm,n

= Vm,n +Rm,n .

To prove asymptotic normality, consider m = 1

T (Fn)− T (F ) = V1,n +R1,n = T ′(F ; Fn − F ) +R1,n

=1

n

n∑j=1

T ′(F ; ∆Xj − F ) +R1,n

(7.2)

as, by (7.1), Fn − F = 1n

∑nj=1(∆Xj − F ) and T ′(F ; ·) will be linear. The first term in (7.2)

will be asymptotically normal by the CLT. One has to show√nR1,n

p−→ 0 only.

Definition 7.2.1 (Gateaux derivative)Let T : F → R be a functional, F,G ∈ F (⇒ (1− λ)F + λG ∈ F , 0 ≤ λ ≤ 1). If

T ′(F ;G− F ) = limλ→0+

T ((1− λ)F + λG)− T (F )

λ

exists, it is called Gateaux derivative of T at F in direction G− F . If T ′(F ; ·) exists for allG ∈ F and if it is a linear functional, then it is called the Gateaux derivative of T and T iscalled Gateaux differentiable.


For the calculation of the Gateaux derivative, we set g(λ) = T ((1− λ)F + λG), 0 ≤ λ ≤1⇒ T ′(F ;G− F ) = g′(0)

Example 7.2.2Consider T (F ) =

∫h(x) dF (x) which already is a linear functional. Then

T ′(F ;G− F ) = T (G)− T (F ) = T (G− F ),

i.e. T ′(F ; ·) = T (·) for all F , if the definition of T is extended to signed measures (i.e. todifferences G− F of distribution functions).

Definition 7.2.3 (Higher order derivatives)The higher-order derivatives of T are given by

T (k)(F ;G− F ) =dk

dλkg(λ)|λ=0

.

If g(λ) is regular enough, we get a Taylor expansion

g(1)− g(0) =m∑k=1

1

k!g(k)(0) +

1

(m+ 1)!g(m+1)(λ∗)

for some 0 ≤ λ∗ ≤ 1. In terms of T :

T (G)− T (F ) =m∑k=1

1

k!T (k)(F ;G− F ) +Rm(G) .

For G = Fn, Rm,n = Rm(Fn), that is the expansion discussed at the beginning of this section.It turns out that the asymptotic distribution of a differentiable statistical functional

depends on which is the first nonvanishing term in the Taylor expansion of T (F ). If it is thelinear term, the limit distribution is normal.

Theorem 7.2.4Let X1, . . . , Xn be i.i.d. with distribution F and the functional T : F → R allowing for aTaylor expansion of order 1:

T (Fn)− T (F ) = V1,n +R1,n

where√nR1,n

p−→ 0 and for some measurable h(F ; ·)

V1,n =1

n

n∑j=1

h(F ;Xj) .

Then, √n(T (Fn)− T (F )− µ(T ;F ))

L−−−→n→∞

N(0, σ2(T ;F )

)with

µ(T ;F ) = E[h(F ;X1)] =∫h(F ;x)dF (x)

0 < σ2(T ;F ) = varh(F ;X1) =∫

(h(F ;x)− µ(T ;F ))2dF (x) <∞ .

7.2 Asymptotics for statistical functionals 89

Proof:The random variables h(F ;Xi), i = 1, . . . , n are iid with mean µ(T ;F ) and variance σ2(T ;F ).The Central Limit Theorem for iid data yields

√nV1,n − µ(T ;F )

σ(T ;F )

L−−−→n→∞

N(0, 1).

The assertion follows using the assumption on R1,n and Slutsky’s lemma.

Remark 7.2.5If T is Gateaux differentiable, choose (compare (7.1))

h(F ;x) = T ′(F ; ∆x − F ).

Example 7.2.6

(i) Sample momentsThe k-th central moment of L(Xi) is

mk = E(X1 − EX1)k =

∫(x−

∫zdF (z))kdF (x) = T (F ).

The corresponding sample moment is

mk = T (Fn) =1

n

n∑j=1

(Xj −Xn

)k.

Set µF =∫x dF (x) = EX1 and Fλ = F + λ(G− F ). Then

µFλ = µF + λ(µG − µF )

T (Fλ) =

∫(x− µFλ)kdF (x) + λ

∫(x− µFλ)kd(G− F )(x).

Hence,

d

dλT (Fλ) =

∫d

dλ(x− µFλ)kdF (x) +

∫(x− µFλ)kd(G− F )(x)

+ λ

∫d

dλ(x− µFλ)kd(G− F )(x)

=

∫d

dλ(x− µF − λ(µG − µF ))kdFλ(x) +

∫(x− µFλ)kd(G− F )(x)

For λ = 0, we get

T ′(F ;G− F ) = −k(µG − µF )

∫(x− µF )k−1dF (x) +

∫(x− µF )kd(G− F )(x)

=

∫[(x− µF )k − kmk−1x]d(G− F )(x)

=

∫ [(x− µF )k − k mk−1x

]dG(x)− (mk − k mk−1µF )


Setting h(F ;x) = (x − µF )k − kmk−1x − (mk − kmk−1µF ) and using µ(T ;F ) = 0, byTheorem 7.2.4 √

n(mk −mk)L−−−→

n→∞N(0, σ2(T ;F ))

withσ2(T ;F ) = m2k −m2

k − 2kmk+1mk−1 + 2kmkmk−1µF + k2m2k−1m2.

It remains to check that the remainder term satisfies√nR1,n

p−→ 0.

(ii) Sample quantilesIf F is strictly increasing and continuous, then qα = F−1(α) = T (F ).The sample quantile is given by qα,n = T (Fn), where the generalized inverse F−1

n hasto be defined appropriately. Using due = mink ∈ Z; k ≥ u for rounding upwards, weget

qα,n = X(dαne)

where X(1) ≤ . . . ≤ X(n) denote the order statistics. For fixed x,

T ′(F ; ∆x − F ) =α− 1[x,∞)(qα)

f(qα)= h(F ;x)

if F has density F ′ = f . Therefore, using (7.1) and

Fn(x) =1

n

n∑i=1

1I(−∞,x](Xi) =1

n

n∑i=1

1I[Xi,∞)(x)

we get

T ′(F ; Fn − F ) =α− Fn(qα)

f(qα).

As µ(T ;F ) = 0, σ2(T ;F ) = α(1− α)/f 2(qα), by Theorem 7.2.4

√n(qα,n − qα)

L−−−→n→∞

N

(0,α(1− α)

f 2(qα)

).

Showing√nR1,n

p−→ 0 requires some work, but it follows from the Bahadur representa-tion of sample quantiles.

Theorem 7.2.7 (Bahadur, 1966)If F is twice continously differentiable at qα with F ′(qα) = f(qα) > 0, then

qα,n = qα +α− Fn(qα)

f(qα)+R1,n

with R1,n = O(( lognn

)3/4) a.s..

(iii) α-trimmed meanAssume that L (Xj) is symmetric around µ =

∫x dF (x) with density f(x).

Xα

n =1

n− 2[αn]

n−[αn]∑j=[αn]+1

X(j) = T (Fn)

7.3 Robustness 91

i.e. a fraction α of the sample at the left and right hand is eliminated before averaging.

T (F ) =1

1− 2α

F−1(1−α)∫F−1(α)

x dF (x) =1

1− 2α

1−α∫α

F−1(u)du.

For fixed x,

h(F ;x) = T ′(F ; ∆x−F ) =1

1− 2α

F−1(α)− µ x < F−1(α)

x− µ F−1(α) ≤ x ≤ F−1(1− α)

F−1(1− α)− µ x > F−1(1− α)

µ(T ;F ) = 0 and

σ2(T ;F ) =1

(1− 2α)2

F−1(1−α)∫F−1(α)

(x− µ)2 dF (x) + 2α(F−1(α)− µ)2

.

We get from Theorem 7.2.4:

√n(Xα

n − µ) L−−−→

n→∞N(0, σ2(T ;F )) .

7.3 Robustness

Literature: P. Huber, Robust Statistics

For the iid data X1, . . . , Xn with L (Xj) = F consider the parametric model

M : L(Xj) ∈ PΘ = Pθ, θ ∈ Θ.

However, M approximates the truth, but is not necessarily perfectly correct. We have seenthat a maximum likelihood estimator for θ can even be computed if L(Xj) /∈ PΘ. Using thetheory of statistical functionals, it can be shown to be asymptotically normal. However, itis no longer asymptotically efficient, as the variance of the limiting distribution will differfrom that in the correctly specified case.

Robust statistics is dealing with the effects of misspecification. An estimator or test iscalled robust, if it works reasonably well even if the model is not perfectly correct. Moreprecisely, we want the law of Tn under the true distribution F to be similar to the law of Tnunder the disturbed distribution G.

Definition 7.3.1 (Gross-error model)Let ε > 0. Assume that L(Xi) is given by

F = (1− ε)Fθ + εG for some θ ∈ Θ, G ∈ F ,

i.e. a fraction ε of the data are arbitrary ”outliers” or ”bad data”. To measure how muchF differs from Fθ we need a metric on distribution functions, e.g. the Kolmogorov distance

dK(F,G) = supx|F (x)−G(x)|


or the Levy distance

dL(F,G) = infε > 0 : F (x− ε)− ε ≤ G(x) ≤ F (x+ ε) + ε∀x.

For probability measures P,Q on Rd, we consider the Prohorov distance

dP (P,Q) = infε > 0 : P (A) ≤ Q(Aε) + ε∀A ∈ B,

where Aε = x : infy∈A ||x− y|| ≤ ε.

Definition 7.3.2 (Qualitative robustness)A sequence Tn of estimators or test statistics is called qualitatively robust at F = F0 if thesequence of maps

F 7→ LF (Tn), n ≥ 1

is equicontinuous at F0, i.e. if for all ε > 0 there are δ > 0 and N0 such that for all F andn ≥ N0

dL(F0, F ) ≤ δ =⇒ dP (LF0(Tn),LF (Tn)) ≤ ε.

Remark 7.3.3A result of Hampel states that for statistics of the form Tn = T (Fn) qualitative robustnessis essentially equivalent to continuity of T .

Let X1, . . . , Xn be iid with a distribution from a parametric model Mθ. A good robuststatistic should

(i) have a good (not necessarily optimal) efficiency if MΘ really holds,

(ii) have a performance which changes only slightly when small deviations from the modeloccur (few outliers or many tiny errors e.g. by rounding),

(iii) should not lead to completely wrong conclusions in case of large deviations from themodel.

These issues require some quantitative measures:

(i) the asymptotic relative efficiency (ARE) w.r.t. the asymptotically optimal statistic,e.g. the maximum likelihood estimator, under model MΘ,

(ii) the influence curve,

(iii) the breakdown point.

Definition 7.3.4 (Influence curve and gross error sensitivity)

(i) If T is Gateaux differentiable at F in direction ∆x − F for all x, then

IC(x;T, F ) = T ′(F ; ∆x − F ) = h(F ;x), x ∈ R

is the influence curve of the estimator Tn = T (Fn) of T (F ).

(ii) γ∗ = supx |IC(x;T, F )| is called gross error sensitivity.

7.3 Robustness 93

(iii) λ∗ = supx 6=y

∣∣∣ IC(x;T,F )−IC(y;T,F )x−y

∣∣∣ is called local shift sensitivity.

Remark 7.3.5The influence curve can be motivated as follows:Let Tn = T (Fn) for a sample X1, . . . , Xn of ”good” data. Now add an observation Xn+1 = x.How strongly can this observation influence the estimate if it is particularly bad? Consider

Tn+1 − Tn = T

(n

n+ 1Fn +

1

n+ 1∆x

)− T (Fn).

For n→∞ we have Fn → F and

(n+ 1)(T (Fn+1)− T (Fn)

)=T(Fn + 1

n+1(∆x − Fn)

)− T (Fn)

1n+1

→ T ′(F,∆x − F )

γ∗ measures the influence of outliers, λ∗ measures the influence of small changes, e.g. dueto rounding. If γ∗ = ∞, a single outlier in a very large sample can change the value of theestimate considerably. Robust estimates should have γ∗ <∞ (bounded influence).

Example 7.3.6Let F be symmetric around µ = EXj.

(i) Sample mean: T (F ) =∫ydF (y) = µ, Tn = T (Fn) = Xn. Due to the linearity of T ,

we have

IC(x;T, F ) = T ′(F ; ∆x − F ) = T (∆x)− T (F ) = x− µ

The influence curve is unbounded, i.e. Xn not robust.

(ii) α-trimmed mean: Tn = Xα

n. By Example 7.2.6, we have T (F ) = 11−2α

F−1(1−α)∫F−1(α)

xdF (x)

and

IC(x;T, F ) =1

1− 2α

F−1(α)− µ x < F−1(α)

x− µ F−1(α) ≤ x ≤ F−1(1− α)

F−1(1− α)− µ x > F−1(1− α)

By the symmetry of F , we have |F−1(α)− µ| = |F−1(1− α)− µ|, i.e. γ∗ = |F−1(1−α)− µ| <∞.

For a quantitative version of requirement (iii), we assume that the distribution F of the datalies in some neighbourhood Fε of the model distribution F0, e.g.

Levy neighbourhood Fε = F : dL(F, F0) < εgross error neighbourhood Fε = F : F = (1− ε)F0 + εG,G ∈ F.


We consider the maximum bias

b(ε) = supF∈Fε

|T (F )− T (F0)|

If ε = 1 then F1 = F , so the model has nothing to do with the real data. Hence, b(1) is theworst case bias. How large may ε be before this breakdown happens?

Definition 7.3.7 (Breakdown point)The (asymptotic) breakdown point ε∗ of T at F0 is

ε∗ = supε; b(ε) < b(1) .

Remark 7.3.8

(i) ε∗ is the limit of the finite-sample break down point ε∗n given in the following manner.Let X1, . . . , Xn i.i.d. with L (Xj) = F0 ∈ FΘ and µ = T (F0) the location parameter ofinterest. Add k data at arbitrary locations z1, . . . , zk to get an extended sample

X∗1 , . . . , X∗n+k = X1, . . . , Xn, z1, . . . , zk .

The corresponding empirical distribution function is given as

F ∗n+k =n

n+ kFn +

k

n+ k∆(k), ∆(k) =

1

k

k∑i=1

1[zi,∞) .

Let τk = supz1,...,zk |T (F ∗n+k)− µ| and k∗ = supk ≥ 1; τk <∞. Set ε∗n = k∗

n+k∗.

As Fn → F0, we have ε∗n → ε∗ for n→∞.

(ii) ε∗ is the largest fraction of outliers which an estimate can handle before giving com-pletely false results.

Example 7.3.9

(i) sample mean Xn : ε∗ = 0

(ii) α-trimmed mean Xα

n : ε∗ = α

(iii) sample medium Xn : ε∗ = 12

For translation invariant estimates of the mean, ε∗ = 12

is the optimal value. Intuitively: Ifmore than 50 % of the data are arbitrarily ”bad”, it is no longer possible to estimate themean of the ”good” data.

Chapter 8

Bootstrap

In the previous chapters we have frequently been confronted with the problem that the distri-bution of interesting statistics were not known exactly. So far, we used asymptotic methodsto solve this problem. If the available sample is very small, this can be problematic.

Idea: If a number of independent realisations of the statistic (e.g. obtained from inde-pendent samples) were available, the distribution of the statistic could be approximated bythe empirical distribution of the sample of statistics.

How do we get such independent realisations? We use resampling techniques, i.e. we gen-erate pseudo samples X∗1 , . . . , X

∗n from the original sample such that their distribution is

similar to that of X1, . . . , Xn, i.e.

L(X∗1 , . . . , X∗n|X1, . . . , Xn) ≈ L(X1, . . . , Xn).

Convention: We use an upper ∗ to denote characteristics which are defined conditional onthe original sample X1, . . . , Xn, e.g.

L∗(X∗1 , . . . , X∗n) = L(X∗1 , . . . , X∗n|X1, . . . , Xn).

Method 1: Parametric bootstrap

• Assume that L(Xi) = Pθ.

• Compute an estimate θ of θ using X1, . . . , Xn.

• Generate X∗1 , . . . , X∗n by sampling from the distribution Pθ.

Method 2: Non-parametric bootstrapResample from the values observed in the sample X1, . . . , Xn. This technique will be inves-tigated in more detail in the following.

96 Chapter 8 Bootstrap

8.1 The non-parametric bootstrap

Let X1, . . . , Xn be iid. Assume that we are studying a statistic Tn = Tn(X1, . . . , Xn) whosedistribution should be studied by a non-parametric bootstrap.

• Let U1, . . . , Un be iid with P (U1 = k) = 1n, k = 1, . . . , n, independent of X.

• Set X∗i = XUi , i = 1, . . . , n.

• Evaluate the bootstrap statistic T ∗n = Tn(X∗1 , . . . , X∗n).

That means that the bootstrap samples are generated by drawing with replacement fromthe values observed in the original sample.

Lemma 8.1.1Let U1 such that P (U1 = k) = 1

n, k = 1, . . . , n, independent of X = (X1, . . . , Xn) and

X∗1 = XU1. Then

P ∗(X∗1 ≤ x) = P (X∗1 ≤ x|X1, . . . , Xn) =1

n

n∑i=1

1IXi≤x = Fn(x).

Remark 8.1.2

(i) Since Fn is a good estimator for F this tells us that the distribution of X∗1 , . . . , X∗n

should be close to the distribution of X1, . . . , Xn. Hence, the distribution of Tn can beapproximated by the conditional distribution of T ∗n .

(ii) Bootstrapping consequently involves two approximation steps:

• Tn is approximated by T ∗n . We need to prove that this is justified.

• The distribution of T ∗n is approximated by the empirical distribution function ofT ∗n,1, . . . , T

∗n,m. This is justified by the law of large numbers. In principle, the

number m of bootstrap samples can be very large if computation time allows.

In the following, we will discuss the statistic

Tn =√n(Xn − µ).

In general, the distribution of Tn is not known. But by the Central Limit Theorem itcan be approximated by N(0, σ2). Alternatively, we can approximate the distribution of Tnusing the bootstrap statistic

T ∗n =√n(X

∗n −Xn).

Definition 8.1.3 (Mallows distance)Let F,G : Rd → R be distribution functions having second moments. The Mallows distanceof F and G is

dM(F,G) = inf||X − Y ||2 : FX = F, FY = G.

8.1 The non-parametric bootstrap 97

Remark 8.1.4Convergence of Fn to F in the Mallows distance is equivalent to

Fn(x)→ F (x) and E[||Xn||2]→ E[||X||2]

where Xn and X have distribution functions Fn and F , respectively.

Theorem 8.1.5Let X1, . . . , Xn be iid with E[X1] = µ, E[X2

1 ] < ∞. Let Tn =√n(Xn − µ) and T ∗n =√

n(X∗n −Xn). Then

dM(Tn, T∗n) = dM(L(T ∗n),L(Tn))→ 0 a.s.

Proof (Sketch):Let L(Z) = N(0, σ2). Then

dM(T ∗n , Tn) ≤ dM(T ∗n , Z) + dM(Z, Tn).

We show that the right side converges to zero a.s. Since TnL−→ Z by the CLT and E[T 2

n ] =σ2 = E[Z2], Remark 8.1.4 shows dM(Tn, Z)→ 0.For T ∗n we have

(i)

T ∗n =1√n

n∑i=1

(X∗i −Xn) =1√n

n∑i=1

Y ∗i,n

with E∗[Y ∗1,n] = 0 and var∗ Y ∗1,n = 1n

∑ni=1(Xi−Xn)2 → σ2. Apply Lindeberg’s CLT to

show that T ∗nL−→ Z. This requires some work to prove the assumptions.

(ii)

E∗[T ∗2n ] =1

n

n∑i=1

(Xi −Xn)2 → σ2a.s.

So dM(T ∗n , Z)→ 0 follows with Remark 8.1.4.

The next question is, which of the two approximations (asymptotic normality or boot-strap) works better. In our example, we get the following result.

Theorem 8.1.6Let X1, . . . Xn be iid absolutely continuous with expectation µ, variance σ2 and E[X4

1 ] <∞.For all x ∈ R we have

(i)

P

(√n(Xn − µ)

σ≤ x

)− Φ(x) = O

(1√n

)and the rate can in general not be improved.

(ii)

P

(√n(X

∗n −Xn)√

var∗X∗1≤ x

)− P

(√n(Xn − µ)

σ≤ x

)= O

(1

n

)This shows that the convergence of the bootstrap approximation to the true distribution

is faster than that of the normal approximation.

98 Chapter 8 Bootstrap

8.2 Applications

Example 8.2.1 (Estimation of a standard error)Consider an iid sample x = (x1, . . . xn) and an estimator θn for a parameter of L(Xi). Wewant to estimate the standard error

se(θn) =

√var θn.

(i) Generate B bootstrap samples x∗1, . . . , x∗B.

(ii) Estimate θ on each sample to obtain θ∗i, i = 1, . . . , B.

(iii) Estimate se(θn) by

seB =

(1

B − 1

B∑i=1

(θ∗i − θ∗)2

) 12

,

i.e. by the sample standard deviation of the sample θ∗1, . . . , θ∗B.

Example 8.2.2 (Bootstrap confidence interval)We start by studying the structure of the confidence interval for the mean µ of a normaldistribution with unknown variance σ2 given in Example 4.4.5, namely

Iµ =

[Xn − tn−1, 1+γ

2

sn√n

; Xn + tn−1, 1+γ2

sn√n

].

Since µ = Xn is an estimator for µ with variance var µ = σ2

n, the boundaries of Iµ have the

formµ± q 1+γ

2se(µ),

where qα is the α-quantile of the distribution of µ−µse(µ)

.

We copy this structure: Let θ be an estimator for θ and se be an estimator for itsstandard error (e.g. obtained by the bootstrap procedure introduced above). We estimatethe distribution of

Z =θ − θse

as follows:

(i) Generate B bootstrap samples x∗1, . . . , x∗B.

(ii) Compute

Z∗i =θ∗i − θse∗i

,

where se∗i is an estimate of the standard error of θ∗i.

(iii) Estimate q 1−γ2

and q1− 1−γ2

via

#Z∗i ≤ qαB

= α.

8.2 Applications 99

(iv) The bootstrap confidence interval with level γ is then given as

I∗θ = [θ − q 1+γ2se, θ − q 1−γ

2se]

Example 8.2.3 (A cautionary example)Let X1, . . . , Xn be iid with L(Xi) = U(0, θ). Then θ = X(n) is the maximum likelihoodestimator for θ. With probability (1 − 1

n)n, X(n) is not contained in the bootstrap sample.

Consequently, X(n) is contained in the bootstrap sample with probability

1−(

1− 1

n

)n→ 1− e−1 ≈ 0.63

Hence, P (θ∗ = θ) ≈ 0.63, i.e., the maximum likelihood estimator is reproduced in mostsamples such that the bootstrap samples do not yield additional information.

On the contrary, for a parametric bootstrap, we have L(X∗i ) = U(0, θ) such that P (θ∗ =θ) = 0.

Example 8.2.4 (Bootstrap test)When constructing the acceptance region of a test by bootstrap, we face the following prob-lem: We need to determine the distribution of the test statistic under the null hypothesis.However, we do not know whether the data follow the null hypothesis.

Consider a two-sample problem: X1, . . . , Xn iid with mean µ1 and variance σ21 and, indepen-

dently, Y1, . . . , Ym iid with mean µ2 and variance σ22. Test H0 : µ1 = µ2 against H1 : µ1 6= µ2

using

Tn,m =Xn − Y m√1nσ2

1 + 1mσ2

2

.

We do not know, whether the two means are the same, hence, we bootstrap from the centredsamples:

X∗i = XU1,i−Xn, i = 1, . . . , n, Y ∗j = YU2,j

− Y m, j = 1, . . . ,m.

The bootstrap samples follow H0 as both samples have mean 0 by construction. The accep-tance region can then be computed via

P ∗(T ∗m,n > c∗α) ≤ α,

determined by the empirical distribution of

T ∗m,n =X∗n − Y

∗m√

1nσ2 ∗

1 + 1mσ2 ∗

2

.

This test is related to the two-sample t-test. However, nether normally distributed data norσ2

1 = σ22 have to be assumed.

Documents

Mathematical Statistics - TU Kaiserslautern · Mathematical Statistics Claudia Redenbach ... 8.2 Applications ... 4 Chapter 2 Repetition and Notation