A First Course in Mathematical Statisticsweb472/spring04/stat04.pdf · A First Course in Mathematical Statistics ... [TD] Tamhane, A and Dunlop, D., Statistics and Data Analysis:

A First Course in Mathematical Statistics

MATH 472 Handout, Spring 04

Michael Nussbaum∗

May 3, 2004

∗Department of Mathematics,Malott Hall, Cornell University,Ithaca NY 14853-4201,e-mail [email protected],http://www.math.cornell.edu/~nussbaum

2

CONTENTS

0.2 Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50.3 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1 Introduction 71.1 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2 What is statistics ? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.3 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.3.1 The Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.3.2 Confidence statements with the Chebyshev inequality . . . . . . . . . . . . . 12

2 Estimation in parametric models 152.1 Basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2 Bayes estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.3 Admissible estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.4 Bayes estimators for Beta densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.5 Minimax estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3 Maximum likelihood estimators 29

4 Unbiased estimators 374.1 The Cramer-Rao information bound . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.2 Countably infinite sample space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.3 The continuous case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5 Conditional and posterior distributions 515.1 The mixed discrete / continuous model . . . . . . . . . . . . . . . . . . . . . . . . . . 515.2 Bayesian inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.3 The Beta densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555.4 Conditional densities in continuous models . . . . . . . . . . . . . . . . . . . . . . . . 565.5 Bayesian inference in the Gaussian location model . . . . . . . . . . . . . . . . . . . 605.6 Bayes estimators (continuous case) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665.7 Minimax estimation of Gaussian location . . . . . . . . . . . . . . . . . . . . . . . . 68

6 The multivariate normal distribution 71

7 The Gaussian location-scale model 797.1 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 797.2 Chi-square and t-distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4 CONTENTS

7.3 Some asymptotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

8 Testing Statistical Hypotheses 938.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 938.2 Tests and confidence sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 978.3 The Neyman-Pearson Fundamental Lemma . . . . . . . . . . . . . . . . . . . . . . . 1018.4 Likelihood ratio tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

9 Chi-square tests 1159.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1159.2 The multivariate central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 1189.3 Application to multinomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1219.4 Chi-square tests for goodness of fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1259.5 Tests with estimated parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1289.6 Chi-square tests for independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

10 Regression 13910.1 Regression towards the mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13910.2 Bivariate regression models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14510.3 The general linear model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

10.3.1 Special cases of the linear model . . . . . . . . . . . . . . . . . . . . . . . . . 14910.4 Least squares and maximum likelihood estimation . . . . . . . . . . . . . . . . . . . 15310.5 The Gauss-Markov Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

11 Linear hypotheses and the analysis of variance 16311.1 Testing linear hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16311.2 One-way layout ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16811.3 Two-way layout ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

12 Some nonparametric tests 17512.1 The sign test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17512.2 The Wilcoxon signed rank test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

13 Exercises 18113.1 Problem set H1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18113.2 Problem set H2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18113.3 Problem set H3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18213.4 Problem set H4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18213.5 Problem set H5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18413.6 Problem set H6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18413.7 Problem set H7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18513.8 Problem set E1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18613.9 Problem set H8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18713.10Problem set H9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18813.11Problem set H10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18913.12Problem set E2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

Preface 5

14 Appendix: tools from probability, real analysis and linear algebra 19514.1 The Cauchy-Schwartz inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19514.2 The Lebesgue Dominated Convergence Theorem . . . . . . . . . . . . . . . . . . . . 195

0.2 Preface

”Spring. 4 credits. Prerequisite: MATH 471 and knowledge of linear algebra such as taught inMATH 221. Some knowledge of multivariate calculus helpful but not necessary.Classical and recently developed statistical procedures are discussed in a framework that emphasizesthe basic principles of statistical inference and the rationale underlying the choice of these proceduresin various settings. These settings include problems of estimation, hypothesis testing, large sampletheory.” (The Cornell Courses of Study 2000-2001).This course is a sequel to the introductory probability course MATH471. These notes will be usedas a basis for the course in combination with a textbook (to be found among the referencesgiven below).

0.3 References

[BD] Bickel, P., Doksum, K., Mathematical Statistics, Basic Ideas and Selected Topics, Vol. 1, (2d Edition),Prentice Hall, 2001

[CB] Casella, G. and R. Berger, R. Statistical Inference, Duxbury Press, 1990.

[D] Durrett, R., The Essentials of Probability, Duxbury Press, 1994.

[DE] Devore, J., Probability and Statistics for Engineering and the Sciences, Duxbury - Brooks/Cole, 2000

[FPP] Freedman, D., Pisani, R., and Purves, R: Statistics (3rd Edition) 1997.

[HC] Hogg, R. V. and Craig, A. T., Introduction to Mathematical Statistics (5 Edition), Prentice-Hall, 1995

[HT] Hogg, R. V. and Tanis, E. A.., Probability and Statistical Inference (6 Edition), Prentice-Hall, 2001

[LM] Larsen, R. and Marx, M., An Introduction to Mathematical Statistics and its Applications, PrenticeHall 2001

[M] Moore, D. The Basic Practice of Statistics, (2d Edition), W. H. Freeman and Co, 2000

[R] Rice, J., Mathematical Statistics and Data Analysis, Duxbury Press, 1995

[ROU] Roussas, G., A Course in Mathematical Statistics, (2d Edition), Academic Press, 1997

[RS] Rohatgi, V and Ehsanes Saleh, A. K., An Introduction to Probability and Statistics, John Wiley 2001

[SH] Shao, Jun, Mathematical Statistics, Springer Verlag, 1998

[TD] Tamhane, A and Dunlop, D., Statistics and Data Analysis: from Elementary to Intermediate PrenticeHall 2000

6 CONTENTS

Chapter 1INTRODUCTION

1.1 Hypothesis testing

This course is a sequel to the introductory probability course Math471, the basis of which has beenthe book ”The Essentials of Probability” by R. Durrett (quoted as [D] henceforth). Some statisticaltopics are already introduced there. We start by discussing some sections and examples (essentiallyreproducing, sometimes extending the text of [D]).Testing biasedness of a roulette wheel ([D] chap. 5.4 p. 244). Suppose we run a casino and wewonder if our roulette wheel is biased. A roulette wheel has 18 outcomes that are red, 18 that areblack and 2 that are green. If we bet $1 on red then we win $1 with probability p = 18/38 = 0.4736and lose $1 with probability 20/38 (see [D] p. 81 on gambling theory for roulette). To phrase thebiasedness question in statistical terms, let p be the probability that red comes up and introducetwo hypotheses:

H0 : p = 18/38 null hypothesis

H1 : p 6= 18/38 alternative hypothesis

To test to see if the null hypothesis is true, we spin the roulette wheel n times and let Xi = 1 ifred comes up on the ith trial and 0 otherwise, so that Xn is the fraction of times red comes up inthe first n trials. The test is specified by giving a critical region Cn so that we reject H0 (that is,decide H0 is incorrect) when Xn ∈ Cn. One possible choice in this case is

Cn =(x :

¯x− 18

38

¯> 2

r18

38· 2038

/√n

).

This choice is motivated by the fact that if H0 is true then using the central limit theorem (ξ is astandard normal variable)

P¡Xn ∈ Cn

¢= P

µ¯Xn − µ

σ/√n

¯≥ 2

¶≈ P (|ξ| ≥ 2) = 0.05. (1.1)

Rejecting H0 when it is true is called a type I error. . In this test we have set the type I errorto be 5%.The basis for the approximation ”≈” is the central limit theorem. Indeed the results Xi,i = 1, . . . , n of the n trials are independent random variables all having the same distribution (orprobability law). This distribution is a Bernoulli law, where Xi takes only values 0 and 1:

P (Xi = 0) = 1− p, P (Xi = 1) = p.

8 Introduction

If µ is the expectation of this law then

µ = EX1 = p.

If σ2 is the variance then

σ2 = EX21 − (EX1)

2 = p− p2 = p(1− p).

The central limit theorem (CLT) gives

Xn − µ

σ/√n

L=⇒ N(0, 1) (1.2)

where N(0, 1) is the standard normal distribution and L=⇒ signifies convergence in law (in distrib-

ution). Thus as n→∞

P

µ¯Xn − µ

σ/√n

¯≥ 2

¶→ P (|ξ| ≥ 2) .

So in fact our reasoning is based on a large sample approximation for n → ∞. The relationP (|ξ| ≥ 2) = 0.05 is then taken from a tabulation of the standard normal law N(0, 1).

Nowq1838 ·

2038 = 0.4993 ≈ 1/2 so to simplify the arithmetic the test can be formulated as

reject H0 if

¯Xn −

18

38

¯>

1√n

or in terms of the total number of reds Sn = X1 + . . .+Xn

reject H0 if

¯Sn −

18n

38

¯>√n.

Suppose now that we spin the wheel n = 3800 times and get red 1868 times. Is the wheel biased?We expect 18n/38 = 1800 reds, so the excess number of reds is |Sn − 1800| = 68. Given the largenumber of trials, this might not seem like a large excess. However

√3800 = 61.6 and 68 > 61.6

so we reject H0 and think ”if H0 were correct then we would see an observation this far from themean less than 5% of the time.Testing academic performance ([D] chap. 5.4 p. 244). Do married college students withchildren do less well because they have less time to study or do they do better because they aremore serious ? The average GPA at the university is 2.48, so we might set up the followinghypothesis test concerning µ, the mean grade point average of married students with children:

H0 : µ = 2.48 null hypothesis

H1 : µ 6= 2.48 alternative hypothesis.

Suppose that to test this hypothesis we have records X1, . . . ,Xn of 25 married college studentswith children. Their average GPA is Xn = 2.35 and sample standard deviation σn = 0.5. Recallthat the standard deviation of a sample X1, . . . ,Xn with sample mean Xn is

σn =

vuut 1

n− 1

nXi=1

¡Xi − Xn

¢2.

Hypothesis testing 9

Using (1.1) from the last example we see that to have a test with a type I error of 5% we shouldreject H0 if ¯

Xn − 2.48¯>2σ√n.

The basis is again the central limit theorem: no particular assumption is made about the distribu-tion of the Xi, they are just independent and identically distributed random variables (i.i.d. r.v.’s)with a finite variance σ2 (standard deviation σ). We again have the CLT (1.2) and we can takeµ = 2.48 to test our hypothesis. But contrary to the previous example, the value of σ is then stillundetermined (in the previous example both µ and σ are given by p). Thus σ is unknown, but wecan estimate it by the sample standard deviation σn. Taking n = 25 we see that

2(0.5)√25

= 0.2 > 0.13 =¯Xn − 2.48

¯so we are not 95% certain that µ 6= 2.48. Note the inconclusiveness in the outcome: the resultof the test is the negative statement ”we are not 95% certain that H0 is not true.. ”, but notthat there is particularly strong evidence for H0. That nonsymmetric role of the two hypotheses isspecific to the setup of statistical testing; this will be discussed in detail later.Testing the difference of two means.([D] p. 245) Suppose we have independent random samplesof size n1 and n2 from two populations with unknown means µ1, µ2 and variances σ

21, σ

22 and we

want to test

H0 : µ1 = µ2 null hypothesis

H1 : µ1 6= µ2 alternative hypothesis.

Now the CLT implies that

X1 ≈ N

µµ1,

σ21n1

¶, − X2 ≈ N

µ−µ1,

σ22n2

¶.

Indeed these are just reformulations of (1.2) using properties of the normal law:

L(ξ) = N(0, 1) if and only if Lµ

σ√nξ + µ

¶= N

µµ,

σ2

n

¶,

and L(−ξ) = L(ξ) = N(0, 1). Here we are using the standard notation L(ξ) for ”probability law ofξ” (law of ξ, distribution of ξ). We have assumed that the two samples are independent, so if H0

is correct,

X1 − X2 ≈ N

µ0,σ21n1+

σ22n2

¶.

Based on the last result, if we want a test with a type I error of 5% then we should

reject H0 if¯X1 − X2

¯> 2

sσ21n1+

σ22n2

.

For a concrete example we consider a study of passive smoking reported in the New England Journalof Medicine. (cf. [D] p.246). A measurement of the size S of lung airways called ”FEF 25-75%”was taken for 200 female nonsmokers who were in a smoky environment and for 200 who were not.

10 Introduction

In the first group the average value of S was 2.72 with a standard deviation of 0.71 while in thesecond group the average was 3.17 with a standard deviation of 0.74 (Larger values are better.).To see that there is a significant difference between the averages we note that

2

sσ21n1+

σ22n2= 2

r(0.71)2

200+(0.74)2

200= 0.14503

while¯X1 − X2

¯= 0.45. With these data, H0 is rejected, based on the reasoning, similar to the

first example, ”if H0 were true, then what we are seeing would be very improbable, i.e. would haveprobability not higher than 0.05.” But again the reasoning is based on a normal approximation,i.e. an belief that sample size 200 is a large enough.

1.2 What is statistics ?

The Merriam-Webster Dictionary says:”Main Entry: sta·tis·ticsPronunciation: st&-’tis-tiksFunction: noun plural but singular or plural in constructionEtymology: German Statistik study of political facts and figures, from New Latin statisticus ofpolitics, from Latin status stateDate: 17701 : a branch of mathematics dealing with the collection, analysis, interpretation, and presentationof masses of numerical data2 : a collection of quantitative data”In ENCYCLOPÆDIA BRITANNICA we find” Statistics: the science of collecting, analyzing, presenting, and interpreting data. Governmentalneeds for census data as well as information about a variety of economic activities provided much of theearly impetus for the field of statistics. Currently the need to turn the large amounts of data availablein many applied fields into useful information has stimulated both theoretical and practical developmentsin statistics. Data are the facts and figures that are collected, analyzed, and summarized for presentationand interpretation. Data may be classified as either quantitative or qualitative. Quantitative data measureeither how much or how many of something, and qualitative data provide labels, or names, for categoriesof like items”. .......... ”Sample survey methods are used to collect data from observational studies, andexperimental design methods are used to collect data from experimental studies. The area of descriptivestatistics is concerned primarily with methods of presenting and interpreting data using graphs, tables, andnumerical summaries. Whenever statisticians use data from a sample—i.e., a subset of the population—tomake statements about a population, they are performing statistical inference. Estimation and hypothesistesting are procedures used to make statistical inferences. Fields such as health care, biology, chemistry,physics, education, engineering, business, and economics make extensive use of statistical inference. Methodsof probability were developed initially for the analysis of gambling games. Probability plays a key role instatistical inference; it is used to provide measures of the quality and precision of the inferences. Many of themethods of statistical inference are described in this article. Some of these methods are used primarily forsingle-variable studies, while others, such as regression and correlation analysis, are used to make inferencesabout relationships among two or more variables.”The subject of this course is statistical inference Let us again quote Merriam-Webster:”Main Entry: in·fer·encePronunciation: ’in-f(&-)r&n(t)s, -f&rn(t)s

Confidence intervals 11

Function: nounDate: 15941 : the act or process of inferring: as a : the act of passing from one proposition, statement, orjudgment considered as true to anotherwhose truth is believed to follow from that of the former b: the act of passing from statistical sample data to generalizations (as of the value of populationparameters) usually with calculated degrees of certainty.”

1.3 Confidence intervals

Suppose a country votes for president, and there are two candidates, A and B. An opinion pollinstitute wants to predict the outcome, by sampling a limited number of voters. We assume that10 days ahead of the election, all voters have formed an opinion, no one intends to abstain, andall voters are willing answer the opinion poll if asked (these assumptions are not realistic, but aremade here in order to explain the principle). The proportion intending ot vote for A is p, where0 < p < 1, so if a voter is picked at random and asked, the probability that he favors A is p. Theproportion p is unknown; if p > 1/2 then A wins the election.The institute samples n voters, and assigns value 1 to a variable Xi if the vote intention is A, 0otherwise (i = 1, . . . , n.). The institute selects the sample in a random fashion, throughout thevoter population, so that Xi can be assumed to be independent Bernoulli B(1, p) random variables.(Again in practice the choice is not entirely random, but follows some elaborate scheme in order tocapture different parts of the population, but we disregard this aspect). Recall that p is unknown;an estimate of p is required to form the basis of a prediction (< 1/2 or > 1/2 ?). In the theory ofstatistical inference a common notation for estimates based on a sample of size n is pn. Supposethat the institute decides to use the sample mean

Xn = n−1nXi=1

Xi

as an estimate of p, so pn = Xn.

1.3.1 The Law of Large Numbers

We have for Xn = pn and any ε > 0

P (|pn − p| > ε)→ 0 as n→∞. (1.3)

In words, for any small fixed number ε the probability that pn is outside the interval (p−ε, p+ε)can be made arbitrarily small by selecting n sufficiently large. If the institute samples enough voters,it can believe that its estimate pn is close enough to the true value. In statistics, estimates whichconverge to the true value in the above probability sense are called consistent estimates (recallthat the convergence type (1.3) is called convergence in probability ). As a basic requirement theinstitute needs a good estimate pn to base its prediction upon. The LLN tells the institute that itactually pays to get more opinions.Suppose the institute has sampled a large number of voters, and the estimate pn turns out > 1/2but near. It is natural to proceed with caution in this case, as the reputation of the institutedepends on the reliability of its published results. Results which are deemed unreliable will not bepublished. A controversy might arise within the institute:Researcher a: ’We spent a large amount of money on this poll on we have a really large n. So letus go ahead and publish the result that A will win.’

12 Introduction

Researcher b: ’This result is too close to the critical value. I do not claim that B will win, but Ifavor not to publish a prediction’.Clearly a sound and rational criterion is needed for a decision whether to publish or not. A methodfor this should be fixed in advance.

1.3.2 Confidence statements with the Chebyshev inequality

Recall Chebyshev’s inequality ([D] chap. 5.1, p. 222). If Y is a random variable with finite varianceVar(Y ) and Y > 0 then

P (|Y −EY | ≥ y) ≤ Var(Y )/y2.

Applying this for Y = Xn = pn, we obtain

P (|pn − p| > ε) ≤ Var(X1)

nε2=

p(1− p)

nε2(1.4)

for any ε > 0. Suppose we wish to guarantee that

P (|pn − p| > ε) ≤ α (1.5)

for an α given in advance (e.g. α = 0.05 or α = 0.01). Now p(1− p) ≤ 14 so

P (|pn − p| > ε) ≤ 1

4nε2≤ α

provided we select ε = 1/√4nα.

The Chebyshev inequality thus allows to to quantitatively assess the accuracy when the sampesize is given (or alternatively, to determine the neccessary sample size to attain to a given desiredlevel of accuracy ε). The convergence in probability (or LLN) (1.3) is just a qualitative statementon pn; actually it also derived from the Chebyshev inequality (cf. the proof of the LLN).Another way of phrasing (1.5) would be: ’the probability that the interval (pn − ε, pn + ε) covers pis more than 95%’, or

P ((pn − ε, pn + ε) 3 p) ≥ 1− α. (1.6)

Such statements are called confidence statements, and the interval (pn−ε, pn+ε) is a confidenceinterval Note that pn is a random variable, so the interval is in fact a random interval. Thereforethe element sign ∈ is written in reverse form 3 to stress the fact that (1.6) the interval is random,not p (p is merely unknown).To be even more precise, we note that the probability law depends on p, so we should properly writePp (as is done usually in statistics, where the probabilities depend on an unknown parameter). Sowe have

Pp((pn − ε, pn + ε) 3 p) ≥ 1− α. (1.7)

When α is a preset value, and (pn − ε, pn+ ε) is known to fulfill (1.7), the common practical pointof view is: ’we believe that our true unknown p is within distance ε of pn.’ When pn happens to bemore than ε away from 1/2 (and e.g larger) then the opinion poll institute has enough evidence;this immediately implies ’we believe that our true unknown p is greater than 1/2’, and they cango ahead and publish the result. They know that if the true p is actually less than 1/2, then the

Confidence intervals 13

outcome they see ( pn ≥ 1/2 + ε) has less 0.05 probability:

Pp(pn ≥ 1/2 + ε) = 1− Pp(pn − ε < 1/2)

≤ 1− Pp(pn − ε < p)

≤ 1− Pp(pn − ε < p < pn + ε)

= 1− Pp((pn − ε, pn + ε) 3 p)

≤ α.

Note that the α is to some extent arbitrary, but common values are α = 0.05 (95% confidence) andα = 0.01 (99% confidence).The reasoning ”When I observe a fact (an outcome of a random experiment) and I know that undera certain hypothesis, this fact would have less than 5% probability then I reject this hypothesis’ isvery common; it is the basis of statistical testing theory. In our case of confidence intervals, the’fact’ (event) would be ”1/2 is not within (pn − ε, pn + ε)” and the hypothesis would be p = 1/2.When we reject p = 1/2 because of pn ≥ 1/2 +ε then we can also reject all values p < 1/2.But this type of decision rule (rational decision making, testing) cannot give reasonable certaintyin all cases. When 1/2 is in the 95% confidence interval the institute would be well advised to becautious, and not publish the result. It just means ’unfortunately, I did not observe a fact whichwould be very improbable under the hypothesis’, so there there is not enough evidence against thehypothesis; nothing can really be ruled out. .In summary: the prediction is suggested by pn; the confidence interval is a rational way of decidingwhether to publish or not.Note that, contrary to the above testing examples, the confidence interval did not involve anylarge sample approximation. However such arguments (normal approximation, estimate Var(X1)by pn(1− pn) ) can alternatively be used here.

14 Introduction

Chapter 2ESTIMATION IN PARAMETRIC MODELS

2.1 Basic concepts

After hypothesis testing and confidence intervals, let us introduce the third major branch of statis-tical inference: parameter estimation.Recall that a random variable is a number depending on the outcome of an experiment. When thespace of outcomes is Ω, then a random variable X is a function on Ω with values in a space X ,written X(ω). The ω is often omitted.We also need the concept of a realization of a random variable. This is a particular value x ∈ Xwhich the random variable has taken, i.e. when ω has taken a specific value. Conceptually, by”random variable” we mean the whole function ω 7→ X(ω), whereas the realization is just a valuex ∈ X (the ”data” in a statistical context).Population and sample.Suppose a large shipment of transistors is to be inspected for defective ones. One would like toknow the proportion of defective transistors in the shipment; assume it is p where 0 ≤ p ≤ 1 (p isunknown). A sample of n transistors is taken from the shipment and the proportion of defectiveones is calculated. This is called the sample proportion:

p =#defectives in sample

n. (2.1)

Here the shipment is called the population in the statistical problem and p is called a parameterof the population. When we randomly select one individual (transistor) from the population, thistransistor is defective with probability p. We may define a random variable X1 in the followingway:

X1 = 1 if defective

X1 = 0 otherwise.

That is, X1 takes value 1 with probability p and value 0 with probability 1 − p. Such a randomvariable is called a Bernoulli random variable and the corresponding probability distributionis the Bernoulli distribution, or Bernoulli law , written B(1, p). The sample space of X1 is the set0, 1.When we take a sample of n transistors, this should be a simple random sample, which meansthe following:a) each individual is equally likely to be included in the sampleb) results for different individuals are independent one from another.In mathematical language, a simple random sample of size n yields a set X1, . . . ,Xn of independent,identically distributed random variables (i.i.d. random variables).. They are identically distributed,

16 Estimation in parametric models

in the above example, because they all follow the Bernoulli law B(1, p).(they are a random selectionfrom the population which has population proportion p). The X1, . . . ,Xn are independent asrandom variables because of property b) of a simple random sample.Denote X = (X1, . . . ,Xn) the totality of observations, or the vector observations. This is now arandom variable with values in the n-dimensional Euclidean space Rn. (We also call this a randomvector, or also a random variable with values in Rn. Some probability texts assume random variablesto take values in R only; the higher dimensional objects are called random elements or randomvectors.) The sample space of X is now the set of all sequences of length n which cosnsist of 0’sand 10s, written also 0, 1n. In general, we denote X the sample space of an observed randomvariable X..

Notation Let X be a random variable with values in a space X . We write L(X) for the probabilitydistribution (or the law) of X.

Recall that the probability distribution (or the law) of X is given by the the totality of the valuesX can take, togehter with the associated probabilities. That definition is true for discrete randomvariables (the totality of values is finite or countable); for continuous random variables the proba-bility density function defines the distribution. When X is real valued, either discrte or continuous,the law of X can also be described by the distribution function

F (x) = P (X ≤ x).)

In the above exampe, each individial Xi is Bernoulli: L(Xi) = B(1, p) but the law of X =(X1, . . . ,Xn) is not Bernoulli: it is the law of n i.i.d. random variables having Bernoulli lawB(1, p). (In probability theory, such a law is called the product law, written B(1, p)⊗n). Note thatin our statistical problem above, p is not known so we have a whole set of laws for X: all the lawsB(1, p)⊗n where p ∈ [0, 1].The parametric estimation problem. Let X be an observed random variable with values in Xand L(X) be the probability distribution (or the law) of X. Assume that L(X) is unknown, butknown to be from a certain set of laws:

L(X) ∈ Pϑ;ϑ ∈ Θ .

Here ϑ is an index (a parameter) of the law and Θ is called the parameter space (the set of admittedϑ). The problem is to estimate a function ϑ based on a realization of X. The set Pϑ;ϑ ∈ Θ isalso called a parametric family of laws.In the sequel we assume Θ is a subset of the real line R and X = (X1, . . . ,Xn) where X1, . . . ,Xn

are independent random variables. Here n is called sample size. In most of this section we confineourselves to the case that X is a finite set (with the exception of some examples). In the aboveexample, the ϑ above takes the role of the population proportion p; since the population proportionis known to be in [0, 1], the parameter space Θ would be [0, 1].

Definition 2.1.1 (i) A statistic T is an arbitrary function of the observed random variable X.(ii) As an estimator T of ϑ we admit any mapping with values in Θ.

T : X 7→ Θ

In this case, for any realization x the statistic T (x) gives the estimated value of ϑ.

Basic concepts 17

Note that T = T (X) is also a random variable. Statistical terminology is such that an ”estimate”is a realized value of that random variable, i.e. T (x) above (the estimated value of ϑ), whereas”estimator” denotes the whole function T . (also called ”estimating function”) . Sometimes thewords ”estimate” and ”estimator” are used synonymously.Thus an estimator is a special kind of statistic. Other instances of statistics are those used forbuilding tests or confidence intervals.

Notation Since the distribution ofX depends on an unknown parameter, we stress this dependenceand write

Pϑ(X ∈ B), Eϑh(X) etc.

for probabilities, expectations etc. which are computed under the assumption that ϑ is thetrue parameter of X.

For later reference we write the model of i.i.d. Bernoulli random variables in the following form.

Model M1 A random vector X = (X1, . . . ,Xn) is observed, with values in X = 0, 1n; thedistribution of X is the joint law B(1, p)⊗n of n independent and identically distributedBernoulli random variables Xi each having law B(1, p), where p ∈ [0, 1].

This fits into the above parametric estimation problems as follows: we have to set ϑ = p, Θ =[0, 1].and Pϑ = B(1, p)⊗n.

Remark 2.1.2 We set p = ϑ and write Pp(·) for probabilities depending on the unknown p. Thusfor a particular value x = (x1, . . . , xn) ∈ 0, 1n we have

Pp(X = x) =nYi=1

Pp(Xi = xi) =nYi=1

pxi(1− p)1−xi .

Note that the above probability can be written

pPn

i=1 xi(1− p)n−Pn

i=1 xi = pz(1− p)n−z,

where z denotes the value z =Pn

i=1 xi. Thus the probability Pp(X = x) depends only on thenumber of realized 1’s among the x1, . . . , xn, or the number of ”successes” in n Bernoulli trials (thenumber of times the original event has occurred). ¤

The problem is to estimate the parameter p from the data X1, . . . ,Xn. (Note that now we used theword ”data” for the random variables, not for the realizations; but this also corresponds to usagein statistics). As estimators all mappings from X into [0, 1] are admitted, for instance the relativefrequency of observed 1’s:

T (X) = n−1nX

j=1

Xj = Xn = pn

i.e. the sample mean Xn which coincides with the sample proportion pn from (2.1). In the sequelwe shall identify certain requirements which ”good” estimators have to fulfill, and we shall developcriteria to quantify the performance of estimators.Let us give sime other examples of parametric estimation problems. In each case, we observe arandom vector X = (X1, . . . ,Xn) consisting of independent and identically distributed random


variables Xi. For the Xi we shall specify a parametric family of laws Qϑ, ϑ ∈ Θ; then this definesthe parametric family of laws Pϑ, ϑ ∈ Θ for X, i.e. the specification L(X) ∈ Pϑ, ϑ ∈ Θ.It suffices to give the family of laws of X1; then L(X1) ∈ Qϑ, ϑ ∈ Θ determines the familyPϑ, ϑ ∈ Θ.(i) Poisson family: Po(λ), λ > 0. Here Θ = (0,∞).(ii) Normal location family:

©N(µ, σ2), µ ∈ R

ªwhere σ2 is fixed (known). Here ϑ = µ, Θ = R

and the expectation parameter µ of the normal law describes its location on the real line.(iii) Uniform family: U(0, ϑ), ϑ > 0 where U(0, ϑ) is the uniform law with endpoints 0 and ϑ,having density

pϑ(x) = ϑ−1 for 0 ≤ x ≤ ϑ0 otherwise.

Choosing the best estimator. In the above example involving transistors the sample proportionp suggested itself as a reasonable estimator of the population proportion p. However its not a prioriclear that this is the best; we may also consider other function of the observationsX = (X1, . . . ,Xn).First of all, we have to define what it means that an estimator is good. A quantitative comparisonof estimators is made possible by the approach of statistical decision theory . We choose a lossfunction L(t, ϑ) which measures the loss (inaccuracy) if the unknown parameter ϑ is estimatedby a value t. We stress that t must be be chosen as an estimate depending on the data, so thecriterion becomes more complicated (randomness intervenes) and the choice of the loss function isjust a first step. The loss is assumed to be nonnegative, i.e. the minimal possible loss is zero.Natural choices, in case ϑ ∈ Θ ⊆ R, are the distance of t and ϑ (estimation error)

L(t, ϑ) = |t− ϑ|

or the quadratic estimation errorL(t, ϑ) = (t− ϑ)2. (2.2)

Another possible choice isL(t, ϑ) = 1[δ,∞)(|t− ϑ|) (2.3)

for some δ > 0, which means that emphasis is put on the distance being less than δ, not on itsactual value.As has been said above, the value of the loss becomes random when we insert an estimate of tbased on the data. The loss then is a random variable L(T (X), ϑ) where T (X) is the estimator.To judge the performance of an estimator, it is natural to proceed to the average or expected lossfor given ϑ.

Definition 2.1.3 The risk of an estimator T at parameter value ϑ is

R(T, ϑ) = EϑL(T (X), ϑ).

The risk R(T, ·) as a function of ϑ is called the risk function of the estimator T .

Note that in our present model (ϑ = p) we do not have to worry about existence of the expectedvalue (since the law Bn(1, p) has finite support). (For later generalizations, note that since L isnonnegative, if its expectation is not finite then it must be +∞ which may then also be regardedas the value of the risk.)

Basic concepts 19

Thus the random nature of the loss is sucessfully dealt with by taking the expectation, but forjudging an estimator T (X), the fact remains that ϑ is unknown. Thus we still have a whole riskfunction ϑ 7→ R(T, ϑ) as a criterion for performance. But it is desirable to express the quality of Tby just one number and then try to minimize it. There is a further choice involved for the methodto obtain such a number; in the sequel we shall discuss several approaches.Suppose that, rather than reducing the problem in this fashion, we try to minimize the whole riskfunction simultaneously, i.e. try to find an estimator T ∗ such that

R(T ∗, ϑ) = minT

R(T, ϑ) for all ϑ ∈ Θ.

Such an estimator would be called a uniformly best estimator . In general such an estimator willnot exist: for each ϑ0 ∈ Θ consider an estimator

Tϑ0(x) = ϑ0.

This estimator ignores the data and always selects ϑ0; then R(Tϑ0 , ϑ0) = 0, i.e. this estimator isvery good if ϑ0 is true. Thus if T ∗ were uniformly best it would have to compete with Tϑ0 , i.e.fulfill

R(T ∗, ϑ) = 0 for all ϑ ∈ Θi.e. T ∗ always achieves 0 risk. If the risk adequately expresses a distance to the parameter ϑ thismeans that a sure decision is possible, which is not realistic for statistical problems, and possibleonly if the problem itself is degenerate or trivial. Thus (we argued informally) uniformly bestestimators do not exist in general.There are two ways out of this dilemma:

• reduce the problem to minimizing one characteristic, as argued above (i.e. the maximal riskor an average risk over Θ)

• restrict the class of estimators such as to rule out ”unreasonable” competitors like Tϑ0 whichare likely to be very bad for most ϑ ∈ Θ. Within a restricted class of estimators, a uniformlybest one may very well exist.

Consistency of estimators. At least an estimator should converge towards the true (unknown)parameter to be estimated when the sample size increases. To emphasize the dependence of thedata vector on the sample size n we write the statistical model

L(X(n)) ∈ Pϑ,n;ϑ ∈ Θ .

Recall that convergence in probability for a sequence of random variables is denoted by the symbol−→P.

Definition 2.1.4 A sequence Tn = Tn(X(n)) of estimators (each based on a sample of size n) for

the parameter ϑ is called consistent if for all ϑ ∈ Θ

Pϑ,n

³|Tn(X(n))− ϑ| > ε

´→ 0 as n→∞, for all ε > 0,

or in other notation

Tn = Tn(X(n)) −→

Pϑ as n→∞, if L(X(n)) = Pϑ,n.


Proposition 2.1.5 In Model M1 the estimator

Tn(X1, . . . ,Xn) = Xn

is consistent for the probability of success p.

Proof. Since X1, . . . ,Xn are i.i.d. Bernoulli B(1, p) random variables, we have EX1 = p and theresult immediately follows from the law of large numbers.The consistency requirement restricts class of estimators to ”reasonable” ones; consistency can beseen as a minimal requirement. But there are still many mappings of the observation space intothe parameter space which define consistent sequences of estimators.

2.2 Bayes estimators

In the Bayesian approach, one assumes that an a priori distribution is given on the parameter space,reflecting one’s prior belief about the parameter ϑ ∈ Θ. In the case of ModelM1, assume that thisprior distribution is given in the form of a density g on the interval [0, 1]. For an estimator T , thisallows to reduce the risk function R(T, p), p ∈ [0, 1] to just one number, by integration:

B(T ) =

Z 1

0R(T, p)g(p)dp (2.4)

=

Z 1

0Ep (T − p)2 g(p)dp (2.5)

(the latter equality being true in the case of quadratic loss). This quantity B(T ) is called integratedrisk or mixed risk of the estimator T . A Bayes estimator TB then is defined by the property ofminimizing the integrated risk:

B(TB) = infTB(T ) (2.6)

and the Bayes risk is minimal integrated risk, i.e. (2.6).The name ”Bayesian” derives from the analogy with the Bayes formula: if Bi, i = 1, . . . , k is apartition of the sample space then

P (A) =kXi=1

P (A|Bi) · P (Bi).

In our example involving the parameter p above, p takes continuous values in the interval (0, 1), butto make the connection to the Bayes formula above, consider a model where the parameter ϑ takesonly finitely many values (Θ = (ϑ1, . . . , ϑk)). Consider the case that P is the joint distribution of(X,U) where X is the data and U is a random variable which takes the k possible values of theparameter ϑ (U ∈ Θ). For A = ”X ∈ A” and Bi = ”U = ϑi” we get

P (X ∈ A) =kXi=1

P (X ∈ A|U = ϑi)P (U = ϑi)

Here gi = P (U = ϑi) can be construed as a prior probability that the parameter ϑ takes valuefor the parameter ϑi, and the conditional probabilities P (X ∈ A|U = ϑi) can be construed as afamily of probability measures depending on ϑ ∈ Θ. In other words, in the Bayesian approach

Bayes estimators 21

a family of probability measures Pϑ;ϑ ∈ Θ is understood as a conditional distribution givenU = ϑ. The marginal distribution of ϑ ”awaits specification” as a prior distribution based on beliefor experience. In our example (2.5), the probabilities gi = P (U = ϑi), i = 1, . . . , k are replacedby a probability density g(p), p ∈ (0, 1). It is also possible of course to admit only finitely manyvalues of p: pi, i = 1, . . . , k with prior probabilities gi; then (2.5) would read

B(T ) =kXi=1

Epi (T − pi)2 gi

The expectations Epi (T − pi)2 for a given pi can then be interpreted as conditional expectation

given U = ϑi, and the integrated risk B(T ) above then is an unconditional expecation, namelyB(T ) = E (T − p)2, the expected loss squared loss (T − p)2 with respect to the joint distributionof observations X and random parameter p.

In the philosophical foundations of statistics, or in theories of how statistical decisions should bemade in the real world, the Bayesian approach has developed into an important separate school ofthought; those who believe that prior distributions on ϑ should always be applied (and are alwaysavailable) are sometimes called ”Bayesians”, and Bayesian statistics is the corresponding part ofMathematical Statistics.

In Model M1, consider the following family of prior densities for the Bernoulli parameter p: forα, β > 0

gα,β(p) =pα−1(1− p)β−1

B(α, β), p ∈ [0, 1].

where

B(α, β) =Γ(α)Γ(β)

Γ(α+ β). (2.7)

The corresponding distributions are called the Beta distributions ; here B(α, β) stands for thebeta function defined by (2.7). Recall that

Γ(α) =

Z ∞

0xα−1 exp(−x)dx.

Thus we consider a whole family (α, β > 0) of possible prior distributions for p, allowing a wide rangeof choices for ”prior belief”. The plot below shows three different densities from this family; let usmention that the uniform density on [0, 1] is also in this class (for α = β = 1). We will discuss theBeta family in more detail later, establishing also that B(α, β) is the correct normalization factor.It will also become clear that Bayesian methods are very useful to prove non-Bayesian optimalityproperties of estimators.


Beta densities for (α, β) = (2, 6), (2, 2), (6, 2)

Proposition 2.2.1 In Model M1, let g be an arbitrary prior density for the parameter p ∈ [0, 1].The corresponding Bayes estimator of p is

TB(x) =

R 10 p

z(x)+1(1− p)n−z(x)g(p)dpR 10 p

z(x)(1− p)n−z(x)g(p)dp(2.8)

for x = (x1, . . . , xn) ∈ 0, 1n and

z(x) =nXi=1

xi.

Remark. (i) Note that the Bayes estimator depends on the sample x only via the statistic z(x),or equivalently, via the the sample mean Xn(x) = n−1z(x).(ii) The function

gx(p) =pz(x)(1− p)n−z(x)g(p)R 1

0 pz(x)(1− p)n−z(x)g(p)dp

is a probability density 0n [0, 1], depending on the observed x. We see from (2.8) that TB(x) is theexpectation for that density:

TB(x) =

Z 1

0pgx(p)dp.

Proof. For any estimator T and X = 0, 1n

B(T ) =Xx∈X

Z 1

0(T (x)− p)2 pz(x)(1− p)n−z(x)g(p)dp.

To minimize B(T ), the value T (x) should be chosen optimally; one could try to do this for everyterm in the sum, given x. Thus we look for the minimum of

bx(t) =

Z 1

0(t− p)2 pz(x)(1− p)n−z(x)g(p)dp.

Admissible estimators 23

This can be written as a polynomial of degree 2 in t:

bx(t) = c0 − 2c1t+ c2t2

where

c2 =

Z 1

0pz(x)(1− p)n−z(x)g(p)dp,

c1 =

Z 1

0pz(x)+1(1− p)n−z(x)g(p)dp,

c0 =

Z 1

0pz(x)+2(1− p)n−z(x)g(p)dp.

Since0 ≤

¯pz(x)(1− p)n−z(x)

¯≤ 1,

all these integrals are finite, and c2 > 0. It follows that the unique minimum t0 of bx(t) can beobtained by setting the derivative 0. The solution of

b0x(t) = 2tc2 − 2c1 = 0

is t0 = c1/c2; we have 0 ≤ c1 ≤ c2 since p ≤ 1. This implies that 0 ≤ t0 ≤ 1, and

TB(x) = t0(x)

is of the form (2.8).

2.3 Admissible estimators

Bayes estimators also enjoy an optimality property in terms of the original risk function p 7→R (TB, p) even before a weighted average over the parameter p is taken.

Definition 2.3.1 An estimator T of a parameter ϑ is called admissible , if for every estimatorS of ϑ, the relation

R (S, ϑ) ≤ R (T, ϑ) for all ϑ ∈ Θ (2.9)

impliesR (S, ϑ) = R (T, ϑ) for all ϑ ∈ Θ. (2.10)

Thus admissibility means that there can be no estimator S which is uniformly at least as good(2.9 holds), and strictly better for one ϑ0 ∈ Θ: then R (S, ϑ0) < R (T, ϑ0) contradicts (2.10).Non-admissibility of T means that T can be improved by another estimator S.

Proposition 2.3.2 Suppose that in Model M1, the prior density g is such that g(p) > 0 for allp ∈ [0, 1] with the exception of a finite number of points p. Then the Bayes estimator TB for thisprior density is admissible, for quadratic loss.


Proof. Suppose that TB is not admissible. Thus there is an estimator S and a p0 ∈ [0, 1] with

R (S, p) ≤ R (TB, p) for all p ∈ [0, 1],and R (S, p0) < R (TB, p0) . (2.11)

Note that R (S, p) is continuous in p (continuous on p ∈ (0, 1), and right resp. left continuous atthe endpoints 0 and 1):

R (S, p) =Xx∈X

(S(x)− p)2 pz(x)(1− p)n−z(x)

for z(x) =Pn

i=1 xi, x ∈ X . Thus (2.11) implies that there must be whole neighborhood of p0within which S is better: for some ε > 0

R (S, p) < R (TB, p) , p ∈ [p0 − ε, p0 + ε] ∩ [0, 1].

It follows that

B(S) =

Z 1

0R(S, p)g(p)dp <

Z 1

0R(TB, p)g(p)dp = B(TB).

This contradicts the fact that TB is a Bayes estimator; thus TB must be admissible.

2.4 Bayes estimators for Beta densities

Let us specify (2.8) to the case of Beta densities gα,β(p) introduced above. We have

TB(x) =

R 10 p

z(x)+1(1− p)n−z(x)gα,β(p)dpR 10 p

z(x)(1− p)n−z(x)gα,β(p)dp(2.12)

=

R 10 p

α+z(x)(1− p)β+n−z(x)−1dpR 10 p

α+z(x)−1(1− p)β+n−z(x)−1dp. (2.13)

With a partial integration we obtain for γ, δ > 0Z 1

0pγ(1− p)δ−1dp =

∙−1δpγ(1− p)δ

¸10

+γ

δ

Z 1

0pγ−1(1− p)δdp (2.14)

=γ

δ

Z 1

0pγ−1(1− p)δdp. (2.15)

(Technical remark: for γ < 1, the function pγ−1 tends to ∞ as p → 0, so the second integral in(2.14) is improper [a limit of

R 1t for t & 0], similarly in case of δ < 1. Hence, strictly speaking,

one should argue in terms ofR 1t first and then take a limit. Note that these limits exist since the

function pγ−1 is integrable on (0, 1) for γ > 0:Z 1

tpγ−1(1− p)δdp ≤

Z 1

tpγ−1dp =

∙1

γpγ¸1t

=1

γ(1− tγ) ≤ 1

γ. )

Bayes estimators for Beta densities 25

Relation (2.15) impliesZ 1

0pγ−1(1− p)δ−1dp =

Z 1

0(p+ 1− p)pγ−1(1− p)δ−1dp

=

Z 1

0pγ(1− p)δ−1dp+

Z 1

0pγ−1(1− p)δdp

=

µ1 +

δ

γ

¶Z 1

0pγ(1− p)δ−1dp

(for the last equality, we reversed the roles of δ and γ in (2.15)). Setting now γ = α + z(x),δ = β + n− z(x), we obtain from (2.13)

TB(x) =

R 10 p

γ(1− p)δ−1dpR 10 p

γ−1(1− p)δ−1dp=

µ1 +

δ

γ

¶−1=

γ

γ + δ=

α+ z(x)

α+ β + n,

thus (:= means defining equality)

Tα,β(X) := TB(X) =Xn + α/n

1 + α/n+ β/n.

We already know that the Bayes estimator is a function of the sample mean; we specified formula(2.8).Let us discuss some limiting cases.Limiting case A) Sample size n is fixed, α → 0, β → 0. In the limit Xn is obtained. However,the family of densities gα,β does not converge to a density for α→ 0, β → 0, since the function

g∗(p) = p−1(1− p)−1

is not integrable. IndeedZ 1/2

tp−1(1− p)−1dp ≥

Z 1/2

tp−1dp = log(1/2)− log t→∞ for t→∞.

This means that Xn is not a Bayes estimator for one of the densities gα,β.Limiting case B) α and β are fixed, sample size n→∞. In this case we have for p ∈ (0, 1)

Tα,β(X) = Xn + op(n−γ) for all 0 ≤ γ < 1. (2.16)

Definition 2.4.1 A sequence of r.v. Zn is called op(n−γ) if

nγZn →p 0, n→∞.

Relation (2.16) thus means thatnγ¯Tα,β(X)− Xn

¯→p 0.

To see (2.16), note that

Xn + α/n

1 + α/n+ β/n− Xn =

α/n− Xn(β/n+ α/n)

1 + α/n+ β/n


and since Xn →P p, it is obvious that the above quantity is op(n−γ). At the same time, Xn convergesin probability to p, but slower: we have

Xn = p+ op(n−γ) for all 0 ≤ γ < 1/2. (2.17)

Indeed, V arp(Xn) = p(1− p)/n, hence

Varp(nγ(Xn − p)) = n2γVarp(Xn − p) = n2γVarpXn = n2γ−1p(1− p)

which tends to 0 for 0 ≤ γ < 1/2 but not for γ = 1/2. Rather, for γ = 1/2 we have by the centrallimit theorem

n1/2(Xn − p)→d N(0, p(1− p))

so that (Xn − p) is not op(n−1/2).The interpretation of (2.16), (2.17) is that n → ∞, the influence of the a priori informationdiminishes. The Bayes estimators become all close to the sample mean (and to each other) at raten−γ, γ < 1, whereas they converge to the true p only at rate n−γ , γ < 1/2.

2.5 Minimax estimators

Let us consider the case α = β = n1/2/2. For the risk we obtain

R (Tα,β, p) = Ep (Tα,β − p)2 =1

(n+ α+ β)2Ep

¡nXn + α− (n+ α+ β)p

¢2 (2.18)

=1

(n+ n1/2)2

nEp

¡nXn − np

¢2+ (α(1− p)− βp)2

o(2.19)

(for the last equality, note that E(Y − EY + a)2 = E(Y − EY )2 + a2 for nonrandom a). Setq := 1− p and note that nXn has the binomial distribution B(n, p), so that

Ep

¡nXn − np

¢2= Var(nXn) = npq.

Hence (2.19) equals

R (Tα,β, p) =n

4(n+ n1/2)2¡4pq + (p− q)2

¢=

n

4(n+ n1/2)2(p+ q)2

=1

4n¡1 + n−1/2

¢2 =: mn.

The above reasoning is valid for all p ∈ [0, 1]; it means that the risk R (Tα,β , p) for this specialchoice of α, β does not depend upon p. In addition we know that Tα,β is admissible. This impliesthat Tα,β is a minimax estimator ; let us define that important notion.

Definition 2.5.1 In model M1, for any estimator T set

M(T ) = max0≤p≤1

R(T, p). (2.20)

An estimator TM is called minimax if

M(TM) = minT

M(T ) = minT

max0≤p≤1

R(T, p).

Minimax estimators 27

Note that the maximum in (2.20) exists since R(T, p) is continuous in p for every estimator. Theminimax approach is similar to the Bayes approach, in that one characteristic of the risk func-tion R(T, p), p ∈ [0, 1] is selected as performance criterion for an estimator. In that case thecharacteristic is the worst case risk .

Theorem 2.5.2 In model M1, the Bayes estimator Tα,β for α = β = n1/2/2 is a minimax esti-mator for quadratic loss.

Proof. Let T be an arbitrary estimator of p. Since Tα,β is admissible according to Proposition2.3.2, there must be a p0 ∈ [0, 1] such that

R(T, p0) ≥ R (Tα,β, p0) = mn. (2.21)

If there were no such p0, then we would have

R(T, p) < R (Tα,β , p)

for all p which contradicts admissibility of Tα,β. Now (2.21) implies

M(T ) ≥ mn =M(Tα,β).

The estimator Tα,β = TM has the form

TM =Xn + α/n

1 + α/n+ β/n=

Xn + n−1/2/2

1 + n−1/2. (2.22)

Let us compare the risk functions of the minimax estimator TM and of the sample mean. We have

R¡Xn, p

¢= Varp(Xn) =

p(1− p)

n,

R (TM , p) = mn =1

4n¡1 + n−1/2

¢2 .For n = 30 the two risk function are plotted below.

Risk of sample mean (line) and minimax risk (dots), n=30


It is clearly visible that the ”problem region” for the sample mean is the area around p = 1/2; theminimax estimator is better in the center, at the expense of the tail regions, and thus achieves asmaller overall risk.It is instructive to look at the form of the minimax estimator TM itself. The function fM(x) =x+n−1/2/21+n−1/2

is plotted below, also for n = 30.

The minimax estimator as a function of the sample mean

This function has values fM(1/2) = 1/2 and fM(0) =n−1/2/21+n−1/2

and it is linear; by symmetry weobtain fM(1) = 1− fM(0). This means that the values of the sample mean are ”moved” towardsthe value 1/2, which is the one where the risk would be maximal. This can be understood as a kindof prudent behaviour of the minimax estimator, which tends to be closer to the value 1/2 since itis more damaging to make an error if this value were true. We can also write

Xn =1

2+ (Xn − 1/2),

TM =1

2+(Xn − 1/2)1 + n−1/2

where it is seen that the minimax estimator takes the distance of Xn to 1/2 and ”shrinks” it by afactor (1 + n−1/2)−1 which is < 1.

Chapter 3MAXIMUM LIKELIHOOD ESTIMATORS

In the general statistical model where X is an observed random variable with values in X , X is acountable set (i.e. X is a discrete r.v.) and Pϑ;ϑ ∈ Θ is a family of probability laws on X :

L(X) ∈ Pϑ;ϑ ∈ Θ ,

consider the probability function for given ϑ ∈ Θ, i.e. Pϑ(X = x). For each x ∈ X , the function

Lx(ϑ) = Pϑ(X = x)

is called the likelihood function of ϑ given x. The name reflects the heuristic principle that whenobservations are realized, i.e. X took the value x, the most ”likely” parameter values of ϑ are thosewhere Pϑ(X = x) is maximal. This does not mean that Lx(ϑ) gives a probability distribution onthe parameter space; the likelihood principle has its own independent rationale, on a purelyheuristic basis.Under special conditions however the likelihood function can be interpreted as a probability func-tion. Consider the case that Θ = ϑ1, . . . , ϑk is a finite set, and consider a prior distribution onΘ which is uniform:

P (U = ϑi) = k−1, i = 1, . . . , k.

Understanding Pϑ(X = x) as a conditional distribution given ϑ, i.e. setting (as always in theBayesian approach)

P (X = x|U = ϑ) = Pϑ(X = x),

we immediately obtain a posterior distribution of ϑ, i.e. the conditional distribution of U givenX = x:

P (U = ϑ|X = x) =P (U = ϑ,X = x)

P (X = x)(3.1)

=Pϑ(X = x)P (U = ϑ)

P (X = x)=

Pϑ(X = x)

k · P (X = x)

= Lx(ϑ) · (k · P (X = x))−1 . (3.2)

Here the factor (k · P (X = x))−1 does not depend on ϑ, so that in this case, the posterior probabilityfunction of ϑ is proportional to the likelihood function Lx(ϑ). Recall that the marginal probabilityfunction of X is

P (X = x) =kXi=1

Pϑi(X = x)P (U = ϑi)

30 Maximum likelihood estimators

Thus in this special case, the likelihood principle can be derived from the Bayesian approach, for a”noninformative” prior distribution (i.e. the uniform distribution on Θ). However in cases whereΘ there is no natural uniform distribution on Θ, such as for Θ = R or Θ = Z+ (the nonnegativeintegers), such a reasoning is not straightforward. (A limiting argument for a sequence of priordistributions is often possible).A maximum likelihood estimator (MLE) of ϑ is an estimator T (x) = TML(x) such that

Lx(TML(x)) = maxϑ∈Θ

Lx(ϑ),

i.e. for each given x, the estimator is a value of ϑ which maximizes the likelihood.In the case of model M1, the probability function for given p = ϑ is

Pp(X = x) = pz(x)(1− p)n−z(x) = Lx(p).

for x ∈ X . In this case the parameter space is Θ = [0, 1], on which there is a natural uniformdistribution (the beta density for α = β = 1). It can be shown that also in this case, somethinganalogous to (3.2) holds, i.e. the likelihood function is proportional to the density of the posteriordistribution. Without proving this statement, let us directly compute the maximum likelihoodestimate.Assume first that x is such that z(x) =

Pni=1 xi ∈ (0, n). Then the likelihood function has Lx(0) =

Lx(1) = 0 and is positive and continuously differentiable on the open interval p ∈ (0, 1). Thus alsothe logarithmic likelihood function

lx(p) = logLx(p)

is continuously differentiable, and since log is a monotone function, local extrema of the likelihoodfunction in (0, 1) coincide with local extrema of the log-likelihood in p ∈ (0, 1). We have

lx(p) = z(x) log p+ (n− z(x)) log(1− p),

l0x(p) = z(x)p−1 − (n− z(x))(1− p)−1.

We look for zeros of this function in p ∈ (0, 1); the local extrema are among these. We obtain

z(x)p−1 = (n− z(x))(1− p)−1,

z(x)(1− p) = (n− z(x))p,

np = z(x).

Thus if z(x) =Pn

i=1 xi ∈ (0, n) then p = n−1z(x) = xn gives the unique local extremum of lx(p)on (0, 1), which must be a maximum of Lx(p).If either z(x) = 0 or z(x) = n then

Lx(p) = (1− p)n or Lx(p) = p

such that p = 0 or p = 1 resp. are maximum likelihood estimates. We have established

Proposition 3.0.3 In modelM1, the sample mean Xn is the unique maximum likelihood estimator(MLE) of the parameter p ∈ [0, 1]:

TML(X) = Xn.

Maximum likelihood estimators 31

We now turn to the continuous likelihood principle. Let X be a random variable with valuesin Rk such that L(X) ∈ Pϑ;ϑ ∈ Θ, and each law Pϑ has a density pϑ(x) on Rk. For each x ∈ Rk,the function

Lx(ϑ) = pϑ(x)

is called the likelihood function of ϑ given x. A maximum likelihood estimator (MLE) of ϑ is anestimator T (x) = TML(x) such that

Lx(TML(x)) = maxϑ∈Θ

Lx(ϑ),

i.e. for each given x, the estimator is a value of ϑ which maximizes the likelihood.Let us consider an example in which all densities are Gaussian.

Model M2 Observed are n independent and identically random variables X1, . . . ,Xn, each havinglaw N(µ, σ2), where σ2 > 0 is given (known) and µ ∈ R is unknown.

This is also called the Gaussian location model (or Gaussian shift model). Consider the case ofsample size n = 1. We can represent Xi as

Xi = µ+ ξi

where ξi are i.i.d. centered normal: L(ξi) = N(0, σ2). The parameter ϑ is µ and parameter spaceΘ is R.

Proposition 3.0.4 In the Gaussian location model M2, the sample mean Xn is the unique maxi-mum likelihood estimator of the expectation parameter µ ∈ R:

TML(X) = Xn.

Proof. We have

Lx (µ) =nYi=1

1

(2πσ2)12

exp

Ã−(xi − µ)2

2σ2

!

=1

(2πσ2)n2

exp

Ã− 1

2σ2

nXi=1

(xi − µ)2!

thus the logarithmic likelihood function is

lx(µ) = logLx(µ) = −1

2σ2

nXi=1

(xi − µ)2 − n

2log(2πσ2).

Note that

d

dµlx (µ) = − 1

2σ2

nXi=1

(−2 (xi − µ))

=1

σ2

ÃnXi=1

xi − nµ

!= 0


if and only if

µ =

Pni=1 xin

= xn.

Thus lx (µ) which is continuously differentiable on R has a unique local extremum at µ = xn. Sincelx (µ)→ −∞ for µ→ ±∞, this must be a maximum.We can now ask what happens if the variance σ2 is also unknown. In that connection we introducethe Gaussian location-scale model

Model M3 Observed are n independent and identically random variables X1, . . . ,Xn, each havinglaw N(µ, σ2), where µ ∈ R and σ2 > 0 are both unknown.

In addition to the sample mean Xn = n−1Pn

i=1Xi, consider the statistic

S2n =1

n− 1

nXi=1

(Xi − Xn)2

This statistic is called the sample variance . The empirical second (central) moment (e.s.m.) is

S2n =n− 1n

S2n =1

n

nXi=1

(Xi − Xn)2

At first glance, it would seem more natural to call the expression S2n the sample variance, since itis the sample analog of the variance of the random variable X1 :

σ2 = Var(X1) = E (X1 −EX1)2 .

However it is customary in statistics to call S2n the sample variance; the reason is unbiasednessfor σ2 , which will be discussed later. Note that we need a sample size n ≥ 2 for S2n and S2n to benonzero.

Proposition 3.0.5 In the Gaussian location-scale model M3, for a sample size n ≥ 2, if S2n >0 then sample mean and e.s.m. (Xn, S

2n) are the unique maximum likelihood estimators of the

parameter ϑ = (µ, σ2) ∈ Θ = R× (0,∞):

TML(X) = (Xn, S2n)

The event S2n > 0 has probability one for all ϑ ∈ Θ.

Proof. We write xn, s2n for sample mean and e.s.m. when x1, . . . , xn are realized data. Note that

s2n = n−1nXi=1

x2i − (xn)2.

HencenXi=1

(xi − µ)2 =nXi=1

x2i − 2nxnµ+ nµ2

=nXi=1

x2i − nx2n + n (xn − µ)2 = ns2n + n (xn − µ)2 .


Thus the likelihood function Lx (ϑ) is

Lx (ϑ) =1

(2πσ2)n2

exp

Ã− 1

2σ2

nXi=1

(xi − µ)2!

=1

(2πσ2)n/2exp

Ã− s

2n + (xn − µ)2

2σ2n−1

!(3.3)

=1

σn−1/2ϕ

µxn − µ

σn−1/2

¶· 1

n1/2(2πσ2)(n−1)/2exp

µ− s2n2σ2n−1

¶. (3.4)

where

ϕ(x) =1

(2π)1/2exp

µ−x

2

2

¶is the standard normal density. We see that the first factor is a normal density in xn and the secondfactor does not depend on µ. To find MLE’s of µ and σ2 we first maximize for fixed σ2 over allpossible µ ∈ R. The first factor is the likelihood function in model M2 for a sample size n = 1,variance n−1σ2 and an observed value x1 = xn. This gives an MLE µ = xn. We can insert thisvalue into (3.3); we now have to maximize

Lx

¡xn, σ

2¢=

1

(2πσ2)n/2exp

µ− s2n2σ2n−1

¶over σ2 > 0. For notational convenience, we set γ = σ2; equivalently, one may minimize

lx(γ) = − logLx (xn, γ) =n

2log γ +

ns2n2γ

.

Note that if s2n > 0, for γ → 0 we have lx(γ) → ∞ and for γ → ∞ also lx(γ) → ∞, so that aminimum exists and is a zero of the derivative of lx. The event S2n > 0 has probability 1 sinceotherwise xi = xn, i = 1, . . . , n. i.e. all xi are equal, which clearly has probability 0 for independentcontinuous random variables Xi. We obtain

l0x(γ) =n

2γ− ns2n2γ2

= 0,

γ = s2n

as the unique zero, so σ2 = s2n is the MLE of σ2.

For our next example, let us introduce the double exponential distribution DE(µ, λ). It has density

f(x) = (2λ)−1 exp¡− |x− µ|λ−1

¢for given µ ∈ R, λ > 0. We shall assume λ = 1 here; clearly the above is a density for any µ sinceZ ∞

0exp(−x)dx = 1.

Clearly DE(µ, λ) has finite moments of any order, and by symmetry reasons µ is the expectation.We introduce the double exponential location model .


Model M4 Observed are n independent and identically random variables X1, . . . ,Xn, each havinglaw DE(µ, 1), where µ ∈ R is unknown.

We will show that the MLE in this case is the sample median. For any vector (x1, . . . , xn), letx(1) ≤ x(2) ≤ . . . ≤ x(n) be the vector of ordered values; this is always uniquely defined. Forn i.i.d. random variables X = (X1, . . . ,Xn), define the order statistics to be components ofthe vector X(1), . . . ,X(n). Recall that a statistic was any function of the data; thus for given i,the i-th order statistic is a well defined data function. In particular X(n) = maxi=1,...,nXi etc,X(1) = mini=1,...,nXi. Define the sample median as

med(X) = X((n+1)/2) if n is uneven12(X(n/2) +X(n/2+1)) if n is even

In other words, the sample median is the central order statistic if n is uneven, and the center ofthe two central order statistics if n is even.

Proposition 3.0.6 In the double exponential location model M4, the sample median med(X) is amaximum likelihood estimator of the location parameter µ ∈ Θ = R: TML(X) = med(X).

Note that uniqueness is not claimed here.Proof. The likelihood function is

Lx (µ) =nYi=1

(2)−1 exp (− |xi − µ|)

= (2)−n exp

Ã−

nXi=1

|xi − µ|!

and maximizing it is equivalent to minimizing − log Lx (µ), or minimizing

lx(µ) =nXi=1

|xi − µ| .

For given x this is a piecewise linear continuous function in µ; for µ > x(n) it is a linear functionin µ tending to ∞ for µ → ∞, and for µ > x(1) it is also a linear function in µ tending to ∞ forµ → −∞. Thus the minimum must be attained in the range of the sample, i. e. in the interval[x(1), x(n)]. Assume that x(1) < x(1) < . . . x(n), i.e. no two values of the sample are equal (there areno ties). That event has probability one since the Xi are continuous random variables. Inside aninterval (x(i), x(i+1)), 1 ≤ i ≤ n− 1 the derivative of lx(µ) is

l0x(µ) =iX

j=1

1 +nX

j=i+1

(−1) = i− (n− i) = 2i− n.

Consider first the case of even n. Then l0x(µ) is negative for i < n/2, positive for i > n/2 and 0for i = n/2. Thus the minimum in µ is attained by any value µ ∈ [x(n/2), x(n/2+1)], in particularby the center of that interval which is the sample median. If n is uneven then l0x(µ) is negative fori < n/2 and positive for i > n/2. Since the function lx(µ) is continuous, the minimum is attainedin the beginning point of the first interval where l0x(µ) is positive, which is X((n+1)/2).


Since the sample median minimizesPn

i=1 |Xi − µ| it may be called a least absolute deviationestimator . In contrast the sample minimizes the sum of squares

Pni=1 (Xi − µ)2, i.e. it is a least

squares estimator . Note that the sample mean remains the same when X(1) →∞ whereas Xn

also tends to infinity in that case. By that reason, the sample median is often applied independentlyof a maximum likelihood justification, which holds when the data are double exponential.In analogy to the sample median, the population median is defined as as a "half point" of thedistribution of a random variable X representing the population. A value m is a median of a r.v.X if simultaneously P (X ≥ m) ≥ 1/2 and P (X ≤ m) ≥ 1/2 hold. For continuous distributions, wealways have P (X = m) = 0, and therefore m is a solution of P (X > m) = 1 − P (X < m) = 1/2which may not be unique.


Chapter 4UNBIASED ESTIMATORS

Consider again the general parametric estimation problem: let X be a random variable with valuesin X and L(X) be the distribution (or the law) of X. Assume that L(X) is known up to a ϑ froma parameter space Θ ⊆ Rk:

L(X) ∈ Pϑ;ϑ ∈ Θ .The problem is to estimate a real valued function g(ϑ) based on a realization of X. Up to now weprimarily considered the case where Θ is an interval in R and g(ϑ) = ϑ (i.e. we are interested inestimation of ϑ itself), but the Gaussian location-scale modelM3 was an instance of a model witha two-dimensional parameter ϑ = (µ, σ2). Here we might set g(ϑ) = µ.Consider an estimator T of g(ϑ) such that EϑT exists. In our i.i.d. Bernoulli model M1, that istrue for every estimator.

Definition 4.0.7 (i) The quantityEϑT − g(ϑ)

is called the bias of the estimator T .(ii) If

EϑT = g(ϑ) for all ϑ ∈ Θthen the estimator T is called unbiased for g(ϑ).

In model M1 we haveEpXn = p for all p ∈ [0, 1],

i.e. the sample mean is an unbiased estimator for the parameter p. Similarly, in the Gaussianlocation and location-scale models, the sample mean is unbiased for µ = X1.Unbiasedness is sometimes considered as a value in itself, i.e. a desirable property for an estimator,independently of the risk optimality. Recall that the risk function (for quadratic loss) for estimationof a real valued ϑ ∈ Θ was defined as

R(T, ϑ) = Eϑ (T (X)− ϑ)2 .

Suppose T is an estimator with R(T, ϑ) <∞. Then

R(T, ϑ) = Eϑ (T (X)−EϑT +EϑT − ϑ))2 (4.1)

= Eϑ (T (X)−EϑT )2 + (EϑT − ϑ)2

= VarϑT (X) + (EϑT − ϑ)2 . (4.2)

The last line is called the bias-variance decomposition of the quadratic risk of T ; it holds forany T with finite risk at ϑ (or equivalently with VarϑT (X) < ∞); T need not be unbiased. If T

38 Unbiased estimators

is unbiased then the second term in (4.2), i.e. the squared bias vanishes; thus for unbiased T thequadratic risk is the variance.We saw that in model M1 the estimator

TM =Xn + n−1/2/2

1 + n−1/2

is minimax with respect to quadratic risk R(T, ϑ), and it is clearly biased. The unbiased estimatorXn performs less well in the minimax sense. It is thus a matter of choice how to judge theperformance: one may strictly adhere to the quadratic risk as a criterion, leaving the bias problemaside, or impose unbiasedness as an a priori requirement for ”good” estimators.If the latter point of view is taken, within the restrict the class of unbiased estimators it is oftenpossible to find uniformly best elements (optimal for all values of ϑ), as we shall see now.

4.1 The Cramer-Rao information bound

An information bound , relative to a statistical decision problem, is a bound for the best possi-ble performance of a decision procedure, either within all possible procedures or within a certainrestricted class of methods. We shall discuss a bound which relates to the class of unbiased esti-mators.Let us formalize a set of assumptions which was already made several times.

Model Mf The sample space X for the observed random variable X is finite, and L(X) ∈Pϑ, ϑ ∈ Θ, where Θ is an open (possibly infinite) interval in R.

The case Θ = R is included. Note that model M1 is a special case, if the open interval Θ = (0, 1)is taken as parameter space. With this model we associate a family of probability functions pϑ(x),x ∈ X , ϑ ∈ Θ. Let T be an unbiased estimator of the parameter:

EϑT (X) = ϑ, (4.3)

Since X is finite, the expectation always exists for all ϑ ∈ Θ. Let us add a differentiabilityassumption on the dependence of pϑ(x) on ϑ.

Assumption D1 For every x ∈ X , the probability function pϑ(x) is differentiable in ϑ, in everypoint ϑ ∈ Θ.

Now (4.3) can be written Xx∈X

T (x)pϑ(x) = ϑ

and both sides are differentiable in ϑ. Taking the derivative we obtainXx∈X

T (x)p0ϑ(x) = 1. (4.4)

where p0ϑ is the derivative wrt to ϑ. The first derivative of a probability function has an interestingproperty. We can also differentiate the equalityX

x∈Xpϑ(x) = 1

The Cramer-Rao information bound 39

valid for any probability function pϑ(x). Differentiation givesXx∈X

p0ϑ(x) = 0. (4.5)

This means that in (4.4) we can replace T (x) by T (x) + c where c is any constant; indeed (4.5)implies that the right side of (4.4) is still 1. Choosing c = −ϑ, we obtainX

x∈X(T (x)− ϑ) p0ϑ(x) = 1. (4.6)

Define now the score function

lϑ(x) =

½p0ϑ(x)/pϑ(x) if pϑ(x) > 00 if pϑ(x) = 0.

Note that if pϑ(x) = 0 at ϑ = ϑ0 ∈ Θ then p0ϑ0(x) must also be 0: since pϑ(x) is nonnegative anddifferentiable, it has a local minimum at ϑ0 and hence p0ϑ0(x) = 0. This fact and (4.6) imply

1 =Xx∈X

(T (x)− ϑ) lϑ(x)pϑ(x) = Eϑ (T (X)− ϑ) lϑ(X) (4.7)

Let us apply now the Cauchy-Schwarz inequality : for any two random variables Z1, Z2 on acommon probability space

|EZ1Z2|2 ≤¡EZ21

¢ ¡Z22¢

provided the left side exists, i.e. EZ2i <∞, i = 1, 2. Squaring both sides of (4.7), we obtain

1 ≤ Eϑ (T (X)− ϑ)2 ·Eϑl2ϑ(X).

The quantityIF (ϑ) = Eϑl

2ϑ(X) (4.8)

is called the Fisher information of the parametric family pϑ, ϑ ∈ Θ of probability functions, atpoint ϑ ∈ Θ.

Theorem 4.1.1 (Cramer-Rao bound ) In model Mf , under smoothness assumption D1, as-sume that IF (ϑ) > 0, ϑ ∈ Θ. Then for every unbiased estimator T of ϑ

VarϑT (X) ≥ (IF (ϑ))−1, ϑ ∈ Θ (4.9)

where IF (ϑ) is the Fisher information (4.8).

Note that the score function lϑ(x) can also be written

lϑ(x) =

½ddϑ log pϑ(x) if pϑ(x) > 00 if pϑ(x) = 0

so that the Fisher information is

IF (ϑ) = Eϑl2ϑ(X) = Eϑ

µd

dϑlog pϑ(X)

¶2(4.10)

=X

x:pϑ(x)>0

¡p0ϑ(x)

¢2/pϑ(x).


The form (4.10) involving the logarithm of pϑ is convenient in many cases for computation of theFisher information.Note also that for the score function we have as a consequence of (4.5)

Eϑlϑ(X) = 0. (4.11)

The Cramer-Rao bound (4.9) gives a benchmark against which to measure the performance of anyunbiased estimator. An unbiased estimator attaining the Cramer-Rao bound at ϑ is called a bestunbiased estimator (or uniformly best if that is true for all ϑ ∈ Θ; ”uniformly” is usually omitted).Another terminology is uniformly minimum variance unbiased estimator (UMVUE).

Example 4.1.2 Consider the case of ModelM1 for sample size n = 1, i.e. we observe one Bernoullir.v. with law B(1, p), p ∈ (0, 1). The probability function is, for x ∈ 0, 1

qp(x) = (1− p)1−xpx.

For the score function we obtain

lp(x) =d

dplog(1− p)1−xpx = (1− x)

d

dplog(1− p) + x

d

dplog p

= −(1− x)1

(1− p)+ x

1

p

=x(1− p)− (1− x)p

p(1− p)=

x− p

p(1− p).

We check again (4.11) in this case from Ep(X − p) = 0. The Fisher information is thus

IF (p) = Epl2p(X) = (p(1− p))−2VarpX = (p(1− p))−1.

Thus p(1 − p) is the Cramer-Rao bound. It follows that X is a best unbiased estimator for allp ∈ (0, 1), i.e. uniform minimum variance unbiased. ¤

The next theorem specializes the situation to the case of independent and identically distributeddata.

Theorem 4.1.3 Suppose that X = (X1, . . . ,Xn) are independent and identically distributed wherethe statistical model for X1 is a model of typeMf , with probability function qϑ, satisfying assumptionD1. Assume that the Fisher information for qϑ satisfies IF (ϑ) > 0, ϑ ∈ Θ. Then for every unbiasedestimator T of ϑ

VarϑT (X) ≥ n−1(IF (ϑ))−1, ϑ ∈ Θ. (4.12)

where IF (ϑ) is the Fisher information for qϑ.

Proof. The model for the vectorX = (X1, . . . ,Xn) is also of typeMf , and the probability functionis pϑ(x) =

Qni=1 qϑ(xi). It suffices to find the Fisher information of pϑ, call it IF,n(ϑ). We have

IF,n(ϑ) = Eϑ

µd

dϑlog pϑ(X)

¶2= Eϑ

ÃnXi=1

d

dϑlog qϑ(Xi)

!2.

Countably infinite sample space 41

Expanding the square, we note for the mixed terms (i 6= j), when lϑ is the score function for qϑ

Eϑlϑ(Xi)lϑ(Xj) = Eϑlϑ(Xi) ·Eϑlϑ(Xj) = 0

in view of independence of Xi and Xj and (4.11). Hence

IF,n(ϑ) =nXi=1

Eϑl2ϑ(Xi) = n · IF (ϑ). (4.13)

Now Theorem 4.1.1 is valid for the whole model for X = (X1, . . . ,Xn) and implies (4.12).

Interpreting the Fisher information. To draw an analogy to concepts from physics or me-chanics, let t be a time parameter and and suppose x(t) = (xi(t))i=1,2,3 describes a movement inthree dimensional space R3 (a path in space, as a function of time). The velocity v(t0) at time t0is the length of the gradient, i.e.

v(t0) =°°x0(t)°° = Ã 3X

i=1

(x0i(t))2

!1/2. (4.14)

If the sample space X has k elements and all pϑ(xi) > 0

IF (ϑ) =kXi=1

¡ddϑpϑ(xi)

¢2pϑ(xi)

(4.15)

= 4kXi=1

µd

dϑp1/2ϑ (xi)

¶2. (4.16)

Thus IF (ϑ) is similar to a squared velocity if we identify ϑ with time, i.e. we consider the ”curve”

described in Rk by the vector³p1/2ϑ (x1), . . . , p

1/2ϑ (xk)

´when ϑ varies in an interval Θ. Note that

the vector qϑ :=³p1/2ϑ (x1), . . . , p

1/2ϑ (xk)

´has length one (kqϑk2 = 1) since pϑ(xi) sum up to one.

Thus the family (qϑ, ϑ ∈ Θ) describes a parametric curve on the surface of the unit sphere in Rk.The Fisher information can thus be interpreted as a ”sensitivity” of the location qϑ ∈ Rk withrespect to small changes in the parameter, at position ϑ.Note that in Theorem (4.1.1) we assumed that IF (ϑ) > 0. Consider the case where pϑ(x) doesnot depend on ϑ; formally then pϑ(·) is differentiable, and l2ϑ = 0 , IF (ϑ) = 0. Then Theorem(4.1.1) has to be interpreted as meaning that unbiased estimators T do not exist (indeed thereis insufficient information in the family to enable EϑT = ϑ; the left side does not depend on ϑ).The relation IF (ϑ) = 0 may also occur in other families or at particular points ϑ; then unbiasedestimators do not exist.

4.2 Countably infinite sample space

Let us now consider the case where the sample space X is still discrete but only countable (notnecessarily finite), e.g. the case where it consists of the nonnegative integers X = Z+.

Model Md The sample space X for the observed random variable X is countable, and L(X) ∈Pϑ, ϑ ∈ Θ, where Θ is an open (possibly infinite) interval in R.


The proof of the Cramer-Rao bound is very similar here, only we need to be sure that diffentiationis possible under the infinite sum sign. If X = x1, x2, . . . then we need to differentiate both sidesof the unbiasedness equation

∞Xi=1

T (xi)pϑ(xi) = ϑ (4.17)

If ϑ and ϑ0 = ϑ+∆ are two close values then we would take the ratio of differences on both sides

∞Xi=1

T (xi) (pϑ(xi)− pϑ+∆(xi))∆−1 = 1 (4.18)

and compute the derivative by letting ∆→ 0. Both sides are differentiable in ϑ. Here for every xi

(pϑ(xi)− pϑ+∆(xi))∆−1 → p0ϑ(xi).

hence if pϑ(xi) > 0 then

pϑ(xi)− pϑ+∆(xi)

∆pϑ(xi)→ p0ϑ(xi)

pϑ(xi)= lϑ(xi) as ∆→ 0.

Condition D2 (i) The probability function pϑ(x) is positive and differentiable in ϑ, at every x ∈ Xand every ϑ ∈ Θ(ii) For every ϑ ∈ Θ, there exist an ε = εϑ > 0 and a function bϑ(x) satisfying Eϑb

2ϑ(X) <∞

and ¯pϑ(x)− pϑ+∆(x)

∆pϑ(x)

¯≤ bϑ(x) for all |∆| ≤ ε and all x ∈ X .

Using that condition, we find for the right hand side of (4.18)

∞Xi=1

T (xi) (pϑ(xi)− pϑ+∆(xi))∆−1 =

∞Xi=1

µT (xi)

pϑ(xi)− pϑ+∆(xi)

∆pϑ(xi)

¶pϑ(xi)

=∞Xi=1

r∆(xi)pϑ(xi).

The sequence of functions for ∆→ 0

r∆(x) = T (xi)pϑ(x)− pϑ+∆(x)

∆pϑ(x)

converges to a limit function r0(x) = T (x)lϑ(x) pointwise (i.e. for every x ∈ X ). We would like toshow ∞X

i=1

r∆(xi)pϑ(xi)→∞Xi=1

r0(xi)pϑ(xi) = EϑT (X)lϑ(X) (4.19)

since in that case we could infer that EϑT (X)lϑ(X) = 1 (differentiate both sides of (4.17) underthe integral sign). By a result from real analysis, the Lebesgue dominated convergence theorem(see Appendix) it suffices to show that there is a function r(x) ≥ 0 and a δ > 0 such that

|r∆(x)| ≤ r(x) for all |∆| ≤ δ, and Eϑr(X) <∞.

Countably infinite sample space 43

To establish that, we use

|r∆(x)| =¯T (xi)

pϑ(x)− pϑ+∆(x)

∆pϑ(x)

¯≤ |T (xi)bϑ(x)| =: r(x)

and the Cauchy-Schwarz inequality

Eϑr(X) ≤¡EϑT

2(X)¢1/2 ¡

Eϑb2ϑ(X)

¢1/2<∞

according to condition D2 and the finiteness of EϑT2(X), which is natural to assume for the

estimator. T (x). Thus we have established

EϑT (X)lϑ(X) = 1.

. The same technique allows us to differentiate the relation

∞Xi=1

pϑ(xi) = 1

under the series sign, i.e. to obtain

∞Xi=1

p0ϑ(xi) = Eϑlϑ(X) = 0.

Theorem 4.2.1 (Cramer-Rao bound, discrete case) In model Md, assume that the Fisherinformation exists and is positive:

0 < IF (ϑ) = Eϑ

µp0ϑ(X)

pϑ(X)

¶2<∞.

for all ϑ ∈ Θ, and also that condition D2 holds. Then for for every unbiased estimator T of ϑ withfinite variance

VarϑT (X) ≥ (IF (ϑ))−1, ϑ ∈ Θ. (4.20)

Remark 4.2.2 Clearly an analog of Theorem 4.1.3 holds, we do not state it explicitly. It is evidentthat (4.13) still holds. One would have to verify that it suffices to impose Condition D2 only onthe law of X1, but we omit this argument.

Example. Let X have Poisson law Po(ϑ), where ϑ > 0 is unknown. Let us check condition D2

(here Θ = (0,∞)). We have for xk = k, k = 0, 1, . . .

pϑ(xk) =ϑk

k!exp(−ϑ)

This is continuously differentiable in ϑ, for every k ≥ 0, and

p0ϑ(xk) =1

k!

³kϑk−1 − ϑk

éxp(−ϑ) =

¡kϑ−1 − 1

¢ ϑkk!exp(−ϑ).


By the mean value theorem, for every ∆ and every k ≥ 0 there exists ∆∗(k) (lying in the intervalbetween 0 and ∆) such that

pϑ+∆(k)− pϑ(k) = ∆ · p0ϑ+∆∗(k)(k).

so that for |∆| ≤ ε, ε > 0 sufficiently small¯pϑ(k)− pϑ+∆(k)

∆pϑ(k)

¯=

¯¯p0ϑ+∆∗(k)(k)

pϑ(k)

¯¯

=

¯k

ϑ+∆∗(k)− 1¯ ¯ϑ+∆∗(k)

ϑ

¯kexp(−∆∗(k))

≤µ

k

ϑ− ε+ 1

¶³1 +

ε

ϑ

´kexp ε =: bϑ(k)

Let us denote by C0, C1, C2 etc. constants which do not depend on k (but may depend on ε andϑ). We have

k

ϑ− ε+ 1 ≤ C0 (k + 1) ≤ C1k,

b2ϑ(k) ≤ C1k2C2k2 ≤ C1 exp(2k) · C2k2 = C3 exp(C4k).

Thus for Eϑb2ϑ(X) <∞ it suffices to show that the Poisson law has finite exponential moments of

all order, i.e. for all c > 0 and all ϑ > 0

Eϑ exp (cX) <∞.

To see this, note

Eϑ exp (cX) =∞Xk=1

exp(ck)ϑk

k!exp(−ϑ) =

∞Xk=1

(ϑ exp c)k

k!exp(−ϑ)

= exp(ϑ exp c− ϑ).

¤

The continuous case 45

4.3 The continuous case

Model Mc The observed random variable X = (X1, . . . ,Xk) is continuous with values in Rk andL(X) ∈ Pϑ, ϑ ∈ Θ. Each law Pϑ is described by a joint density pϑ(x) = pϑ(x1, . . . , xk), andΘ is an open subset of Rd.

The essential work for the Cramer-Rao bound was already done in the previous subsection; indeedthis time we need differentiation under the integral sign which is analogous to the infinite seriescase. The reasoning is very similar. As estimator T is called unbiased if Eϑ |T | <∞ and EϑT = ϑfor all ϑ ∈ Θ. We again start with the unbiasedness relationZ

T (x)pϑ(x)dx = ϑ

and we need to differentiate the left side under the integral sign. Let us assume that our parameterϑ is one dimensional, as in the previous subsections; multivariate versions of the Cramer-Rao boundcan be derived but need not interest us here.

Condition D3 (i) The parameter space Θ is an open (possibly infinite) interval in R. There is asubset S ⊆ Rk such that for all ϑ ∈ Θ, we have pϑ(x) > 0 for x ∈ S, pϑ(x) = 0 for x /∈ S.(ii) The density pϑ(x) is positive and differentiable in ϑ, at every x ∈ S and every ϑ ∈ Θ(iii) For every ϑ ∈ Θ, there exist an ε = εϑ > 0 and a function bϑ(x) satisfying Eϑb

2ϑ(X) <∞

and ¯pϑ(x)− pϑ+∆(x)

∆pϑ(x)

¯≤ bϑ(x) for all |∆| ≤ ε and all x ∈ X .

We can define a score function lϑ(x)

lϑ(x) = ddϑ log pϑ(x) if pϑ(x) > 00 if pϑ(x) = 0

and the Fisher information

IF (ϑ) = Eϑl2ϑ(X) =

ZS

¡p0ϑ(x)

¢2/pϑ(x)dx.

Theorem 4.3.1 (Cramer-Rao bound, continuous case) In Model Mc, assume that that thesmoothness condition D3 holds, and that the Fisher information exists and is positive: 0 < IF (ϑ) <∞ for all ϑ ∈ Θ. Then for for every unbiased estimator T of ϑ with finite variance

VarϑT (X) ≥ (IF (ϑ))−1, ϑ ∈ Θ. (4.21)

The proof is exactly as in the preceding countably infinite (discrete case), but the infinite seriesP∞i=1 T (xj)pϑ(xj) has to be substituted by an integral

RT (x)pϑ(x)dx. Remark 4.2.2 on the i.i.d.

case can be made here analogously.

Example 4.3.2 Consider the Gaussian location model M2 for sample size n = 1, i.e. we observea Gaussian r.v. with law N(ϑ, σ2) where σ2 > 0 is known, Θ = R. Let us verify condition D3. Forease of notation assume σ2 = 1. If ϕ is the standard Gaussian density then pϑ(x) = ϕ(x−ϑ), and


by a transformation y = x− ϑ it suffices to show the condition at parameter point ϑ = 0. Now thedensity

pϑ(x) = ϕ(x− ϑ)

is continuously differentiable in ϑ with derivative

p0ϑ(x) = −ϕ0(x− ϑ)

Since

ϕ(x) =1

(2π)1/2exp

µ−x

2

2

¶we have

ϕ0(x) = −xϕ(x).Now for |∆| ≤ ε

|ϕ(x−∆)− ϕ(x)| =¯Z 0

−∆ϕ(x+ u)dy

¯≤ ∆ sup

|u|≤ε

¯ϕ0(x− u)

¯.

Also note that

sup|u|≤ε

¯ϕ0(x− u)

¯≤ (|x|+ ε) sup

|u|≤εϕ(x− u),

ϕ(x− u) = (2π)−1/2 exp

µ−(x− u)2

2

¶= ϕ(x) exp

¡xu− u2/2

¢so that for |∆| ≤ ε, ε > 0 sufficiently small and ϑ = 0¯

pϑ(x)− pϑ+∆(x)

∆pϑ(x)

¯≤

¯sup|u|≤ε |ϕ0(x+ u)|

ϕ(x)

¯≤ (|x|+ ε)

¯¯ sup|u|≤ε

exp¡xu− u2/2

¢¯¯≤ (|x|+ ε) exp (|x|ε)

Since |x| ≤ C1 exp (|x|ε) for an appropriate constant C1 (depending on ε), we have¯pϑ(x)− pϑ+∆(x)

∆pϑ(x)

¯≤ C2 =: bϑ(x).

Now we need to show thatEϑb

2ϑ(x) = C22E0 exp (|X|2ε) <∞,

The right side clearly has a finite expectation under L(X) = N(0, 1). Indeed for all t > 0 we have

1

(2π)1/2

Zexp(t|x|) exp

µ−x

2

2

¶dx

≤ 21

(2π)1/2

Zexp(tx) exp

µ−x

2

2

¶dx

= 21

(2π)1/2

Zexp

µ−(x− t)2

2

¶dx · exp(t

2

2) = 2 exp(

t2

2).

. ¤


Proposition 4.3.3 In the Gaussian location model M2, the Fisher information w.r.t. the expec-tation parameter µ ∈ R is n/σ2, and the sample mean Xn is is a UMVUE of µ.

Proof. We haveV arµXn = σ2/n

so we need only find the Fisher information. Condition D3 is easily verified, analogously to thecase n = 1: in (3.4) we found an expression for the joint density

nYi=1

pµ(xi) =1

σn−1/2ϕ

µxn − µ

σn−1/2

¶· 1

n1/2(2πσ2)(n−1)/2exp

µ− s2n2σ2n−1

¶. (4.22)

where s2n = n−1Pn

i=1(xi − xn)2 is the sample variance. The second factor does not depend on µ

(σ2 is fixed), and therefore cancels in the term (pµ(x)− pµ+∆(x))/pµ(x) in condition D3. The firstfactor is the density (as a function of xn) of the law N(µ, n−1σ2); thus reasoning as in the aboveexample (4.3.2), we finally have to show that

E0 exp¡2|xn|nσ−2ε

¢<∞

which follows as above. To find the Fisher information, we can use the factorization (4.22) again.Indeed

IF,n(µ) = Eµ

Ãd

dµlog

nYi=1

pµ(xi)

!2

= Eµ

µd

dµlog

1

σn−1/2ϕ

µxn − µ

σn−1/2

¶¶2so here the Fisher information is the same as if we observed only xn with law N(µ, n−1σ2), i.e. ina Gaussian location model with variance γ2 = n−1σ2. The score function in such a model is

lµ(x) =d

dµlogϕ

µx− µ

σn−1/2

¶=

d

dµ

µ−(x− µ)2

2γ2

¶= γ−2(x− µ),

andIF (µ) = γ−4E(x− µ)2 = γ−2 = nσ−2.

We will now discuss an example where the Cramer-Rao bound does not hold; cf. Casella andBerger [CB], p. 312, Example 7.3.5. Suppose that X1, . . . ,Xn are i.i.d with uniform density on[0, ϑ]. That means the density is

pϑ(x) =1

ϑ, 0 ≤ x ≤ ϑ.

We might try to formally apply the Cramer-Rao bound, neglecting the differentiability conditionfor a moment. For one observation, the score function is

lϑ(x) =∂

∂ϑlog pϑ(x) =

− 1ϑ , 0 < x < ϑ0, x > ϑ

,


hence (formally)

IF (ϑ) = Eϑl2ϑ(x) =

1

ϑ2

which suggests that for any unbiased estimator based on n observations

VarϑT (X) ≥ϑ2

n. (4.23)

However an estimator can be found which is better. Consider the maximum, i.e the largest orderstatistic

X[n] = maxi=1,...,n

Xi

. The density of X[n] (qϑ, say) can be found as follows: for t ≤ ϑ

Pϑ¡X[n] ≤ t

¢= Pϑ (X1 ≤ t, . . . ,Xn ≤ t)

=nYi=1

Pϑ(Xi ≤ t) = (t/ϑ)n ,

and Pϑ¡X[n] ≤ t

¢= 1 for t > ϑ, hence

qϑ(t) =d

dt(t/ϑ)n = ntn−1/ϑn, 0 ≤ t ≤ ϑ

and qϑ(t) = 0 for t > ϑ. For the expectation of X[n] we get

EϑX[n] = ϑ−nn

Z ϑ

0y · yn−1dy = ϑ−nn

∙1

n+ 1yn+1

¸ϑ0

= ϑn

n+ 1.

Remark 4.3.4 Assume that ϑ = 1, i.e. we have the uniform distribution on [0, 1]. Then

ϑ−Eϑ maxi=1,...,n

Xi =1

n+ 1

which can be interpreted as follows: when n points are randomly thrown into [0, 1] (independently,with uniform law) then the largest of these value tends to be 1

n+1 away from the right boundary.The same is true by symmetry for the smallest value and the left boundary. ¤

The estimator X[n] is thus biased, but the bias can easily be corrected: the estimator

Tc(X) =n+ 1

nX[n]

(which moves X[n] up towards the interval boundary) is unbiased. To find its variance we note

EϑX2[n] = ϑ−nn

Z ϑ

0y2 · yn−1dy = ϑ−nn

∙1

n+ 2yn+2

¸ϑ0

= ϑ2n

n+ 2,


Varϑ (Tc(X)) =

µn+ 1

n

¶2 ³VarϑX

2[n]

´= ϑ2

µn+ 1

n

¶2Ã n

n+ 2−µ

n

n+ 1

¶2!

= ϑ2µ(n+ 1)2

n(n+ 2)− 1¶

=ϑ2

n(n+ 2)(4.24)

which is smaller than the bound ϑ2/n suggested by (4.23). In fact the variance (4.24) decreasesmuch faster with n than the bound (4.23), by an additional factor 1/(n+ 2).Theorem 4.3.1 is in fact not applicable since the conditions are violated (the support depends on ϑ,and the density is not differentiable everywhere). That suggests that the statistical model whereϑ describes the support of the density [0, ϑ] is more informative than the ”regular” cases wherethe Cramer-Rao bound applies. Indeed if X[n] = y then all values ϑ < y can be excluded withcertainty, and certainty is a lot of information in a statistical sense.


Chapter 5CONDITIONAL AND POSTERIOR DISTRIBUTIONS

5.1 The mixed discrete / continuous model

In section 2.2 we introduced the integrated risk with respect to a prior density g; it was mentionedalready that in the Bayesian approach, the parametric family Pϑ, ϑ ∈ Θ for the observations X isunderstood as a conditional distribution given ϑ. In model M1, the data X were discrete and theassumed prior distribution for p was continuous (the Beta densities.) It is clear that this shouldgive rise to a joint distribution of observation and parameter. Most basic probability coursestreat joint distributions of two r.v.’s (X,U) only for the case that the joint law is either discreteor continuous (hence X and U are of the same type). Let us fill in some technical details here forthis mixed case.When X and U are both discrete or both continuous then a joint distribution is given by a jointprobability function q(x, u), or a joint density q(x, u) respectively. In our mixed case where Θ isan interval in R and our sample space X for X is finite we can expect that a joint distribution isgiven by q(x, u) which is of mixed type: q(x, u) ≥ 0 for all x ∈ X , u ∈ Θ and

Xx∈X

ZΘq(x, u)du = 1. (5.1)

Such a function defines a joint distribution Q of X and U : for any subset A of X and any intervalB in Θ

Q(X ∈ A,U ∈ B) = Q(A×B) =Xx∈A

ZBq(x, u)du.

It is then possible to extend the class of sets where the probabilities Q are defined: consider setsC ∈ S of form

C =[x∈A

x ×Bx

where A is an arbitrary subset of X and for each x ∈ A, Bx is a finite union of intervals in Θ.Define

Q(C) =Xx∈A

ZBx

q(x, u)du (5.2)

Let S0 be the collection of all such sets C as above; the function Q on sets C ∈ S0 fulfills all theaxioms of a probability.Suppose now that Pϑ, ϑ ∈ Θ is a parametric family of distributions on X and g(ϑ) is a density onΘ. Setting

q(x, ϑ) = Pϑ(x)g(ϑ)

52 Conditional and posterior distributions

we have an object q as described above: it is nonnegative and (5.1) is fulifilled:

Xx∈X

ZΘPϑ(x)g(ϑ)dϑ =

ZΘ

ÃXx∈X

Pϑ(x)

!g(ϑ)dϑ

=

ZΘg(ϑ)dϑ = 1

since g is a density. The joint distribution of X and U thus defined gives rise to marginal andconditional probabilities, for x ∈ X

P (X = x) = P (X = x,U ∈ Θ) =ZΘPϑ(x)g(ϑ)dϑ, (5.3)

P (U ∈ B|X = x) =P (U ∈ B,X = x)

P (X = x)(5.4)

=

RB Pϑ(x)g(ϑ)dϑRΘ Pϑ(x)g(ϑ)dϑ

.

If Pϑ(x) > 0 for all x and ϑ (which is the case in M1 if Θ = (0, 1)) then also P (X = x) > 0, andthe conditional probability (5.4) is well defined. Then it is immediate (and shown in probabilitycourses) that the function

Qx(B) = P (U ∈ B|X = x)

fulfills all the axioms of a probability on Θ; it defines the conditional law of U given X = x. In thestatistical context this is called the posterior distribution of ϑ given X = x.It is obvious that this distribution has a density on Θ = (a, b): for B = (a, t] we have

Qx((a, t]) =

Z t

aPϑ(x)g(ϑ)

µZΘPu(x)g(u)du

¶−1dϑ

which identifies the density of Qx as

qx(ϑ) = Pϑ(x)g(ϑ)

µZΘPu(x)g(u)du

¶−1(5.5)

This density is called the conditional density , or in the context of Bayesian statistics, theposterior density of ϑ given X = x. The formula (5.5) is very simple: for given prior densityg(ϑ), adjoin the probability function Pϑ(x) (when X = x is observed) and renormalize Pϑ(x)g(ϑ)such that it integrates to one w.r.t. ϑ (i.e. becomes a density).

Remark 5.1.1 The formula (5.5) suggests an analog for the case that X is a continuous r.v. aswell, with density pϑ(x) say. In this case the formula (5.4) cannnot be used to define a conditionaldistribution since all events X = x have probability 0 for the laws Pϑ(x) and thus also for themarginal law of X. For continuous r.v.’s X the conditional (posterior) density qx(ϑ) is directlydefined by replacing the probability Pϑ(x) in (5.5) by the density pϑ(x); cf. [D], sect. 3.8 and ourdiscussion to follow in later sections.

Exercise. To prepare for the purely continuous case, let us see what happens when we reverse theroles of ϑ and X in (5.5), i.e. we take the marginal probability function for X given by (5.3) and

Bayesian inference 53

combine it with the conditional density for ϑ given by (5.4). Consider the expression qx(ϑ)P (X = x)and, analogously to (5.5), divide it by its sum over all possible values of x (x ∈ X ). Call the resultqϑ(x). Show that for any ϑ with g(ϑ) > 0, the relation

qϑ(x) = Pϑ(x), x ∈ X

holds. This result justifies to call Pϑ(x) a conditional probability function under U = ϑ:

Pϑ(x) = P (X = x|U = ϑ) ,

even though U is continuous and the event U = ϑ has probability 0.

Remark 5.1.2 Consider the uniform prior density on Θ: g(ϑ) = (b− a)−1. Then

qx(ϑ) = Lx(ϑ)

µZΘLx(u)du

¶−1i.e. the posterior density is proportional as a function of ϑ to the likelihood function Lx(ϑ) (thenormalizing factor does not depend on ϑ).

5.2 Bayesian inference

Computation of a posterior distribution and of a Bayes estimator (which is associated, as we shallsee) can be subsumed under the term Bayesian inference . Let us compute the posterior densityin our case of model M1 and prior densities of the Beta family. We have

Pp(x)g(p) = pz(x)(1− p)n−z(x) · pα−1(1− p)β−1

B(α, β)

=pz(x)+α−1(1− p)n−z(x)+β−1

B(α, β).

The posterior density is proportional to the Beta density gα+z(x),β+n−z(x), and hence must coincidewith this density:

qx(p) = gα+z(x),β+n−z(x)(p).

We see that if the prior density is in the Beta class, then the posterior density is also in this class,for any observed data x. Such a family is called a conjugate family of prior distributionsThe next subsection presents a more technical discussion of the Beta family. Using the formulagiven below for the expectation, we find the expected value of the posterior distribution as

E(U |X = x) =α+ z(x)

α+ z(x) + β + n− z(x)(5.6)

=α+ z(x)

α+ β + n=

Xn + α/n

1 + α/n+ β/n= Tα,β(X),

i.e. it coincides with the Bayes estimator for the prior density gα,β already found in section 2.4.This is no coincidence, as the next proposition shows. Notationally, the separate symbol U for ϑas a random variable is only needed for expressions like P (X = x|U = ϑ); we suppress U and setU = ϑ wherever possible. Recall that E(U |X = x) is a general notation for conditional expectation([D] sec. 4.6), i.e the expectation of a conditional law. In our statistical context E(ϑ|X = x) iscalled the posterior expectation .


Proposition 5.2.1 In the statistical model Mf (the sample space X for X is finite and Θ is anopen interval in R), let Θ be a finite interval and g be a prior density on Θ. Then, for a quadraticloss function a Bayes estimator TB of ϑ is given by the posterior expectation

TB(x) = E(ϑ|X = x), x ∈ X .

Proof. Note that for a r.v. U = ϑ taking values in a finite interval, the expectation and all highermoments always exists, so both for prior and posterior distributions the expectation exists. Theintegrated risk for any estimator is (it was defined previously for the special case of Model M1)

B(T ) =

ZΘR(T, ϑ)g(ϑ)dϑ

=

ZΘEϑ (T − ϑ)2 g(ϑ)dϑ.

Obviously this can be written

B(T ) =

ZΘ

Xx

(T (x)− ϑ)2 Pϑ(x)g(ϑ)dϑ

=Xx

ZΘ(T (x)− ϑ)2 Pϑ(x)g(ϑ)dϑ.

From (5.5) we havePϑ(x)g(ϑ) = qx(ϑ)P (X = x),

where P (X = x) is the marginal probability function of X. Hence

B(T ) =Xx

µZΘ(T (x)− ϑ)2 qx(ϑ)dϑ

¶P (X = x). (5.7)

Let q(ϑ) be an arbitrary density in ϑ, Eq(·), be expectation under q and a be a constant (a doesnot depend on ϑ). Then we claim

Eq (a− ϑ)2 ≥ Eq (Eqϑ− ϑ)2 = Varqϑ (5.8)

with equality if and only if a = ϑ. (Note that Eq (a− ϑ)2 is always finite in our model). Indeed

Eq (a− ϑ)2 = Eq (a−Eqϑ− (ϑ−Eqϑ))2

= (a−Eqϑ)2 +Varqϑ

in view of Eq(ϑ − Eqϑ) = 0, which proves (5.8) and our claim. Now apply this result to theexpression in round brackets under the sum sign in (5.7) and obtain that for any given x, T (x) =Eqxϑ = E(ϑ|X = x) = TB(x) is the unique minimizer. Hence

B(T ) ≥ZΘEϑ (TB(X)− ϑ)2 g(ϑ)dϑ

= E (TB(X)− ϑ)2 .

where the last expectation refers to the joint distribution X and ϑ.In the last display we wrote TB(x) = E(ϑ|X = x) as a random variable when the conditioning x isseen as random. The common notation for this random variable is E(ϑ|X).

The Beta densities 55

5.3 The Beta densities

Let us discuss more carefully the Beta class of densities which were used as prior distributions inmodel Md,1. Consider the following family of prior densities for the Bernoulli parameter p: forα, β > 0

gα,β(x) =xα−1(1− x)β−1

B(α, β), x ∈ [0, 1].

where B(α, β) is the beta function

B(α, β) =Γ(α)Γ(β)

Γ(α+ β). (5.9)

Recall that the Gamma function is defined (for α > 0) as

Γ(α) =

Z ∞

0xα−1 exp(−x)dx

and satisfiesΓ(α+ 1) = αΓ(α). (5.10)

It was already argued that for α, β > 0, the function x 7→ xα−1(1− x)β−1 is integrable on [0, 1] (cf.relation (2.15). Relation (5.9) is proved below. The moments are (k integer)

mk : =

Z 1

0xkgα,β(x)dx

=Γ(α+ β)

Γ(α)Γ(β)

Z 1

0xk+α−1(1− x)β−1dx

=Γ(α+ β)

Γ(α)Γ(β)· Γ(k + α)Γ(β)

Γ(k + α+ β).

Invoking (5.10), we find

mk =(k − 1 + α)(k − 2 + α) . . . (α)

(k − 1 + α+ β)(k − 2 + α+ β) . . . (α+ β).

especially (for a r.v. U with beta density gα,β)

EU = m1 =α

α+ β, (5.11)

EU2 = m2 =(α+ 1)α

(1 + α+ β)(α+ β),

Var(U) = EU2 − (EU)2 = (α+ 1)(α+ β)α− (1 + α+ β)α2

(1 + α+ β)(α+ β)2

=αβ

(1 + α+ β)(α+ β)2.

Thus α = β implies EU = 1/2; in particular the prior distribution for which the Bayes estimatoris minimax (α = β = n1/2/2, cf. Theorem 2.5.2) has expected value 1/2.


We now show that

B(α, β) =

Z 1

0uα−1 (1− u)β−1 du =

Γ(α)Γ(β)

Γ(α+ β)(5.12)

Consider the gamma distribution for parameter α: it has density for x > 0

fα(x) =1

Γ(α)xα−1 exp(−x).

Consider independent r.v.s X, Y with gamma laws for parameters α, β, and their joint density

gα,β(x, y) = fα(x)fβ(y) =1

Γ(α)Γ(β)xα−1yβ−1 exp(−(x+ y)).

Consider the change of variables x = rt, y = r(1 − t) for 0 < r < ∞, 0 ≤ t ≤ 1. (Indeed every(x, y) with x > 0, y > 0 can be uniquely represented in this form: r = x+ y, t = x/(x+ y)). TheJacobian matrix is µ

∂∂trt

∂∂rrt

∂∂tr(1− t) ∂

∂rr(1− t)

¶=

µr t−r (1− t)

¶with determinant (1− t)r + rt = r. The new density in variables t, r is

g∗α,β(t, r) =1

Γ(α)Γ(β)(rt)α−1(r(1− t))β−1r exp(−r)

=1

Γ(α)Γ(β)rα+β−1 exp(−r)tα−1(1− t)β−1.

When we integrate over r, the result is the marginal density of t (call this fα,β(t)) and hence mustintegrate to one. We find from the definition of the Gamma function

fα,β(t) =

Z ∞

0g∗α,β(t, r)dr =

Γ(α+ β)

Γ(α)Γ(β)tα−1(1− t)β−1.

This is the density of the Beta law; since this density integrates to one, we proved (5.12).

5.4 Conditional densities in continuous models

Consider a two dimensional random variable (X,Y ) with joint density p(x, y). Consider an intervalBh = (x−h/2, x+h/2) and assume P (X ∈ Bh) > 0. The conditional probability P (Y ∈ A|X ∈ Bh)is well defined:

P (Y ∈ A|X ∈ Bh) =P (Y ∈ A,X ∈ Bh)

P (X ∈ Bh)(5.13)

It is clear that P (A|X ∈ Bh) considered as a function of A, is a probability distribution, i.e. theconditional distribution of Y given X ∈ Bh. This distribution has a density: for A = Y ≤ y weobtain

P (Y ≤ y|X ∈ Bh) =

Z y

−∞

ZBh

p(x, y)dxdy (P (X ∈ Bh))−1 (5.14)

which shows that the density is

p (y|X ∈ Bh) =

ZBh

p(x, y)dx (P (X ∈ Bh))−1 .

Conditional densities in continuous models 57

When X is continuous, it is desirable to define conditional distributions under an event X = x.Indeed, when X is realized, conditioning on an interval Bh which contains x does not correspond tothe actual information we have. We might be tempted to condition on smaller and smaller intervals,all containing the realized value x, and each of these conditions would reflect more accurately theinformation we have, yet none of them would correspond to the actual information X = x.It is natural to try to pass to a limit here as h→ 0. Let pX(x) be the marginal density of X:

pX(x) =

Zp(x, y)dy.

Without trying to be rigorous in this limit argument, for Bh = (x− h/2, x+ h/2), h→ 0Z(x−h/2,x+h/2)

p(x, y)dx ≈ hp(x, y),

P (X ∈ Bh) =

Z(x−h/2,x+h/2)

pX(x)dx ≈ hpX(x)

Based on this heuristic, we introduce the conditional density given X = x by definition: ifpX(x) > 0 then

p (y|x) = p(x, y)

pX(x). (5.15)

The figure is meant to illustrate the idea of the of p (y|x) as giving the ”relative weight ” given todifferent y’s by the joint density, once the other argument x is fixed.The expression p (y|x) is defined only in case pX(x) > 0. In that case however it clearly is a density,since it is nonnegative and, as a function of y, it integrates to one. Then, for such x for whichpX(x) > 0, we define the conditional distribution of Y given X = x: for any A

P (Y ∈ A|X = x) =

ZAp (y|x) dy.

Note that formula (5.15) is analogous to the conditional probability function in the discrete case:if X and Y can take only finitely many values x, y and

p(x, y) = P (Y = y,X = x), pX(x) = P (X = x)

are the joint and marginal probability functions then if pX(x) > 0, by the classical definition ofconditional probability

p (Y = y|X = x) =P (Y = y,X = x)

P (X = x)=

p(x, y)

pX(x).

which looks exactly like (5.15), but all the terms involved are probability functions, not densities.To state a connection between conditional densities and independence, we need a slightly moreprecise definition. First note that densities are not unique; they can be modified in certain pointsor subsets without affecting the corresponding probability measure.

Definition 5.4.1 Let Z be a random variable with values in Rk and density q on Rk. A version ofq is a density q on Rk such that for all sets A ⊂ Rk for which the k-dimensional volume (measure)is defined, Z

Aq(z)dz =

ZAq(z)dz


Figure 1 Conditional density of Y given X = x. The altitude lines symbolize the joint density of Xand Y . On the dark strip, in the limit when h tends to 0, this gives the conditional density of Y givenX = x

In particular, if A = A1∪A2 and A2 has volume 0 then the density q can be arbitrarily modified onA2. On the real line, A2 might consists of a countable number of points; on Rk, A2 might consistsof hyperplanes, smooth surfaces etc.

Definition 5.4.2 Suppose that Z = (Z1, . . . , Zk) is a continuous random variable with values inRk and with joint density p(z) = p(z1, . . . , zk). Set X = Z1, Y = (Z2, . . . , Zk) and p(x, y) = p(z).A version of the conditional density of Y given X = x is any function p (y|x) with properties:(i) For all x, p (y|x) is a density in y, i.e.

p (y|x) ≥ 0,Z

p (y|x) dy = 1

(ii) There is a version of p(x, y) such that for pX(x) =Rp(x, y)dy we have

p(x, y) = p (y|x) pX(x) (5.16)

Lemma 5.4.3 A conditional density as defined above exists.

Proof. If x is such that pX(x) > 0 then we can set

p (y|x) = p(x, y)

pX(x). (5.17)

Clearly this is a density in y sinceZp(x, y)

pX(x)dy =

ZpX(x)

pX(x)dy = 1.

Conditional densities in continuous models 59

Let A = x ∈ R : pX(x) = 0. Clearly P (X ∈ A) = 0, and hence for

A0 = (x, y) : x ∈ A

we obtainP ((X,Y ) ∈ A0) = P (X ∈ A) = 0.

Now modify p(x, y) on A0, to obtain another version p, namely set

p(x, y) = 0 for (x, y) ∈ A0.

Clearly p is a version of p: for any event BZBp(z)dz = P (Z ∈ B) = P (Z ∈ B ∩A0)

=

ZB∩A0

p(z)dz =

ZB∩A0

p(z)dz =

ZBp(z)dz.

For this version p(x, y) we have: pX(x) = 0 implies p(x, y) = 0 and hence for such x, p (y|x) canbe chosen as an arbitrary density to fulfill (5.16).

Proposition 5.4.4 (X,Y ) with joint density p(x, y) are independent if and only if there is a ver-sion of p (y|x) which does not depend on x.

Proof. If X and Y are independent then p(x, y) = pX(x)pY (y). Thus pY (y) = p (y|x) is such aversion. Conversely, if p (y|x) is such a version then

pY (y) =

Zp(x, y)dx =

Zp (y|x) pX(x)dx

= p (y|·)Z

pX(x)dx = p (y|·)

hencep(x, y) = pY (y)pX(x)

which implies that (X,Y ) are independent.In case that all occurring joint and marginal densities are positive, there is really no need to considermodifications; the conditional densities can just be taken as (5.17).Let now f(Y ) be any function of the random variable Y. The conditional expectation of f(Y )given X = x is

E (f(Y )|X = x) =

Zf(y)p(y|x)dy,

i.e. the expectation with respect to the conditional distribution of Y given X = x, given by theconditional density p(y|x). Note that this conditional expectation depends on x, i.e. is a functionof x- the realization of the random variable X. Sometimes it is useful to "keep in mind" the originalrandom nature of this realization, i.e. consider the conditional expectation to be a function of therandom variable X. The common notation for this random variable is

E (f(Y )|X)


it is a random variable which is a function of X. For the expectation we then have

EE (f(Y )|X) = E

Zf(y)p(y|X)dy =

Z µZf(y)p(y|x)dy

¶pX(x)dx

=

Zf(y)p(x, y)d(x, y) =

Zf(y)pY (y)dy = Ef(Y ),

i.e. expectation of conditional expectation yields the expectation. These notions are the same inthe case of discrete random variables; only in the continuous case we had to pay attention to thefact that P (X = x) = 0.

5.5 Bayesian inference in the Gaussian location model

Suppose that we observe a continuous random variable X = (X1, . . . ,Xn) with density unknowndensity pϑ(x), x ∈ Rn, ϑ ∈ Θ where Θ is an open subset of Rd (this was called model Mc before).In a Bayesian statistical approach, assume that ϑ ”becomes random” as well. Suppose we have aprior density π(ϑ) on the parameter space Θ and wish to build a joint density of X and ϑ fromthis.We need some regularity condition on functions and sets. A set A ⊆ Rk is called measurable if itsk-dimensional volume is defined (volume may be 0 or ∞). One also defines measurable functionson Rk; we do not give the definition here but remark only that measurability is necessary forintegrals

Rf(x)dx to be defined (thus densities must be measurable). All continuous functions are

measurable, and also functions which are continuous except on a set of volume 0. In this course,we need not dwell on these technicalities (measure theory); we assume that all occurring sets andfunctions are are measurable.

Proposition 5.5.1 Suppose that in model Mc a density π(ϑ) on Θ is given such that π(ϑ) > 0,ϑ ∈ Θ, (and that pϑ(x) is jointly measurable as a function of (x, ϑ)). Then the function

p(x, ϑ) = pϑ(x)π(ϑ) (5.18)

is a density on Rk×Θ. When this is construed as a joint density of (X,ϑ), then pϑ(x) is a versionof the conditional density p(x|ϑ).

Proof. The function p(x, ϑ) is nonnegative; we haveZΘ

ZRn

p(x, ϑ)dxdϑ =

ZΘ

ZRn

pϑ(x)dxπ(ϑ)dϑ =

ZΘπ(ϑ)dϑ = 1,

thus p(x, ϑ) is a density. Then π(ϑ) is the marginal density of ϑ, derived from the joint density:indeed Z

p(x, ϑ)dx =

Zpϑ(x)π(ϑ)dx = π(ϑ).

Then, we immediately see from (5.18) that pϑ(x) is a version of the conditional density p(x|ϑ).This justifies the notation that for a parametric family of densities, one writes interchangeablypϑ(x) or p(x|ϑ). Then p(ϑ|x) is again called the posterior density. If Θ is an interval in R then theconditional expectation

E(ϑ|X = x) =

Zϑp(ϑ|x)dϑ

Bayesian inference in the Gaussian location model 61

if it exists, is again called posterior expectation. Let us discuss the case of the Gaussian locationmodel, first in the case of sample size n = 1. We can represent X = X1 as

X = µ+ ξ

where ξ is centered normal: L(ξi) = N(0, σ2). The parameter ϑ is µ and parameter space Θ is R.Assume that µ becomes random as well: L(µ) = N(m, τ2) where m and τ2 are known (these aresometimes called hyperparameters ). For Bayesian statistical inference, we wish to compute theconditional distribution of µ given X, i.e. the posterior distribution of the parameter.The joint density of X and µ is

p(x, µ) =1√2πσ

exp

µ−(x− µ)2

2σ2

¶· 1√2πτ

exp

µ−(µ−m)2

2τ2

¶=

1

2πστexp

µ−(x− µ)2

2σ2− (µ−m)2

2τ2

¶.

To find the marginal density of X, we define ξ = X−µ and compute the joint distribution of ξ andµ. First find the conditional law L(ξ|µ) which is the law of X−µ when µ is fixed. This is N(0, σ2),and since it does not depend on µ, ξ and µ are independent in their joint law (Proposition 5.4.4).In other words, we have

X = µ+ ξ

where µ, ξ are independent N(m, τ2), N(0, σ2) respectively. By the properties of the normaldistribution, we conclude that X has marginal law L(X) = N(m,σ2 + τ2), with density (ϕ is thestandard Gaussian density)

pX(x) =1

(σ2 + τ2)1/2ϕ

µx−m

(σ2 + τ2)1/2

¶=

1

(2π)1/2(σ2 + τ2)1/2exp

µ− (x−m)2

2(σ2 + τ2)

¶and the posterior density is

p(µ|x) = p(x, µ)/pX(x)

=(σ2 + τ2)1/2

(2π)1/2στexp

µ−(x− µ)2

2σ2− (µ−m)2

2τ2+(x−m)2

2(σ2 + τ2)

¶.

We conjecture that this is a Gaussian density (if X and µ were independent then p(µ|x) = pµ(µ)certainly would be Gaussian). We shall establish that p(µ|x) is indeed Gaussian with variance

σ2τ2

σ2 + τ2:= ρ2. (5.19)

Note that for µ∗ = µ−m, x∗ = x−m

−(x− µ)2

2σ2− (µ−m)2

2τ2+(x−m)2

2(σ2 + τ2)= − 1

2ρ2

µµ∗ − τ2

σ2 + τ2x∗¶2

and setting

β =τ2

σ2 + τ2, (5.20)


we obtain

p(µ|x) = 1

ρϕ

µµ−m− β(x−m)

ρ

¶.

Proposition 5.5.2 In the Gaussian location model M2 for sample size n = 1 and a normal priordistribution L(µ) = N(m, τ2), the posterior distribution is

L(µ|X = x) = N(m+ β(x−m), ρ2)

where ρ and β are defined by (5.19), (5.20). The normal family©N(m, τ2), m ∈ R, τ2 > 0

ªis a

conjugate family of prior distributions.

Let us interpret this result. The posterior distribution of µ has expected value m+ β(x−m); notethat 0 < β < 1. The prior distribution of µ had expectation m, so the posterior expectation of µintuitively represents a ”compromise” between the prior belief and the empirical evidence about µ,i.e. X = (µ+ ξ) = x. Indeed

E(µ|X = x) = m+ β(x−m) = m(1− β) + βx

so E(µ|X = x) is a convex (linear) combination of data x and prior mean m (is always betweenthese two points). In other words, the data x are ”shrunken” towards m when x−m is multipliedby β. A similar shrinkage effect was observed for the Bayes estimator in the Bernoulli model M1

when the prior mean was 1/2 (recall that the minimax estimator there was Bayes for a Beta priorwith α = β, which has mean 1/2).Moreover

ρ2 =σ2τ2

σ2 + τ2< min(τ2, σ2).

The posterior variance ρ2 is thus smaller than both the prior variance and the variance of the datagiven ϑ (i.e. σ2). It is seen that the information in the data and in the prior distribution is ”addedup” to give a smaller variability (a posteriori) than is contained in either of the two sources. In fact

1

ρ2=1

σ2+1

τ2.

The inverse variance of a distribution can be seen as a measure of concentration, sometimes calledprecision . Then the precision of the posterior is the sum of precisions of the data and of the priordistribution.The posterior expectation again can be shown to be a Bayes estimator for quadratic loss. Beforeestablishing this, let us discuss two limiting cases.

Case 1 Variance τ2 of the prior density is large: τ2 →∞. In this case

β =τ2

σ2 + τ2→ 1, ρ2 → σ2. (5.21)

A large prior variance means that the prior density is more and more spread out, i.e more”diffuse” or noninformative . β → 1 then means that the prior information counts less andless in comparison to the data, and the posterior expecation of µ tends to x. This means,that in the limit the only evidence we have about µ is the realized value X = x.


Case 2 Variance τ2 of the prior density is small: τ2 → 0. In this case

β =τ2

σ2 + τ2→ 0, ρ2 → 0. (5.22)

A small prior variance means that the prior density is more and more concentrated aroundm. Then (5.22) means that the ”belief” that µ is near m becomes overwhelmingly strong,and forces the posterior distribution to concentrate around m as well.

Case 3 Variance σ2 of the data (given µ) is large: σ2 →∞. In this case

β =τ2

σ2 + τ2→ 0, ρ2 → τ2

The posterior density tends to the prior density, since the quality of the information X = xbecomes inferior (large variance σ2)

Case 4 Variance σ2 of the data (given µ) is small: σ2 → 0. We expect the prior distribution tomatter less and less, since the data are more and more reliable. Indeed

β =τ2

σ2 + τ2→ 1, ρ2 → 0

which is similar to (5.21) for case 1.

In the case of prior mean m = 0, the quantity

r =τ2

σ2

is often called the signal-to-noise ratio . Recall that in the Gaussian location model (ModelMd,1) for n = 1 the data are

X = µ+ ξ

where L(µ) = N(0, τ2) and L(ξ) = N(0, σ2). The parameter µ can be seen as a ”signal” which isobserved with noise ξ. For m = 0 we have

τ2 = Eµ2, σ2 = Eξ2

so that τ2, σ2 represent the average absolute value the (squared) signal and noise. The parametersof the posterior density can be expressed in terms of the signal-to-noise ratio r and σ2:

β =τ2

σ2 + τ2=

r

1 + r, ρ2 =

σ2τ2

σ2 + τ2= σ2

r

1 + r

and the discussion of the limiting cases 1-4 above could have been in terms of r.

Remark 5.5.3 The source of randomness of the parameter ϑ may not only be ”prior belief”, asargued until now, but may be entirely natural and part of the model, along with the randomnessof the noise ξ. This randomness assumption even dominates in the statistical literature related tocommunication and signal transmission (assumption of a random signal). The problem of estima-tion of ϑ then becomes the problem of prediction . Predicting ϑ still means to find an estimator


T (X) depending on the data, but for assessing the quality of a predictor, the randomness of ϑ isalways taken into account. The risk of a predictor is the expected loss

EL(T (X), ϑ).

w.r. to the joint distribution of X and ϑ. This coincides with the mixed risk of an estimator

B(T ) = Eπ (EϑL(T (X), ϑ))

in Bayesian statistics when a prior distribution π on ϑ is used to build the joint distribution of(X,ϑ). Thus an optimal predictor is the same as a Bayes estimator, when the loss functions Lcoincide.

In statistical communication theory (or information theory ), a family of probability distribu-tions Pϑ, ϑ ∈ Θ on some sample space X is often called a noisy (communication) channel,the parameter ϑ is the signal at the input and observations X are called the signal at theoutput. It is assumed that the signal ϑ is intrinsically random, e.g. the finite set Θ = ϑ1, . . . , ϑkmay represent letters of an alphabet which occur at different frequencies (or probabilities) π(ϑ).Thus π(ϑ) may be given naturally, by the frequency of certain letters in a given language. Typicalexamples of noisy channels are

• the binary channel : Θ = 0, 1, Pϑ = B(1, pϑ) where pϑ ∈ (0, 1), ϑ = 0, 1. The signals areeither 0 and 1, and given ϑ = 1 the channel then produces the correct signal with probabilityp1 and the distorted signal 0 with probability 1− p1; analogously for ϑ = 0. It is naturally toassume here p1 > 1/2, p0 < 1/2 lest the channel would be entirely distorted (gives the wrongsignal with higher probability than the correct one). A natural prior π, would be the uniformhere, (π(0) = π(1) = 1/2) since in most data streams is probably not sensible to assume that0 is more frequent than 1.

• the n-fold product of the binary channel: Θ = 0, 1n, Pϑ =Nn

i=1B(1, pϑ(i)) whereϑ(i) is the i-th component of a sequence of 0’s and 1’s of length n and

Nni=1 signifies the

n-fold product of laws. The numbers pϑ ∈ (0, 1), ϑ = 0, 1 are as above in the simple binarychannel. Here a signal is a sequence of length n, and the channel transmits this sequencesuch that each component is sent independently in a simple binary channel. The differenceto the previous case is that the signal is a sequence of length n, not just 0 or 1; thus n = 8gives card(Θ) = 28 = 256 and this suffices to encode and send all the ASCII signs. Here itis natural to assume a non-uniform distribution π on the alphabet Θ since ASCII signs (e.g.letters) do not all occur with equal probability.

• the Gaussian channel . Let ϑ ∈ R and Pϑ = N(ϑ, σ2) where σ2 is fixed. Here the signalat the output has the form

X = ϑ+ ξ (5.23)

where L(ξ) = N(0, σ2), i.e. the channel transmits the signal in additive Gaussian noise.Here the real numbers ϑ are ”codes” agreed upon for any other signal, such as sequences ofof 0’s and 1’s as above. Thus the channel (5.23) itself coincides with the Gaussian locationmodel. However since there are usually only a finite number of signals possible, in informationtheory one considers prior distributions π for the signal ϑ which are concentrated on finitesets Θ ⊂ R.


After this digression, our next task is Bayesian inference in the Gaussian location model for generalsample size n. Our prior distribution for ϑ will again be N(m, τ2). We have

Xi = ϑ+ ξi

where ξi are N(0, σ2) and independent. Consider the density of X given ϑ = µ:

pµ(x) = p(x|µ) = p(x1, . . . , xn|µ)

=nYi=1

1

(2πσ2)1/2exp

µ−(xi − µ)2

2σ2

¶=

1

σn−1/2ϕ

µxn − µ

σn−1/2

¶· 1

n1/2(2πσ2)(n−1)/2exp

µ− s2n2σ2n−1

¶where the last line is the representation found in (3.4) in the proof of Proposition 3.0.5, with ϕ thestandard normal density and s2n the sample variance. The first factor is a normal density in xn andthe second factor (call it p(s2n) for now) does not depend on µ. If we denote ϕµ,σ(x) the density ofthe normal law N(µ, σ2) then

pµ(x) = ϕµ,n−1/2σ(xn) · p(s2n).

If π(µ) is the prior density then

p(µ|x) = pµ(x)π(µ)Rpµ(x)π(µ)dµ

=ϕµ,n−1/2σ(xn)π(µ)Rϕµ,n−1/2σ(xn)π(µ)dµ

This coincides with the posterior distribution for one observation (n = 1) at value X1 = xn andvariance n−1σ2. In other words, the posterior distribution given X = x may be computed as ifwe observed only the sample mean Xn, taking into account that its law is N(µ, n−1σ2). Thus theposterior density depends on the vector x only via the function xn of x.

Proposition 5.5.4 In the Gaussian location model M2 for general sample size n and a normalprior distribution L(µ) = N(0, τ2), the posterior distribution is

L(µ|X = x) = N(m+ β(xn −m), ρ2) (5.24)

where ρ and β are defined by

β =τ2

n−1σ2 + τ2, ρ2 =

n−1σ2τ2

n−1σ2 + τ2(5.25)

The normal family©N(m, τ2), m ∈ R, τ2 > 0

ªis a conjugate family of prior distributions.

We again emphasize the special role of the sample mean.

Corollary 5.5.5 In ModelM2 for general sample size n, consider the statistic Xn and regard theseas data in a derived model:

L(Xn|µ) = N(µ, n−1σ2)

(Gaussian location for sample size n = 1 and variance n−1σ2). In this model, the normal priorL(µ) = N(m, τ2) leads to the same posterior distribution (5.24) for µ.


Remark 5.5.6 Sufficient statistics Properties of statistics (data functions) T (X) like thesesuggest that T (X) may contain all the relevant information in the sample, i.e. that is suffices totake the statistic T (X) and perform all inference about the parameter ϑ in the parametric modelfor T (X) derived from the original family Pϑ, ϑ ∈ Θ for X:

Qϑ, ϑ ∈ Θ = L(T (X)|ϑ), ϑ ∈ Θ

which is a family of distributions in the space T where T (X) takes its values. This idea is calledthe sufficiency principle and T would be a sufficient statistic. At this point we do not rigorouslydefine this concept; noting only that it is of fundamental importance in the theory of statisticalinference.

The expectation of the posterior distribution in Proposition 5.5.4 is m+ β(xn −m). It is naturalto call this the conditional expectation of µ given X = x, or the posterior expecation, writtenE(µ|X = x). (We have not shown so far that is is unique; more accurately we should call it aversion of the conditional expectation.) Clearly for n→∞ we have β → 1 and E(µ|X = x) will beclose to the sample mean. Moreover, the above corollary shows that in a discussion of limiting casesas in Case 1 — Case 4 above, all statements carry over when X (the one observation for n = 1) isreplaced by the sample mean. In addition, the noise variance in this discussion now is replaced byn−1σ2. Thus e.g. case 4 can be taken to mean that as sample size n increases, the prior distributionmatters less and less (indeed we have more and more information in the sample). The analog ofthe signal-to-noise ratio would be

r =nτ2

σ2→∞ as n→∞.

Again we see that intuitively large sample size n amounts to ”small noise”.

5.6 Bayes estimators (continuous case)

The following discussion of Bayes estimators as posterior expectations is analogous to the case offinite sample space (Proposition 5.2.1). Consider the again the general continuous statistical modelMc where a density π(ϑ) on Θ is given such that π(ϑ) > 0, ϑ ∈ Θ. Consider the mixed quadraticrisk of an estimator T

B(T ) =

ZΘR(T, ϑ)π(ϑ)dϑ

=

ZΘEϑ (T (X)− ϑ)2 π(ϑ)dϑ = E (T (X)− ϑ)2

where the last expectation is taken w.r. to the joint distribution of X and ϑ. A Bayes estimatoris an estimator TB which minimizes B(T ):

B(TB) = infTB(T ).

It is also called a best predictor, depending on the context (cf. remark 5.5.3).In the statistical model Mf (the sample space X for X is finite and Θ is an finite interval in R),let Θ be a finite interval and g be a prior density on Θ. Then, for a quadratic loss function a Bayesestimator TB of ϑ is given by the posterior expectation

TB(x) = E(ϑ|X = x), x ∈ X .

Bayes estimators (continuous case) 67

Proposition 5.6.1 In model Mc assume a prior density π(ϑ) on Θ such that π(ϑ) > 0, ϑ ∈ Θ.Suppose that for all realizations x of X, there is a version of the posterior density with finite secondmoment (E(ϑ2|X = x) < ∞). Then the posterior expectation T (x) = E(ϑ|X = x) is a Bayesestimator for quadratic loss.

Proof. In Proposition 5.2.1 the finitene second moment was automatically ensured by the con-dition that Θ was a finite interval; otherwise the proof is entirely analogous. Under the conditionE(ϑ|X = x) exists. We have

B(T ) =

ZRk×Θ

(T (x)− ϑ)2 p(x|ϑ)π(ϑ)dxdϑ

=

ZRk

µZΘ(T (x)− ϑ)2 p(ϑ|x)dϑ

¶pX(x)dx

=

ZRk

E³(T (X)− ϑ)2 |X = x

´pX(x)dx.

Invoking relation (5.8) again, we find for any x and TB(x) = E(ϑ|X = x)ZΘ(T (x)− ϑ)2 p(ϑ|x)dϑ ≥

ZΘ(TB(x)− ϑ)2 p(ϑ|x)dϑ.

This holds true even if the left side is infinite. The right side is the variance of p(ϑ|x) and is finiteunder the assumptions. Hence

B(T ) ≥ E (TB(X)− ϑ)2 .

Corollary 5.6.2 In Model M2 for general sample size n and a normal prior distribution L(µ) =N(m, τ2), the Bayes estimator for quadratic loss is

TB(X) = m+ β(Xn −m)

where β is given by (5.25). The Bayes risk is

B(TB(X)) = E (TB(X)− µ)2 = ρ2 =n−1σ2τ2

n−1σ2 + τ2. (5.26)

Proof. The second statement follows from the fact that if Var(µ|X) denotes the variance of theposterior density p(ϑ|X) then, from the proof of Proposition 5.6.1,

E (TB(X)− µ)2 = E(Var(µ|X))

and Proposition (5.5.4), which gives

Var(µ|X = x) = ρ2

Thus Var(µ|X = x) does not depend on x and E(Var(µ|X)) = ρ2.Note that speaking of ”the” Bayes estimator is justified here: indeed when we start with the normalconditional density for our data X1, . . . ,Xn. and normal prior, then there is always a version of theposterior density which is normal. As long as we do not arbitrarily modify these normal densitesin certain points (which is theoretically allowed ) we obtain a unique posterior expectation for alldata points x.


5.7 Minimax estimation of Gaussian location

The Bayes estimators have also interesting properties with regard to the original risk functionR(T, ϑ), without integration over Θ. One such statement (admissibility) was proved in Propo-sition 2.3.2. It is easy to find the risk R(T, ϑ) of the Bayes estimator: the usual bias-variancedecomposition gives

Eϑ (TB(X)− ϑ)2 = Eϑ (TB(X)−EϑTB(X))2 +Eϑ (EϑTB(X)− ϑ)2 .

Consider the Gaussian location model with a mean zero Gaussian prior L(µ) = N(0, τ2). Accordingto Corollary 5.6.2 we have TB(X) = βXn and hence

EµTB(X)− µ =τ2

n−1σ2 + τ2EµXn − µ =

τ2µ

n−1σ2 + τ2− µ

= − n−1σ2

n−1σ2 + τ2µ,

Varµ (TB(X)) = β2Varµ¡Xn

¢=

µτ2

n−1σ2 + τ2

¶2n−1σ2,

hence

R(TB, µ) = Eµ (TB(X)− µ)2 =

µ1

n−1σ2 + τ2

¶2 ¡τ4n−1σ2 + (n−1σ2)2µ2

¢= n−1σ2

µτ2

n−1σ2 + τ2

¶2µ1 +

n−1σ2

τ4µ2¶.

The sample mean is an unbiased estimator, hence

R(Xn, µ) = Eµ

¡Xn − µ

¢2= Varµ

¡Xn

¢= n−1σ2. (5.27)

Comparing the two risks, we note the following facts.

A) Sinceτ2

n−1σ2 + τ2< 1

for small values of µ we haveR(TB, µ) < R(Xn, µ)

i.e. for small values of µ the estimator TB is better.

B) For |µ| → ∞, R(TB, µ) → ∞ whereas R(Xn, µ) remains constant. For large values of µ theestimator Xn is better.

C) As τ →∞, the risk of TB(X) approaches the risk of Xn, at any given value of µ.

The sample mean recommends itself as particularly prudent, in the sense that it takes into accountpossibly large values of µ, whereas the Bayes estimator is more ”optimistic” in the sense that it isgeared towards smaller values of µ.

Minimax estimation of Gaussian location 69

Risks of the Bayes estimators for τ = 1 and τ = 2, and risk of sample mean (dotted), forn−1σ2 = 1

Recall Definition 2.5.1 of a minimax estimator; that concept is the same in the general case: forany estimator T set

M(T ) = supϑ∈Θ

R(T, ϑ). (5.28)

Here more generally sup is written in place of the maximum, since it is not claimed that thesupremum is attained. Conceptually, the criterion is again the worst case risk. An estimator TMis called minimax if

M(TM) = minT

M(T ).

Theorem 5.7.1 In the Gaussian location model for µ ∈ Θ = R (Model M2), the sample mean Xn

is a minimax estimator for quadratic loss.

Proof. Suppose that for an estimator T0 we have

M(T0) < M(Xn). (5.29)

Above in (5.27) it was shown that R(Xn, µ) = n−1σ2 does not depend on µ, henceM(Xn) = n−1σ2.It follows that there is ε > 0 such that

R(T0, µ) ≤ n−1σ2 − ε for all µ ∈ R.

Take a prior distribution N(0, τ) for µ; then for the mixed (integrated) risk B(T0) of the estimatorT0

B(T0) = EτR(T0, µ) ≤ n−1σ2 − ε

If TB,τ is the Bayes estimator for N(0, τ) given by Corollary (5.6.2) then

B(TB,τ ) = ρ2 =n−1σ2τ2

n−1σ2 + τ2→ n−1σ2 as τ →∞.


Hence for τ large enough we haveB(T0) < B(TB,τ )

which contradicts the fact that TB,τ is the Bayes estimator. Hence there can be no T0 with (5.29).

It essential for this argument that the parameter space Θ is the whole real line. It turns out (seeexercise below) that for e.g. Θ = [−K,K] we can find an estimator TB,τ which is uniformly strictlybetter than Xn, so that Xn is no longer minimax.Let us consider the case Θ = [−K,K] in more detail; here we have an a priori restriction |µ| ≤ K. Itis a more complicated problem to find a minimax estimator here. A common approach is to simplifythe problem again, by looking for minimax estimators within a restricted class of estimators. Forsimplicity consider the case of the Gaussian location for n = 1, σ2 = 1. A linear estimator is anyestimator

T (X) = aX + b

where a, b are real numbers. The appropriate worst case risk is again (5.28), for the presentparameter space Θ. A minimax linear estimator is a linear estimator TLM such that

M(TLM) = supµ∈Θ

R(TLM , µ) ≤ minT linear

M(T ).

Exercise 5.7.2 Consider the Gaussian location model with restricted parameter space Θ = [−K,K],where K > 0, sample size n = 1 and σ2 = 1. (i) Find the minimax linear estimator TLM . (ii)Show that TLM is strictly better than the sample mean Xn = X, everywhere on Θ = [−K,K] (thisimplies that X is not admissible). (iii) Show that TLM is Bayesian in the unrestricted model Θ = Rfor a certain prior distribution N(0, τ2), and find the τ2.

Chapter 6THE MULTIVARIATE NORMAL DISTRIBUTION

Recall the Gaussian location model (Model M2), for sample size n = 1. We can represent X as

X = µ+ ξ

where ξ is centered normal: L(ξ) = N(0, σ2). In a Bayesian statistical approach, assume that µbecomes random as well: L(µ) = N(0, τ2), in such a way that it is independent of the ”noise” ξ.Thus ξ and µ are independent normal r.v.’s. For Bayesian statistical inference, we computed thejoint distribution of X and µ; this is well defined as the distribution of the r.v. (µ + ξ, µ). Thejoint density was

p(x, µ) =1

2πστexp

µ−(x− µ)2

2σ2− µ2

2τ2

¶. (6.1)

Here we started with a pair of independent Gaussian r.v.’s (ξ, µ) and obtained a pair (X,µ).We sawthat the marginal of X is N(0, σ2+ τ2); the marginal of µ is N(0, τ2). We have a joint distributionof (X,µ) in which both marginals are Gaussian, but (X,µ) are not independent: indeed (6.1) isnot the product of its marginals. Note that we took a linear transformation of (ξ, µ):

X = 1 · ξ + 1 · µ,µ = 0 · ξ + 1 · µ

We could have written ξ = σx1 and µ = τx2 for independent standard normals x1,x2; then thelinear transformation is

X = σ · x1 + τ · x2, (6.2)

µ = 0 · x1 + τ · x2.

Let us consider a more general situation. We have independent standard normals x1,x2; andconsider

y1 = m11x1 +m12x2,

y2 = m21x1 +m22x2.

where the linear transformation is nonsingular: m11m22 −m21m12 6= 0. Nonsingularity is true for(6.2): it means just στ > 0. Define a matrix

M =

µm11 m12

m21 m22

¶

72 The multivariate normal distribution

and vectors

x =

µx1x2

¶,y =

µy1y2

¶.

Then nonsingularity means |M | 6= 0 (where |M | is the determinant of M). Let us find the jointdistribution of y1, y2, i.e. the law of the vector y. For this we immediately proceed to the k-dimensional case, i.e. consider i.i..d standard normal x1, . . . , xk, and let

x = (x1, . . . , xk)>,

M be a nonsingular k × k matrix and y =Mx.Let A be a measurable set in Rk, i.e. a set which has a well defined volume (finite or infinite).Then (ϕ is the standard normal density) )

P (y ∈ A) = P (Mx ∈ A) =

ZMx∈A

ÃkYi=1

ϕ(xi)

!dx1 . . . dxk

=1

(2π)k/2

ZMx∈A

exp

Ã−12

kXi=1

x2i

!dx1 . . . dxk

=1

(2π)k/2

ZMx∈A

exp

µ−12x>x

¶dx.

where x>is the transpose of x, x>x is the inner product (scalar product) and x>x =Pk

i=1 x2i . Now

let M−1 be the inverse of M and write

x =M−1Mx, x>x = (Mx)>(M−1)>M−1Mx.

Set Σ =MM>; this is a nonsingular symmetric matrix. Recall the following matrix rules:

(M−1)>M−1 =³MM>

´−1= Σ−1, |Σ| = |M |

¯M>

¯= |M |2 ,¯

M−1¯ = |M |−1 .

We thus obtainx>x =(Mx)>Σ−1Mx,

P (y ∈ A) =1

(2π)k/2

ZMx∈A

exp

µ−12(Mx)>Σ−1Mx

¶dx

This suggests a multivariate change of variables: setting y = Mx, we have to set x = M−1y and(formally)

dx =¯M−1¯ dy = |M |−1 dy.

Writing |M | = |Σ|1/2, we obtain

P (y ∈ A) =1

(2π)k/2 |Σ|1/2Zy∈A

exp

µ−12y>Σ−1y

¶dy.

The multivariate normal distribution 73

Definition 6.1.3 The distribution in Rk given by the joint density of y1, . . . , yk

p(y) =1

(2π)k/2 |Σ|1/2exp

µ−12y>Σ−1y

¶(6.3)

for some Σ = MM>, |Σ| 6= 0 (y = (y1, . . . , yk)>), is called the k-variate normal distributionNk(0,Σ).

Theorem 6.1.4 Let x1, . . . , xk be i.i..d standard normal random variables, M be a nonsingulark × k matrix and

x = (x1, . . . , xk)>, y =Mx.

Then y has a distribution Nk(0,Σ), where Σ =MM>.

Clearly all matrices Σ =MM> with |M | 6= 0 are possible here. Let us describe the class of possiblematrices Σ differently.

Lemma 6.1.5 Any matrix Σ = MM> with |M | 6= 0 is positive definite: for any x ∈Rk which isnot identically zero.

x>Σx > 0.

Proof. For any x 6= 0 (0 is the null vector) and z =M>x

x>Σx = x>MM>x = (M>x)>(M>x) = z>z > 0,

since z>z = 0 would mean z = 0 and thus x =M−1z = 0, which was excluded by assumption.Recall the following basic fact from linear algebra.

Proposition 6.1.6 (Spectral decomposition) Any positive definite k×k-matrix Σ can be written

Σ = C>ΛC

where Λ is a diagonal k × k-matrix

Λ =

⎛⎜⎜⎜⎜⎝λ1 0 · · 00 λ2 0 · ·· 0 · 0 ·· · 0 · 00 · · 0 λk

⎞⎟⎟⎟⎟⎠with positive diagonal elements λi > 0, i = 1, . . . , k (called eigenvalues or spectral values) andC is an orthogonal k × k-matrix :

C>C = CC> = Ik

(Ik is the unit k × k-matrix). If all the λi, i = 1, . . . , k are different, Λ is unique and C is uniqueup to sign changes of its row vectors.

Lemma 6.1.7 Every positive definite k × k-matrix Σ can be written as MM> where M is anonsingular k × k-matrix.


Proof. It is easy to take a square root Λ1/2 of a diagonal matrix Λ: let Λ1/2 be the diagonalmatrix with diagonal elements λ1/2i ; then Λ1/2Λ1/2 = Λ. Now take M = C>Λ1/2 where C, Λ arefrom the spectral decomposition:

MM> = C>Λ1/2Λ1/2C = C>ΛC = Σ

and M is nonsingular:

|M | =¯C>¯ ¯Λ1/2

¯=¯C>¯ kYi=1

λ1/2i ,

and for any orthogonal matrix C one has¯C>¯= ±1 since¯

C>¯2=¯C>C

¯= |Ik| = 1.

As a result, the k-variate normal distribution Nk(0,Σ) is defined for any positive definite matrixΣ.Recall that for any random vector y in Rk, the covariance matrix is defined by

(Cov(y))i,j = Cov(yi, yj) = E(yi −Eyi)(yj −Eyj).

if the expectations exist. (Here (A)i,j is the (i.j) entry of a matrix A). This existence is guaranteedby a condition

Pki=1Ey

2i <∞ (via the Cauchy-Schwarz inequality).

Lemma 6.1.8 The law Nk(0,Σ) has expectation 0 and covariance matrix Σ.

Proof. In the present case, we have y =Mx, hence

yi =kX

r=1

mirxr,

which immediately shows Eyi = 0. Furthermore,

Cov(yi, yj) = Eyiyj = E

ÃkX

r=1

mirxr

!ÃkX

s=1

mjsxs

!

= E

⎛⎝ kXr,s=1

mirmjsxrxs

⎞⎠ .

Since

Exrxs = 1 if r = s0 otherwise

we obtain

Cov(yi, yj) =kX

r=1

mirmjr =³MM>

í,j= (Σ)i,j .

For the matrix Σ one commonly writes

(Σ)i,j = σij , (Σ)i,i = σii = σ2i .


Definition 6.1.9 Let x be a random vector with distribution Nk(0,Σ) where Σ is a positive definitematrix, and let µ be a (nonrandom) vector from Rk. The distribution of the random vector

y = x+ µ

is called the k-variate normal distribution Nk(µ,Σ).

The following result is obvious.

Lemma 6.1.10 (i) The law Nk(µ,Σ) has expectation µ and covariance matrix Σ.(ii) The density is

ϕµ,Σ(y) :=1

(2π)k/2 |Σ|1/2exp

µ−12(y− µ)>Σ−1(y − µ)

¶.

Lemma 6.1.11 L(x) = Nk(0, Ik) if and only if x is a vector of i.i.d. standard normals.

Proof. In this case the joint density is

ϕµ,Σ(x) = ϕ0,Ik(x) =kYi=1

ϕ(xi)

which proves that the components xi are i.i.d. standard normal. N

Lemma 6.1.12 Let L(x) = Nk(µ,Σ) where Σ is positive definite, and let A be a (nonrandom)l × k matrix with rank l (this implies l ≤ k). Then

L(Ax) = Nl(Aµ, AΣA>)

Proof. It suffices to show

L(Ax−Aµ) = L(A(x− µ)) =Nl(0, AΣA>),

so we can assume µ = 0. Let also x =Mξ where L(ξ) = Nk(0, Ik) and Σ = MM>. ThenAx =AMξ so for l = k the claim is immediate. In general, for l ≤ k, consider also the l × l-matrix AΣA>; note that it is positive definite:

a>AΣA>a > 0

for every nonzero l-vector a since A>a is then a nonzero k-vector (A has rank l). Let

AΣA> = C>ΛC

be a spectral decomposition. Define D = Λ−1/2C; then

Ax = AMξ,

DAx = DAMξ.


Suppose first that l = k. Then DAx is multivariate normal with expectation 0 and covariancematrix

DAM (DAM)> = DAMM>A>D>

= DAΣA>D> = DC>ΛCD>

= Λ−1/2C³C>ΛC

Ć>Λ−1/2

= Λ−1/2ΛΛ−1/2 = Il.

HenceL(DAx) = Nl(0, Il),

i.e. DAx is a vector of standard normals.If l < k then we find a (k − l)× k matrix F such that

DAMF> = 0, FF> = Ik−l.

For this it suffices to select the rows of F as k − l orthonormal vectors which are a basis of thesubspace of Rk which is orthogonal to the subspace spanned by the rows of DAM . Then for thek × k matrix

F0 =

µDAMF

¶we have

F0F>0 =

µIl 00 Ik−l

¶= Ik,

i.e. F0 is orthogonal. Hence F0ξ is multivariate normal Nk(0, Ik), i.e. a vector of independentstandard normals. Since DAMξ consists of the first l elements of F0ξ, it is a vector of l standardnormals. We have shown again that DAx is a vector of standard normals.Note that D−1 = C>Λ1/2, since

DC>Λ1/2 = Λ−1/2CC>Λ1/2 = Λ−1/2Λ1/2 = Il.

Hence for Ax =D−1DAx

L(Ax) = Nl(0,D−1(D−1)>) =

= Nl(0, C>Λ1/2Λ1/2C) = Nl(0, AΣA

>).

Two random vectors x, y with dimensions k, l respectively, are said to have a joint normal distri-bution if the vector

z =

µxy

¶has a k+ l-multivariate normal distribution. The covariance matrix Cov(x,y) of x, y is defined by

(Cov(x,y))i,j = (Cov(xiyj)) , i = 1, . . . , k, j = 1, . . . , l;

it is a k × l-matrix. We then have a block structure for the joint covariance matrix Cov(z):

Cov(z) =

µCov(x) Cov(x,y)Cov(y,x) Cov(y)

¶.


Theorem 6.1.13 Two random vectors x, y with joint normal distribution are independent if andonly if they are uncorrelated, i.e. if

Cov(x,y) = 0 (6.4)

(where 0 stands for the null matrix).

Proof. Independence here means that the joint density is the product of its marginals. Assumethat both x,y, are centered, i.e. have zero expectation (otherwise expectation can be subtracted,with (6.4) still true). Write Σ = Cov(z), Σ11 = Cov(x), Σ22 = Cov(y), Σ12 = Cov(x,y), Σ21 = Σ>12.Then (6.4) means that

Σ =

µΣ11 00 Σ22

¶.

where 0 represents null matrices of appropriate dimensions. A matrix of such a structure is calledblock diagonal. It is easy to see that

Σ−1 =

µΣ−111 0

0 Σ−122

¶.

Indeed, both inverses Σ−111 , Σ−122 exist, since for the determinants

|Σ| = |Σ11| |Σ22| (6.5)

and the matrix multiplication rules show

Σ−1Σ =

µΣ−111 Σ11 0

0 Σ−122 Σ22

¶= Ik+l.

For the joint density of x,y (density of z> = (x>,y>)) we obtain

p(x,y) = ϕ0,Σ(z) =

=1

(2π)(k+l)/2 |Σ11|1/2 |Σ22|1/2exp

µ−12(x>Σ−111 x+ y

>Σ−122 y)

¶= ϕ0,Σ11

(x) · ϕ0,Σ22(y).

This immediately implies that the marginal of x is

p(x) =

Zp(x,y)dy =ϕ0,Σ11

(x)

and the marginal of y is ϕ0,Σ22(y). Thus ϕ0,Σ(z) is the product of its marginals.

Conversely, assume that ϕ0,Σ(z) is the product of its marginals, i.e. x,y are independent. Thisimplies that for any real valued functions f(x), g(y) of the two vectors we have Ef(x)g(y) =Ef(x)Eg(y). Let ei(k) be the i-th unit vector in Rk, so e>i(k)x = xi; then for the covariance matrixwe obtain

Cov(x,y)i,j = Exiyj = E³e>i(k)x·e>j(l)y

´= Ee>i(k)x·Ee>j(l)y =0

since both x,y are centered.Exercise. Prove (6.5) for a block diagonal Σ, using the spectral decomposition for each of theblocks Σ11, Σ22 and the rule |AB| = |A||B|.


Chapter 7THE GAUSSIAN LOCATION-SCALE MODEL

Recall the Gaussian location-scale model which was already introduced (see page 32). :

Model M3 Observed are n independent and identically random variables X1, . . . ,Xn, each havinglaw N(µ, σ2), where µ ∈ R and σ2 > 0 are both unknown..

The parameter is this model is two dimensional: ϑ = (µ, σ2) ∈ Θ = R× (0,∞).

7.1 Confidence intervals

To illustrate the consequences of an unknown variance σ2, let us look at the problem of constructinga confidence interval, beginning with the location model, model M2. Suppose the confidencelevel 1− α is given (e.g. α = 0.05 or α = 0.01), and try to construct a random interval (µ−, µ+):

Pµ([µ−, µ+] 3 µ) ≥ 1− α. (7.1)

meaning that the probability that the interval [µ−, µ+] covers µ is more than 95%’, Note bothµ−, µ+ are random variables (functions of the data), so the interval is in fact a random interval.Therefore the element sign ∈ is written in reverse form 3 to stress the fact that (7.1) the intervalis random, not µ (µ is merely unknown)When σ2 is known it is easy to build a confidence interval based on the sample mean: since

L(Xn) = N(µ, n−1σ2),

it follows thatL(¡Xn − µ

¢n1/2/σ) = N(0, 1). (7.2)

Definition 7.1.1 Suppose that P is a continuous law on R with density p such that p(x) > 0 forx > 0 and P ((0,∞)) ≥ 1/2. For every α ∈ (0, 1/2), the uniquely defined number za > 0 fulfillingZ ∞

zα

p(x)dx = α

is called the (upper) α-quantile of the distribution P .

The word ”upper” is usually omitted for the standard normal distribution, since it is symmetricaround 0. In our case it follows immediately from (7.2) that

Pµ

³¯¡Xn − µ

¢n1/2/σ

¯≤ zα/2

´= 1− α,

80 The Gaussian location-scale model

hence[µ−, µ+] = [Xn − σzα/2/

√n, Xn + σzα/2/

√n] (7.3)

is a 1− α-confidence interval for µ.Obviously we have to know the variance σ2 for this confidence interval, so the procedure breaksdown for the location-scale model. In Proposition 3.0.5, page 32 we already encountered the samplevariance:

S2n = n−1nXi=1

(Xi − Xn)2 = n−1

nXi=1

X2i − (Xn)

2.

This appears as a reasonable estimate to substitute for the unknown σ2, for various reasons. First,S2n is the variance of the empirical distribution Pn: when x1, . . . , xn are observed, the empiricaldistribution is a discrete law which assigns probability n−1 to each point xi. From the point ofview that x1, . . . , xn should be identified with the random variables X1, . . . ,Xn, this is a randomprobability distribution, with distribution function

Fn(t) = n−1nXi=1

1(−∞,t](Xi)

This is the empirical distribution function, (e.d.f.). Obviously if Z is a random variable withlaw Pn then (assuming x1, . . . , xn fixed)

xn =nXi=1

n−1xi = EZ

s2n =nXi=1

n−1x2i − (xn)2 = Var(Z).

Analogously to the sample mean, we write S2n for s2n when this is construed as a random variable.

For the expectation of S2n we obtain (when ξi are ind. standard normals)

ES2n = En−1nXi=1

(Xi − Xn)2 = En−1

nXi=1

(Xi − µ− (Xn − µ))2

= En−1nXi=1

(σξi − (σξn))2 = σ2E

Ãn−1

nXi=1

ξ2i − (ξn)2!

Since Eξ2i = 1 and L(ξn) = N(0, n−1), we have E(ξn)2 = n−1, and thus

ES2n = σ2¡1− n−1

¢= σ2

n− 1n

.

Thus S2n is not unbiased, but an unbiased estimator is

S2n =n

n− 1S2n =

1

n− 1

nXi=1

(Xi − Xn)2.

Suppose we plug in S2n for σ2 in the confidence interval (7.1). We cannot expect this to be an

α-confidence interval, since it is based on the standard normal law for¡Xn − µ

¢n1/2/σ via (7.2),

and the quantity

Tµ = Tµ(X) =

¡Xn − µ

¢n1/2

Sn

Chi-square and t-distributions 81

cannot be expected to have a normal law. (Here Sn =qS2n). However we can hope to identify this

distribution, and then base a confidence interval upon it (by taking a quantile).Note that Tµ is not a statistic, since it depends on the unknown parameter µ. However, revertingagain to the representation Xi − µ = σξi, we obtain

Tµ(X) =σξnn

1/2³1

n−1Pn

i=1 σ2(ξi − ξn)

2´1/2 = ξnn

1/2³1

n−1Pn

i=1(ξi − ξn)2´1/2 .

We see that the distribution of Tµ does not depend on the parameters µ and σ2; it depends onlyon the sample size n (i..e the number of i.i.d standard normals ξi involved).

7.2 Chi-square and t-distributions

Definition 7.2.1 Let X1, . . . ,Xn be independent N(0, 1). The distribution of the statistic

T = T (X) =Xnn

1/2

Sn

is called the t-distribution with n− 1 degrees of freedom (denoted tn−1). The statistic T is calledthe t-statistic.

Lemma 7.2.2 The t-distribution is symmetric around 0, i.e. for x ≥ 0

P (T ≥ x) = P (T ≤ −x)

Proof. This follows immediately form the fact that L(−ξi) = N(0, 1) = L(ξi).This suggest that a confidence interval based on the t-distribution can be built in the same fashionas for the standard normal, i.e. taking an upper quantile.It remains to find the actual form of the t-distribution, in order to compute its quantiles. Thefollowing result prepares this derivation.

Theorem 7.2.3 Let X1, . . . ,Xn be n i.i.d. r.v.’s each having law N(µ, σ2).(i) The sample mean Xn and the sample variance S2n are independent random variables.(ii) The bias corrected sample variance S2n = nS2n/(n− 1) can be represented as

S2n =σ2

n− 1

n−1Xi=1

ξ2i

where ξ1, . . . , ξn−1 are i.i.d standard normals, independent of Xn.

Proof. Let X be the vector X = (X1, . . . ,Xn)>; then X has a multivariate normal distribution

with covariance matrix σ2In. To describe the expectation vector, let 1n = (1, . . . , 1)> be the vectorin Rn consisting of 1’s. Then

L (X) = Nn(µ1n, σ2In).


Consider the linear subspace of Rn of dimension n− 1 which is orthogonal to 1n. Let b1, . . . , bn−1be an orthonormal basis of this subspace and bn = n−1/21n. Then bn has also norm 1 (kbnk = 1),the whole set b1, . . . , bn is an orthonormal basis of Rn and the n× n-matrix

B =

⎛⎜⎜⎝b>1b>2. . .b>n

⎞⎟⎟⎠is orthogonal. Define Y = BX; then

L (Y ) = Nn(µB1n, σ2In)

and the components Y1, . . . , Yn are independent. (Indeed when the covariance matrix of a multi-variate normal is of form σ2In, or more generally a diagonal matrix, then the joint density of yi isthe product of its marginals). Moreover

Yn = b>nX = n−1/21>nX = n1/2Xn, (7.4)

EYj = Eb>j X = µb>j 1n = 0, j = 1, . . . , n− 1.

It follows that Y1, . . . , Yn−1 are independent N(0, σ2) random variables and are independent of Xn.Now

nXi=1

Y 2i = kY k2 = X>B>BX = X>X =nXi=1

X2i .

On the other hand, in view of (7.4)

nXi=1

Y 2i =n−1Xi=1

Y 2i + nX2n,

so thatn−1Xi=1

Y 2i =nXi=1

X2i − nX2

n = nS2n.

This shows that S2n is a function of Y1, . . . , Yn−1 and hence independent of Xn, which establishes(i). Dividing both sides in the last display by n− 1 and setting ξi = Yi/σ establishes (ii).

Definition 7.2.4 Let X1, . . . ,Xn be independent N(0, 1). The distribution of the statistic

χ2 = χ2(X) =nXi=1

X2i

is called the chi-square distribution with n degrees of freedom, denoted χ2n.

To find the form of the t-distribution, we need to find the density of the ratio of a normal andthe square root of an independent χ2-variable. We begin with deriving the density of the χ2-distribution. The following lemma immediately follows from the above definition.


Figure 1 The densities of χ2n for n = 5, 10, 15, . . . , 30

Lemma 7.2.5 Let Y1, Y2 be independent r.v’s with laws

L(Y1) = χ2k, L(Y2) = χ2l , k, l ≥ 1.

ThenL(Y1 + Y2) = χ2k+l.

Proposition 7.2.6 The density of the law χ2n is

fn(x) =1

2n/2Γ(n/2)xn/2−1 exp(−x/2), x ≥ 0.

Comment. In Section 5.3 a family of Gamma densities was introduced as

fα(x) =1

Γ(α)xα−1 exp(−x)

for α > 0. Define more generally, for some γ > 0

fα,γ(x) =1

Γ(α)γαxα−1 exp(−xγ−1).

The corresponding law is called the Γ(α, γ)-distribution . The chi-square distribution χ2n thuscoincides with the distribution Γ(n/2, 2). The figure shows the χ2n-densities for 6 values of n (n = 5, 10, . . . , , 30). From the definition it is clear that χ2n has expectation n.Proof of the Proposition. Recall that for α > 0

Γ(α) =

Z ∞

0xα−1 exp(−x)dx

Γ(α+ 1) = αΓ(α).


Our proof will proceed by induction. Start with χ21: let X1 be N(0, 1); then

P¡X21 ≤ t

¢= P

³−t1/2 ≤ X1 ≤ t1/2

´= 2

Z t1/2

0

1

(2π)1/2exp

µ−z

2

2

¶dz

A change of variable x = z2, dz = (1/2x1/2)dx gives

P¡X21 ≤ t

¢=

Z t

0

1

(2πx)1/2exp

³−x2

´dx

and we obtainf1(x) =

1

21/2π1/2x1/2−1 exp(−x/2), x ≥ 0.

NowΓ(1/2) = π1/2 (7.5)

follows from the fact that f1 integrates to one and the definition of the gamma-function. Weobtained the density of χ21 as claimed.For the induction step, we assume that L(Y1) = χ2n and L(Y2) = χ21. By the previous lemma, fn+1is the convolution of fn and f1: assuming the densities zero for negative argument, we obtain

fn+1(x) =

Z ∞

0fn(y)f1(x− y)dy

(Indeed the convolution of densities is the operation applied to densities of two independent r.v.’sfor obtaining the density of the sum). Hence

fn+1(x) =

Z x

0

1

2n/2Γ(n/2)yn/2−1 exp(−y/2) 1

21/2Γ(1/2)(x− y)−1/2 exp(−(x− y)/2)dy

=1

2(n+1)/2Γ(n/2)Γ(1/2)exp(−x/2)

Z x

0yn/2−1(x− y)−1/2dy

To compute the integral, we note (with a change of variables u = y/x)Z x

0yn/2−1(x− y)−1/2dy = x · xn/2−1 · x−1/2

Z x

0

³yx

ń/2−1 ³1− y

x

´−1/2 1xdy

= x(n+1)/2−1Z 1

0un/2−1 (1− u)−1/2 du

= x(n+1)/2−1B (n/2, 1/2)

where B (n/2, 1/2) is the Beta integral:

B(α, β) =

Z 1

0uα−1 (1− u)β−1 du =

Γ(α)Γ(β)

Γ(α+ β)

(cf. Section 5.3 on the Beta densities). Thus, collecting results, we get

fn+1(x) =1

2(n+1)/2Γ(n/2)Γ(1/2)exp(−x/2)x(n+1)/2−1Γ(n/2)Γ(1/2)

Γ((n+ 1)/2)

=1

2(n+1)/2Γ((n+ 1)/2)x(n+1)/2−1 exp(−x/2)

which is the form of fn+1(x) claimed.


Proposition 7.2.7 (i) The law tn is the distribution of

n1/2Z1√Z2

where Z1, Z2 are independent r.v.’s with standard normal and χ2n-distribution, respectively.(ii) The density of the law tn is

fn(x) =Γ((n+ 1)/2)

(πn)1/2Γ(n/2)

µ1 +

x2

n

¶−(n+1)/2, x ∈ R.

Proof. (i) follows immediately from Definition 7.2.1 and Theorem 7.2.3: for a r..v T with tn−1-distribution we obtain, if ξi are defined as in the proof of Theorem 7.2.3

T =Xnn

1/2

Sn=

ξn³1

n−1Pn−1

i=1 ξ2i

´1/2=

(n− 1)1/2ξn³Pn−1i=1 ξ

2i

´1/2 = (n− 1)1/2Z1Z1/22

.

To prove (ii), we note that the law tn is symmetric, so we can proceed via the law of the squaredvariable nZ21/Z1. Here L(Z21) = χ21, and the joint density of Z

21 , Z2 is

g(t, u) =1

21/2Γ(1/2)t1/2−1 exp(−t/2) 1

2n/2Γ(n/2)un/2−1 exp(−u/2), t, u ≥ 0.

Consequently

P

µZ21Z2≤ x

¶=

Zt/u≤x,u≥0,t≥0

1

2(n+1)/2Γ(1/2)Γ(n/2)un/2−1t−1/2 exp(−(t+ u)/2) dt du

Substitute t by v = t/u, then dt = udv and the above integral isZ0≤v≤x,u≥0

1

2(n+1)/2Γ(1/2)Γ(n/2)un/2−1(vu)−1/2 exp(−(vu+ u)/2) udv du

=

Z0≤v≤x,u≥0

1

2(n+1)/2Γ(1/2)Γ(n/2)u(n+1)/2−1v−1/2 exp(−((v + 1)u)/2) dv du

Another substitution of u by z = (v + 1)u makes thisZ0≤v≤x

Zz≥0

1

2(n+1)/2Γ(1/2)Γ(n/2)

1

(v + 1)(n+1)/2−1z(n+1)/2−1v−1/2 exp(−z/2) 1

v + 1dz dv.

Here the terms depending on z together with appropriate constants (when we also divide andmultiply by Γ((n+1)/2)) form the density of χ2n+1, so that the expression becomes, after integratingout z, Z

0≤v≤x

Γ((n+ 1)/2)

Γ(1/2)Γ(n/2)

v−1/2

(v + 1)(n+1)/2dv = P

µZ21Z2≤ x

¶.


A final change of variables s = (nv)1/2, dv = n−12sds gives

P

Ã0 ≤ n1/2|Z1|

Z1/22

≤ a

!= 2

Z0≤s≤a

Γ((n+ 1)/2)

(nπ)1/2Γ(n/2)

µ1 +

s2

n

¶−(n+1)/2ds

= 2P

Ã0 ≤ n1/2Z1

Z1/22

≤ a

!.

Now differentiating w.r. to a gives the form of the density (ii).Let us visualize the forms of the normal and the t-distribution.

The t-density with 3 degrees of freedom against the standard normal (dotted)Clearly the t-distribution has heavier tails, which means that the quantiles (now called tα/2 in placeof zα/2) are farther out and the confidence interval is wider. A narrower confidence interval, for thesame level α, is preferable. Thus the absence of knowledge of the sample variance σ2 is reflected inless sharp confidence statements.

Theorem 7.2.8 Let tα/2 be the upper α/2-quantile of the t-distribution with n − 1 degrees offreedom. Then in Model M3, the interval

[µ−, µ+] = [Xn − Sntα/2/√n, Xn + Sntα/2/

√n]

is a 1− α-confidence interval for the unknown expected value µ.

The t-distribution has been found by Gosset (”The probable error of a mean”, Biometrika 6, 1908),who wrote under the pseudonym ”Student”. The distribution is frequently called ”Student’s t”.A notion of ”studentization” has been derived from it: in the Gaussian location model M2, thestatistic

Z =Xnn

1/2

σ

Some asymptotics 87

is sometimes called the Z-statistic. It is standard normal when µ = 0 and can therefore be usedto test the hypothesis µ = 0 (for the theory if testing hypotheses cf. later sections). In thelocation-scale model, σ is not known, and by substituting Sn one forms the t-statistic

T =Xnn

1/2

Sn.

The procedure of substituting the unknown variance σ2 by its estimate S2n is called studentization.

Remark 7.2.9 Consider the absolute moment of order r (integer) of the tn-distribution:

E |T |r = mr = Cn

Z ∞

0xrµ1 +

x2

n

¶−(n+1)/2dx

where Cn is the appropriate normalizing constant. For x→∞, the integrand is of order xrx−(n+1) =xr−n−1, so this integral is finite only for r < n. For r = n, since

R∞0 x−1dx =∞, the r-th moment

does not exist. This illustrates the fact that the t-distribution has ”heavy tails” compared to thenormal distribution; indeed E |Z|r <∞ for all r if Z is standard normal.

7.3 Some asymptotics

We saw in the plot of the t-distribution that already for 3 degrees of freedom it appears close tothe normal. Write

n1/2Z1√Z2

=Z1√n−1Z2

where Z1, Z2 are independent r.v.’s with standard normal and χ2n-distribution, respectively. Since

n−1Z2 = n−1nXi=1

ξ2i

for some i.i.d. standard normals ξi, we have by the law of large numbers

n−1nXi=1

ξ2i →P Eξ21 = 1.

This suggests that the law of n1/2Z1/√Z2, i.e. the law tn, should become close to the law of Z1 as

n→∞. Let us formally prove that statement. We begin with a recap of some probability notions.

Definition 7.3.1 Let Fn, n = 1, 2, . . . be a sequence of distribution functions. The Fn are said toconverge in distribution to a limit F (a distribution function) if

Fn(t)→ F (t) as n→∞

for every point of continuity t of F .

A point of continuity is obviously a point where F is continuous. Any distribution function F hasleft and right side limits at every point t, so it means that these limits coincide in t. When F is


continuous then in the above statement t must run through all t ∈ R. For instance, the distributionfunction of every N(µ, σ2) is continuous.It is also said that a sequence of r.v.’s Yn converges in distribution (or in law), written

Yn =⇒d F

when the d.f. of Yn converge in d. to F . One also writes YnL⇒ Y for a r.v. Y having that

distribution function F (or also FnL⇒ F , Fn

D⇒ F )

Example 7.3.2 Convergence in probability : if F is the d.f. of the random variable Y = 0 (whichis always 0 !) then

F (t) = 1[1,∞)(t)

i.e. F (t) jumps only in 0. Now Yn =⇒d Y is equivalent to Yn →P 0 (Exercise)

Example 7.3.3 Let L(Yn) = B(n, λn−1) and L(Y ) = Po(λ) then for all events A

supA|P (Yn ∈ A)− P (Y ∈ A)|→ 0

which implies Yn =⇒d Y (take A = (−∞, t]).

Example 7.3.4 (Central Limit Theorem). Let Y1, . . . , Yn be independent identically distributedr.v.’s with distribution function F and finite second moment EY 21 <∞. Let

Yn = n−1nXi=1

Yi

be the average (or sample mean) and σ2 = Var(Y1). Then for fixed F and n→∞

n1/2¡Yn −EY1

¢=⇒d N(0, σ

2). (7.6)

The normal law N(0, σ2) is continuous, so in the CLT the relation =⇒d means convergence of thed.f. of the standardized sum n1/2

¡Yn −EY1

¢to the normal d.f. at every t ∈ R.

Lemma 7.3.5 Suppose Xn is a sequence of r.v. which converges in distribution to a continuouslimit law P0:

L (Xn) =⇒d P0

and let Yn be a sequence of r.v. which converges in probability to 0:

Yn →P 0.

Then

L (Xn + Yn) =⇒d P0.

Some asymptotics 89

Note that no independence assumptions were made.Proof. Let Fn be the distribution function of Xn and F0 be the respective d.f. of the law P0.Convergence in distribution means that

P (Xn ≤ t) = Fn(t)→ F0(t)

for every continuity point of the limit d.f. F0.We assumed that P0 is a continuous law, so it meansconvergence for every t. Now for ε > 0

P (Xn + Yn ≤ t) = P (Xn + Yn ≤ t ∩ (|Yn| ≤ ε ∪ |Yn| > ε))= P (Xn + Yn ≤ t ∩ |Yn| ≤ ε) + P (Xn + Yn ≤ t ∩ |Yn| > ε).

The first term on the right is

P (Xn ≤ t− Yn ∩ |Yn| ≤ ε) ≤ P (Xn ≤ t+ ε ∩ |Yn| ≤ ε)≤ P (Xn ≤ t+ ε).

The other term is

P (Xn + Yn ≤ t ∩ |Yn| > ε) ≤ P (|Yn| > ε)→ 0 as n→∞.

On the other hand we have

P (Xn ≤ t+ ε)→ F0(t+ ε) as n→∞.

Hence for every δ > 0 we can find m1 such that for all n ≥ m1

P (Xn + Yn ≤ t) ≤ F0(t+ ε) + 2δ. (7.7)

Now take the same ε > 0; we haveP (Xn ≤ t− ε) =

P (Xn ≤ t− ε ∩ |Yn| ≤ ε) + P (Xn ≤ t− ε ∩ |Yn| > ε)≤ P (Xn + Yn ≤ t ∩ |Yn| ≤ ε) + P (|Yn| > ε)

≤ P (Xn + Yn ≤ t) + P (|Yn| > ε).

ConsequentlyP (Xn + Yn ≤ t) ≥ P (Xn ≤ t− ε)− P (|Yn| > ε).

Using again the two limits for the probabilities on the right, for every δ > 0 we can find m2 suchthat for all n ≥ m2

P (Xn + Yn ≤ t) ≥ F0(t− ε)− 2δ. (7.8)

Taking m = max(m1,m2) and collecting (7.7), (7.8), we obtain for n ≥ m

F0(t− ε)− 2δ ≤ P (Xn + Yn ≤ t) ≤ F0(t+ ε) + 2δ.

Since F0 is continuous at t, and ε was arbitrary, we can select ε such that

F0(t+ ε) ≤ F0(t) + δ,

F0(t− ε) ≥ F0(t)− δ

so that for n large enough

F0(t)− 3δ ≤ P (Xn + Yn ≤ t) ≤ F0(t) + 3δ

and since δ was also arbitrary, the result follows.


Lemma 7.3.6 Under the assumptions of Lemma 7.3.6, we have

XnYn →P 0.

Proof. Let ε > 0; and δ > 0 be arbitrary and given. Suppose |XnYn| ≥ ε. Then, for every T > 0,either |Xn| > T, or if that is not the case, then |Yn| ≥ ε/T . Hence .

P (|XnYn| ≥ ε) ≤ P (|Xn| ≥ T ) + P (|Yn| ≥ ε/T ) . (7.9)

Now for every T > 0

P (|Xn| > T ) = 1− Fn (Xn ≤ T ) + Fn (Xn < −T )≤ 1− Fn (Xn ≤ T ) + Fn (Xn ≤ −T ) .

Since Fn converges to F0 at both points T , −T , we find m1 = m1(T ) (depending on T ) such thatfor all n ≥ m1

P (|Xn| ≥ T ) ≤ 1− F0(T ) + F0(−T ) + δ

Select now T large enough such that

1− F0(T ) ≤ δ, F0(−T ) ≤ δ.

Then for all n ≥ m1(T )P (|Xn| ≥ T ) ≤ 3δ.

On the other hand, once T is fixed, in view of convergence in probability to 0 of |Yn|, one can findm2 such that for all n ≥ m2

P (|Yn| ≥ ε/T ) ≤ δ.

In view of (7.9) we have for all n ≥ m = max(m1,m2)

P (|XnYn| ≥ ε) ≤ 4δ.

Since δ > 0 was arbitrary, the result is proved.We need an auxiliary result which despite its simplicity is still frequently cited with a name attachedto it.

Lemma 7.3.7 (Slutsky’s theorem). Suppose a sequence of random variables Xn converges inprobability to a number x0 (Xn →p x0 as n → ∞). Suppose f is a real valued function defined ina neighborhood of x0 and continuous there. Then

f(Xn)→p f(x0), n→∞

Proof. Consider an arbitrary ε > 0. Select δ > 0 small enough such that (x0 − δ, x0 + δ) is inthe abovementioned neighborhood of x0 and also fulfilling the condition that |z − x0| ≤ δ implies|f(z)− f(x0)| ≤ ε (by continuity of f such a δ can be found). Then the event |f(Xn)− f(x0)| > εimplies |Xn − x0| > δ and hence

P (|f(Xn)− f(x0)| > ε) ≤ P (|Xn − x0| > δ) .

Since the latter probability tends to 0 as n→∞, we also have

P (|f(Xn)− f(x0)| > ε)→ 0 as n→∞

and since ε was arbitrary, the result is proved.

Some asymptotics 91

Theorem 7.3.8 The t-distribution with n degrees of freeedom converges (in distribution) to thestandard normal law as n→∞.

Proof. Let

Xn =Z1√n−1Z2

where Z1, Z2 are independent r.v.’s with standard normal and χ2n-distribution, respectively. Weknow already

n−1Z2 →P 1

By Slutsky’s theorem, for the r.v. Y1,n

Y1,n :=1√

n−1Z2→P 1

since the function g(x) = x−1/2 is continuous at 1 and defined for x > 0. (Indeed we need consideronly those x, since n−1Z2 > 0 with probability one). Now

Xn = Z1Y1,n = Z1 + Z1 (Y1,n − 1) .

Now Y1,n−1→P 0, hence by Lemma 7.3.6 Z1 (Y1,n − 1)→P 0. Now Z1 is a constant sequence withlaw N(0, 1), which certainly converges in law to N(0, 1). Then by Lemma 7.3.5 Xn =⇒d N(0, 1).

We can translate this limiting statement about the t-distribution into a confidence statement.

Theorem 7.3.9 Let z∗α/2 be the upper α/2-quantile of N(0, 1). Then in the Gaussian location-scalemodel Mc,2, the interval

[µ∗−, µ∗+] = [Xn − Snz

∗α/2/√n, Xn + Snz

∗α/2/√n]

is an asymptotic α-confidence interval for the unkown expected value µ:

limn→∞

Pµ,σ2([µ∗−, µ

∗+] 3 µ) ≥ 1− α.

Here the same quantiles as in the exact interval (7.3) are used, but Sn replaces the unknown σ. Insummary: if σ2 is unknown, one has the choice between an exact confidence interval (which keepslevel 1−α) based on the t-distribution, or an asymptotic interval (which keeps the confidence levelonly approximately) based on the normal law. The normal interval would be shorter in general:consider e.g. degrees of freedom 10 and α = 0.05; then for the t-distribution we have zα/2 = 2.228,whereas the normal quantile is z∗α/2 = 1.96 (cf. the tables of the normal and t-distribtions on pp.608/609 of [CB]).Note that in subsection (1.3.2) we discussed an nonasymptotic confidence interval for the Bernoulliparameter p based on the Chebyshev inequality, and mentioned that alternatively, a large sampleapproximation based on the CLT could also have been used. In this section we developed the toolsfor this.


Chapter 8TESTING STATISTICAL HYPOTHESES

8.1 Introduction

Consider again the basic statistical model where X is an observed random variable with values inX and the law L(X) is known up to a parameter ϑ from a parameter space Θ: L(X) ∈ Pϑ;ϑ ∈ Θ.This time we do not restrict the nature of Θ; this may be a general set, possibly even a set of laws(in this case ϑ is identified with L(X)). Suppose the parameter set Θ is divided into two subsets:Θ = Θ0 ∪ Θ1 where Θ0 ∩ Θ1 = ∅. The problem is to decide on the basis of an observation of Xwhether the unknown ϑ belongs to Θ0 or to Θ1. Thus two hypotheses are formulated:

H ϑ ∈ Θ0, the hypothesis

K ϑ ∈ Θ1, the alternative.

Of course both of these are ”hypotheses”, but in testing theory they are treated in a nonsymmetricway (to be explained). In view of this nonsymmetry, one of them is called the hypothesis andthe other the alternative. It is traditional to write them as above, with letters H for the first (thehypothesis) and K for the second (the alternative).

Example 8.1.1 Assume that a new drug has been developed, which is supposed to have a higherprobability p of success when applied to an average patient. The new drug will be introduced onlyif a high degree of certainty can be obtained that it is better. Suppose p0 is the known probabilityof success of the old drug. Clinical trials are performed to test the hypothesis that p > p0. Forthe new drug, n patients are tested independently, and succes of the drug is measured (we assumethat only success or failure of the treatment can be seen in each case). Let the j-th experiment(patient) for the new drug be Xj; assume that the Xj are independent B(1, p). Thus observationsare X = (X1, . . . ,Xn), Xj are i.i.d. Bernoulli r.v.’s., where ϑ = p and Θ = (0, 1). The hypothesesare Θ0 = (0, p0] and Θ1 = (p0, 1).

The motivation for a nonsymmetric treatment of the hypotheses is evident in this example: if thestatistical evidence is inconclusive, one would always stay with the old drug. There can be noquestion of treating H and K the same way. Thus in section 5.5 we briefly discussed the problemof estimating a signal ϑ ∈ 0, 1 (binary channel, Gaussian channel), where basically both values 0and 1 are treated the same way, e.g. we use a Bayesian decision rule for prior probabilities 1/2. Incontrast, one will decide 1, i.e decide in favor of the new drug only if there is ”reasonable statisticalcertainty”.

Definition 8.1.2 A test is a decision rule characterized by an acceptance region S ⊂ X , i.e. a(measurable) subset of the sample space, such that

94 Testing Statistical Hypotheses

X ∈ S means that ϑ ∈ Θ0 is acceptedX ∈ Sc means that ϑ ∈ Θ0 is rejected (thus ϑ ∈ Θ1 is accepted. )The complement Sc is called the critical region (rejection region).

Formally, a test φ is usually defined as a statistic which is the indicator function of a set Sc, suchthat

φ(X) = 1Sc(X)

where a value φ(X) = 1 is understood as a decision that ϑ ∈ Θ0 is rejected (and 0 that it isaccepted).In the above example, a reasonable test would be given by a rejection region

Sc =

(x : n−1

nXi=1

xj > c

)

for realizations x = (x1, . . . , xn), where c is some number fixed in advance. One would decide”ϑ ∈ Θ1” if the number of successes in the sample is large enough.Tests and estimators have in common that they are statistics which are also decision rules. But thenature of the decisions is different, which is reflected in the loss functions. The natural commonloss function for testing problems is ”loss is 0 if the decision is correct, and is 1 otherwise”. Thusif d ∈ 0, 1 is one of the two possible decisions then the loss function is

L(d, ϑ) = d if ϑ ∈ Θ01− d if ϑ ∈ Θ1

.

As for estimation, the risk of a decision rule φ at parameter value ϑ is the expected loss when ϑ isthe true parameter. The decision rule is φ(X) = 1Sc(X); its risk is

R(φ, ϑ) = EϑL(φ(X), ϑ) = Eϑφ(X) = Pϑ(S

c) if ϑ ∈ Θ01− Eϑφ(X) = Pϑ(S) if ϑ ∈ Θ1.

Thus the risk coincides with the probability of error in each case: for ϑ ∈ Θ0, an erroneousdecision is made when X ∈ Sc ; when ϑ ∈ Θ1 is true, then the error in the decision occurs whenX ∈ S.Since both probability of errors are functions of Eϑ(1− φ(X)) = Pϑ(S), one can equivalently workwith the operation characteristic (OC):

OC(φ, ϑ) = Pϑ(S) = 1−R(φ, ϑ) if ϑ ∈ Θ0R(φ, ϑ) if ϑ ∈ Θ1.

A test with zero risk would require a set S such that Pϑ(S) = 1 if ϑ ∈ Θ0, Pϑ(S) = 0 if ϑ ∈ Θ1.This is possible only in degenerate cases: if such an S ⊂ X exists then the families Pϑ;ϑ ∈ Θ0and Pϑ;ϑ ∈ Θ1 are said to be orthogonal. In this case ”sure” decisions are possible accordingto whether X ∈ S or not, and one is led outside the realm of statistics.

Example 8.1.3 Let U(a, b) be the uniform law on the interval (a, b), Θ = 0, 1 and P0 = U(0, 1),P1 = U(1, 2). Here S = (0, 1) gives zero error probabilities.

Introduction 95

In typical statistical situations, the probabilities Pϑ(S) depends continuously on the parameter,and the sets Θ0, Θ1 are bordering each other. In this case the risk R(φ, ϑ) cannot be near 0 on thecommon border.

Example 8.1.4 Let Θ = R, Pϑ = N(ϑ, σ2), (σ2 fixed), Θ0 = (−∞, 0], Θ1 = (0,∞). (Gaussianlocation model, Model Mc,1, for sample size n = 1). Reasonable acceptance regions are

Sa = x : x ≤ a

for some a. The OC of any such test φa = 1Sca is (if ξ is a standard normal r.v. such thatX = ϑ+ σξ)

OC(φ, ϑ) = Pϑ(S) = Pϑ(X ≤ a) = P (ξ ≤ (a− ϑ)/σ)

= Φ ((a− ϑ)/σ)

where Φ is the standard normal distribution function.

Example 8.1.1 (continued). Suppose p0 = 0.6 and use a critical region©Xn > c

ªfor c = 0.8. We

assume n is not small and use the De Moivre-Laplace central limit theorem for an approximationto the OC. We have

OC(φ, p) = Pp(Xn ≤ c) = Pp

Ãn1/2(Xn − p)

(p(1− p))1/2≤ n1/2(c− p)

(p(1− p))1/2

!

≈ Φ

Ãn1/2(c− p)

(p(1− p))1/2

!

where ≈ means approximate equality. For a visualization of this approximation to the OC, see theplot below. ¤These examples show that it is not generally possible to keep error probabilities uniformly smallunder both hypotheses.

Definition 8.1.5 Suppose that L(X) ∈ Pϑ;ϑ ∈ Θ and φ is test for H : ϑ ∈ Θ0 vs. K : ϑ ∈ Θ1with acceptance region S.(i) An error of the first kind is: φ takes value 1 when ϑ ∈ Θ0 is true;An error of the second kind is: φ takes value 0 when ϑ ∈ Θ1 is true;(ii) φ has significance level (or level) α if

Pϑ(φ(X) = 1) ≤ α for all ϑ ∈ Θ0

(iii) For ϑ ∈ Θ1, the probabilityβ(ϑ) = Pϑ(φ(X) = 1)

is called the power of the test φ.


OC of the critical region x > 2 for testing H : ϑ ≤ 0 vs. K : ϑ > 0 for the family N(ϑ, 1), ϑ ∈ R

Approximation to OC for critical region Xn > c for testing H : p ≤ p0 vs. K : p > p0 for i.i.d.Bernoulli model (Model Md,1), with p0 = 0.6 , c = 0.8, n = 15

Thus significance level α means that probability of an error of the first kind is uniformly less thanα. The power is the probability of not making an error of the second kind for a given ϑ in thealternative.In terms of the risk, φ has level α if R(φ, ϑ) ≤ α for all ϑ ∈ Θ0, and the power is

β(ϑ) = 1−R(φ, ϑ), ϑ ∈ Θ1.

Tests and confidence sets 97

In terms of the OC, φ has level α if OC(φ, ϑ) ≥ 1− α and the power is

β(ϑ) = 1−OC(φ, ϑ), ϑ ∈ Θ1

In example 8.1.1, it is particularly apparent why the error of the first kind is very sensitive, sothat its probability should be kept under a given small α. When actually the old drug is better(ϑ ∈ Θ0), but we decide erroneously that the new is better, it is a very painful error indeed, withpotentially grave consequences. We wish a decision procedure which limits the probability of sucha catastrophic misjudgment. But given this restriction, opportunities for switching to a better drugshould not be missed, i.e. when the new drug is actually better, then the decision should be ableto detect it with as high a probability as possible.For drug testing, procedures like this (significance level α should be kept for some small α) arerequired by law in every developed country. In general statistical practice, common values areα = 0.05 and α = 0.01.If one of the hypothesis sets Θ0, Θ1 consists of only one element, the respective hypothesis (oralternative) called simple, otherwise it is called composite . An important special case of testingtheory is the one where both hypothesis and alternative are simple, i.e. Θ0 = ϑ0, Θ1 = ϑ1(which means the whole parameter space Θ consists of the two elements ϑ0, ϑ1; in this case theNeyman-Person fundamental lemma applies (see below).A test where Θ0 is simple is called a significance test. The question to be decided there is onlywhether the hypothesis H : ϑ0 = ϑ can be rejected with significance level α; the alternatives areusually not specified.

8.2 Tests and confidence sets

A confidence set is a random subset A(X) of the parameter space which covers the true parameterwith probability at least 1− α:

Pϑ(A(X) 3 ϑ) ≥ 1− α, ϑ ∈ Θ.

Confidence intervals A(X) = [µ−(X), µ+(X)] were treated in detail in section 7.1 for the Gaussianlocation-scale model and in the introductory section 1.3. Confidence sets are also called domainestimators of the parameter ϑ (the estimators which pick a value of ϑ rather than a covering setare called point estimators ).There is a close connection between confidence sets and significance tests.

1 Suppose a confidence set A(X) for level 1− α is given. Let ϑ0 ∈ Θ be arbitrary and consider asimple hypothesis H : ϑ = ϑ0 (vs. alternative K : ϑ 6= ϑ0). Construct a test φϑ0 by

φϑ0(X) = 1− 1A(X)(ϑ0)

where 1A(X) is the indicator of the confidence set, as a function of ϑ0. In other words, H :ϑ = ϑ0 is rejected if ϑ0 is outside the confidence set. Then

Pϑ0(φϑ0 = 1) = 1− Pϑ0(A(X) 3 ϑ0) ≤ α

hence φϑ0 is an α-significance test for H : ϑ = ϑ0.


2 We saw that a confidence set generates a family of significance tests, one for each ϑ0 ∈ Θ. Assumenow conversely that such a family φϑ, ϑ ∈ Θ is given, and they all observe level α. Define

A(X) = ϑ ∈ Θ : φϑ(X) = 0 .

Then

Pϑ(A(X) 3 ϑ) = Pϑ (φϑ(X) = 0) = 1− Pϑ (φϑ(X) = 1)

≥ 1− α.

i.e. we found a 1− α-confidence set A(X).

For a more general setting, let γ(ϑ) be a function of the parameter (with values in an arbitrary setΓ). A confidence set for γ(ϑ) is defined by

Pϑ(A(X) 3 γ(ϑ)) ≥ 1− α, ϑ ∈ Θ.

For instance ϑ might have two components: ϑ = (ϑ1,ϑ2). and γ(ϑ) = ϑ1. Then the above familyof tests should be indexed by γ ∈ Γ, and φγ0 has level α for a hypothesis H: γ(ϑ) = γ0. Thishypothesis is composite if γ is not one-to-one (then φγ0 cannot be called a significance test).As an example, consider the Gaussian location-scale model for unknown σ2 (Model Mc,2). Hereϑ = (µ, σ2) ∈ R× (0,∞). Consider the quantity

Tµ = Tµ(X) =

¡Xn − µ

¢n1/2

Sn

used to build a confidence interval

[µ−, µ+] = [Xn − Snzα/2/√n, Xn + Snzα/2/

√n]

for the unkown expected value µ. The level 1− α was kept for all unknown σ2 > 0, i.e. we have aconfidence interval for the parameter function γ(ϑ) = µ.As we remarked, Tµ(X) depends on the parameter µ and therefore is not a statistic. Such a functionof both the observations and the parameter used to build a confidence interval is called a pivotalquantity. The knowledge of the law of the pivotal quantity under the respective parameter is thebasis for a confidence interval. When looking at the significance test derived from Tµ(X), for ahypothesis H : µ = µ0, we find that the test is

φµ0(X) = 1 if¯Tµ0(X)

¯> zα/2 (8.1)

where zα/2 is the upper α/2-quantile of the t-distribution for n− 1 degrees of freedom. From thispoint of view, when µ = µ0 is a ”known” hypothesis, Tµ0(X) does not depend on an unknownparameter, and is thus a statistic. In the example, it is the t-statistic for testing H : µ = µ0, andthe test (8.1) is the two sided t-test. The basic result implied by Theorem 7.2.8 about this testis the following.

Theorem 8.2.1 In the Gaussian location-scale model Model Mc,2, for sample size n, consider thehypothesis H : µ = µ0. Let zα/2 be the upper α/2-quantile of the t-distribution with n− 1 degreesof freedom. Then the two sided t-test (8.1) has level α for any unknown σ2 > 0.

Tests and confidence sets 99

Analogously , when σ2 is known, the test

φµ0(X) = 1 if¯Zµ0(X)

¯> z∗α/2 (8.2)

where z∗α/2 is the upper α/2-quantile of N(0, 1) and

Zµ = Zµ(X) =

¡Xn − µ

¢n1/2

σ

is the Z-statistic, is called the two sided Gauss test for H : µ = µ0. Then the t-statistic Tµ0(X)can be construed as a studentized Z-statistic Zµ0(X).Let us investigate the relationship between the power of φϑ and the confidence interval. Supposethat we have two 1− α-confidence sets A∗(X), A(X) for ϑ itself, where

A∗(X) ⊆ A(X)

i.e. A∗(X) is contained A(X) (in the case of intervals, A∗(X) would be shorter or of equal length).For the respective families φϑ0 , φ

∗ϑ0 , ϑ0 ∈ Θ of α-significance tests this means

φϑ0(X) = 1 implies φ∗ϑ0(X) = 1

hencePϑ¡φϑ0 = 1

¢≤ Pϑ

¡φ∗ϑ0 = 1

¢for all ϑ ∈ Θ.

At ϑ 6= ϑ0 these are precisely the respective powers of the two tests φϑ0 , φ∗ϑ0 . (At ϑ = ϑ0 the

relation implies that for A(X) to keep level 1−α, it is sufficient that A∗(X) keeps this level). Thusφ∗ϑ0 has uniformly better (or at least as good) power for all ϑ 6= ϑ0.It was mentioned that shorter confidence intervals are desirable (given a confidence level), sincethey enable ”sharper” decision making. Translating this into a power relation for tests, we havemade the idea more transparent. The assumed inclusion A∗(X) ⊆ A(X) implies a larger criticalregion for φ∗ϑ0 :

©φϑ0 = 1

ª⊆©φ∗ϑ0 = 1

ª. This does not describe all situations in which φ∗ϑ0 might

have better power (and thus A∗(X) is better in some sense); we shall not further investigate the”power” of confidence intervals here but will concentrate on tests instead.However asymptotic confidence intervals should be discussed briefly. The statement of Theorem7.3.9 can be translated immediately into the language of test theory. When the law of the observedr.v. X depends on n, we write Xn and L(Xn) ∈ Pϑ,n;ϑ ∈ Θ for the family of laws. Here theobservation space Xn might also depend on n, as is the case of n i.i.d. observed variables.

Definition 8.2.2 (i) A sequence of tests φn = φn(Xn) for testing H : ϑ ∈ Θ0 vs. K : ϑ ∈ Θ1 hasasymptotic level α if

limsupn→∞Pϑ,n(φn(X) = 1) ≤ α for all ϑ ∈ Θ0

(ii) The sequence φn is consistent if it has asymptotic power one, i.e.

limn

Pϑ,n(φn(X) = 1) = 1 for all ϑ ∈ Θ1.


We saw in Theorem 7.3.9 that in Model Mc,2, the interval

[µ∗−, µ∗+] = [Xn − Snz

∗α/2/√n, Xn + Snz

∗α/2/√n]

is an asymptotic α-confidence interval for the unkown expected value µ:

liminf n→∞Pµ,σ2([µ−, µ+] 3 µ) ≥ 1− α

where z∗α/2 is a normal quantile. Thus if φ∗µ0is the derived test for H : µ = µ0 then

limsup n→∞Pµ,σ2³φµ0 = 1

´= 1− liminf n→∞Pµ,σ2([µ−, µ+] 3 µ0)

≤ α

so that every φ∗µ0 is an asymptotic α-test.

Exercise. Consider the test φ∗µ0 (i.e. the test based on the t-statistic as in (8.1), where the quantilez∗α/2 of the standard normal is used in place of zα/2). Show that in the Gaussian location-scalemodel Mc,2, for sample size n, this test is consistent as n→∞ on the pertaining alternative

Θ1(µ0) =©(µ, σ2) : µ 6= µ0

ª.

.

Consistency of one sided Gauss tests (OC-plot)

The Neyman-Pearson Fundamental Lemma 101

Consistency of two sided Gauss tests (OC-plot)

Above, the first plot, in the Gaussian location model with σ2 = 1 gives the OC of a test of H:µ ≤ µ0vs K : µ > µ0 of type

φ(X) = 1 if Xn > cn0 otherwise

where cn is selected such that φ is an α-test, for α = 0.05 and sample sizes n = 1, 2, 4 respectively.This is the same situation as in the first plot on p. 96, only α is selected as one of the common values(in the other figure we just took c1 = 2 and did not care about the resulting α), and three samplesizes are plotted. This is a one sided Gauss test; we have not yet discussed the respective theory,but it can be observed that these test keep level α on the whole composite hypothesis H:µ ≤ µ0.Moreover, the behaviour of consistency is visible (for larger n, the power increases).The second plot concerns the simple hypothesis H:µ = µ0 in the same model (σ

2 = 1, α = 0.05,n = 1, 2, 4) and the two sided Gauss test (8.2) derived from the confidence interval (7.3). Themiddle OC-line for n = 2 is dotted.

8.3 The Neyman-Pearson Fundamental Lemma

We saw that in testing theory, the a basic goal is to maximize the power of tests, while keepingsignificance level α.

Definition 8.3.1 Suppose that L(X) ∈ Pϑ;ϑ ∈ Θ. An α-test φ = φ(X) for testing H : ϑ ∈ Θ0vs. K : ϑ ∈ Θ1 is uniformly most powerful (UMP) if for every other test ψ which is an α-testfor the problem:

supϑ∈Θ0

Eϑψ ≤ α

we haveEϑψ ≤ Eϑφ, for all ϑ ∈ Θ1.


Typically is not possible to find such a UMP test; some tests do better at particular points in thealternative, at the the expense of the power at other points. An example is given in the followingplot.

Power of one sided and two sided Gauss test for H : µ = µ0 vs. K : µ 6= 0

In the Gausian location model with σ2 = 1 and n = 1, for hypotheses H: µ = 0 vs K : µ 6= 0 welook at the OC of two tests:

one sided Gauss test: φ1(X) = 1 if Xn > c10 otherwise

two sided Gauss test: φ2(X) = 1 if

¯Xn

¯> c2

0 otherwise

where c1, c2 are selected such both φ1, φ2 are α-tests for α = 0.05. This is the same situation as inthe two plots on p. 8.2 and the following, where we put the two tests in one picture and omittedthe larger sample sizes n = 2, 4. (Here of course Xn = X1 if n = 1, but the same curves appear ifonly Var(Xn) = 1, i.e. σ2/n = 1).We see that φ1 is better than φ2 for alternatives µ > 0, but it is much worse for alternatives µ < 0.If we really are interested also in alternatives µ < 0 (i.e. we wish to detect these, and not just becontent with a statement µ ≤ 0) we should apply the two sided test; the one sided φ1 is totallyimplausible for the two sided alternative µ 6= 0. However, even though it is implausible, φ1 hasbetter power for µ > 0.In one special situation it is possible to find a UMP test, namely when both hypothesis andalternative are simple. In this case, we have only one point in the alternative, and maximizing thepower turns out to be possible. Recall the continuous statistical model, first defined in section 4.3:

Model Mc The observed random variable X = (X1, . . . ,Xk) is continuous with values in Rk andL(X) ∈ Pϑ, ϑ ∈ Θ. Each law Pϑ is described by a joint density pϑ(x) = pϑ(x1, . . . , xk),and Θ ⊆ Rd.

(Earlier we required that Θ be an open set, but this is omitted now).


Definition 8.3.2 Assume Model Mc, and that Θ = ϑ0, ϑ1 consists of only two elements. A testφ for the hypothesesH : ϑ = ϑ0K : ϑ = ϑ1is called a Neyman-Pearson test of level α if

φ(X) = 1 if pϑ1(x) > c pϑ0(x)0 otherwise

where the value c is chosen such that

Pϑ0 (φ(X) = 1) = α. (8.3)

We should first show that a Neyman-Person test exists. Of course we can take any c and builda test according to the above rule. This rule seems plausible: given x, each of the two densitiescan be regarded as a likelihood. We might say that ϑ1 is more ”likely” if the ratio of likelihoodspϑ1(x)/pϑ0(x) is sufficiently large. We recognize an application of the likelihood principle (recallthat this consists in regarding the density as a function of ϑ ∈ Θ when x is already observed, andassigning corresponding likelihoods to each ϑ).The question is only whether c can be chosen such that (8.3) holds. Define a random variable

L = L(X) = pϑ1(X)/pϑ0(X) if pϑ0(X) > 00 otherwise

where X has distribution Pϑ0 , and let FL be its distribution function

FL(t) = Pϑ0 (L(X) ≤ t) .

Recall that distribution functions F are monotone increasing, right continuous and 0 at −∞, 1 at∞. To ensure (8.3), we makeAssumption L. The distribution function FL(t) is continuous.In this case

Pϑ0 (L(X) = t) = 0

for all t, and for c > 0 consider the probability Pϑ0 (φ(X) = 1). Since

Pϑ0(pϑ0(X) = 0) = 0,

we have

Pϑ0 (φ(X) = 1) = Pϑ0 (φ(X) = 1 ∩ pϑ0(X) > 0)= Pϑ0 (L(X) > c ∩ pϑ0(X) > 0)= Pϑ0 (L(X) > c) = 1− Pϑ0 (L(X) ≤ c)

= 1− Pϑ0 (L(X) < c) = 1− FL(c)

which is continuous and monotone a with limit 0 for c→∞. Furthermore, FL(t) = 0 for all t < 0since L(X) is nonnegative, and in view of continuity of FL (Assumption L) we have

1− FL(c)→ 0 as c→ 0.

Thus a value c with (8.3) exists for every α ∈ (0, 1).


Example 8.3.3 Let Pϑ = N(ϑ, 1), ϑ ∈ Θ (Gaussian location). Here

L(x) =pϑ1(x)

pϑ0(x)= exp

Ã−(x− ϑ1)

2 − (x− ϑ0)2

2

!

= exp (x(ϑ1 − ϑ0)) exp

µ−ϑ

21 − ϑ202

¶Now for any t > 0

Pϑ0 (L(X) = t) = Pϑ0

µX(ϑ1 − ϑ0)−

ϑ21 − ϑ202

= log t

¶= Pϑ0 (X = t0(ϑ1, ϑ0, t))

where t0 is a well defined number if ϑ1 6= ϑ0. However for a normal X this probability is 0 for anyt0. Moreover L(X) = 0 cannot happen since exp(z) > 0 for all z, so that assumption L is fulfilledif ϑ1 6= ϑ0.

Example 8.3.4 Let U(a, b) be the uniform law on the interval [a, b] and Pϑ0 = U(0, 1), Pϑ1 =U(0, a) where 0 < a < 1. Then pϑ1(x) = a−1 for x ∈ [0, a], 0 otherwise, and

L(x) = pϑ1(x) = a−1 if x ∈ [0, a]0 otherwise.

.

Thus if X has law Pϑ0 then L takes only values a−1 and 0, and

P (L(X) = a−1) = a

P (L(X) = 0) = 1− a

Thus FL is not continuous; it jumps at 0 and a−1 is constant at other points.

In the latter example we cannot guarantee the existence of a Neyman-Pearson test. We will remedythis situation later; let us first prove the optimality of Neyman-Pearson tests (abbreviated N-P testshenceforth). We do not claim that there is only one N-P test for a given level α (the c may not beunique, and also one can use different versions of the densities, e.g. modifying them in some pointsetc.).

Theorem 8.3.5 (Neyman-Pearson fundamental lemma ). In model Model Mc, for Θ =ϑ0, ϑ1, under assumption L, any N-P test of level α (0 < α < 1) is a most powerful α-test forthe hypotheses H : ϑ = ϑ0 vs. K : ϑ = ϑ1.

Proof. Let φ be a N-P test and SNP its acceptance region

SNP = x : pϑ1(x) ≤ c pϑ0(x) .

We can assume c > 0 here since1− α = Pϑ0 (L(X) ≤ c)

and for c = 0, α < 1 this would mean that Pϑ0 (L(X) = 0) > 0, which we excluded by assumptionL. Let ψ be any α-test and S its acceptance region. We have to show

Pϑ1 (S) ≥ Pϑ1 (SNP ) .


Define A = S \ SNP , A0 = SNP \ S; then

S = (S ∩ SNP ) ∪A,SNP = (S ∩ SNP ) ∪A0.

and it suffices to showPϑ1 (A) ≥ Pϑ1

¡A0¢. (8.4)

Now since ψ is an α-test, we have

Pϑ0 (S) ≥ 1− α = Pϑ0 (SNP )

which impliesPϑ0 (A) ≥ Pϑ0

¡A0¢. (8.5)

Since A ⊆ ScNP , we have for any x ∈ A that pϑ1(x) > c pϑ0(x), hence

Pϑ0 (A) =

ZApϑ0(x)dx ≤ c−1

ZApϑ1(x)dx = c−1 Pϑ1 (A) , (8.6)

and since A0 ⊆ SNP , we have

Pϑ0¡A0¢≥ c−1

ZA0pϑ1(x)dx = c−1 Pϑ1

¡A0¢. (8.7)

Relations (8.7), (8.6), (8.5) imply (8.4).Let us consider the case where Assumption L is not fulfilled. For any test, the quantity Pϑ0 (φ(X) = 1)is called the size of the test. We saw in the above proof that the assumption that the Neyman-Pearson test φ has size exactly α was essential. Recall that

Pϑ0 (φ(X) = 1) = 1− FL(c)

and that FL(c) is continuous from the right. When Assumption L is not fulfilled, the followingsituation may occur: there is a c0 such that for all c < c0, FL(c) < 1 − α, and at c0 we haveFL(c0) > 1−α. In other words, the function 1−FL(c) jumps in such a way that α is not attained.In order to deal with this situation, let us generalize the notion of a test function.

Definition 8.3.6 A randomized test φ (based on the data X) is any statistic such that 0 ≤φ(X) ≤ 1.

When the value of φ is between 0 and 1, the interpretation is that the decision between hypothesisH and alternative K is taken randomly, such that φ is the probability of deciding K. Thus,given the data X = x, a Bernoulli random variable Z is generated with law (conditional on x)L(Z) = B(1, φ(x)), and the decision is Z. The former nonrandomized test functions are specialcases: when φ(x) = 1 or φ(x) = 0, the Bernoulli r.v. Z is degenerate and takes the correspondingvalue 1 or 0 with probability one. For a randomized test, we have for ϑ = ϑ0 or ϑ = ϑ1, and writingPϑ (Z = ·) for the unconditional probability in Z (when X is random)

Pϑ (Z = 1) = EϑP (Z = 1|X = x) = Eϑφ(X), (8.8)

Pϑ (Z = 0) = 1−Eϑφ(X)


so that both the errors of first and second kind are a function of the expected value of φ(X) underthe respective hypothesis.This method of introducing artificial randomness into the decision process should be regarded withcommon sense reservations from a practical point of view. However randomized tests providea completion of theory and therefore a better understanding of the basic problems of statistics.For instance, inclusion of randomized tests allows to state that there is always a level α test: takeφ(X) = α, independently of the data. But the power of that trivial test is also α, i. e. not verygood.In the above situation, when there is no c such that FL(c) = 1− α, consider the left limit of FL atc0:

FL,−(c0) := limc%c0

FL(c) = Pϑ0 (L(X) < c0)

(this always exists for monotone functions), then the height of the jump of FL at c0 is the probabilitythat the r.v. L(X) takes the value c0:

Pϑ0 (L(X) = c0) = FL(c0)− FL,−(c0)

= Pϑ0 (L(X) ≤ c0)− Pϑ0 (L(X) < c0) > 0.

Define

γα =α− Pϑ0 (L(X) > c0)

Pϑ0 (L(X) = c0),

thenα = Pϑ0 (L(X) > c0) + γαPϑ0 (L(X) = c0) . (8.9)

Moreover, since FL,−(c0) is a limit of values which are all < 1− α,

FL,−(c0) ≤ 1− α

and by assumption FL(c0) > 1− α, hence

0 < γα =FL(c0)− (1− α)

FL(c0)− FL,−(c0)≤ 1.

That allows us to construe γα as the value of a randomized test φ, which is taken if the eventL(X) = c0 occurs. We can then define Neyman-Pearson tests for any statistical model, providedthe likelihood ratio L(X) is defined as a random variable. Then the distribution function FL isdefined, and we may construct a level α test as above.

Definition 8.3.7 Assume ModelMd (discrete) or ModelMc (continuous), and that Θ = ϑ0, ϑ1.Let pϑ be either probability functions or densities and define the likelihood ratio as a function of thedata x

L = L(x) = pϑ1(x)/pϑ0(x) if pϑ0(x) > 0+∞ otherwise

A test φ for the hypothesesH : ϑ = ϑ0K : ϑ = ϑ1


Figure 1 Construction of the randomized Neyman-Pearson test

is called a randomized Neyman-Pearson test of level α if there exist c ∈ [0,∞) and γ ∈ [0, 1]such that

φ(x) = 1 if L(x) > cγ if L(x) = c0 if L(x) < c

and such thatPϑ0 (L(X) > c) + γPϑ0 (L(X) = c) = α. (8.10)

Note we modified the definition of L(x): formerly we took L(x) = 0 if pϑ0(x) = 0. But underH such values of x do not occur anyway (or with probability 0), so that FL as considered beforeremains the same and a level α is attained. The modification ensures that if we decide on the basisof L (rejct if L is large enough), then pϑ0(x) = 0 implies that the decision is always 1.

Theorem 8.3.8 (Neyman-Pearson fundamental lemma, general case). In ModelMd (dis-crete) or Model Mc (continuous), for Θ = ϑ0, ϑ1, and any α (0 < α < 1) a N-P test of levelα exists and is a most powerful α-test (among all randomized) for the hypotheses H : ϑ = ϑ0vs.K : ϑ = ϑ1.

Proof. We have shown existence above: L(X) is a well defined r.v. if X has law Pϑ0 (it takes value∞ only with probability 0); it has distribution function FL. If c0 exists such that FL(c0) = 1− αthen take c = c0 and γ = 0. Otherwise, find c and γ as above, fulfilling (8.9). Only the propertiesof distribution functions were used for establishing (8.9), i.e. (8.10). Let Z be the ”randomizing”random variable, which has conditional law L(Z) = B(1, φ(x)) given X = x. Then the probabilityunder H that H is rejected is

Pϑ0 (Z = 1) = Eϑ0φ(X) (8.11)

= 1 · Pϑ0 (L(X) > c) + γ · Pϑ0 (L(X) = c) = α,


i.e. the test has indeed size α. The optimality proof is analogous to Theorem 8.3.5. We assumethat pϑ(x) are densities; the case of probability functions requires only changes in notation.According to (8.11), let φ be a N-P test with given c, γ and

S> := x: L(x) > c,S= := x: L(x) = c,S< := x: L(x) < c.

Then for any randomized α-test ψ

Eϑ1ψ(X) =

ZS>

ψ(x)pϑ1(x)dx+

ZS=∪S<

ψ(x)pϑ1(x)dx

≤ZS>

ψ(x)pϑ1(x)dx+ c

ZS=∪S<

ψ(x)pϑ0(x)dx

=

ZS>

ψ(x) (pϑ1(x)− cpϑ0(x)) dx+ c

Zψ(x)pϑ0(x)dx.

The second term on the right is bounded from above by cα, since ψ is a level α test. For the firstterm, since pϑ1(x)− cpϑ0(x) > 0 on S> and ψ(x) ≤ 1, we obtain an upper bound by substituting 1for ψ. Hence

Eϑ1ψ(X) ≤ZS>

(pϑ1(x)− cpϑ0(x)) dx+ cα

=

ZS>

φ(x) (pϑ1(x)− cpϑ0(x)) dx+ c

Zφ(x)pϑ0(x)dx

=

ZS>

φ(x)pϑ1(x)dx+ c

ZS=

φ(x)pϑ0(x)dx

=

ZS>∪S=

φ(x)pϑ1(x)dx ≤ Eϑ1φ(X).

In some cases the Neyman-Pearson lemma allows the construction of UMP tests for compositehypotheses. Consider the Gaussian location model (Model Mc,1), for sample size n, and the hy-potheses H : µ ≤ µ0. K : µ > µ0. Consider the one sided Gauss test φµ0 : for the teststatistic

Z(X) =

¡Xn − µ0

¢n1/2

σ

the test is defined byφµ0(X) = 1 if Zµ0(X) > zα (8.12)

(0 otherwise), where zα is the upper α -quantile of N(0, 1). As was argued in Example 8.3.3,condition L is fulfilled here; hence for Neyman-Pearson tests of two simple hypotheses within thismodel randomization is not needed. We have composite hypotheses now, but the following can beshown.

Proposition 8.3.9 In the Gaussian location model (Model Mc,1), for sample size n, for the testproblem H : µ ≤ µ0 vs. K : µ > µ0 , for any 0 < α < 1 the one sided Gauss test (8.12) is a UMPα-test.

Likelihood ratio tests 109

Proof. Note first that φµ0 is an α-test for H : µ ≤ µ0. Indeed for any µ ≤ µ0

Pµ (Z(X) > zα) = Pµ

Ã¡Xn − µ

¢n1/2

σ> zα +

(µ0 − µ)n1/2

σ

!

≤ Pµ

Ã¡Xn − µ

¢n1/2

σ> zα

!= α.

Consider now any point µ1 > µ0. We claim that for simple hypotheses H∗ : µ = µ0, K∗ : µ = µ1

the test φµ0 is a Neyman-Pearson test of level α. Indeed, when pµ0 , pµ1 are the densities then forx = (x1, . . . , xn) (ϕ is the standard normal density)

L(x) =nYi=1

ϕ ((xi − µ1)/σ)

ϕ ((xi − µ0)/σ)= exp

µ−Pn

i=1(xi − µ1)2 −

Pni=1(xi − µ0)

2

2σ2

¶= exp

µnxn(µ1 − µ0)

σ2

¶exp

µ−nµ

21 − nµ202σ2

¶Since µ1 > µ0, L(x) is a monotone function of xn, and L(x) > c is equivalent to xn > c∗ for somec∗. In turn, xn is a monotone function of Z(x) = (xn − µ0)n

1/2/σ, thus L(x) > c is equivalent toZ(x) > c∗∗. We find c∗∗ from the level α condition:

Pµ0 (Z(X) > c∗∗) = α,

and since Z(X) is standard normal under µ0, we find c∗∗ = zα. Thus φµ0 is a Neyman-Pearsonα-test for H∗ : µ = µ0 vs. K

∗ : µ = µ1 . Any test ψ of level α for the original composite hypothesesis also a test of level α for H∗ : µ = µ0,vs. K

∗ : µ = µ1 , so that the fundamental lemma (8.3.5)implies

Eµ1ψ ≤ Eµ1φµ0 .

Since µ1 > µ0 was arbitrary, the proposition follows.

8.4 Likelihood ratio tests

We saw that the Neyman-Pearson lemma is closely connected to the likelihood principle: the N-Ptest rejects if L(x) = pϑ1(x)/pϑ0(x) is too large at the observed x. The densities (or probabilityfunctions) pϑ(x) is a likelihood of ϑ when x is fixed (observed) and ϑ varies, and the decision istaken as a function of the two likelihoods. That idea can be carried over to general compositehypotheses.

Definition 8.4.1 Assume Model Md (discrete) or Model Mc (continuous), and that Θ = Θ0 ∪Θ1where Θ0∩Θ1 = ∅. Let pϑ be either probability functions or densities and define the likelihood ratioas a function of the data x

L = L(x) = supϑ∈Θ1 pϑ(x)

supϑ∈Θ0 pϑ(x)if supϑ∈Θ0 pϑ(x) > 0

+∞ otherwise

A (possibly randomized) test φ for the hypothesesH : ϑ ∈ Θ0


K : ϑ ∈ Θ1is called a likelihood ratio test (LR test) if there exist c ∈ [0,∞) such that

φ(x) = 1 if L(x) > c0 if L(x) < c

.

Note that for L(x) = c we made no requirement; any value φ(x) in [0, 1] is possible, so that thetest is possibly a randomized one. Neyman-Pearson tests are a special case for simple hypotheses.One interpretation is the following. Suppose that the suprema over both hypotheses are attained,so that for certain ϑi(x) ∈ Θi, i = 0, 1 we have

supϑ∈Θi

pϑ(x) = maxϑ∈Θi

pϑ(x) = pϑi(x)(x), i = 0, 1.

Then ϑi(x) are maximum likelihood estimators (MLE) of ϑ under assumptions ϑ ∈ Θi, and theLR test can be interpreted as a Neyman-Pearson test for simple hypotheses H : ϑ = ϑ0(x) vs.K : ϑ = ϑ1(x). Of course this is pure heuristics and none of the Neyman-Pearson optimality theoryapplies, since the ”hypotheses” have been formed on the basis of the data.Consider the Gaussian location-scale model and recall the form of the t-statistic for given µ0

Tµ0(X) =

¡Xn − µ0

¢n1/2

Sn.

The two sided t-test was already defined (cp. (8.1); it rejects when¯Tµ0(X)

¯is too large. The one

sided t-test is the test which rejects when Tµ0(X) is too large (in analogy to the one sided Gausstest for known σ2).

Proposition 8.4.2 Consider the Gaussian location-scale model (Model Mc,2), for sample size n.(i) For hypotheses H : µ ≤ µ0 vs. K : µ > µ0, the one sided t-test is a LR test.(ii) For hypotheses H : µ = µ0 vs. K : µ 6= µ0, the two sided t-test is a LR test.

Proof. In relation (3.4) in the proof of Proposition 3.0.5, we obtained the following form of thedensity pµ,σ2 of the data x = (x1, . . . , xn) :

pµ,σ2(x) =1

(2πσ2)n/2exp

Ã−S

2n + (xn − µ)2

2σ2n−1

!, (8.13)

S2n = n−1nXi=1

(xi − xn)2. (8.14)

Consider first the two sided case (ii). To find MLE’s of µ and σ2 under µ 6= µ0, we first maximizefor fixed σ2 over all possible µ ∈ R. This gives an unrestricted MLE µ = xn, and since xn = µ0with probability 0, we obtain that µ1 = xn is the MLE of µ with probability 1 under K. We nowhave to maximize

lx(σ2) =

1

(σ2)n/2exp

µ− S2n2σ2n−1

¶over σ2 > 0. For notational convenience, we set γ = σ2; equivalently, one may minimize

lx(γ) = − log lx(γ) =n

2log γ +

nS2n2γ

.


Note that if S2n > 0, for γ → 0 we have lx(γ) → ∞ and for γ → ∞ also lx(γ) → ∞, so that aminimum exists and is a zero of the derivative of lx. The event s2n > 0 has probability 1 sinceotherwise xi = xn, i = 1, . . . , n, i.e. all xi are equal, which clearly has probability 0 for independentcontinuous xi. We obtain

l0x(γ) =n

2γ− nS2n2γ2

= 0

γ = S2n

as the unique zero, so σ21 = s2n is the MLE of σ2 under K. Thus

maxµ6=µ0,σ2>0

pµ,σ2(x) =1

(2πσ21)n/2

exp

Ã−S

2n + (xn − µ1)

2

2σ21n−1

!=

1

(σ21)n/2

· 1

(2π)n/2exp

³−n2

Ńow under H the MLE of µ is µ0 = µ0. Defining

S20,n := S2n + (xn − µ0)2 = n−1

nXi=1

(xi − µ0)2,

we see that the MLE of σ2 under H is σ20 = S20,n. Hence

maxµ=µ0,σ

2>0pµ,σ2(x) =

1

(σ20)n/2

· 1

(2π)n/2exp (−n/2)

and the likelihood ratio L is

L(x) =maxµ6=µ0,σ2>0 pµ,σ2(x)

maxµ=µ0,σ2>0 pµ,σ2(x)=(σ20)

n/2

(σ21)n/2

=

ÃS2n + (xn − µ0)

2

s2n

!n/2

Note that

Tµ0(X) =n1/2

¡Xn − µ0

¢Sn

=(n− 1)1/2(Xn − µ0)

Sn

hence

L(X) =

µ1 +

1

n− 1T2µ0

¶n/2

.

Thus L(x) is a strictly monotone increasing function of¯Tµ0¯, which proves (ii).

Consider now claim (i). For hypotheses H : µ ≤ µ0 vs. K : µ > µ0, the one sided t-test whichrejects when the t-statistic

Tµ0(X) =

¡Xn − µ0

¢n1/2

Sn.

is too large (with a proper choice of critical value, such that an α-test results). It is easy to seethat the rejection region Tµ0(X) > zα where zα is the upper α-quantile of the tn−1-distribution


leads to an α-test for H (exercise). to shoe equivalence to the LR test, note that when maximizingpµ,σ2(x) over the alternative, the supremum is not attained (µ > µ0 is an open interval). Howeverthe supremum is the same as the maximum over µ ≥ µ0 which is attained by certain maximumlikelihod estimators µ1, σ

21 . (we will find these, and also MLE’s µ0, σ

20 under H).

The density pµ,σ2 of the data x = (x1, . . . , xn) is again (8.13), (8.14). To find MLE’s of µ and σ2

under µ > µ0, we first maximize for fixed σ2 over all possible µ. When xn > µ0 the solution isµ = xn. When xn ≤ µ0, the problem is to minimize (xn − µ)2 under a condition µ > µ0. Thisminimum is not attained (µ can be selected arbitrarily close to µ0, such that still µ > µ0, whichmakes (xn − µ)2 arbitrarily close to (xn − µ0)

2, never attaining this value). However

infµ>µ0

(xn − µ)2 = minµ≥µ0

(xn − µ)2 = (xn − µ0)2 .

Thus the MLE of µ under µ ≥ µ0 is µ1 = max(xn, µ0). This is not the MLE under K, but givesthe supremal value of the likelihood under K for given σ2. To continue, we have to maximize inσ2. Now

(xn − µ1)2 = (min(0, xn − µ0))

2

and defining

S2n,1 := S2n + (xn − µ1)2

we obtain

supµ>µ0,σ

2>0

pµ,σ2(x) = supσ2>0

1

(2πσ2)n/2exp

Ã−

S2n,12σ2n−1

!.

The maximization in σ2 is now analogous to the argument above given for part (ii). The maximizingvalue is σ21 = S2n,1 and the maximized likelihood (which is also the supremal likelihood under K) is

supµ>µ0,σ

2>0pµ,σ2(x) =

1

(σ21)n/2

1

(2π)n/2exp

µ− 1

2n−1

¶.

Now under the hypothesis, since µ ≤ µ0 is a closed interval, the MLE’s can straightforwardly befound. An analogous argument to the one above gives

µ0 = min(xn, µ0),

minµ≤µ0

(xn − µ)2 = (xn − µ0)2 = (max(0, xn − µ0))

2 ,

σ20 = S2n,0 where S2n,0 := S2n + (xn − µ0)

2 ,

supµ≤µ0,σ2>0

pµ,σ2(x) =1

(σ20)n/2

1

(2π)n/2exp

µ− 1

2n−1

¶.

Thus the likelihood ratio is

L(x) =

µσ20σ21

¶n/2

=

ÃS2n + (xn − µ0)

2

S2n + (xn − µ1)2

!n/2


Suppose first that the t-statistic Tµ0(X) has values ≤ 0; this is equivalent to xn ≤ µ0. In this caseµ0 = xn, µ1 = µ0, hence

L(x) =

µS2n

S2n + (xn − µ0)2

¶n/2

=

µ1

1 + (xn − µ0)2 /S2n

¶n/2

=

µ1

1 + (Tµ0(X))2/(n− 1)

¶n/2

.

Thus for nonpositive values of Tµ0(X), the likelihood ratio L(x) is a monotone decreasing functionof the absolute value of Tµ0(X), which means it is monotone increasing in Tµ0(X), on valuesTµ0(X) ≤ 0.Consider now nonnegative values of Tµ0(X): Tµ0(X) ≥ 0. Then xn ≥ µ0, hence µ0 = µ0, µ1 = xnand

L(x) =

µσ20σ21

¶n/2

=

ÃS2n + (xn − µ0)

2

S2n

!n/2

=¡1 + (Tµ0(X))

2/(n− 1)¢n/2

.

Thus for values Tµ0(X) ≥ 0, the likelihood ratio L(x) is monotone increasing in Tµ0(X).The two areas of values of Tµ0(X) we considered do overlap (in Tµ0(X) = 0); and we showed thatL(x) is a monotone increasing function of Tµ0(X) on both of these. Hence L(x) is a monotoneincreasing function of Tµ0(X).


Chapter 9CHI-SQUARE TESTS

9.1 Introduction

Consider the following problem related to Mendelian heredity. Two characteristica of pea plants(phenotypes) are observed: form, which may be smooth or wrinkled, and color, which may be yellowor green. Thus there are 4 combinations of form and color (4 combined phenotypes). Mendelian the-ory predicts certain frequencies of these in the total population of pea plants; call themM1, . . . ,M4

(here M1 is the frequency of smooth/yellow etc); assume these are normed asP4

j=1Mj = 1. Weobserve a sample of n pea plants; let the observed frequency be Zj for each phenotype (j = 1, . . . , 4),then

P4j=1 Zk = n. We wish to find out whether these observations support the Mendelian hypoth-

esis that (M1, . . . ,M4) are the frequencies of phenotypes in the total population.

Model Md,2 The observed random vector Z = (Z1, . . . , Zk) has a multinomial distributionMk(n,p)

with unknown probability vector p = (p1, . . . , pk) (Pk

j=1 pj = 1, pj ≥ 0, j = 1, . . . , k).

Recall the basic facts about the multinomial law. Consider a random k-vector Y of form(0, . . . , 0, 1, 0, . . . , 0) where exactly one component is 1 and the others are 0. The probability thatthe 1 is at position j with is pj ; thus Y can describe an individual falling into one of k categories(phenotypes in the above example). This Y is said to have lawMk(1,p). If Y1, . . . , Yn are i.i.d. withlaw Mk(1,p) then Z =

Pni=1 Yi has the law Mk(n,p). (The Yi may be called counting vectors).

The probability function is

P (Z = (z1, . . . , zk)) =n!Qk

j=1 zj !

kYj=1

pzjj (9.1)

wherePk

j=1 zj = n, zj ≥ 0 integer. Since the j-th component of Y1 is Bernoulli B(1, pj), the j-thcomponent Zj of Z has binomial law B(n, pj). The Zj are not independent; in fact

Pkj=1 Zj = n.

For k = 2 all the information is in Z1 since Z2 = n − Z1; thus for k = 2 observing a multinomialM2(n, (p1, p2)) is equivalent to observing a binomial B(n, p1).In Model Md,2 consider the hypothesesH : p = p0K : p 6= p0.The test we wish to find thus is a significance test. Recall the basic rationale of hypothesis testing:what we wish to statistically ascertain at level α is K; if K is accepted then it can be claimed that”the deviation from the null hypothesis∗ H is statistically significant”; and there can be reasonableconfidence in the truth of K. On the contrary, when H is accepted, no statistical significance

∗The hypothesis H is often called the ”null hypothesis”, even if it is not of the form ”ϑ = 0”

116 Chi-square tests

claim can be attached to this result. When formulating a test problem, the statement for which”reasonable statistical certainty” is desired is taken as K.Let us find the likelihood ratio test for this problem. Setting ϑ = p, ϑ0 = p0 = (p0,1, . . . , p0,k),denoting pϑ(z) the probability function (9.1) of Mk(n,p) and

Θ =

⎧⎨⎩p :pj ≥ 0,kX

j=1

pj = 1

⎫⎬⎭ , Θ1 = Θ \ ϑ0

we obtain the likelihood ratio statistic

L(z) =supϑ∈Θ1 pϑ(z)

pϑ0(z)=supp∈Θ1

Qkj=1 p

zjjQk

j=1 pzj0,j

.

Consider the numerator; let us first maximize over p ∈ Θ (this will be justified below). If some ofthe zj are 0, we can set the corresponding pj = 0 (making the other pj larger; we set 00 = 1). Wenow maximimze over pj such that zj > 0. Taking a logarithm, we have to maximizeX

j:zj>0

zj log pj

over all p ∈ Θ. Since log x ≤ x− 1, we haveXj:zj>0

zj logpj

n−1zj≤

Xj:zj>0

zj

µpj

n−1zj− 1¶= n

Xj:zj>0

pj −Xj:zj>0

zj

= 0

so that Xj:zj>0

zj log pj ≤Xj:zj>0

zj logn−1zj

and for pj = n−1zj equality is attained. This is the unique maximizer since log x = x− 1 only forx = 1. We proved

Proposition 9.1.1 In the multinomial Model Md,2, with L(Z) =Mk(n,p) with no restriction onthe parameter p the maximum likelihood estimator p is

p(Z) = n−1Z.

The interpretation is that p is a vector valued sample mean of the counting vectors Y1, . . . , Yn. Inthis sense, we have a generalization of the result for binomial observations (Proposition 3.0.3).Recall that for the LR statistic we have to find the supremum over Θ1 = Θ \ p0, i.e. only onepoint p0 is taken out. Since the target function p 7→

Qkj=1 p

zjj is continuous (with 00 = 1) on Θ,

we have


pϑ0(z)=supϑ∈Θ pϑ(z)

pϑ0(z)

=maxϑ∈Θ pϑ(z)

pϑ0(z)=

kYj=1

µn−1zjp0,j

¶zj

.

Introduction 117

Since the logarithm is a monotone function, the acceptance region S (complement of critical /rejection region) can also be written

S =

⎧⎨⎩z : log (L(z))−1 =kX

j=1

zj lognp0,jzj≥ c

⎫⎬⎭ .

Even the logarithm is a relatively involved function of the data, so it is difficult to find its distribu-tion under H and to determine the critical value c from that. We will use a Taylor approximationof the logarithm to simplify it. The basis is the observation that the estimator p(Z) is consistent,i.e. converges in probability to the true probability vector p

p(Z) = n−1Z →p p.

Under the hypothesis, this true vector is p0, so all values n−1zj/p0,j converge to one. Note theTaylor expansion

log (1 + x) = x− x2

2+ o(x2) as x→ 0

where o(x2) is a term which is of smaller order than x2 ( such that o(x2)/x2 → 0). Thus, assumingthat each term p0,j/n

−1zj − 1 is small, we obtain

log (L1(z))−1 =

kXj=1

zj log

µ1 +

p0,jn−1zj

− 1¶

≈kX

j=1

zj

µp0,jn−1zj

− 1¶− 12

kXj=1

zj

µp0,jn−1zj

− 1¶2

.

Here the first term on the right vanishes, since the p0,j sum to one and the zj sum to n. We obtain

log (L1(z))−1 ≈ −

kXj=1

1

2zj

µp0,jn−1zj

− 1¶2

.

We need not make the approximation ”≈” more rigorous, if we do not insist on using the likelihoodratio test. In fact we will use the LR principle only to find a reasonable test (which should beshown to have asymptotic level α). In this spirit, we proceed with another approximation n−1zj ≈p0,j to obtain

−12

kXj=1

zj

µp0,jn−1zj

− 1¶2

= −12

kXj=1

zj

µp0,j − n−1zj

n−1zj

¶2

≈ −12

kXj=1

n¡p0,j − n−1zj

¢2p0,j

.

Definition 9.1.2 In the multinomial ModelMd,2, with L(Z) =Mk(n,p), the χ2-statistic relativeto a given parameter vector p0 is

χ2(Z) =kX

j=1

¡n1/2

¡p0,j − n−1Zj

¢¢2p0,j

.


The name is derived from the asymptotic distribution of this statistic, which we will establish below(the statistic does not have a χ2-distribution). The hypothesis H : p = p0 will be rejected if χ2(Z)is too large; as shown above, that idea was obtained from the likelihood ratio principle.But the χ2(Z) has an interpretation of its own, as a measure of deviation from the hypoth-esis. Indeed n−1Zj are consistent estimators of the true parameter p, so the sum of squaresPk

j=1 (p0,j − pj)2 can be seen as a measure of departure from H. In the chi-square statistic, we

have a weighted sum of squares with weights p−10,j .We know that since each Zj has a marginal binomial law, for each j we have a convergence indistribution

n1/2¡n−1Zj − p0,j

¢ L⇒ N(0, p0,j(1− p0,j)) (9.2)

i.e. has a limiting normal law by the CLT under H. The χ2 distribution is a sum of squares ofindependent normals. However the Zj are not independent in the multinomial law; so we needmore than the CLT for each Zj : in fact a multivariate CLT for the joint law of (Z1, . . . , Zk) isrequired.

9.2 The multivariate central limit theorem

Recall the Central Limit Theorem (CLT) for i.i.d. real valued random variables (Example 7.3.4.Recall also that since there are no further conditions on the law of Y1, the CLT is valid for both

continuous and discrete Yi. Here the symbolL⇒ refers to convergence in distribution (or in law),

see Definition 7.3.1.Suppose now that Y1, . . . ,Yn are i.i.d. random vectors of dimension k; we also write Y for arandom vector which has the distribution of Y1. Assume that

E kYk2 <∞. (9.3)

Here kYk2 =Pk

j=1 Y2j is the Euclidean norm of the random vector Y. Recall that for any random

vector Y in Rk, the covariance matrix is defined by

(Cov(Y))j,l = Cov(Yj , Yl) = E (Yj −EYj) (Yl −EYl) , 1 ≤ j, l ≤ k.

if the expectations exist. (Here (A)i,j is the (i.j) entry of a matrix A). This existence is guaranteedby condition (9.3), as a consequence of the Cauchy-Schwarz inequality:

|Cov(Yj , Yl)|2 ≤ Var(Yj)Var(Yl) ≤¡EY 2j

¢ ¡EY 2l

¢≤

³E kYk2

´2.

Note that for expectations of vectors and matrices, the following convention holds: the expectationof a vector (matrix) is the vector (matrix) of expectations. This means

EY = (EYj)j=1,...,k

and since the k × k-matrix whose components are YjYl can be written

YY> = (YjYl)l=1,...,kj=1,...,k ,

The multivariate central limit theorem 119

the covariance matrix can be written

Cov(Y) = E (Y−EY) (Y −EY)> .

Our starting point for the multivariate CLT is the observation that if t is a nonrandom k-vector,then the r.v.’s t>Yi, i = 1, . . . , n are real-valued i.i.d. r.v.’s with finite second moment.Indeed,since for any vector t,x we have ³

t>x´2= t>xx>t,

we obtain

Var(t>Y) = E³t>Y−Et>Y

´2= E

³t>(Y−EY)

´2=

= Et>(Y−EY)(Y−EY)>t= t>E(Y−EY)(Y−EY)>t = t>Cov(Y)t <∞

since Cov(Y) is a finite matrix. Thus according to the (univariate) CLT, if

σ2t = t>Cov(Y)t

is not 0, then

n1/2

Ãn−1

nXi=1

t>Yi −Et>Y

!L⇒ N(0, σ2t). (9.4)

Define the sample mean of the random vectors Yi by

Yn = n−1nXi=1

Yi;

then (9.4) can be written

n1/2t>¡Yn −EYn

¢ L⇒ N(0, t>Cov(Y)t).

This suggests a multivariate normal distribution Nk(0,Σ) with Σ = Cov(Y) as the limit law forthe vector n1/2

¡Yn −EYn

¢. Indeed recall that for a nonsingular (positive definite) matrix Σ

L(Z) =Nk(0,Σ) implies that L(t>Z) =N(0, t>Σt) for every t 6= 0.

(Lemma 6.1.12. Actually the converse is also true; cf. Proposition 9.2.3 below). In the sequel, forthe multivariate CLT we will impose the condition that Cov(Y) is nonsingular. This means thatt>Cov(Y)t > 0 for every t 6= 0.For an interpretation of this condition, note if it is violated then there exists a t 6= 0 such thatt>Cov(Y)t = 0 (this number is the variance of a random variable and thus it cannot be negative).Then the r.v. t>Y is 0 with probability 1. Define the hyperplane (linear subspace) in Rk

H =nx ∈ Rk : t>x = 0

o;

then Y ∈H with probability one. The condition that Cov(Y) is nonsingular thus excludes thiscase of a ”degenerate” random vector which is actually concentrated on a linear subspace of Rk.However a multivariate CLT is still possible if the vector Y is linearly transformed (to a space oflower dimension).


Definition 9.2.1 Let Qn, n = 1, 2, . . . be a sequence of distributions in Rk. The Qn are saidto converge in distribution to a limit Q0 if for random vectors Yn such that L(Yn) = Qn,L(Y0) = Q0

t>YnL⇒ Q0,t as n → ∞,

Q0,t = L(t>Y0),

for every t ∈ Rk, t 6= 0. In this case one writes

YnL⇒ Q0.

Note that it is not excluded here that the limit law has a singular covariance matrix (t>Y0 mightbe 0 with probability one for certain t, or even for all t). However for the multivariate CLT wewill exclude this case by assumption, since we did not systematically treat the multivariate normalNk(0,Σ) with singular Σ.With this definition, it is not immediately clear that the limit law Q0 is unique. It is desirableto have this uniqueness; otherwise there could be two different limit laws Q0, Q∗0 such that (forL(Y∗0) = Q∗0)

L(t>Y0) = L(t>Y∗0) for all t 6= 0.

That this is not possible will follow from the Proposition 9.2.3 below.

Theorem 9.2.2 (Multivariate CLT) Let Y1, . . . ,Yn be i.i.d. random vectors of dimension k,each with distribution Q, fulfilling condition

E kY1k2 <∞. (9.5)

Let

Yn = n−1nXi=1

Yi

and assume that the covariance matrix Σ = Cov(Y1) is nonsingular. Then for fixed Q and n→∞

n1/2¡Yn −EY1

¢ L⇒ Nk(0,Σ).

Proof. With our definition of convergence in distribution, it is an immediate consequence ofthe univariate (one dimensional) CLT and the properties of the multivariate normal distribution(following the pattern outlined above, that everything is reduced to the one dimensional case).The uniqueness of the limiting law follows from the next statement. It means that we can in factuse all properties of the the multivariate normal when it is a limiting law (e.g. independence ofcomponents when they are uncorrelated).

Proposition 9.2.3 Let Q, Q∗ be the distributions of two random vectors Y, Y∗ with values inRk. Then

L(Y) = L(Y∗) if and only if L(t>Y) = L(t>Y∗) for all t 6= 0.

Application to multinomials 121

Proof. The complete argument is beyond the scope of this course; let us discuss some elements(cp. also the arguments for the proof of the univariate CLT in [D]). Suppose that for all t ∈Rk, theexpression

MY(t) = E exp(t>Y)

is finite (i. e. the expectation is finite). In that case, MY is called the moment generating function(m.g.f.) of L(Y). Analogously to the one dimensional case, it can be shown that the m. g.f.determines L(Y) uniquely (that is the key argument). Thus if L(t>Y) = L(t>Y∗) then theirunivariate m. g.f. coincide:

E exp(ut>Y) = E exp(ut>Y∗)

and conversely, if that is the case for all u and t then MY =MY∗ , hence L(Y) = L(Y∗).Existence of the m.g.f. is a strong additional assumption on a distribution. The proof in the generalcase (without any conditions on the laws L(Y),L(Y∗)) is based on the so called characteristicfunction of a random vector

φY(t) = E exp(it>Y)

where the complex-valued expression

exp(iz) = cos(z) + i sin(z)

occurs (for z = t>Y). Since|exp(iz)| = 1

(absolute value for complex numbers), no special strong assumptions have to be made for theexistence of the characteristic function: it exists for any random vector and can be used in theproof in much the same way as above. The essential part is again that φY uniquely determinesL(Y).Let A be a subset of Rk. A set is called regular if it has a volume and a boundary of zero volume(the boundary is the intersection of A and Ac, where Ac is the complement and A is the closureof A). Rectangles and balls are regular. The following statement is similar to Proposition 9.2.3, inthe sense that advanced mathematical tools are neeeded for its proof, and we only quote it here.

Proposition 9.2.4 Let Yn be a sequence of random vectors in Rk such that

YnL⇒ Q0 as n→∞

where Q0.is a continuous law in Rk (Q0 has a density). Then

P (Yn ∈ A)→ Q0(A)

for all regular sets A ⊂ Rk.

9.3 Application to multinomials

Let us apply the multivariate CLT to the multinomial random vector Z. Since the components arelinearly dependent (they sum to n), we cannot expect a nonsingular covariance matrix. Recall thatif L(Z) = Mk(n,p) then Z =

Pni=1Yi where Yi are independent Mk(1,p). If L(Y) = Mk(1,p)

then for (Y1, . . . , Yk)> = YEYj = pj ,


and for j = l, since Yj is binomial,

EYjYl = EY 2j = EYj = pj ,

while for j 6= l

EYjYl = P (Yj = 1, Yl = 1) = 0

(the random vector Y has a 1 in exactly one position). We can now write down the covariancematrix:

Cov(YjYl) = EYjYl −EYjEYl = pj − pjpl, j = l−pjpl, j 6= l.

Introduce transformed variables Yj = (Yj − pj)/p1/2j ; then

Cov(YjYl) = p−1/2j p

−1/2l Cov(YjYl) =

1− p1/2j p

1/2l , j = l

−p1/2j p1/2l , j 6= l.

Let Λ be the diagonal matrix with diagonal elements p−1/2j . Then, for vectors

Y =Λ(Y − p), p =Λp =(p1/21 , . . . , p1/2k )>

we have (Y1, . . . , Yk)> = Y andCov(Y) = Ik − pp>. (9.6)

Let e1, . . . , ek be an orthonormal basis of vectors in Rk such that ek = p. That is possible since phas length 1 :

kpk2 =kX

j=1

³p1/2k

´2= 1.

Then e1, . . . , ek−1 are all orthogonal to p. Let G, G0 be matrices of dimension k × k, (k − 1)× k

G =

⎛⎝ e>1. . .e>k

⎞⎠ , G0 =

⎛⎝ e>1. . .e>k−1

⎞⎠ . (9.7)

These are orthogonal matrices. Moreover we have

e>k Y = p>Y =kX

j=1

p1/2j p

−1/2j (Yj − pj) (9.8)

=kX

j=1

(Yj − pj) = 1− 1 = 0. (9.9)

Lemma 9.3.1 Let L(Z) = Mk(n,p), let Λ be the diagonal matrix with diagonal elements p−1/2j

and let the (k − 1)× k-matrix F be defined by

F = G0Λ

Application to multinomials 123

where G0 is defined by (9.7). Then

kF (Z− np)k2 =kX

j=1

p−1j (Zj − npj)2 (9.10)

and the random vector FZ has covariance matrix:

Cov(FZ) = nIk−1. (9.11)

Proof. Note that

Λ(Z− np) =nXi=1

Yi

where Yi are i.i..d Mk(1,p) andYi = Λ(Yi − p).

Above it was shown (9.9) that

e>k Yi = p>Yi = 0, i = 1, . . . , n.

Hencee>k Λ(Z− np) = 0.

This implies

kXj=1

p−1j (Zj − npj)2 = kΛ(Z− np)k2 = kGΛ(Z− np)k2

=kX

j=1

³e>j Λ(Z− np)

´2=

k−1Xj=1

³e>j Λ(Z− np)

´2= kG0Λ(Z− np)k2 = kF (Z− np)k2

thus the first claim (9.10) is proved. For the second claim, we note that in view of (9.6) and theadditivity of covariance matrices for independent vectors

Cov(FZ) = nCov(FY1) = nCov(G0Y1)

= nG0

³Ik − pp>

´G>0

= nG0G>0 = nIk−1.

(computation rules for covariance matrices can be obtained from the rules for the multivariatenormal, cp. Lemma 6.1.12:

Cov(AX) = ACov(X)A>.

The following is the Central Limit Theorem for a multinomial random variable, which generalizesthe de Moivre-Laplace CLT for binomials (sums of i.i..d Bernoulli r.v’s, cp. (1.2) and (9.2)). Sincethe components of the multinomial are dependent, we need to multiply with a (k−1)×k-matrix Ffirst, otherwise we would get a multivariate normal limiting distribution with singular covariancematrix.


Proposition 9.3.2 Let L(Z) = Mk(n,p). Then for the (k − 1) × k-matrix F defined above wehave

n1/2F¡n−1Z− p

¢ L⇒ Nk−1(0, Ik) as n→∞.

Proof. We can represent Z as a sum of i.i.d. vectors

Z =nXi=1

Yi

where each Yi is Mk(1,p). Hence

FZ =nXi=1

FYi

the FYi are again i.i.d. vectors expectation Fp and with unit covariance matrix, according to(9.11) for n = 1. For the multivariate CLT, the second moment condition is fulfilled trivially sincethe vector Y1 takes only k possible values. The multivariate CLT (Theorem 9.2.2) yields the result.

The next result justifies the name of the χ2-statistic, by establishing an asymptotic distribution.

Theorem 9.3.3 Let L(Z) =Mk(n,p). Then for the χ2-statistic

χ2(Z) =kX

j=1

¡n1/2

¡n−1Zj − pj

¢¢2pj

we haveχ2(Z)

L⇒ χ2k−1 as n→∞.

Proof. We have according to (9.10)

χ2(Z) = n−1kX

j=1

p−1j (Zj − npj)2

= n−1 kF (Z− np)k2 =°°°n1/2F (n−1Z− p)°°°2 .

The above Proposition 9.3.2 implies that the expression inside k·k2 is asymptotically multivariatek − 1-standard normal. Denote this expression

Vn = n1/2F (n−1Z− p).

For convergence in law to χ2k−1 we have to show that

P³kVnk2 ≤ t

´→ F (t)

where F is the distribution function of the law χ2k−1, at every continuity point t of F . Since thislaw has a density, F is continuous, so it has to be shown for every t (it suffices for t ≥ 0). The set

Chi-square tests for goodness of fit 125nx : kxk2 ≤ t

ois a ball in Rk−1 and hence regular in the sense of Proposition 9.2.4. Thus if ξ is a

random vector with law Nk−1(0, Ik) then

P³kVnk2 ≤ t

´→ P

³kξk2 ≤ t

´.

By definition of the χ2k−1 distribution, we have

P³kξk2 ≤ t

´= F (t)

so that the theorem follows from Proposition 9.2.4.

9.4 Chi-square tests for goodness of fit

In a goodness-of-fit test we test whether the distribution of the data is different from a givenspecified distribution Q0. More generally, this term is used also when we test whether the distri-bution is outside a specified family (such as the normal family). In the first case, the hypothesisH is simple (consists of Q0) and the test may also be called a significance test, but the terminol-ogy ”goodness-of-fit” emphasizes that we specifically focus on the actual shape of the distribution.Goodness of fit to whole families of distributions (taken as H) will be discussed in the next section.

Theorem 9.4.1 Consider ModelMd,2: the observed random k-vector Z has law L(Z) =Mk(n,p)where p is unknown. Consider the hypothesesH : p = p0K : p 6= p0.Let zα be the upper α-quantile of the distribution χ2k−1. The test ϕ(Z) defined by

ϕ(Z) = 1 if χ2(Z) > zα

0 otherwise(9.12)

where

χ2(Z) =kX

j=1

(np0,j − Zj)2

np0,j. (9.13)

is the χ2-statistic relative to H, is an asymptotic α-test, i. e. under p = p0

limsupn→∞P (ϕ(Z) = 1) ≤ α.

The form (9.13) of the χ2-statistic is easy to memorize: take observed frequency minus expectedfrequency, square it, and divide by expected frequency, and sum over components (components arealso called cells).

Remark 9.4.2 On quantiles. In the literature (especially in tables) it is more common to uselower quantiles; i.e. values qγ such that for a given random variable X

P (X ≤ qγ) ≥ γ, P (X ≥ qγ) ≥ 1− γ.

For γ = 1/2 one obtains a median (i.e. a theoretical median of the random variable X; notethat quantiles qγ need not be unique). When X has a continuous strictly monotone distributionfunction F (at least in a neighborhood of qγ) then qγ is the unique value

F (qγ) = γ.


Lower quantiles are often written in the form χ2k;γ if F corresponds to χ2k. Thus for the upperquantile zα used above we have

zα = χ2k−1;1−α.

Example 9.4.3 Consider again the heredity example (beginning of Section 9.1). Suppose thevalues of (M1, . . . ,M4) = (p0,1, . . . , p0,4) predicted by the theory are

(p0,1, . . . , p0,4) =1

1002¡912, 9 · 91, 9 · 91, 92

¢= (0.828 1, 0.0819, 0.0819, 0.00 81)

(we do not claim that these are the correct values corresponding to Mendelian theory). Supposewe have 1000 observations with observed frequency vector

Z = (822, 96, 75, 7)

(these are hypothetical, freely invented data). We have

χ2(Z) =4X

j=1

(np0,j − Zj)2

np0,j

=(828 .1− 822)2

828 .1+(81.9− 96)2

81.9+(81.9− 75)2

81.9+(8.1− 7)2

8.1= 3. 2031

At significance level α = 0.05, we find zα = χ23;0.95 = 7.82. The hypothesis is not rejected.

Example 9.4.4 Suppose we have a die and want to test whether it is fair, i.e. all six outcomesare equally probable. For n independent trials, the frequency vector for outcomes (1, . . . , 6) is

Z = (Z1, . . . , Z6)

and the expected frequency vector would be

np0 =³n6, . . . ,

n

6

´.

This can easily be simulated on a computer; in fact one would then test whether the computerdie (i.e. a uniform random number taking values on integers 1, . . . , 6) actually has the uniformdistribution. The random number generator of QBasic (an Ms-Dos Basic version) was tested inthis way, with n = 10000, and a result

z = (1686, 1707, 1739, 1583, 1643, 1642) .

We have n/6 = 1666.667 and a value for the χ2-statistic

χ2(z) =6X

j=1

(n/6− zj)2

n/6= 9.2408.

The quantile at α = 0.05 is zα = χ25;0.95 = 11.07, so the hypothesis of a uniform distribution cannotbe rejected.

Chi-square tests for goodness of fit 127

Exercise. Suppose you are intent on proving that the number generator is bad, and run the abovesimulation program 20 times. You claim that the random number generator is bad when the testrejects at least once. Are you still doing an α-test ? (assuming that n above is large enough sothat the level of one test is practically α)The χ2-test can also be used to test hypotheses that the data follow a specific distribution, notnecessarily multinomial. Suppose observations are i.i.d.real valued X1, . . . ,Xn, with distributionQ. Suppose Q0 is a specific distribution, and consider hypothesesH : Q = Q0K : Q 6= Q0.This is transformed into multinomial hypotheses by selecting a partition of the real line into subsetsor cells A1, . . . , Ak

k[j=1

Ak = R, Ai ∩Aj = ∅, j 6= i .

The Aj are often called cells or bins; they are usually intervals. For a real r.v. X, define anindicator vector

Y(X) = (Y1, . . . , Yk), Yj = 1Aj (X), j = 1, . . . , k (9.14)

i.e. Y(X) indicates into which of the k cells the r.v. X falls. Then obviously Y(X) has amultinomial distribution:

L(Y(X)) =Mk(1,p), p = p(Q) =(Q(A1), . . . , Q(Ak))

This p(Q) is the vector of cell probabilities, corresponding to the given partition. Thus, in theabove problem, the vectors Yi = Y(Xi) are multinomial; they are sometimes called binned data.Then

Z :=nXi=1

Y(Xi) (9.15)

is multinomialMk(n,p) with the above value of p. When Q takes the value Q0 then also the vectorof cells probabilities takes the value

p0 = p(Q0) = (Q0(A1), . . . , Q0(Ak)).

From initial hypotheses H, K one obtains derived hypothesesH 0 : p(Q) = p0(Q)K 0 : p(Q) 6= p0(Q).From Theorem 9.4.1 it is clear that as n→∞, the χ2-test based Z for the multinomial hypothesisH 0 is again an asymptotic α-test.

Corollary 9.4.5 Suppose observations are i.i.d. real valued X1, . . . ,Xn, with distribution Q. Con-sider hypothesesH : Q = Q0K : Q 6= Q0.For a partition of the real line into nonintersecting cells A1, . . . , Ak, define the vector of cell fre-quencies Z by (9.15). Then the χ2-test ϕ(Z) defined by (9.12) based Z is an asymptotic α-test asn→∞.


This χ2-test has very wide range of applicability; it is not specified whether Q0 is discrete orcontinuous. Every distribution Q0 gives rise to a specific multinomial distribution Mk(n,p(Q0))which is then tested. For instance, a random number generator for standard normal variables canbe tested in this way. On the real line, at least one of the cells Aj contains an unbounded interval.However there is a certain arbitrariness involved in the choice of the cells A1, . . . , Ak. In factpartitioning the data into groups amounts to a ”coarsening” of the hypothesis: there are certainlydistributions Q 6= Q0 which have the same cell probabilities, i.e. p(Q0) = p(Q). These cannotbe told apart from Q0 by this method. If one choses a large number of groups k, the number ofobservations in each cell may be small, so that the approximation based on the CLT appears lesscredible.

Remark 9.4.6 A family of probability distributions P = Pϑ, ϑ ∈ Θ, indexed by ϑ, is calledparametric if all ϑ ∈ Θ are finite dimensional vectors (Θ ⊆ Rk for some k), otherwise P is callednonparametric. In hypothesis testing, any hypothesis corresponds to some P, thus the terminologyis extended to hypotheses. Any simple hypothesis (consisting of one probability distribution P =Q0) is parametric. In Corollary 9.4.5, the alternative K : Q 6= Q0 is nonparametric: the set ofall distributions Q = L(X1) cannot be parametrized by a finite dimensional vector (take e.g. onlyall discrete distributions 6= Q0 characterized by probabilitities q1, . . . , qr, r arbitrarily large).

Thus we encountered the first nonparametric hypothesis, in the form of the alternative Q 6= Q0. Inthis sense, the χ2-test for goodness of fit in Corollary 9.4.5 is a nonparametric test ; in a narrowersense this term is used for tests which have level α on a nonparametric hypothesis. (However thisχ2-test actually tests the hypothesis on the cell probabilities p(Q) = p(Q0), with asymptotic levelα, and the set of all Q fulfilling this hypothesis is also nonparametric).

9.5 Tests with estimated parameters

Back in the multinomial model, consider now the situation that the hypothesis is not H : p = p0but also composite: let H be a d+1-dimensional linear subspace of Rk (0 ≤ d ≤ k− 1)and assumethe hypothesis is H : p ∈H. An example would be the hypothesis p1 = p2 (when k > 2). Now p isalready in a k − 1 dimensional affine manifold:

SP =nx : 1>x = 1, xj ≥ 0, j = 1, . . . , k

o, (9.16)

which is called the probability simplex in Rk. It is the set of all k-dimensional probability vectors.(Here 1 = (1, . . . , 1)> ∈ Rk). Instead of fixing p0 as before, we now have only linear restrictions onp: if H> is the orthogonal complement of H and h1, . . . ,hk−d−1 an orthonormal basis then

h>j p = 0, j = 1, . . . , k − d− 1. (9.17)

If there is a probability vector in H, then h1, . . . ,hk−d−1,1 must be linearly independent (otherwise1 would be a linear combination of hj , and we cannot have 1>p = 1), and it follows that

H0 = SP ∩H (9.18)

has dimension k − (k − d − 1 + 1) = d. (The dimension of H0 should be understood in the sensethat there are d+1 points x0, . . . ,xd in H0 such that xj−.x0, j = 1, . . . , k are linearly independent,and no more such points. Simlarly, SP has dimension k − 1).When d = 0 then H0 consists of only

Tests with estimated parameters 129

one point p0. The setting can be can be visualized as follows.

The probability simplex for k = 3 intersected with the linear space spanned by 1 (dimension 1).The intersection is n−11 (dimension d = 0).

The probability simplex intersected with a linear subspace H (dimension 2). The intersection isH0 (dimension d = 1).

The multinomial data vector n−1Z also takes values in SP , which means intuitively that there arek − 1 ”degrees of freedom”. Our parameter vector p varies in H0 with dimension d, which meansthat there are d ”free” parameters under the hypothesis which must be estimated. We now claim


that the corresponding χ2-statistic has a limiting χ2-distribution with degrees of freedom

dim(SP )− dim(H0) = (k − 1)− d = k − d− 1.

Let us discuss what we mean by estimated parameters. A guiding principle is still the likelihoodratio principle: consider the LR statistic


supϑ∈Θ0 pϑ(z)(9.19)

where Θ1, Θ0 are the parameter spaces under H, K respectively. In the case Θ0 = p0 this ledus to the χ2-statistic relative to p0

χ2(Z) =kX

j=1

¡n1/2

¡p0,j − n−1Zj

¢¢2p0,j

.

Since now in (9.19) we also have to maximize over the hypothesis, we should expect that in placeof p0,j we now obtain estimated values under the hypothesis: p = p(Z), which are the maximumlikelihood estimators under H.Write op(n−r) for a random vector such that nr kop(n−r)k→p 0. , r ≥ 0.

Lemma 9.5.1 The MLE p in a multinomial model Mk(n,p), p ∈H0 for a parameter space H0given by (9.18) fulfills

F (p− p) = ΠF¡n−1Z− p

¢+ op(n

−1/2) (9.20)

where F is the (k− 1)× k-matrix defined in Lemma (9.3.1), p is the true parameter vector, and Πis a (k − 1)× (k − 1) projection matrix of rank d.

Comment. The result means that the k − 1-vector F (p− p) ”almost” lies in the d-dimensionallinear subspace of Rk−1 associated to the projection Π. This is related to the fact that both p, pare in the d-dimensional manifold H0.Proof. We present only a sketch, suppressing some technical arguments. Let p denote the truevalue and p the one over which one maximizes (and ultimately the maximizing value). Considerthe log-likelihood; up to an additive term which does not depend on p it is

kXj=1

Zj log pj

Maximizing this is the same as minimizing

−2kX

j=1

Zj log

µpj

n−1Zj

¶.

A preliminary argument (which we cannot give here) shows that p is consistent, i.e. p→p p. Alson−1Z→p p, so that pj/n−1Zj →p 1. . A Taylor expansion of the logarithm yields

−2kX

j=1

Zj log

µ1 +

pjn−1Zj

− 1¶

= −2kX

j=1

Zj

µpj

n−1Zj− 1¶+

kXj=1

Zj

µpj

n−1Zj− 1¶2+ op(1).


Here the first term vanishes and the remainder is

kXj=1

n¡n−1Zj − pj

¢2n−1Zj

+ op(1).

In another approximation step, n−1Zj is replaced by its limit pj to yield

kXj=1

n¡n−1Zj − pj

¢2pj

+ op(1).

The above expression is one which p minimizes. Similarly to (9.10) we now have

kXj=1

p−1j (n−1Zj − pj)

2 =°°F (n−1Z− p)°°2

since with the choice of the vector ek and Λ as in Lemma (9.3.1) we have Λek = Λp = Λ2p = 1and

e>k Λ(n−1Z− p) =1− 1 = 0.

Furthermore, denotingq = F (p− p), Z∗ = F (n−1Z− p) (9.21)

we obtain °°F (n−1Z− p)°°2 = kZ∗ − qk2 (9.22)

and q minimizesn kZ∗ − qk2 + op(1). (9.23)

Let us disregard the requirement that all components of p must be nonnegative; it can be shownthat since n−1Z ∈SP , this requirement is fulfilled automatically for the minimizer p if n is largeenough. With this agreement, the vector q varies in the set

H1 = F (x− p),x ∈ S ∩H . (9.24)

whereS =

nx : 1>x = 1

o.

This set H1 can be described as follows. The set S is an affine subspace of Rk; for the given p ∈SPit can be represented

S =np+ z : 1>z = 0

o.

Then since p ∈H, we have

H01 : = (x− p),x ∈ S ∩H=

n(z,1>z = 0, z ∈ H

o=

nz, h>j z = 0, j = 1, . . . , k − d− 1, 1>z = 0

o


and since 1 and the h>j are linearly independent, H01 is a d-dimensional linear subspace of Rk.Then H1 =

©Fy, y∈H01

ªis a linear subspace of Rk−1, and by the construction of the matrix F

the space H1 has not smaller dimension than H01, i.e. also dimension d. (Indeed, if the dimensionwere less then there would be an nonzero element z of H01 orthogonal to all rows of F . Since theseare orthogonal to p by construction, cp. (9.7), z must be a multiple of p, which implies 1>p = 0,in contradiction to 1>p = 1). Let Π be the projection matrix onto H1 in Rk−1. Since q minimizesthe distance to Z∗ within the space H1, we must have (approximately)

q =ΠZ∗

i.e. q is the projection of Z∗ onto H1. This is already (9.20) up to the size op(n−1/2) of the errorterm, for which a more detailed argument based on (9.23) is necessary.Consider now the χ2 statistic relative to H, with (maximum likelihood) estimated parameter p =p(Z). We obtain

χ2(Z) =kX

j=1

n¡pj(Z)− n−1Zj

¢2pj(Z)

.

To find the asymptotic distribution, we substitute the denominator by its probability limit pj (thetrue parameter):

χ2(Z) =kX

j=1

n¡pj(Z)− n−1Zj

¢2pj

+ op(1)

≈ n°°F (n−1Z− p)°°2

= n kZ∗ − qk2 = n°°°(Ik−1 −Π)Z∗ + op(n

−1/2)°°°2

according to the approximation of the Lemma above. Now the matrix Π∗ = Ik−1 − Π is also aprojection matrix, namely onto the orthogonal complement of H1, of rank k − 1 − d. It can berepresented

Π∗ = CC>

where C is a (k − 1− d)× (k − 1) orthogonal matrix (such that C>C = Ik−1−d). Thus

χ2(Z) =°°°n1/2C>Z∗ + op(1)

°°°2 + op(1).

Sincen1/2Z∗ = n1/2F (n−1Z− p) L⇒ Nk−1 (0, Ik−1)

it followsn1/2C>Z∗

L⇒ Nk−1−d (0, Ik−1−d) .

This implies

χ2(Z)L⇒ χ2k−d−1.

We see that if d parameters must be estimated under H, then the degrees of freedom in the limitingχ2 law are k − d − 1. We argued for a hypothesis H : p ∈H0 where H0 is a d-dimensional affinemanifold in Rk described in (9.18) (0 ≤ d < k − 1).


Theorem 9.5.2 Consider ModelMd,2: the observed random k-vector Z has law L(Z) =Mk(n,p)where p is unknown. Let H0 be a d-dimensional set of probability vectors of form (9.18). Considerthe hypothesesH : p ∈H0K : p /∈H0Let χ2k−d−1;1−α be the lower 1− α-quantile of the distribution χ2k−d−1. The test ϕ(Z) defined by

ϕ(Z) = 1 if χ2(Z) > χ2k−d−1;1−α

0 otherwise(9.25)

where

χ2(Z) =kX

j=1

(npj − Zj)2

npj. (9.26)

is the χ2-statistic and p = (p1, . . . , pk) is the MLE of p relative to H0, is an asymptotic α-test.

Consider now a hypothesis of form H : p ∈ P where P is a parametric family of probability vectors

P = pϑ, ϑ ∈ Θ

and Θ ⊆ Rd. Under some smoothness conditions, and assuming that the mapping ϑ 7→ pϑ isone-to-one, the set P can be regarded as a ”smooth” subset of the probability simplex SP . Wecan assume that in every point p ∈P there is a tangent set of form H0 (or tangent affine subspaceH1) which has the same dimension d. In this sense, ”locally” we are back in the previous case ofTheorem 9.5.2. Here ”locally” means that if the MLE ϑ of ϑ is consistent, it will point us to avicinity of the true parameter ϑ, i.e. the true underlying probability vector pϑ, and in this vicinitywe can substitute P by its tangent space H0 at pϑ. This is the outline of the proof that theχ2-statistic with estimated parameters

χ2(Z) =kX

j=1

³npj(ϑ)− Zj

´2npj(ϑ)

(9.27)

where p = (p1(ϑ), . . . , pk(ϑ)) has still a limiting χ2-distribution with k − d− 1 degrees of freedom.The essential condition is that d parameters are to be estimated, and ϑ 7→ pϑ is smooth andone-to-one.In conjunction with binned data, this procedure can be used to test that the data are in a spe-cific class of distributions, not necessarily multinomial. Suppose observations are i.i.d.real valuedX1, . . . ,Xn, with distribution Q. Suppose Q = Qϑ, ϑ ∈ Θ is a specific class of distributions, andconsider hypothesesH : Q ∈ QK : Q /∈ Q.This is transformed into multinomial hypotheses by selecting a partition of the real line into subsetsor cells A1, . . . , Ak, as discussed above. We obtain a vector of cell probabilities

p(Q) =(Q(A1), . . . ,Q(Ak)).


The observed vector of cell frequencies Z is multinomialMk(n,p) with the above value of p. WhenQ takes values inside the family Q, then p(Q) also takes values inside a family P defined by

P: = p(Qϑ), ϑ ∈ Θ .

From initial hypotheses H, K one obtains derived hypothesesH 0 : p(Q) ∈ PK 0 : p(Q) /∈ P.Under smoothness conditions on Q it is clear that as n→∞, the χ2-test based on Z, with estimatedparameters relative to the hypothesis H 0 is again an asymptotic α-test. Here the degrees of freedomin the limiting χ2-distribution is k − d− 1 if d the family Q has d parameters.It should be stressed however that the estimator ϑ should be the MLE based on the binned multino-mial data Z, not on the original data X1, . . . ,Xn. Thus, for a test of normality, strictly speaking,one cannot use sample mean and sample variance as estimators, but one has to get the binned datafirst and then estimate mean and variance from these multinomial data by maximum likelihood.

9.6 Chi-square tests for independence

Consider a bivariate random variable X = (X1,X2) taking values in the finite set of pairs (j, l),1 ≤ j ≤ r, 1 ≤ l ≤ s, where r, s ≥ 2, with probabilities

P ((X1,X2) = (j, l)) = pjl.

These probabilities give the joint distribution of (X1,X2),with marginal distributions

P (X1 = j) =sX

l=1

pjl =: q1,j , P (X2 = l) =rX

j=1

pjl =: q2,l.

We are interested in the problem whetherX1,X2 are independent, i.e. whether the joint distributionis the product of its marginals:

pjl = q1,jq2,l, j = 1, . . . , r, l = 1, . . . , s. (9.28)

Suppose that there are n i.i.d observations X1, . . . ,Xn, all having the distribution of X.This can easily be transformed into a hypothesis about a multinomial distribution. Call the pairs(j, l) cells; it is not important that they are pairs of natural numbers- these can just be symbolsfor certain categories. Thus there are rs cells; they can be written as an r × s-matrix. Define acounting variable Yi associated to observation Xi = (X1i,X2i): Yi is a r × s-matrix such that

Yi = (Yi,jl)l=1,...,sj=1...,r ,

Yi,jl = 1 if (X1i,X2i) = (j, l)0 otherwise.

These Yi can be identified with vectors of dimension k = rs; when they are looked upon as vectors,they have a multinomial distribution

L(Yi) =Mk(1,p)

Chi-square tests for independence 135

where p is the r × s-matrix of cell probabilities pjl. We can also define counting vectors for eachvariable X1,X2 separately:

Y1,i = (Y1,i,j)j=1...,r,Y2,i = (Y2,i,l)l=1...,s,

Y1,i,j = 1 if X1i = j,0 otherwise.

, Y2,i,l = 1 if X2i = l,0 otherwise.

,

Then the counting matrix Yi is obtained as

Yi = Y1,iY>2,i.

Again we have n of these observed counting vectors, and the matrix (or vector) of observed cellfrequencies is

Z =nXi=1

Yi.

Let us stress again that we identify a r × s matrix with a rs-vector here. We write Mr×s(n,p),for the multinomial distribution of a r × s-matrix with corresponding matrix of probabilities p. (This can be identified with Mrs(n,p) when p is construed as a vector). We can also define cellfrequencies for the variables X1,X2 separately:

Z1=nXi=1

Y1,i, Z2=nXi=1

Y2,i. (9.29)

The hypothesis of independence of X1,X2 translates into a hypothesis about the shape of theprobability matrix p: according to (9.28), for

q1 = (q1,j)j=1,...,r, , q2 = (q2,l)l=1,...,s,

we haveH : p = q1q

>2 ,

K : ”p is not of this form”.The hypotheses H can be written in the form H : p ∈ PI where PI is a parametric family ofprobability vectors (the lower index I in PI stands for ”independence”):

PI =nq1q

>2 , q1 ∈ SP,r, q2 ∈ SP,s

owhere SP,r is the probability simplex in Rr. Indeed, in the case of independence we have just thetwo marginals, which are two probability vectors in Rr, Rs respectively. These marginal probabilityvectors have r− 1, and s− 1 independent parameters respectively (the respective r− 1, s− 1 firstcomponents). Thus PI can be smoothly parametrized by a r+ s−2-dimensional parameter ϑ ∈ Θ,where Θ is a subset of Rr+s−2 (but we do not make this explicit). Thus the hypotheses areH : p ∈PIK : p /∈PI .Define ”marginal” cell frequencies

Zj· =sX

l=1

Zjl, Z·l =rX

j=1

Zjl.

These coincide with the components of the vectors Z1, Z2 defined in (??):

Zj· = # i : X1i = j , Z·l = # i : X2i = l .


Proposition 9.6.1 In the multinomial Model Md,2, when Z is a multinomial r × s matrix withlaw L(Z) = Mr×s(n,p), r, s ≥ 2, the maximum likelihood estimator p under the hypothesis ofindependence p ∈PI is

p = q1q>2 = (q1,j q2,l)

l=1,...,sj=1,...,r

whereq1 = n−1Z1, q2 = n−1Z2

and Z1, Z2 are the vectors of marginal cell frequencies

Z1 = (Zj·)j=1,...,r, Z2 = (Z·l)l=1,...,s.

Proof. The probability function for Z is (for a matrix z =(zjl)l=1,...,sj=1,...,r)

P (Z = z) = C(n, z)rY

j=1

sYl=1

pzjljl

where C(n, z) is a factor which does not depend on the parameters. Independence means pjl =q1,jq2,l, so the likelihood function is

l(q1,q2) = C(n, z)rY

j=1

sYl=1

qzjl1,jq

zjl2,l

= C(n, z)

⎛⎝ rYj=1

sYl=1

qzjl1,j

⎞⎠⎛⎝ rYj=1

sYl=1

qzjl2,l

⎞⎠= C(n, z)

rYj=1

qzj·1,j

sYl=1

qz·l2,l.

Now the factorQs

l=1 qz·l2,l is the likelihood (up to a factor) for the multinomial vector Z2 defined

in (9.29) with parameter q2, andQr

j=1 qzj·1,j is proportional to the the likelihood for Z1. Thus

maximizing over q1,q2 amounts to maximizing the product of two multinomial likelihoods, each inits own parameter q1,q2. The maximizer of each likelihood is the unrestricted MLE in a multinomialmodel for Z1 or Z2, thus according to Proposition 9.1.1

q1 = n−1Z1, q2 = n−1Z2.

This proves the result.We can now write down the χ2-statistic with estimated parameters (estimated under the indepen-dence hypothesis p ∈PI), according to (9.27):

χ2(Z) =rX

j=1

sXl=1

¡Zjl − n(n−1Zj·)(n−1Z·l)

¢2n(n−1Zj·)(n−1Z·l)

=rX

j=1

sXl=1

(nZjl − Zj·Z·l)2

nZj·Z·l. (9.30)

Chi-square tests for independence 137

The dimension of Z is k = rs, while the number of estimated parameters is d = r − 1 + s − 1, soaccording to the result in the previous section, as n→∞

χ2(Z)L⇒ χ2k−d−1 = χ2rs−1−(r−1)−(s−1)

= χ2(r−1)(s−1).

We thus obtain an asymptotic α-test for independence.We assumed initially that two real random variables (X1,X2) take discrete values (j, l), 1 ≤ j ≤ r,1 ≤ l ≤ s. But obviously, when these both take real values, one can use two partitions A1,j ,j = 1, . . . , r and A2,l, l = 1, . . . , s and define cells

Bj,l = A1,j ×A2,l.

For n i.i..d data (X1i,X2i), this gives rise to observed cell frequencies

Zjl = # i : (X1i,X2i) ∈ Bj,l

which is then a multinomial matrix Z as above. The hypotheses of independence of X1 and X2

gives rise to a derived hypotheses p ∈PI as above. Thus the χ2-statistic can again be used to testindependence.A contingency table is a matrix of form

l = 1 . . . l = s__ __ __ __ __

j = 1 | Z11 Z1s | Z1·| . . . . . . | . . .

. . . | Zj1 Zjl Zjs | Zj·| . . . . . . | . . .

j = r | Zr1 Zrs | Zr·__ __ __ __ __Z·1 . . . Z·l . . . Z·s

It serves as a symbolic aid in computing the χ2-statistic (9.30). The χ2-test for independence isalso called χ2-test in a contingency table.Exercise. Test your random number generator for independence in consecutive pairs. IfN1, N2, . . .isthe sequence generated then take pairs X1 = (N1, N2), X2 = (N3,N4), . . ., and test independenceof the first from the second component. Note: if they are not independent then presumably thepairs Xi are also not independent, so the alternatives which one might formulate are different fromthe above. Still the contingency table. provides an asymptotic α-test.


Chapter 10REGRESSION

10.1 Regression towards the mean

To introduce the term, we start with a quotation from the Merriam-Webster dictionary∗ for thatentry.Regression: a: a trend or shift toward a lower or less perfect state: as a : progressive decline of amanifestation of disease b :... c : reversion to an earlier mental or behavioral level d : a functionalrelationship between two or more correlated variables that is often empirically determined fromdata and is used especially to predict values of one variable when given values of the others <theregression of y on x is linear>; specifically : a function that yields the mean value of a randomvariable under the condition that one or more independent variables have specified values.Let us explain the origin of the usage d in mathematical statistics. Around 1886 the biometristFrancis Galton observed the size of pea plants and their offspring. He observed an effect which canequivalently be described by observing body height of fathers and sons in the human species; sincethe latter is a popular example, we shall phrase his results like this† Predictably, he noticed thattall fathers tended to have tall sons and short fathers tended to have short sons. At the same time,he noticed that the sons of tall fathers tended to be shorter then their fathers, while the sons ofshort fathers tended to be less short. Thus the respective height of sons tended to be closer to theoverall average body height of the total population. He called this effect ”regression towards themean”.This phenomenon can be confirmed in a probabilistic model, when we assume a joint normaldistribution for the height of fathers X and sons Y . Let (X,Y ) have a bivariate normal distributionwith mean vector (a, a), where a > 0. This a is the average total population height; for simplicityassume it is the same for X and Y , i.e. height does not increase from one one generation to thenext. It should also be assumed that X and Y have the same variance:

Var(X) =: σ2X = Var(Y ) =: σ2Y

and that they are positively correlated:

Cov(X,Y ) =: σXY > 0

(recall the correlation between X and Y is

ρ =Cov(X,Y )pVar(X)Var(Y )

=σXY

σXσY).

∗http://www.m-w.com†Sources differ as to what he actually observed; the textbook has fathers and sons (p. 555 ), while the ”Encyclo-

pedia of Statististal Sciences” quotes Galton about seeds.

140 Regression

Thus

Lµ

XY

¶= N2(a1,Σ), Σ =

µσ2X σXY

σXY σ2Y

¶.

The average body height of sons, given the height of the father, is described by the conditionalexpectation E(Y |X = x). To find it, we state a basic result on the conditional distribution L(Y |X =x) in a bivariate normal. Some special cases appeared already ( Proposition 5.5.2).

Proposition 10.1.1 Suppose that X and Y have a joint bivariate normal distribution with expec-tation vector µ = (µX , µY )

> and positive definite covariance matrix Σ:

Lµ

XY

¶= N2(µ,Σ), Σ =

µσ2X σXY

σXY σ2Y

¶. (10.1)

ThenL(Y |X = x) = N

³µy + β(x− µx), σ

2Y |X

´(10.2)

where

β =σXY

σ2X,

σ2Y |X = σ2Y −σ2XY

σ2X.

Proof. Recall the form of the joint density p of X and Y : (Lemma 6.1.10: for z = (x, y)>

p(z) = p(x, y) =1

(2π) |Σ|1/2exp

µ−12(z− µ)>Σ−1(z−µ)

¶.

The marginal density of x is

p1(x) =1

(2π)1/2σXexp

Ã−(x− µx)

2

2σ2X

!.

The density of L(Y |X = x) (conditional density) is given by

p(y|x) = p(x, y)

p1(x),

and the conditional expectation E(Y |X = x) can be read off that density. Note that

|Σ| = σ2Xσ2Y − σ2XY ,

Σ−1 =

µσ2X σXY

σXY σ2Y

¶−1=

1

|Σ|

µσ2Y −σXY

−σXY σ2X

¶.

For ease of notation write x = x− µx, y = y − µy. This gives

p(x, y) =1

(2π) |Σ|1/2exp

µ− 1

2 |Σ|¡x2σ2Y − 2xyσXY + y2σ2X

¢¶,

p(y|x) =σX

(2π)1/2 |Σ|1/2exp

µ− 1

2 |Σ|¡x2σ2Y − 2xyσXY + y2σ2X

¢+

|Σ| x22 |Σ|σ2X

¶

Regression towards the mean 141

Note that

x2σ2Y − 2xyσXY + y2σ2X = x2σ2Y + σ2X¡y − xσXY σ

−2X

¢2 − x2σ2XY σ−2X

= σ2X (y − βx)2 + x2σ2Y |X ,

|Σ|σ−2X = σ2Y − σ2XY σ−2X = σ2Y |X .

The terms involving x2 now cancel out, and we obtain

p(y|x) = 1

(2π)1/2σY |Xexp

Ã− 1

2σ2Y |X(y − βx)2

!.

In view ofy − βx = y − µy − β(x− µx)

this proves (10.2).

Corollary 10.1.2 If X, Y have a bivariate normal distribution, then(i) E(Y |X = x) is a linear function of x(ii) The variance of L(Y |X = x) ( conditional variance ) does not depend on x.

Definition 10.1.3 Let X and Y have a joint bivariate normal distribution (10.1).(i) The quantity

β =σXY

σ2X

is called the regression coefficient for the regression of Y on X.(ii) The linear function

y = E(Y |X = x) = µy + β(x− µx) (10.3)

is called the regression function or regression line (for Y on X).

Thus β is the slope of the regression line and µy − βµx is its intercept.Note that the regression line always goes through the point (µx, µy) given by the two means. Indeedfor x = µx in (10.3) we obtain y = µy.Furthermore, the absolute value of β is bounded by

|β| ≤ σXσYσ2X

=σYσX

as a consequence of the Cauchy-Schwarz inequality. For the conditional expectation we obtain

E(Y |X = x) = µy + β(x− µx).

For father / son height model of Galton, we assumed that µy = µx = a, furthermore σY = σX , andpositive correlation: σXY > 0. This implies

0 < β ≤ 1.

Here equality is not possible in : since Σ is positive definite, it cannot be singular, hence |Σ| =σ2Xσ

2Y − σ2XY > 0, so that

|σXY | < σXσY

142 Regression

which means that0 < β < 1.

This gives the desired mathematical explanation of Galton’s ”regression toward the mean”. We have

E(Y |X = x) = a+ β(x− a)

= (1− β)a+ βx.

which means that E(Y |X = x) is a convex combination of x and a; the average height of sons,given the height x of fathers, is always ”pulled toward the mean height” a. It is less than x if x > aand is greater than x if x < a.It is interesting to note that an analogous phenomenon is observed for the relationship of sonswith given height to their fathers. Indeed, reversing the roles of X and Y , we obtain the secondregression line

x = E(X|Y = y) = µx + β0(y − µy), (10.4)

β0 =σXY

σ2Y

Under the assumption σ2Y = σ2X we have β0 = β, hence 0 < β0 < 1, so that the fathers of tall sonstend to be shorter etc. In (10.4) x is given as a function of y; when we put in the same form as thefirst regression line, with y a function of x, we obtain

y = µy +1

β0(x− µx)

and turns out that the other regression line also goes through the point (µx, µy), but has differentslope 1/β0. This slope is higher (1/β0 > 1) if σ2Y = σ2X . The linear function (10.4) is said to pertainto the regression of X on Y .Back in the general bivariate normal N2(µ,Σ), consider the random variable

η = Y −E(Y |X = x) = Y − µy − β(x− µx);

we know from (10.2) that it has conditional law N(0, σ2Y |X), given x. It is often called the residual(random variable). With the conditional law of η, we can form the joint law of η and X; sincethe conditional law does not depend on x (Corollary 10.1.2 (ii)), it turns out that η and X areindependent.

Corollary 10.1.4 If X, Y have a bivariate normal distribution, then the residual

η = Y −E(Y |X) = Y − µy − β(X − µx)

and X are independent.

Recall that E(Y |X) denotes the conditional expectation as a random variable, i.e. E(Y |X = x)when X is understood as random.As a consequence, we can write any bivariate normal distribution as

Y = µy + βξ + η (10.5)

X = µx + ξ (10.6)

Regression towards the mean 143

where ξ, η are independent normal with laws N(0, σ2X), N(0, σ2Y |X) respectively. If µx = 0 we can

writeY = µy + βX + η.

Assume that we wish to obtain a representation (10.5), (10.5) in terms of standard normals. Defineη0 = σ−1Y |Xη, ξ0 = σ−1X ξ; then for a matrix

M =

µσX 0βσX σY |X

¶=

ÃσX 0

σXY σ−1X

¡σ2Y − σ2XY σ

−2X

¢1/2 !

we have for Z = (X,Y )>, U = (ξ0, η0)>

Z = µ+MU, L (U) = N2(0,I2), . (10.7)

Since Σ is the covariance matrix of Z we should have

Σ =MM>

according to the rules for the multivariate normal. The above relation can indeed be verified, andrepresents a decomposition of Σ . into a product of a lower triangular with its transpose.Next we define a regression function for possibly nonnormal distributions

Definition 10.1.1 Let X, Y have a continuous or discrete joint distribution, where the first mo-ment of Y exists: E|Y | <∞. The regression function (for Y on X ) is defined as

r(x) := E(Y |X = x).

In the normal case, we have seen that r is a linear function, determined by the regression coefficientand the two means. In the general case, one should verify that E(Y |X = x) exists, and clarify theproblem of uniqueness. Let us do that in the continuous case.

Proposition 10.1.1 Let X, Y have a continuous joint distribution, where E|Y | <∞.(i) There is a version of the conditional density p(y|x) such that for this density, E(Y |X = x)exists for all x.(ii) For all versions of p(y|x), the law of the random variable E(Y |X) = r(X) is the same, thusL(E(Y |X)) is unique.

Proof. Recall definition 5.4.2: any version of the conditional density p(y|x) fulfills

p(x, y) = p(y|x)pX(x). (10.8)

Now E|Y | <∞ means that

∞ >

Z Z|y|p(x, y)dydx =

Z Z|y|p(y|x)pX(x)dydx

=

ZpX(x)

½Z|y|p(y|x)dy

¾dx.

Let A be the set of x such that g(x) =R|y|p(y|x)dy is infinite. For this A we must haveR

A pX(x)dx = 0, or else the whole expression above would be infinite. (This needs a few lines

144 Regression

of reasoning with integrals.) Hence we must have P (X ∈ A) = 0. For these x we can modify ourversion of p(y|x), e.g. take it as the standard normal density (as in the proof of Lemma 5.4.3), sothat

R|y|p(y|x)dy is also finite for these x. Thus we found a version p(y|x) such that

E(Y |X = x) =

Zyp(y|x)dy (10.9)

exists for all x. This proves (i).For (ii), integrate (10.8) over y to obtainZ

yp(x, y)dy = r(x)pX(x). (10.10)

Note that two densities for a probability law must coincide except on a set of probability 0. Thisholds for p(x, y), and it implies that two versions of the function

Ryp(x, y)dy must coincide for

all x. (in terms of X). But the versions of densities pX(x) coincide for a set of probability 0 interms of X; and then (10.10) implies such a property for r(x). This mean that L(r(X)) is uniquelydetermined.Note that strictly speaking, we are not entitled to speak of ”the” regression function r(x) above,as it is not unique. However the law of the random variable E(Y |X) = r(X) is unique.The next statement recalls a ”best prediction” property of the conditional expectation. In theframework of discrete distributions, this was already discussed in Remark 2.4, in connection withthe properties of the conditional expectation E(Y |X) as a random variable. For the normal case,this was exercise H5.2.

Proposition 10.1.2 Let X, Y have a continuous or discrete joint distribution, where the secondmoment of Y exists: E|Y |2 <∞. Then the regression function has the property that

E (Y − r(X))2 = minf∈MX

E (Y − f(X))2

where MX is the set of all (measurable) functions of X such that E (f(X))2 <∞.

Proof. This resembles other calculations with conditional expectations (cf. Remark 2.4, p. 9).A little more work is needed now to ensure that all expectations exist. We concentrate on thecontinuous case. First note that under the conditions, E (Y − f(X))2 is finite for every f ∈ MX .We claim that also r ∈MX . Indeed the Cauchy-Schwarz inequality gives

|r(x)|2 =

¯Zyp(y|x)dy

¯2≤Z

p(y|x)dy ·Z

y2p(y|x)dy

=

Zy2p(y|x)dy.

The last integral can be shown to be finite for all x when E|Y |2 < ∞, similarly to Lemma 10.1.1above (possibly with a modification of p(y|x)). Thus

E (r(X))2 ≤Z

pX(x)

µZy2p(y|x)dy

¶dx

=

Z Zy2p(x, y)dxdy = EY 2 <∞

Bivariate regression models 145

which proves r ∈MX . This implies that all expectations in the following reasoning are finite. Wehave for any f ∈MX

E (Y − f(X))2 = E (Y − r(X)− f(X) + r(X))2

= E (Y − r(X))2 + 2E (Y − r(X)) (f(X)− r(X)) +E (f(X)− r(X))2 ,

and the middle term vanishes, when we use the formulae.

E(·) = E (E(·|X)) , E(Y h(X)|X) = h(X)E(Y |X).

HenceE (Y − f(X))2 ≥ E (Y − r(X))2 .

The regression function is thus a characteristic of the joint distribution of X and Y . In generalr(x) := E(Y |X = x) is a nonlinear regression function; it is linear if the joint distribution isnormal.In the general case, it is no longer true that the residual η = Y − r(X) and X are independent;it can only be shown that they are uncorrelated (Exercise). However one can build a bivariatedistribution of X,Y from independent η (with zero mean) and X:

Y = r(X) + η. (10.11)

Assume that E |r(X)| < ∞, so that Y has finite expectation and hence E (Y |X) exists. It thenfollows that

E (Y |X) = E (r(X)|X) +E (η|X) = r(X) +Eη = r(X)

so r is the regression function for (X,Y ).

10.2 Bivariate regression models

For the joint distribution of X,Y the relation (10.11) describes approximately the dependence ofY on X, or a ”noisy” causal relationship between X and Y . Suppose one has i.i.d. observationsZi = (Xi, Yi)

>, and one wishes to draw inferences on this dependence, i.e. on the regression functionr(x):

Yi = r(Xi) + ηi

where ηi are the corresponding residuals. Since the regression function r(x) = E (Y |X) dependsonly on the conditional distribution of Y given X, and X is observed, it makes sense to take aconditional point of view, and assume that Xi = xi where xi are nonrandom values. These can betaken to be the realized Xi Thus

Yi = r(xi) + ηi

where the ηi are still independent. (Exercise: show that the joint conditional densities of allηi = Yi −E(Yi|Xi) under all Xi is the product of the individual conditional densities of ηi ).

Definition 10.2.1 Suppose that xi, i = 1, . . . , n are nonrandom values, and ηi, i = 1, . . . , n arei.i.d. random variables with zero expectation, L(η) = Q. Let R be a set of functions on R and Qbe a set of probability laws on R. A bivariate regression model is given by observed data

Yi = r(xi) + ηi,

146 Regression

where it is assumed that r ∈ R and Q ∈ Q.(i) A linear regression model is obtained when R is assumed to be a set of linear functions

r(x) = α+ βx

(where linear restrictions on α, β may be present)(ii) A normal linear regression model is obtained when in (i) Q is assumed as a set of normallaws N(0, σ2) ( σ2 fixed or unknown)(iii) A nonlinear regression model (wide sense) is obtained when for some parameter setΘ ⊆ Rk , k ≥ 1

R = rϑ, ϑ ∈ Θand all functions rϑ(x) are nonlinear in x.(iv) A nonparametric regression model is obtained when R is a class of functions of x whichcannot be smoothly parametrized by some ϑ ∈ Θ ⊆ Rk, e.g. a set of differentiable functions

R =nr : r

0exists,

¯r0(x)

¯≤ C, all x

o.

In a nonparametric regression model, the functions are also nonlinear in x, but the term is reservedfor families rϑ, ϑ ∈ Θ indexed by a finite dimensional ϑ. A typical example for nonlinear regressionis polynomial regression

r(x) =kX

j=0

ϑjxj

or more generally

rϑ(x) =kX

j=0

ϑjϕj(x)

where ϕj is a system of functions (e.g. trigonometric regression). However in these examples,the function values rϑ(x) depend linearly on the parameter ϑ = (ϑj)j=1,...,n. We shall see belowthat these can be treated similarly to the linear case. Therefore the name of nonlinear regressionmodel in a narrow sense is reserved for models where rϑ(x) depends on ϑ nonlinearly, e.g.

rϑ(x) = sin(ϑx).

In the linear case, we haveYi = α+ βxi + ηi, i = 1, . . . , n (10.12)

and commonly it is assumed that the variance exists: EY 21 <∞. The linear case (without normalityassumption) is simply called a linear model.Note that the normal location and location-scale models are special cases of the normal linearmodel, when a restriction α = 0 is assumed and xi = 1, i = 1, . . . , n. Here the xi do not resemblerealizations of i.i.d. random variables; this possibility enlarges considerably the scope of the linearmodel. Indeed the xi can be chosen or designed, e. g. taken all as 1 as above or as as an equidistantgrid (or mesh):

xi = i/n, i = 1, . . . , n.

The set xi, i = 1, . . . , n is also called a regression design. Especially the case of nonparametricregression with equidistant grid resembles the problems of function interpolation from noisy data(also called smoothing) in numerical analysis.

The general linear model 147

10.3 The general linear model

Our starting point above was a bivariate normal distribution of variables (X,Y ), and the descriptionof E(Y |X), as the best predictor of Y given X. The same reasoning is possible when we haveseveral variables X1, . . . ,Xk, i.e. a whole vector X = (X1, . . . ,Xk)

> and we are interested in anapproximate causal relationship between X and Y . This of course allows modelling of much morecomplex relationships. The X1, . . . ,Xk are called regressor variables and Y the regressand.Again E(Y |X) can be shown to be a linear function of X, but we forgo this and proceed directlyto setting up a linear regression model in several nonrandom regressors.Let x1, . . . ,xn be a set of nonrandom k-vectors. These might be independent realizations of arandom vector X, but they might also be designed values. We assume that for some vector β =(β1, . . . , βk)

>, observations areYi = x

>i β + εi, i = 1, . . . , n

where εi are i.i.d. with Eε1 = 0, Eε21 <∞. This is a direct generalization of (10.12).

Definition 10.3.1 Let X be a nonrandom n × k-matrix of rank k, and ε be a random n-vectorsuch that

Eε = 0, Cov(ε) = σ2In.

(i) A (general) linear model is given by an observed n-vector

Y = Xβ + ε

where X is fixed (known), ε is unobserved and β ∈Rk is unknown, and σ2 is either known orunknown (σ2 > 0).(ii) A normal linear model is obtained when ε is assumed to have a normal distribution.

Notation and terminology. Now X is an n×k-matrix ,whereas in the previous section X was arandom variable (the regressor variable), and generalizing this, in the first paragraph of this sectionX = (X1, . . . ,Xk) was a random vector of regressor variables. This reflects the situation that thematrix X may arise from independent realizations xi of the random vector X, in conjunction witha conditional point of view (the xi are considered nonrandom, and form the rows of the matrix X)Therefore we keep the symbol X for the matrix above;X may be called regression matrix. Thecolumns of X (ξ1, . . . , ξk, say) may be called nonrandom regressor variables; they correspondto the random regressors X1, . . . ,Xk.In the normal linear model, we have L(ε) = Nn(0,In), and the components εi are i.i.d. standardnormal. Hence L(Y) = Nn(Xβ,In). In the general case, εi are only uncorrelated; however in mostcases, when modelling real world phenomena by a by a linear model, r.v.’s which are uncorrelatedbut not independent will not often occur.

Example 10.3.2 Let U have the uniform distribution on [0, 1] and

Z = cos(2πU), Y = sin(2πU).

Then Z2 + Y 2 = 1 and the pair (Z, Y ) takes values on the unit circle, which implies that Z, Y arenot independent (they do not have a joint density on R2 which is the product of its marginals.)

148 Regression

But Z, Y are uncorrelated:

EZ =

Z 1

0cos(2πu)du = 0, EY = 0,

Cov(Z, Y ) = EZY =

Z 1

0cos(2πu) sin(2πu)du = 0.

¤

Identifiability.

Let us explain the assumption rank(X) = k. Let x>i , i = 1, . . . , n be the rows of the n× k-matrixX and ξj , j = 1, . . . , k the colums:

X =

⎛⎝ x>1. . .x>n

⎞⎠ =¡ξ1, . . . , ξk

¢.

Recall that rank(X) = k is equivalent to each of the two:

Lin (x1, . . . ,xn) = Rk, (10.13)

ξ1, . . . , ξk are linearly independent n-vectors. (10.14)

(where Lin() denotes the linear space spanned by a set of vectors, also called linear span, linearhull).

Definition 10.3.3 Let P = Pϑ;ϑ ∈ Θ be a family of probability laws on a general space Z. Theparameter ϑ is called identifiable in P if for all pairs ϑ1, ϑ2 ∈ Θ, ϑ1 6= ϑ2 implies Pϑ1 6= Pϑ2.

If P is a statistical model, i.e. is the set of possible (assumed) distributions of an observed randomvariable Z with values in Z, then nonidentifiablity means that there exist two parameters ϑ1 6= ϑ2which lead to the same law of Z. These cannot be distinguished in any statistical sense: forhypotheses ϑ1 = ϑ2 vs. ϑ1 6= ϑ2 the trivial randomized test ϕ(Z) = α is a most powerful α-test. Aparameter ϑ is just an index or name for a probability law; identifiability means that no law in Phas two names.Identifiablity thus is a basic condition for a statistical model, if inference on the parameter ϑ isdesired. If ϑ is nonindentifiable in P then it is advisable to reparametrize, i.e. give the laws othernames which are identifiable.Assume for a moment that the assumption rank(X) = k is not part of the definition of the linearmodel (Definition 10.3.1(i)).

Lemma 10.3.4 In the normal linear model, β is identifiable if and only if rank(X) = k.

Proof. Assume rank(X) = k and β1 6= β2. Since

EY = Xβ,

it suffices to show that Xβ1 6= Xβ2. If Xβ1 = Xβ2 the we would have X(β1 − β2) = 0 andβ1 − β2 6= 0, which contradicts (10.14).


Conversely, assume identifiability; recall L(Y) = Nn(Xβ, In). If rank(X) < k then (10.14) isviolated, i.e. ξ1, . . . , ξk are linearly dependent. Hence there is β 6= 0 such that Xβ = 0. Since alsoX0 = 0, the parameters β and 0 lead to the same distribution L(Y) = Nn(0, In), hence β is notidentifiable. This contradicts the assumption, hence rank(X) = k.In the linear model without the normality assumption, another parameter is present, namely thedistribution of the random noise vector ε. When this law L(ε) is assumed unknown, except forthe assumptions Eε = 0, Cov(ε) = σ2In, then the parameter is ϑ = (β,L(ε)) and L(Y) = Pϑ isindexed by ϑ. In this situation, we call β = β(ϑ) identifiable if Pϑ1 = Pϑ2 implies β(ϑ1) = β(ϑ2).It is easy to see (exercise) that also in this model, the condition rank(X) = k is necessary andsufficient for identifiability of β.

10.3.1 Special cases of the linear model

1. Bivariate linear regression. We have

Yi = α+ βxi + εi, i = 1, . . . , n

Here k = 2, the rows of X are x>i = (1, xi), β = (α, β)>, and identifiability holds as soon as not all

xi are equal.2. Normal location-scale model. Here

Yi = β + εi, i = 1, . . . , n,

k = 1, X = (1, . . . , 1)>, εi are i.i.d. N(0, σ2), σ2 > 0.3. Polynomial regression. Here for some design points xi, i = 1, . . . , n

Yi =kX

j=1

βjϕj(xi) + εi, i = 1, . . . , n

where ϕj(x) = xj−1. The matrix X is made up of the columns

ξj = (ϕj(x1), . . . , ϕj(xn))>, j = 1, . . . , k. (10.15)

Note that we obtained as a special case of the linear model one which was earlier classified as anonlinear regression model (in a wide sense), since the functions

r(x) =kX

j=1

βjϕj(x)

are nonlinear in x (polynomials). However they are linear in the parameter β = (β1, . . . , βk)> ,

and for purposes of estimating β this can be treated as a linear model.

Lemma 10.3.5 In the linear model arising from polynomial regression, the parameter β is identi-fiable if and only if among the design points xi, i = 1, . . . , n, there are at least k different points.

Proof. Identifiability means linear independence of the vectors ξj in (10.15). This in turn meansthat for any coefficients β1, . . . , βk, the relation

r(xi) =kX

j=1

βjϕj(xi) = 0, i = 1, . . . , n (10.16)

150 Regression

implies βj = 0, j = 1, . . . , k. Now r(x) is a polynomial of degree k − 1; if not all βj vanish, then rcan have at most k − 1 different zeros. Thus if among the design points xi, i = 1, . . . , n, there areat least k different points then X has rank k, thus β is identifiable. Conversely, assume that onlyx1, . . . , xk−1 are different. Consider the polynomial r0 of degree k− 1 having these points as zeros:

r0(t) =k−1Yj=1

(t− xi).

Let β1, . . . , βk be the coefficients of r0; then .(10.16) holds for r = r0, hence the vectors ξj arelinearly dependent.The result that some of the design points xi may be the same opens up the possibility of a designwith replication: for a fixed numberm ≥ k, takem different points x∗j , j = 1, . . . ,m and repeatedmeasurements at these points, l times say, so that n = ml. The entire design may then be writtenwith a double index:

xjk = x∗j , j = 1, . . . ,m, k = 1, . . . , l

and each of the x∗j appears l times. As a result, using double index notation again, we obtainobservations

Yjk = r(x∗j ) + εjk, j = 1, . . . ,m, k = 1, . . . , l (10.17)

which suggests taking averages Yj· = l−1Pl

k=1 Yjk to obtain a simplified model, with more accurate”data” Yj·. The choice of such a replicated design may be advantageous.Comment on notation: we now use k also as a running index, even though above k denoted thenumber of functions ϕj involved, i.e. the dimension of the regression parameter β. In the sequel,we will use d for the dimension of β. The reason is that use of k in expressions such as Yjk istraditional, in connection with regression and replicated designs.4. Analysis of variance. Consider a model

Yjk = µj + εjk, j = 1, . . . ,m, k = 1, . . . , l (10.18)

where εjk are independent noise variables. Here again a replication structure is present, similary to(10.17), but no particular form is assumed for the function r. Thus, if r is an arbitrary function in(10.17), we might as well write µj = r(x∗j ) and assume µj unrestricted. The case m = 2 (for normalεjk) was already encountered in the two sample problem. Suppose there are two ”treatments”,j = 1, 2, respective observations Yjk, j = 1, 2, and one wishes to test whether the treatments havean effect: H:µ1 = µ2 vs. K : µ1 6= µ2. To explain the name ”analysis of variance”, let us find thelikelihood ratio test. In HW 7.1 we found a certain t-test for this problem; cf. also HW 6.1; thiswill turn out to be the LR test.Define the two sample means and variances

yi· = l−1lX

k=1

yik, S2il = l−1

lXk=1

(yik − yi·)2, i = 1, 2.

and also pooled estimates (where n = 2l)

y·· = n−1 (ly1· + ly2·) =1

2(y1· + y2·) , (10.19)

S2n =1

n

ÃlX

k=1

(y1k − y··)2 +

lXk=1

(y2k − y··)2

!. (10.20)


Note that the pooled sample variance can be decomposed: since

lXk=1

(y1k − y··)2 =

lXk=1

(y1k − y1·)2 + l (y1· − y··)

2 ,

we obtainS2n =

1

2

¡S21l + S22l

¢+1

2

Xj=1,2

(yj· − y··)2 (10.21)

Note that the second term has the form of a sample variance, for a ”sample” of size 2 with ”obser-vations” y1·, y2· The first term can be seen as the” variability within groups” and the secondterm as the ”variability between groups”.

Theorem 10.3.6 In the Gaussian two sample problem (10.18), L(εjk) = N(0, σ2), σ2 > 0 un-known, for hypothesesH : µ1 = µ2K : µ1 6= µ2the LR statistic is

L(y1, y2) =

Ã1 +

12

Pj=1,2 (yj· − y··)

2

12(S

21l + S22l)

!n/2

.

The LR test is equivalent to a t-test which rejects when |T | is too large, where

T =l1/2 (y1· − y2·)³S21l + S22l

´1/2 ,S2il =

l

l − 1S2il, i = 1, 2,

and L(T ) = t2l−2 under H.

Comment. This form of the likelihood ratio statistic explains the name ”analysis of variance”.The pooled sample variance is decomposed according to (10.21), and the LR test rejects if thevariability between groups is too large, compared to the variability within groups.Proof. The argument is very similar to the proof of Proposition 8.4.2 about the LR tests for theone sample Gaussian location-scale model; it can be considered the ”two sample analog”. Sincewe have two independent samples with different expectations µi and the same variance, the jointdensity pµ1,µ2,σ2 of all the data y1 = (y11, . . . , x1l), y2 = (y21, . . . , y2l) is, under the alternative,

pµ1,µ2,σ2(y1, y2) =Yi=1,2

1

(2πσ2)l/2exp

Ã−S

2il + (yi· − µi)

2

2σ2l−1

!. (10.22)

Maximizing the likelihood under the alternative we obtain estimates µi = yi· and for n = 2l

maxµ1 6=µ2,σ2>0

pµ1,µ2,σ2(y1, y2) =1

(2πσ2)lexp

µ−S

21l + S22l2σ2l−1

¶=

1

(2πσ2)n/2exp

µ−(S

21l + S22l)/2

2σ2n−1

¶=

1

(σ2)n/2· 1

(2π)n/2exp

µ− 1

2n−1

¶

152 Regression

whereσ2 =

1

2

¡S21l + S22l

¢.

Under the hypothesis µ1 = µ2 = µ the data y1k, y2k are identically distributed, as N(µ, σ2). Thuswe can treat the two samples as one pooled sample of size n, and form the pooled mean and varianceestimates (10.19), (10.20). To maximize the likelihood in µ, σ2, we just refer to the results in theone sample case (Proposition 8.4.2) to obtain

maxµ,σ2>0

pµ,µ,σ2(y1, y2) =1

(σ20)n/2

· 1

(2π)n/2exp

µ− 1

2n−1

¶.

Thus the likelihood ratio is

L(y1, y2) =maxµ1 6=µ2,σ2>0 pµ1,µ2,σ2(y1, y2)

maxµ,σ2>0 pµ,µ,σ2(y1, y2)

=

µσ20σ2

¶n/2

=

ÃS2n¡

S21l + S22l¢/2

!n/2

.

The LR test compares the sample variance under the hypothesis S2n with the sample variance underthe alternative, which can taken to be

¡S21l + S22l

¢/2. Using (10.21), we obtain

L(y1, y2) =

Ã1 +

Pj=1,2 (yj· − y··)

2

S21l + S22l

!n/2

.

As a consequence of (10.19), we have

y1· − y·· = (y1· − y2·)/2, y2· − y·· = (y2· − y1·)/2

and hence

L(y1, y2) =

µ1 +

1

2(l − 1)T2

¶n/2

.

The statistic T has a t-distribution with 2l − 2 degrees of freedom, since y1·, y2·, S21l, S22l are allindependent,

T =(l/2)1/2 (y1· − y2·)³(S21l + S22l)/2

´1/2 ,L³(l/2)1/2(y1· − y2·)

´= N(0, σ2),

Lµ2(l − 1)

σ(S21l + S22l)/2

¶= χ22l−2.

The ANOVA (analysis of variance) model (10.18) is a special case of the linear model. Write

β = (µ1, . . . , µm)>, n = ml, (10.23)

X =

⎛⎝ 1l. . .

1l

⎞⎠ (10.24)

Least squares and maximum likelihood estimation 153

where 1l is the l-vector consisting of 1’s, X is the ml ×m matrix where the column vectors 1l arearranged in a diagonal fashion (with 0’s elsewhere),

Y = (Y11, . . . , Y1l, Y21, . . . , Yml)>,

and the n-vector ε is formed analogously. Then

Y =Xβ + ε,

rank(X) = m, and the hypothesis H : µ1 = . . . = µm can be expressed as

H : β ∈ Lin(1m)

where Lin(1m) is the linear subspace of Rm spanned by the column vector 1m. That is a specialcase of a linear hypothesis about β (i.e. β is in some specified subspace of Rm).

10.4 Least squares and maximum likelihood estimation

Consider again the general linear model

Y = Xβ + ε, (10.25)

Eε = 0, Cov(ε) = σ2In. (10.26)

and the problem of estimating β, with known or unknown σ2. If X is a n×k-matrix and rank(X) =k (identifiability), then the expectation of the random vector Y is in the linear subspace of Rn

spanned by the matrix X (or by the k columns ξj of X):

Lin(X) = Lin(ξ1, . . . , ξk) =nXβ, β ∈ Rk

o=

⎧⎨⎩z =kX

j=1

ξjβj , βj arbitrary

⎫⎬⎭ .

An equivalent way of writing the linear model is

EY ∈Lin(X), Cov(Y) = σ2In.

Indeed no distributional assumptions about ε are made in the general linear model, so the above isall the information one has. For obtaining an estimator of β, one could try to apply the principleof maximum likelihood. However since the distribution of ε (Q, say) is unspecified, it would haveto be considered a parameter, along with β. Then Q and β specify the distribution of Y, andhence a likelihood for any realization Y = y. But maximizing it in Q does not lead to satisfactoryresults, since the class for Q ( arbitrary distributions on Rn) is too large. Thus one has to lookfor a different principle to get an estimator of β. The distribution Q of the noise is not of primaryinterest in most cases; it can be considered a nuisance parameter . The regression parameter β isthe parameter of interest, since it describes the dependence of Y on X, and possibly also σ2.Of course one could assume normality, and then find maximum likelihood estimators. Howeveranother principle can be invoked, without a normality assumption.

154 Regression

Definition 10.4.1 In the general linear model (10.25), (10.26), with observed vector Y ∈ Rn,β ∈ Rk and rank(X) = k, a least squares estimator (LSE) β of β is a function β = β(Y) ofthe observations such that °°°Y−Xβ°°°2 = min

β∈RkkY−Xβk2

where k·k2 is the squared Euclidean norm in Rn.

The name ”least squares” derives from the fact that the squared Euclidean norm of any vectorz ∈ Rn is the sum of squares of the components zi:

kzk2 =nXi=1

z2i .

Recall that another from of writing the linear model was

Yi = x>i β + εi, i = 1, . . . , n

where x>i are the rows of X ( xi are k-vectors). Thus another way of describing the least squaresestimator β is to say that for given Y it is a minimizer of

LsY(β) =nXi=1

³Yi − x>i β

´2.

This expression, depending on the observations Y, to be minimized in β is also called the leastsquares criterion.Exercise. Show that if β is the true parameter in (10.25), (10.26), then β provides a best approx-imation to the data Y in an average sense:

E kY−Xβk2 = minγ∈Rk

E kY−Xγk2

This minimization property can serve as a justification for the least squares criterion. If we couldcompute E kY−Xγk2 for any γ, we would take the minimizer and obtain the true β. However tofind the expectation involved we already have to know β, so we take just kY−Xγk2 and minimizeit. In this sense, β can be considered an empirical analog of β.

Theorem 10.4.2 Consider the general linear model (10.25), (10.26), in the case k < n, withobserved vector Y ∈ Rn, β ∈ Rk and rank(X) = k, with σ2 either known or unknown (σ2 > 0)(i) The LSE β of β is uniquely determined and given by

β =³X>X

´−1X>Y. (10.27)

(ii) Under a normality assumption L(ε) = Nn(0, σ2In), with probability one the LSE β coincides

with the maximum likelihood estimator (MLE) for β, both in cases of known σ2 and unknown σ2

(when σ2 > 0).


Proof. Note that rank(X) = k implies that¡X>X

¢−1exists. The key argument is that the matrix

ΠX = X³X>X

´−1X>

represents the linear projection operator in Rn onto the linear subspace Lin(X). Indeed, note thatΠX is a projection matrix, i.e. idempotent (ΠXΠX = ΠX ) and symmetric: Π>X = ΠX . To seethese two properties, note

ΠXΠX = X³X>X

´−1X>X

³X>X

´−1X> = X

³X>X

´−1X> = ΠX ,

and using the matrix rule (AB)> = B>A>, (which implies (H−1)> = (H>)−1 for symmetricnonsingular H)

Π>X =

µµX³X>X

´−1¶X>

¶>= X

µX³X>X

´−1¶>= X

µ³X>X

´−1¶>X>

= X

µ³X>X

´>¶−1X> = ΠX .

Hence ΠX is a projection matrix. It has rank k, and it leaves the space Lin(X) invariant: ifz ∈ Lin(X) then z =Xa for some a ∈Rk, and

ΠXz =ΠXXa =X³X>X

´−1X>Xa =Xa = z.

Also for any y the vector ΠXy is in Lin(X):

ΠXy =X³X>X

´−1X>y =Xy, where y =

³X>X

´−1X>y.

Moreover, consider the orthogonal complement of Lin(X), i.e. Lin(X)⊥. Any z ∈ Lin(X)⊥ ismapped into 0 by the linear map ΠX : since X>z = 0, we have

ΠXz =X³X>X

´−1X>z = 0.

These facts establish that indeed ΠX is a matrix which represents the projection onto Lin(X). (Asa consequence, In −ΠX is the projection operator onto Lin(X)⊥).It is well known that the projection operator has a minimizing property: for any y ∈Rn, ΠXy givesthe element of Lin(X) closest to y (in the sense of k·k). Indeed for any z ∈ Lin(X)

ky− zk2 = ky−ΠXy+ΠXy−zk2 (10.28)

Note that ΠXy−z ∈Lin(X), and

y−ΠXy = (In −ΠX)y ∈ Lin(X)⊥

since In − ΠX projects onto Lin(X)⊥. It follows that, when we compute (10.28) via kzk2 = z>z,since the two vectors are orthogonal, we get a sum of squared norms:

ky − zk2 = ky−ΠXyk2 + kΠXy−zk2 .

156 Regression

The right side is minimized for z =ΠXy.Apply this minimizing property of ΠX to obtain

kY−Xβk2 ≥ kY−ΠXYk2

where equality is achieved for

Xβ =ΠXy =X³X>X

´−1X>Y.

Pre-multiply by X> to obtainX>Xβ = X>Y

which gives a unique solution

β =³X>X

´−1X>Y. (10.29)

Part (i) is proved. For (ii), write down the likelihood function: since L(Y) = Nn(Xβ, σ2In), it is

LY(β,σ2) =

1

(2πσ2)n/2exp

µ− 1

2σ2kY−Xβk2

¶.

For known σ2, it is obvious that maximizing LY in β is equivalent to minimizing the least squarescriterion

LsY(β) = kY−Xβk2 .For unknown σ2, restricted only by σ2 > 0, minimize LY(β,σ2) first in β, for given σ2. The solutionis again β given by (10.29). We now have to maximize

LY(β,σ2) =

1

(2πσ2)n/2exp

Ã−n

−1LsY(β)

2σ2n−1

!

in σ2. This procedure was already carried out in the proof of Proposition 3.0.5 (insert nown−1LsY(β) for S2n). Provided that LsY(β) > 0, the solution is

σ2 = n−1LsY(β).

Thus (ii) is proved if we show that LsY(β) > 0 happens with probability one. Indeed LsY(β) = 0means thatY ∈Lin(X). Under normality, with covariance matrix σ2In it is obvious thatY ∈Lin(X)happens with probability 0, if k < n is fulfilled, since any proper linear subspace of Rn has proba-bility 0 (indeed for any nonrandom vector z ∈Rn, z 6= 0 the event z>Y = 0 has probability 0, sincez>Y is normally distributed with variance σ2 kzk > 0).It is easy to see that if k = n then the LSE of β is still given by (10.27) and coincides with theMLE under normality if σ2 is known, but if σ2 is unknown then LsY(β) = 0 and the MLE of σ2

under normality does not exist (or should be taken as 0, with the likelihood function taking value∞). The assumption k < n is realistic, since k represents the number of independent regressorvariables in most cases, and can be expected to be less than n.Exercise. Consider the special cases of the linear model discussed in the previous subsection.1. Bivariate linear regression:

Yi = α+ βxi + εi, i = 1, . . . , n


Here k = 2, the rows of X are x>i = (1, xi), β = (α, β)>, and identifiability holds as soon as not all

xi are equal). Show that the LSE of α, β are

αn = Yn − βnxn, βn =

Pni=1(Yi − Yn)(xi − xn)Pn

i=1(xi − xn)2. (10.30)

where xn is the mean of the nonrandom xi.

Remark 10.4.3 Note that the formula for βn is analogous to the regression coefficient in a bivari-ate normal distribution for (X,Y ) according to Definition 10.1.3:

β =σXY

σ2X

Therefore βn is also called the empirical regression coefficient (and β the theoretical coeffi-cient). The regression function for (X,Y ) was found as

y = E(Y |X = x) = µy + β(x− µx) = α+ βx

which shows that α is also the analog of α = µy − βµx.

Since the bivariate regression is very important for applications (it is programmed in scientificpocket calculators), we summarize again what was done there.

Fitting a straight line to data (bivariate linear regression). Given pairs (Yi, xi), i = 1, . . . , n,find the straight line y = α + βx which best fits the data in the sense that the least squarescriterion

nXi=1

(Yi − α+ βxi)2 .

is minimal. The solutions αn, βn are given by (10.30). The fitted straight line y = αn+ βnxpasses through the point (xn, Yn) with slope βn.

2. Location-scale model. (normality not assumed). Here

Yi = β + εi, i = 1, . . . , n,

This can be obtained from 1. above when α = 0 is assumed, and xi = 1. Show that the LSE of βis the sample mean:

βn = Yn. (10.31)

3. ANOVA: HereYjk = µj + εjk, j = 1, . . . ,m, k = 1, . . . , l

Show that the LSE of µj is the group mean:

µj = Yj· = l−1lX

k=1

Yjk, j = 1, . . . ,m.

158 Regression

10.5 The Gauss-Markov Theorem

Consider again the general linear model:

Model (LM) (General linear model). Observations are

Y = Xβ + ε, (10.32)

Eε = 0, Cov(ε) = σ2In. (10.33)

where Y is a random n-vector, X is a n× k-matrix and rank(X) = k, β ∈Rk is unknown, εis an unobservable random n-vector, and L(ε) satisfies (10.33) where σ2 is known (Model(LM1) or unknown, restricted by σ2 > 0 (Model (LM2).

Recall that the normal linear model is obtained when we add the assumption L(ε) = N(0,σ2In).This might be called NLM (possibly NLM1 or NLM2).In (LM) we are now interested in optimality properties of the least squares estimator of β

β =³X>X

´−1X>Y.

Above in (10.31) it was remarked that the sample mean Yn is a special case of β, for X = 1 (then-vector consisting of 1’s). In Section 5.7 it was shown that Yn is a minimax estimator undernormality, for known σ2 (i.e. in the Gaussian location model). In Proposition 4.3.3 it was shownthat in the Gaussian location model, the sample mean is also a uniformly minimum varianceunbiased estimator (UMVUE). Both optimality properties depend on the normality assumption:e.g. for the second optimality, within the class of unbiased estimators, it was shown that thequadratic risk of Yn attains the Cramer-Rao bound, which is the inverse Fisher information. TheFisher information IF is a function of the density (or probability function) of the data: recall thegeneral form of IF for a density pϑ depending on ϑ

IF (ϑ) = Eϑ

µ∂

∂ϑlog pϑ(Y )

¶2.

In the linear model, the distribution of Y is left unspecified. We might ask whether the samplemean, or more generally β, still has any optimality properties. A natural choice for the risk is

E°°°β − β°°°2, i.e. expected loss for the squared Euclidean distance.

Note that β is unbiased:

Eβ = E³X>X

´−1X>Y =

³X>X

´−1X>EY

=³X>X

´−1X>Xβ = β.

Without normality, it cannot be shown that β is best unbiased. But when we further restrict theclass of estimators admitted for comparison, to linear estimators, β turns out to be optimal. Recallthat a similar device was already employed for the sample mean and minimaxity: in Exercise 5.7.2a minimax linear estimator was proposed, i.e. an estimator which is minimax within the class oflinear ones.

The Gauss-Markov Theorem 159

Definition 10.5.1 Consider the general linear model (LM).(i) A linear estimator of β is any estimator b of form

b = AY

where A is a (nonrandom) k × n-matrix.(ii) A linear estimator b is called best linear unbiased estimator (BLUE) if b is unbiased:

Eb = β (10.34)

and if

E°°°b− β°°°2 ≤ E

°°°b− β°°°2 (10.35)

for all linear unbiased b. Relations (10.34), (10.35) are assumed to hold for all values of theunknown parameters (β ∈ Rk, and L(ε) as specified).

Theorem 10.5.2 (Gauss, Markov) In the linear model (LM), the least squares estimator is theunique BLUE.

Proof. Consider a linear unbiased estimator b = AY. The unbiasedness condition implies

β = Eb = EAY =AXβ

for all β ∈ Rk, henceAX = Ik. (10.36)

The loss is °°°b− β°°°2 = kAY − βk2 = kAXβ+Aε− βk2 (10.37)

= kAεk2 = (Aε)>Aε = ε>A>Aε. (10.38)

For the risk we have to compute the expected loss.For any n× n-matrix B = (Bij)

j=1,...,ni=1,...,n , define the trace as

tr [B] =nXi=1

Bii.

In other words, the trace is the sum of diagonal elements. Note that for n× k-matrices B, C andfor n-vectors z,y we have

trhBC>

i=

nXi=1

kXj=1

BijC>ji = tr

hC>Bi, (10.39)

trhB>B

i=

nXi=1

nXj=1

B2ij = trhBB>

i, (10.40)

trhzy>

i=

nXi=1

ziyi = y>z, tr

hzz>

i= z>z = kzk2 .

160 Regression

From (10.39) we see that C 7→ tr [BC] is a linear operation on matrices, so that when C is a randommatrix, then for the expectation we have

Etr [BC] = tr [BEC] .

Applying this to the quadratic loss (10.37), (10.38), we have

E°°°b− β°°°2 = Eε>A>Aε =Etr

hAεε>A>

i= Etr

hA>Aεε>

i= tr

hA>AE(εε>)

i= tr

hA>Aσ2In

i= σ2tr

hA>A

i.

The problem thus is to minimize tr£A>A

¤under the unbiasedness restriction AX = Ik from (10.36).

From (10.40) we see that if vec(A) is the kn-vector formed of all elements of A, then

trhA>A

i=

kXi=1

nXj=1

A2ij = kvec(A)k2

i. e. problem is to minimize the length of the vector vec(A) under a set of affine linear restrictionsAX = Ik. Consider first the special case k = 1; then a = A> and X are vectors of dimension kand the set a : a>X = 1 is an affine hyperplane in Rk. To minimize the norm of a within this setmeans taking a perpendicular to the hyperplane, i.e. having the same direction as X. This givesa = X(X>X)−1 as a minimizerThis argument is generalized to k ≥ 1 as follows. Let ΠX = X

¡X>X

¢−1X> be the projection

operator onto Lin(X) in the space Rn. We have

trhA>A

i= tr

hAA>

i= tr

hA(In −ΠX +ΠX)A>

i= tr

hA(In −ΠX)A>

i+ tr

hAΠXA

>i

= trhA(In −ΠX)A>

i+ tr

∙³X>X

´−1¸.

in view of AX = Ik. Here the term tr£A(In −ΠX)A>

¤is nonnegative, since for C = A(In −ΠX)

we have (recall that In −ΠX is idempotent, since it is a projection)

trhA(In −ΠX)A>

i= tr

hCC>

i≥ 0

since tr£CC>

¤is a sum of squares (10.40).Thus

E°°°b− β°°°2 ≥ σ2tr

∙³X>X

´−1¸.

This lower bound is attained for A =¡X>X

¢−1X> :

(In −ΠX)A> = (In −ΠX)X³X>X

´−1= 0 (10.41)

The Gauss-Markov Theorem 161

since In −ΠX projects onto Lin(X)⊥. Thus β is a BLUE.It remains to show uniqueness. This follows from the fact that (10.41) is necessary for attainmentof the lower bound, and this implies

AΠX = A

The left side isAΠX = AX

³X>X

´−1X> =

³X>X

´−1X>,

and the result is proved.The BLUE property of Gauss-Markov Theorem is not a very strong optimality, since the class ofestimators is quite restricted. On the other hand, it says that that β is uniformly best withinthat class. Recall that the sample mean Yn is a special case, so we obtained another optimalityproperty of the sample mean. This also suggests more optimality properties of β under normality(minimaxity, best unbiased estimator); these indeed can be established, but we do not discuss thishere.

162 Regression

Chapter 11LINEAR HYPOTHESES AND THE ANALYSIS OF VARIANCE

11.1 Testing linear hypotheses

In the normal linear model (NLM),Y = Xβ + ε

recall that the parameter β varies in the whole space Rk. Consider a linear subspace of Rk, S say, ofdimension s < k, and a hypothesis H : β ∈S. Such a problem, i.e. testing a linear hypothesis,arises naturally in a number of situations.1. Bivariate linear regression. We have

Yi = α+ βxi + εi, i = 1, . . . , n.

Here k = 2, the rows of X are x>i = (1, xi), β = (α, β)>. A linear hypotheses could be

H : β = 0,meaning that the nonrandom regressor variable xi has no influence upon Yi. If the xi are i.i.d.realizations of a normal random variable X, L(X) = N(µx, σ

2X), then the regression coefficient β

is (see Definition (10.1.3)

β =σXY

σ2X.

The hypothesis then means that σXY = 0, i.e. X and Y are independent random variables, orequivalenty X and Y are uncorrelated. Indeed the correlation coefficient is

ρ =σXY

σXσY

and it is related to β byβ = ρ

σXσY

(in ”Galton’s case” of fathers and sons where σ2X = σ2Y they are actually equal).Using the linear algebra formalism of (NLM), the linear subspace S would be the subspace of R2spanned by the vector (1, 0)>, i. e.

S = Lin((1, 0)>).

2. Normal location-scale model. Here

Yi = β + εi, i = 1, . . . , n,

k = 1, X = (1, . . . , 1)>, εi are i.i.d. N(0, σ2), σ2 > 0. The only possible linear hypothesis isH : β = 0.

164 Linear hypotheses and the analysis of variance

3. Polynomial regression. Here for some design points xi, i = 1, . . . , n

Yi =kX

j=1

βjϕj(xi) + εi, i = 1, . . . , n

the functions ϕj are ϕj(x) = xj−1 and β = (β1, . . . , βk)>. A linear hypotheses could be

H : βk = 0,meaning that the regression function

r(x) =kX

j=1

βjϕj(xi)

is actually a polynomial of degree k−2 , and not of degree k−1 as postulated in the model (LM). Aspecial case is 1. above (the bivariate linear regression) for k = 2. The hypothesis means that themean function r(x) is a polynomial of lesser degree, or that the model is less complex. Of course,as always with testing problems, the statistically significant result would be a rejection; for e.g.k = 3 this means that it is not possible to describe the mean function of the Yi by a straight line,and a higher degree polynomial is needed.4. Analysis of variance (ANOVA). Here

Yjk = µj + εjk, j = 1, . . . ,m, k = 1, . . . , l (11.1)

where εjk are independent noise variables. The index j is associated withm ”treatments” or groups,and one might wish to test whether the treatments have any effect:H:µ1 = . . . = µm.or in other words, whether the groups differ in the characteristic Yjk measured. The linear subspaceS of Rm would be

S = Lin(1)where 1 = (1, . . . , 1)>∈Rm, and dim(S) = 1 (note that the m here corresponds to k in (10.32),(10.33)).5. General case. Recall that ξj were the colums of the matrix X, so that model (LM) can bewritten

Y =kX

j=1

βjξj + ε

where the j-th column may have arisen from n independent realizations of the j-th component ofa random vector X, or as designed nonrandom values. In either case, ξj may be construed as oneof k regressor variables which influences the regressand Y∗). Such a variable ξj is also called anexplanatory variable, or one of j independent variables, when Y is the ”dependent” variable.Thus a hypothesis H : βj = 0 postulates that ξj is without influence on Y, and can be dispensedwith. Clearly the polynomial regression above is a special case for ξj = (ϕj(x1), . . . , ϕj(xn))

>

(designed nonrandom values). Here again, what is sought is the ”statistical certainty” in the caseof rejection, when it can be claimed that ξj actually does influence the measured quantity Y.In the normal linear model (NLM), to find a test statistic for the problemH : β ∈S.

∗Note that throughout math and statistics software, a vector of values is frequently called a ”variable”.

Testing linear hypotheses 165

K : β /∈Swe could again apply the likelihood ratio principle.That was already done in a special case ofANOVA, namely the two sample problem (cf. Theorem 10.3.6). We repeat some of that reasoning,with general notation, assuming first σ2 unknown. The density of Y is (as a function of y ∈Rn)

pβ,σ2(y) =1

(2πσ2)n/2exp

µ− 1

2σ2ky−Xβk2

¶.

The LR statistic is

L(y) =maxβ /∈S,σ2>0 pβ,σ2(y)

maxβ∈S,σ2>0 pβ,σ2(y).

To find the numerator, we first maximize over β /∈S and then over σ2 > 0. Maximizing first overβ ∈Rk, we obtain the LSE of β:

β =³X>X

´−1X>Y.

We now claim that with probability one, β /∈S, so that β is also the maximizer over β /∈S. Notethat

L(β) = Nk(β,σ2³X>X

´−1)

and that the matrix¡X>X

¢−1is nonsingular. It was already argued that a multivariate normal

vector, with nonsingular covariance matrix, takes values in a given lower dimensional subspace withprobability 0 (conclusion of proof of Theorem 10.4.2). Thus P (β /∈S) = 1 and almost surely

maxβ /∈S,σ2>0

pβ,σ2(y) = maxσ2>0

pβ,σ2(y)

= maxσ2>0

1

(2πσ2)n/2exp

µ− 1

2σ2k(In −ΠX)yk2

¶.

This maximization problem has been solved already in the proof of Proposition 3.0.5 (insert nowk(In −ΠX)yk2 for S2n): the result is

maxβ /∈S,σ2>0

pβ,σ2(y) =1

(σ2)n/2· 1

(2π)n/2exp

³−n2

´,

σ2 = n−1 k(In −ΠX)yk2 . (11.2)

Key idea for linear hypotheses in linear models. To find the maximized likelihood under thehypothesis, we note that β ∈ S implies that X β varies in a linear subspace of Lin(X). Indeed ifS is a k×s-matrix such that S = Lin(S) then every β ∈ S can be represented as β = Sb for someb ∈ Rs, thus

Xβ =XSb

where rank(XS) = s and we see that

Xβ, β ∈S = Lin(XS).

We can now apply the results for least squares (or ML) estimation of β ∈ Rk to estimation ofb ∈ Rs. We can immediately write down a LSE for b and a derived one for β ∈ S, but we skipthis and proceed directly to the MLE of σ2 under the hypothesis, analogously to (11.2):

σ20 = n−1 k(In −ΠXS)yk2


where the projection In −ΠX is substituted by In −ΠXS. Write the respective linear subspaces ofRn as

X := Lin(X), S0 := Lin(XS)

ΠX = ΠX , ΠS0 = ΠXS for the associated projection operators in Rn. It follows that the maximizedlikelihood under H is

maxβ /∈S,σ2>0

pβ,σ2(y) =1

(σ20)n/2

· 1

(2π)n/2exp

³−n2

´,

σ20 = n−1 k(In −ΠS0)yk2 . (11.3)

Thus the LR statistic is

L(y) =

µσ20σ2

¶n/2

=

Ãk(In −ΠS0)yk2

k(In −ΠX )yk2

!n/2

. (11.4)

For a further transformation, note that the matrix ΠX −ΠS0 is again a projection matrix, namelyonto the orthogonal complement of S0 in X . Denote this orthogonal complement as X ª S0 (thespace of all x ∈ X which are orthogonal to S0). We then have a decomposition

Rn = X⊥ ⊕ (X ª S0)⊕ S0 (11.5)

where ⊕ denotes the orthogonal sum, X⊥ = RnªX and all three spaces are orthogonal. We havea corresponding decomposition of In (which is the projection onto Rn)

In = (In −ΠX ) + (ΠX −ΠS0) +ΠS0

and all three matrices on the right are projection matrices, orthogonal to one another. (Exercise:show that ΠX −ΠS0 is the projection operator onto X ª S0). As a consequence

k(In −ΠS0)yk2 = k(In −ΠX +ΠX −ΠS0)yk2

= k(In −ΠX )yk2 + k(ΠX −ΠS0)yk2 ,

and we obtain from (11.4)

L(y) =

Ã1 +

k(ΠX −ΠS0)yk2

k(In −ΠX )yk2

!n/2

.

Note that the form obtained for the two sample problem in Theorem 10.3.6 is a special case; therewe could further argue that the LR test is equivalent to a certain t-test.

Definition 11.1.1 Let Z1, Z2 be independent r.v.’s having χ2-distributions of k1 and k2 degreesof freedom, respectively. The F-distribution with k1, k2 degrees of freedom (denoted Fk1,k2)is the distribution of

Y =k−11 Z1

k−12 Z2.

It is possible to write down the density of the F -distribution explicitly, with methods similar thosefor the t-distribution (Proposition 7.2.7). Note that for k1 = 1 , Z1 is the square of a standardnormal, hence Y is the square of a t-distributed r. v. with k2 degrees of freedom.

Testing linear hypotheses 167

Definition 11.1.2 In the normal linear model (NLM), consider a linear hypothesis given by alinear subspace S ⊂ Rk, dim(S) = s < kH : β ∈S.K : β /∈S.The F -statistic for this problem is

F (Y) =k(ΠX −ΠS0)Yk2 /(k − s)

k(In −ΠX )Yk2 /(n− k)(11.6)

where X = Lin(X), S0 = Lin(XS) and S = Lin(S).

Proposition 11.1.3 In the normal linear model (NLM), under a hypothesis H : β ∈S, the per-taining F -statistic has an F -distribution with k − s, n− k degrees of freedom.

Proof. Note that under the hypothesis,

EY =Xβ ∈S0,

hence(ΠX −ΠS0)Y =(ΠX −ΠS0)(Xβ + ε) = (ΠX −ΠS0)ε.

ButEY =Xβ ∈X

is already in the model assumption (LM); it holds under H in particular. Thus

(In −ΠX )Y =(In −ΠX )ε.

Let now e1, . . . , en be an orthonormal basis of Rn such that

Lin(e1, . . . , es) = S0, Lin(es+1, . . . , ek) = X ª S0,Lin(ek+1, . . . , en) = X⊥

i.e. the basis conforms to the decomposition (11.5). Then the three projection operators are givenby

ΠS0y =sX

i=1

(e>i y)ei, etc.,

hence

k(In −ΠX )εk2 =nX

i=k+1

(e>i ε)2, k(ΠX −ΠS0)εk2 =

kXi=s+1

(e>i ε)2.

Note that zi = σ−1e>i ε , i = 1, . . . , n are i.i.d. standard normals. For the F -statistic we obtain

F (Y) =

Pki=s+1 z

2i /(k − s)Pn

i=k+1 z2i /(n− k)

.

This proves the result.An immediate consequence is the following.


Theorem 11.1.4 In the normal linear model (NLM), consider a linear hypothesis given by a linearsubspace S ⊂ Rk, dim(S) = s < kH : β ∈S.K : β /∈S.Let F (Y) be the pertaining F -statistic and Fk−s,n−k;1−α the lower 1−α-quantile of the distributionFk−s,n−k. The F-test which rejects when

F (Y) > Fk−s,n−k;1−α

is an α-test.

11.2 One-way layout ANOVA

Consider a modelYjk = µj + εjk, k = 1, . . . , lj , j = 1, . . . ,m, (11.7)

where εjk are i.i.d. normal noise variables and µj are unknown parameters. The model is slightlymore general than (10.18) since we admit different observations numbers lj for each group j. Thetotal number of observations then is n =

Pmj=1 lj . Clearly this is again a special case of the normal

linear model (NLM). Assume that lj > 1 for at least one j.The groups j = 1, . . . ,m are also called factors or treatments. The question of interest is ”dothe factors have an influence upon the measured quantity Yjk?” This corresponds to a hypothesisH : µ1 = . . . = µm. At first sight, the question appears similar to the one in bivariate regression,where one asks whether a measured regressor quantity xi has an influence upon the regressand Yi(the contrary is the linear hypothesis β = 0). This in turn is related to a question ”are the r.v.’s Yand X correlated ?”. However the difference is that in ANOVA no numerical value xi is attached tothe groups; the groups or factors are just different categories (they are qualitative in nature). Thus,even if one is willing to assume that individuals are randomly selected from one of the groups, tocompute a correlation or regression does not make sense- it is not clear which values Xi should beassociated to the groups. An example is the drug testing problem, where one has two samples, onefor old and new drug (m = 2). For ANOVA with m groups, an example would be that j representsdifferent social groups and Yjk the cholesterol level, or j might represent different regions of spaceand Yjk the size of cosmic background radiation.Recall Theorem 10.3.6 where it was shown that the two sample problem (with H : µ1 = µ2) can betreated by a t-test. and that it is a special case of ANOVA. The general ANOVA case for m ≥ 2can be treated by an F -test; it suffices to formulate the test problem as a linear hypothesis in a(normal) linear model and apply Theorem 11.1.4.To write down the F -statistic,

F (Y) =k(ΠX −ΠS0)Yk2 /(k − s)

k(In −ΠX )Yk2 /(n− k)(11.8)

it suffices to identify the linear spaces and projection operators involved. We have

β = (µ1, . . . , µm)>, n =

mXj=1

lj , (11.9)

X =

⎛⎝ 1l1. . .

1lm

⎞⎠ (11.10)

One-way layout ANOVA 169

where 1l is the l-vector consisting of 1’s, X is the n×m matrix where the vectors 1l are arrangedin a diagonal fashion (with 0’s elsewhere),

Y = (Y11, . . . , Y1l1 , Y212 , . . . , Ymlm)>, (11.11)

and the n-vector ε is formed analogously. Then

Y =Xβ + ε

rank(X) = m, and the hypothesis H : µ1 = . . . = µm can be expressed as

H : β ∈ S = Lin(1m)

where Lin(1m) is a one dimensional linear space (dim(S) = s = 1). Thus

S0 = Xβ, β ∈ S = Lin(1n)

which can also be seen by noting that if all µj are equal to one value µ, then then Yjk = µ+ εjk.To find the value k(ΠX −ΠS0)Yk2, note that

ΠS0 = 1n

³1>n 1n

´−11>n = n−11n1

>n ,

ΠS0Y = 1n

³n−11>nY

´= 1nY··

where Y·· is the overall sample mean:

Y·· = n−1mXj=1

ljXk=1

Yjk.

Furthermore let

Yj· = l−1j

ljXk=1

Yjk

be the mean within the j-th group. We claim that

ΠXY =X³X>X

´−1X>Y = X

⎛⎝ Y1·. . .Ym·

⎞⎠ . (11.12)

Indeed, X>Y gives the m-vector of the sumsPlj

k=1 Yjk for each group, and X>X is a m × m-diagonal matrix with lj as diagonal elements. But we can also write

ΠS0Y = 1nY·· = X

⎛⎝ Y··. . .Y··

⎞⎠hence

ΠXY −ΠS0Y = X

⎛⎝ Y1· − Y··. . .Ym· − Y··

⎞⎠ ,

k(ΠX −ΠS0)Yk2 =mXj=1

lj¡Yj· − Y··

¢2.


Similarly we obtain from (11.12) and (11.11)

(In −ΠX )Y = (Y11 − Y1·, . . . , Y1l1 − Y1·, Y21 − Y2·, . . . , Ymlm − Yl·)>,

k(In −ΠX )Yk2 =mXj=1

ljXk=1

¡Yjk − Yj·

¢2.

Thus the F -statistic (11.8) takes the form

F (Y) =(m− 1)−1

Pmj=1 lj

¡Yj· − Y··

¢2(n−m)−1

Pmj=1

Pljk=1

¡Yjk − Yj·

¢2 . (11.13)

The terms

Dw =mXj=1

ljXk=1

¡Yjk − Yj·

¢2, Db =

mXj=1

lj¡Yj· − Y··

¢2can be called the sums of squares within groups and and sums of squares between groupsrespectively. Recall that we introduced this terminology essentially already in the two sampleproblem (just before Theorem 10.3.6; there we divided by n and called this ”variability”). Consideralso the quantity

D0 =mXj=1

ljXk=1

¡Yjk − Y··

¢2= k(In −ΠS0)Yk2 .

This is the total sums of squares; then n−1D0 = S2n is the total sample variance . We have adecomposition

D0 = Dw + Db (11.14)

as an immediate consequence of the identity

k(In −ΠS0)Yk2 = k(In −ΠX )Yk2 + k(ΠX −ΠS0)Yk2 .

The decomposition (11.14) of the total sample variance S2 generalizes (10.21); it gives the name tothe test procedure ”analysis of variance”. The hypothesis H of equality of means is rejected whenthe between groups sum of squares Db is too large, compared to the within groups sum of squaresDw .Note that the quantities

dw = (n−m)−1Dw, db = (m− 1)−1Db

are both unbiased estimates of σ2 under the hypothesis (cf. the proof of Proposition 11.1.3); theycan be called the respective ”mean sum of squares”. The F -statistic (11.13) involves these quantitiesdw, db which differ only by a factor from the sums of squares Dw, Db.A common way of visualizing all the quantities involved is the ANOVA table:

sum of squares degrees of freedom mean s. of squ. F-valuebetween groups Db m− 1 (m− 1)−1Db = db F (Y) = db/dwwithin groups Dw n−m (n−m)−1Dw = dwtotal D0 n− 1

Two-way layout ANOVA 171

11.3 Two-way layout ANOVA

Return to the cholesterol /social group example and suppose we have data from various countries,where it can be assumed that the general level of cholesterol varies from country to country (e.g.via different nutritional habits). Nevertheless we are still interested in the same question, namelywhether the cholesterol level differs across social groups or not. This can be modeled by a ”twoway layout”, where we allow for additional effects (the countries). It will become clear why thesimpler ANOVA in the previous subsection is called ”one way layout”.In the previous subsection we could have written the nonrandom group means µj as

µj = µ+ δj , j = 1, . . . ,m,mXj=1

δj = 0

.

Any vector β = (µ1, . . . , µm)> can be written uniquely in this form: set

µ = m−1mXj=1

µj , δj = µj − µ.

The one way layout can be written

Yjk = µ+ δj + εjk, k = 1, . . . , lj , j = 1, . . . ,m,

and the hypothesis isδ1 = . . . = δm = 0.

The δj are called factor effects and µ is called themain effect . Then the Yjk can be decomposed

Yjk = Y·· + δj + εjk, k = 1, . . . , lj , j = 1, . . . ,m, (11.15)

δj = Yj· − Y··, εjk = Yjk − Yj·,

where εjk are called residuals and δj can be interpreted as estimates of the factor effects δj . Therelation (11.15) can be written as a decomposition of the data vector Y

Y = ΠS0Y + (ΠX −ΠS0)Y+(In −ΠX )Y.

The F−statistic (11.13) then takes the form

F (Y) =(m− 1)−1

Pmj=1 lj δ

2j

(n−m)−1Pm

j=1

Pljk=1 ε

2jk

.

This indicates how the two way layout can be treated. The model is

Yijk = µij + εijk, k = 1, . . . , l, i = 1, . . . , q, j = 1, . . . ,m,

where εijk are i.i.d normal variables with variance σ2. For simplicity we assume that all groups(i, j) have an equal number of observations l. The index i is called the first factor and j is calledthe second factor. Again this is a normal linear model, but the matrix X has an involved form.It is somewhat laborious to work out all the projections and derived sums of squares; we therefore


forgo the projection approach and use the more elementary multiple ”dot” notation. Note thatthe projection approach is still needed for a rigourous proof that the tests statistics below have anF -distribution. The array of nonrandom mean values µij can be written

µij = µ+ αi + βj + γij (11.16)qX

i=1

αi =mXj=1

βj = 0 (11.17)

where

µ = µ·· = q−1qX

i=1

µi· = m−1mXj=1

µ·j is the global effect

αi = µi· − µ·· is the main effect of value i of the first factorβj = µ·j − µ·· is the main effect of value j of the second factorγij = µij − µi· − µ·j + µ·· is the interaction

of value i of the first factor and value j of the second factor.

(note that relation (11.16, 11.17) immediately follow from the definitions above of the quantitiesinvolved). It follows also that

qXi=1

γij = qµ·j − qµ·· − q(µ·j − µ··) = 0 for all j = 1, . . . ,m,

mXj=1

γij = 0 for all i = 1, . . . , q.

Assume now that there is no interaction between the factors 1 and 2, i.e. γij = 0. Inthis case from (11.16) we obtain

Yijk = µ+ αi + βj + εijk, k = 1, . . . , l, i = 1, . . . , q, j = 1, . . . ,m.

Set n = mql. The two hypotheses of interest areH1 : α1 = . . . = αq = 0H2 : β1 = . . . = βm = 0In our example, if factor 1 (indexed by i) is social group and factor 2 (indexed by j) is country,then H1 would be ”cholesterol level does not depend on social group, even though it may varyacross countries” and H2 would be ”cholesterol level does not depend on country, even though itmay vary across social groups”.The analog of (11.14) is

D0 = Dw + Db1 + Db2, (11.18)

Db1 = lm

qXi=1

¡Yi·· − Y···

¢2 , Db2 = lqmXj=1

¡Y·j· − Y···

¢2,

Dw =Xi,j,k

ε2ijk, D0 =Xi,j,k

(Yijk − Y···)2.

Two-way layout ANOVA 173

whereεijk = Yijk − Yi·· − Y·j· + Y···

are the residuals. The two-way ANOVA table is

sum of squares degrees of freedom mean s. of squ. F-value1. factor, between gr. Db1 q − 1 Db1/(q − 1) = db1 db1/dw2. factor, between gr. Db2 m− 1 Db2/(m− 1) = db2 db2/dwresiduals (within gr) Dw n−m− q + 1 Dw/(n−m− q + 1) = dwtotal D0 n− 1

Here the F -value in the row for the first factor is for testing H1.

Remark 11.3.1 We briefly outline the associated projection arguments needed for the proof of theF -distributions. Let Y be the n-vector of the data Yijk, e.g. arranged in a lexicographic fashion:

Y = (Y111, Y112, . . . , Y11l, Y121, . . . , Yqml)>

let S00 be the subspace of Rn spanned by the vector 1n,

S10 =

(Z : Zijk = αi, for some αi where

qXi=1

αi = 0

),

S01 =

(Z : Zijk = βj, for some βj where

qXi=1

βj = 0

).

Note that S10,S01 and S00 are mutually orthogonal in Rn: e.g. for any Z1 ∈ S10 and Z2 ∈ S01 wehave Z>1 Z2 = 0 etc. Consider the linear space X spanned by these three subspaces, i.e. the set ofall linear combinations

P3r=1 λrZr where Z1 ∈ S10, Z2 ∈ S01, Z3 ∈ S00. This can be represented

X =

½Z : Zijk = µ+ αi + βj, for some µ, αi, βj,

wherePq

i=1 αi = 0,Pq

i=1 βj = 0

¾.

or in short notation (using the orthogonal sum operation ⊕)

X = S10 ⊕ S00 ⊕ S01.

To this corresponds a representation of projection operators:

ΠX = ΠS00 +ΠS10 +ΠS01

where ΠS00 = Π1n = n−11n1>n . The basic assumption of no interaction γij = 0 in (11.16) meansthat

EY ∈X .

(If the assumption of no interaction is not made then we were only able to claim EY ∈X0 where

X0 =©Z : Zijk = µij, for some µij

ª).


The basic (normal) linear model (under the assumption of no interaction) can be expressed as

L(Y) = Nn(EY,σ2In)

where EY ∈ X .

To write in the form Y =Xβ + ε we need a matrix X which spans the space X ; we can avoid thishere. The two way ANOVA decomposition (11.18) can be written as

D0 = k(In −ΠS00)Yk2 = k(In −ΠX )Yk2 + kΠS10Yk2 + kΠS01Yk2

= Dw + Db1 + Db2.

In this linear model, the expression

dw = (n−m− q + 1)−1Dw = (n−m− q + 1)−1 k(In −ΠX )Yk2

is an unbiased estimator of σ2 and Dw/σ2 has a law χ2n−m−q+1. The two linear hypotheses are

H1 : α1 = . . . = αq = 0 or equivalently EY ∈S01 ⊕ S00H2 : β1 = . . . = βm = 0 or equivalently EY ∈S10 ⊕ S00.Under H1, we have ΠS10EY = 0 and hence the expression

db1 = (q − 1)−1Db1 = (q − 1)−1 kΠS10Yk2

is independent of dw and such that Db1/σ2 has a law χ2q−1. Thus db1/dw has law Fq−1,n−m−q+1.

The textbook in chap. 11.3 treats the case where l = 1 and βj are random variables (RCB,randomized complete block design). This model is very similar to the two way layout treated here.The theory of ANOVA with its associated design problems has many further ramifications.

Chapter 12SOME NONPARAMETRIC TESTS

Recall Remark 9.4.6: a family of probability distributions P = Pϑ, ϑ ∈ Θ, indexed by ϑ, iscalled parametric if all ϑ ∈ Θ are finite dimensional vectors (Θ ⊆ Rk for some k), otherwiseP is called nonparametric. In hypothesis testing, any hypothesis corresponds to some P, thusthe terminology is extended to hypotheses. Any simple hypothesis (consisting of one probabilitydistribution P = Q0) is parametric. The understanding is that the family P is too large to beindexed by some ϑ ⊆ Rk; it should be indexed by some set of functions or other objects (e.g. theassociated distribution functions, or the densities if they exist, or the infinite series of their Fourierceofficients, or the probability laws themselves). Typically some restrictions are then placed on thefunctions involved, such that: the density is symmetric around 0, or is differentiable with derivativebounded by a constant etc.In the χ2-goodness-of-fit test treated in Corollary 9.4.5, we already encountered a nonparametricalternativeK :Q 6= Q0. In a more narrow sense, a nonparametric test is one in which the hypothesisH is a nonparametric set. It was also noted that the χ2-test for goodness of fit actually tests thehypothesis on the cell probabilities p(Q) = p(Q0), with asymptotic level α, and the set of all Qfulfilling this hypothesis is also nonparametric.A genuinely nonparametric test was the χ2-test for independence ( χ2-test in a contingency table)treated in Section 9.6: the hypothesis consists of all joint distributions of two r.v.s’s X1, X2 whichare the product of their marginals.

12.1 The sign test

Suppose that for a pair of (possibly dependent) random variables X,Y , one is interested in thedifference Z = X − Y , more precisely in the median of Z, i.e. med(Z). If X and Y representsome measurements or ”treatments”, then med(Z) > 0 would mean that X is ”better” in somesense than Y . Assume that Z has a continuous distribution, so that P (Z = 0) = 0. If oneobserves n independent pairs of r.v.’s (Xi, Yi), i = 1, . . . , n all having the distribution of (X,Y )then corresponding hypotheses would beH :med(Z) = 0K :med(Z) > 0.A possible test statistic is

S =nXi=1

sgn(Zi)

where Zi = Xi − Yi and

sgn(x) = 1 if x > 00 if x = 0−1 if x < 0.

176 Some nonparametric tests

It is plausible to reject H if the number of positive Zi is too large compared to the number ofnegative Zi. This is the (one sided) sign test for H. Define

S0 =nXi=1

1(0,∞)(Zi);

then S = 2S0 − n. Moreover if p = P (Z > 0) then the statistic S0 has a binomial distributionB(n, p). Under H we have

P (Z < 0) = P (Z > 0)

so that p = 1/2 and the distribution of S0 under H is B(n, 1/2). Since S is a strictly monotonefunction of S0, the sign test can also be based on S and the binomial distribution under H (rejectionwhen S is too large). Note that since B(n, 1/2) is a discrete distribution, in order to achieveexact size α under the hypothesis, for any given α one has to use a randomized test in general(alternatively, for a nonrandomized α-test the size may be less than α).This is a prototypical nonparametric test; the hypothesis H contains all distributions Q for Z whichhave median 0, and this is a large nonparametric class of laws. The statistic S is distribution-freeunder H, since its law is B(n, 1/2) which does not depend on Q in H.

12.2 The Wilcoxon signed rank test

The hypothesis H : med(Z) = 0 in the sign test has the disadvantage that it includes distributionsfor Z which are very skewed, in the sense e.g. that most of the positive values may be concentratedaround a large value (µ+ > 0 say) whereas the probability mass on the negative half-axis may beconcentrated around a moderate negative value (µ− < 0 say). In that case one would tend to saythat X is ”better” than Y in some sense, even though the median of Z = X − Y is still 0. Thusone formulates the hypothesis of symmetryH : P (X − Y > c) = P (Y −X > c) for all c > 0or equivalentlyH : P (Z > c) = P (Z < −c) for all c > 0.Assume that Zi, i = 1, . . . , n are i.i.d. having the distribution of Z = X − Y (a continuousdistribution). The rank of Zi among Z1, . . . , Zn denoted by Ri is defined to be the number of Zj ’ssatisfying Zj ≤ Zi, j = 1, . . . , n. The rank of |Zi| among |Z1| , . . . , |Zn| is similarly defined anddenoted by Ri. Because of the continuity assumption, we assume that no two Zi are equal andthat no Zi is zero. Define signed ranks

Si = sgn(Zi)Ri

and theWilcoxon signed rank statistic W

Wn =nXi=1

Si.

Under the hypothesis of symmetry, about one half of the Si would be negative; thus Si would beclose to 0. Thus it seems plausible to reject H if W is too large (one sided test) or if |W | is toolarge (two sided test).

The Wilcoxon signed rank test 177

Lemma 12.2.1 Under the hypothesis H of symmetry, W has the same distribution as

Vn =nXi=1

Mii (12.1)

where Mi are independent Rademacher random variables, i.e.

P (Mi = −1) = P (Mi = −1) = 1/2

(or Mi = 2Bi − 1 where Bi are Bernoulli B(1, 1/2)).

Proof. If Z has a symmetric distribution then sgn(Z) and |Z| are independent:

P (|Z| > c|sgn(Z) = 1) = P (|Z| > c|Z > 0) = 2P (Z > c)

= 2P (Z < −c) = P (|Z| > c|Z < 0) = P (|Z| > c|sgn(Z) = −1)

thus the conditional distribution of |Z| given sgn(Z) does not depend on sgn(Z), which meansindependence. Thus W has the same law as

V ∗n =nXi=1

MiRi

where Mi are independent of the original sample of Zi. Moreover, a random permutation of the(M1, . . . ,Mn) (independent of the Mi) does not change the law of the vector (M1, . . . ,Mn), so thatL(V ∗) = L(V ).Note that EMi = 0 so that EW = 0 under H. The variance of Mi is

Var(M1) = EM21 =

1

2(−1)2 + 1

2(1)2 = 1

so that

Var(Wn) = Var(Vn) =nXi=1

i2 =1

6(n(n+ 1)(2n+ 1)) =: vn.

(the last formula can easily be proved by induction).To obtain a test, we can now use a normal approximation for the law of Wn or find the quantilesof its exact law. The normal approximation for the law of Vn seems plausible since Mi are i.i.d.zero mean and the weights i in (12.1) are deterministic. An appropriate central limit theorem (Lyapunov CLT) gives

Wn

v1/2n

=⇒d N(0, 1).

The corresponding asymptotic α-test of H then rejects if Wn > zαv1/2n where zα is the upper

α-quantile of N(0, 1).For the exact distribution, define

W+n =

Xi:sgn(Zi)=1

Ri,W−n =

Xi:sgn(Zi)=−11

Ri.


The test can also be based on W+n since

W+n +W−

n =nXi=1

i =n(n+ 1)

2

W = W+n −W−

n = 2W+n − n(n+ 1)/2

By Lemma (12.2.1), the law of W+n coincides with that of

V +n =nXi=1

Bii

where Bi are i.i.d. Bernoulli B(1, 1/2). This distribution has been tabulated in the past ( one sidedcritical values, without randomization, for selected values of α and n = 1, . . . , 20 ∗) and this caneasily be included in statistical software today. Note that the two sided critical values for W canbe obtained from the fact that W has a symmetric distribution around 0.Further justification. Recall that when Zi are i.i.d. normal N(µ, σ2) with unknown σ2 thenthe hypothesis of symmetry around 0 reduces to H : µ = 0. For this the t-test is available,based on T = n1/2Zn/Sn or when σ2 is known we could even use a Z-test based on the statisticZ0 = n1/2Zn/σ and its normal distribution. The one sided Z-test was UMP test against alternativesH : µ > 0, and the two sided Z- and t-tests can also be shown to have an optimality property(UMP unbiased tests). When the assumption of normality of Q = L(Zi) is not justified, for testingsymmetry we could try still to use the t-test: we have

n1/2Zn/Sn =⇒d N(0, 1)

if the second moment of the Q exists. In that case we would obtain an asymptotic α-test for thehypothesis of symmetry, but this breaks down if the class of admitted distributions is too large, i.e.the second moment of Q may not exist. For instance, we may have Q(·) = Q0(·−µ) where Q0 is aCauchy distribution (which is symmetric about 0). Symmetry of Q means µ = 0, and T is not anappropriate test statistic (does not provide an asymptotic α-test). In contrast, the Wilcoxon signedstatistic W is distribution free under the hypothesis of symmetry: according to Lemma (12.2.1) i.e.its law does not depend on Q as long as Q is symmetric.Nonparametric alternatives. A random variable Z1 with distribution function F1 is stochas-tically larger than Z2 with distribution function F2 if

P (Z1 > x) ≥ P (Z1 > x) for all x ∈ Rand there exists x0 : P (Z1 > x0) > P (Z1 > x0).

If, for distribution functions we define the symbol F1 - F2 to mean ”F1 ≤ F2 and F1(x0) < F2(x0)for at least one x0” then is equivalent to

F1 - F2.

AnInvariance considerations. Let z be a point in Rn (thought of as representing a realization ofZ = (Z1, . . . , Zn)) and let π be a transformation π(z) = (τ(z1), . . . , τ(zn)) where τ is a continuous,

∗cf. e.g. the table given in Rohatgi and Saleh, Introduction to Prob. and Statistics, 2nd Ed.

The Wilcoxon signed rank test 179

strictly monotone increasing and uneven real valued function on R (uneven means τ(x) = −τ(−x)).If Q = L(Z) is a symmetric law then the law of τ(Z) is also symmetric:

P (τ(Z) > c) = P (Z > τ−1(c)) = P (Z < −τ−1(c))= P (Z < τ−1(−c)) = P (τ(Z) < −c) for all c > 0.

Note that the statisticL(Z) = (sgn(Zi), Ri)i=1,...,n

is invariant under any transformation π applied to Z. It can be shown that L is maximal invariantunder the group of transformation of Rn given by all π, i.e. any other invariant map is a functionof L(Z). Thus the set of all signs and of all ranks Ri of |Zi| is a maximal invariant. This providesa justification for use of the Wilcoxon signed rank statistic W .


Chapter 13EXERCISES

13.1 Problem set H1

Exercise H1.1. Show that in modelM1, with parameter p ∈ Θ = [0, 1] and risk function R(T, p)defined by

R(T, p) = Ep (T (X)− p)2 (*)

there is no estimator T (X) such that

R(T, p) = 0 for all p ∈ Θ.

Exercise H1.2. Let pi, i = 1, . . . , k be a finite subset of (0, 1) (where k ≥ 2), and consider astatistical model M0

1 with the same data X and parameter p as M1 but where p is now restrictedto Θ = p1, . . . , pk. Assume that each parameter value pi ∈ Θ is assigned a prior probabilityqi > 0 where

Pki=1 qi = 1. For any estimator T, define the mixed risk

B(T ) =kXi=1

R(T, pi)qi

where R(T, p) is again the quadratic risk (*).a) Find the form of the Bayes estimator TB (i.e the minimizer of B(T )) and show that it is unique.b) Show that TB is admissible in the modelM0

1.

13.2 Problem set H2

Exercise H2.1. Let X1, ...,Xn be independent and identically distributed with Poisson law Po (λ),where λ ≥ 0 is unknown. (The Poisson law with parameter λ = 0 is defined as the one pointdistribution at 0, where X1 = 0 with probability one). Find the maximum likelihood estimator(MLE) of λ (proof).Exercise H2.2 Let X1, . . . ,Xn be independent and identically distributed such that X1 has theuniform law on the set 1, . . . , r for some integer r ≥ 1 (i.e. Pr(X1 = k) = 1/r, k = 1, . . . , r). Inthe statistical model where r ≥ 1 is unknown, find the MLE of r (proof).Exercise H2.3. Let X1, . . . ,Xn be independent and identically distributed such that X1 has thegeometric law Geom(p), i.e.

Pp(X1 = k) = (1− p)k−1 p, k = 1, 2, . . .

i) In the statistical model where p ∈ (0, 1] is unknown, find the MLE of p (proof).ii) Compute (or find in a textbook) the expectation of X1 under p.

182 Exercises

Exercise H2.4 Let X1, . . . ,Xn be independent and identically distributed such that X1 has theuniform law on the interval [0, ϑ] for some ϑ > 0 (i. e. X1 has the uniform density pϑ(x) =ϑ−11[0,ϑ](x). Here 1A(x) is the indicator function of a set A: 1A(x) = 1 if x ∈ A, 1A(x) = 0otherwise). In the statistical model where ϑ > 0 is unknown, find the MLE of ϑ (proof).Hint: it can be assumed that not all Xi are 0, since this event has probability 0 under any ϑ > 0.Exercise H2.5 Let X1, . . . ,Xn be independent and identically distributed such that X1 has the ex-ponential law Exp(λ) on [0,∞) with parameter λ (i. e. X1 has density pϑ(x) = λ−1 exp(−xλ−1)1[0,∞)(x)).i).In the statistical model where λ > 0 is unknown, find the MLE of λ (proof)ii) Recall the expectation of X1 under λ..

13.3 Problem set H3

Exercise H3.1. Let X1, ...,Xn be independent and identically distributed with Poisson law Po (ϑ),where ϑ ∈ Θ = (0,∞) is unknown.(i) Compute the Fisher information at each ϑ ∈ Θ.(ii) Assume that condition D2 for the validity of the Cramer-Rao bound is fulfilled (for n = 1 itis shown in the handout). Find a minimum variance unbiased estimator (UMVUE) of ϑ for thismodel.

Exercise H3.2. Let X be an observed (integer-valued) random variable with binomial law B(n, ϑ)where ϑ ∈ Θ = (0, 1) is unknown.(i) Compute the Fisher information at each ϑ ∈ Θ.(ii) Clearly condition D1 for the validity of the Cramer-Rao bound is fulfilled (sample space isfinite, prob. function differentiable). Find a minimum variance unbiased estimator (UMVUE) of ϑfor this model.

Exercise H3.3. Let X1, ...,Xn be independent and identically distributed with Geometric lawGeom(ϑ−1), i. e.

Pϑ(X1 = k) = (1− ϑ−1)k−1ϑ−1

where ϑ ∈ Θ = (1,∞) is unknown. (Note that in Exercise H2.3 the family Geom(p), p ∈ (0, 1] wasconsidered. Here we just took ϑ = p−1 as parameter and also excluded the value p = 1)(i) Compute the Fisher information at each ϑ ∈ Θ.(ii) Assume that condition D2 for the validity of the Cramer-Rao bound is fulfilled . Find aminimum variance unbiased estimator (UMVUE) of ϑ for this model.Hint: Here the variance of X1 is important; compute or look up in a book.

13.4 Problem set H4

Exercise H4.1. Let X1, . . . ,Xn be independent and identically distributed such that X1 has theexponential law Exp(λ) on [0,∞) with parameter λ (i. e. X1 has density pλ(x) = λ−1 exp(−xϑλ−1)1[0,∞)(x)),where λ is unknown and varies in the set Θ = (0,∞).(i) Compute the Fisher information at each λ ∈ Θ.(ii) Assume that conditionD3 for the validity of the Cramer-Rao bound is fulfilled. Find a minimumvariance unbiased estimator (UMVUE) of λ for this model.

Problem set H4 183

Exercise H4.2. Consider the Gaussian scale model: observations are X1, . . . ,Xn, independentand identically distributed such that X1 has the normal law N(0, σ2) with variance σ2, where σ2

is unknown and varies in the set Θ = (0,∞).(i) Compute the Fisher information IF (σ

2) for one observation X1.(ii) Assume that condition D3 for the validity of the Cramer-Rao bound is fulfilled. Show that forn observations in the Gaussian scale model, the sample variance

S2 = n−1nXi=1

X2i

is a uniformly best unbiased estimator.Hints: (a) Note that σ2 is treated as parameter, not σ; so it may be convenient to write ϑ for σ2

when taking derivatives.(b) Note that

Varσ2X21 = 2σ

4.

A short proof runs as follows. We have X1 = σY for standard normal Z, so it suffices to proveVarZ2 = 2. Now VarZ2 = EZ4− (EZ2)2, so it suffices to prove EZ2 = 3. For the standard normaldensity ϕ we have by partial integration, using ϕ0(x) = −xϕ(x)Z

x4ϕ(x)dx = −Z

x3ϕ0(x)dx = 3

Zx2ϕ(x)dx = 3.

Exercise H4.3. In the handout, sec. 8.3 a family of Gamma densities was introduced as

fα(x) =1

Γ(α)xα−1 exp(−x)

for α > 0. Define more generally, for some γ > 0

fα,γ(x) =1

Γ(α)γαxα−1 exp(−xγ−1).

The corresponding law is called the Γ(α, γ)-distribution.(i) Let X1, ...,Xn be independent and identically distributed with Poisson law Po (ϑ), where ϑ ∈Θ = (0,∞) is unknown. Assume that the results on Bayesian inference in section 8 carry overfrom finite to countable sample space. Show that the family Γ(α, γ), α > 0, γ > 0 is a conjugatefamily of prior distributions.(ii) Show that for a r.v. U with law Γ(α, γ)

EU = αγ.

(iii) In the above Poisson model, with prior Γ(α, γ), find the posterior expectation E(ϑ|X) of ϑand discuss its relation to the sample mean Xn for large n (α, γ fixed).Remark: Note that if Proposition 8.1 carries over to the case of countable sample space thenE(ϑ|X) is a Bayes estimator for quadratic risk.

184 Exercises

13.5 Problem set H5

Exercise H5.1. Consider the Gaussian location model with restricted parameter space µ ∈ Θ =[−K,K], where K > 0, sample size n = 1 and σ2 = 1. A linear estimator is of form T (X) = aX+ bwhere a, b are nonrandom (fixed) real numbers.(i) Find the minimax linear estimator TLM (note that all a, b are allowed)(ii) Show that TLM is strictly better than the sample mean Xn = X, everywhere on Θ = [−K,K](this implies that X is not admissible).(iii) Show that TLM is Bayesian in the unrestricted model Θ = R for a certain prior distributionN(0, τ2), and find the τ2.Exercise H5.2. Consider the binary channel for information transmission, as a statistical model:Θ = 0, 1, Pϑ = B(1, pϑ) where pϑ ∈ (0, 1), ϑ = 0, 1. Assume also symmetry: p1 = 1 − p0 and1/2 < p1. Note that for estimating ϑ, the quadratic loss coincides with the 0-1-loss: for t, ϑ ∈ Θwe have

(t− ϑ)2 = 0 if t = ϑ1 if t 6= ϑ.

Consider estimators with values in Θ (note there exist only four possible estimators here (maps0, 1 7→ 0, 1: T1(x) = x, T2(x) = 1− x, T3(x) = 0, T4(x) = 1).(i) Find the maximum likelihood estimator of ϑ ∈ Θ(ii) For a prior distribution Q on Θ with q0 = Q(0), q1 = Q(1), and the above 0-1-loss, findthe Bayes estimator with values in Θ. Note: the posterior expectation cannot be used since it willbe between 0 and 1 in general.(iii) Assume that q1 tends to 1. Find the value such that if q1 > z then the Bayes estimator is T4(disregards the data and takes always 1 as estimated value).Exercise H5.3. Consider the binary Gaussian channel: Θ = 0, 1, Pϑ = N(µϑ, 1) where µ0 < µ1are some fixed values (a restricted Gaussian location model for sample size n = 1). Note thatestimators with values Θ are described by indicators of sets A ⊂ R such that T (x) = 1A(x) (i.e.A = x : T (x) = 1).(i) as in H5.2.(ii) as in H5.2(iii) Show that the Bayes estimator for a uniform prior ( q1 = 1/2) is minimax.

13.6 Problem set H6

Exercise H6.1. Let X1, . . . ,Xn1 , be independent N(µ1, σ2) and Y1, . . . , Yn2 be independent

N(µ2, σ2), also independent of X1, . . . ,Xn1 (n1,n2 > 1) For each of the two samples, form the

sample means X, Y and the bias corrected sample variances

S2(1) =1

n1 − 1

n1Xi=1

(Xi − X)2, S2(2) =1

n2 − 1

n2Xi=1

(Yi − Y )2

Consider the statistic

Z =X − Y

σ³1n1+ 1

n2

´1/2which is standard normal if µ1 = µ2. In a model where µ1, µ2 are unknown but σ

2 is known, itobviously can be used to build a confidence interval for the difference µ1 − µ2.

Problem set H7 185

For the case that in addition σ2 is unknown, find a statistic which has a t-distribution if µ1 = µ2 (this would then be called a ”studentized” statistic) , and find the degrees of freedom.

13.7 Problem set H7

Exercise H7.1. Let X1, . . . ,Xn1 , be independent N(µ1, σ2) and Y1, . . . , Yn2 be independent

N(µ2, σ2), also independent of X1, . . . ,Xn1 (n1, n2 > 1). In a model where these r.v.’s are observed,

and µ1, µ2 and σ2 are all unknown, find an α-test for the hypothesis

H : µ1 = µ2.

Exercise H7.2. Let zα/2,n be the upper α/2-quantile of the t-distribution with n degrees offreedom and z∗α/2 the respective quantile for the standard normal distribution. Use the tables atthe end of the textbook or a computer program to finda) zα/2,n for n = 5, n = 20 and z∗α/2 for a value α = 0.05b) the same for α = 0.01.Exercise H7.3. In the introductory subsection 1.3.2 ”Confidence statements with the Chebyshevinequality” (early in the handout, p. 10) we constructed a confidence interval for the parameter pfor n i.i.d. Bernoulli observations X1, . . . ,Xn (model Md,1) of form [Xn − ε, Xn + ε] where Xn isthe sample mean and ε = εn,α = 2(nα)−1/2. Using the Chebyshev inequality, it was shown thatthis has coverage probability at least 1− α.(i) Use the central limit theorem

n1/2(Xn − p) =⇒d N(0, p(1− p)) as n→∞

and the upper bound p(1− p) ≤ 1/4 to construct an asymptotic 1−α− confidence interval of form[Xn − ε∗n,α, Xn + ε∗n,α] for p, involving a quantile of the standard normal distribution N(0, 1).(ii) Show that the ratio of the lengths of the two confidence intervals, i.e. ε∗n,α/εn,α, does notdepend on n and find its numerical values for α = 0.05 and α = 0.01, using the table of N(0, 1) onp. 608 textbook.(iii) Use the property of the standard normal distribution function Φ(x)

1− Φ(x) ≤ x−1 exp¡−x2/2

¢([D] p. 108) to prove that ε∗n,α/εn,α → 0 for α→ 0 (n fixed).Comment: the interval based on the normal approximation turns out to be shorter, and this effectbecomes more pronounced for smaller α.Exercise H7.4. Consider the test φ∗µ0 (i.e. the test based on the t-statistic as in (8.1) handout,where the quantile z∗α/2 of the standard normal is used in place of zα/2). On p. 94 handout it isargued that this is an asymptotic α-test. Show that in the Gaussian location-scale modelMc,2, forsample size n, this test is consistent as n→∞ on the pertaining alternative

Θ1(µ0) =©(µ, σ2) : µ 6= µ0

ª.

Comment: The proof of consistency of the two-sided Gauss test, which is illustrated in the figurep. 95, is similar but more direct since σ2 is known there.

186 Exercises

13.8 Problem set E1

Exercise E1.1. (10%). Let X1, . . . ,Xn be independent identically distributed with unknown dis-tribution Q, where it is known only that

Var(X1) ≤ K

for some known positive K. Then also µ = EX1 exists. Consider hypothesesH : µ = µ0K : µ 6= µ0.Find an α-test (exact α-test; i.e. level is observed for every n, not just asymptotic as n → ∞).(Hint: Chebyshev inequality, p. 10 or [D], p. 222).Exercise E1.2. Let X1, . . . ,Xn, be independent Poisson Po(λ). Consider some λ0, λ1 such that0 < λ0 < λ1.

(i) (5%) Consider simple hypothesesH : λ = λ0K : λ = λ1.Find a most powerful α-test.

Note: the distribution of any proposed test statistic can be expected to be discrete, so that arandomized test might be most powerful. For the solution, this aspect can be ignored; just indicatethe statistic, its distribution under H and the type of rejection region (such as ”reject when T istoo large”).

(ii) (10%) Consider composite hypothesesH : λ = λ0K : λ > λ0.Find a uniformly most powerful (UMP) α-test.

Hint: take a solution of (i) which does not depend on λ1.

(iii) (10%) Consider composite hypothesesH : λ ≤ λ0K : λ > λ0.Find a uniformly most powerful (UMP) α-test.

Hint: take a solution of (ii) and show that it preserves level α on H : λ ≤ λ0. Properties of thePoisson distribution are useful.

Exercise E1.3. (20%) Consider the Gaussian location-scale model (ModelMc,2), for sample sizen, i. e. observations are i.i.d. X1, . . . ,Xn with distribution N(µ, σ2) where µ ∈ R and σ2 > 0 areunknown. For a certain σ20 > 0, consider hypotheses H : σ2 ≤ σ20 vs. K : σ2 > σ20.

Find an α-test with rejection region of form (c,∞) (i.e. a one-sided test) where c is a quantile of aχ2-distribution. (Note: it is not asked to find the LR test; but the test should have level α. Thisincludes an argument that the level is observed on all parameters in the hypothesis H : σ2 ≤ σ20. )

Hint: A good estimator of σ2 might be a starting point..

Exercise E1.4 (Two sample problem, F -test for variances). Let X1, . . . ,Xn be independentN(µ1, σ

21) and Y1, . . . , Yn be independent N(µ2, σ

22), also independent of X1, . . . ,Xn (n > 1) where

Problem set H8 187

µ1, σ21 and µ2, σ

22 are all unknown. Define the statistics

F = F (X,Y ) =S2XS2Y

, (13.1)

S2X = n−1nXi=1

¡Xi − Xn

¢2 , S2Y = n−1nXi=1

¡Yi − Yn

¢2(here (X,Y ) symbolizes the total sample).Define the F-distribution with k1, k2 degrees of freedom (denoted Fk1,k2) as the distributionof k−11 Z1/k

−22 Z2 where Zi are independent r.v.’s having χ2-distributions of k1 and k2 degrees of

freedom, respectively.i) (15%) Show that F (X,Y ) has an F -distribution if σ21 = σ22, and find the degrees of freedom.ii) (20%) For hypotheses H : σ21 ≤ σ22 vs. K : σ21 > σ22, find an α-test with rejection region ofform (c,∞) (i.e. a one-sided test) where c is a quantile of an F -distribution. (Note: it is not askedto find the LR test; but the test should have level α. This includes an argument that the level isobserved on all parameters in the hypothesis H : σ21 ≤ σ22 ).

Exercise E1.5 (F -test for equality of variances). Consider the two sample problem of exerciseE1.4, but hypotheses H : σ21 = σ22 vs. K : σ21 6= σ22.i) (5%) Find the likelihood ratio test and show that it is equivalent to a test which rejects if theF -statistic (13.1) is outside a certain interval of form [c−1, c].ii) (5%) Show that the c of i) can be chosen as the upper α/2 quantile of the distribution Fr,r fora a certain r > 0.

13.9 Problem set H8

Exercise H8.1 (Exercise 8.59 e, p. 399 textbook). A famous medical experiment was conductedby Joseph Lister in the late 1800s. Mortality associated with surgery was quite high and Listerconjectured that the use of a disinfectant, carbolic acid, would help. Over a period of several yearsLister performed 75 amputations with an without using carbolic acid. The data are

Carbolic acid used ?Yes No

Patient Yes 34 19

lived ? No 6 16

Use these data to test whether the use of carbolic acid is associated with patient mortality.Exercise H8.2. LetX1, . . . ,Xn be independent N(µ1, 1) and Y1, . . . , Yn be independent N(µ2, 1),also independent of X1, . . . ,Xn and consider hypothesesH : µ1=µ2 = 0K : (µ1, µ2) 6= (0, 0).Find the likelihood ratio test and show that it is equivalent to a test based on a statistic whichhas a certain χ2-distribution under H (thus the critical value can be taken as a quantile of thisχ2-distribution).Exercise H8.3. Let X1, . . . ,Xn be independent Poisson Po(λ1) and Y1, . . . , Yn be independentPo(λ2), also independent of X1, . . . ,Xn. Let λ = (λ1, λ2) be the parameter vector, λ0 a particularvalue for this vector (with positive components) and consider hypotheses

188 Exercises

H : λ = λ0K : .λ 6= λ0.Find an asymptotic α-test. Hint: find a statistic similar to the χ2-statistic in the multinomial case(Definition 9.1.2, p. 111 handout) and its asymptotic distribution under H.

Exercise H8.4 (Adapted from exercise 8.60, p. 399 textbook) Let Z = (Z1, . . . , Zk) have amultinomial law Mk(n,p) with unknown p = (p1, p2, . . . , pk), where k > 2. Consider hypotheseson the first two componentsH : p1 = p2K : p1 6= p2A test that if often used, called McNemar’s Test, rejects H if

(X1 −X2)2

X1 +X2> χ21;1−α (13.2)

where χ21;1−α is the lower 1− α quantile of the distribution χ21;1−α∗.

(i) Find the maximum likelihood estimator p of the parameter p under the hypothesis.(ii) Show that the appropriate χ2-statistic with estimated parameter p (maximum likelihood esti-mator under H as above), as defined in relation (9.26) on p.127 handout, coincides with McNemar’sstatistic (13.2) (exact equality, not approximate with an error term).Comment: It follows that McNemar’s test is the χ2-test for this problem and is an asymptoticα-test, cf. Theorem 9.5.2, p.127 handout.

13.10 Problem set H9

Exercise H9.1. Consider the general linear model

Y = Xβ + ε,

Eε = 0, Cov(ε) = σ2In.

where X is a n× k-matrix, rank(X) = k. Show that β provides a best approximation to the dataY in an average sense:

E kY−Xβk2 = minγ∈Rk

E kY−Xγk2

Exercise H9.2. Consider the bivariate linear regression model:

Yi = α+ βxi + εi, i = 1, . . . , n

where not all xi are equal, α, β are real valued and εi are uncorrelated with variance σ2.(i) Show that the LSE of α, β are

αn = Yn − βnxn, βn =

Pni=1(Yi − Yn)(xi − xn)Pn

i=1(xi − xn)2. (13.3)

where xn is the mean of the nonrandom xi.∗The textbook writes an upper α-quantile, called χ21,α there; it coincides the lower quantile χ

21;1−α.

Problem set H10 189

(ii) Find the distribution of βn when εi are independent normal: L(εi) = N(0, σ2).Hint: for (ii), a possibility is to find the projection matrix Π1 projecting onto the space Lin(1),where 1 is the n-vector consisting of 1’s, and use

(Y1 − Yn, . . . , Yn − Yn)> = (In −Π1)(Y1, . . . , Yn)>.

More elementary arguments are also possible.Exercise H9.3. Consider the general linear model as in H9.1, but with an assumption Cov(ε) = Σwhere Σ is a known positive definite (symmetric) n×n-matrix. Find the BLUE (best linear unbiasedestimator) of β, as defined in Def. 10.5.1 .Hint: Recall Lemma 6.1.7, p. 70 and the fact that Cov(Aε) = AΣA> for any n× n-matrix A.Exercise H9.4. Suppose that a r.v. Y and the random k-vectorX have a joint normal distributionNk+1(0,Σ), where Σ is a positive definite (k + 1)× (k + 1)-matrix. Write Σ in partitioned form

Σ =

µΣXX ΣXY

ΣY X σ2Y

¶where

ΣXY = Σ>Y X = EXY

is a k-vector, σ2Y = EY 2 and ΣXX = Cov(X). Show that

E(Y |X) = X>β,

where β = Σ−1XXΣXY . (13.4)

Comment. Compare this with the form of β in the case k = 1 (i.e β = σ−2X σXY as in Def.10.1.3, p. 135 handout), and with the form of the LSE in the linear model (Theorem 10.42, p.149): β = (X>X)−1X>Y.Hint. Consider independent X∗, ε where X∗ has the same law as X and L(ε) = N(0, σ2ε), for someσ2ε > 0 (X

∗ is a random k-vector, ε a random variable) and define

Y ∗ = X∗>β+ε.

with β as above. Find a value of the variance σ2ε such that (X∗,Y ∗) have the same joint distribution

as (X,Y ). This solves the problem, since then

E(Y |X) = E(Y ∗|X∗).

13.11 Problem set H10

A PRACTICE EXAM (2 1/2 hours time)

Exercise H10.1. (15 %) A company registered 100 cases within a year where some employee wasmissing exactly one day at work. These were distributed among the days of the week as follows:Day M T W Th FNo. 22 19 16 18 25Test the hypothesis that these one day absences are uniformly distributed among the days of theweek, at level α = 0.05.

190 Exercises

(Solution consists of: value of the test statistic, critical value for the test [quantile of the pertainingdistribution], resulting decision. )Exercise H10.2 (15 %) The personnel manager of a bank wants to find out whether the chanceto successfully pass a job interview depends on the sex of the applicant. For 35 randomly selectedapplicants, 21 of which were male, the results of the interview were evaluated. It turned out thatexactly 16 applicants passed the interview, 5 of which were female. Use a χ2-test in a contingencytable to test whether interview result and sex are independent, at level α = 0.05.(Solution consists of: value of the test statistic, critical value for the test [quantile of the pertainingdistribution], resulting decision. )Comment: most sources would recommend Fisher’s exact test here, but this was not treated andthe χ2-test is also applicable.Exercise H10.3. (20 %) Suppose 18 wheat fields have been divided into m = 3 groups, withl1 = 5, l2 = 7 and l3 = 6 members. There are three kinds of fertilizer Φj , and the group j of fieldsis treated with fertilizer Φj , j = 1, 2, 3. The yield results for all fields are given in the followingtable.Φ1 781 655 611 789 596

Φ2 545 786 976 663 790 568 720

Φ3 696 660 639 467 650 380Assume that these values are realizations of independent random variables Yjk with distributionN(µj , σ

2), k = 1, . . . , lj , j = 1, 2, 3, where the mean values µj correspond to fertilizer Φj . Test thehypothesis that the three group means coincide, at level α = 0.05.(Solution consists of: value of the test statistic, critical value for the test [quantile of the pertainingdistribution], resulting decision. )Exercise H10.4. (25 %) Consider a normal linear model of type NLM2 (cf. section 10.5, p. 152handout), for dimension k = 1, i.e. observations are

Y = Xβ + ε,

where X is a nonrandom n× 1-matrix (i.e. an n-vector in this case), β is an unknown real valuedparameter and

L(ε) = Nn(0, σ2In)

where σ2 > 0 is unknown. Assume also that rank(X) = 1 (identifiability condition; i. e. X 6= 0).Consider some value β0 and hypothesesH : β ≤ β0K : β > β0.Let β be the LSE of β and define the statistic

Tβ0(Y) =√n− 1 (β − β0)(X

>X)1/2

(Y>(In −ΠX)Y)1/2. (13.5)

where ΠX is the projection matrix onto the linear space Lin(X). Show that Tβ0(Y) can be used asa test statistic to construct an α-test, and indicate the distribution and the quantile used to findthe rejection region.Comment: Note that this is not a linear hypothesis on β.Hint: in the case that all elements of X are 1 we obtain the Gaussian location-scale model (meanvalue β), for which the present hypothesis testing problem was discussed extensively.

Problem set E2 191

Exercise H10.5. (25 %) Suppose that the random vector Z = (X,Y ) has a bivariate normaldistribution with EX = EY = 0 and covariance matrix

Var(X) = σ2X , Var(Y ) = σ2Y , Cov(X,Y ) = ρσXσY ,

where ρ = EXY/σXσY is the correlation coefficient between X and Y . Suppose that Zi = (Xi, Yi),i = 1, . . . , n are i.i.d. observed random 2-vectors each having the distribution of Z. Define theempirical covariance matrix

S2X = n−1nXi=1

X2i , S

2Y = n−1

nXi=1

Y 2i ,

SXY = n−1nXi=1

XiYi

(note that we do not use centered data Xi − Xn etc. here for empirical variances / covariancessince EX = EY = 0 is known) and the empirical correlation coefficient

ρ =SXY

SXSY

(where SX =qS2X etc.).

Consider hypothesesH : ρ = 0K : ρ 6= 0.Define the statistic

T0(Z) =√n− 1 ρp

1−ρ2(13.6)

(where Z represents all the data Zi, i = 1, . . . , n). Show that T0(Z) can be used as a test statisticto construct an α-test, and indicate the distribution and the quantile used to find the rejectionregion.Hint: Note that this is closely related to the previous exercise H10.4. In H 10.4, set β0 = 0,consider the two sided problem H : β = β0 vs. K : β 6= β0 (then H is a linear hypothesis) and lookfor similarities of the statistics (13.5) and (13.6). The distribution of (13.5) was found in H10.4,but now the Xi are random. How does this affect the distribution of the test statistic ?Further comment: when it is not assumed that EX = EY = 0, the definition of ρ has to bemodified in an obvious way, by using centered data Xi− Xn, Yi− Yn, and

√n− 2 appears in place

of√n− 1. In this form the test is found in the literature.

13.12 Problem set E2

ANOTHER PRACTICE EXAM (2 1/2 hours time)

Exercise E2.1. (25 %) A course in economics was taught to two groups of students, one in aclassrooom situation and the other on TV. There were 24 students in each group. These studentswere first paired according to cumulative grade-point averages and background in economics, andthen assigned to the courses by a flip of a coin (this was repeated 24 times). At the end of the course

192 Exercises

each class was given the same final examination. Use the Wilcoxon signed rank test (level α = 0.05,normal approximation to the test statistic) to test that the two methods of teaching are equallyeffective against a two-sided alternative. The differences in final scores for each pair of students,the TV student’s score having been subtracted from the corresponding classroom student’s scorewere as follows:14 −4 −6 −2 −1 186 12 8 −4 13 72 6 21 7 −2 11−3 −14 −2 17 −4 −5Hint: treatment of ties. Let Z1, . . . , Zn be the data. If some |Zi| have the same absolute values(i.e. ties occur) then they are assigned values Ri which are the averages of the ranks. Example:4 values |Zi1 |, . . . , |Zi4 | have the same absolute value c and only two of the other |Zi| are smaller.The |Zi1 |, . . . , |Zi4 | would then occupy ranks 3, 4, 5, 6. Since they are tied, they are all assigned theaverage rank (3+4+5+6)/4 = 4.5. The next rank assigned is then 7 (or higher if there is anothertie).Remark: For the two-sided version of the Wilcoxon signed rank test, as described in section 12.2handout, the last sentence on p. 175 should read as ”The corresponding asymptotic α-test of H

then rejects if |Wn| > zα/2v1/2n where zα/2 is the upper α/2-quantile of N(0, 1))”.

As a starting point, to limit the bookkeeping effort in the solution, the following table gives theordered absolute values of the data |Zi|, starring those that were originally negative. Below eachentry (every other row of the table) is the rank Ri of |Zi| (in the notation of the handout), whereties are treated as indicated:1* 2∗ 2∗ 2∗ 2 3∗

1 3.5 3.5 3.5 3.5 64∗ 4∗ 4∗ 5∗ 6∗ 68 8 8 10 12 126 7 7 8 11 1212 14.5 14.5 16 17 1813 14∗ 14 17 18 2119 20.5 20.5 22 23 24Solution.Exercise E2.2. Consider the one-way layout ANOVA treated in handout sec. 11.2, relation(11.7):

Yjk = µj + εjk, k = 1, . . . , l, j = 1, . . . ,m, (13.7)

where εjk are i.i.d. normal N(0, σ2) noise variables and µj are unknown parameters, and there isan equal number of observation l > 1 for each factor j. The total number of observations is n = ml.In this case the F test given by the statistic (11.13) (handout) can be considered an average t test.(i) (25 %) Let i and j be two different factor indices from 1, . . . ,m. Show that a t test ofH : µi = µjK : µi 6= µjcan be based on the statistic

Tij(Y) =Yi· − Yj·³2dw/l

´1/2

Problem set E2 193

where

dw = (n−m)−1mXj=1

lXk=1

¡Yjk − Yj·

¢2is the mean sum of squares within groups. (More precisely, show that Tij has a t-distribution underH and find the degrees of freedom).(ii) (25 %) Show that

1

m(m− 1)

mXj=1

mXi=1

T 2ij(Y) = F (Y)

where F (Y) is the F statistic (11.13) (handout). This relation shows that F (Y) is an average ofthe (nonzero) T 2ij(Y).Exercise E2.3. (25 %) Consider again the model (13.7) of exercise F2, with the same assump-tions, but in an asymptotic framework where the number l of observation in each group tends toinfinity (m stays fixed). Consider the F -statistic

F (Y) =(m− 1)−1

Pmj=1 l

¡Yj· − Y··

¢2(n−m)−1

Pmj=1

Plk=1

¡Yjk − Yj·

¢2Find the limiting distribution, as l → ∞, of the statistic (m − 1)F (Y) under the hypothesis ofequality of means H : µ1 = . . . = µm. (It follows that this distribution can be used to obtain anasymptotic α-test of H).Hint: Limiting distributions of other test statistics in the handout have been obtained e.g. inTheorem 7.3.8 and Theorem 9.3.3.

194 Exercises

Chapter 14APPENDIX: TOOLS FROM PROBABILITY, REAL ANALYSIS AND LINEAR

ALGEBRA

14.1 The Cauchy-Schwartz inequality

Proposition 14.1.1 (Cauchy-Schwartz-inequality). Suppose that for the random variables Yi,i = 1, 2, the second moments EY 2i exist, for i = 1, 2. Then the expectation of Y1 · Y2 exists, and

|EY1Y2| ≤¡EY 21

¢1/2 ¡EY 22

¢1/2.

Proof. For any λ > 0

0 ≤³λ1/2Y1 ± λ−1/2Y2

´2= λY 21 ± 2Y1Y1 + λ−1Y 22 ,

thus∓Y1Y2 ≤

1

2

¡λY 21 + λ−1Y 22

¢.

This proves that EY1Y2 exists and

|EY1Y2| ≤1

2

¡λEY 21 + λ−1EY 22

¢.

If both EY 22 , EY21 > 0 then for λ =

¡EY 22

¢1/2/¡EY 21

¢1/2 we obtain the assertion. If one of themis 0 (EY 22 = 0, say), then by taking λ > 0 arbitrarily small, we obtain |EY1Y2| = 0.

14.2 The Lebesgue Dominated Convergence Theorem

This is a result from real analysis for measure theoretic integrals, which contain both sums andintegrals as a special cases. We formulate here a special case relating to expectation of randomvariables. For a statement and proof in full generality see Durrett [D], Appendix, (5.6), p. 468.

Theorem 14.2.1 (Lebesgue) Let X be a random variable taking values in a sample space X andlet rn(x), n = 1, 2, . . . be a sequence of functions on X such that rn(x) → r0(x) for all x ∈ X .Assume furthermore that there exists a function r(x) ≥ 0 such that |rn(x)| ≤ r(x) for all x ∈ Xand all n (domination property), and E r(X) <∞. Then

E rn(X)→ E r0(X), as n→∞.

Documents

A First Course in Mathematical Statisticsweb472/spring04/stat04.pdf · A First Course in Mathematical Statistics ... [TD] Tamhane, A and Dunlop, D., Statistics and Data Analysis: