82

Click here to load reader

Stats Notes

Embed Size (px)

Citation preview

Page 1: Stats Notes

INTRODUCTIONTOSTATISTICS

William J. AndersonMcGill University

Page 2: Stats Notes

2

Page 3: Stats Notes

Contents

1 Sampling Distributions 51.1 The Basic Distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Applications of these Distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 Order Statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Estimation 112.1 Methods of Estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1.1 Maximum Likelihood Estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.1.2 Method of Moments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.1.3 Bayesian Estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 Properties of Estimators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.2.1 Unbiasedness and Efficiency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2.2 Consistency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.2.3 Sufficiency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.3 Minimum Variance Revisited. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Confidence Intervals 27

4 Theory of Hypothesis Testing 354.1 Introduction and Definitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.2 How to Choose the Critical Region – Case of Simple Hypotheses. . . . . . . . . . . . . . . . . 394.3 How to Choose the Critical Region - Case of Composite Hypotheses. . . . . . . . . . . . . . . 434.4 Some Last Topics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.4.1 Large Sample Tests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.4.2 p-value. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.4.3 Bayesian Tests of Hypotheses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.4.4 Relationship Between Tests and Confidence Sets.† . . . . . . . . . . . . . . . . . . . . 47

5 Hypothesis Testing: Applications 495.1 The Bivariate Normal Distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.2 Correlation Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505.3 Normal Regression Analysis.† . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6 Linear Models 556.1 Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.2 Experimental Design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6.2.1 The Completely Randomized Design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 616.2.2 Randomized Block Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3

Page 4: Stats Notes

4 CONTENTS

7 Chi-Square Tests 697.1 Tests Concerning k Independent Binomial Populations. . . . . . . . . . . . . . . . . . . . . . 697.2 Chi-Square Test for the Parameters of a Multinomial Distribution. . . . . . . . . . . . . . . . 717.3 Goodness of Fit Tests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 727.4 Contingency Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

8 Non-Parametric Methods of Inference 778.1 The Sign Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 778.2 The Mann-Whitney, or U-Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 798.3 Tests for Randomness Based on Runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

Page 5: Stats Notes

Chapter 1

Sampling Distributions

Reference: WMS 7th ed., chapter 7

1.1 The Basic Distributions.

Definition. A random variable X with density function given by

f(x) =

1

2n/2Γ(n/2)x(n/2)−1e−x/2 if x > 0,

0 otherwise,

where n ≥ 1 is an integer, is said to have a chi-square distribution with n degrees of freedom (df). Briefly,we write X ∼ χ2

n.

The chi-square distribution is just a gamma distribution with α = n/2 and β = 2. We therefore have

E(X) = n, V ar(X) = 2n, MX(t) =1

(1− 2t)n/2.

Also, (X − n)/√

2n is approximately N(0, 1), by the CLT.

Proposition 1.1.1

(1) Let X1, X2, . . . , Xm be independent chi-square random variables, with n1, n2, . . . , nm degrees of freedom,and let X = X1 + · · ·+ Xm. Then X has the chi-square distribution with n = n1 + · · ·+ nm d.f.

(2) Suppose that X = X1 + X2 where X1 and X2 are independent and X and X1 have distributions χ2n

and χ2n1

respectively, where n1 < n. Then X2 ∼ χ2n2

where n2 = n− n1.

Proof. We have

MX(t) = MX1(t)MX2(t) · · ·MXn(t) =

1(1− 2t)n1/2

· · · 1(1− 2t)nm/2

=1

(1− 2t)n/2,

so X is as claimed. The proof of (2) is similar.

Proposition 1.1.2 Let X1, X2, . . . , Xn be i.i.d., each with distribution N(0, 1), and let Y = X21 + · · ·+X2

n.Then Y ∼ χ2

n.

Proof. If Z ∼ N(0, 1), then

PrZ2 ≤ w = Pr−√

w ≤ Z ≤√

w = 2 Pr0 ≤ Z ≤√

w =2√2π

∫ √w

0

e−u2/2 du =∫ w

0

v−1/2e−v/2

√2π

dv

so that (because Γ(1/2) =√

π) we haveZ2 ∼ χ21. The result then follows from proposition 1.1.1.

5

Page 6: Stats Notes

6 CHAPTER 1. SAMPLING DISTRIBUTIONS

Definition. Given α with 0 < α < 1, we define χ2α,n to be the unique number such that

PrX > χ2α,n = α,

where X is a χ2 random variable with n degrees of freedom. χ2α is called a critical value of the χ2-distribution.

Definition. A random variable T with density function given by

f(t) =Γ(

n+12

)√

πnΓ(

n2

) (1 +t2

n

)−n+12

, −∞ < t < +∞,

where n ≥ 1 is an integer, is said to have the t-distribution with n degrees of freedom. Briefly, we writeT ∼ tn.

Remarks. The t density function is symmetric about 0 and very similar to the standard normal densityfunction, except that it is lower in the middle and has fatter tails. In fact, it is easy to see that the densityf(t) tends to the standard normal density as n →∞.

Proposition 1.1.3 Let X ∼ N(0, 1) and Y ∼ χ2n be independent. Then

T =X√Y/n

has the t-distribution with n degrees of freedom.

Definition. Given α with 0 < α < 1, we define tα,n to be the unique number such that

PrT > tα,n = α,

where T is a t-random variable with n degrees of freedom. tα,n is called a critical value of the t-distribution.

Definition. A random variable Y with density function given by

g(y) =

Γ(n1+n22 )

Γ(n12 )Γ(n2

2 )

(n1n2

)n1/2

yn1/2−1(1 + n1

n2y)−n1+n2

2, if y > 0,

0 if y ≤ 0,

where n1, n2 ≥ 1 are integers, is said to have the F -distribution with n1, n2 degrees of freedom. Briefly, wewrite Y ∼ Fn1,n2 .

Proposition 1.1.4 Let X1 ∼ χ2n1

and X2 ∼ χ2n2

be independent. Then

Y =X1/n1

X2/n2

has the F -distribution with n1, n2 degrees of freedom.

Corollary 1.1.5 If Y ∼ Fn1,n2 , then 1/Y ∼ Fn2,n1 .

Definition. Given α with 0 < α < 1, we define Fα,n1,n2 to be the unique number such that

PrY > Fα,n1,n2 = α,

where Y is an F -random variable with n1, n2 degrees of freedom. Fα,n1,n2 is called a critical value of theF -distribution.

Page 7: Stats Notes

1.2. APPLICATIONS OF THESE DISTRIBUTIONS. 7

Problem. Show that F1−α,n2,n1 = 1Fα,n1,n2

.

Solution. We have

α = P [Y > Fα,n1,n2 ] = P [1Y

<1

Fα,n1,n2

] = 1− P [1Y

>1

Fα,n1,n2

],

so P [ 1Y > 1

Fα,n1,n2] = 1− α. This gives the required result, since 1/Y ∼ Fn2,n1 .

1.2 Applications of these Distributions.

The following lemma will be extremely useful throughout the course.

Lemma 1.2.1 Let x1, . . . , xn and y1, . . . , yn be two sets of n numbers (which may be the same), andlet x = (x1 + · · ·+ xn)/n. Then

n∑i=1

(xi − c)(yi − d) =n∑

i=1

(xi − x)(yi − y) + n(x− c)(y − d), (1.1)

for any numbers c and d. In the case where they are the same, we have

n∑i=1

(xi − µ)2 =n∑

i=1

(xi − x)2 + n(x− µ)2, (1.2)

for any number µ.

Proof. Using the fact that∑n

i=1(xi − x) = 0,

n∑i=1

(xi − c)(yi − d) =n∑

i=1

[(xi − x) + (x− c)][(yi − y) + (y − d)]

=n∑

i=1

[(xi − x)(yi − y) + (xi − x)(y − d) + (x− c)(yi − y) + (x− c)(y − d)]

=n∑

i=1

(xi − x)(yi − y) + (y − d)n∑

i=1

(xi − x) + (x− c)n∑

i=1

(yi − y) + n(x− c)(y − d)

=n∑

i=1

(xi − x)(yi − y) + n(x− c)(y − d).

Definition. A set of independent random variables X1, X2, · · · , Xn, each having the distribution functionF , is said to be a simple random sample from the distribution F . Let us denote the mean and variance ofF by µ and σ2, and define

X =X1 + X2 + · · ·+ Xn

n, s2 =

1n− 1

n∑i=1

(Xi − X)2

to be the sample mean and sample variance, respectively.

Proposition 1.2.2

(1) E(X) = µ and Var(X) = σ2/n.

(2) For large n, X is approximately N(µ, σ2/n).

Page 8: Stats Notes

8 CHAPTER 1. SAMPLING DISTRIBUTIONS

(3) E(s2) = σ2.

(4) X and Xi − X are uncorrelated.

Proof. (1) is obvious and (2) follows from the CLT. For (3), we have from (1.2) that

E(s2) =1

n− 1E

[n∑

i=1

(Xi − µ)2 − n(X − µ)2]

=1

n− 1

[n∑

i=1

E(Xi − µ)2 − nE(X − µ)2]

=1

n− 1

[n∑

i=1

σ2 − n · σ2

n

]= σ2.

For (4), we have

Cov(X,Xi − X) = Cov(X,Xi)− Cov(X, X) =1n

n∑j=1

Cov(Xj , Xi)−σ2

n=

Cov(Xi, Xi)n

− σ2

n= 0.

Proposition 1.2.3 Let X1, X2, . . . , Xn be a random sample from N(µ, σ2). Then

(1) X ∼ N(µ, σ2/n).

(2) X and s2 are independent.

(3) (n− 1)s2/σ2 ∼ χ2n−1.

(4) the random variableX − µ

s/√

n

has the t-distribution with n− 1 degrees of freedom.

Proof.

(1) MX(t) = MX1+···+Xn(t/n) = (MX(t/n))n =(eµt/n+σ2t2/2n2

)n

= eµt+σ2t2/2n ∼ N(µ, σ2/n).

(2) It can be shown that X and Xi− X have a bivariate normal distribution. Since they are uncorrelated,they must also be independent. Then X and (Xi − X)2 are also independent for each i, and so X and∑n

i=1(Xi − X)2 are independent.

(3) From lemma 1.1, we haven∑

i=1

(Xi − µ

σ

)2

︸ ︷︷ ︸χ2

n

=(n− 1)s2

σ2+(

X − µ

σ/√

n

)2

︸ ︷︷ ︸χ2

1

and so the result follows from proposition 1.1.

(4) We can write

X − µ

s/√

n=

X−µσ/√

n√(n−1)s2

σ2 /(n− 1).

But this is of the formX√Y/m

where X ∼ N(0, 1) and Y ∼ χ2m, and so the result follows by proposition 1.3.

Page 9: Stats Notes

1.3. ORDER STATISTICS. 9

1.3 Order Statistics.

Definition. Let X1, · · · , Xn be a random sample from the continuous distribution F (x) with density f(x).For each outome ω ∈ Ω, define

Y1(ω) = the smallest of X1(ω), · · · , Xn(ω)Y2(ω) = the second smallest of X1(ω), · · · , Xn(ω)Y3(ω) = the third smallest of X1(ω), · · · , Xn(ω)

...Yn(ω) = the largest of X1(ω), · · · , Xn(ω).

The random variables Y1, · · · , Yn are called the order statistics. In particular, Yr is called the rth orderstatistic.

Example. Suppose X1(ω) = 7, X2(ω) = 4, X3(ω) = −5. Then Y1(ω) = −5, Y2(ω) = 4, Y3(ω) = 7.

Proposition 1.3.1 The rth order statistic Yr has density function given by

gr(y) =n!

(r − 1)!(n− r)![F (y)]r−1f(y)[1− F (y)]n−r.

Proof. For small h > 0, we have, using the trinomial distribution,

Pry < Yr ≤ y + h = the probability that r − 1 sample values fall below y, one falls in (y, y + h],and n− r fall above y + h

=n!

(r − 1)!1!(n− r)![F (y)]r−1[F (y + h)− F (y)][1− F (y + h)]n−r.

Dividing both sides by h and letting h → 0 then gives the required result.

Definition. Given a random sample X1, · · · , Xn of size n, we define the sample median to be

X =

Ym+1 if n = 2m + 1,Ym + Ym+1

2if n = 2m.

Note that X is the “middle sample value” if n is odd, or the mean of the two “middle values” if n is even.

Remark. From proposition 3.1, the density function of the median for a sample of size 2m + 1 is

gm(y) =(2m + 1)!

m!m![F (y)]mf(y)[1− F (y)]m.

Example. For the case of a sample of size 2m + 1 from the exponential distribution with mean θ, thedensity function of the sample median is

gr(y) =

(2m+1)!

m!m! [1− e−y/θ]m 1θ e−(m+1)y/θ if y > 0,

0 if y ≤ 0.

Page 10: Stats Notes

10 CHAPTER 1. SAMPLING DISTRIBUTIONS

Page 11: Stats Notes

Chapter 2

Estimation

Reference: WMS 7th ed., chapters 8,9

Statistical Inference. Let X be a random quantity whose distribution depends on a parameter θ. Wedo not know the “true”value of θ, only that θ belongs to the set Θ, called the parameter set. Statisticalinference is concerned with using the observed value x of X to obtain information about this true value ofθ. It takes basically two forms:

(1) Estimation – in which the observation x is used to come up with a plausible value for θ as the“true”value.

(2) Hypothesis Testing – in which x is used to decide between two hypotheses concerning the “true”value of θ.

During most (but not all) of this course, the classical situation will prevail. That is, X will be a randomvector (X1, X2, · · · , Xn), where X1, X2, · · · , Xn are independent and identically distributed number-valuedrandom variables, each with distribution function Fθ(x). Such an X is called a simple random sample, orjust a “random sample” taken from the distribution Fθ(x). On the other hand, θ need not be a numericalparameter. For example, θ may be in the form of a vector, such as (µ, σ2).

Definition. Let t(x) be a function of x which does not depend on θ. The random variable t(X) (e.g.t(X1, X2, · · · , Xn) in the case of a random sample) is called a statistic. A statistic used to estimate θ iscalled an estimator of θ, and is generically denoted by θ. The value of the random variable θ is called theestimate.

For example, given a random sample X1, X2, · · · , Xn from a distribution with mean µ and variance σ2,

µ = X =1n

n∑i=1

Xi and σ2 = s2 =1

n− 1

n∑i=1

(Xi − X)2

are commonly used estimators of µ and σ2.

In §2.1 – 2.3, we study point estimation. Then in the last section, §2.4, we look at interval estimation.

2.1 Methods of Estimation.

The problem to be considered in this section is: given a parameter θ to be estimated, how do we derive asuitable estimator of it? We will examine three methods of estimation – maximum likelihood estimation,the method of moments, and Bayesian estimation. A fourth method, that of least squares, will be used lateron in the course.

11

Page 12: Stats Notes

12 CHAPTER 2. ESTIMATION

2.1.1 Maximum Likelihood Estimation.

We shall suppose that the observation X is either discrete with probability function fθ(x), or continuouswith density function fθ(x).

Definition. The function L(θ) = fθ(x) of θ, where x is considered fixed, is called the likelihood function.The method of maximum likelihood consists of choosing as our estimate θ that value of θ for which L(θ) isa maximum. In other words, θ is a member of Θ such that

L(θ) = supL(θ)|θ ∈ Θ.

That is, we take for θ that value of θ which is most likely to have produced the observation x. Then θ iscalled the maximum likelihood estimator (MLE) of θ.

L(θ) = log L(θ) is called the log likelihood function. It should be noted that since log x is a strictlyincreasing function of x, then L(θ) attains its maximum value at the same θ as does L(θ).

Example. Suppose θ = 0 or 1/2, and fθ(x) is given in the following table.

θ0 1/2

x 1 0 .12 1 .9

If x = 1 is observed, then θ = 1/2. If x = 2 is observed, then θ = 0.

Problem. Suppose fθ(x) is given in the following table. Find MLE’s of θ. This example shows that MLE’sare not unique. (A later example involving the uniform distribution shows they may not even exist.)

θ0 1/4 1/2

1 .1 .2 0x 2 .4 .4 .3

3 .5 .4 .7

Example. A manufacturer of lightbulbs knows that the lifetime of his bulbs is a random variable withexponential density function

gθ(t) =

1θ e−t/θ if t > 0,0 if t ≤ 0,

where θ > 0. He wants to estimate θ.

(1) Suppose he selects at random n bulbs, sets them burning, and separately measures their lifetimesT1, T2, · · · , Tn. Then the observation is X = (T1, T2, · · · , Tn), and is a random sample from the abovedensity function. The likelihood function is then the joint density function of T1, · · · , Tn, which becauseof independence, is

L(θ) = fθ(t1, t2, · · · , tn) = gθ(t1) · · · gθ(tn) =

1

θn e−nt/θ if t1, . . . , tn > 00 otherwise.

The log likelihood function is then L(θ) = −n log θ − nt/θ. Differentiating, we get

L′(θ) = −n/θ + nt/θ2 =n

θ

(t

θ− 1)

=

> 0 if θ < t,= 0 if θ = t,< 0 if θ > t,

so that the MLE must be θ = t.

Page 13: Stats Notes

2.1. METHODS OF ESTIMATION. 13

(2) Consider the following alternative sampling scheme. He sets n bulbs burning, and then c units of timelater, observes the number x of bulbs which are still burning. We shall determine the MLE of θ in thiscase. The number X of bulbs still burning at time c is a binomial random variable with parameters nand p = e−c/θ, and therefore the likelihood function is

L(θ) =(

n

x

)px(1− p)n−x where p = e−c/θ.

The log likelihood function is L(θ) = log(nx

)+ x log p + (n− x) log(1− p), so

L′(θ) =x

pp′(θ)− n− x

1− pp′(θ) = p′(θ)

(x− np

p(1− p)

).

Since p′(θ) > 0, L′(θ) = 0 when p = x/n. Since p = e−c/θ, then solving for θ gives

θ = − c

log(x/n).

We leave it to the reader to verify that this critical value θ actually gives a maximum.

(3) As a third alternative, suppose he sets n bulbs burning, and observes the time Y until the first bulbburns out. That is, if T1, . . . , Tn denote the lifetimes, then Y is just the first order statistic for thesample. Since

PY > y = PT1 > y, . . . , Tn > y = PT1 > y · · ·PTn > y = e−ny/θ

for y > 0, then the likelihood function is

L(θ) =

nθ e−ny/θ if y > 0,0 otherwise.

It is an easy matter to show that the MLE of θ is θ = ny.

Example. Given a random sample X1, · · · , Xn from the distribution N(µ, σ2), find the maximum likeli-hood estimators of µ and σ2.

Solution. Here we have θ = (µ, σ2). The likelihood function is the joint density function of X1, · · · , Xn,and because of independence, we have

L(µ, σ2) =1

σ√

2πe−1

2

(x1 − µ

σ

)2

· · · 1σ√

2πe−1

2

(xn − µ

σ

)2

=1

σn(2π)n/2e

− 12σ2

n∑i=1

(xi − µ)2

.

The log likelihood function is

L(µ, σ2) = −n

2log σ2 − n

2log 2π − 1

2σ2

n∑i=1

(xi − µ)2,

so∂

∂µL =

1σ2

n∑i=1

(xi − µ) =n

σ2(x− µ)

and∂

∂σ2L = − n

2σ2+

12σ4

n∑i=1

(xi − µ)2 = − n

2σ4

(σ2 − 1

n

n∑i=1

(xi − µ)2)

.

Page 14: Stats Notes

14 CHAPTER 2. ESTIMATION

Setting these partial derivatives equal to zero and solving simultaneously, we find that

µ = x and σ2 =1n

n∑i=1

(xi − x)2.

We shall leave it to the reader to check that L(µ, σ2) actually achieves its maximum at these values, andthat therefore these are the MLE’s.

Remark. z = f(x, y) has a maximum at (x0, y0) if (1) f1 = f2 = 0 at (x0, y0), (2) f11f22 − f12f21 > 0 at(x0, y0), and (3) f11 < 0 at (x0, y0).

Example. Given a random sample X1, · · · , Xn from the uniform density

gθ(x) =

1θ if 0 ≤ x ≤ θ,0 otherwise,

find the MLE of θ.

Solution. The likelihood function is

L(θ) =

1

θn if 0 ≤ x1, x2, · · · , xn ≤ θ

0 otherwise=

1

θn if 0 ≤ y1 < yn ≤ θ

0 otherwise

where yr denotes the rth order statistic. The figure below shows a sketch of the graph of L(θ) versus θ, fromwhich we see that the MLE of θ is θ = yn = max(x1, · · · , xn).

q

L(q)

y n

Note that calculus would not work in this case since the function L(θ) is not differentiable at yn. Also notethat if the definition of gθ(x) is changed to

gθ(x) =

1θ if 0 ≤ x < θ,0 otherwise,

then L(θ) becomes

L(θ) =

1

θn if 0 ≤ y1 < yn < θ

0 otherwise,

so that no MLE can exist.

Proposition 2.1.1 (Invariance Property for MLE’s) If θ is an MLE of θ, and if τ is some functiondefined on Θ, then τ(θ) is an MLE of τ(θ).

Page 15: Stats Notes

2.1. METHODS OF ESTIMATION. 15

Proof. First suppose τ is a one-to-one function. Let η = τ(θ) and η = τ(θ), and let gη(x) be the likelihoodfunction of x using the parameter η (so that gη(x) = fθ(x)). Then

gη(x) = gτ(θ)(x) = fθ(x) = supθ

fθ(x) = supη

gη(x).

If τ is not one-to-one, it is not clear what is meant by the MLE of τ(θ). We therefore proceed as follows:we define the induced likelihood function L∗(η) = supθ:τ(θ)=η L(θ), and say that η is an MLE of τ(θ) if ηis a maximum of L∗(η). Then

L∗(τ(θ)) = supθ:τ(θ)=τ(θ)

L(θ) = L(θ) = supθ

L(θ) = supη

supθ:τ(θ)=η

L(θ) = supη

L∗(η),

so τ(θ) is an MLE of τ(θ).

2.1.2 Method of Moments.

This method is based on the strong law of large numbers: if X1, X2, X3, . . . is a sequence of i.i.d. randomvariables with finite kth moment µ′k = E(Xk), then

1n

n∑i=1

Xki −→ µ′k as n →∞.

Thus if X1, · · · , Xn is a random sample and n is large, and if we put

mk =1n

n∑i=1

Xki ,

then we havemk ≈ µ′k, k ≥ 1.

The method of moments consists of solving as many of the equations

mk = µ′k, k ≥ 1,

starting with the case k = 1, as are necessary to identify the unknown parameters.

Example. Suppose X1, · · · , Xn is a random sample from an exponential distribution with mean θ. Thenµ′1 = θ and m1 = X, so the moment estimator is θ = X.

Example. Suppose X1, · · · , Xn is a random sample from a gamma distribution with parameters α and β.Find moment estimators of α and β.

Solution. We have µ′1 = αβ and µ′2 = αβ2 + α2β2. Hence m1 = αβ and m2 = αβ2(1 + α). Soving thesetwo equations for α and β, we find

α =m2

1

m2 −m21

, β =m2 −m2

1

m1.

2.1.3 Bayesian Estimation.

Page 16: Stats Notes

16 CHAPTER 2. ESTIMATION

Reference: WMS 7th ed., chapter 16In Bayesian estimation, we assume that the parameter θ is actually a random variable with a distribution

to be called the prior distribution. We also have in mind a certain “loss” function L(θ, θ) which specifiesthe loss or penalty when the true parameter value is θ and our estimate of it is θ. Examples of possible lossfunctions are L(θ, θ) = (θ − θ)2 (the squared loss function), and L(θ, θ) = |θ − θ|.

Definition. The Bayes’ estimator of θ is that estimator θ = t(X) for which the mean loss

E(L(θ, t(X))

is a minimum.It can be shown that for the squared loss function, the Bayes’ estimator is given by θ = t(X) where

t(x) = E(θ|X = x), (2.1)

where X is the observation. This is a result of a proposition given in the 357 supplement.

The conditional expectation in (2.1) is calculated as follows. Let h(θ) denote the prior density (orprobability) function of θ. The conditional density (or probability) function f(x|θ) of X given θ is thelikelihood function fθ(x). Then the conditional density (or probability) function of θ given X = x is

φ(θ|x) =fθ(x)h(θ)

g(x)(2.2)

where g(x) is the marginal density (probability) function of X. φ(θ|x) is called the posterior density (prob-ability) function of θ. Finally, E(θ|X = x) is computed as the mean of the posterior density function.

Example 1. Estimate the parameter θ of a binomial distribution given the number X of successes in ntrials, and given that the prior distribution of θ is a beta distribution with parameters α and β; that is,

h(θ) =

Γ(α+β)Γ(α)Γ(β)θ

α−1(1− θ)β−1 if 0 < θ < 1,

0 otherwise.

Solution. We shall need the fact that the mean of such a beta density h(θ) is α/(α + β). The likelihoodfunction is

fθ(x) =(

n

x

)θx(1− θ)n−x, x = 0, 1, . . . , n,

and the joint density function of X and θ is therefore

f(x, θ) = fθ(x)h(θ) =Γ(α + β)Γ(α)Γ(β)

(n

x

)θα+x−1(1− θ)β+n−x−1.

Integrating, we find the marginal probability function of X to be

g(x) =∫ 1

0

f(x, θ) dθ =Γ(α + β)Γ(α)Γ(β)

(n

x

)∫ 1

0

θα+x−1(1−θ)β+n−x−1 dθ =Γ(α + β)Γ(α)Γ(β)

(n

x

)Γ(α + x)Γ(β + n− x)

Γ(α + β + n).

Hence the posterior density of θ is

φ(θ|x) =f(x, θ)g(x)

=Γ(α + β + n)

Γ(α + x)Γ(β + n− x)θα+x−1(1− θ)β+n−x−1, 0 < θ < 1

which is another beta density, this time with parameters α′ = α+x and β′ = β +n−x. The Bayes estimateis therefore the mean of this density, namely

θ = E(θ|X = x) =α + x

α + β + n.

Page 17: Stats Notes

2.2. PROPERTIES OF ESTIMATORS. 17

Notice how both the information in the prior and the observation have been used to estimate θ. Also notethat (2.2) can be written as

φ(θ|x) = Kfθ(x)h(θ) (2.3)

where K is a normalization constant depending on x. The point is that the distribution on the right-handside of (2.3) might be easily recognizable, and then φ(θ|x) (or even E(θ|X = x)) could perhaps be writtendown directly, without going through the fuss of determining g(x).

Example 2. Estimate the parameter θ of a uniform distribution on (0, θ), on the basis of a single observa-tion from that distribution, and given that the prior distribution of θ is the Pareto distribution with densityfunction

h(θ) =

αθα

0θα+1 if θ ≥ θ0,0 if θ < θ0,

where θ0 > 0 and α > 0.

Solution. The mean of the Pareto law is∫ +∞

−∞θh(θ)dθ =

∫ ∞

θ0

αθα0

θαdθ =

αθ0/(α− 1) if α > 1,+∞ if α ≤ 1.

From (2.2), we have

φ(θ|x) = Kfθ(x)h(θ) =

K · 1

θ ·αθα

0θα+1 if 0 < x < θ, θ ≥ θ0,

0 otherwise,

Thus the posterior distribution is again Pareto, but with parameters θ0 ∨ x and α + 1. It follows that theBayes’ estimator is

E(θ|X = x) =(α + 1) maxθ0, x

α.

Example 3. Estimate the mean µ of a normal distribution with known variance σ2, on the basis of themean X of a random sample of size n, and given that the prior distribution of µ is N(µ0, σ

20).

Solution. After carrying out the details, we find that φ(µ|x) ∼ N(µ1, σ21), where

µ1 =nxσ2

0 + µ0σ2

nσ20 + σ2

,1σ2

1

=n

σ2+

1σ2

0

.

Hence the Bayesian estimator of µ is

E(µ|X = x) = µ1 =nxσ2

0 + µ0σ2

nσ20 + σ2

.

Remark. The method of this section cannot be used to generate estimators in the “classical”situationwhere θ is not random.

2.2 Properties of Estimators.

We have seen in §2.1 ways of generating estimators. Now how do we choose the appropriate one to use?There are many criteria for “goodness” of estimators. In this section we will look at only four – unbiasedness,efficiency, consistency, and sufficiency.

Page 18: Stats Notes

18 CHAPTER 2. ESTIMATION

2.2.1 Unbiasedness and Efficiency.

Definition. An estimator θ of θ is unbiased if E(θ) = θ. The bias of an estimator θ is B(θ) = E(θ) − θ.The mean square error of θ is MSE(θ) = E(θ − θ)2. Observe that

MSE(θ) = E[(θ − E(θ)) + (E(θ)− θ)]2 = E[(θ − E(θ))2 + 2(θ − E(θ))B(θ) + [B(θ)]2]

= Var(θ) + 2B(θ)E[θ − E(θ)] + [B(θ)]2

so that

MSE(θ) = Var(θ) + [B(θ)]2.

The standard error of θ is the standard deviation of θ.

Definition. Given two estimators θ1 and θ2 of θ, we say θ1 is relatively more efficient than θ2 if E(θ1−θ)2 ≤E(θ2 − θ)2. The ratio

E(θ2 − θ)2

E(θ1 − θ)2

is called the relative efficiency of θ1 with respect to θ2.

Observe that when θ1 and θ2 are unbiased for θ, θ1 is relatively more efficient than θ2 if V ar(θ1) ≤V ar(θ2). Obviously unbiasedness and efficiency are two desirable properties of estimators. A good strategyfor choosing an estimator among the many available might then be as follows. We agree to restrict ourselvesto unbiased estimators, and then among all unbiased estimators of θ, to choose the most efficient one. Suchan estimator is called a minimum variance unbiased estimator (MVUE). We remark, though, that among allunbiased estimators of θ, there may not be one with minimum variance. On the other hand, there may bemore than one. The following theorem helps very much in verifying whether a given unbiased estimator isan MVUE.

Theorem 2.2.1 (The Cramer - Rao inequality) Let θ = t(X) be an estimator of θ, and let fθ(x) denote thelikelihood function. Then

V ar(θ) ≥ |φ′(θ)|2

E

(∂ log fθ(X)

∂θ

)2 , (2.4)

where φ(θ) = E[t(X)] =∫

t(x)fθ(x) dx is the bias function. (φ(θ) = θ if t(X) is unbiased.) When X is arandom sample X1, X2, · · · , Xn from a density or probability function gθ(x), (2.4) becomes

V ar(θ) ≥ |φ′(θ)|2

nE

(∂ log gθ(X1)

∂θ

)2 . (2.5)

If θ is unbiased (so φ′(θ) = 1) and equality holds in (2.4) (or in (2.5)), then θ is an MVUE.

Proof. We shall apply the inequality |Cov(U, V )|2 ≤ Var(U)Var(V ) to the random variables U = θ andV = ∂

∂θ log fθ(X). Using the fact that[∂

∂θlog fθ(x)

]· fθ(x) =

∂θfθ(x),

we have

E(V ) = E(∂

∂θlog fθ(X)) =

∫∂

∂θlog fθ(x) · fθ(x) dx =

∫∂

∂θfθ(x) dx =

∂θ

∫fθ(x) dx = 0 (2.6)

Page 19: Stats Notes

2.2. PROPERTIES OF ESTIMATORS. 19

and

E(UV ) = E(θ∂

∂θlog fθ(X)) =

∫t(x)

∂θlog fθ(x)·fθ(x) dx =

∫t(x)

∂θfθ(x) dx =

∂θ

∫t(x)fθ(x) dx = φ′(θ),

so that Cov(θ, ∂∂θ log fθ(X)) = φ′(θ). Thus

|φ′(θ)|2 = |Cov(θ,∂

∂θlog fθ(X))|2 ≤ Var(θ)Var(

∂θlog fθ(X)) (2.7)

and therefore

V ar(θ) ≥ |φ′(θ)|2

V ar(∂

∂θlog fθ(X))

. (2.8)

(2.4) follows from (2.8), by virtue of (2.6). If X is a random sample, then fθ(x) = fθ(x1, x2, . . . , xn) =gθ(x1) · · · gθ(xn), so

∂ log fθ(x)∂θ

=n∑

i=1

∂ log gθ(xi)∂θ

and then

Var(

∂ log fθ(X)∂θ

)=

n∑i=1

Var(

∂ log gθ(Xi)∂θ

)= nVar

(∂ log gθ(X1)

∂θ

)= nE

(∂ log gθ(X1)

∂θ

)2

,

where we used the fact that the variance of a sum of independent random variables is the sum of the variances.Hence (2.5) follows from (2.8) and (2.6) again.

Remarks.

(1) If equality holds in (2.7), θ is said to be an efficient estimator of θ.

(2) The quantity

I(θ) = E

(∂ log fθ(X)

∂θ

)2

= V ar(∂

∂θlog fθ(X))

in the denominator of (2.4) is called the Fisher information of the observation X.

(3) Recall that equality holds in (2.7) if and only if θ = t(X) and ∂∂θ log fθ(X) are linearly related with

probability 1; that is, if

P

[∂

∂θlog fθ(X) = a(θ)t(X) + b(θ)

]= 1,

which means the likelihood function must be of the form

fθ(x) = ew(θ)t(x) + B(θ) + H(x) = h(x)c(θ)ew(θ)t(x),

where w(θ) =∫

a(θ)dθ. Density or probability functions that have this form are said to be of expo-nential type.

The exact definition of an exponential family is given at the end of §2.2.

Example. Let X1, · · · , Xn be a random sample from N(µ, σ2). Show that the sample mean X is anMVUE of µ.

Page 20: Stats Notes

20 CHAPTER 2. ESTIMATION

Solution. We already know that X is unbiased for µ. Hence we check for equality in (2.5). g(x1) is theusual normal density, so log g(x1) = − log σ

√2π − (x1 − µ)2/2σ2. Hence

∂µlog g(x1) =

(x1 − µ

σ

)so that

nE

(∂ log g(X1)

∂µ

)2

=n

σ2E

(X1 − µ

σ

)2

=n

σ2.

On the other hand, V ar(X) = σ2/n, so in fact we do have equality in (2.5)

Example. In a sequence of n Bernoulli trials with probability θ of success, we observe the total numberX of successes. Show that θ = X/n is an MVUE for θ.

Solution. Obviously θ is unbiased. On the one hand we have V ar(θ) = θ(1− θ)/n. On the other hand,

∂θlog fθ(x) =

∂θlog(

n

x

)θx(1− θ)n−x =

∂θ

(log(

n

x

)+ x log θ + (n− x) log(1− θ)

)=

x

θ− n− x

1− θ=

x− nθ

θ(1− θ)

and so

E

(∂ log fθ(X)

∂θ

)2

= E

(X − nθ

θ(1− θ)

)2

=n

θ(1− θ).

Hence we have equality in (2.4)

2.2.2 Consistency.

Definition. Let X1, · · · , Xn be a random sample from Fθ. The estimator θ = t(X1, · · · , Xn) is consistent

for θ if θP→ θ as n →∞; that is

Pr|θ − θ| > ε → 0 as n →∞for every ε > 0.

Theorem 2.2.2 θ is consistent for θ if

(1) θ is unbiased for θ

(2) Var(θ) → 0 as n →∞.

Proof. By Chebychev’s inequality, we have

Pr|θ − θ| > ε = Pr|θ − E(θ)| > ε ≤ V ar(θ)ε2

→ 0 as n →∞.

Example. Show that for random samples from a normal distribution N(µ, σ2), the sample variance s2 isconsistent for σ2.

Solution. We have shown that s2 is unbiased. Since (n− 1)s2/σ2 has a chi-square distribution with n− 1degrees of freedom, it has variance 2(n− 1), and so

V ar(s2) = V ar

(σ2

n− 1· (n− 1)s2

σ2

)=

σ4

(n− 1)2· 2(n− 1) =

2σ4

n− 1→ 0

as n →∞.

Page 21: Stats Notes

2.2. PROPERTIES OF ESTIMATORS. 21

2.2.3 Sufficiency.

A “good” estimator θ should utilize all the information about θ in the observation, as opposed to an estimatorwhich does not. This motivates the following definition.

Definition. The statistic θ = t(X) is said to be sufficient for θ if the conditional density (probability)function f(x|t(X) = w) of X given t(X) does not depend on θ for any w for which it is well-defined.

Remarks.

(1) The interpretation of the definition is that θ is sufficient if once we know the value of θ, the remaininginformation in the observation says nothing about θ. θ contains in itself all the relevant informationabout θ.

(2) For example, suppose we have data x1, · · · , xn from a random sample from N(µ, σ2). We would thinkthat the sample mean x would contain all the information about µ, and that we could throw away thedata. In other words, x would seem to be sufficient for µ.

(3) Let φ(θ) be a one-to-one function of θ. If θ is sufficient for θ, then

f(x|φ(θ) = θ′) = f(x|θ = φ−1(θ′))

does not depend on θ, so φ(θ) is also sufficient for θ.

Example. Suppose we have a sequence of n Bernoulli trials, each with probability θ of resulting in success.Define Xi = 1 if the ith trial results in success, and 0 otherwise. Then X1, X2, · · · , Xn is a random samplefrom a Bernoulli distribution. Let Y =

∑ni=1 Xi be the number of successes in the sample. Then

f(x1, · · · , xn|Y = y) =PrX1 = x1, · · · , Xn = xn;Y = y

PrY = y=

PrX1 = x1, · · · , Xn = xnPrY = y

=PrX1 = x1 . . .PrXn = xn

PrY = y=

θy(1− θ)n−y(ny

)θy(1− θ)n−y

=1(ny

)if∑n

i=1 xi = y, and 0 otherwise. In either case, it does not depend on θ, so Y is sufficient for θ. By remark(3), θ = Y/n is also sufficient for θ.

Theorem 2.2.3 (Neyman-Pearson Factorization Criterion) θ = t(X) is sufficient for θ if and only if thedensity (probability) function fθ(x) of X can be factored as

fθ(x) = gθ[t(x)]h(x) (2.9)

where gθ[t(x)] depends on θ and on x only through t(x) and h(x) depends on x but not on θ.

Proof. We give the proof only in the case where X is discrete.Suppose t(X) is sufficient. Define gθ[w] = Pθ[t(X) = w]. If gθ[t(x)] > 0, then h(x) = PrX = x|t(X) =

t(x) is well defined and does not depend on θ, and we have

fθ(x) = PrX = x = PrX = x, t(X) = t(x) = PrX = x|t(X) = t(x)Prt(X) = t(x) = h(x)gθ[t(x)],

so (2.9) holds. If gθ[t(x)] = 0, then fθ(x) = PrX = x ≤ P [t(X) = t(x)] = 0, so fθ(x) = 0. Once again,(2.9) holds with both sides zero.

Conversely, suppose that the factorization in (2.9) holds, and let w be such that f(x|t(X) = w) exists(i.e. P [t(X) = w] > 0). If t(x) 6= w, then

f(x|t(X) = w) =P [X = x, t(X) = w]

P [t(X) = w]= 0,

Page 22: Stats Notes

22 CHAPTER 2. ESTIMATION

so does not depend on θ. If t(x) = w, then

P [t(X) = w] =∑

z∈t−1(w)

P [X = z] =∑

z∈t−1(w)

gθ[t(z)]h(z) =∑

z∈t−1(w)

gθ[w]h(z) = gθ[w]∑

z∈t−1(w)

h(z),

and so

f(x|t(X) = w) =P [X = x, t(x) = w]

P [t(X) = w]=

P [X = x]P [t(X) = w]

=gθ[t(x))]h(x)

gθ[w)]∑

z∈t−1(w) h(z)=

h(x)∑z∈t−1(w) h(z)

,

which does not depend on θ.

Example. Suppose X1, · · · , Xn is a random sample from N(µ, σ2)

(1) Suppose σ2 is known. Show that X is sufficient for µ.

(2) Suppose both µ and σ2 are unknown. Show that (X, s2) is sufficient for θ = (µ, σ2).

Solution. We use the identity

n∑i=1

(xi − µ)2 =n∑

i=1

(xi − x)2 + n(x− µ)2

and the factorization theorem. We have

(1)

f(x1, · · · , xn) =1

(σ√

2π)n· e−

12σ2

Pni=1(xi−µ)2 =

1(σ√

2π)n· e−

n2σ2 (x−µ)2 · e−

12σ2

Pni=1(xi−x)2

= gµ(x)h(x1, · · · , xn)

and so X is sufficient for µ.

(2)

f(x1, · · · , xn) =1

(σ√

2π)n· e−

n2σ2 (x−µ)2− (n−1)s2

2σ2 = gµ,σ2(x, s2) · 1,

so (X, s2) is sufficient for (µ, σ2).

Remarks.

(1) Suppose θ = t(X) is sufficient for θ. If a unique maximum likelihood estimator of θ exists, it will be afunction of θ. This follows from the factorization fθ(x) = gθ(θ)h(x).

(2) Suppose that the likelihood function is expressible in the form

fθ(x) = h(x)c(θ)ew(θ)t(x).

Then the factorization criterion applies with

gθ(t(x)) = c(θ)ew(θ)t(x).

Hence t(X) is a sufficient statistic.

Page 23: Stats Notes

2.3. MINIMUM VARIANCE REVISITED. 23

Examples.

(1) The exponential density

f(x) =

1θ e−x/θ if x > 0,0 otherwise,

where θ > 0, is of exponential type. Here, we have

h(x) = I(0,∞)(x), c(θ) = w(θ) =1θ, t(x) = −x.

(2) The binomial probability function

fθ(x) =(

n

x

)θx(1− θ)n−x =

(n

x

)(1− θ)nex log( θ

1−θ ), x = 0, 1, . . . , n; 0 < θ < 1,

is of exponential type, where

h(x) =(

n

x

), c(θ) = (1− θ)n, w(θ) = log(

θ

1− θ), t(x) = x.

Here is the general definition of an exponential family.

Definition. A family fθ(x), θ ∈ Θ of density or probability functions is called an exponential family ifit can be expressed as

fθ(x) = h(x)c(θ)ePk

i=1 wi(θ)ti(x),

where h(x) ≥ 0 and c(θ) ≥ 0 for all x and θ; t1(x), . . . , tk(x) are real valued functions of the observation x(which itself may be a point of some arbitrary space) which do not depend on θ, and w1(θ), . . . , wk(θ) arereal valued functions of the parameter θ ∈ Θ (again where Θ may be some arbitrary space) which do notdepend on x.

Using the Neyman-Pearson factorization criterion, we see that the statistic t(X) = (t1(X), . . . , tk(X)) issufficient for θ.

Example. The normal distribution N(µ, σ2) with density function

fµ,σ2(x) =1√2π

e−12 ( x−µ

σ )2

=e−µ2/2σ2

√2π

e−x2/2σ2+µx/σ2,

where µ and σ2 are unknown, is of exponential type, with

h(x) ≡ 1, c(µ, σ2) =e−µ2/2σ2

√2π

, w(µ, σ2) = (1

2σ2,

µ

σ2), t(x) = (−x2, x).

2.3 Minimum Variance Revisited.

References: WMS 7th ed., section 9.5, and the appendix to this chapter.

Page 24: Stats Notes

24 CHAPTER 2. ESTIMATION

Definition. Let X and Y be random variables and suppose fY |X(y|x) is the conditional density (or con-ditional probability) function of Y given that X = x. Then if x is such that fY |X(y|x) is defined (that is, iffX(x) > 0), we define

E(φ(Y )|X = x) =

∫∞−∞ φ(y)fY |X(y|x) dy, if Y is a continuous random variable;∑

y φ(y)fY |X(y|x), if Y is a discrete random variable;

to be the conditional expectation of φ(Y ) given that X = x. Temporarily let h(x) = E(φ(Y )|X = x). Thenwe define E(φ(Y )|X) = h(X) to be the conditional expectation of φ(Y ) given X.

Notice that one gets E(φ(Y )|X) by replacing x in the expression for E(φ(Y )|X = x) by X. Thus whereasE(φ(Y )|X = x) is a function of the numerical variable x, E(φ(Y )|X) is a random variable.

Proposition 2.3.1 Let X and Y be random variables and assume Y has finite mean and variance. ThenE[E(Y |X)] = E(Y ) and Var(E(Y |X)) ≤ Var(Y ), with equality if and only if PY = E(Y |X) = 1.

Theorem 2.3.2 (Rao-Blackwell) Suppose T = t(X) is a sufficient statistic for θ. Let U = u(X) be anunbiased estimator of θ, and define φ(w) = E(U |T = w). Then φ does not depend on θ, so that φ(T ) is astatistic. φ(T ) is unbiased for θ and Var(φ(T )) ≤ Var(U). If U is MVUE, then PU = φ(T ) = 1.

Proof. By sufficiency, f(x|T = w) does not depend on θ, so φ(w) =∫

u(x)f(x|T = w)du, being an expectedvalue computed using f(x|T = w) does not, either. We have φ(T ) = E(U |T ), so Eφ(T ) = E[E(U |T )] =E(U) = θ. Finally, we have Var(φ(T )) = Var(E(U |T )) ≤ Var(U) by the previous proposition. If U isminimum variance, we have equality.

Thus, assuming there exists a sufficient statistic, we can always reduce variance by conditioning on it,and an MVUE must be a function of the sufficient statistic.

Definition. A family gθ(y), θ ∈ Θ of densities or probability functions is called complete if

Eθh(Y ) = 0 for all θ ∈ Θ

implies that Pθh(Y ) = 0 = 1 for all θ ∈ Θ. Here, Pθ is the probability corresponding to the density (orprobability) function gθ(y).

Example.† Suppose that Y has the binomial distribution

gθ(y) =(

n

y

)θy(1− θ)n−y, y = 0, 1, . . . , n,

where 0 ≤ θ ≤ 1. Then for θ ∈ [0, 1),

Eθh(Y ) =n∑

y=0

h(y)(

n

y

)θy(1− θ)n−y = (1− θ)n

n∑y=0

h(y)(

n

y

)(θ

1− θ

)y

.

Putting φ = θ/(1− θ), we see that if Eθh(Y ) = 0 for all θ ∈ [0, 1), then

n∑y=0

h(y)(

n

y

)φy = 0

for all φ > 0, which implies that h(y) = 0 for all y = 0, 1, . . . , n. Hence the binomial family gθ, θ ∈ [0, 1] iscomplete.

Definition. A statistic Y = t(X) is complete if the family gθ(y)|θ ∈ Θ of distributions of t(X) iscomplete.

Page 25: Stats Notes

2.3. MINIMUM VARIANCE REVISITED. 25

Remark. Suppose that the statistic T = t(X) is complete, and that φ1(T ) and φ2(T ) are unbiased for θ.Then

Eθ[φ1(T )− φ2(T )] = Eθ[φ1(T )]− Eθ[φ2(T )] = θ − θ = 0

for all θ ∈ Θ, so by completeness, φ1(T ) ≡ φ2(T ) in the sense that Pθφ1(T ) = φ2(T ) = 1 for all θ ∈ Θ.

Proposition 2.3.3 (Lehmann-Scheffe) Suppose that T is a complete sufficient statistic for θ, and thatθ = φ(T ) is unbiased for θ. Then θ is the unique MVUE of θ.

Proof. T must be of the form T = t(X). Suppose that U is any unbiased estimator of θ, and defineφ2(w) = E(U |T = w). By the Rao-Blackwell theorem, V = φ2(T ) is an unbiased estimator of θ withVar(V ) ≤ Var(U). Moreover, by the previous remark, we must have θ ≡ V , and so Var(θ) = Var(V ) ≤Var(U). Hence θ is an MVUE.

Suppose W is another MVUE. Then by the remark following the Rao-Blackwell theorem, W must be ofthe form W = ρ(T ) for some function ρ(t). Once again, by the remark, we have W ≡ θ.

Example. Let X1, . . . , Xn be a random sample from the Bernoulli distribution with parameter θ, and letX = X1 + · · ·+ Xn. We know that X is sufficient for θ, and also, from the above example, complete. ThenX = X/n, being unbiased for θ, must be the unique MVUE.

Example. Let X1, . . . , Xn be a random sample from the exponential distribution with density

f(x) =

1θ e−x/θ if x > 0,0 otherwise.

Suppose we want to estimate θ2. We know that X is sufficient for θ. It can also be shown that X is complete.We have

E(X2) = Var(X) + (EX)2 =θ2

n+ θ2 =

(n + 1

n

)θ2,

so nn+1X2 is the MVUE of θ2.

Remark. The Lehmann-Scheffe theorem is very useful for finding MVUEs. However, the verification thata given statistic is complete can be quite difficult. It so happens that members of an exponential family arecomplete as well as sufficient. Most (if not all) of the distributions in WMS are from exponential families.Hence their approach in setting a problem to find an MVUE is to require the reader to prove the statistic issufficient and to find a function of it which is unbiased. Completeness is swept under the rug.

Page 26: Stats Notes

26 CHAPTER 2. ESTIMATION

Page 27: Stats Notes

Chapter 3

Confidence Intervals

Reference: WMS 7th ed., chapter 8In the previous chapter, we discussed point estimation. For example, the sample mean X is a point

estimate of the population mean µ. We do not expect X to coincide exactly with the true value of µ, onlyto be “close” to it. It would be desireable to have some idea of just how close our estimate is to the truevalue, namely to be able to say that

x− ε < µ < x + ε (3.1)

for some ε > 0. This is not possible, but we can say that (3.1) holds with a certain degree of confidence,which we shall now make clear.

Definition. Let θ be a parameter of a given distribution, and let 0 < α < 1. Let t1(X) and t2(X) be twostatistics such that

(1) t1(X) ≤ t2(X),

(2) Prt1(X) < θ < t2(X) = 1− α.

Let θ1 and θ2 be values of t1(X) and t2(X) respectively. Then

θ1 < θ < θ2

is called a (1− α)× 100% confidence interval (or interval estimate) for θ.

We are now going to give several examples of standard confidence intervals. All of them will be derivedby the use of a pivot.

Definition. A pivot is a random variable of the form g(X, θ) whose distribution does not depend on θ,and where the function g(x, θ) is for each x a monotonic function of the parameter θ.

Example 1. Random sample from N(µ, σ2), σ2 known. Confidence interval for µ. We use thepivot

X − µ

σ/√

n

which has the distribution N(0, 1). We have

1− α = Pr−zα/2 <X − µ

σ/√

n< zα/2 = PrX − zα/2

σ√n︸ ︷︷ ︸

t1(X)

< µ < X + zα/2σ√n︸ ︷︷ ︸

t2(X)

after some rearrangement, and so a (1− α)× 100% confidence interval for µ is

x− zα/2σ√n

< µ < x + zα/2σ√n

. (3.2)

27

Page 28: Stats Notes

28 CHAPTER 3. CONFIDENCE INTERVALS

Remarks. If n is large, say n ≥ 25, we may invoke the CLT to find that the pivot has (approximately)the N(0, 1) distribution regardless of the population distribution. Hence the confidence interval is valid forany population if n is large. Moreover, it is likely that σ will be unknown. If n is large, we may replace σin (3.2) by s.

Length of C.I. LetLCL = x− zα/2

σ√n

, UCL = x + zα/2σ√n

be the lower and upper confidence limits, respectively. Then

L = UCL− LCL = 2zα/2σ√n

,

is the length of the confidence interval. Observe that

(1) L varies inversely as the square root of the sample size.

(2) the larger the degree 1− α of confidence, the bigger L is.

The above formula for length can be inverted to give

√n = 2zα/2

σ

L

allowing us to compute the sample size required to achieve a certain length of confidence interval. If youdon’t know what σ is, you could use Range ∼ 4σ.

Example. (8.71, p. 424) We want to estimate µ to within 2 with a 95% c.i., so we want n so L = 4. Frompast experience, we can take σ = 10. Hence

√n = 2× 1.96

104≈ 10,

so n ≥ 100.

Interpretation of C.I. Suppose samples are repeatedly drawn and 90% confidence intervals constructedfrom them.

µ

In the long run, 90% of such c.i.’s so constructed will contain the true value of µ.

Page 29: Stats Notes

29

Example 2. Random sample from N(µ, σ2), σ2 unknown. Confidence interval for µ. We usethe pivot

X − µ

s/√

n

which has the t-distribution with n− 1 d.f. We have

1− α = Pr−tα/2,n−1 <X − µ

s/√

n< tα/2,n−1

and just as in example 1, a (1− α)× 100% confidence interval for µ is

x− tα/2,n−1s√n

< µ < x + tα/2,n−1s√n

. (3.3)

Example 3. Random sample from N(µ, σ2), µ unknown. Confidence interval for σ2. We usethe pivot

(n− 1)s2

σ2

which has the χ2-distribution with n− 1 d.f. We have

1− α = Prχ21−α/2,n−1 <

(n− 1)s2

σ2< χ2

α/2,n−1

and in the same way as above, a (1− α)× 100% confidence interval for σ2 is

(n− 1)s2

χ2α/2,n−1

< σ2 <(n− 1)s2

χ21−α/2,n−1

. (3.4)

Numerical Example. (8.114,8.115, p.439) Given a random sample 785, 805, 790, 793, 802 from N(µ, σ2),

(1) find a 90% c.i. for µ.

(2) find a 90% c.i. for σ2.

Solution. We have n = 5, x = 795, s = 8.34, α/2 = .05.

(1) tα/2,n−1 = t.05,4 = 2.132. Substituting into (3.3), the 90% c.i. for µ is

795± 2.1328.34√

5= 795± 7.95 or (787.05, 802.85).

(2) χ2.95,4 = .710721 and χ2

.05,4 = 9.48773. Substituting into (3.4), the 90% c. i. for σ2 is

4× 8.342

9.48773≤ σ2 ≤ 4× 8.342

.7107212, that is, 29.30 ≤ σ2 ≤ 391.15. (3.5)

Note also that a 90% c.i. for σ is 5.412 ≤ σ ≤ 19.77.

Page 30: Stats Notes

30 CHAPTER 3. CONFIDENCE INTERVALS

Example 4. Large sample confidence interval for the parameter p of a binomial distribution.Let X be the number of successes in n trials, where n is large. We use the pivot

X − np√np(1− p)

which because of the CLT has approximately the N(0, 1) distribution. We have

1− α = Pr−zα/2 <X − np√np(1− p)

< zα/2 = Pr(X − np)2 < z2α/2np(1− p)

= Pr(n2 + nz2α/2)p

2 − (z2α/2n + 2nX)p + X2 < 0 = Prp1(X) < p < p2(X),

where p1(x) and p2(x) are the roots of the quadratic, namely

p1(x), p2(x) =x + z2

α/2/2∓ zα/2

√z2α/2/4 + x− x2/n

n + z2α/2

. (3.6)

Thus our confidence interval is p1(x) < p < p2(x). However, since n is large, we can divide the top andbottom of (3.6) by n to find the approximate (1− α)× 100% confidence interval for p

p− zα/2

√p(1− p)

n< p < p + zα/2

√p(1− p)

n, (3.7)

where p = x/n.

An Easier Derivation. We have

1− α = Pr−zα/2 <X − np√np(1− p)

< zα/2 = Pr−zα/2

√np(1− p) < X − np < zα/2

√np(1− p)

= Pr−zα/2

√np(1− p) < np−X < zα/2

√np(1− p)

= PrX − zα/2

√np(1− p) < np < X + zα/2

√np(1− p)

= PrX

n− zα/2

√p(1− p)

n< p <

X

n+ zα/2

√p(1− p)

n,

so a (1− α)× 100% c.i. for p is

p− zα/2

√p(1− p)

n< p < p + zα/2

√p(1− p)

n,

where p = xn .

Example 5. Confidence interval for the difference µ1−µ2 of means of two normal distributionsN(µ1, σ

21) and N(µ2, σ

22). We shall assume we have independent random samples of sizes n1 and n2 from

these distributions, and that σ21 and σ2

2 are known. The pivot is

X1 − X2 − (µ1 − µ2)√σ21

n1+ σ2

2n2

and has the distribution N(0, 1). In the usual way, we derive

x1 − x2 − zα/2

√σ2

1

n1+

σ22

n2< µ1 − µ2 < x1 − x2 + zα/2

√σ2

1

n1+

σ22

n2.

The same remarks apply here as in example 1.

Page 31: Stats Notes

31

Example 6. Small sample confidence interval for the difference µ1−µ2 of means of two normaldistributions N(µ1, σ

21) and N(µ2, σ

22). We shall assume we have independent random samples of sizes

n1 and n2 from these distributions, and that σ21 and σ2

2 are unknown. However, a technical assumption weshall have to make is that σ2

1 = σ22 = σ2, say. The pivot is

X1 − X2 − (µ1 − µ2)√s2

p

n1+ s2

p

n2

where

s2p =

(n1 − 1)s21 + (n2 − 1)s2

2

n1 + n2 − 2

is the pooled sample variance. Note that

(n1 + n2 − 2)s2p

σ2=

(n1 − 1)s21

σ21

+(n2 − 1)s2

2

σ22

and so has a χ2-distribution with n1 + n2 − 2 degrees of freedom. Using this information, our pivot can bewritten as

X1−X2−(µ1−µ2)qσ2n1

+ σ2n2√

(n1+n2−2)s2p

σ2 /(n1 + n2 − 2)

and so has the t-distribution with n1 + n2 − 2 degrees of freedom. The rest is as usual, and we find that a(1− α)× 100% confidence interval is

x1 − x2 − tα/2,n1+n2−2sp

√1n1

+1n2

< µ1 − µ2 < x1 − x2 + tα/2,n1+n2−2sp

√1n1

+1n2

. (3.8)

Numerical Example. (No. 8.120, p. 440)

Method 1 Method 2

No. of children in group 11 14x 64 69s2 52 71

Solution. We have

s2p =

(10× 52) + (13× 71)23

= 62.74, t.025,23 = 2.069,

so substituting into (3.8), a 95% c.i. for µ1 − µ2 is

64− 69± 2.069

√62.74

(111

+114

), which is − 5± 6.60.

Example 7. Confidence interval for the ratio σ21/σ2

2 of variances of two normal distributionsN(µ1, σ

21) and N(µ2, σ

22). We shall assume we have independent random samples of sizes n1 and n2 from

these distributions, and that µ1 and µ2 are unknown. The pivot is

s22/σ2

2

s21/σ2

1

Page 32: Stats Notes

32 CHAPTER 3. CONFIDENCE INTERVALS

which has the F -distribution with n2−1, n1−1 degrees of freedom. In the usual way, we find the confidenceinterval to be

s21

s22

· F1−α/2,n2−1,n1−1 <σ2

1

σ22

<s21

s22

· Fα/2,n2−1,n1−1.

Sometimes the fact thatF1−α/2,n2−1,n1−1 =

1Fα/2,n1−1,n2−1

is used in this c.i.

Example 8. Large sample confidence interval for the difference p1 − p2 of parameters of twoindependent binomial random variables. Let X1 and X2 be independent binomial random variableswith parameters n1, p1 and n2, p2. Define p1 = x1/n1 and p2 = x2/n2. The pivot

p1 − p2 − (p1 − p2)√p1(1−p1)

n1+ p2(1−p2)

n2

has distribution N(0, 1), and we find

1− α = Prp1 − p2 − zα/2

√p1(1− p1)

n1+

p2(1− p2)n2

< p1 − p2 < p1 − p2 + zα/2

√p1(1− p1)

n1+

p2(1− p2)n2

.

By estimating p1 and p2 under the square root by p1 and p2, we obtain the confidence interval as

p1 − p2 − zα/2

√p1(1− p1)

n1+

p2(1− p2)n2

< p1 − p2 < p1 − p2 + zα/2

√p1(1− p1)

n1+

p2(1− p2)n2

.

Example 9. Confidence interval for the mean θ of an exponential distribution. Suppose we havea random sample X1, · · · , Xn from the exponential distribution with mean θ. We take as pivot the randomvariable

n∑i=1

Xi,

which has the χ2 distribution with 2n degrees of freedom (check its moment generating function). Wetherefore get the confidence interval

2∑n

i=1 xi

χ2α/2,2n

< θ <2∑n

i=1 xi

χ21−α/2,2n

.

Extra Notes-Large Sample Confidence Intervals. Here, we assume our estimator θ is such that forlarge n sample size,

θ − θ

σθ

has approximately the distribution N(0, 1). Here, σθ is the standard error (the standard deviation of θ).Then we can write

1− α = P [−zα/2 ≤θ − θ

σθ

≤ zα/2] = P [−zα/2σθ ≤ θ − θ ≤ zα/2σθ]

= P [θ − zα/2σθ ≤ θ ≤ θ + zα/2σθ],

resulting in the (1− α)× 100% c.iθ − zα/2σθ ≤ θ ≤ θ + zα/2σθ.

Page 33: Stats Notes

33

Bayesian Credible Sets. The Bayesian analog of a classical confidence interval is called a credible set.

Definition. A (1− α)× 100% credible set for θ is a subset C of Θ such that

1− α ≤ P (C|x) =

∫C

φ(θ|x)dθ (continuous case),∑θ∈C φ(θ|x) (discrete case).

Obviously, there can be many such sets. One usually looks for the one that has minimal length.

Page 34: Stats Notes

34 CHAPTER 3. CONFIDENCE INTERVALS

Page 35: Stats Notes

Chapter 4

Theory of Hypothesis Testing

Reference: WMS 7th ed., chapter 10

4.1 Introduction and Definitions.

We have an observation X from a distribution F belonging to a family F of distributions. Let F0 ⊂ F , andlet F1 = F \F0. Then based on X, we want to decide whether F ∈ F0 or F ∈ F1; that is, we wish to decidewhich of the hypotheses

H0 : F ∈ F0

H1 : F ∈ F1

is true. H0 is called the null hypothesis, and H1 the alternative hypothesis. If F , F0, and F1 are parametrizedas F = Fθ, θ ∈ Θ, F0 = Fθ, θ ∈ Θ0 , and F1 = Fθ, θ ∈ Θ1, where Θ0 ∩Θ1 = ∅ and Θ0 ∪Θ1 = Θ, thesetwo hypotheses can be (and usually are) equivalently written as

H0 : θ ∈ Θ0

H1 : θ ∈ Θ1.

The only way to base a decision on X is to choose a subset C ⊂ RX , called the critical region (or regionof rejection) for the test, with the understanding that if X ∈ C, then H0 will be rejected (and H1 accepted),and if X /∈ C, then H0 will be accepted (and H1 rejected).

Because X is random, it is possible that the value x of X might lie in C even though H0 is true. Thiswould cause us to erroneously reject H0. With this in mind, let us enumerate the four possible things thatcan happen in a test, as shown in the following table.

Accept Ho

Accept H1

correctdecision

correct decision

type Ierror

type IIerror

Ho is true H1 is true

When H0 is true and the test procedure causes us to reject H0 (that is, if the value x of X is in C), we havemade an error of type 1. When H1 is true and we are led to accept H0 (that is, if X /∈ C), we have madean error of type II. Of great importance will be the probability

α = PrH0X ∈ C

35

Page 36: Stats Notes

36 CHAPTER 4. THEORY OF HYPOTHESIS TESTING

of making a type I error, and the probability

β = PrH1X /∈ C

of making a type two error. Here, the notation PH· means “calculate the probability of the event ·assuming H is true”. The number 1− β will be called the power of the test.

If in a hypothesis H : F ∈ C, C consists of a single distribution, then H is called a simple hypothesis.Otherwise, H is a composite hypothesis.

Example 1.1. Suppose an urn is filled with 7 marbles, of which θ are red and the rest are green. We wantto test the hypotheses

H0 : θ = 3H1 : θ = 5.

Both of these are simple hypotheses – each specifies a specific value of θ. To carry out the test, we select arandom sample of size three, without replacement. The rule will be: reject H0 (and accept H1) if at leasttwo of the marbles in the sample are red; otherwise accept H0 (and reject H1).

Discussion. The sample space here is the set of(73

)triples of the form R1R3G1, and the critical region is

that part of it consisting of all triples with at least two R’s. X is the particular triple in S that occurs. LetY denote the number of red marbles in the sample. Y is determined from X and so is a statistic. CertainlyY is zero if θ is zero, and is three if θ is seven. Otherwise, the probability function of Y is given by

PrY = y =

(θy

)(7−θ3−y

)(73

) , y = 0, 1, . . . ,min(3, θ).

Let us calculate the probability of making type I and type II errors. The probability of a type I error is

α = PrH0X ∈ C = Pr

θ=3Y ≥ 2 =

(32

)(41

)(73

) +

(33

)(40

)(73

) =1335

and the probability of a type II error is

β = PrH1X /∈ C = Pr

θ=5Y < 2 =

(51

)(22

)(73

) =535

.

The power of the test is 30/35. Of course, a different rule would lead to different values for α and β. If wethink that α is too high, we could change the rule to: reject H0 if Y = 3. In that case, α would be 1/35,but β would increase. To decrease both α and β simultaneously, we would have to increase the sample size.Of course, when the sample size is 7, both α and β are zero since no error can be made.

Example 1.2. Suppose we have a normal distribution N(µ, σ2) with known variance σ2 = 625, and wewant to test the hypotheses

H0 : µ = 100H1 : µ = 70.

We select a simple random sample of size 10 from this distribution. The rule is: reject H0 if X < 80.

Page 37: Stats Notes

4.1. INTRODUCTION AND DEFINITIONS. 37

Discussion. Here again, both hypotheses are simple. The observation is a simple random sample of size10, the sample space S is the set of all ten-tuples (x1, · · · , x10) of real numbers, and the critical region isC = (x1, · · · , x10) ∈ S|x < 80. The probability of making a type I error is

α = PrH0X ∈ C = Pr

µ=100X < 80 = PrX − µ

σ/√

n<

80− 10025/

√10 = PrZ < −2.53 = 0.0057

and the probability of a type II error is

β = PrH1X /∈ C = Pr

µ=70X ≥ 80 = PrX − µ

σ/√

n≥ 80− 70

25/√

10 = PrZ ≥ 1.26 = 0.1040.

Example 1.3. Suppose we have the same urn problem as in example 1, but this time we wish to test thecomposite hypotheses

H0 : θ ≤ 3 (4.1)H1 : θ > 3.

As before, we select a sample of size 3, without replacement, and we reject H0 if the number Y of red ballsin the sample is at least two. What are the probabilities of type I and type II errors?

Since knowing that H0 is true does not pinpoint a particular value of θ, we must calculate an α value, tobe denoted by α(θ), for each value of θ with θ ≤ 3; that is, for each value of θ assumed under H0. Similarily,we must calculate a value β(θ) for each value of θ assumed under H1.

Definition. The function

P (θ) = Pθ(reject H0) =

α(θ) if θ is assumed under H0,1− β(θ) if θ is assumed under H1.

is called the power function of the test. If α(θ) ≤ α for all values of θ assumed under H0, then the test issaid to be of level α. The number maxα(θ) : θ is assumed under H0 is called the size of the test. (Thesedefinitions are as given in the graduate texts Casella and Berger, p. 385, and Shao, p. 126.)

Remarks. WMS use some different terminology.

(1) They would write the hypotheses in (4.1), as

H0 : θ = 3H1 : θ > 3,

and the hypotheses in (4.2), as

H0 : µ = 100H1 : µ < 100.

This is justified if for example, maxµ≥100 α(µ) = α(100).

(2) WMS write their null hypotheses as simple, say as in

H0 : θ = θ0

H1 : θ < θ0.

They then define α(θ0) to be the level of the test. Thus, they confuse size and level.

Page 38: Stats Notes

38 CHAPTER 4. THEORY OF HYPOTHESIS TESTING

Let us now return to example 3 and compute the power function for the test. We have

P (θ) = Pθ(reject H0) = PrθY = 2+ Pr

θY = 3,

which can be computed from (1.1). The results are in the following table:

θ 0 1 2 3 4 5 6 7

P (θ) 0 0 5/35 13/35 22/35 30/35 1 1

Example 1.4. Suppose as in example 1.2 we have a normal distribution N(µ, 625) and we wish to testthe composite hypotheses

H0 : µ ≥ 100 (4.2)H1 : µ < 100.

As before, we select a simple random sample of size 10 from this distribution and reject H0 if X < 80. Findthe power function for this test.

Solution. We have

P (µ) = PrµX < 80 = Pr

µX − µ

σ/√

n<

80− µ

25/√

10 = PrZ <

80− µ

25/√

10

where Z ∼ N(0, 1). We get the following table for selected values of µ,

µ 50 60 70 75 80 85 90 100 110

P (µ) 1 .9941 .8962 .7357 .5000 .2643 .1038 .0059 0

which gives the graph

1.0

.8

.6

.4

.2

50 60 70 80 90 100 110 120

P( )

Example 1.5. For the test of example 1.2, find a critical region of size α = .05.

Page 39: Stats Notes

4.2. HOW TO CHOOSE THE CRITICAL REGION – CASE OF SIMPLE HYPOTHESES. 39

Solution. The critical region is of the form X ≤ k, where we have to find k. We have

.05 = α = PrµX ≤ k = Pr

µ=100X − µ

σ/√

n<

k − 10025/

√10 = PrZ <

k − 10025/

√10,

sok − 10025/

√10

= −z.05 = −1.645.

Solving, we find k = 87. Hence, in order that the level of the test be .05, the rule should be: reject H0 ifx ≤ 87. In this case we also have

β = Prµ=70

X > 87 = PrZ >87− 7025/

√10 = PrZ > 2.15 = .016.

4.2 How to Choose the Critical Region – Case of Simple Hypothe-ses.

In the examples of §4.1, the critical region (or equivalently the rule) was chosen sensibly, but with noparticular method. In this section, we will learn how to choose “good” critical regions when both hypothesesare simple.

We have noticed in the examples of §4.1 that different critical regions give different values of α and β,and that changing the critical region to decrease α has the effect of increasing β, and vice-versa. A “good”critical region should be one for which both α and β are as small as possible, but it seems we cannot makeboth simultaneously small (unless the sample size is increased). Hence our method will be as follows:

(1) Decide beforehand on a suitable value of α. Typical values are .01 or .05.

(2) Among all tests (i.e. critical regions) of level α, choose one that has maximal power (i.e. minimal β).This is called a most powerful test (critical region) of level α.

In the remainder of this section, H0 and H1 denote two simple hypotheses, and f0(x) and f1(x) denote thelikelihood (i.e. density or probability) functions of the observation X under H0 and H1 respectively.

Example 2.1. Two hypotheses are to be tested on the basis of an observation X with range set 0, 1, 2, . . . , 5, 6.f0(x) and f1(x) are given in the following table.

x 0 1 2 3 4 5 6

f0(x) .2 .1 .1 .05 .25 .3 0f1(x) .1 .1 .2 .2 .1 .15 .15

Find a most powerful critical region of level α = .3.

Solution. The critical regions of level .3 are as follows. In each case, the powers (1 − βs) for these aregiven in parentheses.

Size = .3 : 0, 1(.2), 0, 1, 6(.35), 0, 2(.3), 0, 2, 6(.45), 3, 4(.3), 3, 4, 6(.45), 5(.15), 5, 6(.3)Size = .25 : 0, 3(.3), 4(.1), 4, 6(.25), 1, 2, 3(.5)Size = .2 : 0(.1), 0, 6(.25)1, 2(.3)Size = .15 : 1, 3(.3), 2, 3(.4)Size = .1 : 1(.1), 2(.2)Size = .05 : 3(.2).

Page 40: Stats Notes

40 CHAPTER 4. THEORY OF HYPOTHESIS TESTING

Hence either 0, 2, 6 or 3, 4, 6 is a most powerful critical region of size .3. In both cases, the power is .45.The most powerful critical region of level .3 is 1, 2, 3. It has size .25 and power .5.

In general, there may be infinitely many critical regions of size α, so how do we find a most powerful one?We reason as follows. For an x ∈ C, in which case we are going to reject H0, f0(x) should be “small”, whilef1(x) should be relatively “large”. Hence x ∈ C should correspond to small values of f0(x)/f1(x). This isthe thinking behind the Neyman-Pearson Lemma.

Theorem 4.2.1 (The Neyman-Pearson Lemma) Let C be a critical region of size α. If there exists aconstant k > 0 such that

f0(x) ≤ kf1(x) for x ∈ C, (4.3)f0(x) > kf1(x) for x /∈ C, (4.4)

then C is a most powerful critical region of level (i.e. at most size) α.

Proof. (†) Let C be such a critical region, and let D be any other critical region of level α. Then∫C∩D

f0(x) dx +∫

C\Df0(x) dx =

∫C

f0(x) dx = α ≥∫

D

f0(x) dx =∫

C∩D

f0(x) dx +∫

D\Cf0(x) dx,

so that ∫C\D

f0(x) dx ≥∫

D\Cf0(x) dx.

Using this with (4.3) and (4.4), we then have

1− βC =∫

C

f1(x) dx =∫

C∩D

f1(x) dx +∫

C\Df1(x) dx

≥∫

C∩D

f1(x) dx +∫

C\D

f0(x)k

dx

≥∫

C∩D

f1(x) dx +∫

D\C

f0(x)k

dx

≥∫

C∩D

f1(x) dx +∫

D\Cf1(x) dx =

∫D

f1(x) dx = 1− βD.

The Neyman-Pearson lemma not only tells us when a critical region is most powerful; it also tells usexactly how to construct a most powerful critical region of size α. What we do is

(1) Step 1. Define C = x : f0(x) ≤ kf1(x) and then

(2) Step 2. Find k from the relation PrH0X ∈ C = α.

Note that the test constructed has size α, but will be the most powerful test of level α.

Example 2.2. We have a normal distribution N(µ, σ2) with known variance σ2. Based on a randomsample X1, . . . , Xn, we wish to test

H0 : µ = µ0

H1 : µ = µ1

where µ1 < µ0. Find a most powerful critical region of size α.

Page 41: Stats Notes

4.2. HOW TO CHOOSE THE CRITICAL REGION – CASE OF SIMPLE HYPOTHESES. 41

Solution. As usual, the likelihood function is

fµ(x) = f(x1, . . . , xn) =(

1σ√

)n

e−1

2σ2Pn

i=1(xi−µ)2 =(

1σ√

)n

e−1

2σ2 [Pn

i=1(xi−x)2+n(x−µ)2]

and sof0(x)f1(x)

= e−1

2σ2 n[(x−µ0)2−(x−µ1)

2] = en(µ0−µ1)

2σ2 [2x−(µ0+µ1)].

Hence

C = f0(x)f1(x)

≤ k = n(µ0 − µ1)2σ2

[2x− (µ0 + µ1)] ≤ log k = 2x− (µ0 + µ1) ≤ k′ = x ≤ k′′

where we used the fact that µ1 < µ0. This tells us the “form” of the critical region. Now we find k′′. Wehave

α = Prµ=µ0

X ≤ k′′ = Prµ=µ0

X − µ

σ/√

n≤ k′′ − µ0

σ/√

n = PrZ ≤ k′′ − µ0

σ/√

n,

which implies thatk′′ − µ0

σ/√

n= −zα.

Solving, we find that k′′ = µ0 − zασ/√

n. The rule for the most powerful test is therefore:

reject H0 at level α if x ≤ µ0 − zασ√n

.

Alternate Version of the Test. We can also write

C =

X − µ

σ/√

n≤ k′′′

= Z ≤ k′′′,

and find k′′′ from α = PrZ ≤ k′′′. This gives k′′′ = −zα, and the test is: calculate

z =x− µ0

σ/√

n

and reject H0 at level α if z ≤ −zα.

Remark.† As in example 2.2, suppose we use the Neyman-Pearson method to find a critical region C ofsize α which is most powerful of level α, for the hypotheses

H0 : θ = θ0

H1 : θ = θ1,

and that C is independent of θ1 for θ1 ∈ Θ1 (so that C is most powerful of level α for testing this pair ofhypotheses for any θ1 ∈ Θ1). Then C is uniformly most powerful of level α for the hypotheses

H0 : θ = θ0

H1 : θ ∈ Θ1.

That is, if PC(θ) is the power function of C and PD(θ) is the power function of any other critical region Dof level α, then PC(θ) ≥ PD(θ) for all θ ∈ Θ1. If moreover the null hypothesis is H0 : θ ∈ Θ0 and α(θ) ≤ αfor all θ ∈ Θ0, then C is UMP for testing

H0 : θ ∈ Θ0

H1 : θ ∈ Θ1.

Page 42: Stats Notes

42 CHAPTER 4. THEORY OF HYPOTHESIS TESTING

Rationale:† For if D is another critical region for this test with Pθ0 [X ∈ D] ≤ α, and if θ′ ∈ Θ1, then Cand D are both critical regions for testing

H0 : θ = θ0

H1 : θ = θ′,

so the power of D (at θ′) is at most the power of C there. That is, PD(θ′) ≤ PD(θ′). This is true of allθ′ ∈ Θ1.

Example 2.2 again.† So the size α critical region x ≤ µ0 − zασ√n

obtained in example 2.2 is the samefor any µ1 < µ0, so is UMP of level α for testing

H0 : µ = µ0

H1 : µ < µ0,

or equivalently

H0 : µ ≥ µ0

H1 : µ < µ0.

Note (†): this method will not allow us to handle the two-sided test

H0 : µ = µ0

H1 : µ 6= µ0.

That is handled in the next section.

Example 2.3. We observe a binomial random variable with parameter θ. Find a most powerful criticalregion of size α for testing

H0 : θ = θ0

H1 : θ = θ1

where θ1 > θ0.

Solution. We have fθ(x) =(nx

)θx(1− θ)n−x for x = 0, 1, . . . , n, and so

C = x|(

n

x

)θx0 (1− θ0)n−x ≤ k

(n

x

)θx1 (1− θ1)n−x = x|

(θ0

θ1

)x(1− θ0

1− θ1

)n−x

≤ k

= x|x[log(

θ0

θ1

)+ log

(1− θ1

1− θ0

)] ≤ log k − n log

(1− θ0

1− θ1

) = x|x ≥

log k − n log(

1−θ01−θ1

)log(

θ0θ1

)+ log

(1−θ11−θ0

),Note that the inequality changed direction because log

(θ0θ1

)+log

(1−θ11−θ0

)< 0 due to the assumption θ1 > θ0.

Thus C is of the form C = x ≥ k′, where k′ is such that α = Prθ=θ0X ≥ k′. There is no nice formulafor k′ in this case, and because X is discrete, we cannot always ensure a test of exactly size α. Hence wewill define

kα(θ) = the smallest value of k such that PrθX ≥ k ≤ α.

The reason for choosing the “smallest”is to maximize the power. Then the test is: reject H0 at level α ifx ≥ kα(θ0). For a numerical example, suppose that n = 20, α = .05, and we want to test

H0 : θ = .3H1 : θ = .5.

From binomial tables, or using a calculator, one finds that k.05(.3) = 10. So the test is: reject H0 if x ≥ 10.

Page 43: Stats Notes

4.3. HOW TO CHOOSE THE CRITICAL REGION - CASE OF COMPOSITE HYPOTHESES. 43

Example 2.4. We have a normal distribution N(µ, σ2) with known mean µ. Based on a random sampleX1, . . . , Xn, we wish to test

H0 : σ2 = σ20

H1 : σ2 = σ21

where σ21 > σ2

0 . Find a most powerful critical region of size α.

Solution. The likelihood function is

fσ2(x) = f(x1, . . . , xn) =(

12πσ2

)n/2

e−1

2σ2Pn

i=1(xi−µ)2

and sofσ2

0(x)

fσ21(x)

=(

σ21

σ20

)n/2

e− 1

2

„1

σ20− 1

σ21

« Pni=1(xi−µ)2

.

After simplifying, we get C = σ2 ≥ k, where σ2 = 1n

∑ni=1(xi − µ)2. Step 2 becomes

α = PH0 [X ∈ C] = Pσ2=σ20[σ2 ≥ k] = Pσ2=σ2

0[nσ2

σ2≥ k′],

where ncσ2

σ2 ∼ χ2n, implying that k′ = ncσ2

σ20

. Hence the test is: reject H0 at level α if ncσ2

σ20≥ χ2

α,n.

4.3 How to Choose the Critical Region - Case of Composite Hy-potheses.

In this section, we have an observation X from a distribution with parameter θ, and we want to test

H0 : θ ∈ Θ0

H1 : θ ∈ Θ1

where Θ is the set of parameter values θ, and Θ = Θ0∪Θ1 disjoint. If C is a critical region for the test, then

PC(θ) = Prθ

(reject H0) = Prθ

(X ∈ C) =

α(θ) if θ ∈ Θ0,1− β(θ) if θ ∈ Θ1.

is the power function corresponding to C.

Definition. Let C and D be two critical regions of level α (i.e. both size(C) and size(D) are less thanα). We say that C is uniformly more powerful than D if PC(θ) ≥ PD(θ) for all θ ∈ Θ1. A test (or criticalregion) of level α is uniformly most powerful if it is more powerful than any other test of level α.

As in the previous section, we decide beforehand on a value α for the level of the test. Then wesearch among all critical regions for one of size less than or equal to α, which is uniformly most powerful.Unfortunately, such uniformly most powerful tests do not always exist. The following method of findingcritical regions, called the likelihood ratio method, tends to give tests with excellent qualities.

The Likelihood Ratio Method. Let fθ(x) denote the density or probability function of X. Then

λ(x) =maxθ∈Θ0 fθ(x)maxθ∈Θ fθ(x)

is called the likelihood ratio statistic. We take C = x|λ(x) ≤ k as the critical region, where k is chosen sothat maxθ∈Θ0 PrθX ∈ C = α.

Page 44: Stats Notes

44 CHAPTER 4. THEORY OF HYPOTHESIS TESTING

Remark. When the hypotheses are simple, we have Θ = θ0, θ1 and Θ0 = θ0. It is easy to show thatthe C defined in the previous paragraph is the same as the C which results from the Neyman-Pearson lemma.

Example 3.1. Given a random sample X1, . . . , Xn from N(µ, σ2) where σ2 is known, find the likelihoodratio test of size α for testing

H0 : µ = µ0

H1 : µ 6= µ0.

Solution. Here we have Θ0 = µ0, Θ = R, and

fµ(x) = f(x1, . . . , xn) =(

1σ√

)n

e−1

2σ2Pn

i=1(xi−µ)2 .

We have

maxµ∈Θ0

fµ(x) = fµ0(x) =(

1σ√

)n

e−1

2σ2Pn

i=1(xi−µ0)2,

maxµ∈Θ

fµ(x) = fx(x) =(

1σ√

)n

e−1

2σ2Pn

i=1(xi−x)2 ,

and using our identityn∑

i=1

(xi − µ)2 =n∑

i=1

(xi − x)2 + n(x− µ)2,

we get

λ(x) = e−n(x−µ0)2/2σ2

.

This gives

C = λ ≤ k = −n(x− µ0)2

2σ2≤ log k =

∣∣∣∣ x− µ0

σ/√

n

∣∣∣∣ ≥ k′.

We find k′ from

α = Prµ=µ0

∣∣∣∣ x− µ0

σ/√

n

∣∣∣∣ ≥ k′,

which gives k′ = zα/2. Hence the test is: calculate

z =x− µ0

σ/√

n

and reject H0 if |z| ≥ zα/2.

Example 3.2.† Same as example 3.1, but this time the hypotheses are

H0 : µ ≤ µ0

H1 : µ > µ0.

Page 45: Stats Notes

4.3. HOW TO CHOOSE THE CRITICAL REGION - CASE OF COMPOSITE HYPOTHESES. 45

Solution. This time

λ(x) =maxµ≤µ0 fµ(x)

max−∞<µ<∞ fµ(x).

If we write

fµ(x) =(

1σ√

)n

e−1

2σ2Pn

i=1[(xi−x)2+n(x−µ)2],

we see that fµ(x) has its maximum when (x−µ)2 has its minimum. Subject to the condition µ ≤ µ0, (x−µ)2

has its minimum at

µ =

µ0 if x ≥ µ0,x if x ≤ µ0.

Hence

λ(x) =

fµ0 (x)

fx(x) = e−n(x−µ0)2/2σ2

if x ≥ µ0,fx(x)fx(x) = 1 if x ≤ µ0,

and so because k must be less than one in order to have size α < 1,

C = λ(x) ≤ k, x ≥ µ0 ∪ λ(x) ≤ k, x < µ0 = e−n(x−µ0)2/2σ2

≤ k, x ≥ µ0

= ∣∣∣∣ x− µ0

σ/√

n

∣∣∣∣ ≥ k′, x ≥ µ0 = x− µ0

σ/√

n≥ k′, x ≥ µ0

where k′ ≥ 0. Now we have to determine k′, using the fact that the test is to have level α. Since for µ ≤ µ0,

PrµX − µ0

σ/√

n≥ k′

is increasing in µ, then

α = maxµ≤µ0

PrµX − µ0

σ/√

n≥ k′ = Pr

µ0X − µ0

σ/√

n≥ k′

implying that k′ = zα.

Example 3.3.† Given a random sample X1, . . . , Xn from N(µ, σ2) where σ2 is unknown, find the likelihoodratio test of size α for testing

H0 : µ = µ0

H1 : µ 6= µ0.

Solution. This time we have

λ(x) =maxµ=µ0,σ2>0 fµ,σ2(x)

max−∞<µ<∞,σ2>0 fµ,σ2(x)=

fµ0,σ2(x)

fx,σ2(x)

where

σ2 =1n

n∑i=1

(xi − µ0)2, σ2 =1n

n∑i=1

(xi − x)2.

Thus,

λ(x) =

(σ2

σ2

)n/2

=(∑n

i=1(xi − µ0)2∑ni=1(xi − x)2

)−n/2

=(

1 +n(x− µ0)2∑ni=1(xi − x)2

)−n/2

=(

1 +t2

n− 1

)−n/2

,

Page 46: Stats Notes

46 CHAPTER 4. THEORY OF HYPOTHESIS TESTING

wheret =

x− µ0

s/√

n.

Hence the critical region is

C = (

1 +t2

n− 1

)−n/2

≤ k = (

1 +t2

n− 1

)n/2

≥ k−1 = |t| ≥ k∗.

Since α = Prµ=µ0|t| ≥ k∗, then k∗ = tα/2,n−1. The test is

Reject H0 at level α if∣∣∣∣X − µ0

s/√

n

∣∣∣∣ ≥ tα/2,n−1.

Example 3.4.† Given a random sample X1, . . . , Xn from N(µ, σ2) where µ is unknown, find the likelihoodratio test of size α for testing

H0 : σ2 = σ20

H1 : σ2 6= σ20 .

Solution. This time we have

λ(x) =max−∞<µ<∞,σ2=σ2

0fµ,σ2(x)

max−∞<µ<∞,σ2>0 fµ,σ2(x)=

fx,σ20(x)

fx,cσ2(x)

where

σ2 =1n

n∑i=1

(xi − x)2.

Thus,

λ(x) =

(1

σ0√

)n

e− 1

2σ20

Pni=1(xi−x)2(

1σ√

)n

e− 1

2dσ2

Pni=1(xi−x)2

=

(σ2

σ20

)n2

e− 1

2σ20

Pni=1(xi−x)2

en2 =

((n− 1)s2

σ20

)n2

e− 1

2(n−1)s2

σ20 e

n2 n−

n2 ,

where

s2 =1

n− 1

n∑i=1

(xi − x)2 =n

n− 1σ2.

Let us write

Y =(n− 1)s2

σ20

∼ χ2n−1 under H0.

Then

λ(x) = Y n/2e−Y/2( e

n

)n/2

,

so the critical region is of the form

C = Y n/2e−Y/2 ≤ k = Y ≤ k1 ∪ Y ≥ k2.

Hence α = PH0(C) = PH0Y ≤ k1+ PH0Y ≥ k2. We will choose k1 and k2 so that

PH0Y ≤ k1 =α

2= PH0Y ≥ k2.

This gives k1 = χ21−α/2,n−1 and k2 = χ2

α/2,n−1. Thus the level-α test is: calculate

Y =(n− 1)s2

σ20

and reject H0 if either Y ≤ χ21−α/2,n−1 or Y ≥ χ2

α/2,n−1.

Page 47: Stats Notes

4.4. SOME LAST TOPICS. 47

4.4 Some Last Topics.

4.4.1 Large Sample Tests.

We have a random sample X1, . . . , Xn from Fθ. Suppose that for n large (≥ 25), the statistic θ−θσθ

, where σθ

is the standard error (std dev’n of θ), has approximately the distribution N(0, 1). Then to test

H0 :θ = θ0

H1 :θ 6= θ0

(θ > θ0)(θ < θ0)

at level α, calculate z = θ−θσθ

and reject H0 if

|z| > zα/2

(z > zα)(z < −zα)

4.4.2 p-value.

(Also called p-level or observed level of significance). The p-value of a test is that value of α for which theobserved value of the test statistic is on the border between accepting and rejecting H0.

4.4.3 Bayesian Tests of Hypotheses.

In Bayesian statistics, one chooses between

H0 : θ ∈ Θ0

H1 : θ ∈ Θ1

by calculating P (Θ0|x) and P (Θ1|x), and deciding accordingly.

4.4.4 Relationship Between Tests and Confidence Sets.†Let E denote the set of possible values of the observation X, and let Θ be the set of all values of theparameter θ.

Definition. A confidence set C(X) for θ is a subset of Θ consisting of parameter values θ which areconsistent with the observation X. The function Pθθ ∈ C(X) is called the coverage probability. Theconfidence coefficient of C(X) is 1− α = infθ∈Θ Pθθ ∈ C(X).

In the chapter on confidence intervals, we had Θ ⊂ R, and C(x) was an interval of the form [θL(x), θU (x)],or (−∞, θU (x)], or [θL(x),+∞). In this case we have interval estimators, or confidence intervals.

The following proposition shows that every test statistic corresponds to a confidence set, and vice-versa.

Proposition 4.4.1 For each θ0 ∈ Θ, let A(θ0) be the acceptance region (i.e. complement of the rejectionregion) for a size α test of H0 : θ = θ0 against H1 : θ 6= θ0. For each x ∈ E, define

C(x) = θ ∈ Θ|x ∈ A(θ).

Then C(X) is a 1− α confidence set for θ. Conversely, for each x ∈ E, let C(x) be a 1− α confidence setfor θ. For each θ ∈ Θ, define

A(θ) = x ∈ E : θ ∈ C(x).Then A(θ0) is an acceptance region for a size α test of H0 : θ = θ0 against H1 : θ 6= θ0.

Page 48: Stats Notes

48 CHAPTER 4. THEORY OF HYPOTHESIS TESTING

Proof. Since θ ∈ C(x) ⇐⇒ x ∈ A(θ), and PθX ∈ A(θ)c = α∀θ, then

infθ∈Θ

Pθθ ∈ C(X) = infθ∈Θ

PθX ∈ A(θ) = 1− supθ∈Θ

PθX ∈ A(θ)c,

so if one side is 1− α, then so is the other.

Remarks. There is no guarantee that the confidence set obtained by this method will be connected; forexample, an interval if Θ ⊂ R. But in most one-dimensional cases, one-sided tests give one-sided intervalsand two-sided tests give two-sided intervals.

Example. Given a r.s. X1, . . . , Xn from N(µ, σ2) with σ2 unknown, find a c.i. of the form [θL(x), θU (x)]for µ.

Solution. We will invert the test for H0 : µ = µ0 against H1 : µ 6= µ0. The acceptance region of size α

for this test is A(µ0) = x :∣∣∣ x−µ0s/√

n

∣∣∣ ≤ tα/2,n−1, so it follows that C(x) = µ :∣∣∣ x−µs/√

n

∣∣∣ ≤ tα/2,n−1 = µ :x− tα/2,n−1

s√n≤ µ ≤ x + tα/2,n−1

s√n. Hence θL(x) = x− tα/2,n−1

s√n

and θU (x) = x + tα/2,n−1s√n.

Page 49: Stats Notes

Chapter 5

Hypothesis Testing: Applications

5.1 The Bivariate Normal Distribution.

Reference: MWS 7th ed., section 5.10

Definition. The random variables X and Y have a bivariate normal distribution if their joint densityfunction is

f(x, y) =1

2πσxσy

√1− ρ2

e− 1

2(1−ρ2)

»( x−µx

σx)2−2ρ( x−µx

σx)

“y−µy

σy

”+

“y−µy

σy

”2–, −∞ < x, y < ∞.

Recall that the marginal distributions of X and Y are N(µx, σ2x) and N(µy, σ2

y), and that ρ is the correlationcoefficient between X and Y .

In general, the random vector X = (X1, . . . , Xn) has the multivariate normal distribution N(µ,Σ) withmean vector µ and covariance matrix Σ (symmetric and positive definite) if X has joint density functiongiven by

f(x) =1

(2π)n/2√

detΣe−

12 (x−µ)T Σ−1(x−µ), x ∈ Rn.

Proposition 5.1.1 Let (X1, Y1), . . . , (Xn, Yn) be a random sample of size n from the bivariate normal dis-tribution. Then the maximum likelihood estimators of µx, µy, σ2

x, σ2y, and ρ are

µx = X =1n

n∑i=1

Xi, µy = Y =1n

n∑i=1

Yi, σ2x =

1n

n∑i=1

(Xi − X)2, σ2y =

1n

n∑i=1

(Yi − Y )2,

and

r = ρ =∑n

i=1(Xi − X)(Yi − Y )√∑ni=1(Xi − X)2

√∑ni=1(Yi − Y )2

.

r is called the sample correlation coefficient.

Proof. † Using (1.1) and (1.2), we can writen∑

i=1

(xi − µx)2 = nσ2x + n(x− µx)2,

n∑i=1

(yi − µy)2 = nσ2y + n(y − µy)2,

n∑i=1

(xi − µx)(yi − µy) = nρσxσy + n(x− µx)(y − µy).

49

Page 50: Stats Notes

50 CHAPTER 5. HYPOTHESIS TESTING: APPLICATIONS

The likelihood function is then

L = L(µx, µy, σ2x, σ2

y, ρ) = Πni=1f(xi, yi)

=e

− 12(1− ρ2)

n∑i=1

[(xi − µx

σx

)2

− 2ρ

(xi − µx

σx

)(yi − µy

σy

)+(

yi − µy

σy

)2]

(2πσxσy)n(1− ρ2)n/2

=e

− n

2(1− ρ2)

[σ2

x

σ2x

+(x− µx)2

σ2x

− 2ρρσxσy

σxσy− 2ρ(x− µx)(y − µy)

σxσy+

σ2y

σ2y

+(y − µy)2

σ2y

](2πσxσy)n(1− ρ2)n/2

.

(Note that µx, µy, σ2x, σ2

y, and ρ are jointly sufficient.) Then

log L = −n log 2π − n

2log σ2

xσ2y(1− ρ2)

− n

2(1− ρ2)

[σ2

x

σ2x

+(x− µx)2

σ2x

− 2ρρσxσy

σxσy− 2ρ(x− µx)(y − µy)

σxσy+

σ2y

σ2y

+(y − µy)2

σ2y

].

Differentiating with respect to µx, µy, σ2x, σ2

y, and ρ, setting the results equal to zero, and solving the resultingfive equations gives the required estimators.

The only thing new here is r. We are going to use r to make inferences about ρ.

Computational Formula for r.

r =Sxy√

Sxx

√Syy

,

where Sxy =∑n

i=1(xi − x)(yi − y) =∑n

i=1 xiyi − nxy.

5.2 Correlation Analysis.

Reference: MWS 7th ed., section 11.8

Maximum Likelihood Ratio Test for ρ.

Given a random sample (X1, Y1), . . . , (Xn, Yn) from a bivariate normal distribution N(µx, σ2x;µy, σ2

y; ρ), letus find the maximum likelihood ratio test for

H0 : ρ = 0H1 : ρ 6= 0

Solution.† If ρ = 0, L becomes

L =e−n

2

[1σ2

x

(σ2x + (x− µx)2) +

1σ2

y

(σ2y + (y − µy)2)

](2π)n(σ2

xσ2y)n/2

.

The maximum likelihood estimators of µx, µy, σ2x, σ2

y are easily seen to be x, y, σ2x, σ2

y, and so the numeratorof the maximum likelihood ratio statistic λ(x) is

Lnum =e−n/2

(2π)n(σ2xσ2

y)n/2.

Page 51: Stats Notes

5.2. CORRELATION ANALYSIS. 51

In the unconstrained case, the MLE’s of µx, µy, σ2x, σ2

y, ρ are x, y, σ2x, σ2

y, ρ, and so the denominator of λ(x) is

Lden =e−n/2

(2π)n(σ2xσ2

y(1− ρ2))n/2.

It follows thatλ(x) =

Lnum

Lden= (1− ρ2)n/2.

The critical region is thereforeC = (1− ρ2)n/2 ≤ k = ρ2 ≥ k′.

Next, by defining

ui =xi − µx

σx, vi =

yi − µy

σy, i = 1, . . . , n,

we see that xi − x = σx(ui − u) and yi − y = σy(vi − v), so that ρ can be written as

ρ =∑n

i=1(ui − u)(vi − v)√∑ni=1(ui − u)2

√∑ni=1(vi − v)2

.

Here, the (ui, vi)’s are values from the bivariate normal N(0, 0; 1, 1, ρ). It follows that the distribution of ρdepends only on the parameter ρ. The question is: what is the distribution of ρ when ρ = 0? The answer isthat if ρ = 0, the statistic

t =√

n− 2ρ√1− ρ2

(5.1)

has a t-distribution with n − 2 degrees of freedom. Since |t| is an increasing function of |ρ|, the criticalregion for rejecting H0 : ρ = 0 can be written as C = |t| ≥ k”, and the level-α test becomes: reject H0 if|t| ≥ tα/2,n−2. Also, t is an increasing function of ρ, and so we are led to the following.

Summary. To test

H0 :ρ = 0H1 :ρ 6= 0

(ρ > 0)(ρ < 0)

at level α, we calculate t as in (5.1) and reject H0 if

|t| ≥ tα/2,n−2

(t > tα,n−2)(t < −tα,n−2).

Large Sample Inference for ρ.

It can be shown that for large n, the statistic 12 log 1+r

1−r is approximately normally distributed with mean12 log 1+ρ

1−ρ and variance 1n−3 . More precisely,

zdef=

12 log 1+r

1−r −12 log 1+ρ

1−ρ√1

n−3

∼ N(0, 1)

as n →∞. We can use this fact to test more general hypotheses concerning ρ, and to construct confidenceintervals for ρ.

Page 52: Stats Notes

52 CHAPTER 5. HYPOTHESIS TESTING: APPLICATIONS

Tests of Hypotheses. To test

H0 :ρ = ρ0

H1 :ρ 6= ρ0

(ρ > ρ0)(ρ < ρ0)

at level α, we calculate

z =12 log 1+r

1−r −12 log 1+ρ0

1−ρ0√1

n−3

,

and reject H0 if

|z| ≥ zα/2

(z > zα)(z < −zα).

Note that 12 log 1+ρ

1−ρ is an increasing function of ρ–this is the reason for the test.

Confidence Intervals. We have

−zα/2 < z =√

n− 32

log(

1 + r

1− r

)(1− ρ

1 + ρ

)< zα/2

with probability 1− α. This gives

e−

2zα/2√n− 3

(1− r

1 + r

)<

1− ρ

1 + ρ< e

2zα/2√n− 3

(1− r

1 + r

)with probability 1− α. Isolating ρ in the centre, we find that a (1− α)× 100% confidence interval for ρ is

1 + r − (1− r)e

2zα/2√n− 3

1 + r + (1− r)e

2zα/2√n− 3

< ρ <1 + r − (1− r)e

−2zα/2√n− 3

1 + r + (1− r)e

−2zα/2√n− 3

.

Example. We have the following data for a secretary concerning

x = minutes to do task in the morningy = minutes to do task in the afternoon.

The data are as follows:

x 8.2 9.6 7.0 9.4 10.9 7.1 9.0 6.6 8.4 10.5y 8.7 9.6 6.9 8.5 11.3 7.6 9.2 6.3 8.4 12.3

(1) Test the hypotheses

H0 : ρ = 0H1 : ρ 6= 0

at level .01.

(2) Find a 99% confidence interval for ρ.

Page 53: Stats Notes

5.3. NORMAL REGRESSION ANALYSIS.† 53

Solution. We have n = 10,∑

xi = 86.7,∑

yi = 88.8,∑

x2i = 771.35,

∑y2

i = 819.34, and∑

xiyi =792.92. Using the computational formula for r, we get r = .936.

(1) We get

z =12 log 1+.936

1−.936 −12 log 1+0

1−0√17

= 4.5

which exceeds zα/2 = z.005 = 2.58. Hence we reject H0 at level .01. There is a correlation between Xand Y .

(2) for α = .01 and the formula given above, a 99% confidence interval for ρ is

.623 < ρ < .991.

Remark. As pointed out previously, we can also test the hypotheses

H0 : ρ = 0H1 : ρ 6= 0

using the statistic

t =√

n− 2r√1− r2

which under the hypothesis that ρ = 0, has the t-distribution with n − 2 degrees of freedom. For the datagiven in the example above, we get

t =√

8× .936√1− .9362

= 7.52.

Since tα/2,n−2 = t.005,8 = 3.355, we reject H0 at level .01.

5.3 Normal Regression Analysis.†Reference: MWS 7th ed., section 11.8

Let X and Y be random variables. Define e = Y − E(Y |X). Then E(e) = 0 and

Var(e) = Var(Y )− 2Cov[Y,E(Y |X)] + Var(E(Y |X)) = Var(Y )−Var(E(Y |X)); (5.2)

moreover, since E(eX) = E[(Y − E(Y |X))X] = E[XY − XE(Y |X)] = E[XY − E(XY |X)] = 0, thenCov(e,X) = 0 and so e and X are uncorrelated.

Proposition 5.3.1 Suppose that X and Y have a bivariate normal distribution. Then we have the repre-sentation

Y = a + bX + e, (5.3)where e is independent of X and has the distribution N(0, σ2).

Proof. As previously shown, E(Y |X) = a + bX for certain constants a, b given by

a = µy − bµx, b = ρσy

σx.

Then X and e = Y − a − bX also have a bivariate normal distribution. Since they are uncorrelated, theyare independent. Hence we have the representation in (5.3) where X and e are independent normal randomvariables. The distribution of e is N(0, σ2) where by (5.2), we have

σ2 = Var(Y )−Var(a + bX) = σ2y − b2σ2

x = σ2y(1− ρ2).

The maximum likelihood estimators of a and b are

b = ρσy

σx=∑

(xi − x)(yi − y)∑(xi − x)2

, a = µy − bµx = y − bx. (5.4)

Page 54: Stats Notes

54 CHAPTER 5. HYPOTHESIS TESTING: APPLICATIONS

Remarks.

(1) The maximum likelihood estimators in (5.4) are the same as the least squares estimators in chapter 7on linear models. However (in contrast to linear models, where the xi’s are fixed), the distributions ofa and b are either unknown or extremely complicated.

(2) Since b = ρσy/σx, the test H0 : b = 0 is the same as the test H0 : ρ = 0.

Page 55: Stats Notes

Chapter 6

Linear Models

6.1 Regression.

Reference: WMS 7th ed., chapter 11Consider the following laboratory experiment on Charles’ Law. We have a closed vessel filled with a

certain volume of gas. We heat the vessel and measure the corresponding pressures inside the vessel. Weobtain the observations (x1, y1), . . . , (xn, yn), where yi is the pressure in the vessel corresponding to thetemperature xi. We plot the points (x1, y1), . . . , (xn, yn) and find

(x1,y1) (xn,yn)

y

x

This is called a scatterplot. The distribution of points on the scatterplot indicates that there probably is alinear relationship between the variables x and y of the form

y = β0 + β1x. (6.1)

However, the points do not exactly fall on a straight line, due to experimental error or perhaps some otherfactor. But if we do assume that the true relationship between x and y is of the form in (6.1), then therelationship between the xi’s and the yi’s is given by the model

yi = β0 + β1xi + ei, i = 1, 2, . . . , n (6.2)

where the ei’s are the “errors”. Note that ei is the vertical distance of the point (xi, yi) from the liney = β0 + β1x.

55

Page 56: Stats Notes

56 CHAPTER 6. LINEAR MODELS

x

(xi,yi)

y

ei y=!x+"

Our object in this chapter will be to use the data (x1, y1), . . . , (xn, yn) to estimate the parameters β0 andβ1.

Least Squares Estimation of β0 and β1. The least squares estimates β0 and β1 are those values of β0

and β1 for which the sum

S =n∑

i=1

e2i =

n∑i=1

(yi − β0 − β1xi)2

of the squares of the vertical deviations is a minimum. Differentiating S with respect to β0 and β1, we find

∂S

∂β0= −2

n∑i=1

(yi − β0 − β1xi) = −2

(n∑

i=1

yi − nβ0 − β1

n∑i=1

xi

)∂S

∂β1= −2

n∑i=1

xi(yi − β0 − β1xi) = −2

(n∑

i=1

xiyi − β0

n∑i=1

xi − β1

n∑i=1

x2i

).

Setting these derivatives equal to zero, we obtain

n∑i=1

yi − nβ0 − β1

n∑i=1

xi = 0 (6.3)

n∑i=1

xiyi − β0

n∑i=1

xi − β1

n∑i=1

x2i = 0, (6.4)

which are called the normal equations. Note that the first of the two equations can be expressed in the usefulform

β0 = y − β1x.

The solutions are

β0 =(∑

x2i )(∑

yi)− (∑

xi)(∑

xiyi)n∑

x2i − (

∑xi)2

, β1 =n∑

xiyi − (∑

xi)(∑

yi)n∑

x2i − (

∑xi)2

=Sxy

Sxx,

where

Sxy =n∑

i=1

(xi − x)(yi − y) =n∑

i=1

xiyi − nxy,

and the least squares line isy = β0 + β1x.

Page 57: Stats Notes

6.1. REGRESSION. 57

Example. An environmentalist is concerned about mercury emissions from a battery manufacturing plantin Sorel, Quebec, into the St. Lawrence river. She measures mercury concentration at several locationsdownriver from the plant. Her results are

x 1.7 2.3 2.4 2.6 2.8 3.3 3.7 4.1 4.5 7.9 8.3 9.8y 32.8 23.6 26.9 21.8 22.7 19.5 19.3 19.9 13.5 8.4 8.2 5.8

where x is the distance downriver in kilometers and y is the mercury concentration in parts per million.Find the least squares prediction line y = β0 + β1x. At a point 4.3 kilometers downriver, what will be thepredicted mercury concentration?

Solution. The scatterplot (as well as the least squares line to be determined) is shown in the figure below.

We have n = 12 and∑xi = 53.4,

∑yi = 222.4,

∑xiyi = 764.2,

∑x2

i = 317.52,∑

y2i = 4849.38,

so

x = 4.45, y = 18.53, Sxx = 79.89, Syy = 727.567, Sxy = −225.48,

which gives

β0 = 31.093, β1 = −2.822.

Thus the least squares prediction line is

y = 31.093− 2.822x.

For x = 4.3, the predicted concentration is β0 + β1x = 31.093− (2.822× 4.3) = 18.7.

Remarks.

(1) Suppose we have a set of data with a scatterplot of the form shown below.

Page 58: Stats Notes

58 CHAPTER 6. LINEAR MODELS

y

x

Then we might want to fit a model of the form

y = β0 + β1x + β2x2 + e. (6.5)

In this case, we have

S =n∑

i=1

(yi − β0 − β1xi − β2x2i )

2

and as above we find β0, β1, and β2 by deriving the three normal equations as above.

(2) A linear model is one, such as in (6.1) or (6.5), which is linear in the coefficients. The reason for treatinglinear models is that they lead to the linear normal equations which are easy to solve. Non-linear modelscan lead to normal equations which cannot be solved exactly.

(3) However, it is sometimes possible to convert a non-linear model into a linear one by an appropriatetransformation of the data. For example, suppose that a scatterplot indicates that the model

y = β0βx1 + e

might be appropriate. Then we could estimate β0 and β1 by actually fitting the model

log y = log β0 + x log β1 + e.

Normal Regression Analysis. So far, we have not made any assumptions about the distributions in-volved, and both the xi’s and yi’s could be values of random variables. Now, in order to obtain confidenceintervals and make tests of hypotheses about β0 and β1, we must make distributional assumptions. Fromthis point on, we shall assume that the xi’s are fixed numbers and not the values of random variables, andthat the xi’s and yi’s are related by the model

yi = β0 + β1xi + ei, ei ∼ N(0, σ2), i = 1, . . . , n

where the ei’s are uncorrelated. Notice that we have

yi ∼ N(β0 + β1xi, σ2), i = 1, . . . , n.

Page 59: Stats Notes

6.1. REGRESSION. 59

Remark. Now that there are distributions involved, we could use maximum likelihood estimation to esti-mate β0 and β1. The likelihood function is

L(β0, β1) = Πni=1

1σ√

2πe−(yi−β0−β1xi)

2/2σ2.

By calculating∂ log L

∂β0and

∂ log L

∂β1,

we find that the maximum likelihood estimates are the same as the least squares estimates.

Sampling Distributions of β0, β1, and β0 + β1x. We can write

β1 =Sxy

Sxx=∑

(xi − x)yi

Sxx=∑[

(xi − x)Sxx

]yi.

Thus β1 is a linear combination of independent normals, and so is itself normal. Moreover, we have

E(β1) =∑[

(xi − x)Sxx

]E(yi) =

∑(xi − x)(β0 + β1xi)

Sxx

=β0

∑(xi − x) + β1

∑xi(xi − x)

Sxx=

0 + β1

∑(xi − x)2

Sxx= β1,

and

Var(β1) =∑[

(xi − x)Sxx

]2Var(yi) =

∑(xi − x)2σ2

S2xx

=σ2

Sxx.

In summary, then, we have

β1 ∼ N

(β1,

σ2

Sxx

).

Similarly, we get

β0 ∼ N

(β0, σ

2

[1n

+x2

Sxx

]).

Moreover, we can show that

Cov(β0, β1) = −σ2x

Sxx,

and so the estimated line β0 + β1x has distribution

β0 + β1x ∼ N

(β0 + β1x, σ2

[1n

+(x− x)2

Sxx

]).

Finally, let y = β0 + β1x + e be the value resulting from a future measurement, and suppose we wish topredict y. Taking y = β0 + β1x as our prediction, we see that E(y − y) = 0 and

Var(y − y) = Var(y) + Var(y)− 2Cov(y, y) = σ2 + σ2

[1n

+(x− x)2

Sxx

],

where we used the fact that Cov(y, y) = 0 (since e is independent of e1, . . . , en). It follows that

y − y ∼ N

(0, σ2

[1 +

1n

+(x− x)2

Sxx

]).

The problem insofar as deriving tests and confidence intervals for β0 and β1 is that σ2 is not likely to beknown, and therefore must be estimated. Let us define

SS(Res)=least squares minimum =n∑

i=1

(yi − β0 − β1xi)2,

Page 60: Stats Notes

60 CHAPTER 6. LINEAR MODELS

(SS(Res) is called SSE in WMS). Then

SS(Res) =n∑

i=1

[(yi − y) + β1(x− xi)

]2=

n∑i=1

(yi − y)2 − 2β1

n∑i=1

(yi − y)(xi − x) + β21

n∑i=1

(xi − x)2,

givingSS(Res) = Syy − β1Sxy,

a useful computation formula for SS(Res). SS(Res)/σ2 can be seen to have a chi-square distribution withn− 2 degrees of freedom, independent of β0 and β1, and therefore

σ2 def=SS(Res)n− 2

(called s2 in WMS) is an unbiased estimator of σ2. Then

tdef=

(β1 − β1)√

Sxx

σ=

β1 − β1√σ2

Sxx√SS(Res)

σ2

n− 2

has a t-distribution with n − 2 degrees of freedom. Similar results hold for β0, β0 + β1x, and y − y. Henceour tests and confidence intervals are based on the facts that

tdef=

(β1 − β1)√

Sxx

σ, t

def=β0 − β0

σ√

1n + x2

Sxx

, tdef=

β0 + β1x− β0 − β1x

σ√

1n + (x−x)2

Sxx

, and tdef=

y − y

σ√

1 + 1n + (x−x)2

Sxx

have t-distributions with n− 2 degrees of freedom.

Example. Let us consider the data from our previous example.

(1) Suppose we want to test the hypotheses

H0 : β1 = β

H1 : β1 6= β

at level α. The method is: calculate

t =(β1 − β)

√Sxx

σ

and reject H0 at level α if |t| > tα/2,n−2.

In the numerical example, are the data sufficient to indicate that mercury concentration depends ondistance? That is, let us test the above hypotheses with β = 0. We have β = −2.882, and Sxx = 79.89.For SS(Res), we get SS(Res)= 727.567 − (−2.822 × 225.48) = 91.176, so σ2 = 91.176/10 = 9.1176.Finally, we find t = −2.882

√79.89/

√9.1176 = −8.53. Since t.025,10 = 2.228, we reject H0 at level .05.

(2) Suppose we want a 95% confidence interval for β0. The method is as follows: we know that

−tα/2,n−2 <β0 − β0

σ√

1n + x2

Sxx

< tα/2,n−2

with probability 1− α. Unravelling and isolating β0 in the usual way, we find that

β0 − tα/2,n−2σ

√1n

+x2

Sxx< β0 < β0 + tα/2,n−2σ

√1n

+x2

Sxx

Page 61: Stats Notes

6.2. EXPERIMENTAL DESIGN. 61

is a (1− α)× 100% confidence interval for β0. In our case, we get 31.093± 3.872, so that

27.22 < β0 < 34.96

is a 95% confidence interval for β0.

(3) A (1− α)× 100% confidence interval for β1 is

β1 − tα/2,n−2σ

√1

Sxx< β1 < β1 + tα/2,n−2σ

√1

Sxx.

(4) A 95% confidence interval for the true line E(y) = β0 + β1x is easily found to be

β0 + β1x− tα/2,n−2σ

√1n

+(x− x)2

Sxx< β0 + β1x < β0 + β1x + tα/2,n−2σ

√1n

+(x− x)2

Sxx.

(5) A 95% confidence interval for a future value y = β0 + β1x + e is easily found to be

y − tα/2,n−2σ

√1 +

1n

+(x− x)2

Sxx< y < y + tα/2,n−2σ

√1 +

1n

+(x− x)2

Sxx,

where y = β0 + β1x.

y

x

6.2 Experimental Design.

Reference: WMS 7th ed., chapters 12,13In this final section, we study experimental designs involving a single factor, also called one-way analysis

of variance.

6.2.1 The Completely Randomized Design.

Suppose we want to test the effects of k different fertilizers (called treatments) on a certain type of wheat.To do this, we plant a total of N test plots with this wheat, of which n1 are fertilized with fertilizer 1, n2

with fertilizer 2, and so on (so that N = n1 + n2 + · · ·+ nk). We then measure the resulting yields for eachplot. For example, if k = 3, a typical set of observations would be the following:

Page 62: Stats Notes

62 CHAPTER 6. LINEAR MODELS

Fertilizer 1 Fertilizer 2 Fertilizer 350 60 4060 60 5060 65 5065 70 6070 75 6080 80 6075 70 6580 658575

Let xij = the yield from the jth plot receiving fertilizer i (the jth observation on treatment i), j =1, 2, · · · , ni; i = 1, · · · , k. Then we may write xij = µi + eij , where µi represents the effect of the ithtreatment, and eij is the random error for the jth plot receiving fertilizer i. eij is the sum total of effectsdue to uncontrolled factors, such as precipitation, soil fertility, and so on. Let us put

µ =1N

k∑i=1

niµi, αi = µi − µ, i = 1, 2, · · · , k.

Then our model becomes

xij = µ + αi + eij , j = 1, · · · , ni, i = 1, · · · , k,

where

µ = the general effectαi = the deviation from the general effect for the ith treatment,

and we notice that∑k

i=1 niαi = 0 (called a side condition). This is called the one-way classification model.

Estimation of Parameters. We shall begin by estimating the parameters µ, αi by the method of leastsquares. We will take µ and αi to be the values of µ and αi for which

SS =k∑

i=1

ni∑j=1

e2ij =

k∑i=1

ni∑j=1

(xij − µ− αi)2

is a minimum. Differentiation gives

∂SS

∂µ= −2

k∑i=1

ni∑j=1

(xij − µ− αi) = −2(x·· −Nµ−k∑

i=1

niαi)

∂SS

∂αi= −2

ni∑j=1

(xij − µ− αi) = −2(xi· − niµ− niαi)

where

xi· =ni∑

j=1

xij and x·· =k∑

i=1

xi· =k∑

i=1

ni∑j=1

xij .

We get the normal equations

xi· − niµ− niαi = 0, i = 1, · · · , k

x·· −Nµ = 0

Page 63: Stats Notes

6.2. EXPERIMENTAL DESIGN. 63

from whichµ =

x··N

, αi =xi·

ni− x··

N.

The minimum sum of squares is

S =k∑

i=1

ni∑j=1

(xij − µ− αi)2 =k∑

i=1

ni∑j=1

(xij −xi·

ni)2.

Derivation of the Treatment Sum of Squares. We shall want to test the hypotheses

H0 : α1 = α2 = · · · = αk(= 0) equivalently µ1 = · · · = µk

H1 : not H0.

If H0 is true, then the observations come from the model xij = µ+eij , j = 1, · · · , ni; i = 1, · · · , k. The leastsquares estimate of µ is µ = x··/N , and comes from minimizing SS =

∑ki=1

∑ni

j=1(xij − µ)2. The minimumsum of squares in this case is

S0 =k∑

i=1

ni∑j=1

(xij − µ)2 =k∑

i=1

ni∑j=1

(xij −x··N

)2.

Now S0 is the variability in the observations not explained by the parameter µ, while S is the variability inthe observations not explained by µ, α1, · · · , αk. Hence S0−S is the variability in the observations explainedby the treatment effects α1, · · · , αk. Note that

S0 − S =k∑

i=1

ni∑j=1

(xij −x··N

)2 −k∑

i=1

ni∑j=1

(xij −xi·

ni)2 =

k∑i=1

ni∑j=1

(xij −x··N− xij +

xi·

ni)(xij −

x··N

+ xij −xi·

ni)

=k∑

i=1

(xi·

ni− x··

N)

ni∑j=1

(xij −x··N

+ xij −xi·

ni) =

k∑i=1

(xi·

ni− x··

N)(xi· − ni

x··N

+ xi· − xi·)

=k∑

i=1

ni(xi·

ni− x··

N)2.

It is customary to write

S0 = SS(T ) = total sum of squares=total variability in the observationsS0 − S = SS(Tr) = treatment sum of squares=variability due to treatments

S = SS(E) = error sum of squares=variability unexplained by µ, α1, · · · , αk.

Note from the expressions for SS(Tr) and SS(E) that SS(Tr) is the “between treatment” variability, whileSS(E) is the “within treatment” variability.

Summary. SS(T ) = SS(Tr) + SS(E), where

SS(T ) =k∑

i=1

ni∑j=1

(xij −x··N

)2 =k∑

i=1

ni∑j=1

x2ij −

x2··

N,

SS(Tr) =k∑

i=1

ni(xi·

ni− x··

N)2 =

k∑i=1

x2i·

ni− x2

··N

,

SS(E) =k∑

i=1

ni∑j=1

(xij −xi·

ni)2 =

k∑i=1

ni∑j=1

x2ij −

k∑i=1

x2i·

ni.

Page 64: Stats Notes

64 CHAPTER 6. LINEAR MODELS

If the null hypothesis H0 : α1 = · · · = αk = 0 is true, then we expect SS(Tr) to be a small part of SS(T),and SS(E) a relatively large part. Hence, the ratio SS(Tr)/SS(E) should be small if H0 is true and large ifH0 is false.

Now assume that eij ∼ N(0, σ2) for all i, j, and are independent. Then if H0 is true, xij ∼ N(µ, σ2)and are independent; from the formulas for SS(T), SS(Tr), and SS(E), it would appear they have chi-squaredistributions. In fact, we have

SS(T )σ2

∼ χ2 with N − 1 degrees of freedom

SS(Tr)σ2

∼ χ2 with k − 1 degrees of freedom

SS(E)σ2

∼ χ2 with N − k degrees of freedom

and SS(Tr) and SS(E) are independent. Hence, under H0,

F =SS(Tr)

σ2 /(k − 1)SS(E)

σ2 /(N − k)=

SS(Tr)k−1

SS(E)N−k

has the F-distribution with k − 1, N − k degrees of freedom.

Summary To test

H0 : α1 = α2 = · · · = αk(= 0) i.e. no treatment effectsH1 : not H0.

calculate

F =SS(Tr)/(k − 1)SS(E)/(N − k)

and reject H0 at level α if F ≥ Fα,k−1,N−k.

ANOVA Table. (ANOVA=“Analysis of Variance”) The data and computations for a given problem areusually summarized in the form of an ANOVA table as follows.

Source of Degrees of Sum of Mean FVariation Freedom Squares Square

Treatments k − 1 SS(Tr) MS(Tr) = SS(Tr)k−1 FTr = MS(Tr)

MS(E)

Errors N − k SS(E) MS(E) = SS(E)N−k

Total N − 1 SS(T )

Example 1. For the wheat and fertilizer example given at the beginning of this chapter, we obtain

x1· = 700, x2· = 480, x3· = 450, x·· = 1630,

n1 = 10, n2 = 7, n3 = 8, N = 25,

k∑i=1

ni∑j=1

x2ij = 109, 200;

k∑i=1

x2i·

ni= 107, 226.79.

Hence we get

Source of Degrees of Sum of Mean FVariation Freedom Squares Square

Treatments 2 950.79 475.4 5.30Errors 22 1973.21 89.69Total 24 2924

Since F.05,2,22 = 3.44, we reject H0 at level .05.

Page 65: Stats Notes

6.2. EXPERIMENTAL DESIGN. 65

Randomization The concept of randomization, which is part of the title of this section, has entered inan important but as yet unspecified way into the theory of this section.

Definition. A randomized design is one in which the plots (test units) are assigned randomly to thetreatments. Complete randomization refers to assigning all the test units randomly to the treatments (asopposed to randomization within blocks, for example, in the next section).

The purpose of randomization is as follows. In most experiments, especially undesigned ones, there willbe one or more extraneous factors, whose effects in this section have been considered a part of the eij ’s.Randomization ensures that each treatment has an equal chance of being favoured or handicapped by anextraneous factor.

To make this clear, let us consider an example. Suppose we have three fertilizers to be applied to threeplots each, as in the drawing below.

low swampy ground

flat fertile ground

high rocky ground

1 2 3

1 2 3

1 2 3

Notice that we have introduced an extraneous factor type of terrain, which may have been unnoticed bythe experimenter. Suppose that fertilizer 1 was assigned to the three left plots, fertilizer 2 to the middlethree plots, and fertilizer 3 to the right three plots. How could we be sure that any differences detectedbetween fertilizers are not in fact due to the difference in type of terrain? A bias has been introduced intothe experiment, and the factor “type of fertilizer”has been confounded by the factor “type of terrain”. Toprotect ourselves from this bias, we use the device of randomization: we choose n1 = 3 plots at random fromthe N = 9 plots and apply fertilizer 1 to these three; then n2 = 3 plots are chosen from the remaining six,and fertilizer 2 is applied; finally, fertilizer 3 is applied to the remaining three plots.

The following is another example, which we will return to in the next section.

Example 2. A certain person can drive to work along four different routes, and the following are thenumbers of minutes in which he timed himself on five different occasions for each route.

Route 1 Route 2 Route 3 Route 4Monday 22 25 26 26Tuesday 26 27 29 28Wednesday 25 28 33 27Thursday 25 26 30 30Friday 31 29 33 30

The ANOVA table is

Page 66: Stats Notes

66 CHAPTER 6. LINEAR MODELS

Source of Degrees of Sum of Mean FVariation Freedom Squares Square

Treatments 3 52.8 17.6 2.8Errors 16 100.4 6.28Total 19 153.2

Since F.05,3,16 = 3.24, we do not reject H0 at level .05.

6.2.2 Randomized Block Designs

Suppose an extraneous factor is present, such as type of terrain in the fertilizer example above. Then unlessthe plots are assigned as in figure 7.2.1 above (and we have just seen why such an assignment should beavoided by randomization), the within treatment sum of squares SSE will contain variation due to the typeof terrain. Since the test statistic F has SSE in its denominator, this may result in F being so small as tomake the test inconclusive, even though there is a difference between fertilizers. To eliminate the effect ofthis extraneous factor, we adopt the randomized block design.

Definition. A randomized block design is a design in which the nk test units are partitioned into n blocksdepending on the extraneous factor, and the k treatments are assigned so that every treatment is representedonce in every block. Within blocks, the treatments should be assigned randomly.

For our fertilizer-terrain example, there are three blocks. The first block consists of the three plots inthe low swampy terrain, the second of the three plots on flat fertile ground, and the third block of the threeplots on high rocky ground. A possible assignment of plots to treatments is shown in the figure below.

low swampy ground

flat fertile ground

high rocky ground

1 2 3

3 1 2

2 3 1

Let us consider another example. In example 2 of the previous section, we were unable to reject the nullhypothesis because SS(E) formed a large part of SS(T ) relative to SS(Tr). Looking at the data, though, itappears that route 1 is certainly better than route 4, since the sample means are x1 = 25.8 and x4 = 30.2.

Again recall that SS(E) is the within treatment variability. Is is possible that SS(E) is being inflated bya second factor? We go back to the original person who took the observations and find that yes, the timeswere measured on different days of the week, as shown in the table. We must do a randomized block designwith the weekdays as blocks.

Analysis Our model becomes

xij = µ + αi + βj + eij , i = 1, . . . , k; j = 1, . . . , n,

Page 67: Stats Notes

6.2. EXPERIMENTAL DESIGN. 67

where

µ = the grand mean,

αi = ith treatment effect andk∑

i=1

αi = 0.

βj = jth block effect andn∑

j=1

βj = 0.

As before, we estimate µ, αi, and βj by the method of least squares. We have

SS =k∑

i=1

n∑j=1

(xij − µ− αi − βj)2

∂SS

∂µ= −2

k∑i=1

n∑j=1

(xij − µ− αi − βj) = −2(x·· − nkµ)

∂SS

∂αi= −2

n∑j=1

(xij − µ− αi − βj) = −2(xi· − nµ− nαi)

∂SS

∂βj= −2

k∑i=1

(xij − µ− αi − βj) = −2(x·j − kµ− kβj).

We obtain the equations

x·· − nkµ = 0xi· − nµ− nαi = 0x·j − kµ− kβj = 0

which giveµ =

x··nk

, αi =xi·

n− x··

nk, βj =

x·jk− x··

nk.

The minimum sum of squares is

SSmin =k∑

i=1

n∑j=1

(xij − µ− αi − βj)2 =k∑

i=1

n∑j=1

(xij −xi·

n− x·j

k+

x··kn

)2

and represents that part of the total variation SS(T ) not explained by treatment effects or block effects.Hence the variability explained by block effects must be

k∑i=1

n∑j=1

(xij −xi·

n)2︸ ︷︷ ︸

old SS(E)

−k∑

i=1

n∑j=1

(xij −xi·

n− x·j

k+

x··kn

)2︸ ︷︷ ︸new SS(E)

=k∑

i=1

n∑j=1

(xij −xi·

n)2 −

k∑i=1

n∑j=1

[(xij −xi·

n)− (

x·jk− x··

kn)]2

= 2n∑

j=1

(x·jk− x··

kn)

k∑i=1

(xij −xi·

n)−

k∑i=1

n∑j=1

(x·jk− x··

kn)2

= kn∑

j=1

(x·jk− x··

kn)2.

Page 68: Stats Notes

68 CHAPTER 6. LINEAR MODELS

Hence we write

SS(T ) =k∑

i=1

n∑j=1

(xij −x··kn

)2,

SS(Tr) = nk∑

i=1

(xi·

n− x··

kn)2,

SS(Bl) = kn∑

j=1

(x·jk− x··

kn)2,

SS(E) =k∑

i=1

n∑j=1

(xij −xi·

n− x·j

k+

x··kn

)2.

and we have SS(T ) = SS(Tr) + SS(Bl) + SS(E). Now assume that eij ∼ N(0, σ2) for all i, j.

Tests concerning treatment effects. Define

FTr =SS(Tr)

k−1

SS(E)(n−1)(k−1)

.

Reject H0 : α1 = · · · = αk = 0 and accept H1 : αi 6= 0 for some i if FTr ≥ Fα,k−1,(n−1)(k−1).

Tests concerning block effects. Define

FBl =SS(Bl)

n−1

SS(E)(n−1)(k−1)

.

Reject H0 : β1 = · · · = βn = 0 and accept H1 : βj 6= 0 for some j if FBl ≥ Fα,n−1,(n−1)(k−1).

ANOVA Table.

Source of Degrees of Sum of Mean FVariation Freedom Squares Square

Treatments k − 1 SS(Tr) MS(Tr) = SS(Tr)k−1 FTr = MS(Tr)

MS(E)

Blocks n− 1 SS(Bl) MS(Bl) = SS(Bl)n−1 FBl = MS(Bl)

MS(E)

Errors (n− 1)(k − 1) SS(E) MS(E) = SS(E)(n−1)(k−1)

Total nk − 1 SS(T )

Example 3. Let us redo example 2 concerning the four routes. We get

Source of Degrees of Sum of Mean FVariation Freedom Squares Square

Treatments 3 52.8 17.6 7.75Blocks 4 73.2 18.3 8.06Errors 12 27.2 2.27Total 19 153.2

Since F.05,3,12 = 3.49 and F.05,4,12 = 3.26, we reject both hypotheses.

Page 69: Stats Notes

Chapter 7

Chi-Square Tests

7.1 Tests Concerning k Independent Binomial Populations.

Suppose we have observations X1, . . . , Xk from k independent binomial distributions with parameters (n1, p1), . . . , (nk, pk)

Case 1.

Suppose we want to test

H0 : p1 = p1,0, . . . , pk = pk,0

H1 : not H0

Background: Each Xi is a binomial random variable, so that if ni is large enough, then

Xi − nipi√nipi(1− pi)

is approximately N(0, 1), and then(Xi − nipi)2

nipi(1− pi)

will have a χ2-distribution with 1 d.f. By independence, if all ni are large enough, then

k∑i=1

(Xi − nipi)2

nipi(1− pi)

will have a χ2-distribution with k degrees of freedom. By large enough, we mean that nipi ≥ 5 andni(1− pi) ≥ 5 for all i = 1, . . . , k.

Test of Hypotheses: Let

χ2 def=k∑

i=1

(Xi − nipi,0)2

nipi,0(1− pi,0).

Assume that nipi,0 ≥ 5 and ni(1 − pi,0) ≥ 5 for all i = 1, . . . , k. If H0 is true, we expect χ2 to be “small”.Hence the α-level test is: reject H0 if χ2 ≥ χ2

α,k.

69

Page 70: Stats Notes

70 CHAPTER 7. CHI-SQUARE TESTS

Example. Suppose we wish to test

H0 : p1 = p2 = p3 = .3H1 : not H0

at level .05. The observations arex1 = 155 n1 = 250x2 = 118 n2 = 200x3 = 87 n3 = 150

We have

χ2 =(155− (250× .3))2

250× .3× .7+

(118− (200× .3))2

200× .3× .7+

(87− (150× .3))2

150× .3× .7= 258.

Since χ2.05,3 = 7.815, we reject H0 at level .05.

Case 2.

However, perhaps it was the value .3 that was incorrect. Suppose we want to test

H0 : p1 = · · · = pk

H1 : not H0.

This is the more usual case. Since no common value of the pi’s is given under H0, we estimate it by thepooled estimate

p =x1 + · · ·+ xk

n1 + · · ·+ nk.

Hence setting

χ2 def=k∑

i=1

(Xi − nip)2

nip(1− p),

we reject H0 at level α if χ2 ≥ χ2α,k−1. (The loss of a degree of freedom is because the common value is

estimated by p.)

Example. Carry out the test

H0 : p1 = p2 = p3

H1 : not H0

at level .05, using the same observations as in the previous example.

Solution. We have

p =155 + 118 + 87250 + 200 + 150

= .6,

and so

χ2 =(155− (250× .6))2

250× .6× .4+

(118− (200× .6))2

200× .6× .4+

(87− (150× .6))2

150× .6× .4= .75.

Since χ2.05,2 = 5.991, we do not reject H0 at level .05.

Page 71: Stats Notes

7.2. CHI-SQUARE TEST FOR THE PARAMETERS OF A MULTINOMIAL DISTRIBUTION. 71

7.2 Chi-Square Test for the Parameters of a Multinomial Distri-bution.

Reference: WMS 7th ed., chapter 14Suppose that the random vector (X1, . . . , Xk) has the multinomial distribution

PX1 = x1, . . . , Xk = xk =

n!

x1!···xk!px11 · · · pxk

k if x1 + · · ·+ xk = n,0 otherwise,

with parameters n ≥ 1 and p1, . . . , pk. Then

χ2 =k∑

i=1

(Xi − npi)2

npi(7.1)

is called Pearson’s Chi-square statistic. It can be shown (see the appendix to this chapter) that, as n →∞,the distribution of χ2 tends to the chi-square distribution with k − 1 degrees of freedom. Thus, for large n,χ2 has approximately a chi-square distribution with k − 1 degrees of freedom. The usual convention is thatif npi ≥ 5 for all i = 1, . . . , k, the approximation is considered “good”.

Remark. The reason that χ2 has a χ-square distribution is as follows: under the conditions npi ≥ 5for i = 1, . . . , k, the random variables X1, . . . , Xk have (by the CLT) approximately a multivariate normaldistribution. Since npi is the mean of Xi, then χ2 is the sum of squares of normalized (almost) normalrandom variables, and should therefore have a chi-square distribution. We lose a degree of freedom becausep1 + · · ·+ pk = 1.

Chi-Square Tests. Suppose we want to test

H0 : p1 = p1,0, p2 = p2,0, . . . , pk = pk,0

H1 : not H0 (7.2)

where p1,0, . . . , pk,0 are given values and p1,0 + · · ·+ pk,0 = 1.

Method: Assume that the conditions npi,0 ≥ 5 are satisfied for each i = 1, . . . , k. We calculate the value

χ2 =k∑

i=1

(Xi − npi,0)2

npi,0

If H0 is true, then Xi should be “close”to npi,0 for each i, and χ2 should be “small”. Hence the test is:reject H0 at level α if χ2 ≥ χ2

α,k−1.

Example. A group of rats, one by one, proceed down a ramp to one of five doors, with the followingresults:

Door 1 2 3 4 5Number of rats which choose door 23 36 31 30 30

Are the data sufficient to indicate that the rats show a preference for certain doors? That is, test thehypotheses

H0 : p1 = p2 = p3 = p4 = p5 =15

H1 : not H0.

Use α = .01.

Page 72: Stats Notes

72 CHAPTER 7. CHI-SQUARE TESTS

Solution. We have n = 150, so npi,0 = 30 ≥ 5 for all i = 1, . . . , 5. Since

χ2 =(23− 30)2

30+

(36− 30)2

30+

(31− 30)2

30+

(30− 30)2

30+

(30− 30)2

30= 2.87,

and since χ2.01,4 = 13.277, we do not reject H0 at level .01. No, the data are not sufficient.

Other Applications. The chi-square statistic can be used to test hypotheses for populations other thanthe multinomial. Suppose we have a population, each of whose members can be of one of k categoriesS1, S2, . . . , Sk. Let pi be the (true) proportion of members of the population which are of category Si,i = 1, 2, . . . , k. Note that p1 + · · · + pk = 1. Suppose we select a random sample of size n from thispopulation. If an observation is of category i, we say that it falls into cell i. Let

ni = the number of observations in the sample that fall into cell i, i = 1, . . . , k.

(Note: we are using ni rather than Xi.) The numbers n1, . . . , nk are called the observed cell frequencies.Suppose we wish to test the hypotheses in (7.2). Let us define

ei = npi,0, i = 1, . . . , k.

Note that for each i, ei is the number of observations in the sample we would expect if H0 were true. Thenumbers e1, . . . , ek are therefore called the expected cell frequencies, and

∑ki=1 ei = n. With this notation,

Pearson’s chi-square statistic becomes.

χ2 =k∑

i=1

(ni − ei)2

ei.

If the sample size n is large enough that all expected cell frequencies ei are at least 5, then χ2 has approx-imately the chi-square distribution with k − 1 degrees of freedom. Also, we expect χ2 to be small if H0 istrue. Hence the test is: reject H0 at level α if χ2 ≥ χ2

α,k−1.It is traditional and useful to arrange the observed and expected cell frequencies into a table. In the

above example, we got the following table of observed and expected cell frequencies.

Cell 1 2 3 4 5Observed Cell Frequency 23 36 31 30 30Expected Cell Frequency 30 30 30 30 30

Remark. If in the process of using the multinomial test, there are t independent parameters which mustbe estimated from the sample data, the number of degrees of freedom of χ2 drops to k − t − 1. Hence ourtest would become: calculate χ2 and reject H0 if χ2 ≥ χ2

α,k−t−1. This will be useful to remember when wecome to goodness of fit tests and contingency tables in the next two sections.

7.3 Goodness of Fit Tests.

We have a random sample X1, . . . , Xn from an unknown distribution F , and we want to test

H0 : F = F ′ (7.3)H1 : F 6= F ′ (7.4)

where F ′ is a given distribution function.

Page 73: Stats Notes

7.3. GOODNESS OF FIT TESTS. 73

Method Let X denote a random variable with distribution F . Partition the range set of X into s subsetscalled cells or categories, of which the ith will be denoted by Ci. We shall say that an observation x falls inthe ith cell if x ∈ Ci. Let P denote probabilities calculated under F . Define

θi = P (X ∈ Ci), θ′i = P ′(X ∈ Ci), i = 1, . . . , s.

Then carry out the multinomial test

H0 : θi = θ′i for all i = 1, . . . , s. (7.5)H1 : not H0. (7.6)

Since if H0 : F = F ′ is true implies H0 : θi = θ′i, i = 1, . . . , s is true, then rejecting H0 : θi = θ′i, i = 1, . . . , scauses us to reject H0 : F = F ′.

Example 1. Let X = the number of cars sold per day by a certain car dealer, measured over a period of100 days. We want to test

H0 : X is Poisson with parameter λ = 3.5H1 : not H0

at level α = .05. The observations are as follows.Cell Number of cars ni ei ei

0 0 1 3.02 1.21 1 4 10.57 5.322 2 11 18.50 11.763 3 16 21.58 17.324 4 26 18.88 19.145 5 21 13.22 16.926 6 12 7.71 12.467 7 5 3.85 7.878 8, 9, . . . 4 2.67 8.00

n =∑

ni = 100

The expected cell frequencies are computed from

ei = n · PrX = i = 100× 3.5ie−3.5

i!, i = 0, 1, . . . , 7

e8 = n · PrX ≥ 8 = 100−7∑

i=0

ei.

Note that cell 8 is the interval x ≥ 8. But the observations imply that four observations were precisely 8(needed to calculate λ below). Also note that e7 and e8 are less than 5. Hence we must combine cells 7 and8. The result is that a new cell 7 is created with n7 = 9 and e7 = 6.52. Similarly, we combine cells 0 and 1,into a new cell 1 with n1 = 5 and e1 = 13.59. For χ2, we obtain

χ2 =7∑

i=1

(ni − ei)2

ei= 20.51, χ2

.05,6 = 12.5916,

so we reject H0 at level .05.

Example 2. In the above example, the ni’s look as though they could come from a Poisson distribution,but with a mean larger than 3.5. Perhaps it was the value λ = 3.5 which was incorrect and inflated the valueof χ2. Let us use the same data and this time test

H0 : X is PoissonH1 : not H0.

Page 74: Stats Notes

74 CHAPTER 7. CHI-SQUARE TESTS

In order to calculate the ei’s, we must supply an estimate of λ. For our estimator, we take the sample mean

λ = x =(1× 0) + (4× 1) + · · ·+ (5× 7) + (4× 8)

100= 4.42,

computed from the original data. The expected cell frequencies are computed as before, but with 3.5 replacedby 4.42, and are given in the rightmost column in the above table. Once again, we have to combine cells 0and 1 into a new combined cell 1 with n1 = 5 and e1 = 6.52. We find

χ2 =8∑

i=1

(ni − ei)2

ei= 6.99, χ2

.05,6 = 12.5916,

(note: s− t− 1 = 8− 1− 1 = 6) and so we do not reject H0.

Can we accept H0? There are really two parts to this question. First, we have really tested the hypothesesin (7.5) and (7.6) and we have no idea of the power of the test. Secondly, even if we could accept the nullhypothesis in (7.5), that does not mean we can accept the null hypothesis in (7.3). Hence the answer is adouble NO! It is customary to conclude in a weaker way, by saying

(1) A Poisson distribution with λ = 4.42 provides a good fit to the data (especially if χ2 is small comparedto χ2

s−t−1, or

(2) the data are not inconsistent with the assumption of a Poisson population with λ = 4.42.

Example 3. We have the following data concerning times X to burnout for a certain brand of battery.

Cell No. of Hours ni ei

1 0− 5 37 39.342 5− 10 20 23.873 10− 15 17 14.474 15− 20 13 8.785 20− 25 8 5.326 25−∞ 5 8.20

n = 100

We wish to test

H0 : X is exponentialH1 : not H0

at level .05.

Solution. The ei’s are calculated from

ei = n Pr0 < X ≤ 5 = 100∫ 5

0

1θe−x/θ dx,

...

e5 = n Pr20 < X ≤ 25 = 100∫ 25

20

1θe−x/θ dx,

e6 = n PrX > 25 = 100∫ ∞

25

1θe−x/θ dx,

or e6 can be calculated from∑6

i=1 ei = n. We use the sample mean θ = x = 10 of our original observationsas our estimate of θ. The ei’s can now be calculated from the formulae above, with θ taken to be 10, andare shown in the above table. We obtain χ2 = 5.84 and χ2

.05,4 = 9.488. Hence we do not reject H0. Anexponential distribution with mean 10 gives a good fit to the data.

Page 75: Stats Notes

7.4. CONTINGENCY TABLES 75

7.4 Contingency Tables

Consider the following data in the form of a contingency table resulting from a random sample of 350students.

Interest inMathematics

low avg. highAbility in low 65 40 15Statistics avg. 54 63 29

high 12 45 27

We want to test

H0 : ability in statistics and interest in mathematics are independentH1 : not H0.

Theory. Suppose each observation from a population has two attributes, measurable by the variables Xand Y . We want to test

H0 : X and Y are independentH1 : not H0.

In the above example, X = ability in statistics and Y = interest in mathematics (both are measurable bysome means). Let A1, . . . , Ar be r categories for the variable X, and B1, . . . , Bc be c categories for thevariable Y . These give rise to r × c categories for the vector (X, Y ), as in the following table.

B1 B2 . . . . . . Bc

A1

A2

...Ar

Let θij = PrX ∈ Ai, Y ∈ Bj, i = 1, . . . , r; j = 1, . . . , c. Then

θi· =c∑

j=1

θij = PrX ∈ Ai, i = 1, . . . , r,

θ·j =r∑

i=1

θij = PrY ∈ Bj, j = 1, . . . , c.

We shall carry out the multinomial test of hypotheses

H0 : θij = θi·θ·j for all i = 1, . . . , r; j = 1, . . . , c

H1 : not H0.

Note that rejection of this H0 also rejects H0 : X and Y are independent. For this multinomial test, theexpected cell frequencies are given (as usual) by

eij = nθ′ij = nθi·θ·j

where θi· and θ·j are unknown and will be estimated by

θi· =ni·

n, θ·j =

n·jn

, i = 1, . . . , r; j = 1, . . . , c,

Page 76: Stats Notes

76 CHAPTER 7. CHI-SQUARE TESTS

where ni· =∑c

j=1 nij and n·j =∑r

i=1 nij . The expected cell frequencies then become

eij = n · ni·

n

n·jn

=ni·n·j

n.

In this process, we have estimated t = (r − 1) + (c − 1) parameters (since∑r

i=1 θij = 1 =∑c

j=1 θij). Thetest is therefore: calculate

χ2 =r∑

i=1

c∑j=1

(nij − eij)2

eij

and reject H0 at level α if χ2 > χ2α,(r−1)(c−1). (Note: the number of degrees of freedom is s − t − 1 =

rc− (r − 1)− (c− 1)− 1 = (r − 1)(c− 1).)

Example 1. For our student example, the expected cell frequency table is

Interest inMathematics

low avg. highAbility in low 44.91 50.74 24.34Statistics avg. 54.65 61.74 29.62

high 31.44 35.52 17.04

and we get χ2 = 31.67. Since χ2.01,4 = 13.277, we reject H0 at level .01.

Example 2. 1200 U.S. stores are classified according to type and location, with the following results:

Observed Cell FrequenciesN S E W

Clothing stores 219 200 181 180Grocery stores 39 52 89 60Other 42 48 30 60

Expected Cell FrequenciesN S E W

195 195 195 19560 60 60 6045 45 45 45

We get χ2 = 38.07. Since χ2.05,6 = 12.592, we reject H0 at level .05.

Page 77: Stats Notes

Chapter 8

Non-Parametric Methods of Inference

Reference: WMS 7th ed., chapter 15In this chapter, we discuss non-parametric, or distribution-free, methods of testing hypotheses. They are

called non-parametric because we do not assume a knowledge of the form of the population distribution.

8.1 The Sign Test

Suppose we have a random sample of size n from an unknown continuous distribution with median m andmean µ. We want to test

H0 : m = m0

H1 : m 6= m0.

Note that if the underlying distribution is known to be symmetrical, then this is also a test of

H0 : µ = µ0

H1 : µ 6= µ0.

Method: Consider a sample value > m0 a “success” and assign it a “+”. Consider a sample value < m0

a “failure”and assign it a “-”. Any sample values equal to m0 are dropped from the sample. Let n be thesample size after any deletions, and let X = the number of +’s. If H0 is true, X has the binomial distributionwith parameters n and p = 0.5. Hence we will reject H0 if X ≤ k′α/2 or if X ≥ kα/2. Two useful points are:

• kα/2 = n− k′α/2.

• If n ≥ 20, then

Z =X − n/2√

n/4

has approximately the standard normal distribution, and we would reject H0 if |Z| > zα/2.

Example 1. The following observations represent prices (in thousands of dollars) of a certain brand ofautomobile at various dealerships across Canada, and are known to come from a symmetrical distribution:

- + - - + - - - - -18.1 20.3 18.3 15.6 22.5 16.8 17.6 16.9 18.2 17.019.3 16.5 19.5 18.6 20.0 18,8 19.1 17.5 18.5 18.0

- - + - + - - - - -

77

Page 78: Stats Notes

78 CHAPTER 8. NON-PARAMETRIC METHODS OF INFERENCE

We want to test

H0 : µ = 19.4H1 : µ 6= 19.4.

We get X = 4. Since k.025 = 15 and k′.025 = 5, we reject H0 at level .05.

The Sign Test for a Matched Pairs Experiment.

Suppose (X, Y ) has joint probability (density) function f(x, y). We have data (X1, Y1), . . . , (Xn, Yn) from(X, Y ), and we want to test

H0 : f(x, y) ≡ f(y, x)H1 : not H0.

We let W = X − Y . Under H0, P [X − Y < 0] = P [X − Y > 0], so we can carry out a sign test using W .Here is the theoretical justification:

Proposition 8.1.1 If the joint probability (or density) function of X and Y is symmetric, then P [X −Y =z] = P [X − Y = −z].

Proof. We are assuming that f(x, y) = f(y, x) for all x, y. Then

P [X − Y = z] =∑

y

P [X − Y = z, Y = y] =∑

y

P [X = y + z, Y = y] =∑

y

f(y + z, y) =∑

y

f(y + z, y)

=∑

y

f(y, y + z) =∑w

f(w − z, w) = P [X − Y = −z].

Example 2. The numbers of defective memory modules produced per day by two operators A and B areobserved for 30 days. The observations are.

+ + + + - + + + +(2,5) (6,8) (5,6) (9,9) (6,10) (7,4) (1,5) (4,8) (3,6) (5,6)+ - + + + + + - +

(6,10) (8,6) (2,5) (1,7) (5,9) (3,3) (6,7) (4,9) (4,2) (5,7)+ + + - + + + - + +

(1,2) (7,9) (4,6) (7,5) (6,8) (7,10) (2,4) (5,4) (6,8) (3,5)

We want to test

H0 : A and B perform equally wellH1 : A is better than B

at level .01. We assign a “+” to a pair (x, y) with x < y, a “-” if x > y, or we delete the pair (x, y) if x = y.Then the number of +’s is W = 23 and the sample size is n = 28. We obtain

z =23− 28/2√

28/4= 3.40,

and since z.01 = 2.326, we reject H0 at level .01.

Remark. In the sign test, we have to assume f(x, y) = f(y, x). As the following example shows, it is notenough to assume X and Y have the same distributions, as in Mendenhall.

Page 79: Stats Notes

8.2. THE MANN-WHITNEY, OR U-TEST 79

Example. Suppose X and Y have joint probability function given by

1 2 31 0 a b2 b 0 a3 a b 0

where a, b ≥ 0, a 6= b and a + b = 1/3. Then X and Y have the same (uniform) distribution, but

P [X − Y > 0] = P [X = 3, Y = 1] + P [X = 3, Y = 2] + P [X = 2, Y = 1] = a + 2b =13

+ b,

P [X − Y < 0] = P [X = 1, Y = 3] + P [X = 2, Y = 3] + P [X = 1, Y = 2] = 2a + b = a +13.

8.2 The Mann-Whitney, or U-Test

We have samples x1, . . . , xn1 and y1, . . . , yn2 from two densities f(x) and g(y) respectively, which are knownto satisfy

f(x) ≡ g(x− θ)

for some number θ. That is, we are assuming that the two underlying distributions differ only in locationalong the horizontal axis. We want to test

H0 :θ = 0H1 :θ 6= 0

(θ > 0 i.e. f shifted to the right of g)(θ < 0 i.e. f shifted to the left of g.)

Method: Combine the two samples, order the resulting combined sample, and assign ranks in order ofincreasing size. If there is a tie, assign to each of the tied observations the mean of the ranks which theyjointly occupy. Define

Ux = n1n2 +n1(n1 + 1)

2−Wx,

where Wx = the sum of the ranks assigned to the values from the x-sample. Note that the minimum valueof Wx occurs when the first sample occupies the n1 lowest ranks, and then Wx min = 1 + 2 + · · · + n1 =n1(n1 + 1)/2. The maximum value of Wx occurs when the first sample occupies the highest n1 ranks, andthen Wx max = (n2 + 1) + (n2 + 2) + · · ·+ (n2 + n1 − 1) + (n2 + n1) = n1n2 + n1(n1 + 1)/2. Hence

n1(n1 + 1)2

≤ Wx ≤ n1n2 +n1(n1 + 1)

2and so

0 ≤ Ux ≤ n1n2.

It may be shown that if H0 is true, then Ux is symmetric about its middle value, and

E(Wx) =n1(n1 + n2 + 1)

2Var(Rx) =

n1n2(n1 + n2 + 1)12

and so

E(Ux) =n1n2

2Var(Ux) =

n1n2(n1 + n2 + 1)12

.

Uy is defined in the same way. The above expressions are also true for Uy, provided n1 and n2 are inter-changed. We also have

Ux + Uy = n1 + n2.

Page 80: Stats Notes

80 CHAPTER 8. NON-PARAMETRIC METHODS OF INFERENCE

Case 1. n1, n2 small. We use the exact distribution of Ux. We reject H0 at level α if

U ≤ U0

(Ux ≤ U0)(Uy ≤ U0),

where U = minUx, Uy, and U0 is such that PU ≤ U0 = α/2 (two-tailed test) or PUx ≤ U0 = α(θ > 0) or or PUy ≤ U0 = α (θ < 0).

Case 2. n1, n2 large (i.e. n1 > 8, n2 > 8). Then

Z =Ux − E(Ux)√

Var(Ux)

has approximately a standard normal distribution. We would therefore reject H0 at level α if |Z| > α/2.

Example 3. Consider the following observations, representing the potencies of two samples of penicillin,one from company X, the other from company Y.

Rank 10 14 6 9 12 13 4 17 11 7Company X 54.8 62.6 51.2 54.5 56.0 59.4 48.2 70.5 55.2 51.9Company Y 50.1 44.6 65.9 72.2 74.6 39.4 52.8 78.2 45.8 68.7Rank 5 2 15 18 19 1 8 20 3 16

Are these data sufficient to indicate that the distributions of potencies of penicillin from these two companiesdiffer in location? Use α = .05.

Solution. We get n1 = n2 = 10, Wx = 103 and so Ux = 52. Hence Uy = 100− 52 = 48 and U = 48. Fromthe tables, we find U0 = 23, so we do not reject. Alternatively, using the normal approximation,

Z =52− 50√

175= 0.15.

Since z.025 = 1.96, we do not reject H0.

Remarks. If Vx = the number of observations in the x-sample which precede each observation in they-sample, then Vx = Ux. (For if ry1 , . . . , ryn

denote the ranks of the observations in the y-sample, thenVx = (ry1 − 1) + (ry2 − 2) + · · ·+ (ryn2

− n) = Wy − n2(n2+1)2 = n1n2 − Uy = Ux.)

8.3 Tests for Randomness Based on Runs.

Consider the following example. A long line of trees is observed one by one, and each tree is classified as S(healthy) or F (diseased). The result is

SSSFSSSSSFFFSSSSSSFFFFSSSSFFSSSSSSSSSSFFFFSS. (8.1)

We want to test

H0 : the ordering of the sample is randomH1 : not H0.

Observe that the S’s and F’s seem to be clustered, perhaps because diseased trees tend to infect neighbouringtrees. On the other hand, this ordering may be due merely to chance. We want to test whether or not thisparticular ordering of 30 S’s and 14 F’s is due to chance.

Page 81: Stats Notes

8.3. TESTS FOR RANDOMNESS BASED ON RUNS. 81

Remark. Note that the opposite case of SFSFSFFSF . . . where S’s and F’s virtually alternate is alsonon-random.

Theory. Suppose a sequence such as in (8.1) is made up of m S’s and n F’s. A maximal subsequenceconsisting of like symbols is called a run. In (8.1), the runs are SSS, F, SSSSS, FFF, SSSSSS, FFFF, SSSS,FF, SSSSSSSSSS, FFFF, and SS, of lengths 3,1,5,3,6,4,4,2,10,4, and 2 respectively. Thus, there are 6 runsof S and 5 runs of F, for a total of 11 runs.

In general, let R denote the number of runs of both types (so that R = 11 in the above example). Thenthe minimum possible value of R is rmin = 2 and the maximum value is

rmax =

2m if m = n,2 minm,n if m 6= n.

In order to compute the probability function of R, we shall begin by counting the number N(r) of ways thatm S’s and n F’s can be ordered so that r runs result. Suppose first that r = 2k, so there are k runs of S andk runs of F.

• N(r) = the number of ways which start with S plus the number of ways that start with F .

• The number of ways in which m S’s can form k runs is the same as the number of ways that k− 1 barscan be put in m− 1 spaces (a bar represents a run of F’s), and is therefore(

m− 1k − 1

).

• Each such way generates the number (n− 1k − 1

)of ways in which n F ’s can be put into k boxes, such that each box contains at least one F .

Combining the above three points, we find that

N(r) = 2(

m− 1k − 1

)(n− 1k − 1

).

When r = 2k + 1, we similarly find that

N(r) =(

m− 1k

)(n− 1k − 1

)+(

m− 1k − 1

)(n− 1

k

).

Now suppose that the sequence of m S’s and n F’s is “generated at random”, in the sense that all(m+n

m

)orderings are equally likely. Then we arrive at

PrR = r =

2(m−1

k−1)(n−1k−1)

(m+nm ) if r = 2k

(m−1k )(n−1

k−1)+(m−1k−1)(n−1

k )(m+n

m ) if r = 2k + 1.(8.2)

From this, we can calculate that

E(R) =2mn

m + n+ 1, Var(R) =

2mn(2mn−m− n)(m + n)2(m + n− 1)

. (8.3)

Page 82: Stats Notes

82 CHAPTER 8. NON-PARAMETRIC METHODS OF INFERENCE

Method: If m and n are both large (m,n ≥ 10), then under H0,

Z =R− E(R)√

Var(R)

(where E(R) and Var(R) are given in (8.3)) has approximately the standard normal distribution, and so wewould reject H0 at level α if |Z| > zα/2. Otherwise (if m and n are not large), we must use the probabilityfunction of R given in (8.2) directly.

Example. For the example with the trees, we have m = 30, n = 14, and so E(R) = 20.1, and Var(R) = 8.03.We found R = 11, and therefore

Z =11− 20.1√

8.03= −3.21.

Since z.025 = 1.96, we reject H0 at level .05.