MATH 170S Introduction to Probability and Statistics II ......MATH 170S Introduction to Probability and Statistics II Lecture Note Hanbaek Lyu DEPARTMENT OF MATHEMATICS, UNIVERSITY

MATH 170S Introduction to Probability and Statistics II

Lecture Note

Hanbaek Lyu

DEPARTMENT OF MATHEMATICS, UNIVERSITY OF CALIFORNIA, LOS ANGELES, CA 90095

Email address: [email protected]

www.hanaeklyu.com

1

2

Contents

Chapter 1. Review of basic probability theory 31. Probability measure and probability space 32. Random variables 4

Chapter 2. Descriptive Statistics 71. Basic sample statistics 72. Exploratory Data Analysis 93. Order statistics 124. Order statistics and the q-q plot 15

Chapter 3. Maximum Likelihood Estimation 171. Definition and Examples of MLE 172. Unbiased estimators 203. Method of moments 22

Chapter 4. Linear regression 241. Conditional expectation and variance 242. A simple linear regression problem 25

Chapter 5. Sufficient Statistics 281. Definition and motivating example 282. Sufficient statistics by factorization theorem 293. Rao-Blackwell Theorem 30

Chapter 6. Bayesian Estimation 321. Bayes’ Theorem and Inference – Discrete setting 322. Bayes’ Theorem and Inference – Continuous setting 333. More Bayesian concepts 37

Chapter 7. Interval estimation 391. Estimating mean of normal distribution with known variance 392. Central limit theorem 403. Confidence interval for the mean with unknown variance 414. Confidence intervals of difference of two means 455. Confidence interval for proportions 496. Sample size 507. Distribution-free confidence intervals for percentiles 53

Chapter 8. Statistical testing 561. Tests about one mean 562. Test about two means 603. Tests about proportions 644. The Wilcoxon Tests 665. Power of a statistical test 696. Best critical regions 727. Likelihood ratio tests 768. Chi-square goodness-of-fit tests 81

Bibliography 86

CHAPTER 1

Review of basic probability theory

1. Probability measure and probability space

Many things in life are uncertain. Can we ‘measure’ and compare such uncertainty so that it helpsus to make more informed decision? Probability theory provides a systematic way of doing so.

We begin with idealizing our situation. LetΩ be a finite set, called sample space. This is the collectionof all possible outcomes that we can observe (think of six sides of a die). We are going to perform someexperiment onΩ, and the outcome could be any subset E ofΩ, which we call an event. Let us denote thecollection of all events E ⊆Ω by 2Ω. A probability measure onΩ is a function P : 2Ω→ [0,1] such that foreach event E ⊆Ω, it assigns a number P(E) ∈ [0,1] and satisfies the following properties:

(i) P(;) = 0 and P(Ω) = 1.(ii) If two events E1,E2 ⊆Ω are disjoint, then P(E1 ∪E2) =P(E1)+P(E1).

In words, P(E) is our quantization of how likely it is that the event E occurs out of our experiment.

Exercise 1.1.1. Let P be a probability measure on sample spaceΩ. Show the following.

(i) Let E = x1, x2, · · ·xk ⊆Ω be an event. Then P(E) =∑ki=1P(xi ).

(ii)∑

x∈ΩP(x) = 1.

If P is a probability measure on sample space Ω, we call the pair (Ω,P) a probability space. This isour idealized world where we can precisely measure uncertainty of all possible events. Of course, therecould be many (in fact, infinitely many) different probability measures on the same sample space.

Exercise 1.1.2 (coin flip). Let Ω = H ,T be a sample space. Fix a parameter p ∈ [0,1], and define afunction Pp : 2Ω → [0,1] by Pp (;) = 0, Pp (H ) = p, Pp (T ) = 1− p, Pp (H ,T ) = 1. Verify that Pp is aprobability measure onΩ for each value of p.

A typical way of constructing a probability measure is to specify how likely it is to see each individualelement in Ω. Namely, let f : Ω→ [0,1] be a function that sums up to 1, i.e.,

∑x∈Ω f (x) = 1. Define a

function P : 2Ω→ [0,1] by

P(E) = ∑ω∈E

f (x). (1)

Then this is a probability measure onΩ, and f is called a probability distribution onΩ. For instance, theprobability distribution on H ,T we used to define Pp in Exercise 1.1.2 is f (H) = p and f (T ) = 1−p.

Example 1.1.3 (Uniform probability measure). Let Ω = 1,2, · · · ,m be a sample space and let P be theuniform probability measure onΩ, that is,

P(x) = 1/m ∀x ∈Ω. (2)

Then for the event A = 1,2,3, we have

P(A) =P(1∪ 2∪ 3) (3)

=P(1)+P(2)+P(3) (4)

= 1

m+ 1

m+ 1

m= 3

m(5)

3

2. RANDOM VARIABLES 4

Likewise, if A ⊆Ω is any event and if we let |A| denote the size (number of elements) of A, then

P(A) = |A|m

. (6)

For example, let Ω= 1,2,3,4,5,62 be the sample space of a roll of two fair dice. Let A be the event thatthe sum of two dice is 5. Then

A = (1,4), (2,3), (3,2), (4,1), (7)

so |A| = 4. Hence P(A) = 4/36 = 1/9. N

2. Random variables

Given a finite probability space (Ω,P), a (discrete) random variable (RV) is any real-valued functionX :Ω→R. We can think of it as the outcome of some experiment onΩ (e.g., height of a randomly selectedfriend). We often forget the original probability space and specify a RV by is probability mass function(PMF) fX :R→ [0,1],

fX (x) =P(X = x) =P(ω ∈Ω |X (ω) = x). (8)

Namely, P(X = x) is the likelihood that the RV X takes value x.

Example 1.2.1. Say you win $1 if a fair coin lands heads and lose $1 if lands tails. We can set up ourprobability space (Ω,P) by Ω= H ,T and P= uniform probability measure on Ω. The RV X :Ω→ R forthis game is X (H) = 1 and X (T ) = −1. The PMF of X is given by fX (1) = P(X = 1) = P(H ) = 1/2 andlikewise fX (−1) = 1/2.

Exercise 1.2.2. Let (Ω,P) be a probability space and X :Ω→ R be a RV. Let fX be the PMF of X , that is,fX (x) =P(X = x) for all x. Show that fX adds up to 1, that is,∑

xfX (x) = 1, (9)

where the summation runs over all numerical values x that X can take.

There are two useful statistics of a RV to summarize its two most important properties: Its averageand uncertainty. First, if one has to guess the value of a RV X , what would be the best choice? It is theexpectation (or mean) of X , defined as below:

E(X ) =∑x

xP(X = x). (10)

Exercise 1.2.3. For any RV X and real numbers a,b ∈R, show that

E[aX +b] = aE[X ]+b. (11)

On the other hand, say you play two different games where in the first game, you win or lose $1depending on a fair coin flip, and in the second game, you win or lose $10. In both games, your expectedwinning is 0. But the two games are different in how much the outcome fluctuates around the mean.This notion if fluctuation is captured by the following quantity called variance:

Var(X ) = E[(X −E(X ))2]. (12)

Namely, it is the expected squared difference between X and its expectation E(X ).

Exercise 1.2.4. Let X be a RV. Show that Var(X ) = E[X 2]−E[X ]2.

Here are some of the simplest and yet most important RVs.

Exercise 1.2.5 (Bernoulli RV). A RV X is a Bernoulli variable with (success) probability p ∈ [0,1] if it takesvalue 1 with probability p and 0 with probability 1−p. In this case we write X ∼ Bernoulli(p). Show thatE(X ) = p and Var(X ) = p(1−p).


Exercise 1.2.6 (Indicator variables). Let (Ω,P) be a probability space and let E ⊆ Ω be an event. Theindicator variable of the event E , which is denoted by 1E , is the RV such that 1E (ω) = 1 if ω ∈ E and1E (ω) = 0 if ω ∈ E c . Show that 1E is a Bernoulli variable with success probability p =P(E).

Example 1.2.7 (Uniform RV). Let Ω = x1, x2, · · · xn be a finite sample space of real numbers. A RV Xis a uniform variable on Ω, denoted as X ∼ Uniform(Ω), if it takes each of the element in Ω with equalprobability. That is,

P(X = xi ) = 1

n= 1

|Ω| for all 1 ≤ i ≤ n. (13)

Let X ∼ Uniform(Ω) as above. Then

E[X ] =n∑

i=1xi · 1

n= 1

n

n∑i=1

xi . (14)

Also,

Var(X ) =n∑

i=1(xi −E[X ])2 · 1

n= 1

n

n∑i=1

(xi −E[X ])2. (15)

The following exercise ties the expectation and the variance of a RV into a problem of finding a pointestimator that minimizes the mean squared error.

Exercise 1.2.8 (Variance as minimum MSE). Let X be a RV. Let x ∈ R be a number, which we consideras a ‘guess’ (or ‘estimator’ in Statistics) of X . Let E[(X − x)2] be the mean squared error (MSE) of thisestimation.

(i) Show that

E[(X − x)2] = E[X 2]−2xE[X ]+ x2 (16)

= (x −E[X ])2 +E[X 2]−E[X ]2 (17)

= (x −E[X ])2 +Var(X ). (18)

(ii) Conclude that the MSE is minimized when x = E[X ] and the global minimum is Var(X ). In this sense,E[X ] is the ‘best guess’ for X and Var(X ) is the corresponding MSE.

To define a discrete RV, it was enough to specify its PMF. For a continuous RV, probability distributionfunction (PDF) plays an analogous role of PMF. We also need to replace summation

∑with an integral∫

d x. Namely, X is a continuous RV if there is a function fX : R→ [0,∞) such that for any interval [a,b],the probability that X takes a value from an interval (a,b] is given by integrating fX over the interval(a,b]:

P(X ∈ (a,b]) =∫ b

afX (x)d x. (19)

The cumulative distribution function (CDF) of a RV X (either discrete or continuous), denoted by FX , isdefined by

FX (x) =P(X ≤ x). (20)

By definition of PDF, we get

FX (x) =∫ x

−∞fX (t )d t . (21)

Conversely, PDFs can be obtained by differentiating corresponding CDFs.

Exercise 1.2.9. Let X be a continuous RV with PDF fX . Let a be a continuity point of fX , that is, fX iscontinuous at a. Show that FX (x) is differentiable at x = a and

dFX

d x

∣∣∣x=a

= fX (a). (22)


The expectation of a continuous RV X with pdf fX is defined by

E(X ) =∫ ∞

−∞x fX (x)d x, (23)

and its variance Var(X ) is defined by the same formula (12).

CHAPTER 2

Descriptive Statistics

1. Basic sample statistics

One of the core problem in statistics is making inference on an unknown random variable X usingits sample values x1, x2, · · · , xn (not necessarily distinct). From these sample values of X , how can we‘infer’ the true distribution of X ? The basic idea is to approximate the distribution of X by the empiricaldistribution X .

Definition 2.1.1. Let X be a RV, and x1, · · · , xn be n sample values of X . Then the empirical distributionX is a probability distribution on the set S = x1, · · · , xn with

f (xi ) = 1

n(# of times that the value xi appeared) (24)

Remark 2.1.2. If the sample values x1, · · · , xn are all distinct, then the empirical distribution becomesthe uniform distribution on the set S = x1, · · · , xn.

Example 2.1.3. Say we have a coin with unknown probability p of coming up heads. We want to figureout the true value of p. A natural way to do this would be simply flipping the coin many times, and seehow many heads come up versus tails. For example, suppose we flip this coin 10 times and the result was

(x1, x2, · · · , x10) = (H , H ,T, H , H ,T,T, H , H ,T ). (25)

Then the corresponding empirical distribution is given by

f (H) = 6

10, f (T ) = 4

10. (26)

Can we conclude that p ≈ 6/10? N

In many cases, computing the exact distribution of an unknown RV maybe costly or difficult. Instead,computing its mean and variance gives a rough idea on how large and ‘spread out’ it is, respectively. Sincewe do not have the full distribution of X , but only its sample values x1, · · · , xn , we can only approximatelycompute the true mean E[X ] and variance Var(X ).

Definition 2.1.4. Let X be a RV and x1, · · · , xn be its sample values. Define the following:

(sample mean) x = 1

n

n∑i=1

xi , (27)

(variance of the empirical dist.) ν= 1

n

n∑i=1

(xi − x)2, (28)

(sample variance) s2 = 1

n −1

n∑i=1

(xi − x)2 = n

n −1ν, (29)

(sample standard deviation) s =√

s2 (30)

Remark 2.1.5. Why do we have n −1 instead of n in the definition of sample variance? In later sections,we will see that s2 is in some sense a ‘better estimator’ than ν for the population variance Var(X ). SeeExercise 3.2.4.

7

1. BASIC SAMPLE STATISTICS 8

Exercise 2.1.6. Let X ∼ Bernoulli(p) for unknown parameter p. Say after some experiment, we have itssample values

(x1, x2, · · · , x10) = (1,1,0,1,1,0,0,1,1,0). (31)

Compute the corresponding sample mean, variance of the empirical distribution, sample variance, andsample standard deviation.

Exercise 2.1.7. Let X ∼ Uniform(1,2,3,4,5,6), which can also be thought as the outcome of rolling afair die. Suppose we have sample values

(x1, x2, · · · , x10) = (2,4,2,5,6,1,3,3,2,6). (32)

Compute the corresponding sample mean, variance of the empirical distribution, sample variance, andsample standard deviation.

Example 2.1.8 (Excerpted from [HTZ77]). Let X be the weight of a candy bar produced in a factory A.Due to the uncertainty in the production process, it is reasonable to think of X as a random variable. Inorder to get to knowledge on X , we may collect actual weights of n candy bars from this factory. This willgive us sample values of X , which are shown in table below:

FIGURE 1. Sample values of candy bar weights.

We can of course directly think of the empirical distribution corresponding to the above sample val-ues of X . But in some cases (especially for hugh sample size), it is often convenient to group the samplevalues and consider the resulting ‘coarse-grained’ empirical distribution. We will illustrate this throughthis example.

We start from observing that the minimum and maximum sample values are 20.5 and 26.7, respec-tively. The interval [20.5,26.7] has length 6.2, so it can be covered with k = 7 sub-intervals of length 0.9. Sowe will divide the interval [20.45,26.75] in k = 7 sub-intervals of equal length 0.9. For each sub-interval,its midpoint is called its class mark. We then count how many sample values fall in each of these sevenintervals and make histogram, where the i th sub-interval Ii has height equals to its fi .

A more probabilistic way of making such histogram is called the relative frequency histogram or thedensity histogram. Namely, for each sub-interval I ,

Area of density histrogram over Ii = fi

n≈P(X ∈ I ). (33)

The corresponding density histogram of our running example is shown below.N

Exercise 2.1.9. Compute the sample mean and sample variance of the raw sample values given in Ex-ample 2.1.8. Also compute the sample mean and sample variance of the grouped data, using the classmarks and with their respective frequencies. How do they compare?

2. EXPLORATORY DATA ANALYSIS 9

2. Exploratory Data Analysis

Exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main char-acteristics, often with visual methods, first promoted by John Tukey. Since modern datasets are usuallytoo large to be able to comprehended by naked human eyes, advanced EDA techniques such as dimen-sionality reduction, clustering, and data visualization are becoming very important. This could lead to abetter understanding of the data set, yielding better modeling or formulating more insightful hypothesis.

In this section, we will learn elementary methods of EDA, such as stem-and-leaf display, order statis-tics, sample percentiles, and box-plot.

Example 2.2.1 (Ordered Stem-and-leaf display, Excerpted from [HTZ77]). Suppose the scores in Midterm1 of Math 170S turned out to be as below:

Drawing histogram of the sample data would be one way to visually express the data set. Tukeysuggested the following more informative ‘version’ of histogram, which does not ‘collapse’ sample valuesinto a class as in histogram. For instance, we think of the sample value 93 as a stem of 9 with leaf of 3. So‘stems’ would be the first digits of the exam scores, and ‘leaves’ would be the second digits. We can alsoorder the leaves from the smallest to the largest for more clean representation. Finally, we also record


the frequency of each stem, which equals to the number of its leaves. The resulting diagram is as above,which is called an (ordered) stem-and-leaf display.

N

Given a set of sample values x1, x2, · · · , xn , we can rearrange these values so that they are orderedfrom the smallest to the largest. Namely, let y1 be the smallest value in the sample; let y2 be the secondsmallest value in the sample, and so on. So we can rearrange the sample as y1 ≤ y2 ≤ ·· · ≤ yn . We call yk

as the kth order statistics of the sample.The median of the sample, roughly speaking, is the value m so that half of the sample values are

< m and the other half are > m; so m is the ‘middle’ value. One should not confuse the median withthe sample mean. For example, if the sample values for 1,3,8, then the median is 3, but the mean is(1+ 3+ 8)/3 = 4. But what if there is no middle value? Say, the sample values are 1,3,5,8. When themedian should be somewhere ‘between’ 3 and 5. In this case, we take a balanced average (3+5)/2 = 4 tobe the median.

In general, fix p ∈ (0,1). The (100p)th sample percentile (or sample percentile of order p) of the sample,which we denote by πp , is the value such that (approximately) np of the sample values are < πp , andn(1−p) of them are > πp . In particular, if p = 1/2, then the 50th sample percentile is exactly the median.If (n +1)p is an integer r , according to the Exercise 2.2.2, we choose r th order statistic yr as the (100p)thsample percentile πp . What if (n+1)p is not integer, but an integer r plus a proper rational δ ∈ (0,1)? Wethen take the weighted average of yr and yr+1 with weights δ and 1−δ. In general, if (n +1)p = r +δ fora unique integer r and δ ∈ [0,1), then

πp = (1−δ)yr +δyr+1. (34)

The reasoning behind this formula is provided in the following exercise.

Exercise 2.2.2. Suppose we have a sample of values x1, x2, · · · , xn . Let yk denote the kth order statisticsof this sample. Fix p ∈ (0,1). Write (n+1)p = r +δ for a unique integer r and δ ∈ [0,1). Show the following:


(i) (# of sample values ≤ yr ) = r = np + (p −δ).

(ii) (# of sample values > yr ) = n − r = n(1−p)−p +δ.

Example 2.2.3 (Order statistics and sample percentiles, Excerpted from [HTZ77]). The following tableshows order statistics of the n = 50 exam scores in Example 2.2.1. We will compute some sample per-

centiles. First, let p = 1/2. The since, (n +1)p = 51∗ (1/2) = 25.5, the 50th sample percentile is

π0.5 = (0.5)y25 + (0.5)y26 = (71+71)/2 = 71. (35)

For p = 1/4, we have (n +1)p = 51∗ (1/4) = 12.75, so the 25th sample percentile is

π0.25 = (0.25)y12 + (0.75)y13 = (0.25)(58)+ (0.25)(59) = 58.75. (36)

Lastly, for p = 3/4, we have (n +1)p = 51∗ (3/4) = 38.25, so the 75th sample percentile is

π0.75 = (0.75)y38 + (0.25)y39 = (0.25)(81)+ (0.25)(82) = 81.25. (37)

N

Special names are given to certain sample percentiles. We have noted that the 50th percentile is themedian of the sample. The 25th, 50th, and 75th percentiles are also called the first, second, and thirdquartiles of the sample, respectively. We also use special notation for them:

q1 = π0.25, q2 = m = π0.5, q3 = π0.75. (38)

The five-number summary of a sample consists of its three quantiles as well as their minimum y1 andmaximum yn . The box-plot is a diagramatic representation of this five-number summary. See the fol-lowing example.

Example 2.2.4 (Five-number summary and box plot). We use the sample of 50 exam scores in Example2.2.1. According to the computation in Example 2.2.3, the five-number summary of the sample is

y1 = 34, q1 = 58.75, q2 = m = 71, q3 = 81.25, y50 = 97. (39)

The corresponding box plot is shown below. The middle line corresponds to the median, whereas therectangular box are bounded by the first and third quantiles. The vertial line spans from the minimumy1 to the maximum y50.

N

Exercise 2.2.5. Suppose we have the following sample of Google’s stock price for the past 50 weeks (unitin $ per stock).

320 326 325 318 322 320 329 317 316 331 320 320 317 329 316 308 321

319 322 335 318 313 327 314 329 323 327 323 324 314 308 305 328 330

322 310 324 314 312 318 313 320 324 311 317 325 328 319 310 324

(i) Compute the sample mean x and sample standard deviation s.

3. ORDER STATISTICS 12

TABLE 1. Box plot of the 50 exam scores data in Example 2.2.1.

(ii) Draw the ordered stem-and-leaf display. How many sample values are between x ± s, and x ±2s?(iii) Give the five-number summary of the sample. Draw the corresponding box plot.

3. Order statistics

3.1. Distribution of Order Statistics. Order statistics are the observations of the random sample,rearranged in magnitude from the smallest to the largest. They are extremely important in statisticsfor determining basic sample statistics such as sample median, sample quantile, sample range, and theempirical CDF. In recent years, they are also used in nonparametric inference and robust procedures.

In this section, we assume that we have n independent and identically distributed (i.i.d) continuousrandom variables X1, X2, · · · , Xn . According to the following exercise,

Exercise 2.3.1. Let X1, · · · , Xn be i.i.d. continuous RVs. Show that

(i) P(Xi = x) = 0 for all 1 ≤ i ≤ n and x ∈R.(ii) Use conditioning and (i) to show that

P(Xi = X j ) = E[P(Xi = X j |X j )] = 0. (40)

(iii) From (ii), deduce that

P(there is a tie in X1, · · · , Xn) =P( ⋃

1≤i< j≤nXi = X j

)≤ ∑

1≤i< j<nP(Xi = X j ) = 0 (41)

Now we define the order statistics for the random sample X1, X2, · · · , Xn . Let

Y1 = smallest of X1, X2, · · · , Xn (42)

Y2 = second smallest of X1, X2, · · · , Xn (43)

... (44)

Yn = largest of X1, X2, · · · , Xn . (45)


We call Yk the kth order statistic of the sample X1, · · · , Xn .Below, we will derive the CDF of order statistics. To begin, we start with the largest one Yn . Note that

for each y ∈R,

P(Yn ≤ y) =P(X1 ≤ y, X2 ≤ y, · · · , Xn ≤ y). (46)

=P(X1 ≤ y)P(X2 ≤ y)P(Xn ≤ y) (47)

=P(X1 ≤ y)n , (48)

where we have used the fact that Xi ’s are independent for the second equality, and that they have iden-tical distribution for the second. Next, consider the minimum Y1.

P(Y1 ≤ y) = 1−P(Y1 > y) (49)

= 1−P(X1 > y, X2 > y, · · · , Xn > y) (50)

= 1−P(X1 > y)P(X2 > y) · · ·P(Xn > y) (51)

= 1−P(X1 > y)n . (52)

So we can reduce the CDF of Yn and Y1 to some functions of that of X1. We generalize this observationin Exercise 2.3.2.

There is also a quite reasonable heuristic argument that gives the PDF of Yk . Consider the followinginfinitesimal probability

P(y ≤ Yk < y +h) =P(smallest k of Xi ’s are < y +h, kth smallest one lies in [y, y +h)) (53)

=P(exactly k −1 Xi ’s are ≤ y , one is in [y, y +h), the others are > y +h) (54)

+O(P(more than one Xi ’s are in [y, y +h))

). (55)

Since Xi ’s are i.i.d., we can write

P(y ≤ Yk < y +h) = n

(n −1

k −1

)P(X1 ≤ y)k−1P(y ≤ X1 < y +h)P(X1 > y +h)n−k (56)

+O(P(more than one Xi ’s are in [y, y +h))

). (57)

Note that the coefficients above is the number of ways to choose one Xi to put inside [y, y +h) and k −1Xi ’s in the remaining n −1 to put inside (−∞, y); the rest goes to [y +h,∞). Also note that P(y ≤ X1 <y +h) ≈ fX1 (y) ·h for h small1, and the probability that more than two Xi ’s fall in [y, y +h) is small in theorder of O(h2). A bit more precisely,

P(y ≤ X1, X2 < y +h) =P(y ≤ X1 < y +h)2 = ( fX1 (y) ·h)2 =O(h2), (58)

so this yields

P(more than one Xi ’s are in [y, y +h)) =P(some two Xi ’s are in [y, y +h)) (59)

=(

n

2

)P(y ≤ X1, X2 < y +h) =O(h2). (60)

Combining with previous estimates, we get

fYk (y) = limh→0

P(y ≤ Yk < y +h)

h= n!

(k −1)!(n −k)!P(X1 ≤ y)k−1P(X1 > y)n−k fX1 (y). (61)

A more standard derivation of the PDF of Yk is also given in Exercise 2.3.2.

Exercise 2.3.2. Let Y1, · · · ,Yn be the order statistics of the sample X1, · · · , Xn .

1Use Taylor expansion of the PDF of Yk or mean value theorem for integrals


(i) Show that

P(Xi > y for exatly one 1 ≤ i ≤ n) =P(

n⋃i=1

Xi > y and X j ≤ y for all other j ’s

)(62)

=n∑

i=1P

(Xi > y and X j ≤ y for all j 6= i

)(63)

=n∑

i=1P(Xi > y)P

(X j ≤ y for all j 6= i

)(64)

=n∑

i=1P(X1 > y)P(X1 ≤ y)n−1 (65)

= nP(X1 > y)P(X1 ≤ y)n−1. (66)

(ii) Show that

P(Xi > y for exactly two i ’s ) =P( ⋃

1≤i< j≤nXi , X j > y and Xk ≤ y for all other k’s

)(67)

= ∑1≤i< j≤n

P(Xi , X j > y

)P

(X j ≤ y for all k ∉ i , j

)(68)

= ∑1≤i< j≤n

P(X1 > y)2P(X1 ≤ y)n−2 (69)

=(

n

2

)P(X1 > y)2P(X1 ≤ y)n−2. (70)

(iii) Using a similar argument as before, show that

P(Xi > y for exactly k i ’s ) =(

n

k

)P(X1 > y)kP(X1 ≤ y)n−k . (71)

(iv) Use (iii) to conclude that

P(Yk ≤ y) =P(Xi ≤ y for at least k i ’s

)(72)

=n∑`=k

P(Xi ≤ y for exacly ` i ’s

)(73)

=n∑`=k

P(Xi > y for exacly n −` i ’s

)(74)

=n∑`=k

(n

`

)P(X1 > y)n−`P(X1 ≤ y)`. (75)

(v) Differentiate the CDF of Yk in (iv) to get is pdf fYk :

fYk (y) = n

(k −1)!(n −k)!P(X1 ≤ y)k−1P(X1 > y)n−k fX1 (y). (76)

Exercise 2.3.3. Let X1, X2, · · · , X5 be i.i.d. Exp(λ) RVs. Recall that the PDF of Exp(λ) RVs is given by

fX (t ) =λe−λt 1(t ≥ 0). (77)

Let Y1 < Y2 < ·· · < Y5 denote the order statistics of X1, · · · , X5.

(i) Compute the CDFs of Y1 < ·· · < Y5.(ii) Compute the PDFs of Y1 < ·· · < Y5.(iii) Plot the PDFs of Y1 < ·· · < Y5 assuming λ= 1.

4. ORDER STATISTICS AND THE Q-Q PLOT 15

4. Order statistics and the q-q plot

Suppose for an unknown RV X , we have a sample x1, · · · , xn with sample mean x = 3.4 and samplevariance s2 = 1.3. Probably it is reasonable to claim that E[X ] ≈ 3.4 and Var(X ) ≈ 1.3 (we will learn in whatprecise sense we could do this). Furthermore, if we would like to claim that X is in fact a normal RV withmean 3.4 and variance 1.3, how do we test this claim and quantitatively see if it is reasonable? Moreover,if the sample size n is small, then sample mean and variance may not be good approximations for therepopulation counterpart. In this case, can we still test if our sample is arising from a normal distribution?

The core problem here is to compare the empirical distribution of X and a normal distribution withunknown mean and variance. A widely used technique in EDA for such distribution comparison is calledthe quantile-quantile plot (or q-q plot). The basic idea is compare the order statistics of the sample withthe quantiles of the hypothetical distribution. Namely, let y1 < ·· · < yn denote the order statistics of thesample x1, · · · , xn . For each integer 1 ≤ r ≤ n, let pr = r /(n +1). Then according to (34), we have

yr = πpr . (78)

Suppose X has distribution close to N (µ,σ2). If we denote by πp the quantile of order p for this hypo-thetical distribution, then we should have

yr = πpr ≈πpr for all 1 ≤ r ≤ n. (79)

In this case, the scatter plot of the points (yr ,πr ) will lie on the line y = x.But what if we do not know µ and σ2? The following exercise show that we can simply use N (0,1)

instead of N (µ,σ2), and see if the resulting points of percentiles (yr ,πr ) lie on a straight line.

Exercise 2.4.1. Let Z ∼ N (0,1) and X ∼ N (µ,σ2). Show the following.

(i) σZ +µ∼ N (µ,σ2).(ii) P(X ≤ t ) =P(Z ≤ (t −µ)/σ).(iii) Let qp be the quantile of X of order p, i.e., P(X ≤ qp ) = p. Recall that for Z ∼ N (0,1), zα is defined so

that P(Z > zα) =α. Then use (ii) to derive that

qp =µ+σz1−p . (80)

That is, the points (z1−p , qp ) lie on the line y =µ+σx.

Example 2.4.2 (Excerpted from [HTZ77]). Suppose for an unknown RV X , we have the following sampleof size n = 30:

1.24 1.36 1.28 1.31 1.35 1.20 1.39 1.35 1.41 1.31 1.28 1.26 1.37 1.49 1.32

1.40 1.33 1.28 1.25 1.39 1.38 1.34 1.40 1.27 1.33 1.36 1.43 1.33 1.29 1.34

We can compute x = 1.33 and s2 = 0.0040. From this, can we conclude that X approximately followsN (1.33,0.0040)? To help answer this question, we shall construct a q–q plot of the standard normalpercentiles that correspond to p = 1/31,2/31, · · · ,30/31 versus the ordered observations.

The figure above shows three q-q plot for the sample in the example against three hypothetical dis-tributions: t-distribution with 20 degrees of freedom, N (0,1), and N (3,5). According to the q-q plots, canwe conclude that the X is approximately normal? N

4. ORDER STATISTICS AND THE Q-Q PLOT 16

FIGURE 2. q-q plot of the sample in this example against three hypothetical distributions.

CHAPTER 3

Maximum Likelihood Estimation

Maximum Likelihood Estimation (MLE) is a classical notion in statitics, which has a prominent usein the context of modern machine learning. Arguably it is one of the most important concept that we willlearn in 170S.

1. Definition and Examples of MLE

We have mentioned that, the core problem in statistics, is to infer an unknown RV X from its samplevalues x1, · · · , xn . When we had absolutely no knowledge on X , in the previous sections we have seensome methods for EDA on the sample, using sample mean, variance, and quantiles, etc. However, inmany cases, we can narrow down our inference problem by imposing a probabilistic (statistical) modelfor X . Namely, say we know X is a Bernoulli RV with unknown success probability p. We then only needto estimate the parameter p using our sample, not the entire distribution of X . MLE gives a systematicapproach to this parameter estimation problem. Below we give the pipeline of MLE.

Pipeline of MLE.Input: An unknown RV X with distribution fX ;θ, parameterized by θ in a parameter space Ω. Also have

sample values x1, x2, · · · , xn .Objective: Obtain an estimation θ of θ using the sample.Method: Choose θ ∈Ω so that it maximizes the following likelihood function

L(x1, · · · , xn ; θ) =n∏

i=1fX ,θ(xi ). (81)

Here the RV θ(X1, · · · , Xn) is called the maximum likelihood estimator of θ, and its observed value θ(x1, · · · , xn)is called the maximum likelihood estimate of θ.

What is the reasoning behind maximizing the likelihood function as above? The underlying assump-tion is that, you are seeing the sample values x1, · · · , xn because it was the most likely to see them! Namely,suppose we have designed the sampling procedure well-enough so that we get i.i.d. samples X1, · · · , Xn

each with distribution fX ;θ. Then we have

L(x1, · · · , xn ; θ) =P(X1 = x1, · · · , Xn = xn ; θ). (82)

Namely, the likelihood function on the LHS is the probability of observing a sequence of specific samplevalues x1, · · · , xn under the i.i.d. assumption and assuming the parameter value θ. Hence, MLE choosesθ to be the value of the parameter in Ω under which the probability of obtaining the current sample ismaximized.

Example 3.1.1 (MLE for Bernoulli RVs). Suppose X ∼ Bernoulli(p) for some unknown p ∈ [0,1]. Supposewe have sample values x1, · · · , xn obtained from an i.i.d. sampling for X . Denote x = n−1 ∑n

i=1 xi . We willshow that the MLE for p is X , that is,

p = X . (83)

Let X1, · · · , Xn be i.i.d. with distribution Bernoulli(p). Note that

P(Xi = xi ; p) =

p if xi = 1

1−p if xi = 0(84)

17

1. DEFINITION AND EXAMPLES OF MLE 18

= pxi (1−p)1−xi . (85)

Hence the likelihood function is given by

L(x1, · · · , xn ; p) =n∏

i=1P(Xi = xi ; p) (86)

=n∏

i=1pxi (1−p)1−xi (87)

= p∑n

i=1 xi (1−p)n−∑ni=1 xi . (88)

In order to maximize the likelihood function, it suffices to maximize its logarithm 1. The log likelihoodfunction is given by

l (x1, · · · , xn ; p) := logL(x1, · · · , xn ; p) = nx log p + (n −nx) log(1−p). (89)

To maximize the log likelihood function, we take partial derivative in p:

∂l (x1, · · · , xn ; p)

∂p= nx

p− n −nx

1−p. (90)

Setting this equal to zero, we obtain

x

p= 1− x

1−p. (91)

Rearranging, we obtain p = x. Hence we have (83) as desired. N

Exercise 3.1.2 (MLE for Binomial RV). Let X ∼ Binomial(n, p) where n is known but p is not.

(i) Write x = x1 +·· ·+xm . Show that the log likelihood function is given by

l (x1, · · · , xm ; p) =(

m∑i=1

log

(n

xi

))+x log p + (n −x) log(1−p). (92)

(ii) Show that the MLE for p is X .

Exercise 3.1.3 (MLE for Geometric RV). Let X ∼ Geom(p), which is a discrete RV with PMF P(X = k) =(1−p)k−1p, k = 1,2,3, · · · .(i) Show that the log likelihood function is given by

l (x1, · · · , xn ; p) = n log p +((

n∑i=1

xi

)−n

)log(1−p). (93)

(ii) Show that the MLE for p is 1/X .2

Exercise 3.1.4 (MLE for Poisson RV). Let X ∼ Poisson(λ), which is a discrete RV with PMF P(X = k) =λk e−λ/k !, k = 0,1,2, · · · .(i) Show that the log likelihood function is given by

l (x1, · · · , xn ;λ) = logλn∑

i=1xi −nλ− log(x1!x2! · · ·xn !). (94)

(ii) Show that the MLE for λ is X .3

1since log is a strictly increasing function.2Recall that E[Geom(p)] = 1/p, so we are saying the MLE for the population mean is X .3Recall that E[Poisson(λ)] =λ, so we are saying the MLE for the population mean is X .

1. DEFINITION AND EXAMPLES OF MLE 19

Example 3.1.5 (MLE for Exponential RVs). Suppose X ∼ Exp(λ) for some unknown θ ∈ [0,1]. We havesample values x1, · · · , xn obtained from an i.i.d. sampling for X . Denote x = n−1 ∑n

i=1 xi . We will showthat the MLE for λ is 1/X , that is,

1/λ= X . (95)

Let X1, · · · , Xn be i.i.d. with distribution Exp(λ), which have the following PDF

fX (x;λ) =λe−λx 1(x ≥ 0). (96)

Noting that x1, · · · , xn ≥ 0, it follows that the likelihood function is given by

L(x1, · · · , xn ;λ) =n∏

i=1fX (xi ;λ) (97)

=λnn∏

i=1e−λxi =λne−λ

∑ni=1 x =λne−λnx . (98)

So the log likelihood function is given by

l (x1, · · · , xn ;λ) = n logλ−λnx. (99)

To maximize the log likelihood function, we take partial derivative in λ:

∂l (x1, · · · , xn ;λ)

∂λ= n

λ−nx. (100)

Setting this equal to zero, we obtain 1/λ= x. Hence 1/λ= X , as desired. N

Example 3.1.6 (MLE for Normal RVs). Suppose X ∼ N (µ,σ2) for some unknown µ ∈ R and σ ≥ 0. LetX1, · · · , Xn be i.i.d. with distribution N (µ,σ2). Denote the sample mean and variance of the empiricaldistribution by

X = n−1n∑

i=1Xi , V = n−1

n∑i=1

(Xi − X )2. (101)

We will show that the MLE for µ and σ2 are given by sample mean and variance of the empirical distri-bution:

µ= X , ˆ(σ2) =V. (102)

Recall that N (µ,σ2) has the following PDF

fX (x;µ,σ2) = 1p2πσ2

exp

(− (x −µ)2

2σ2

). (103)

It follows that the likelihood function is given by

L(x1, · · · , xn ;µ,σ2) =n∏

i=1fX (xi ;µ,σ2) (104)

= (2πσ2)−n/2n∏

i=1exp

(− (xi −µ)2

2σ2

)(105)

= (2πσ2)−n/2 exp

(− 1

2σ2

n∑i=1

(xi −µ)2

). (106)

So the log likelihood function is given by

l (x1, · · · , xn ;µ,σ2) =−n

2log(2πσ2)− 1

2σ2

n∑i=1

(xi −µ)2. (107)

2. UNBIASED ESTIMATORS 20

To maximize the log likelihood function, we take partial derivatives in µ and σ2:

∂l (x1, · · · , xn ;µ,σ2)

∂µ=− 1

σ2

n∑i=1

(xi −µ) =− n

σ2

(x −µ)

, (108)

∂l (x1, · · · , xn ;µ,σ2)

∂σ2 =−n

2

1

σ2 + 1

2σ4

n∑i=1

(xi −µ)2. (109)

Setting these equations to be zero, we obtainx =µ−n + 1

σ2

∑ni=1(xi −µ)2 = 0.

(110)

Using the first equation for the second, we obtain

σ2 = 1

n

n∑i=1

(xi − x)2. (111)

This shows (102). N

Exercise 3.1.7. Consider a RV X with the following PDF

fX ;θ(x) =(

θ

1−θ)

x(2θ−1)/(1−θ)1(0 < x ≤ 1). (112)

(i) Show that E[X ] = θ.(ii) If X1, · · · , Xn are i.i.d. from the same distribution fX ;θ above, find the MLE θ for θ.(iii) Conclude that the MLE of the population mean is not always the sample mean.

2. Unbiased estimators

Suppose we have a RV X whose distribution has an unknown parameter θ. Let X1, · · · , Xn be i.i.d.copies of X . I the previous section, we have seen various examples where the method of MLE gives afunction u such that u(X1, · · · , Xn) is an estimator of the parameter θ (e.g., u(X1, · · · , Xn) = X ). We say thisestimator is unbiased if

E[u(X1, · · · , Xn)] = θ. (113)

Exercise 3.2.1. Let X1, · · · , Xn be i.i.d. copies of a RV X of unknown expectation E[X ] =µ. Show that

u(X1, · · · , Xn) = X (114)

is an unbiased estimator of µ.

Example 3.2.2 (MLE for Uniform RV). Let X1, · · · , Xn be i.i.d. samples from Uniform([0,θ]) distribution,and let Y1 < ·· · < Yn be their order statistics. Recall that their PDF is given by

fX (x) = 1

θ1(x ∈ [0,θ]). (115)

Hence the likelihood function is given by

L(x1, · · · , xn ;θ) = θ−n1(x1, · · · , xn ∈ [0,θ]). (116)

In order to maximize this, we need to choose θ as small as possible. But since xi ’s have to be between 0and θ, we need to have θ ≥ max(x1, · · · , xn). Thus the MLE θ for θ is

θ = max(X1, · · · , Xn) = Yn . (117)

Is this an unbiased estimator for θ? Recall from the previous section that

P(Yn ≤ y) =P(X1, · · · , Xn ≤ y) =P(X1 ≤ y)n = (y/θ)n1(y ∈ [0,θ]). (118)

2. UNBIASED ESTIMATORS 21

Hence the PDF of Yn is

fYn (y) = d

d yP(Yn ≤ y) = nθ−n yn−11(y ∈ [0,θ]). (119)

It follows that

E[Yn] =∫ θ

0nθ−n yn d y = nθ−n

[yn+1

n +1

]θ0= n

n +1θ. (120)

Hence Yn is not an unbiased estimator of θ. However, n+1n Yn is an unbiased estimator of θ, since

E

[n +1

nYn

]= n +1

nE[Yn] = n +1

n

n

n +1θ = θ. (121)

In particular, MLE does not always give unbiased estimator. N

Exercise 3.2.3 (Pythagorian theorem for variances). Let X1, · · · , Xn be i.i.d. samples from some distribu-tion with mean µ and finite variance. Let X = n−1 ∑n

i=1 Xi and S2 = (n −1)−1 ∑ni=1(Xi − X )2.

(i) Show thatn∑

i=1(Xi −µ)2 =

n∑i=1

(Xi − X + (X −µ))2 (122)

=n∑

i=1(Xi − X )2 +2

n∑i=1

(Xi − X )(X −µ)+n(X −µ)2 (123)

=[

n∑i=1

(Xi − X )2

]+n(X −µ)2. (124)

(ii) From (i), deduce that

(n −1)S2 =[

n∑i=1

(Xi −µ)2

]−n(X −µ)2. (125)

Exercise 3.2.4 (Sample variance is unbiased). Let X1, · · · , Xn be i.i.d. samples from some distributionwith mean µ and finite variance σ2. Let X = n−1 ∑n

i=1 Xi and S2 = (n −1)−1 ∑ni=1(Xi − X )2. We will show

that the sample variance S2 is an unbiased estimator of the population variance σ2.

(i) Show that

E

[n∑

i=1(Xi − X )(X −µ)

]= 0. (126)

(ii) Show that

E[(X −µ)2] = E[

n−2

(n∑

i=1(Xi −µ)

)2]= n−1σ2. (127)

(iii) Use Exercise 3.2.3 and show that

E

[n∑

i=1(Xi −µ)2

]= E

[n∑

i=1(Xi − X )2

]+nE

[(X −µ)2] . (128)

(iv) From (ii) and (iii), deduce that

E

[n∑

i=1(Xi − X )2

]= (n −1)σ2. (129)

(v) From (iv), show that

E[S2]=σ2. (130)

3. METHOD OF MOMENTS 22

Exercise 3.2.5 (Sample variance is consistent). Let X1, · · · , Xn be i.i.d. samples from some distributionwith mean µ and finite variance σ2. Let X = n−1 ∑n

i=1 Xi and S2 = (n −1)−1 ∑ni=1(Xi − X )2. We will show

that the sample variance S2 is a consistent estimator of σ2, that is,

limn→∞S2 =σ2 in probability. (131)

(This justifies the heuristic of approximating unknown σ2 with observed value s2 of S2.)

(i) Use Strong Laws of Large Numbers to deduce that

P

(lim

n→∞n−1n∑

i=1(Xi −µ)2 =σ2

)= 1, and P

(lim

n→∞(X −µ)2 = 0)= 1. (132)

(ii) Use Exercise 3.2.3 to write

S2 = n

n −1

[1

n

n∑i=1

(Xi −µ)2

]− n

n −1(X −µ)2. (133)

Use (i) to deduce that

P(

limn→∞S2 =σ2

)= 1. (134)

(You may use the fact that if Xn → x and Yn → y almost surely as n →∞, then Xn +Yn → x + yand XnYn → x y almost surely as n →∞.)

(iii) From (ii), deduce (131). (You may use the fact that almost sure convergence implies convergence inprobability.)

Example 3.2.6 (MLE for normal RV revisited). From Example 3.1.6, we have seen that the MLEs for themean µ and variance σ2 of N (µ,σ2) RVs are given by

µ= X , ˆ(σ2) = 1

n

n∑i=1

(Xi − X )2. (135)

From Exercise 3.2.1, we know that X is an unbiased estimator of µ. However, what about the variance?According to Exercise 3.2.4 (iv), we have

E

[1

n

n∑i=1

(Xi − X )2

]= n −1

nσ2. (136)

Hence the MLE for σ2 is a biased estimator. If we had n −1 instead of n, we would have

E

[1

n −1

n∑i=1

(Xi − X )2

]= 1

n −1(n −1)σ2 =σ2. (137)

Hence the sample variance S2 is an unbiased estimator of σ2, as we have also seen in Exercise 3.2.4. N

3. Method of moments

Before the method of Maximum Likelihood Estimation was popularized by R.A. Fischer during 1912and 1922, a different method for parameter estimation, called the method of moments, was widely used.This still carries some advantages over MLE nowadays, especially in its computational efficiency. Recallthat the kth moment of a RV X is E[X k ]. Basically, the method of moments is to let

k th sample moment = k th population moment. (138)

Example 3.3.1 (Gamma distribution). A continuous random variable X is a Gamma RV with parametersα,β if it has the following PDF

fX ;α,β(x) = βα

Γ(α)xα−1e−βx 1(x > 0), (139)

3. METHOD OF MOMENTS 23

where Γ(x) is the Gamma function defined by

Γ(x) =∫ ∞

0t x e−t d t . (140)

In this case we write X ∼ Gamma(α,β).If X1, · · · , Xn are i.i.d. samples of X ∼ Gamma(α,β), then the likelihood function is

L(x1, · · · , xn ;α,β) =[βα

Γ(α)

]n

(x1x2, · · · , xn)α−1 exp

(−β

n∑i=1

xi

). (141)

Now maximizing the above function is computationally difficult, especially due to the presence of theGamma function in the parameters.

Instead, using the fact that Γ(α,β) has mean α/β and variance α/β2, method of moments givesα

β= X ,

α

β2 =V , (142)

where V = 1n

∑ni=1(Xi − X )2 is the variance of the empirical distribution. This gives the following method

of moments estimators

α= X 2

V, β= V

X. (143)

N

Exercise 3.3.2. Show that Maximum Likelihood Estimation and Method of Moments give the same esti-mators for N (µ,σ2).

CHAPTER 4

Linear regression

Suppose we have a data set consisting of the pairs of numbers (x1, y1), · · · , (xn , yn). Think of yk isthe price of stock you are holding and xk is NASDAQ composite at time k. You want to obtain a simplefunctional explaining the dependent variable yk in terms of the independent variable xk . If we think ofthe data as a sample for the pair of random variables (X ,Y ), then what you want to compute is the ‘bestguess’ on Y given X . This is well-known to be nothing but the conditional expectation E[Y |X ]. Below wegive a brief recap of conditional expectation and variance.

1. Conditional expectation and variance

Let X ,Y be discrete RVs. Recall that the expectation E[Y ] is the ‘best guess’ on the value of Y whenwe do not have any prior knowledge on Y . But suppose we have observed that some possibly related RVX takes value x. What should be our best guess on Y , leveraging this added information? This is calledthe conditional expectation of Y given X = x, which is defined by

E[Y |X = x] =∑y

yP(Y = y |X = x). (144)

This best guess on Y given X = x, of course, depends on x. So it is a function in x. Now if we do notknow what value X might take, then we omit x and E[Y |X ] becomes a RV, which is called the conditionalexpectation of Y given X .

As we have defined conditional expectation, we could define the variance of a RV Y given that antherRV X takes a particular value. Recall that the (unconditioned) variance of Y is defined by

Var(Y ) = E[(Y −E(Y ))2]. (145)

Note that there are two places where we take expectation. Given X , we should improve both expectationsso the conditional variance of Y given X is defined by

Var(Y |X ) = E[(Y −E[Y |X ])2 |X ]. (146)

Exercise 4.1.1. Let X and Y be RVs. Then show that

Var(Y |X ) = E[Y 2 |X ]−E[Y |X ]2. (147)

The following exercise explains in what sense the conditional expectation E[Y |X ] is the best guesson Y given X , and that the minimum possible mean squared error is exactly the conditional varianceVar(Y |X ).

Exercise 4.1.2. Let X ,Y be RVs. For any function g : R→ R, consider g (X ) as an estimator of Y . LetE[(Y − g (X ))2 |X ] be the mean squared error.

(i) Show that

E[(Y − g (X ))2 |X ] = E[Y 2 |X ]−2g (X )E[Y |X ]+ g (X )2 (148)

= (g (X )−E[Y |X ])2 +E[Y 2 |X ]−E[Y |X ]2 (149)

= (g (X )−E[Y |X ])2 +Var(Y |X ). (150)

(ii) Conclude that the mean squared error is minimized when g (X ) = E[Y |X ] and the global minimumis Var(Y |X ).

24

2. A SIMPLE LINEAR REGRESSION PROBLEM 25

2. A simple linear regression problem

Suppose we have two unknown dependent RVs X and Y , for which we want to figure out the rela-tionship between them. In order to test this, we have i.i.d. random samples (X1,Y1), · · · , (Xn ,Yn), eachwith the same distribution as (X ,Y ). When we have to guess a functional relationship between two RVsY and X , our first choice should be a linear function. We will have our probabilistic model of Y in termsof X as follows:

Yi |Xi =α+βXi +εi , (151)

where εi ∼ N (0,σ2) denotes independent Gaussian ‘noise’. Namely, given that we observe the value of Xi ,we think of Yi as a linear transform α+βXi plus some Gaussian noise εi occurring whenever we make ameasurement. This probabilistic model has three parameters, α, β, and σ2. We will use MLE to obtainestimators for each of them.

Proposition 4.2.1. The MLEs for the paramters α,β, and σ2 in the linear model in (151) are given by

α= Y − βX (152)

β=(∑n

i=1 Xi Yi)−n−1

(∑ni=1 Yi

)(∑ni=1 Xi

)(∑ni=1 X 2

i

)−n−1(∑n

i=1 Xi)2 =

∑ni=1(Xi − X )(Yi − Y )∑n

i=1(Xi − X )2(153)

σ2 = 1

n

n∑i=1

(Yi − α− βXi )2. (154)

PROOF. First, observe that Yi |Xi = xi ’s are independent with distribution N (α+βxi ,σ2). Thus thelikelihood function is

L(x1, y1, · · · , xn , yn ;α,β,σ2) =n∏

i=1

1p2πσ2

exp

(− (yi −α−βxi )2

2σ2

)(155)

= (2πσ2)−n/2

exp

(− 1

2σ2

n∑i=1

(yi −α−βxi )2

). (156)

So the log likelihood function is

l (x1, y1, · · · , xn , yn ;α,β,σ2) =−n

2log(2πσ2)− 1

2σ2

n∑i=1

(yi −α−βxi )2. (157)

In order to maximize this, we differentiate it with respect to α, β, and σ2. This gives

∂l

∂α=σ−2

n∑i=1

(yi −α−βxi ) = 0 (158)

∂l

∂β=σ−2

n∑i=1

xi (yi −α−βxi ) = 0 (159)

∂l

∂σ2 =− n

2σ2 +σ−4n∑

i=1(yi −α−βxi )2 = 0. (160)

These equations yield (∑n

i=1 yi)−nα− β(∑n

i=1 xi) = 0(∑n

i=1 xi yi)− α(∑n

i=1 xi)− β(∑n

i=1 x2i

) = 0

nσ2 −∑ni=1(yi − α− βxi )2 = 0.

(161)

Combining the first two, we get(n∑

i=1xi yi

)−

[n−1

(n∑

i=1yi

)− βn−1

(n∑

i=1xi

)](n∑

i=1xi

)− β

(n∑

i=1x2

i

)= 0, (162)


which yields

β

[(n∑

i=1x2

i

)−n−1

(n∑

i=1xi

)2]=

(n∑

i=1xi yi

)−n−1

(n∑

i=1yi

)(n∑

i=1xi

). (163)

This gives the first desired expression for β. The second expression for β is easy to verify from the first bysimply distributing the product and the sum. Then using the first and last equations in (161), we get theMLEs for α and σ2, as desired.

Exercise 4.2.2 (Method of Least Squares). Suppose we have data samples (x1, y1), · · · , (xn , yn). Let y =(y1, · · · , yn)T and x = (x1, · · · , xn)T . We would like to find the best approximation of the vector y by a lineartransform of x. Namely, we want to find α and β such thaty1

...yn

≈

α...α

+

βx1...

βxn

=

1 x1...

...1 xn

[α

β

]. (164)

Write the above matrix equation as Y ≈ XB. The Least Squares Estimator B for this problem is obtainedby solving the following optimization problem

minimize ‖Y−XB‖22 (165)

subject to B ∈R2×1, (166)

where ‖·‖2 denotes the Euclidean norm. (d.f. For matrix algebra and calculus, see The Matrix Cookbook.)

(i) Show that (165) is equivalent to following optimization problem:

minimizen∑

i=1(yi −α−βxi )2 (167)

subject to α,β ∈R. (168)

(ii) Show that

‖Y−XB‖22 = (Y−XB)T (Y−XB) (169)

= YT Y−2YT XB+BT XT XB. (170)

(iii) Show that

∂

∂B‖Y−XB‖2

2 = 2XT XB−2XT Y. (171)

Conclude that the solution of the optimization problem in this exercise is given by

B = (XT X)−1XT Y. (172)

Verify that this gives

α= y − βx (173)

β=(∑n

i=1 xi yi)−n−1

(∑ni=1 yi

)(∑ni=1 xi

)(∑ni=1 x2

i

)−n−1(∑n

i=1 xi)2 =

∑ni=1(xi − x)(yi − y)∑n

i=1(xi − x)2 . (174)

Exercise 4.2.3. Let α, β, and σ2 be the MLEs for the parameters α,β, and σ2 in the simple linear regres-sion model

Y |X =α+βX +ε, (175)

where ε ∈ N (0,σ2) is an independent Gaussian noise. (See Proposition 4.2.1.) Suppose we are given asample X1 = x1, · · · , Xn = xn .

https://www.ics.uci.edu/~welling/teaching/KernelsICS273B/MatrixCookBook.pdf


(i) Using the fact that Yi |Xi = xi is a N (α+βxi ,σ2) RV, show that

E[β(x1, · · · , xn)] =β, Var(β(x1, · · · , xn)) = σ2∑ni=1(xi − x)2 . (176)

(ii) Use Proposition 4.2.1 and (i) to show that

E[α(x1, · · · , xn)] =α, Var(α(x1, · · · , xn)) =σ2(

1

n+ x2∑n

i=1(xi − x)2

). (177)

Exercise 4.2.4 (Excerpted from [HTZ77]). The midterm and final exam scores of 10 students in a statisticscourse are tabulated as shown.

Midterm: 70 74 80 84 80 67 70 64 74 82Final: 87 79 88 98 96 73 83 79 91 94

(178)

(i) Calculate the least squares regression line for these data. (See Proposition 4.2.1 and Exercise 4.2.2.)(ii) Plot the points and the least squares regression line onthe same graph.

(iii) Find the value of σ2 (see Proposition 4.2.1).

CHAPTER 5

Sufficient Statistics

1. Definition and motivating example

Let X1, · · · , Xn be i.i.d. copies of a RV X , whose distribution has an unknown parameter θ. We mayconstruct an estimator Y = u(X1, · · · , Xn) for θ. So far we have seen two ways to do so – Maximum Like-lihood Estimation and Method of Moments – but there are many others. To be the most effective, wewould like to extract as much information as possible from the given sample X1, ·, Xn . For example, if ourestimator is made of the first three samples X1, X2, X3, then probably it is not the most informative oneand we should be making use of all the other samples.

Roughly speaking, an estimator (or statistics) Y = u(X1, · · · , Xn) for a parameter θ is called a sufficientstatistics of θ if there is no other statistic v(X1, · · · , Xn) that provides any additional information about θ.

Definition 5.1.1. Let X1, · · · , Xn be i.i.d. RVs whose distribution has a parameter θ. A statistic T =T (X1, · · · , Xn) for θ is sufficient if the following conditional probability

Pθ(X1 = x1, · · · , Xn = n |T = t ) (179)

does not depend on θ.

Note that the conditional probability in (179) is typically a function of both t and θ. In the followingmotivating example, we will see that in some cases, knowing the value of T gets rid of the dependenceon θ.

Example 5.1.2 (Sufficient statistic for Bernoulli RV). Let X1, · · · , Xn be i.i.d. RVs with distribution Bernoulli(p).Recall that we can write down the full joint distribution of the n samples:

f (x1, · · · , xn ; p) =n∏

i=1P(Xi = xi ; p) (180)

=n∏

i=1pxi (1−p)1−xi (181)

= p∑n

i=1 xi (1−p)n−∑ni=1 xi . (182)

From Example 3.1.1, we have seen that the sample mean X is a MLE for p.Let Y := X1 +·· ·+ Xn = nX . Suppose we know that Y = y (and hence X = y/n). Is it possible to use

the values of the sample X1, · · · , Xn in some clever way to extract further information for p? The answeris negative. To see this, let us compute the conditional joint distribution of (X1, · · · , Xn) on event Y = y :For any x1, · · · , xn such that

∑ni=1 xi = y ,

P(X1 = x1, · · · , Xn = xn |nX = y) = P(X1 = x, · · · , Xn = xn)

P(nX = y)(183)

= p∑n

i=1 xi (1−p)n−∑ni=1 xi(n

y

)p y (1−p)n−y

(184)

= p y (1−p)n−y(ny

)p y (1−p)n−y

(185)

28

2. SUFFICIENT STATISTICS BY FACTORIZATION THEOREM 29

= 1(ny

) . (186)

Indeed, the conditional joint distribution does not depend on the parameter p, so it does not containany additional information on p. Hence Y (and hence X ) is a sufficient statistic for p. N

2. Sufficient statistics by factorization theorem

The following factorization theorem due to Neyman and Fisher provides a convenient way to com-pute sufficient statistics, which we will illustrate through a number of examples.

Theorem 5.2.1 (Neyman-Fisher). Let X1, · · · , Xn denote RVs with joint distribution f (x1, · · · , xn ;θ), whichdepends on the parameter θ. A statistic T = T (X1, · · · , Xn) is sufficient for θ if and only if there exists func-tions φ and g such that

f (x1, · · · , xn ;θ) =φ(T (x1, · · · , xn);θ

)g (x1, · · · , xn), (187)

where g does not depend on θ.

PROOF. (Optional*) We prove the assertion when Xi ’s are discrete RVs, but the argument for contin-uous case is the same. First suppose T = T (X1, · · · , Xn) is a sufficient statistic for θ. Then we can write

Pθ(X1 = x1, · · · , Xn = xn) =Pθ(T = t )Pθ(X1 = x1, · · · , Xn = xn |T = t ). (188)

Note that the conditional probability on the right hand side does not depend on θ, since T is sufficientfor θ. Thus the above give the desired factorization in (187).

Conversely, suppose (187) holds. Then

Pθ(T = t ) = ∑x1,··· ,xn

T (x1,··· ,xn )=t

Pθ(X1 = x1, · · · , Xn = xn) (189)

= ∑x1,··· ,xn

T (x1,··· ,xn )=t

φ(t ;θ

)g (x1, · · · , xn) (190)

=φ(t ;θ

) ∑x1,··· ,xn

T (x1,··· ,xn )=t

g (x1, · · · , xn). (191)

Thus we have

Pθ(X1 = x1, · · · , Xn = xn |T = t ) = Pθ(X1 = x1, · · · , Xn = xn ,T = t )

Pθ(T = t )(192)

= Pθ(X1 = x1, · · · , Xn = xn)

Pθ(T = t )(193)

= φ(Y ;θ

)g (x1, · · · , xn)

φ(t ;θ

)∑T (x1,··· ,xn )=t g (x1, · · · , xn)

(194)

= g (x1, · · · , xn)∑T (x1,··· ,xn )=t g (x1, · · · , xn)

. (195)

Hence T = T (X1, · · · , Xn) is a sufficient statistic for θ, as desired.

Example 5.2.2 (Sufficient statistic for Poisson RV). Let X1, · · · , Xn be i.i.d. with distribution Poisson(λ).In order to compute a sufficient statistics for λ, we write down the full joint distribution of the n samples:

f (x1, · · · , xn ;λ) =n∏

i=1fXi (xi ;λ) (196)

=n∏

i=1

λxi e−λ

xi !(197)

3. RAO-BLACKWELL THEOREM 30

= (λ∑n

i=1 xi e−nλ)

(1

x1!x2! · · ·xn !

). (198)

Hence any constant multiple of∑n

i=1 Xi is a sufficient statistic for λ. Recall that X is also an unbiasedestimator of E[Poisson(λ)] =λ. Hence it is natural to choose u(X1, ·, Xn) = X . N

Example 5.2.3 (Sufficient statistic for Normal RV). Let X1, · · · , Xn be i.i.d. with distribution N (µ,1). Inorder to compute a sufficient statistics for µ, we write down the full joint distribution of the n samples:

f (x1, · · · , xn ;µ) =n∏

i=1fXi (xi ;µ) (199)

=n∏

i=1

1p2π

exp

(− (xi −µ)2

2

)(200)

= (2π)−n/2 exp

(−1

2

n∑i=1

(xi −µ)2

). (201)

To separate the dependence of each xi on µ, we note thatn∑

i=1(xi −µ)2 =

n∑i=1

(xi − x + x −µ)2 (202)

=n∑

i=1

[(xi − x)+2(xi − x)(x −µ)+ (x −µ)2] (203)

=(

n∑i=1

(xi − x)2

)+n(x −µ)2, (204)

where for the last equality, we have used∑n

i=1(xi − x) = 0. So we can decompose the joint PDF as

f (x1, · · · , xn ;λ) = [(2π)−n/2 exp

(n(x −µ)2)][

exp

(−1

2

n∑i=1

(xi − x)2

)]. (205)

Hence X is a sufficient statistic for µ. N

Exercise 5.2.4. Let X1, · · · , Xn be i.i.d. RVs with the following PDF of exponential form

fX ;θ(x) = exp(K (x)p(θ)+S(x)+q(θ)

)1(x ∈ A), (206)

where the support A does not depend on the parameter θ.

(i) Show that the joint likelihood function is given by

f (x1, · · · , xn ;θ) = exp

(p(θ)

(n∑

i=1K (xi )

)+nq(θ)

)exp

(n∑

i=1S(xi )

)1(x1, · · · , xn ∈ A). (207)

Deduce that∑n

i=1 K (Xi ) is a sufficient statistic for θ.(ii) Derive sufficient statistics for the parameters of Bernoulli(p), Exp(λ), and Poisson(λ) using (i).

3. Rao-Blackwell Theorem

Theorem 5.3.1 (Rao-Blackwell). Let X1, · · · , Xn be i.i.d. RVs of distribution fX ;θ with paramter θ. LetT1 = T1(X1, · · · , Xn) and T2 = T2(X1, · · · , Xn) be two statistics for θ. Let T3 = E[T2 |T1].

(i) One can improve T2 by conditioning on T1. That is, for the new statistic T3,

Var(T3) ≤ Var(T2). (208)

(ii) If T2 is unbiased, then T3 is also an unbiased estimator for θ.(iii) If T1 is sufficient for θ, then T3 depends only on the sample values X1, · · · , Xn and not on θ.

3. RAO-BLACKWELL THEOREM 31

PROOF. Recall the law of total variance:

Var(T2) = E[Var(T2 |T1)]+Var(E[T2 |T1]). (209)

Since Var(T2 |T1) ≥ 0,

Var(T3) = Var(E[T2 |T1]) ≤ E[Var(T2 |T1)]+Var(E[T2 |T1]) = Var(T2). (210)

This shows (i). To show (ii), suppose E[T2] = θ. Then

θ = E[T2] = E[E[T2 |T1]] = E[T3]. (211)

Hence T3 is an unbiased estimator for θ.Now we show (iii). By definition of conditional expectation, write

T3 = E[T2 |T1] =∑t

t Pθ(T2(X1, · · · , Xn) = t |T1). (212)

Since T1 is sufficient for θ, the conditional probabilities in the last expression do not depend on θ. Thisshows that T3 does not depend on θ.

Remark 5.3.2. Notice that parts (i) and (ii) in the Rao-Blackwell theorem above do not require sufficiencyof T1. What is the meaning of part (iii)? Notice that, the conditional expectation T3 = E[T2 |T1] definingT3 may depend on θ. If it is this case, then we won’t be able to compute the value of T3 by only looking atthe sample values X1, · · · , Xn (e.g., consider T3 = X +θ). This is problematic, since a ‘statistic’ should becomputable just from the sample values to give insight into θ.

Exercise 5.3.3. Let X1, · · · , Xn be i.i.d. Poisson(λ) RVs.

(i) Show that X1 is an unbiased estimator for λ.(ii) Show that T = X1 +·· ·+Xn is a sufficient statistic for λ.(iii) (Rao-Blackwellization) Show that

E[X1 |T ] = T /n = X . (213)

Deduce that X is an unbiased estimator for θ with Var(X ) ≤ λ from Rao-Blackwell theorem. (Infact, Var(X ) = Var(X1)/n2 =λ/n2.)

CHAPTER 6

Bayesian Estimation

In this section, we discuss another widely used method of parameter estimation, called the Bayesianestimation. Just like in MLE, we would like to estimate an unkonwn parameter θ in a distribution fX ;θ

from analyzing an i.i.d. sample X1, · · · , Xn drawn from that distribution. One of the key aspect in Bayesianestimation is to quantify our prior knowledge on the unknown parameter θ as yet another probabilisticmodel, which we call as belief. After observing a new data X1 = x1, · · · , Xn = xn, our current best beliefon θ (prior distribution) will be updated to a new belief (posterior distribution) according to Bayes’ The-orem. Since we are making some additional modeling assumption on θ, we will be able to draw morequantitative inference on θ using Bayesian estimation.

One of the early bottleneck in Bayesian estimation was the computational overhaed in computingposterior distributions. Due to the recent explosion in our computational capacity, coupled with ad-vanced sampling techniques such as MCMC, Bayesian methods have become one of the indispensibletool in modern statistics and machine learning.

1. Bayes’ Theorem and Inference – Discrete setting

If we have a model for bitcoin price, then we can tune the parameters accordingly and attempt topredict its future prices. But how do we tune the parameters? Say the bitcoin price suddenly dropped by20% overnight (with no surprise). Which factor is the most likely to have caused this drop? In general,Bayes’ Theorem can be used to infer likely factors when we are given the effect or outcome. This is allbased on conditional probability and partitioning.

We begin by the following trivial but important observation:

P(B |A)P(A) =P(A∩B) =P(A|B)P(B). (214)

And this is all we need.

Theorem 6.1.1 (Bayes’ Theorem). Let (Ω,P) be a probability space, and let A1, · · · , Ak ⊆ Ω be events ofpositive probability that form a partition ofΩ. Then

P(A1 |B) = P(B | A1)P(A1)

P(B)= P(B | A1)P(A1)

P(B | A1)P(A1)+P(B | A2)P(A2)+·· ·+P(B | Ak )P(Ak ). (215)

PROOF. Note that (224) yields the first equality. Then partitioning and rewrite P(B) gives the secondequality.

What’s so important about the Bayes’ theorem is its interpretation as a means of inference, which isone of the fundamental tool in modern machine learning.

Example 6.1.2. Suppose Bob has a coin with unknown probability of heads, which we denote byΘ. Thisis called the parameter. Suppose Alice knows that Bob has one of the three kinds of coins A, B , and C ,with probability of heads being p A = 0.2, pB = 0.5, and pC = 0.8, respectively. This piece of information iscalled the model. Since Alice has no information, she initially assumes that Bob has one of the three coinsequally likely. Namely, she assumes the uniform distribution over the sample space Ω = 0.2,0.5,0.8.This knowledge is called prior.

32

2. BAYES’ THEOREM AND INFERENCE – CONTINUOUS SETTING 33

Now Bob flips his coin 10 times and got 7 heads, and reports this information, which we call Data, toAlice. Now that Alice has more information, she needs to update her prior to posterior, which is the prob-ability distribution onΩ that best explains the Data. Namely, it is the conditional probability distributionP(Θ= θ |Data).

First, using our prior, we compute P(Data), the probability of seeing this particular data at hand.This can be done by partitioning:

P(Data) =P(Data |Θ= 0.2)P(Θ= 0.2)+P(Data |Θ= 0.5)P(Θ= 0.5) (216)

+P(Data |Θ= 0.8)P(Θ= 0.8) (217)

=P(7/10 heads |Θ= 0.2)1

3+P(7/10 heads |Θ= 0.5)

1

3+P(7/10 heads |Θ= 0.8)

1

3(218)

= 1

3

((10

7

)(0.2)7(0.8)3 +

(10

7

)(0.5)7(0.5)3 +

(10

7

)(0.8)7(0.2)3

)≈ 0.1064, (219)

where(10

7

)= 120 is the number of ways to choose 7 out of 10 objects.Second, we reformulate the first equality of Theorem 6.2.1 as

P(Θ |Data) = P(Data |Θ)P(Θ)

P(Data). (220)

Hence we can compute the posterior distribution by

P(Θ= 0.2 |Data) = P(Data |Θ= 0.2)P(Θ= 0.2)

P(Data)=

(107

)(0.2)7(0.8)3 1

3

0.1064≈ 0.0025 (221)

P(Θ= 0.5 |Data) = P(Data |Θ= 0.5)P(Θ= 0.5)

P(Data)=

(107

)(0.5)7(0.5)3 1

3

0.1064≈ 0.3670 (222)

P(Θ= 0.8 |Data) = P(Data |Θ= 0.8)P(Θ= 0.8)

P(Data)=

(107

)(0.8)7(0.2)3 1

3

0.1064≈ 0.6305. (223)

Note that according to the posterior distribution, Θ= 0.8 is the most likely value, which is natural giventhat we have 7 heads out of 10 flips. However, our knowledge is always incomplete so our posteriorknowledge is still a probability distribution on the sample space.

What if Bob flips his coin another 10 times and reports only 3 heads to Alice? Then she will haveto use her current prior π= [0.0025,0.3060,0.6305] (which was obtained as the posterior in the previousround) to compute yet another posterior using the new data. This will be likely to give higher weight toΘ= 0.2. N

Exercise 6.1.3. Suppose we have a prior distribution π = [0.0025,0.3680,0.6305] on the sample spaceΩ= 0.2,0.5,0.8 for the inference problem of unknown parameterΘ. Suppose we are given the data thatten independent flips of probability Θ coin comes up heads twice. Compute the posterior distributionusing this data and Bayesian inference.

Exercise 6.1.4. A test for pancreatic cancer is assumed to be correct %95 of the time: if a person has thecancer, the test results in positive with probability 0.95, and if the person does not have the cancer, thenthe test results in negative with probability 0.95. From a recent medical research, it is known that only%0.05 of the population have pancreatic cancer. Given that the person just tested positive, what is theprobability of having the cancer?

2. Bayes’ Theorem and Inference – Continuous setting

We begin by recalling the following basic observation, from which Bayes’ theorem can be easily de-rived:

P(B |A)P(A) =P(A∩B) =P(A|B)P(B). (224)


In case of the events involving continuous random variables taking a certain value, we interpret the aboveprobabilities as probability densities. The following is a random variable version of Bayes’ Theorem dis-cussed in the previous subsection.

Theorem 6.2.1 (Bayes’ Theorem). Let X andΘ be random variables. Then

P(Θ= θ |X = x) = P(X = x |Θ= θ)P(Θ= θ)

P(X = x). (225)

PROOF. Follows from (224).

Remark 6.2.2. Depending on whether X andΘ are discrete or continuous, we interpret the probabilitiesin the above theorem as PDF or PMF, accordingly.

Bayesian inference is usually carried out in the following procedure.

Bayesian inference:

(i) To explain an observable x, we choose a probabilistic model p(x |θ), which is a probability distributionon the possible values x of x, depending on a parameter θ.

(ii) Choose a probability distribution π(θ), called the prior distribution, that expresses our beliefs abouta parameter θ to the best of our current knowledge.

(iii) After observing data D = x1, · · · , xn, we update our beliefs and compute the posterior distributionp(θ |D) according to Bayes’ Theorem.

When we generate data D = x1, · · · , xn, we assume that each xi is obtained by an independent sam-ple Xi of the observable x from the true distribution of x. According to Bayes’ Theorem, we can computethe posterior distribution as

P(Θ= θ |X1 = x1, · · · , Xn = xn) = P(X1 = x1, · · · , Xn = xn |Θ= θ)P(Θ= θ)

P(X1 = x1, · · · , Xn = xn), (226)

or in a more compact form,

p(θ |x1, · · · , xn) = p(x1, · · · , xn |θ)π(θ)

p(x1, · · · , xn). (227)

By a slight abuse of notation, we use lowercase p to denote either PMF or PDF, depending on the context.The conditional probability p(x1, · · · , xn |θ) is called the likelyhood function, which is the probability

of obtaining independent data sample D = x1, · · · , xn according to our probability model p(x |θ) assum-ing model parameter Θ= θ. By the independence between samples, the likelihood function factors intothe product of marginal likelihood of each sample:

p(x1, · · · , xn |θ) =P(X1 = x1, · · · , Xn = xn |Θ= θ) (228)

=n∏

i=1P(Xi = xi |Θ= θ) =

n∏i=1

p(xi |θ). (229)

On the other hand, the joint probability P(X1 = x1, · · · , Xn = xn) of obtaining the data sample D =x1, · · · , xn from our probability model can be computed by conditioning on the values of model param-eterΘ:

P(X1 = x1, · · · , Xn = xn) = E[P(X1 = x1, · · · , Xn = xn |Θ)

∣∣∣Θ](230)

=∫ ∞

−∞p(x1, · · · , xn |θ)π(θ)dθ. (231)

Example 6.2.3 (Signal detection). A binary signal X ∈ −1,1 is transmitted, and we are given that

P(X = 1) = p, P(X =−1) = 1−p (232)


for some p ∈ [0,1], as the prior distribution of X . There is a white noise in the channel so we receive aperturbed signal

Y = X +Z , (233)

where Z ∼ N (0,1) is a standard normal variable.Conditional on Y = y , what is the probability that the actual signal X is 1? In order to use Bayes

theorem, first observe that Y |X = x ∼ N (x,1). Hence

fY |X=x (z) = 1p2π

e−(z−x)2

2 . (234)

So by Bayes Theorem,

P(X = 1 |Y = y) = fY |X=1(y)P(X = 1)

fY (y)(235)

=pp2π

e−(y−1)2

2

pp2π

e−(y−1)2

2 + 1−pp2π

e−(y+1)2

2

(236)

Thus we get

P(X = 1 |Y = y) = pe y

pe y + (1−p)e−y , (237)

and similarly,

P(X =−1 |Y = y) = (1−p)e−y

pe y + (1−p)e−y , (238)

This is our posterior distribution of X after observing Y = y . N

Example 6.2.4 (Bernoulli model and uniform prior). Bob has a coin with unknown probability Θ ofheads. Alice has no information whatsoever, so her prior distribution π for Θ is the uniform distribu-tion Uniform([0,1]). Bob flips his coin independently n times, and let X1, · · · , Xn be the outcome, Xi ’s arei.i.d. Bernoulli(Θ) variables. Let xi be the observed value of Xi , and let sn = x1+·· ·+xn be the number ofheads in the n flips. Given data D = x1, · · · , xn, Alice wants to compute her posterior distribution onΘ.

The the likelyhood function is given by

p(x1, · · · , xn |θ) =n∏

i=1θxi (1−θ)1−xi (239)

= θsn (1−θ)n−sn , (240)

where we denote sn = x1 +·· ·+xn . Since π≡ 1, the posterior distribution is given by

p(θ |x1, · · · , xn) = θsn (1−θ)n−sn

p(x1, · · · , xn), (241)

Note that

p(x1, · · · , xn) =∫ 1

0θsn (1−θ)n−sn dθ = 1(n

sn

)(n +1)

, (242)

where the last equality follows from Exercise 6.2.6. Hence

p(θ |x1, · · · , xn) =(

n

sn

)(n +1)θsn (1−θ)n−sn (243)

= (n +1)!

sn !(n − sn)!θ(sn+1)−1(1−θ)(n−sn+1)−1. (244)


Hence, according to Exercise 6.2.7, we can write

Θ |x1, · · · , xn ∼ Beta(sn +1,n − sn +1). (245)

N

Exercise 6.2.5. Bob has a coin with unknown probability Θ of heads. Alice has the following Beta prior(See Exercise 6.2.7 for the definition of Beta distribution):

π= Beta(α,β). (246)

Suppose that Bob gives Alice the data Dn = x1, · · · , xn, which is the outcome of n independent coin flips.Denote sn = x1 +·· ·+xn . Show that Alice’s posterior distribution is Beta(α+ sn ,β+n − sn). Namely,

Θ |Dn ∼ Beta(α+ sn ,β+n − sn). (247)

Exercise 6.2.6. Let Y ∼ Uniform([0,1]) and X ∼ Binomial(n,Y ) be independent RVs.

(i) Use iterated expectation for probability to write

P(X = k) =(

n

k

)∫ 1

0yk (1− y)n−k d y. (248)

(ii) Write An,k = ∫ 10 yk (1− y)n−k d y . Use integration by parts and show that

An,k = k

n −k +1An,k−1. (249)

for all 1 ≤ k ≤ n. Conclude that for all 0 ≤ k ≤ n,

An,k = 1(nk

) 1

n +1. (250)

(iii) Conclude that X ∼ Uniform(0,1, · · · ,n).

Exercise 6.2.7 (Beta distribution). A random varible X taking values from [0,1] has Beta distribution ofparameters α and β, which we denote by Beta(α,β), if it has PDF

fX (x) = Γ(α+β)

Γ(α)Γ(β)xα−1(1−x)β−1, (251)

where Γ(z) is the Euler Gamma function defined by

Γ(z) =∫ ∞

0xz−1e−x d x. (252)

(i) Use integration by parts to show the following recursion

Γ(z +1) = zΓ(z). (253)

Deduce that Γ(n) = (n −1)! for all integers n ≥ 1.(ii) Let X ∼ Beta(k +1,n −k +1). Use (i) to show that

fX (x) = n!(n +1)

k !(n −k)!xk (1−x)n−k = xk (1−x)n−k

1/(n

k

)(n +1)

. (254)

Use Exercise 6.2.6 to verify that the above function is indeed a PDF (i.e., it integrates to 1).(iii)* Show that if X ∼ Beta(α,β), then

E[X ] = α

α+β , Var(X ) = αβ

(α+β)2(α+β+1). (255)

3. MORE BAYESIAN CONCEPTS 37

3. More Bayesian concepts

Example 6.3.1 (Bayesian estimator). Recall Exercise 6.2.5, where we want to infer an unknown parameterθ in Bernoulli(θ), starting from the beta prior π = Beta(α,β). After observing a data D = (x1, · · · , xn)consisting of sn =∑n

i=1 xi successes, our posterior distribution is Beta distribution with parameters sn+1and β+n − sn . In other words,

Θ|D ∼ Beta(α+ sn ,β+n − sn). (256)

Now suppose we want to get a point estimator θ for the unknown parameter from our posteriordistribution. Any choice of estimator will result in some kind of error, so we would like our estimator tobe minimizing an error function of choice. A standard choice is the mean squared error (MSE), which inour case would be under conditioning on the observed data. That is, we want our estimator θ to be suchthat

θ = argmin E[(Θ− θ)2 |D]. (257)

According to Exercise 4.1.2, the above MSE will be minimized when θ = E[Θ |D], and the minimum MSEis the conditional variance Var(Θ |D). Noting the mean of beta distribution in Exercise 6.2.7, we obtainthe following Bayesian estimator

θ = E[Θ |D] = E[Beta(α+ sn ,β+n − sn)] = α+ sn

α+β+n. (258)

N

Example 6.3.2 (Excerpted from [HTZ77], Berry 1996). We are concerned with breakage of glass panels inhigh-rise buildings. One such case involved 39 panels, and of the 39 panels that broke, it was known that3 broke due to nickel sulfide (NiS) stones found in them. Loss of evidence prevented the causes of break-age ofthe other 36 panels from being known. So the court wanted to know whether the manufacturer ofthe panels or the builder was at fault for the breakage of these 36 panels. In other words, we would liketo know the probability θ of a glass panel being broken due to NiS stones.

We are going to use Bayesian estimation using beta prior π= Beta(α,β) with parametersα and β thatwe are going to determine using some external knowledge. From expert testimony, it was thought thatusually about 5% breakage is caused by NiS stones. Hence we should have

0.05 = Eπ[θ] = E[Beta(α,β)] = α

α+β . (259)

Moreover, the expert thought that if two panels from the same lot break and one breakage was causedby NiS stones, then, due to the pervasive nature of the manufacturing process, the probability of thesecond panel breaking due to NiS stones increases to about 95%. Hence after observing one data sampleof broken glass panel, our posterior distribution is Beta(α+1,β). Hence 95% should equal to the meanof this posterior distribution:

0.95 = E[Θ |1st panel broken due to NiS] = E[Beta(α+1,β)] = α+1

α+β+1. (260)

Solving these two equations, we find

α= 1

360, β= 19

360. (261)

Recall that we know that three glass panels were broken due to NiS. Then the posterior estimate θthat the fourth glass panel is also broken due to NiS, given the the first three were so, is given by

E[Θ |first three panels broken due to NiS] = E[Beta(α+3,β)] = α+3

α+β+3. (262)

3. MORE BAYESIAN CONCEPTS 38

A similar calculation holds for the fourth panel, and so on. Hence by multiplying out the Bayesian esti-mators of conditional probabilities (predictive probabilities),

P

(remaining 36 panels are

broken due to NiS

∣∣∣first 3 broken due to NiS

)(263)

=38∏

i=3P

(i +1th panels is

broken due to NiS

∣∣∣first i broken due to NiS

)(264)

≈(

α+3

α+β+3

)(α+4

α+β+4

)· · ·

(α+38

α+β+38

)= 0.8664. (265)

Hence according to our Bayesian estimation, the probability of all remaining 36 panels were broken dueto NiS stones given the first three were so is about 87%, which is the needed value in the court’s decision.N

CHAPTER 7

Interval estimation

1. Estimating mean of normal distribution with known variance

Let (X t )t≥0 be a sequence of i.i.d. RVs with finite mean µ and variance σ2. We have seen many timesthat the sample mean X arises as a natural estimator for the population mean µ (from MLE and MoM).Recall that the sample mean is an unbiased estimator of µ:

E[X ] =µ. (266)

Suppose after some experiment, we obtained a data of samples D = (x1, · · · , xn). From this we will esti-mate µ≈ x. But what do we really mean by this approximation here? Can we say more quantitative aboutthe location of µ? Namely, we want to make statements like

“µ is contained in the interval [x −ε, x +ε] with confidence at least α”. (267)

More precisely, this may be rephrased as

P(|µ− x| ≤ ε)≥α. (268)

However, what is really random in the above probability? µ is an unknown parameter, and x is an ob-served sample mean. So what we really mean here is the following:

P(|µ− X | ≤ ε)≥α. (269)

Namely, if we randomly sample the RVs X1, · · · , Xn and take their sample mean X (which is still random),then this random estimator X should be close to the unknown parameter µ with some large probability.

In order to compute the above probability in the left hand side, we need to know the distributionof the random sample mena X . In this subsection, we first consider a special case where we know thedistribution of X exactly. Our derivation of what is called ‘confidence interval’ in this section relies onthe following two assumptions:

A1. We have i.i.d. samples X1, · · · , Xn from normal distribution N (µ,σ2).A2. We know the population variance Var(X ) =σ2.

Example 7.1.1 (Confidence interval for the mean of normal distribution). Let X1, · · · , Xn be i.i.d. RVswith distribution N (µ,σ2). Suppose we know the variance σ2, but not the mean µ. First note that

E[X1 +·· ·+Xn] = E[X1]+·· ·+E[Xn] = nE[X1] =µ (270)

Var(X1 +·· ·+Xn) = Var(X1)+·· ·+Var(Xn) = nVar(X1) = nσ2, (271)

where the first uses linearity of expectation and the second uses independence between Xi ’s. Moreover,recalling that normal distribution is additive, we have the precise distribution of the sum:

X1 +X2 +·· ·+Xn ∼ N (nµ,nσ2). (272)

Since E[X ] = E[X1] =µ and Var(X ) = n−2Var(X1 +·· ·+Xn) =σ2/n, it follows that

X ∼ N (µ,σ2/n). (273)

From this we deduce

X −µσ/

pn∼ N (0,1). (274)

39

2. CENTRAL LIMIT THEOREM 40

From this we can construct a confidence interval for µ. Namely, recall that the ‘z-score’ for ‘confi-dence’ α ∈ [0,1], denoted by zα, is defined so that P(N (0,1) > zα) =α. Hence we have

P

(−zα/2 ≤ X −µ

σ/p

n≤ zα/2

)= 1−α. (275)

Rewriting, this gives

P

(µ ∈

[X − zα/2

σpn

, X + zα/2σpn

])= 1−α. (276)

That is, the probability that the random interval[

X − zα/2σpn

, X + zα/2σpn

]containing the population

mean µ is 1−α. In this sense, this random interval is called the 100(1−α)% confidence interval for µ. For

instance, noting that z0.05/2 = 1.96,[

x −1.96 σpn

, x +1.96 σpn

]is a 95% confidence interval for µ. N

Exercise 7.1.2 (Excerpted from [HTZ77]). Let X equal the length of life of a 60-watt light bulb marketedby a certain manufacturer. Assume that the distribution of X is N (µ,1296). If a random sample of n = 27bulbs is tested until they burn out, yielding a sample mean of x = 1478 hours, what is the corresponding95% confidence interval for µ?

2. Central limit theorem

In this subsection, we handle constructing confidence interval for population mean µ for the caseof unknown population distribution but known population variance σ2. The main tool we use is thecelebrated and one of the most important result in probability and statistics – the Central Limit Theorem.

As before, let (X t )t≥0 be a sequence of i.i.d. RVs with finite mean µ and variance σ2. Let Sn = X1 +·· ·+Xn for n ≥ 1. We have calculated the mean and variance of the sample mean Sn/n:

E[Sn/n] =µ, Var(Sn/n) =σ2/n. (277)

Since Var(Sn/n) → 0 as n →∞, we expect the sequence of RVs Sn/n to converge its meanµ in probability.This is the famous laws of large numbers.

Central limit theorem is a limit theorem for the sample mean with different regime, namely, it de-scribes the ‘fluctuation’ of the sample mean around its expectation, as n →∞. For this purpose, we needto standardize the sample mean so that the mean is zero and variance is unit. Namely, let

Zn = Sn/n −µσ/

pn

= Sn −nµ

σp

n, (278)

so that

E[Zn] = 0, Var(Zn) = 1. (279)

Since the variance is kept at 1, we should not expect the sequence of RVs (Zn)n≥0 converge to someconstant in probability, as in the law of large number situation. Instead, Zn should converge to someother RV, if it ever converges in some sense. Central limit theorem states that Zn becomes more andmore likely to be a standard normal RV Z ∼ N (0,1).

Let us state the central limit theorem.

Theorem 7.2.1 (CLT). Let (Xk )k≥1 be i.i.d. RVs and let Sn = ∑nk=1 Xi , n ≥ 1. Suppose E[X1] < ∞ and

E[X 21 ] =σ2 <∞. Let Z ∼ N (0,1) be a standard normal RV and define

Zn = Sn −µn

σp

n= Sn/n −µ

σ/p

n. (280)

Then Zn converges to Z as n →∞ in distribution, namely,

limn→∞P(Zn ≤ z) =P(Z ≤ z). (281)

As a typical application of CLT, we can approximate Binomial(n, p) variables by normal RVs.

3. CONFIDENCE INTERVAL FOR THE MEAN WITH UNKNOWN VARIANCE 41

Exercise 7.2.2. Let (Xn)n≥1 be a sequence of i.i.d. Bernoulli(p) RVs. Let Sn = X1 +·· ·+Xn .

(i) Let Zn = (Sn −np)/√

np(1−p). Use CLT to deduce that, as n → ∞, Zn converges to the standardnormal RV Z ∼ N (0,1) in distribution.

(ii) Conclude that if Yn ∼ Binomial(n, p), then

Yn −np√np(1−p)

⇒ Z ∼ N (0,1). (282)

(iii) From (ii), deduce that have the following approximation

P(Yn ≤ x) ≈P(

Z ≤ x −np√np(1−p)

), (283)

which becomes more accurate as n →∞.

Example 7.2.3 (Confidence interval for the mean with known variance). Let (X t )t≥0 be a sequence ofi.i.d. RVs with unknown mean µ but known variance σ2. Here we do not know if Xi ’s are drawn from anormal distribution. By CLT, we have

Zn = X −µσ/

pn=⇒ Z ∼ N (0,1), (284)

as the sample size n →∞. In other words, Zn becomes more and more likely to be a standard normal RV,so for any z ∈R

P(Zn ≤ z) ≈P(Z ≤ z) (285)

when n is large enough.From this we can construct a confidence interval for µ similarly as in Example 7.4.2. Namely,

P

(−zα/2 ≤ X −µ

σ/p

n≤ zα/2

)≈ 1−α. (286)


P

(µ ∈

[X − zα/2

σpn

, X + zα/2σpn

])≈ 1−α. (287)


X − zα/2σpn

, X + zα/2σpn

]containing the population

mean µ is approximately 1−α. In this sense, this random interval is called the 100(1−α)% confidence in-

terval for µ. For instance, noting that z0.05/2 = 1.96,[

x −1.96 σpn

, x +1.96 σpn

]is a 95% confidence interval

for µ. N

Exercise 7.2.4. Let X equal the length of life of a 60-watt light bulb marketed by a certain manufacturer.Assume that the distribution of Var(X ) = 1296. If a random sample of n = 27 bulbs is tested until theyburn out, yielding a sample mean of x = 1478 hours, what is the corresponding 95% confidence intervalfor µ?

3. Confidence interval for the mean with unknown variance

In this subsection, we handle constructing confidence interval for population mean µ for the case ofnormal population distribution with unknown population variance σ2.

Let (X t )t≥0 be a sequence of i.i.d. RVs with normal distribution N (µ,σ2) with unknown finite meanµ and variance σ2. Using our argument in Example 7.4.2, we can construct the following %100(1−α)confidence interval

P

(µ ∈

[X − zα/2

σpn

, X + zα/2σpn

])= 1−α. (288)

However, since we do not know σ, this does not really give us a valid confidence interval for µ.


Recall that the underlying observation for the above equation was that the standardized samplemean from normal distribution follows standard normal:

X −µσ/

pn∼ N (0,1). (289)

But sinceσ is unknown, a reasonable move here is to replaceσ by some of its estimator. A natural choicewould be the sample variance

S2 = 1

n −1

n∑i=1

(Xi − X )2, (290)

which we have shown to be an unbiased estimator of σ2 in Exercise 3.2.4. So what happens if we replaceσ by S in (289)? It turns out that this substitution changes the distribution from the standard normalN (0,1) to something that is known as the ‘t-distribution with n −1 degrees of freedom’. This is based onchi-square distribution.

Definition 7.3.1 (chi-square distribution). Let Z1, · · · , Zk be i.i.d. standard normal RVs. Let V = Z1+·· ·+Zk . Then the distribution of X is called the chi-square distribution of k degrees of freedom, which wedenote by V ∼χ2(k).

Definition 7.3.2 (t-distribution). Let Z ∼ N (0,1) and V ∼χ2(k) be independent. Let T = Z /p

V /k. Thenthe distribution of T is called the t-distribution of k degrees of freedom, which we denote by T ∼ t (k).

FIGURE 1. PDF of χ2- and t-distributions for some values of degrees of freedom.

Exercise 7.3.3. (i) Let Z ∼ N (0,1). Show that Z 2 ∼χ2(1) has the following MGF

E[e t Z 2] =

∫ ∞

0e t x2 1p

2πe−

x2

2 d x =∫ ∞

0

1p2π

e−(1−2t )x2/2 d x = 1p1−2t

. (291)

(ii) Let V ∼ ξ2(k). Use (i) to deduce that

E[e tV ] = (1−2t )−k/2. (292)

(iii) Let V ∼χ2(k). Show that V has the following PDF

fV (x) = xk2 −1e−

x2

2k2 Γ (k/2)

1(x ≥ 0), (293)

where Γ(cot) is the Gamma function. (Hint: Compute the MGF of the above PDF, and show thatit equals to the MGF of ξ2(k).)


Exercise 7.3.4. (Optional) Let T ∼ t (k). Show that T has the following PDF

fT (t ) =Γ

(k+1

2

)pπkΓ

(k2

) (1+ t 2

k

)− k+12

, (294)

where Γ(·) is the Gamma function.

The following is the theoretical background in this subsection, whose proof (optional) is provided atthe end of this subsection.

Proposition 7.3.5. Let X1, · · · , Xn be i.i.d samples from N (µ,σ2). Then

X −µS/

pn∼ t (n −1). (295)

Using this result, we can now construct confidence interval for the case of normal distribution withunknown variance:

Example 7.3.6 (Confidence interval for the mean of normal with unknown variance). Let (X t )t≥0 be asequence of i.i.d. RVs with distribution N (µ,σ2), where both µ and σ2 are unknown. By Proposition7.3.5, we have

X −µS/

pn∼ t (n −1). (296)

From this we can construct a confidence interval for µ similarly as in Example 7.4.2. Namely, denote thatthe ‘t-score’ tα(k) by P(T ≥ tα(k)) =α, where T ∼ t (k). Then

P

(−tα/2(n −1) ≤ X −µ

S/p

n≤ tα/2(n −1)

)= 1−α. (297)


P

(µ ∈

[X − tα/2(n −1)

Spn

, X + tα/2(n −1)Spn

])= 1−α. (298)


X − tα/2(n −1) Spn

, X + tα/2(n −1) Spn

]containing the

population mean µ is 1 −α. In this sense, this random interval is called the 100(1 −α)% confidenceinterval for µ. For instance, noting that t0.05(19) = 1.729, a 90% confidence interval for µ would be[

x −1.729 spn

, x +1.729 spn

]. N

Exercise 7.3.7 (Excerpted from [HTZ77]). Let X equal the amount of butterfat in pounds produced by atypical cow during a 305-day milk production period between her first and second calves. Assume thatthe distribution of X is N (µ,σ2). To estimate µ, a farmer measured the butter fat production for n = 20cows and obtained the following data:

481 537 513 583 453 510 570 500 457 555

618 327 350 643 499 421 505 637 599 392

(i) Give a point estimate of µ.(ii) Give a 90% confidence interval for µ.

Exercise 7.3.8 (Excerpted from [HTZ77]). An automotive supplier of interior parts places several elec-trical wires in a harness. A pull test measures the force required to pull spliced wires apart. A customerrequires that each wire spliced into the harness must with-stand a pull force of 20 pounds. Let X equalthe pull force required to pull 20 gauge wires apart. Assume that X ∼ N (µ,σ2). The following data give 20observations of X :

28.8 24.4 30.1 25.6 26.4 23.9 22.1 22.5 27.6 28.1 (299)


20.8 27.7 24.4 25.1 24.6 26.3 28.2 22.2 26.3 24.4 (300)

(i) Find point estimates for µ and σ.(ii) Find a 99% one-sided confidence interval for µ that provides a lower bound for µ. That is, first find

the smallest constant c such that

P

(X − c

Spn≤µ

)≥ 0.99. (301)

Then obtain the desired one-sided confidence interval for µ by computing x and s from thegiven sample.

PROOF OF PROPOSITION 7.3.5. (Optional*) We first write

X −µS/

pn=

(X −µσ/

pn

)(1

S/σ

). (302)

Since Xi ’s are i.i.d. with distribution N (µ,σ2), we know that the standardized sample mean (X−µ)/(σ/p

n)has distribution N (0,1). Hence by definition, it suffices to show that (n −1)(S/σ)2 ∼χ2(n −1).

Now we show that (n −1)(S/σ) ∼χ2(n −1). First by using Exercise 3.2.4 (iii), we write

n∑i=1

(Xi −µ)2 =(

n∑i=1

(Xi − X )2

)+n(X −µ)2 (303)

= (n −1)S2 +n(X −µ)2. (304)

Divide both sides by σ2 to get

n∑i=1

(Xi −µσ

)2

= (n −1)(S/σ)2 +(

X −µσ/

pn

)2

. (305)

Notice that the left hand side has distribution ξ2(n), and the last term in the right hand side has distribu-tion ξ2(1). We claim that the two RVs in the right hand side are independent. The assertion then followsby taking MGFs, since

MGF(ξ2(n)) = MGF((n −1)(S/σ)2)MGF(ξ2(1)) (306)

so that

MGF((n −1)(S/σ)2) = MGF(ξ2(n))

MGF(ξ2(1))= [MGF(ξ2(1))]n

MGF(ξ2(1))= [MGF(ξ2(1))]n−1 = MGF(ξ2(n −1)). (307)

It remains to show that∑n

i=1(Xi − X )2 and n(X −µ)2 are independent. For this, it is enough to showthat the RV X is independent of the random vector X := (X1 − X , X2 − X , · · · , Xn − X ). We will do this byusing joint MGF and factorization theorem. Note that

MGFX ,X(s, t1, · · · , tn) = E[exp

(sX + t1(X1 − X )+·· ·+ tn(Xn − X )

)](308)

= E[

exp

(n∑

i=1ti Xi +

(s −

n∑i=1

ti

)X

)](309)

= E[

exp

(n∑

i=1ti Xi +

((s/n)− t

) n∑i=1

Xi

)](310)

= E[

exp

(n∑

i=1(ti − t + (s/n))Xi

)], (311)

where we have denoted t = (t1 +·· ·+ tn)/n. Now since Xi ’s are i.i.d. with N (µ,σ2) distribution, we have

E

[exp

(n∑

i=1(ti − t + (s/n))Xi

)]=

n∏i=1

MGFXi (ti − t + (s/n)) (312)

4. CONFIDENCE INTERVALS OF DIFFERENCE OF TWO MEANS 45

=n∏

i=1exp

(µ(ti − t + (s/n))+ σ2

2(ti − t + (s/n))2

)(313)

= exp

(n∑

i=1µ(ti − t + (s/n))+ σ2

2(ti − t + (s/n))2

)(314)

= exp

(sµ+ σ2

2

(n∑

i=1(ti − t )2 + s2

n

))(315)

= exp

(sµ+ σ2s2

2n

)exp

(σ2

2

n∑i=1

(ti − t )2

). (316)

This shows that the joint MGF of X and X factorizes, which implies their independence. This shows theassertion.

4. Confidence intervals of difference of two means

In this section, we study how to statistically compare two unknown RVs X and Y by sampling andmaking confidence intervals. We start with the ideal situation when both X and Y are normal with knownvariances.

Example 7.4.1 (Confidence interval for the difference of two means I). Let X ∼ N (µX ,σ2Y ) and Y ∼

N (µY ,σ2Y ), where both σX and σY are known, but the means are unknown. We want to know whether

µX 6=µY . To test this, draw i.i.d. samples X1, · · · , Xn for X and Y1, · · · ,Ym for Y . Recall that

X ∼ N (µX ,σ2X /n), Y ∼ N (µX ,σ2

Y /m). (317)

If it were the case that µX 6=µY , then it should be that X 6= Y . To make this quantitative, we consider theRV X − Y . Note that

E[X − Y ] =µX −µY , Var(X − Y ) = Var(X )+Var(Y ) = σ2X

n+ σ2

Y

m. (318)

We make a further assumption that the Xi ’s and Y j ’s are also independent. Then X and Y are also inde-pendent, so

X − Y ∼ N

(µX −µY ,

σ2X

n+ σ2

Y

m

). (319)

Thus if we standardize X − Y and make a statistic Z , we get

Z := (X − Y )− (µX −µY )√σ2

Xn + σ2

Ym

∼ N (0,1). (320)

From this we can construct a confidence interval for µX −µY . Namely, (441) implies

P

−zα/2

√σ2

X

n+ σ2

Y

m≤ (X − Y )− (µX −µY ) ≤ zα/2

√σ2

X

n+ σ2

Y

m

= 1−α. (321)


P

µX −µY ∈(X − Y )− zα/2

√σ2

X

n+ σ2

Y

m, (X − Y )+ zα/2

√σ2

X

n+ σ2

Y

m

= 1−α. (322)

That is, the probability that the random interval in the above probablity containing the differenceµX −µY

of the population means is 1−α. In this sense, this random interval is called the 100(1−α)% confidenceinterval for µX −µY .


For instance, let n = 15 ,m = 8, x = 70.1,y = 75.3,σ2X = 60,σ2

Y = 40, and 1−α=0.90. Note that 1−α/2 =0.95 =P(Z ≤ 1.645) for Z ∼ N (0,1). Hence zα/2 = 1.645, so

zα/2

√σ2

X

n+ σ2

Y

m= 1.645

√60

15+ 40

8= 1.645 ·3 = 4.935. (323)

Noting that x − y = −5.2, we conclude that [−5.2−4.935,−5.2+4.935] = [−10.135,−0.265] is a 90% con-fidence interval for µX −µY . In particular, we can claim that, with at least 90% confidence, µY is largerthan µX by at least 0.265. N

Example 7.4.2 (Confidence interval for the difference of two means II). Let X and Y be independent RVssuch that E[X ] =µX , Var(X ) =σ2

X , E[Y ] =µY , and Var(Y ) =σ2Y . Suppose we do not know if X and Y have

normal distribution but we know σX and σY . Then by using CLT, all our confidence interval estimates inthe previous example hold asymptotically. Namely, by CLT,

X ⇒ N (µX ,σ2X /n), Y ⇒ N (µX ,σ2

Y /m), (324)

as n,m →∞. As all samples are independent, this yields that, as n,m →∞,

X − Y ⇒ N

(µX −µY ,

σ2X

n+ σ2

Y

m

). (325)

Thus when n and m are large, X − Y nearly follows normal distribution above. This implies the following100(1−α)% (approximate) confidence interval for µX −µY :

P

µX −µY ∈(X − Y )− zα/2

√σ2

X

n+ σ2

Y

m, (X − Y )+ zα/2

√σ2

X

n+ σ2

Y

m

≈ 1−α. (326)

For instance, let n = 60 ,m = 40, x = 70.1,y = 75.3, σ2X = 60, σ2

Y = 40, and 1−α = 0.90. Note that1−α/2 = 0.95 =P(Z ≤ 1.645) for Z ∼ N (0,1). Hence zα/2 = 1.645, so

zα/2

√σ2

X

n+ σ2

Y

m= 1.645

√60

60+ 40

40= 1.645 ·p2 = 2.326. (327)

Noting that x−y =−5.2, we conclude that [−5.2−2.326,−5.2+2.326] = [−7.526,−2.87] is a 90% confidenceinterval for µX −µY . In particular, we can claim that, with at least 90% confidence, µY is larger than µX

by at least 2.87. N

Example 7.4.3 (Confidence interval for the difference of two means III). Let X1, · · · , Xn i.i.d. from N (µX ,σ2Y )

and Y1, · · · ,Ym i.i.d. from N (µY ,σ2Y ). Also assume that all such samples are independent of each other.

In this discussion, we suppose that σX and σY are both unknown.As in the previous section for estimating a mean, we may try to replace population variances with

sample variances. This would result in the following random interval(X − Y )− zα/2

√S2

X

n+ S2

Y

m, (X − Y )+ zα/2

√σ2

X

n+ σ2

Y

m

(328)

This may heuristically serve as 100(1−α)% (approximate) confidence interval for µX −µY at least whenn,m À 1. However, this is not rigorously justified (as opposed to Example 7.3.6).

In fact, we can use t-distribution to handle this situation, at least when we further assume that σX =σY =σ (that is, unknown but equal variances). Namely, recall that

(n −1)S2X

σ2 ∼χ2(n −1),(m −1)S2

Y

σ2 ∼χ2(m −1). (329)


As all samples are independent, the two RVs above are also independent. It follows that their sum is a χ2

RV with (n −1)+ (m −1) = n +m −2 degrees of freedom:

U = (n −1)S2X

σ2 + (m −1)S2Y

σ2 ∼χ2(n +m −2). (330)

Moreover, at the end of the proof of Proposition 7.3.5, we have shown that the sample mean and samplevariance are independent. Hence if we define Z to be the standardization of the difference of samplemeans (see (441)), then U and Z are indpendent. Thus by definition of T distribution (see (7.3.2)), wehave

T := ZpU /(n +m −2)

= (X − Y )− (µX −µY )√σ2

n + σ2

m

1√(n−1)S2

Xσ2 + (m−1)S2

Yσ2

/pn +m −2

(331)

= (X − Y )− (µX −µY )√(n−1)S2

X +(m−1)S2Y

n+m−2

1√1n + 1

m

∼ t (n +m −2). (332)

Denoting the pulled estimator

S2p = (n −1)S2

X + (m −1)S2Y

n +m −2, (333)

this implies the following 100(1−α)% confidence interval for µX −µY :

P

(µX −µY ∈

[(X − Y )− tα/2(n +m −2)Sp

√1

n+ 1

m, (X − Y )+ tα/2(n +m −2)Sp

√1

n+ 1

m

])= 1−α.

(334)

For instance, suppose the problem of comparing the means of N (µX ,σ2) and N (µY ,σ2) for unknownσ. Suppose we have n = 9 samples from the first distribution with x = 81.31, s2

X = 60.76, and m = 15samples from the second distribution with y = 78.61, s2

Y = 48.24. The positive square root of the pulledestimator is given by

sp =√

(9−1)(60.76)+ (15−1)(48.24)

9+15−2= 7.2658 (335)

Noting that t0.025(22) = 2.074, we compute the endpoints of the corresponding 95% confidenceintervalfor µX −µY as

(81.31−78.61)± (2.074)(8.4923)

√1

9+ 1

15=−3.6538, 9.0538. (336)

Since this interval contains 0, we cannot conclude that µX 6=µY with high confidence as before. N

Remark 7.4.4. What if we do not know population variances and also that they are equal? See the dis-cussion at the end of [HTZ77, Sec. 7.2]. We give an alternative solution to this in Example 7.4.6.

Example 7.4.5 (Confidence interval for the difference of two means IV). Suppose we are interested intwo measurements (e.g., height and weight) for each subject in some experiment. We will model thisas a pair (X ,Y ) of RVs. We are interested in whether E[X ] 6= E[Y ]. In order to test this, we made i.i.d.observations (X1,Y1), · · · , (Xn ,Yn). Notice that the pairs are independent as random vectors, but each Xi

and Yi should be dependent (height and weight of the same i th subject). To build confidence intervalfor E[X ]−E[Y ], we start from the observation that X1−Y1, · · · , Xn −Yn are i.i.d. RVs. Denote their samplemean by D . First note that

µD := E[D] = E[X ]−E[Y ] = E[X ]−E[Y ], (337)

σ2D := Var(D) = Var(X1 −Y1)

n= Var(X −Y )

n. (338)


(Note here that Var(X −Y ) 6= Var(X )+Var(Y ) since X and Y are dependent.) Note that being the samplemean of n i.i.d. RVs, D converges to a normal distribution by CLT:

D ⇒ N(µD ,σ2

D

). (339)

Hence if we know σD , this gives the following (approximate) 100(1−α)% confidence interval for µD :

P

(µD ∈

[(X − Y )− zα/2

σDpn

, (X − Y )+ zα/2σDp

n

])≈ 1−α. (340)

In case σD is unknown, but if Xi −Yi ’s are normal, then we may use the sample variance S2D of the

differences Xi −Yi to replace σ2D and use the t-distribution. Namely, we have

D −µD

SD /p

n∼ t (n −1). (341)

This gives

P

(µD ∈

[D − tα/2(n −1)

SDpn

,D + tα/2(n −1)SDp

n

])= 1−α. (342)

Recalling that D = X − Y , the following random interval[(X − Y )− tα/2(n −1)

SDpn

, (X − Y )+ tα/2(n −1)SDp

n

](343)

contains E[X ]−E[Y ] with probability 1−α.

Example 7.4.6 (Confidence interval for the difference of two means V). We revisit the problem in Ex-ample 7.4.3. Let X1, · · · , Xn i.i.d. from N (µX ,σ2

Y ) and Y1, · · · ,Ym i.i.d. from N (µY ,σ2Y ), with n < m. Also

assume that all such samples are independent of each other and that σX and σY are unknown. In casewhen σX =σY , we can use pooled estimator and make use of the full samples, as discussed in Example7.4.3. In this discussion, suppose we do not know if σX = σY . Since all we are interested in is the differ-ence µX −µY , we will only use first n samples from the larger sample Y1, · · · ,Ym , and form the followingn i.i.d. samples of differences:

X1 −Y1, X2 −Y2, · · · , Xn −Yn . (344)

These are i.i.d. RVs with mean µX −µY and unknown variance. We can use the sample variance of theabove differences to replace this unknown variance. This reduces our problem to the exact situation asin the previous example. Namely, we use the sample variance S2

D of the differences Xi −Yi and use thet-distribution so that

(X − Y )− (µX −µY )

SD /p

n∼ t (n −1). (345)

Thus the following random interval[(X − Y )− tα/2(n −1)

SDpn

, (X − Y )+ tα/2(n −1)SDp

n

](346)

contains µX −µY with probability 1−α.

Example 7.4.7 (Excerpted and modified from [HTZ77]). An experiment was conducted to compare peo-ple’s reaction times to a red light versus a green light. When signaled with either the red or the green light,the subject was asked to hit a switch to turn off the light. When the switch was hit, a clock was turnedoff and the reaction time in seconds was recorded. The following results give the reaction times for eightsubjects with some missing data:

Suppose we do know that these samples are from some normal distribution, but not the value oftheir variances and if they equal to each other. Since each X and Y in the same row are from the same

5. CONFIDENCE INTERVAL FOR PROPORTIONS 49

Subject Red (x) Green (y) Difference (x − y)1 0.30 0.43 -0.132 0.23 0.32 -0.093 0.41 0.58 -0.174 0.53 0.46 0.075 0.24 0.27 -0.036 0.36 0.41 -0.057 0.38 0.38 0.008 0.51 0.61 -0.109 ? 0.83 ?

10 0.49 ? ?

subject, they are liketly to be dependent of each other. By disregarding the last two incomplete data, wemay use the approach in Example 7.4.5. Namely, using only the first 8 rows, we compute

d =−0.0625, sd = 0.0765. (347)

Noting that t0.025(7) = 2.365, we compute the following 95% confidence interval for µX −µY :

−0.0625±2.3650.0765p

7, or [−0.1265,0.0015]. (348)

Since this interval contains 0, we cannot conclude µX −µY with 95% confidence.Next, consider the samples X and Y even from the same subject are independent. Also assume that

X and Y have the same variance. Then we can use the approach in Example 7.4.3 using pooled estimator.This approach will make use of all 10 rows in the table (You may carry out this computation). However,even if we do not have this equal variance assumption, we may proceed in the same way as before andobtain the same conclusion as in the previous paragraph. This illustrates our discussion in Example7.4.6. N

5. Confidence interval for proportions

Example 7.5.1. Let E A be the event that a randomly select voter supports candidate A. Using a poll, wewould like to estimate p =P(E A), which can be understood as the proportion of supporters of candidateA. As before, we observe a sequence of i.i.d. indicator variables Xk = 1(E A). Let pn = n−1(X1 +·· ·+ Xn).be the empirical proportion of supporters of A out of n samples. We know by WLLN that pn convergesto p in probability. But if we want to guarantee a certain significance level α for an error bound ε, howmany samples should be take?

Let (Xn)n≥1 be a sequence of i.i.d. Bernoulli(p) RVs. According to the CLT, it is immediate to deducethe following convergence in distribution

pn −p√p(1−p)/n

⇒ Z ∼ N (0,1). (349)

Hence for any ε> 0, we have

P(|pn −p| ≤ ε)=P(∣∣∣∣∣ pn −p√

p(1−p)/n

∣∣∣∣∣≤ εp

n√p(1−p)

)(350)

≥P(∣∣∣∣∣ pn −p√

p(1−p)/n

∣∣∣∣∣≤ 2εp

n

)(351)

≈P(|Z | ≤ 2εp

n)= 2P(0 ≤ Z ≤ 2ε

pn), (352)

6. SAMPLE SIZE 50

where for the inequality we have used the fact that p(1−p) ≤ 1/4 for all 0 ≤ p ≤ 1. The last expression isat least 1−α if and only if

P(0 ≤ Z ≤ 2εp

n) ≥ 0.5− (α/2). (353)

This shows the following implication (for n large)

n ≥( zα/2

2ε

)2=⇒ P

(|pn −p| ≤ ε)≥ 1−α. (354)

See Exercise 7.6.3 for more details.For instance, from the table of standard normal distribution, we know that P(0 ≤ Z ≤ 1.96) = 0.475.

Hence

n ≥(

0.98

ε

)2

(355)

implies that p is within p with at least probability 0.95. For instance, ε= 0.01 gives n ≥ 9604. N

In the discussion above, we made use of the simple fact p(1− p) ≤ 1/4 for all p ∈ [0,1] to get rid ofthe dependence on the error term on the unknown parameter p. In the exercise below, we will present aslightly more precise analysis.

Exercise 7.5.2. Let X1, · · · , Xn be i.i.d. samples from Bernoulli(p) and denote pn = n−1(X1 +·· ·+Xn).

(i) Show that ∣∣∣∣∣ pn −p√p(1−p)/n

∣∣∣∣∣≤ z ⇐⇒ n(pn −p)2 ≤ z2p(1−p) (356)

⇐⇒ (1+ (z2/n)

)p2 −2

(pn + (z2/(2n))

)p + p2

n ≤ 0. (357)

(ii) Solving the quadratic inequality in (i) for p and using CLT, show that an approximate 100(1−α)%confidence interval for p is given by the end points

pn + (z2/(2n))

1+ (z2/n)± z

√pn(1− pn)/n + (z/(2n))2

1+ (z2/n)(358)

for z = zα/2.(iii) For n À 1, show that an approximate 100(1−α)% confidence interval is given by the endpoints

pn ± z√

pn(1− pn)/n. (359)

Example 7.5.3 (Excerpted from [HTZ77]). In a certain political campaign, one candidate has a poll takenat random among the voting population. The results are that 185 votes out of n = 351 voters favor thiscandidate. Even though p = 185/351 = 0.527, should the candidate feel very confident of winning? FromExercise 7.5.2, an approximate 95% confidence interval for the fraction p of the voting population whofavor the candidate is

0.527±1.96

√(0.527)(0.473)

351, (360)

or, equivalently, [0.475,0.579]. Thus, there is a good possibility that p is less than 50%, and the candidateshould certainly take this possibility into account in campaigning. N

6. Sample size

So far, our focus was to obtain a confidence interval for the population mean for a confidence 1−α ofchoice, given that we have i.i.d. samples X1, · · · , Xn from an unknown distribution fX ;θ. In other words,we tried to get the most out of a given sample. In this section, we consider a reverse question: If we wantto get confidence interval for the population mean of a certain confidence and error bound, how largeshould the sample size n has to be? We have already seen an example of this discussion in Example 7.5.1.

6. SAMPLE SIZE 51

In fact, using the confidence intervals we have derived, we can estimate the needed sample size for aneasy computation.

Exercise 7.6.1 (Sample size for estimating the mean). Let X1, · · · , Xn be i.i.d. samples from some un-known distribution of mean µ. Let X and S2 denote the sample mean and sample variance. Fix α ∈ (0,1)and ε> 0.

(i) Suppose the population distribution is N (µ,σ2) for known σ2 > 0. Recall that we have the following100(1−α)% confidence interval for µ:

P

(µ ∈

[X − zα/2

σpn

, X + zα/2σpn

])= 1−α. (361)

Deduce that

n ≥ (2zα/2σ/ε)2 ⇐⇒ zα/2σpn≤ ε

2⇐⇒ P

(µ ∈

[X − ε

2, X + ε

2

])≥ 1−α. (362)

(ii) Suppose the population distribution is not necessarily normal but has known varianceσ2 > 0. Recallthat CLT gives us the following approximate 100(1−α)% confidence interval for µ:

P

(µ ∈

[X − zα/2

σpn

, X + zα/2σpn

])≈ 1−α. (363)

Deduce that, approximately for large n ≥ 1,

n ≥ (2zα/2σ/ε)2 ⇐⇒ zα/2σpn≤ ε

2⇐⇒ P

(µ ∈

[X − ε

2, X + ε

2

])≥ 1−α. (364)

(iii) Suppose the population distribution is normal but unknown variance σ2. Recall that we have thefollowing 100(1−α)% confidence interval for µ:

P

(µ ∈

[X − tα/2(n −1)

Spn

, X + tα/2(n −1)Spn

])= 1−α. (365)

Deduce that

n ≥ (2tα/2(n −1)S/ε)2 ⇐⇒ tα/2(n −1)Spn≤ ε

2⇐⇒ P

(µ ∈

[X − ε

2, X + ε

2

])≥ 1−α. (366)

(iv) In each cases of (i)-(ii), suppose σ2 = 5. If we would like to get a 95% confidence interval for µ oflength ≤ 0.01, what is the smallest sample size n that guarantees this?

(v) In the cases of (iii), suppose s2 = 5 and n = 30. If we would like to get a confidence interval of length≤ 0.01, what is the largest confidence we can guarantee?

Exercise 7.6.2 (Sample size for estimating the difference of two means). Let X1, · · · , Xn and Y1, · · · ,Ym bei.i.d. samples from some unknown distributions of mean µX and µY , respectively. Fixα ∈ (0,1) and ε> 0.

(i) Suppose the population distribution for Xi ’s and Yi ’s are normal with known variances σ2X and σ2

Y ,respectively. Recall that we have the following 100(1−α)% confidence interval for µX −µY :

P

µX −µY ∈(X − Y )− zα/2

√σ2

X

n+ σ2

Y

m, (X − Y )+ zα/2

√σ2

X

n+ σ2

Y

m

= 1−α. (367)

Deduce that

zα/2

√σ2

X

n+ σ2

Y

m≤ ε

2⇐⇒ P

(µ ∈

[X − ε

2, X + ε

2

])≥ 1−α. (368)

6. SAMPLE SIZE 52

(ii) Suppose the population distribution for Xi ’s and Yi ’s are not necessarily normal with known vari-ances σ2

X and σ2Y , respectively. Recall that CLT gives us the following 100(1−α)% confidence

interval for µX −µY :

P

µX −µY ∈(X − Y )− zα/2

√σ2

X

n+ σ2

Y

m, (X − Y )+ zα/2

√σ2

X

n+ σ2

Y

m

≈ 1−α. (369)

Deduce that (approximately for large n)

zα/2

√σ2

X

n+ σ2

Y

m≤ ε

2⇐⇒ P

(µ ∈

[X − ε

2, X + ε

2

])≥ 1−α. (370)

(iii) Suppose the population distribution are normal but with unknown but equal variances. Recall thatwe have the following 100(1−α)% confidence interval for µX −µY :

P

(µX −µY ∈

[(X − Y )− tα/2(n +m −2)Sp

√1

n+ 1

m, (X − Y )+ tα/2(n +m −2)Sp

√1

n+ 1

m

])= 1−α.

(371)

where S2p = (n−1)S2

X +(m−1)S2Y

n+m−2 . Deduce that

tα/2(n +m −2)sp

√1

n+ 1

m≤ ε

2⇐⇒ P

(µ ∈

[X − ε

2, X + ε

2

])≥ 1−α. (372)

(iv) Let (X1,Y1), · · · , (Xn ,Yn) be i.i.d. copies of a random vector (X ,Y ) with unknown mean and vari-ance. Suppose Xi −Yi ’s have normal distribution. Let S2

D denote the sample variance for thedifferences X1 −Y1, · · · , Xn −Yn . Recall that we have the following

P

((E[X ]−E[Y ]) ∈

[(X − Y )− tα/2(n −1)

SDpn

, (X − Y )+ tα/2(n −1)SDp

n

])= 1−α. (373)

Deduce that

zα/2SDp

n≤ ε

2⇐⇒ P

((E[X ]−E[Y ]) ∈

[(X − Y )− ε

2,(X − Y )+ ε

2

])≥ 1−α. (374)

(v) In each cases of (i)-(ii), suppose σ2X = 5 and σ2

Y = 3. If we would like to get a 95% confidence intervalfor E[X ]−E[Y ] of length ≤ 0.01, what is the choice of sample sizes n and m such that n +m isminimized?

(vi) In the cases (iii), suppose s2X = 5, s2

Y = 3, n = 20, and m = 30, depending on the assumption. If wewould like to get a confidence interval of length ≤ 0.01, what is the largest confidence we canguarantee?

(v) In the cases of (iv), suppose n = 20,∑n

i=1 x2i = 100,

∑ni=1 y2

i = 60, and∑n

i=1 xi yi = 60. If we would liketo get a confidence interval of length ≤ 0.01, what is the largest confidence we can guarantee?

Exercise 7.6.3 (Sample size for estimating proportion). Let X1, · · · , Xn be i.i.d. samples from Bernoulli(p)for unknown p. Let X and S2 denote the sample mean and sample variance. Fix α ∈ (0,1) and ε> 0. Letpn = n−1(X1 +·· ·+Xn).

(i) Use CLT to show that we have the following convergence in distribution as n →∞:

pn −p√p(1−p)/n

⇒ Z ∼ N (0,1). (375)

(ii) Show that t (1− t ) ≤ 1/4 for all t ∈ [0,1]. Use this and (i) to show that

P(|pn −p| ≤ ε)=P(∣∣∣∣∣ pn −p√

p(1−p)/n

∣∣∣∣∣≤ εp

n√p(1−p)

)(376)

7. DISTRIBUTION-FREE CONFIDENCE INTERVALS FOR PERCENTILES 53

≥P(∣∣∣∣∣ pn −p√

p(1−p)/n

∣∣∣∣∣≤ 2εp

n

)≈P(|Z | ≤ 2ε

pn

)= 2P(0 ≤ Z ≤ 2εp

n), (377)

(iii) Suppose 2εp

n ≥ zα/2. Show that

2P(0 ≤ Z ≤ 2εp

n) ≥ 2P(0 ≤ Z ≤ zα/2) = 1−α. (378)

(iv) From (ii) and (iii), deduce the following implication (for n large)

n ≥( zα/2

2ε

)2=⇒ P

(|pn −p| ≤ ε)≥ 1−α. (379)

(v) Suppose we want to estimate p with confidence at least 99% with error at most 0.01. How largeshould our sample size n be?

7. Distribution-free confidence intervals for percentiles

The main question we will address in this section is how we can estimate population percentiles us-ing sample percentiles. Our main tool are order statistics, which we have discussed in Section 2. Themethods presented here does not make any assumption on the population distribution (except that it iscontinuous), hence the name “distribution-free confidence intervals”.

Given a random variable X with distribution fX , its median is the number m such that P(X ≤ m) =1/2. In case there are several such values of m, we may take the smallest of such:

m = m(X ) := infx ∈R |P(X ≤ x) = 1/2. (380)

For instance, if X ∼ N (3,4), then its median is m = 3. In general, for given p ∈ [0,1], the percentile of orderp for X (or for its distribution fX ) is defined by

πp =πp (X ) := infx ∈R |P(X ≤ x) = p. (381)

Recall the definition of sample percentile of order p in Section 2.We begin our discussion with the following motivating example.

Example 7.7.1. Suppose we have i.i.d. samples X1, X2, · · · , X5 of an unknown continuous RV X , for whichwe have no idea what its distribution looks like. We are interested in estimating its median m = m(X ).For this, we order the samples to obtain the following order statistics

Y1 < Y2 < Y3 < Y4 < Y5. (382)

From this sample, it is tempting that the ‘middle value’ Y3 might be close to the population median m.However, since we have only five samples, we will go for a more conservative estimate and claim that mis between Y1 and Y5. In what confidence can we make this claim? Namely, we would like to computethe following probability

P(m(X ) ∈ (Y1,Y5)

)=P (Y1 < m(X ) < Y5) . (383)

In order to compute the above probability, observe that by definition of the median m = m(X ) and sinceX is continuous, P(Xi < m) = 1/2 for all i = 1,2, · · · ,5. For Y1 < m(X ), we need at least one Xi such thatXi < m. For m < Y5, we need at least one X j such that X j > m. By viewing the event Xi < i a ‘success’ ati th trial, we want a Binomial(5,1/2) RV to take values from 1 up to 4. Namely,

P (Y1 < m(X ) < Y5) =P (k Xi ’s are < m and the rest are > m for k = 1,2,3,4) (384)

=4∑

k=1P (k sucesses out of 5 trials) (385)

=P (1 ≤ Binomial (5,1/2) ≤ 4) . (386)

This gives

P (Y1 < m(X ) < Y5) = 1−P ( Binomial (5,1/2) = 0)−P ( Binomial (5,1/2) = 5) (387)


= 1−(

1

2

)5

−(

1

2

)5

= 15

16≈ 0.94. (388)

This means that the population median m is in the interval (Y1,Y5) with confidence 94%. N

Example 7.7.2 (Excerpted from [HTZ77]). The lengths in centimeters of n = 9 fish of a particular speciescaptured off the New England coast were 32.5, 27.6, 29.3, 30.1, 15.5, 21.7, 22.8, 21.2, and 19.0. Thus, theobserved order statistics are

15.5 < 19.0 < 21.2 < 21.7 < 22.8 < 27.6 < 29.3 < 30.1 < 32.5. (389)

Before the sample is drawn, we know that

P (Y2 < m < Y8) =P (2 ≤ Binomial (5,1/2) < 8) (390)

=P ( Binomial (5,1/2) ≤ 7)−P ( Binomial (5,1/2) ≤ 1) (391)

≈ 0.9805−0.0195 = 0.9610, (392)

where for the last computation, we have used the binomial table in [HTZ77, Table II, Appendix B]. Fromthe samples drawn above, we have y2 = 19.0 and y8 = 30.1. Thus the interval (19.0,30.1) is a 96% confi-dence interval for the the population median m. N

When we have large sample size, we can use CLT for a normal approximation for binomial distribu-tion (see Exercise 7.2.2).

Exercise 7.7.3. Let X1, · · · , Xn be i.i.d. samples of a continuous RV X , and let Y1 < ·· · < Yn be their orderstatistics. Fix p ∈ [0,1] and let πp =πp (X ) denote the percentile of X of order p.

(i) For each Xi , show that P(Xi <πp ) = p. We will call the event Xi <πp a ‘success at the i th trial’.(ii) Show that

P(Yi <πp < Y j

)=P(i ≤ # of successes in n trials < j ) (393)

=P(i ≤ Binomial(n, p) < j

)(394)

=P(i −0.5 ≤ Binomial(n, p) ≤ j −0.5

). (395)

(iii) Use CLT (or Exercise 7.2.2) and (ii) to deduce that

P(Yi <πp < Y j

)≈P(i −0.5−np√

np(1−p)≤ N (0,1) < j −0.5−np√

np(1−p)

). (396)

Example 7.7.4. Let Y1 < ·· · < Y16 be order statistics of n = 16 i.i.d. samples X1, · · · , Xn of a continuous RVX . According to Exercise 7.7.3 (ii) and the binomial table in [HTZ77, Table II, Appendix B],

P (Y5 < m < Y12) =P(5 ≤ Binomial (16,1/2) ≤ 11) (397)

=P( Binomial (16,1/2) ≤ 11)−P( Binomial (16,1/2) ≤ 4) (398)

= 0.9616−0.0384 = 0.9232. (399)

On the other hand, using the normal approximation in Exercise 7.7.3 (iii),

P (Y5 < m < Y12) ≈P(

5−0.5−8p4

≤ N (0,1) ≤ 12−0.5−8p4

)(400)

=P (−1.75 ≤ N (0,1) ≤ 1.75) = 0.9198, (401)

where the last numerical value can be computed from the standard normal table given at the end of thisnote. Thus the normal approximation here gives quite accurate result in computing the confidence. N

Exercise 7.7.5 (Excerpted from [HTZ77]). Let the following numbers represent the order statistics of then = 27 observations obtained in a random sample from a certain population of incomes (measured inhundreds of dollars):

261 269 271 274 279 280 283 284 286 287 292 293 296 300 (402)


304 305 313 321 322 329 341 343 356 364 391 417 476 (403)

We are interested in estimating the 25th percentile, π0.25, of the population.Since (n +1)p = 28(1/4) = 7, the seventh order statistic, namely, y7 = 283, would be a point estimate

of π0.25. For interval estimate, we would like to use the interval (y4, y10) for estimating π0.25. Computethe confidence of this interval estimate.

CHAPTER 8

Statistical testing

1. Tests about one mean

Suppose we have an unknown RV X . According to some external information, we know that its dis-tribution is either N (1,5) or N (3,5). How can we test which one is correct? We will propose statisticaltesting procedures for this type of problems, which are based on analysis of sample statistics and confi-dence intervals that we have developed before.

Example 8.1.1 (Simple alternative, Excerpted from [HTZ77]). Let X equal the breaking strength of a steelbar. If the bar is manufactured by process I, we know that X ∼ N (50,36). It is hoped that if process II (anew process) is used, then it will be that X ∼ N (55,36). Given a large number of steel bars manufac-tured by process II, how could we test whether the five-unit increase in the mean breaking strength wasrealized?

There are two hypothesis on the mean µ of X ∼ N (µ,36) that we want to test:

1. Null hypothesis H0 :µ= 502. Alternative hypothesis H1 :µ= 55.

Both of the above hypotheses are simple hypotheses, since they completely describe the distribution of X .Otherwise, they would be called a composite hypothesis. We will be testing which of the above hypothesisis more likely to be true, using our given samples:

3. Sampled values x1, x2, · · · , xn of X manufactured by process II

Since we are testing about the unknown mean of X , we will use the usual test statistic:

4. Test statistic: X = n−1(X1 +·· ·+Xn).

Next, we set up a decision rule based on the observed sample mean x. Since large values of x wouldindicate H1 is more likely than H0, our rule may be of the form “Reject H0 and accept H1 if x ≥ 53”.

5. Critical region: C = (x1, · · · , xn) | x ≥ 53. Reject H0 and accept on C , otherwise test is inconclusive.

The decision rule above could lead to the wrong conclusion, in each cases when H0 or H1 is true.Hence we will compute two types of errors as below:

6. Two types of error probabilities:

α=P(Type I error) =P(Reject H0 when H0 is true) (404)

β=P(Type II error) =P(Not reject H0 when H1 is true) (405)

In our case, we can compute these error probabilities using the known distribution of X under eachhypothesis:

X ; H0 ∼ N (50,36/n), X ; H1 ∼ N (55,36/n). (406)

Hence we can compute

α=P(X > 53; H0) =P(

X −50

6/p

n> 53−50

6/p

n; H0

)=P

(N (0,1) > 53−50

6/p

n

)(407)

β=P(X < 53; H1) =P(

X −55

6/p

n< 53−55

6/p

n; H1

)=P

(N (0,1) < 53−55

6/p

n

). (408)

56

1. TESTS ABOUT ONE MEAN 57

Assuming n = 16, we get the following values

α= 0.0228, β= 0.0913. (409)

Hence the probability of making both typo of errors are quite small. See Figure 1.

FIGURE 1. PDF of X under H0 and H1 with type I and II errors.

N

Example 8.1.2 (Composite alternative, Excerpted and modified from [HTZ77]). Assume that the under-lying distribution is normal with unknown mean µ but known variance σ2 = 100. Say we are testing thesimple null hypothesis H0 :µ= 60 against the composite alternative hypothesis H1 :µ> 60 with a samplemean X based on n = 52 observations. Suppose that we obtain the observed sample mean of x = 62.75,which is a bit larger than µ and biased toward H1. How should we make our conclusion from this data?

As in the previous example, our decision rule will be based on whether the observed sample mean xis biased toward H1 or not. In this example, the critical interval will be of the form

C = (x1, · · · , xn) | x > θ, (410)

where θ is a certain threshold that we will set. Then the type I error will be

type I error =P(X > θ; H0) =P(

X −60

10/p

52> θ−60

10/p

52; H0

)=P

(N (0,1) > θ−60

10/p

52

). (411)

Note that we can tolerate type I error up to probability 0.05 (also called the significance level). We canmake type I error arbitrary small by making the rejection threshold θ large, but then the resulting testwould not be very useful. So we would like to find smallest value of θ which guarantees type I errorto occur with probability at most 0.05. This is when (θ− 60)/(10/

p52) = z0.05, so our optimal test for

significance level α= 0.05 would be

Reject H0 if x > 60+ z0.0510p52

= 60+ (1.745)10p52

= 62.281. (412)

In our case, x = 62.75 so we will reject H0 according to the criterion above.There is an alternative way to form a significance level α= 0.05 test. Note that

P(X ≥ x; H0) < 0.05 ⇐⇒ P

(X −60

10/p

52≥ x −60

10/p

52; H0

)< 0.05 (413)

⇐⇒ P

(N (0,1) ≥ x −60

10/p

52; H0

)< 0.05 (414)

⇐⇒ x −60

10/p

52> z0.05 (415)


FIGURE 2. Illustration of p-value for H0 :µ= 60 and H1 :µ> 60.

⇐⇒ x > 60+ z0.0510p52

. (416)

Hence we may reformulate the test in (477) as

Reject H0 if P(X ≥ x; H0) < 0.05. (417)

Here the probability P(X ≥ x; H0) is called the p-value associated with observed sample mean x for thisexample. N

In the previous example, we have introduced the p-value associated with sample mean x. Roughlyspeaking, it is the probability of obtaining another random sample mean that is more extreme than theobserved value x under H0. As in the MLE, our grounding belief is that we are observing a data becauseit was likely to occur. Hence if such a p-value is small, this means our observed data is extremely unlikelyunder H0, which goes against our belief, so we better reject H0. Below we formally introduce the p-valuedepending on the composite alternate hypotheses.

p-value associated with x =P(

getting a sample X1, · · · , Xn uner H0 s.t. X is biasedtoward H0 more than the observed sample mean x

)(418)

=

P(X ≥ x; H0) if H1 :µ>µ0

P(X ≥ x; H0) if H1 :µ<µ0

P(|X −µ0| ≥ |x −µ0|; H0) if H1 :µ 6=µ0

(419)

By solving the equation (p-value) <α in the case when the RV X of interest is known to follow normaldistribution with variance σ2, we can summarize our test about one mean of significance level α.

Remark 8.1.3. If we do not know X has normal distribution but know the variance Var(X ) =σ2, we canuse CLT to get the above critical regions approximately for large n. Hence the same hypothesis testingholds approximately for large n.

When we do not know population variance but still know that it has normal distribution, then wecan use t-scores instead of z-scores for testing one mean.

Example 8.1.4 (Excerpted from [HTZ77]). Let X (in millimeters) equal the growth in 15 days of a tumorinduced in a mouse. Assume that the distribution of X is N (µ,σ2). We shall test the null hypothesisH0 : µ = µ0 = 4.0 against the two-sided alternative hypothesis H1 : µ 6= 4.0. If we use n = 9 observationsand a significance level of α= 0.10, the critical region is given by∣∣∣∣ x −4

s/p

9

∣∣∣∣≥ tα/2(8) = 1.860. (422)


Algorithm 1 Hypothesis testing for one mean, known variance σ2

1: Input:2: Null hypothesis H0 : X ∼ N (µ0,σ2) (or only Var(X ) =σ2 without normal distribution)3: Alternative hypothesis H1 : one of µ<µ0 or µ>µ0 or µ 6=µ0, where µ= E[X ]4: Observed sample: x1, x2, · · · , xn .5: Sample mean x = n−1(x1 +·· ·+xn)6: significance level: α ∈ (0,1)7: Test statistic: X = n−1(X1 +·· ·+Xn)8: Do: Compute the critical region associated with x:

x >µ0 + zασpn

if H1 :µ>µ0

x <µ0 − zασpn

if H1 :µ<µ0

|x −µ0| > zα/2σpn

if H1 :µ 6=µ0

(420)

9: Do:10: If Inside critical region, Reject H0

11: Else Test inconclusive and cannot reject H0

Algorithm 2 Hypothesis testing for one mean, unknown variance

1: Input:2: Null hypothesis H0 : X ∼ N (µ0, ?)3: Alternative hypothesis H1 : one of µ<µ0 or µ>µ0 or µ 6=µ0, where µ= E[X ]4: Observed sample: x1, x2, · · · , xn .5: Sample mean x = n−1(x1 +·· ·+xn) and sample variance s2 = 1

n−1

∑ni=1(xi − x)2.

6: significance level: α ∈ (0,1)7: Test statistic: X = n−1(X1 +·· ·+Xn)8: Do: Compute the critical region associated with x:

x >µ0 + tαspn

if H1 :µ>µ0

x <µ0 − tαspn

if H1 :µ<µ0

|x −µ0| > tα/2spn

if H1 :µ 6=µ0

(421)



Specifically, if we are given the data n = 9, x = 4.3, and s = 1.2, then

4.3−4

1.2/p

9= 0.75 < 1.860. (423)

Thus in this case cannot reject H0 :µ= 4.0 at significance level α= 0.1.In terms of the p-value, let T ∼ t (8). Then

p-value =P(|T | > 0.075) = 2P(T > 0.075) = 2(0.2374) = 0.4748. (424)

Since this p-value is much larger than the significance level α = 0.1, we cannot reject H0 : µ = 4.0 atsignificance level α= 0.1.

N

Exercise 8.1.5 (Excerpted and modified from [HTZ77]). Twenty-four girls in the 9th and 10th grades wereput on an ultra heavy rope-jumping program. Someone thought that such a program would increase

2. TEST ABOUT TWO MEANS 60

FIGURE 3. Test about mean of tumor growth

their speed in the 40-yard dash. Let W equal the difference in time to run the 40-yard dash—the “before-program time” minus the “after-program time.” Assume that the distribution of W is approximatelyN (µW ,σ2

W ). We shall test the null hypothesis H0 :µW = 0 against the alternative hypothesis H1 :µW > 0.

(i) Show that the test statistic and the critical region that has an α= 0.05 significance level are given by

t = w −0

sW /p

24> t0.05(23) = 1.714, (425)

where w is the observed value of the sample mean for the random variable W .(ii) The following data give the difference in time that it took each girl to run the 40-yard dash, with

positive numbers indicating a faster time after the program:

0.28 0.01 0.13 0.33 −0.03 0.07 −0.18 −0.14 (426)

−0.33 0.01 0.22 0.29 −0.08 0.23 0.08 0.04 (427)

−0.30 −0.08 0.09 0.70 0.33 −0.34 0.50 0.06 (428)

Compute the observed value of the test statistic t . Can we reject the null hypothesis H0 :µW = 0at significance level α= 0.05? Also, compute the p-value.

(iii) Find the smallest value of significance level α under which we can reject the null hypothesis H0 :µW = 0 using the data given in (ii). Does it equal to the p-value found in (ii)? Explain why.

2. Test about two means

In this section, we study how to statistically compare the means of two unknown RVs X and Y . Ourstatistical testing we will develop in this section is based on the confidence interval for difference of twomeans, which are summarized in Exercise 7.6.2.

We start with the ideal situation when both X and Y are normal with known variances.

Example 8.2.1 (Normal with known variances). Let X ∼ N (µX ,σ2X ) and Y ∼ N (µY ,σ2

Y ), where their vari-ances are known. Suppose we want to test the null hypothesis H0 : µX = µY against the alternativehypothesis H1 : µX 6= µY . Suppose we have i.i.d. samples for X1, · · · , Xn and Y1, · · · ,Ym for X and Y ,respectively. Our critical region for rejecting the null hypothesis H0 :µX =µY will be of the form

C = (x1, · · · , xn , y1, · · · , ym) | |x − y | ≥ θ, (429)

for some rejection threshold θ. To compute the corresponding type I error, note that

type I error =P(reject H0 :µX =µY |H0) (430)

=P(|X − Y | ≥ θ |H0). (431)


Recall from Example 7.4.2 that

Z := (X − Y )− (µX −µY )√σ2

Xn + σ2

Ym

∼ N (0,1). (432)

Since µX =µY under H0, we have

type I error =P(|X − X | ≥ θ |H0) =P

|Z | ≥ |θ− (µX −µY )|√σ2

Xn + σ2

Ym

|H0

=P

|Z | ≥ |θ|√σ2

Xn + σ2

Ym

. (433)

Hence it follows that

type I error ≤α ⇐⇒ |θ|√σ2

Xn + σ2

Ym

≥ zα/2. (434)

Hence this gives the following critical region for rejecting the null hypothesis H0 :µX =µY at significancelevel α:

|x − y | ≥ zα/2

√σ2

X

n+ σ2

Y

m. (435)

N

Example 8.2.2 (Known variances, not necessarily normal). Suppose we have two RVs X and Y of interest,and their variances are known as σ2

X and σ2Y , respectively, but not that they follow normal distribution.

Using CLT, all our discussion in the 8.5.4 holds approximately for large sample sizes n and m.Suppose we want to test the null hypothesis H0 :µX =µY against the alternative hypothesis H1 :µX 6=

µY . Suppose we have i.i.d. samples for X1, · · · , Xn and Y1, · · · ,Ym for X and Y , respectively. Our criticalregion for rejecting the null hypothesis H0 :µX =µY will be of the form

C = (x1, · · · , xn , y1, · · · , ym) | |x − y | ≥ θ, (436)

for some rejection threshold θ. By CLT, we have

Z := (X − Y )− (µX −µY )√σ2

Xn + σ2

Ym

⇒ N (0,1) (437)

as both n,m →∞. Hence we can approximately compute the type I error:


=P(|X − Y | ≥ θ |H0) (439)

=P

|Z | ≥ |θ|√σ2

Xn + σ2

Ym

≈P

|N (0,1)| ≥ |θ|√σ2

Xn + σ2

Ym

(440)

Recall from Example 7.4.2 that

Z := (X − Y )− (µX −µY )√σ2

Xn + σ2

Ym

∼ N (0,1). (441)

Hence it follows that, approximately for large n,m,

type I error ≤α ⇐⇒ |θ|√σ2

Xn + σ2

Ym

≥ zα/2. (442)


Hence this gives the following approximate critical region for rejecting the null hypothesis H0 : µX = µY

at significance level α:

|x − y | ≥ zα/2

√σ2

X

n+ σ2

Y

m. (443)

N

Example 8.2.3 (Normal with unknown, equal variances). Let X ∼ N (µX ,σ2) and Y ∼ N (µY ,σ2), wheretheir variances are known. Suppose we want to test the null hypothesis H0 :µX =µY against the alterna-tive hypothesis H1 : µX 6= µY . Suppose we have i.i.d. samples for X1, · · · , Xn and Y1, · · · ,Ym for X and Y ,respectively. We first recall from Example 7.4.3 that

T := (X − Y )− (µX −µY )

Sp

√1n + 1

m

∼ t (n +m −2), (444)

where S2p denotes the pulled estimator

S2p = (n −1)S2

X + (m −1)S2Y

n +m −2. (445)

Hence in this case, we will form our critical region for rejecting the null hypothesis H0 :µX =µY as

C =

(x1, · · · , xn , y1, · · · , ym) :

∣∣∣∣∣ x − y

spp

(1/n)+ (1/m)

∣∣∣∣∣≥ θ

, (446)

for some rejection threshold θ. To compute the corresponding type I error,Hence we can (exactly) compute the type I error in this case:


=P(∣∣∣∣∣ X − Y

Spp

(1/n)+ (1/m)

∣∣∣∣∣≥ θ |H0

)=P (|T | ≥ θ |H0) =P (|t (n +m −2)| ≥ θ) . (448)

Hence it follows that

type I error ≤α ⇐⇒ θ ≥ tα/2(n +m −2). (449)

Hence this gives the following critical region for rejecting the null hypothesis H0 :µX =µY at significancelevel α:

|x − y |sp

√1n + 1

m

≥ tα/2(n +m −2). (450)

N

Exercise 8.2.4 (Paired test, known variance of the difference). Let X , Y be RVs. Denote E[X ] = µx andE[Y ] = µY . Suppose we want to test the null hypothesis H0 : µX = µY against the alternative hypothesisH1 : µX 6= µY . Suppose we have i.i.d. pairs (X1,Y1), · · · , (Xn ,Yn) from the joint distribution of (X ,Y ).Further assume that we know the value of σ2 = Var(X −Y ).

(i) Noting that X1 −Y1, · · · , Xn −Yn are i.i.d. with finite variance σ2, show using CLT that, as n →∞,

Z := (X − Y )− (µX −µY )

σ/p

n=⇒ N (0,1). (451)

(ii) Our critical region for rejecting the null hypothesis H0 :µX =µY will be of the form

C = (x1, · · · , xn , y1, · · · , ym) | |x − y | ≥ θ, (452)

for some rejection threshold θ > 0. Show that



=P(|X − Y | ≥ θ |H0) (454)

=P(|Z | ≥ θ

σ/p

n; H0

)≈P

(|N (0,1)| ≥ θ

σ/p

n

). (455)

(iii) Conclude that (approximately for large n)

type I error ≤α ⇐⇒ θ

σ/p

n≥ zα/2 ⇐⇒ θ ≥ zα/2

σpn

. (456)

(iv) Show that the critical region of rejecting H0 :µX =µY in favor of H1 :µY 6=µY with significance levelα is given by

|x − y | ≥ zα/2σpn

. (457)

Exercise 8.2.5 (Paired test, known normality of the difference). Let X , Y be RVs. Denote E[X ] = µx andE[Y ] = µY . Suppose we want to test the null hypothesis H0 : µX = µY against the alternative hypothesisH1 : µX 6= µY . Suppose we have i.i.d. pairs (X1,Y1), · · · , (Xn ,Yn) from the joint distribution of (X ,Y ).Further assume that we know the X −Y follows a normal distribution.

(i) Noting that X1 −Y1, · · · , Xn −Yn are i.i.d. with normal distribution, show that (exactly)

T := (X − Y )− (µX −µY )

S/p

n∼ t (n −1), (458)

where S2 = 1n−1

∑ni=1((Xi −Yi )− (X − Y ))2 denotes the sample variance.

(ii) Our critical region for rejecting the null hypothesis H0 :µX =µY will be of the form

C =

(x1, · · · , xn , y1, · · · , ym)∣∣∣ |x − y |

s/p

n≥ θ

, (459)

for some rejection threshold θ > 0. Show that

type I error =P(reject H0 :µX =µY ; H0) (460)

=P( |X − Y |

S/p

n≥ θ ; H0

)=P (|t (n −1)| ≥ θ) . (461)

(iii) Conclude that (exactly for all n)

type I error ≤α ⇐⇒ θ ≥ tα/2(n −1). (462)

(iv) Show that the critical region of rejecting H0 :µX =µY in favor of H1 :µX 6=µY with significance levelα is given by

|x − y |s/p

n≥ tα/2(n −1). (463)

Example 8.2.6 (Excerpted from [HTZ77]). A product is packaged by a machine with 24 filler heads num-bered 1 to 24, with the odd-numbered heads on one side of the machine and the even on the otherside. Let X and Y equal the fill weights in grams when a package is filled by an odd-numbered headand an even-numbered head, respectively. Assume that the distributions of X and Y are N (µX ,σ2) andN (µY ,σ2), respectively, and that X and Y are independent. We would like to test the null hypothesisH0 : µX = µY against the alternative hypothesis H1 : µx 6= µY . To perform the test, after the machine hasbeen setup and is running, we shall select one package at random from each filler head and weigh it. Thetest statistic is that given by Equation 8.2-1 with n = m = 12. At an α= 0.10 significance level, the criticalregion is given by

|x − x|xp

√1

12 + 112

≥ t0.05(22) = 1.717. (464)

3. TESTS ABOUT PROPORTIONS 64

Suppose we have the following 12 observations each for X and Y :

X : 1071 1076 1070 1083 1082 1067 1078 1080 1075 1084 1075 1080 (465)

Y : 1074 1069 1075 1067 1068 1079 1082 1064 1070 1073 1072 1075 (466)

We can compute x = 1076.75, s2X = 29.30, y = 1072.33, and s2

Y = 26.24. Hence the observed value of thetest statistic is

t = |1076.75−1072.33|√(11)(29.30)+(11)(26.24)

221

12 + 112

= 2.05. (467)

Since t = 2.05 > 1.717 = t0.05(22), we reject the null hypothesis H0 : µX = µY at significance level α= 0.1.N

Exercise 8.2.7 (Excerpted from [HTZ77]). Plants convert carbon dioxide (CO2) in the atmo-sphere, alongwith water and energy from sunlight, in to the energy they need for growth and reproduction. Experi-ments were performed under normal atmospheric air conditions and in air with enriched CO2 concen-trations to determine the effect on plant growth. The plants were given the same amount of water andlight for a four-week period. The following table gives the plant growths in grams:

Normal Air: 4.67 4.21 2.18 3.91 4.09 5.24 2.94 4.71 4.04 5.79 3.80 4.38 (468)

Enriched Air: 5.04 4.52 6.18 7.01 4.36 1.81 6.22 5.70 (469)

On the basis of these data, determine whether CO2-enriched atmosphere increases plant growth.

3. Tests about proportions

Suppose we have a Bernoulli RV X of unknown success probability p. The goal of this section isto develop statistical testing procedures that infers the value of p from a given set of i.i.d. samplesX1, X2, · · · , Xn .

Example 8.3.1. Suppose X ∼ Bernoulli(p) with unknown p. Say we are testing the simple null hypothesisH0 : p = p0 against the composite alternative hypothesis H1 : p > p0. Let X1, · · · , Xn be i.i.d. samples ofX and let pn = n−1(X1 +·· ·+Xn) denote the sample frequency. We know pn is an unbiased estimator forp (e.g., E[pn] = p) and strong law of large numbers tells us pn → p almost surely as n →∞. But we onlyhave a finite number of samples (say, n = 30) and we need to make a decision whether p = p0 or p > p0.

As before, our decision rule will be based on whether the observed sample frequency pn is biasedtoward H1 or not. In this example, the critical interval will be of the form

C = (x1, · · · , xn) | pn > θ, (470)

where θ is a certain threshold that we will set. Note that we can tolerate type I error up to probabilityα (say 0.05) (also called the significance level). We can make type I error arbitrary small by making therejection threshold θ large, but then the resulting test would not be very useful. So we would like to findsmallest value of θ which guarantees type I error to occur with probability at most α.

The key observation here is that

npn = (X1 +·· ·+Xn) ∼ Binomial(n, p). (471)

So the type I error will be

type I error ≤α=P(pn > θ; H0) =P(npn > nθ; H0

)=P(Binomial(n, p0) > nθ). (472)

If we use the binomial table, the above equation will give us the exact type I error. But when n is suffi-ciently large, we can proceed by using CLT:

P(pn > θ; H0) =P(pn > θ; H0

)=P(pn −p0√

p0(1−p0)/n> θ−p0√

p0(1−p0)/n; H0

)(473)

3. TESTS ABOUT PROPORTIONS 65

=P(

npn −np0√np0(1−p0)

> nθ−np0√np0(1−p0)

; H0

)(474)

≈P(

N (0,1) > nθ−np0√np0(1−p0)

). (475)

Thus we have (approximately for large n)

type I error ≤α ⇐⇒ nθ−np0√np0(1−p0)

≥ zα ⇐⇒ θ ≥ p0 + zα√

p0(1−p0)/n. (476)

Hence our optimal test for significance level α would be

Reject H0 if pn > p0 + zα√

p0(1−p0)/n. (477)

N

A similar argument as above gives hypothesis testing about the null H0 : p = p0 against various al-ternatives H1 : p > p0, H1 : p < p0, and H1 : p 6= p0 population proportion against various alternativehypothesis.

Algorithm 3 Hypothesis testing for proportion

1: Input:2: Null hypothesis H0 : X ∼ Bernoulli(p0)3: Alternative hypothesis H1 : one of p > p0 or p < p0 or p 6= p0, where p = E[X ]4: Observed sample: x1, x2, · · · , xn .5: Sample mean x = n−1(x1 +·· ·+xn) and sample variance s2 = 1

n−1

∑ni=1(xi − x)2.

6: significance level: α ∈ (0,1)7: Test statistic: pn = n−1(X1 +·· ·+Xn)8: Do: Compute the critical region associated with x:

pn > p0 + zα√

p0(1−p0)/n if H1 : p > p0

pn < p0 − zα√

p0(1−p0)/n if H1 : p < p0

|pn −p0| > zα/2√

p0(1−p0)/n if H1 : p 6= p0

(478)



Example 8.3.2 (Excerpted from [HTZ77]). It was claimed that many commercially manufactured diceare not fair because the “spots” are really indentations, so that, for example, the 6-side is lighter thanthe1-side. Let p equal the probability of rolling a 6 with one of these dice. To test H0 : p = 1/6 against thealternative hypothesis H1 : p > 1/6, several such dice will be rolled to yield a total of n = 8000 observa-tions. Let Y equal the number of times that 6 resulted in the 8000 trials. The test statistic is

Z = Y −np√np(1−p)

= (Y /n)−p√p(1−p)/n

= (Y /8000)− (1−6)p(1/6)(5/6)/8000

. (479)

If we use a significance level of α= 0.05, the critical region is

z > z0.05 = 1.645. (480)

Suppose the results of the experiment yielded y = 1389 success. Then the observed value of the teststatistic is

z = (1389/8000)− (1−6)p(1/6)(5/6)/8000

= 1.67 > 1.645. (481)

4. THE WILCOXON TESTS 66

Hence the null hypothesis is rejected at significance level α= 0.05, and the experimental results indicatethat these dice favor a 6 more than a fair die would. N

4. The Wilcoxon Tests

When we cannot verify normality of the unknown random variable subject to our test, and if we donot know its variance, then our hypothesis testing on its mean, which is based on either normality (sothat we can compute the distribution of sample mean) or known variance (so that we can apply CLT),do not apply. In this case we may use ‘non-parameteric’ testing methods, which are essentially testingpercentiles rather than mean or variance. The simplest of such is called the sign test, which is explainedbelow.

Example 8.4.1 (Sign test). Let X be a continuous RV of unknown distribution with median m = m(X ).We would like to test the null hypothesis H0 : m = m0 against the alternative hypothesis H1 : m > m0. LetX1, · · · , Xn be i.i.d. samples of X . We will use the following statistic

W =n∑

i=11(Xi > m0) = (#Xi ’s s.t. Xi > m0). (482)

If H1 is true, then it is more likely to exceed m0 than even. Hence it is reasonable to use the critical regionof the form

C = (x1, · · · , xn) |w > c (483)

to reject H0 in favor of H1. Under the null hypothesis H0 : m = m0, each Xi exceeds its (hypothetical)median m0 with probability 1/2. Hence W ; H0 follows Binomial(n,1/2) distribution. This gives

type I error =P(reject H0 : m = m0 ; H0) (484)

=P(W > c ; H0) (485)

=P (Binomial(n,1/2) > c) . (486)

While one can use a binomial table to compute the above probability for any given c, it is more conve-nient to use normal approximation by CLT when n is large. This gives

type I error =P (Binomial(n,1/2) > c) (487)

≈P(

N (0,1) > c −n/2pn/4

). (488)


type I error ≤α ⇐⇒ c −n/2pn/4

≥ zα ⇐⇒ c ≥ n

2+ zα

pn/4. (489)


Reject H0 : m = m0 if w > n

2+ zα

pn/4. (490)

For instance, suppose a random sample of size 20 yielded the following data

6.5 5.7 6.9 5.3 4.1 9.8 1.7 7.0 2.1 19.0 (491)

18.9 16.9 10.4 44.1 2.9 2.4 4.8 18.9 4.6 7.9 (492)

Consider H0 : m = 6.2 and H1 : m > 6.2. Then the (approximate) critical region for rejecting H0 in favor ofH1 with significance level α= 0.05 is given by

w > 10+ z0.05p

20/4 = 10+1.645p

5 = 13.67. (493)

Since there are only w = 9 samples that exceed the claimed median m = 6.2, we cannot reject H0 in favorof H1 at significance level 5%. N


One of the objection to the above sign test is that it does not leverage the information of magnitudeof sample data, but only the sign of Xi −m0. Below we present the Wilcoxon signed rank test, whichcombines the sign test with extra information about the magnitudes of the differences Xi −m0 betweendata and claimed median.

Example 8.4.2 (Wilkcoxon’s test). Let X be a continuous RV of unknown distribution with median m =m(X ). We would like to test the null hypothesis H0 : m = m0 against the alternative hypothesis H1 : m >m0. Let X1, · · · , Xn be i.i.d. samples of X . We will use the following signed rank statistic

W =n∑

i=1sgn(Xi −m0) ·Rank(|Xi −m0|) , (494)

where sgn(a) = 1 if a > 0 and −1 if a < 0, and Rank(|Xi −m0|) denotes the rank (from the smallest) of themagnitude |Xi −m0| among the n values |X1 −m0|, |X2 −m0|, · · · , |Xn −m0|.

For instance, suppose we have the following n = 10 sample values of X :

5.0 3.9 5.2 5.5 2.8 6.1 6.4 2.6 2.0 4.3 (495)

Suppose we want to test H0 : m = 3.7 against H1 : H1 > 3.7. Then

xi −m0 : 1.3 0.2 1.5 1.8 −0.9 2.4 2.7 −1.1 −1.7 0.6|xi −m0| : 1.3 0.2 1.5 1.8 0.9 2.4 2.7 1.1 1.7 0.6

Ranks : 5 1 6 8 3 9 10 4 7 2Signed Ranks : 5 1 6 8 −3 9 10 −4 −7 2

(496)

Then the observed value w of the signed rank statistic W is

w = 5+1+6+8+ (−3)+9+10+ (−4)+ (−7)+2 = 27. (497)

Notice that large values of W supports H1 : m > m0, since if the actual median m is larger thanclaimed median m0, then each Xi −m0 is more likely to be positive than negative. Hence, as in the caseof the sign test, we use the critical region of the form

C = (x1, · · · , xn) |w > c (498)

to reject H0 in favor of H1.In order to compute type I error of this test, we need to know the distribution of the signed rank

statistic under the null hypothesis H0 : m = m0. Observe that, under H0, each Xi −m0 is either positive ornegative with equal probability. Also, since we are adding up the signed ranks and order of summationdoes not matter, we can rewrite

W =n∑

k=1ξk k, (499)

where ξ1, · · · ,ξk are i.i.d. RVs with P(ξ1 = 1) = P(ξ1 = −1) = 1/2. It is easy to see that E[W ] = 0 andVar(W ) =∑n

k=1 k2 = n(n +1)(2n +1)/6. So if we have a normal approximation of W , we should have

Wn −E[Wn]pWn

= Wnpn(n +1)(2n +1)/6

=⇒ N (0,1) (500)

as n →∞. However, note that the increments ξk k are independent, not but identically distributed. Sostandard CLT does not apply here. Nonetheless, a more general version of CLT (called Lyapunov’s CLT)can be used to obtain a normal approximation of W (see Exercise 8.4.3) as above.

Now we can proceed in a standard way:

type I error =P(reject H0 : m = m0 ; H0) (501)

=P(W > c ; H0) (502)

=P(

Wpn(n +1)(2n +1)/6

> cpn(n +1)(2n +1)/6

; H0

)(503)


≈P(

N (0,1) > cpn(n +1)(2n +1)/6

). (504)


type I error ≤α ⇐⇒ cpn(n +1)(2n +1)/6

≥ zα⇐⇒ c ≥ zα√

n(n +1)(2n +1)/6. (505)


Reject H0 : m = m0 if w ≥ zα√

n(n +1)(2n +1)/6. (506)

For the example we considered above, we had n = 10 and w = 25. Noting that z0.1 = 1.282, the criticalregion for rejecting H0 : m = 3.7 in favor of H1 : m > 3.7 at significance level α= 0.1 is given by

w > (1.282)√

10(11)(21)/6 = 25.147 (507)

As our observed valued of signed rank statistic is w = 27, we reject H0 : m = 3.7 in favor of H1 : m > 3.7 atsignificance level α= 0.1. N

Exercise 8.4.3 (Normal approximation of signed rank statistic). Let V1, · · · ,Vn be independent RVs wherefor each 1 ≤ k ≤ n,

P(Vk = k) =P(Vk =−k) = 1/2. (508)

Define a RV Wn =∑nk=1 Vk .

(i) Show that E[Wn] = 0 and Var(Wn) =∑nk=1 k2 = n(n +1)(2n +1)/6.

(ii) Show that for any δ> 0,∑ni=1E[|Xk −E[Xk ]|2+δ]√∑n

k=1 Var(Xk )2+δ =

∑ni=1 k2+δ

pn(n +1)(2n +1)/6

2+δ ≤ n3+δ

(n/3)3+3δ/2→ 0 as n →∞. (509)

Hence Wn verifies the following Lyapunov’s condition

limn→∞

∑ni=1E[|Xk −E[Xk ]|2+δ]√∑n

k=1 Var(Xk )2+δ = 0 for some δ> 0. (510)

Then Lyapunov’s central limit theorem tells that as n →∞,

Wn −E[Wn]pVar(Wn)

= Wnpn(n +1)(2n +1)/6

=⇒ N (0,1). (511)

(Note that Vk ’s are independent but not identically distributed, so standard CLT does not applyto Wn .)

Exercise 8.4.4. The weights of the contents of n1 = 8 and n2 = 8 tins of cinnamon packaged by companiesA and B , respectively, selected at random, yielded the following observations of X and Y :

x : 117.1 121.3 127.8 121.9 117.4 124.5 119.5 115.1y : 123.5 125.3 126.5 127.9 122.1 125.6 129.8 117.2

(512)

Let T = X −Y . As a means of comparing X and Y , we want to test H0 : m(T ) = 0 against H1 : m(T ) < 0.

(i) Compute the sample values of differences x − y . This gives a n = 8 sample for T .(ii) Compute the signed rank statstic w for the sample of T in (i) for the null hypothesis H0 : m(T ) = 0.(iii) Can we reject H0 : m(T ) = 0 in favor of H1 : m(T ) < 0 at significance level α = 0.05? What does it

imply on the original RVs X and Y ?

5. POWER OF A STATISTICAL TEST 69

5. Power of a statistical test

So far, we have derived statistical test based on the following procedure:

(i) Find a good test statistic (e.g., unbiased and sufficient). In most cases sample mean or sample fre-quency.

(ii) Derive the distribution of the test statistic under the null hypothesis (using normality of populationor CLT for known variance).

(iii) By considering the null H0 and the alternative H1 hypotheses, form a critical region.(iv) Compute type I error, and find the largest critical region such that type I error is at most a given

significance level.

In this section, we will also consider type II error, which is the probability of not being able to rejectH0 when H1 is true.

Example 8.5.1 (Power function of a test). Let X ∼ Bernoulli(p) for unknown p. We have the null hypoth-esis H0 : p = p0 and the composite alternative H1 : p < p0. Let X1, · · · , Xn be i.i.d. samples of X . We willuse the test statistic Y = ∑n

i=1 Xi , which has Binomial(n, p) distribution. Since E[Y ] = np, if Y is muchless than np0, this would support H1. Hence our critical region will be of the form

C = (x1, · · · , xn) | y < c. (513)

From the argument in Example 8.3.1, we know that

Type I error ≤α ⇐⇒ c ≤ np0 − zα√

np0(1−p0). (514)

So for a given significance level α, we will choose our critical region by setting

c = cα := np0 − zα√

p0(1−p0)n. (515)

Now, what about the type II error? Since H1 : p < p0 is a composite hypothesis, we cannot computethe distribution of the test statistic Y under H1. However, we can compute the probability of not beingable to reject H0 : p = p0 according to the critical region we found above assuming a specific value ofp = p1 < p0. Namely, note that

P(not reject H0 |p = p1) =P(Y ≥ cα |p = p1) (516)

= ∑cα≤k≤n

(n

k

)pk

1 (1−p1)n−k . (517)

Notice that the last expression is a function in the assumed value p1 of the true success probability p. Wemay as well consider the complement of the above probability:

K (p1) :=P(reject H0 |p = p1) =P(Y < cα |p = p1) (518)

= ∑0≤k≤cα

(n

k

)pk

1 (1−p1)n−k . (519)

The above function in the assumped value p1 of p is called the power function of the test. Each valueK (p1) is called the power of the test at p = p1.

For instance, suppose n = 20, and we have the following test

Reject H0 : p = 1/2 in favor of H1 : p < 1/2 ⇐⇒ y = x1 +·· ·+xn ≤ 6. (520)

Then the power function of this test is given by

K (p) =P(Binomial(20, p) ≤ 6

)= ∑0≤k≤6

(20

6

)pk (1−p)20−k (521)

≈P(

N (0,1) ≤ 6−20p√20p(1−p)

), (522)


where the last expression comes from the normal approximation. Either using the exact binomal prob-ability or the approximate normal probability, This power function entails type I error, and also type II

FIGURE 4. Plot of the power function K (p) =P(Binomial(20, p) ≤ 6; p)

error for simple alternatives. Namely,

Type I error =P(reject H0 |p = 1/2) = K (1/2) = ∑0≤k≤6

(20

6

)(1/2)k (1/2)20−k = 0.0577. (523)

Also, if we were to use the simple alternative H1 : p = 1/10, then

Type II error =P(not reject H0 |p = 1/10) = 1−P(reject H0 |p = 1/10) (524)

= 1−K (1/10) = 1− ∑0≤k≤6

(20

6

)(1/10)k (9/10)20−k = 1−0.9976 = 0.0024. (525)

N

Example 8.5.2 (Determining critical region and sample size from power function). Let X ∼ Bernoulli(p)for unknown p. We have the null hypothesis H0 : p = 2/3 and the composite alternative H1 : p = 1/3.Let X1, · · · , Xn be i.i.d. samples of X . We will use the test statistic Y = ∑n

i=1 Xi , which has Binomial(n, p)distribution. Since E[Y ] = np, if Y is much less than 2n/3, this would support H1. Hence our criticalregion will be of the form

C = (x1, · · · , xn) | y < c. (526)

There are two parameters in this test; the rejection threshold c and the sample size n. We will choosethese parameters such that

Type I error ≤ 0.05 and Type II error ≤ 0.1. (527)

From the discussion in Example 8.5.1, the power function of the above test is given by

K (p) =P(reject H0 ; p) (528)

=P(Y < c ; p) (529)

=P(Binomial(n, p) ≤ c

)(530)

≈P(

N (0,1) ≤ c −np√np(1−p)

). (531)


From the error conditions, we have0.05 ≥ Type I error = K (2/3) ≈P(N (0,1) ≤ c−2n/3p

2n/9

)0.1 ≥ Type II error = 1−K (1/3) ≈P

(N (0,1) > c−n/3p

2n/9

) (532)

Solving these, we get

c −2n/3p2n/9

≤−z0.05 andc −n/3p

2n/9≥ z0.1. (533)

An optimal test is achieved if we choose c and n so that the above inequalites are equalities. This gives

c ≤ (2n/3)− z0.05p

2n/9 (534)

c ≥ (n/3)+ z0.1p

2n/9, (535)

so n has to satisfy

(n/3)+ z0.1p

2n/9 ≤ (2n/3)− z0.05p

2n/9. (536)

Simplifying, we getn

3≥p

2n/9(z0.05 + z0.1), (537)

so squaring both sides gives

n ≥ 2(z0.05 + z0.1)2 = 2(1.645+1.282)2 = 17.606. (538)

Thus the minimial required sample size is n = 18. On the other hand, for c this gives

(18/3)+1.282p

2 ·18/9 ≤ (2 ·18/3)−1.645p

2 ·18/9, (539)

so

8.564 ≤ c ≤ 8.71. (540)

We may choose any value of c within this range, say, c = 8.6. Then the test

Reject H0 : p = 2/3 in favor of H1 : p = 1/3 ⇐⇒ y = x1 +·· ·+x18 ≤ 8.6. (541)

has type I error ≤ 0.05 and type II error ≤ 0.1. N

Exercise 8.5.3 (Paired test, known variance of the difference). Let X , Y be RVs. Denote E[X ] = µx andE[Y ] =µY . Suppose we want to test the null hypothesis H0 :µX −µY = 0 against the alternative hypothesisH1 : µX −µY =−3. Suppose we have i.i.d. pairs (X1,Y1), · · · , (Xn ,Yn) from the joint distribution of (X ,Y ).Further assume that we know the value of σ2 = Var(X −Y ).

(i) Noting that X1 −Y1, · · · , Xn −Yn are i.i.d. with finite variance σ2, show using CLT that, as n →∞,

Z := (X − Y )− (µX −µY )

σ/p

n=⇒ N (0,1). (542)

(ii) Our critical region for rejecting the null hypothesis H0 :µX −µY = 0 in favor of H1 :µX −µY =−3 willbe of the form

C = (x1, · · · , xn , y1, · · · , ym) | x − y ≤ c, (543)

for some rejection threshold c > 0. Show that the power function of this test is given by

K (µD ) =P(reject H0 :µX =µY ; µX −µY =µD ) (544)

=P(X − Y ≤ c ; µX −µY =µD ) (545)

=P(

Z ≤ c −µD

σ/p

n; µX −µY =µD

)≈P

(N (0,1) ≤ c −µD

σ/p

n

). (546)

6. BEST CRITICAL REGIONS 72

(iii) Conclude that (approximately for large n)

Type I error ≤α ⇐⇒ c

σ/p

n≤−zα, (547)

Type II error ≤β ⇐⇒ c +3

σ/p

n≥ zβ. (548)

(iv) Suppose we want both the type I and II errors to be at most 0.05. Find conditions for c and n thatwould guarantee this.

Exercise 8.5.4 (Paired test, known normality of the difference). Let X , Y be RVs. Denote E[X ] = µx andE[Y ] = µY . Suppose we want to test the null hypothesis H0 : µX = µY against the alternative hypothesisH1 : µX −µY =−3. Suppose we have i.i.d. pairs (X1,Y1), · · · , (Xn ,Yn) from the joint distribution of (X ,Y ).Further assume that we know the X −Y follows a normal distribution.

(i) Noting that X1 −Y1, · · · , Xn −Yn are i.i.d. with normal distribution, show that (exactly)

T := (X − Y )− (µX −µY )

S/p

n∼ t (n −1), (549)

where S2 = 1n−1

∑ni=1((Xi −Yi )− (X − Y ))2 denotes the sample variance.

(ii) Our critical region for rejecting the null hypothesis H0 :µX =µY in favor of H1 :µX −µY =−3 will beof the form

C =

(x1, · · · , xn , y1, · · · , ym)∣∣∣ x − y

s/p

n≤ c

, (550)

for some rejection threshold c < 0. Show that the power function of this test is given by

K (µD ) =P(reject H0 :µX =µY ; µX −µY =µD ) (551)

=P(

X − Y

S/p

n≤ c ; µX −µY =µD

)(552)

=P(

(X − Y )−µD

S/p

n≤ c − µD

S/p

n; µX −µY =µD

)=P

(t (n −1) ≤ c − µD

S/p

n

). (553)

(iii) Conclude that

Type I error ≤α ⇐⇒ c ≤−tα(n −1), (554)

Type II error ≤β ⇐⇒ c ≥ tβ(n −1)− 3

s/p

n. (555)

(iv) Suppose we want both the type I and II errors to be at most 0.05. Find (numerically) conditions forc and n that would guarantee this.

6. Best critical regions

Suppose we want to design a test for the null hypothesis H0 and the alternative hypothesis H1 on anunknown RV of interest X , based on n i.i.d. samples X1, X2, · · · , Xn of X . Our decision rule will be of theform

Reject H0 in favor of H1 if (X1, · · · , Xn) ∈C , (556)

where C is a critical region of our choice. Recall that the type I and type II errors of this test are given by

Type I error =P((X1, · · · , Xn) ∈C ; H0), (557)

Type II error =P((X1, · · · , Xn) ∉C ; H1). (558)

Suppose we want significance level of α. In other words, we need to choose the critical region C suchthat

P((X1, · · · , Xn) ∈C ; H0) =α. (559)


However, there could be many choices of such critical region C . Among such, it would desirable tochoose the one that gives the smallest type II error, that is, P((X1, · · · , Xn) ∈C ; H1), which is also called thesize of C , is maximized. This leads us to the following definition of a best critical region.

Definition 8.6.1 (Best critical region). Consider the test of the null hypothesis H0 against the simplealternative hypothesis H1. Let C be a critical region of size α. Then C is a best critical region of size α if,for every other critical region D of size α, we have

P((X1, · · · , Xn) ∈C ; H1) ≥P((X1, · · · , Xn) ∈ D ; H1). (560)

That is, when H1 is true, the probability of rejecting H0 with the use of the critical region C is at least asgreat as the corresponding probability with the use of any other critical region D of size α.

Thus, a best critical region of sizeα is the critical region that has the greatest power among all criticalregions of sizeα. The Neyman-Pearson lemma gives sufficient conditions for a best critical region of sizeα.

Theorem 8.6.2 (Neyman-Pearson Lemma). Let X be a RV of pdf or pmf fX (x;θ), depending on an un-known parameter θ ∈ θ0,θ1. Let X1, · · · , Xn be i.i.d. samples of X . Denote the joint likelihood function

L(θ) = L(x1, · · · , xn ;θ) =n∏

i=1fX (xi ;θ). (561)

Suppose that there exists a positive constant k and a subset C of the sample space such that

(a) P((X1, · · · , Xn) ∈C ; θ = θ0) =α ∈ (0,1),

(b)L(θ0)

L(θ1)≤ k for (x1, · · · , xn) ∈C , and

(c)L(θ0)

L(θ1)≥ k for (x1, · · · , xn) ∉C .

Then C is a best critical region of size α for testing the simple null hypothesis H0 : θ = θ0 against the simplealternative hypothesis H1 : θ = θ1.

PROOF. (Optional*) Denote X = (X1, · · · , Xn). Fix any critical region D of sizeα. We would like to showthat

P(X ∈C ; θ = θ1) =P(X ∈ D ; θ = θ1). (562)

We first write

P(X ∈C ; θ) =P(X ∈C ∩D ; θ)+P(X ∈C ∩Dc ; θ) (563)

P(X ∈ D ; θ) =P(X ∈ D ∩C ; θ)+P(X ∈ D ∩C c ; θ). (564)

Since both critical regions C and D have size α, we have

0 =α−α=P(X ∈C ; θ = θ0)−P(X ∈ D ; θ = θ0) (565)

=P(X ∈C ∩Dc ; θ = θ0)−P(X ∈ D ∩C c ; θ = θ0). (566)

Also, from condition (a), L(θ0) ≤ kL(θ1) for all points in C , and hence in C ∩Dc . So we get

P(X ∈C ∩Dc ; θ = θ0) ≤ kP(X ∈C ∩Dc ; θ = θ1). (567)

On the other hand, from condition (b), L(θ0) ≥ kL(θ1) for all points not in C , and hence in D ∩C c . So

P(X ∈ D ∩C c ; θ = θ0) ≥ kP(X ∈ D ∩C c ; θ = θ1). (568)

Now combining the above equation and inequalities above,

P(X ∈C ; θ = θ1)−P(X ∈ D ; θ = θ1) =P(X ∈C ∩D ; θ = θ1)+P(X ∈C ∩Dc ; θ = θ1) (569)

≥P(X ∈C ∩D ; θ = θ1)+ 1

kP(X ∈C ∩Dc ; θ = θ0). (570)


Similarly, we have

P(X ∈ D ; θ = θ1)−P(X ∈ D ; θ = θ1) =P(X ∈ D ∩C ; θ = θ1)+P(X ∈ D ∩C c ; θ = θ1) (571)

≤P(X ∈ D ∩C ; θ = θ1)+ 1

kP(X ∈ D ∩C c ; θ = θ0). (572)

Taking the difference, we obtain

P(X ∈C ; θ = θ1)−P(X ∈ D ; θ = θ1) ≥ 1

k

(P(X ∈C ∩Dc ; θ = θ0)−P(X ∈ D ∩C c ; θ = θ0)

)= 0, (573)

as desired.

Example 8.6.3 (Excerpted from [HTZ77]). Let X1, X2, · · · , Xn be a random sample from the normal dis-tribution N (µ,36). We shall find the best critical region for testing the simple hypothesis H0 : µ = 50against the simple alternative hypothesis H1 :µ= 55. Using the ratio of the likelihood functions, namely,L(50)/L(55), we shall find those points in the sample space for which this ratio is less than or equal tosome positive constant k.

We first compute the ratio of the likelihood functions:

L(50)

L(55)= (2π ·36)−n/2 exp

[− 12·36

∑ni=1(xi −50)2

](2π ·36)−n/2 exp

[− 12·36

∑ni=1(xi −55)2

] (574)

= exp

[− 1

72

n∑i=1

(xi −50)2 − (xi −55)2

](575)

= exp

[− 1

72

n∑i=1

(2xi −105)(5)

]. (576)

Then note thatL(50)

L(55)≤ k ⇐⇒ log

L(50)

L(55)≤ logk ⇐⇒ − 5

72

n∑i=1

(2xi −105) ≤ logk. (577)

The last condition is equivalent ton∑

i=1xi ≥ 105n

2− 36

5logk. (578)

Thus we have shown thatL(50)

L(55)≤ k ⇐⇒ x ≥ c := 105

2− 36

5nlogk. (579)

Hence according to the Neyman-Pearson lemma, a best critical region is given by

C = (x1, · · · , xn) | x ≥ c, (580)

where c is chosen so that the size of the critical region (i.e., type I error) is α.For instance, suppose n = 16 and α= 0.05. Then we will choose c so that

0.05 ≥P(X ≥ c ; µ= 50) (581)

=P(

X −50

6/p

16≥ c −50

6/p

16; µ= 50

)=P

(N (0,1) ≥ c −50

3/2

). (582)

Thus we would want (c −50)/(3/2) ≥ z0.05 = 1.645. This gives c ≥ 50+ (3/2)(1.645) ≈ 52.47. According tothe previous discussion, the following critical region

C = (x1, · · · , xn) | x ≥ 52.47 (583)

is a best critical region for rejecting H0 :µ= 50 against H1 :µ= 55 at significance level α= 0.05. N

Exercise 8.6.4. Let X1, · · · , Xn be i.i.d. samples from Poisson(λ). We want to test the null hypothesisH0 :λ=2 against the alternative hypothesis H1 :λ= 5.


(i) Compute the ratio L(2)/L(5) of joint likelihood functions. Show that

L(2)

L(5)≤ k ⇐⇒

(n∑

i=1xi

)log(2/5)+3n ≤ logk. (584)

(ii) Using the Neyman-Pearson lemma, conclude that a best critical region C for testing H0 : p = 2 againstH1 : p = 5 is given by

C =

(x1, · · · , xn)∣∣∣ n∑

i=1xi ≥ c

. (585)

(iii) Let n = 30. Use CLT and (ii) and obtain a best critical region C for testing H0 : p = 2 against H1 : p = 5at significance level α= 0.05.

Definition 8.6.5. A test defined by a critical region C of size α is a uniformly most powerful test if it is amost powerful test against each simple alternative in H1. The critical region C is called a uniformly mostpowerful critical region of size α.

Example 8.6.6 (Excerpted from [HTZ77]). Let X1, X2, · · · , Xn be a random sample from the normal distri-bution N (µ,36). We shall find the best critical region for testing the simple hypothesis H0 :µ= 50 againstthe simple alternative hypothesis H1 :µ=µ1, where µ1 > 50.

We first compute the ratio of the likelihood functions:

L(50)

L(µ1)= (2π ·36)−n/2 exp

[− 12·36

∑ni=1(xi −50)2

](2π ·36)−n/2 exp

[− 12·36

∑ni=1(xi −µ1)2

] (586)

= exp

[− 1

72

n∑i=1

(xi −50)2 − (xi −µ1)2

](587)

= exp

[− 1

72

n∑i=1

(2xi −50−µ1)(50−µ1)

]. (588)

Then note that

L(50)

L(µ1)≤ k ⇐⇒ log

L(50)

L(µ1)≤ logk ⇐⇒ −50−µ1

72

n∑i=1

(2xi −50−µ1) ≤ logk. (589)

Since µ1 > 50, the last condition is equivalent to

n∑i=1

xi ≥ (50+µ1)n

2− 36

50−µ1logk. (590)

Thus we have shown that

L(50)

L(µ1)≤ k ⇐⇒ x ≥ c := 50+µ1

2− 36

(50−µ1)nlogk. (591)

Hence according to the Neyman-Pearson lemma, a best critical region is given by

C = (x1, · · · , xn) | x ≥ c, (592)

where c is chosen so that the size of the critical region (i.e., type I error) isα. It is important to notice thatthe chosen value of c depend only on α, although the value of k depends on µ1.

For instance, suppose n = 16 and α= 0.05. Then we will choose c so that

0.05 ≥P(X ≥ c ; µ= 50) (593)

=P(

X −50

6/p

16≥ c −50

6/p

16; µ= 50

)=P

(N (0,1) ≥ c −50

3/2

). (594)

7. LIKELIHOOD RATIO TESTS 76

Thus we would want (c −50)/(3/2) ≥ z0.05 = 1.645. This gives c ≥ 50+ (3/2)(1.645) ≈ 52.47. According tothe previous discussion, the following critical region

C = (x1, · · · , xn) | x ≥ 52.47 (595)

is a best critical region for rejecting H0 :µ= 50 against H1 :µ=µ1 at significance levelα= 0.05. Moreover,this holds for all µ1 > 50. Hence C is a uniformly most powerful critical region of size α. N

Exercise 8.6.7. Let X1, · · · , Xn be i.i.d. samples from Bernoulli(p). We want to find a uniformly mostpowerful test for the null hypothesis H0 : p = p0 against the one-sided alternative hypothesis H1 : p =p1 > p0.

(i) Compute the ratio L(p0)/L(p1) of joint likelihood functions, where p1 > p0. Show that

L(p0)

L(p1)≤ k ⇐⇒

[p0(1−p1)

p1(1−p0)

]∑ni=1 xi

[1−p0

1−p1

]n

≤ k (596)

⇐⇒n∑

i=1xi ≥ c := logk −n log[(1−p0)/(1−p1)]

log[p0(1−p1)/p1(1−p0)]. (597)

(ii) Using the Neyman-Pearson lemma, conclude that a best critical region C for testing H0 : p = p0

against H1 : p = p1 > p0 is given by

C =

(x1, · · · , xn)∣∣∣ n∑

i=1xi ≥ c

. (598)

(iii) Let n = 30. Use CLT and (ii) and obtain a best critical region C for testing H0 : p = 0.2 against H1 :p = 0.5 at significance level α= 0.05.

Remark 8.6.8. If a sufficient statistic Y = u(X1, · · · , Xn) exists for an unknown paramter θ, then by thefactorization theorem (Thm 5.2.1), we have

L(θ0)

L(θ1)= φ

(T (x1, · · · , xn);θ0

)g (x1, · · · , xn)

φ(T (x1, · · · , xn);θ1

)g (x1, · · · , xn)

(599)

= φ(T (x1, · · · , xn);θ0

)φ

(T (x1, · · · , xn);θ1

) . (600)

Thus L(θ0)/L(θ1) ≤ k provides a critical region that is a function of the observations x1, x2, · · · , xn onlythrough the observed value of the sufficient statistic y = u(x1, x2, · · · , xn). Hence, best critical and uni-formly most powerful critical regions (at least the ones given by the Neyman-Pearson lemma) are basedupon sufficient statistics when they exist.

7. Likelihood ratio tests

In this section, we develop a general framework of deriving statistical tests for composite (both nulland alternative) hypothesis.

Consider an unknown RV X with distribution fX (x;θ) that has unknown parameter(s). LetΩ denotethe space of all possible values of θ (e.g., set of all weights in a deep neural network). Suppose we are givena sample (x1, x2, · · · , xn) of X . For each subset Ω′ ⊆ Ω of the parameter space, we define its maximumlikelihood as

L(Ω′) = L(x1, · · · , xn ;Ω′) := supθ′∈Ω′

L(x1, · · · , xn ;θ = θ′). (601)

Namely, this is the largest likelihood of the sample (x1, · · · , xn) assuming the parameters can take valuesonly inside the subsetΩ′. The key observation here is that

L(x1, · · · , xn ;Ω′) large ⇐⇒ Ω′ can well-explain the given sample (602)


We now introduce the likelihood ratio test. We take two disjoint subsetsΩ0 andΩ1 of the parameterspaceΩ. Our null and alternative hypotheses are then formulated as

H0 : θ ∈Ω0, H1 : θ ∈Ω1. (603)

We define the likelihood ratio associated with the subsetΩ′ and sample (x1, · · · , xn) by

λ= L(Ω0)

L(Ω1)= L(x1, · · · , xn ;Ω′)

L(x1, · · · , xn ;Ω). (604)

The critical region for the likelihood ratio test is given by

Reject H0 : θ ∈Ω0 in favor of H1 : θ ∈Ω1 ⇐⇒ λ= L(Ω0)

L(Ω1)≤ k, (605)

where k is some constant that will be chosen so that the associate type I error is within a desired con-fidence level. The rationale behind this likelihood ratio test is the following. If Ω0 is a much less likelyset of values of parameters than Ω1 (that has weaker explanation power of the given sample), then weshould have L(Ω0) ¿ L(Ω1), or, equivalently, λ¿ 1. This is why we form the critical region of the formλ≤ k.

In this section, we will mostly consider the case of ‘complementary hypotheses’, Ω1 = Ω \Ω1. (Forinstance, µ= 0 v.s. µ 6= 0). In this case, the critical region for the likelihood ratio test simplifies to

Reject H0 : θ ∈Ω0 in favor of H1 : θ ∉Ω0 ⇐⇒ λ= L(Ω0)

L(Ω)≤ k. (606)

See Exercise 8.7.1 for a justification.

Exercise 8.7.1 (Likelihood ratio test for complementary hypotheses). FixΩ0 ⊆Ω and letΩ1 =Ω\Ω0. Fixa constant 0 < k < 1. Show the following equivalences:

L(Ω0)

L(Ω1)≤ k ⇐⇒ L(Ω0) ≤ kL(Ω\Ω0) (607)

⇐⇒ L(Ω0) ≤ kL(θ′) for some θ′ ∉Ω0 ⇐⇒ L(Ω0) ≤ kL(Ω) ⇐⇒ L(Ω0)

L(Ω)≤ k (608)

Example 8.7.2 (Excerpted from [HTZ77]). Let X1, X2, · · · , Xn be a random sample from the normal distri-bution N (µ,5). We shall find the best critical region for testing the simple hypothesis H0 :µ= 162 againstthe composite alternative hypothesis H1 : µ 6= 162. Here we have a single parameter µ, so our parameterspace isΩ=R. To compute the maximum likelihood function of the null hypothesis, note that

L(162) = (2π ·5)−n/2 exp

[− 1

2 ·5

n∑i=1

(xi −162)2

]. (609)

On the other hand, the full maximum likelihood is given by

L(Ω) = supθ∈R

(2π ·5)−n/2 exp

[− 1

2 ·5

n∑i=1

(xi −θ)2

]. (610)

By a computation we made in Example 3.1.6, we know that the full maximum likelihood in this Gaussiancase is achieved when θ = x = n−1(x1 +·· ·+xn). Hence

L(Ω) = L(x) = (2π ·5)−n/2 exp

[− 1

2 ·5

n∑i=1

(xi − x)2

]. (611)

Thus the likelihood ratio is given by

λ= L(162)

L(x)= (2π ·5)−n/2 exp

[− 12·5

∑ni=1(xi −162)2

](2π ·5)−n/2 exp

[− 12·5

∑ni=1(xi − x)2

] (612)


= exp

[− 1

10

n∑i=1

(xi −162)2 − (xi − x)2

](613)

= exp

[− 1

10

n∑i=1

(162− x)2

]= exp

[− n

10(162− x)2

], (614)

where we have used the ‘Pythagorian theorem’

n∑i=1

(xi −a)2 =n∑

i=1(xi − x)2 +

n∑i=1

(x −a)2 =(

n∑i=1

(xi − x)2

)+n(a − x)2 (615)

with a = 162.On the one hand, a value of x close to 162 would tend to support H0, and in that case λ is close to 1.

On the other hand, an x that differs from 162 by too much would tend to support H1. See Figure 5 for anillustration.

FIGURE 5. The Likelihood ratio test for testing H0 :µ= 162 against H1 :µ 6= 162.

Now we compute the critical region, which is given by

λ= exp[− n

10(162− x)2

]≤ k ⇐⇒ − n

10(162− x)2 ≤ logk (616)

⇐⇒ |162− x| ≥ c :=√−10

nlogk. (617)

In order to guarantee significance level α, we compute the Type I error of this test, using the fact that

X ∼ N (µ,5/n)H0= N (162,5/n), (618)

noting that the i.i.d. samples X1, · · · , Xn are drawn from a normal distribution N (µ,5) and H0 assumesµ= 162. This gives

Type I error =P(reject H0; H0

)(619)

=P(|X −162| ≥ c)

(620)

=P( |X −162|p

5/n≥ cp

5/n

)=P

(|N (0,1)| ≥ cp

5/n

). (621)

Thus we have

Type I error ≤α ⇐⇒ cp5/n

≥ zα/2. (622)


In summary, the critical region of the likelihood ratio test for testing H0 : µ = 162 against H1 : µ 6= 162 atsignificance level α is given by

C =

(x1, · · · , xn) : |x −162| ≥p

5/nzα/2

. (623)

This is precisely the z-test we had before in Algorithm 2. N

Example 8.7.3 (Testing mean for unknown variance). Let X1, X2, · · · , Xn be an i.i.d. sample from thenormal distribution N (µ,σ2), where µ and σ2 are both unknown. Here we have two parameters µ andσ2, so our parameter space is Ω = R× [0,∞). We shall find the best critical region for testing the nullhypothesis H0 :µ=µ0 against the alternative hypothesis H1 :µ 6=µ0. We set

Ω0 = (µ,σ2) ∈Ω |µ=µ0, Ω0 = (µ,σ2) ∈Ω |µ 6=µ0 =Ω\Ω0. (624)

Note that because σ2 is a variable (unknown), both of the hypotheses are composite.To compute the maximum likelihood function of the null hypothesis, note that

L(Ω0) = sup(µ,σ2)∈Ω0

(2π ·σ2)−n/2 exp

[− 1

2 ·σ2

n∑i=1

(xi −µ)2

](625)

=(

2π ·n−1n∑

i=1(xi −µ0)2

)−n/2

exp

[− 1

2 ·n−1 ∑ni=1(xi −µ0)2

n∑i=1

(xi −µ0)2

](626)

=(

2π ·n−1n∑

i=1(xi −µ0)2

)−n/2

exp[−n

2

], (627)

where we have used the fact that the MLE ofσ2 for the normal density N (µ0,σ2) is σ2 = n−1 ∑ni=1(xi −µ0)2

(check this by differentiating L(Ω0) by σ2 and finding the value of σ2 that makes the derivative zero).On the other hand, the full maximum likelihood is given by

L(Ω) = sup(µ,σ2)∈Ω

(2π ·σ2)−n/2 exp

[− 1

2 ·σ2

n∑i=1

(xi −µ)2

](628)

=(

2π ·n−1n∑

i=1(xi − x)2

)−n/2

exp

[− 1

2 ·n−1 ∑ni=1(xi − x)2

n∑i=1

(xi − x)2

](629)

=(

2π ·n−1n∑

i=1(xi − x)2

)−n/2

exp[−n

2

], (630)

where we have used the fact that the unconstrained MLE of µ and σ2 for the normal density N (µ,σ2) are

given by µ= x and σ2 = n−1 ∑ni=1(xi − x)2 (see Example 3.1.6).


λ= L(Ω0)

L(Ω)=

(2π ·n−1 ∑n

i=1(x −µ0)2)−n/2(

2π ·n−1 ∑ni=1(x − x)2

)−n/2=

( ∑ni=1(xi − x)2∑n

i=1(xi −µ0)2

)n/2

. (631)

Using the ‘Pythagorian theorem’ (615) for a =µ0, we can rewrite the likelihood ratio as

λ=( ∑n

i=1(xi − x)2[∑ni=1(xi − x)2

]+n(x −µ0)2

)n/2

= 1

1+ n(x−µ0)2∑ni=1(xi−x)2

n/2

=(1+ n(x −µ0)2∑n

i=1(xi − x)2

)−n/2

. (632)

Note that λ is close to 1 if x ≈µ0, and λ is small when x and µ0 differ by a lot.Now we compute the critical region, which is given by

λ=(1+ n(x −µ0)2∑n

i=1(xi − x)2

)−n/2

≤ k ⇐⇒ n(x −µ0)2∑ni=1(xi − x)2 ≥ k−2/n −1 (633)


⇐⇒ (x −µ0)2( 1n−1

∑ni=1(xi − x)2

)/n

≥ (n −1)(k−2/n −1) (634)

⇐⇒ |x −µ0|s/p

n≥ c :=

√(n −1)(k−2/n −1), (635)

where s denotes the sample standard deviation.In order to guarantee significance level α, we compute the Type I error of this test, we recall that (see

Proposition 7.3.5)

T = X −µ0

S/p

nH0∼ t (n −1). (636)

This gives

Type I error =P(reject H0; H0

)(637)

=P (|T | ≥ c; H0) (638)

=P (|t (n −1)| ≥ c) . (639)

Thus we have

Type I error ≤α ⇐⇒ c ≥ tα/2(n −1). (640)

In summary, the critical region of the likelihood ratio test for testing H0 : µ= 162 against H1 : µ 6= 162 forunknown σ2 at significance level α is given by

C =

(x1, · · · , xn) :

∣∣∣∣ x −µ0

s/p

n

∣∣∣∣≥ tα/2(n −1)

. (641)

This is precisely the t-test we had before in Algorithm 2. N

Example 8.7.4 (testing variance for unknown mean). Let X1, X2, · · · , Xn be an i.i.d. sample from the nor-mal distribution N (µ,σ2), where µ and σ2 are both unknown. Here we have two parameters µ and σ2, soour parameter space isΩ=R× [0,∞). We shall find the best critical region for testing the null hypothesisH0 :σ2 =σ2

0 against the alternative hypothesis H1 :σ2 6=σ20. We set

Ω0 = (µ,σ2) ∈Ω |σ2 =σ20, Ω1 = (µ,σ2) ∈Ω |σ 6=σ2

0 =Ω\Ω0. (642)

Note that because µ is a variable (unknown), both of the hypotheses are composite.To compute the maximum likelihood function of the null hypothesis, note that

L(Ω0) = sup(µ,σ2)∈Ω0

(2π ·σ2)−n/2 exp

[− 1

2 ·σ2

n∑i=1

(xi −µ)2

](643)

= (2π ·σ2

0

)−n/2exp

[− 1

2 ·σ20

n∑i=1

(xi − x)2

](644)

where we have used the fact that the MLE of µ for the normal density N (µ,σ20) is the sample mean x. On

the other hand, as in the previous example, the full maximum likelihood is given by

L(Ω) = sup(µ,σ2)∈Ω

(2π ·σ2)−n/2 exp

[− 1

2 ·σ2

n∑i=1

(xi −µ)2

](645)

=(

2π ·n−1n∑

i=1(xi − x)2

)−n/2

exp[−n

2

], (646)


λ= L(Ω0)

L(Ω)=

(2π ·σ2

0

)−n/2exp

[− 1

2·σ20

∑ni=1(xi − x)2

](2π ·n−1 ∑n

i=1(xi − x)2)−n/2 exp

[−n2

] =( w

n

)n/2exp

[−w

2+ n

2

], (647)

8. CHI-SQUARE GOODNESS-OF-FIT TESTS 81

where we denote

w = 1

σ20

n∑i=1

(xi − x)2. (648)

Now we compute the critical region as

λ=( w

n

)n/2exp

[−w

2+ n

2

]≤ k ⇐⇒ n log(w/n)−w +n ≤ 2logk (649)

⇐⇒ w ≤ c1 or w ≥ c2, (650)

where c1 and c2 are constants that depends on n and k (see Exercise 8.7.5).In principle, we can compute Type I error of the above test by noting that

W = 1

σ20

n∑i=1

(Xi − X )2 = (n −1)(S/σ0)2 H0∼ χ2(n −1), (651)

which we have shown in the proof of Proposition 7.3.5. In practice, most statisticians tend to choose c1

and c2 so that both regions W ≤ c1 and W ≥ c2 get equal probability α/2. For this choice, the resultingcritical region will be

Reject H0 :σ2 =σ20 in favor of H1 :σ2 6=σ2

0 if w ≤χ21−α/2(n −1) or w ≥χ2

α/2(n −1). (652)

N

Exercise 8.7.5. Let X1, X2, · · · , Xn be a random sample from the normal distribution N (µ,σ2), where µand σ2 are both unknown. We set

Ω0 = (µ,σ2) ∈Ω |σ2 =σ20, Ω0 = (µ,σ2) ∈Ω |σ 6=σ2

0 =Ω\Ω0. (653)

According to Example 20.4, we know that

λ= L(Ω0)

L(Ω)=

( w

n

)n/2exp

[−w

2+ n

2

], (654)

where

w = 1

σ20

n∑i=1

(xi − x)2. (655)

(i) For n = 30, draw the graph of the function f (x) = n log(x/n)− x +n. (Using graphic calculator orsoftware package)

(ii) Draw the regions of f (x) ≤ −2 = w ≤ c1∪ w ≥ c2 and approximately compute c1 and c2. Usingthe fact that W = (1/σ2

0)∑n

i=1(Xi − X )2 ∼χ2(29), compute the probabilities,

P(W ≤ c1), P(W ≥ c2), (656)

which add up to the Type I error of the critical region 2logλ ≤ −2 = λ ≤ 1/e. Are they thesame?

(iii) Assume that n = 30, w = 10, and σ2 = 5, and α = 0.05. Using the following approximate (and sym-metric) critical region, determine whether we can reject H0 : σ2 = σ2

0 in favor of H1 : σ2 6= σ20 at

significance level α= 0.05.

Reject H0 :σ2 =σ20 in favor of H1 :σ2 6=σ2

0 if w ≤χ21−α/2(n −1) or w ≥χ2

α/2(n −1). (657)

8. Chi-square goodness-of-fit tests

In this section, we will study how to test a hypothesis about the entire distribution of an unknownrandom variable, not just on its mean or variance. Roughly speaking, we can see how our hypotheticaldistribution ‘fits’ to the empirical distribution given by the sample values of the random variable. If theydiffer by a lot, then we reject the hypothetical distribution.


8.1. Chi-square goodness-of-fit test. Let X be a discrete RV taking r distinct values from x1, x2, · · · , xr .We would like to formulate a hypothesis about its entire PMF, which we will denote by fX :

fX = [P(X = x1),P(X = x2), · · · ,P(X = xr )] = [p1, p2, · · · , pr ]. (658)

Consider the following complementary hypotheses

H0 : fX = [p1, p

2, · · · , pr ] (659)

H1 : fX 6= [p1, p

2, · · · , pr ]. (660)

In order to test the above hypotheses, we first draw some i.i.d. samples, X1, X2, · · · , Xn of X . This willgive us the following empirical distribution

fX =[

N1

n,

N2

n, · · · ,

Nr

n

], (661)

where for each 1 ≤ i ≤ r , Ni denotes the number of samples of value xi :

Ni =n∑

k=11(Xk = xi ). (662)

Since Xk ’s are i.i.d. and each Xk take value xi with probability pi ,

Ni ∼ Binomial(n, pi ). (663)

In particular,

E[Ni ] = npi , Var(Ni ) = npi (1−pi ). (664)

Furthermore, according to CLT we also have the following normal approximation

Ni −npi√npi (1−pi )

=⇒ N (0,1) (665)

as the sample size n tends to infinity.However, the above ‘large sample approximation’ only applies for each Ni separately, whereas we

would like to approximate the entire empirical distribution fX as n →∞. The difficulty here is that thecounts N1, · · · , Nr are not independent, since they satisfy the following deterministic condition

N1 +N2 +·· ·+Nr = n. (666)

The following classical theorem due to Pearson overcomes this issue and provides the basis of ‘goodness-of-fit’ tests. See the end of this section for a proof this result.

Theorem 8.8.1 (Pearson, 1900). Let N1, · · · , Nr be as before. Then as the sample size n →∞,r∑

i=1

(Ni −npi )2

npi=⇒χ2(r −1), (667)

where =⇒ denotes convergence in distribution and χ2(r −1) denotes the chi-square distribution with r −1degrees of freedom.

In light of Pearson’s theorem, we formulate the chi-square goodness-of-fit test as follows.

Example 8.8.2 (Chi-square goodness-of-fit test). Let X be a discrete RV of interest, taking r distinct val-ues from x1, x2, · · · , xr . Consider the following complementary hypotheses on the PMF fX of X :

H0 : fX = [p1, p

2, · · · , pr ] (668)

H1 : fX 6= [p1, p

2, · · · , pr ]. (669)

Form the following test statistic

T =r∑

i=1

(Ni −npi )2

npi

(670)


Clearly, T = 0 if H0 were true, and large values of T will indicate that H0 is far from the truth. Hence wemay form the critical region as

Reject H0 in favor of H1 if T ≥ θ, (671)

where θ ≥ 0 is a thresholding parameter that we will choose in response to a desired significance level α.By using Pearson’s Theorem (Theorem 8.8.1), we can approximately compute the type I error as

Type I error =P(Reject H0; H0) (672)

=P(T ≥ θ; H0) (673)

≈P(χ2(r −1) ≥ θ). (674)

Hence, approximately as n →∞,

Type I error ≤α ⇐⇒ P(χ2(r −1) ≥ θ) ≤α ⇐⇒ θ ≥χ2α(r −1). (675)

Thus the optimal choice of θ is θ =χ2α(r −1). N

Example 8.8.3 (Excerpted from [HTZ77]). If persons are asked to record a string of random digits, suchas

3 7 2 4 1 9 7 2 1 5 0 8 · · · (676)

we usually find that they are reluctant to record the same or even the two closest numbers in adjacentpositions. Indeed, if the digits are i.i.d. uniform from 0,1, · · · ,9, then the probability of the next digitbeing the same as the preceding one is q1 = 1/10, the probability of the next being only one away from thepreceding (assuming that 0 is one away from 9) is q2 = 2/10, and the probability of all other possibilitiesis q3 = 7/10.

Suppose we have an algorithm that generates strings of digits, and we would like to test if it generatesrandom strings. Let

p1 = frequeny that two consecutive digits generated by the algroithm are the same (677)

p2 = frequeny that two consecutive digits generated by the algroithm differ by 1 (678)

p3 = frequeny that two consecutive digits generated by the algroithm differ by ≥ 2. (679)

By comparing the actual random number generator, we form the following null hypothesis

H0 : [p1, p2, p3] = [q1, q2, q3] = [1/10,2/10,7/10]. (680)

Our test statistic is

T =3∑

i=1

Ni −nqi

nqi, (681)

where

N1 = # of times that the next digit is the same as before (682)

N2 = # of times that the next digit is one away from before (683)

N3 = # of times that the next digit is more than two away from before. (684)

The critical region for α= 0.05 significance level is t ≥χ20.05(2) = 5.991.

Suppose the observed sequence of 51 digits is as follows:

583194679263087513621954803714604382739856187035252. (685)

By examining this sequence, we find

N1 = 0, N2 = 8, N3 = 42. (686)


There are n = 50 consecutive digits in this string. The computed chi-square statistic is

t = (0−50(1/10))2

50(1/10)+ (8−50(2/10))2

50(2/10)+ (42−50(7/10))2

50(7/10)= 6.8 > 5.991 =χ2

0.05(2). (687)

Thus, we conclude that the algorithm does not seem to be generating random numbers. N

The proof of Pearson’s theorem uses a random vector version of CLT, which is called the ‘multivariateCLT’. We state this result without proof.

Theorem 8.8.4 (Multivariate CLT). Let X = (X1, · · · , Xr ) be a random vector so that each entry has finitevariance. Let Σ denote the covariance matrix of X, that is, the r × r matrix whose (i , j ) entry is given by

Σ[i , j ] = Cov(Xi , X j ) = E[Xi X j ]−E[Xi ]E[X j ]. (688)

Let (Xi )i≥1 be a sequence of i.i.d. copies of X. Then as n →∞,

Zn := 1pn

n∑i=1

(Xi −E[Xi ]) =⇒ N (0,Σ), (689)

where =⇒ denotes convergence in distribution and N (0,Σ) denotes the multivariate normal distributionwith mean 0 and covariance matrix Σ.

PROOF OF THEOREM 8.8.1. (Optional*) Recall that we have an unknown RV X that takes r valuesx1, · · · , xr , and we have i.i.d. samples X1, · · · , Xn of X . Every time we draw a sample Xk , we observe ther -dimensional row vector of indicators

Xk = [1(Xk = x1),1(Xk = x2), · · · ,1(Xk = xr )]. (690)

In words, if Xk takes the j th value x j , then the j th entry of X j becomes 1 and its all other entries are zero.To compute its covariance matrix Σ of Xk , observe that

Cov(1(Xk = xi )1(Xk = x j )) = E[1(Xk = xi )1(Xk = x j )

]−E [1(Xk = xi )]E[1(Xk = x j )

](691)

= E[1(Xk = xi )1(Xk = x j )

]−pi p j (692)

=

pi (1−pi ) if i = j

−pi p j if i 6= j .(693)

Thus the covariance matrix Σ is given as

Σ=

p1(1−p1) −p1p2 −p1p3 · · · −p1pr

−p2p1 p2(1−p2) −p2p3 · · · −p2pr...

. . ....

−pr p1 −p2(1−p2) · · · −pr pr−1 pr (1−pr )

. (694)

Hence according to the mutivariate CLT, as n →∞,

Zn := 1pn

n∑i=1

(Xi −E[Xi ]) =[

N1 −np1pn

,N2 −np2p

n, · · · ,

Nr −nprpn

]=⇒ N (0,Σ). (695)

A slight modification will show that[N1 −np1p

np1,

N2 −np2pnp2

, · · · ,Nr −nprp

npr

]=⇒ N (0,Σ′), (696)

where the new covariance matrix Σ′ is given by

Σ′ =

1−p1 −pp1p2 −pp1p3 · · · −pp1pr

−pp2p1 1−p2 −pp2p3 · · · −pp2pr...

. . ....

−ppr p1 −√p2(1−p2) · · · −ppr pr−1 1−pr

. (697)


This1 shows thatr∑

i=1

(Ni −npi )2

npi=⇒

r∑i=1

Z 2i , (698)

where the Gaussian random vector [Z1, · · · , Zr ] follows the multivariate normal distribution N (0,Σ′).To finish up, we use the result of Exercise 8.8.5, which states that

Z 21 +·· ·+Z 2

r−1 +Z 2r

d= Z 21 +·· ·+ Z 2

r−1, (699)

where [Z1, · · · , Zr−1] is multivariate normal with mean zero and covariance matrix Ir−1. Since the righthand side follows χ2(r −1), the assertion then follows.

Exercise 8.8.5. Let g = [g1, · · · , gr ] ∼ N (0, Ir ) and let p = [p

p1, · · · ,p

pr ]. Note that ‖p‖ = 1. Using thestandard basis [e1, · · · ,er ], write

g− (g ·p)p = Z1e1 +·· ·+Zr−1er−1 +Zr er . (700)

(i) Show that [Z1, · · · , Zr ] ∼ N (0,Σ′), where the covariance matrix Σ′ is given by (697).

(ii) By changing the standard basis [e1, · · · ,er ] to an orthonormal basis [q1, · · · ,q1,p], write

g = Z1q1 +·· ·+ Zr−1qr−1 + (g ·p)p. (701)

Using the fact that the covariance matrix of a multivariate normal does not change under changeof orthonormal basis, deduce that [Z1, · · · , Zr−1] ∼ N (0, Ir−1).

(iii) From (i) and (ii), show that

Z 21 +·· ·+Z 2

r−1 +Z 2r = ‖g− (g ·p)p‖2 = Z 2

1 +·· ·+ Z 2r−1. (702)

Conclude that

Z 21 +·· ·+Z 2

r−1 +Z 2r ∼χ2(r −1). (703)

1In fact, we need to appeal to a functional version of the multivariate CLT to deduce that the convergence of a sequence ofrandom vector in distribution transfers to a function of their coordinates.

Bibliography

[HTZ77] Robert V Hogg, Elliot A Tanis, and Dale L Zimmerman, Probability and statistical inference, vol.993, Macmillan New York, 1977.

86

Bibliography 87

Table of standard normal probabilities ℙ(0 ≤ 𝑍 ≤ 𝑧), where 𝑍 ∼ 𝑁(0,1).

TABLE 1. Standard normal table

Bibliography 88

Table of t-scores 𝑡 (𝑘), which is the unique value such that ℙ 𝑇 > 𝑡 (𝑘) = 𝛼 for 𝑇 ∼ 𝑡(𝑘).

𝑡 (𝑘)

TABLE 2. Table of t-scores

Documents

MATH 170S Introduction to Probability and Statistics II ......MATH 170S Introduction to Probability and Statistics II Lecture Note Hanbaek Lyu DEPARTMENT OF MATHEMATICS, UNIVERSITY