Nonparametric Testing Multinomial Distribution, Chi-square ... · IntroMultinomial DistributionGoodness of Fit TestsEmpirical CDFs Nonparametric Testing Multinomial Distribution,

Intro Multinomial Distribution Goodness of Fit Tests Empirical CDFs

Nonparametric TestingMultinomial Distribution, Chi-square goodness of fit tests,

Empirical CDFs

1


Goals

Up until now, we’ve been performing statistical inferencesabout specific parameters of interest.

Given a problem, we’ve been modeling the setup with a knownprobability function. Then we work to learn what we canabout parameters of interest in the probability density or massfunction.

Our goal now is to consider what we might do when we’reinterested in performing inference about the probabilityfunction itself.

2


Motivating Example: Birth Dates of Hockey Players

The Toronto Globe and Mail published an article in 1987,written by Neil Campbell, called NHL career can bepreconcieved. The article claims that organized hockey has

“turned half the boys in hockey-playing countries into secondclass citizens. The disadvantaged are those unlucky enough tohave been born in the second half of the calendar year.”

Basically, the claim is that boys born earlier in the year aremuch more likely to end up playing organized hockey.

This example is from Paul Gingrich at the University of Regina

3


Birth Dates of Hockey Players

Quarter # PlayersJan. to Mar. 84Apr. to Jun. 77Jul. to Sep. 35Oct. to Dec. 34

From the Western Hockey League (WHL) Yearbook for 1987-88

So, it looks like the data support his claim.

What confounding variable might affect the these results?

4



Births per quarter (shown for 1967-1970)

Quarter Birth Count Birth Prop. # PlayersJan. to Mar. 97,487 0.242 84Apr. to Jun. 104,761 0.260 77Jul. to Sep. 103,974 0.258 35Oct. to Dec. 97,186 0.241 34

We want to answer some question about birth quarters andchance of playing organized hockey.

What hypothesis do you think is reasonable?

5



We could form hypotheses like “Are the number of players byquarter in the hockey league represented by a discrete uniformwith equal probabilities?”

Given the birth counts we have, a better hypothesis mightsimply be: Are the birth counts and hockey playercounts coming from the same discrete distribution?

So, how do we test this?

6


Multinomial DistributionThe multinomial is an extension of the Binomial distributionwhere we allow t different outcomes r1, . . . , rt , each withprobability p1, . . . , pt .

For a set of n trials, we define Xi to be the number of timesthat outcome ri occurs, with pi = P(ri). Then the vector(X1, . . . ,Xt) has a multinomial distribution with

P(X1 = k1, . . . ,Xt = kt) =n!

k1!k2! · · · kt!pk11 pk22 · · · pktt

ki ∈ {0, 1, . . . , n} for i = 1, 2, . . . , tt∑

i=1

ki = n

7


Multinomial Distribution

The mean of the multinomial distribution is the vector

E (X1, . . . ,Xt) = (n × p1, . . . , n × pt)

And it has covariancenp1(1− p1) −np1p2 −np1p3 · · · −np1pt−np2p1 np2(1− p2) −np2p3 · · · −np2pt−np3p1 −np3p2 np3(1− p3) · · · −np3pt

......

.... . .

...−nptp1 −nptp2 −nptp3 · · · npt(1− pt)

8



We can use the multinomial to test general equality of twodistributions.

How would we do it for a discrete distribution?

How would we do it for a continuous distribution?

9



We can use the multinomial to test general equality of twodistributions.

How would we do it for a discrete distribution?For a finite sample space, we can formulate a hypothesiswhere the probability of each outcome is the same in the twodistributions. For an infinite sample space, we can bin thetail(s) into one category and then do the same thing.

How would we do it for a continuous distribution?We can bin the continuous distribution, and then check if theprobability in each bin is the same.

draw a picture

10


Goodness of Fit Tests

For a multinomial random vector (X1, . . .Xt) with associatedprobability vector (p1, . . . pt)

D =t∑

i=1

(Xi − npi)2

npi∼ (approx)χ2

(t−1)

This is Karl Pearson’s chi-square statistic.

The generalized likelihood ratio test would (asymptotically)result in the same statistic.

11


Goodness of Fit Tests

Using the multinomial, we can then test if one sample has aspecific distribution by testing:

H0 : p1 = p10 , . . . , pt = pt0 VS H1 : pi 6= pi0 for some i

Given an observation (X1 = k1, . . .Xt = kt), we can calculatevalid p-values:

p-val = P

(χ2(t−1) >

t∑i=1

(ki − npi0)2

npi0

)

12


Testing the Hockey Data


What’s our hypothesis?

13




230

H0 : (p1 = 0.242, p2 = 0.260, p3 = 0.258, p4 = 0.241)

H1 : not H0

14



D =4∑

i=1

(Xi − npi)2

npi=

(84− 230× 0.242)2

230× 0.242+

(77− 230× 0.260)2

230× 0.260+

(35− 230× 0.258)2

230× 0.258+

(34− 230× 0.241)2

230× 0.241

= 37.6

15



0 10 20 30 40

0.00

0.10

0.20χ(2)(4 − 1)

x

dchi

sq(x

, df =

3)

16


Choosing BinsBy popular request, a bit more info on choosing bins(particularly in the continuous distribution setting).

• There is no general optimal bin width (it depends on thedistribution).

• A rule of thumb is to choose bins so the expected countsin bini is greater than 5, This is to ensure a valid χ2

approximation. One thing to do is choose the number ofbins to be n/5 and to choose your cutoffs so the binshave equal probability.

• Another method: choose bins that have a width of s/3,and form the lower and upper tail bins at X̄ ± 6s, where sis the sample standard deviation.

• Any reasonable set of bins should give reasonable results.

17


Empirical CDFs

Given a sample of iid RVs X1, . . . ,Xn, we know that their(cumulative) distribution function is

FX (t) = P(X ≤ t)

Last class we were considering ways to use the densityfunction fx = dFx

dxto perform nonparametric tests, but it seems

just as reasonable to use the distribution function.

Given a set of observations from iid RVS x1, . . . , xn, we needto estimate the CDF.

18


Building the ECDF

We call our estimate the empirical cumulative distributionfuntion (ecdf). We denote it Fn(t).

Let’s consider a specific value of t = t?.

We want to find an estimator Fn(t?) of F (t?) by.

• What do you think is a reasonable form for Fn(t?)?

• What can we say about it’s distribution?

• Is it biased?

• What about it’s asymptotic distribution?

19


Building the ECDF

Since our observations were drawn from the same underlyingdistribution, each observation has the same probability: 1

n.

Fn(t?) = number of obs less than t?

n= 1

n

∑ni=1 1{xi ≤ t?}

It’s a sum of indicators (Bernoullis)

So, n ∗ Fn(t?) ∼ Bin(n,F (t?)).And, E [Fn(t?)] = 1

n× nF (t?)→ unbiased for each t

And we also get that

√n (Fn(t?)− F (t?))→ N (0,F (t?)(1− F (t?)) )

20


Empirical CDFs

(stolenfrom wikipedia)

21


Empirical CDFs

What’s the mean of the random variable that has empiricalCDF as it’s CDF?

What’s the variance of the random variable that has empiricalCDF as it’s CDF?

22


Empirical CDFsWhat’s the mean of the random variable that has empiricalCDF as it’s CDF?

We can think of this as the mean value of the discreteprobability distribution of a randomly chosen member of thelist {X1,X2, . . . ,Xn}. The index i of this randomly chosenmember of the list is equally likely (i.e.,has probability 1/n) tobe any of the values 1, 2, . . . , n.

Usually we find the (discrete RV) mean byµ = E [X ] =

∑xi ∗ p(xi)

Now, we have µFn = E [Fn] =∑

1nXi = X .

So the mean of the ECDF is the sample mean! Seemsreasonable.

23


Empirical CDFs

What’s the variance of the random variable that has empiricalCDF as it’s CDF?

We do the same thing, and we know the mean already.

Var(Fn) =∑

(Xi − X )2 ∗ p(Xi)

=1

n

∑(Xi − X )2

So we get the (biased) sample variance. This also seemsreasonable.

24


Sample Quantiles

25

Documents

Nonparametric Testing Multinomial Distribution, Chi-square ... · IntroMultinomial DistributionGoodness of Fit TestsEmpirical CDFs Nonparametric Testing Multinomial Distribution,