Chapter 7: Hypothesis testing - Lund Observatory · Hypothesis testing: The goal of hypothesis testing is to decide, based on a sample from the population, which of the two complementary

ASTM21 Chapter 7: Hypothesis testing p.

Chapter 7: Hypothesis testing

• Classical hypothesis testing

• Significance level and p-value

• The chi-square test for histogram data

• Parametric and non-parameteric tests

• The chi-square goodness-of-fit test

• The Kolmogorov-Smirnov test

1


Hypothesis testing: Some definitions

2

Hypothesis (pl. hypotheses): A hypothesis is a statement about a population parameter.

Hypothesis testing:The goal of hypothesis testing is to decide, based on a sample from the population, which of the two complementary hypotheses is true.

Null and alternative hypothesis:The two complementary hypotheses in a hypothesis testing problem are called the null hypothesis (H0) and the alternative hypothesis (H1).

Hypothesis test:A hypothesis test is a rule that specifies for which sample values H0 should be rejected.

Test statistic:Typically, the test is specfified in terms of a test statistic t(X), such that H0 is rejected if texceeds a certain value (= the critical value).


Classical hypothesis testing: Role of the null hypothesis

3

H0 and H1 are not treated on an equal basis: The null hypothesis describes a well-defined model (theory), while the alternative hypothesis could be “anything else”. Example: We want to test whether the frequency of exoplanetary companions depends on the mass of the host

star. The null hypothesis is that the frequency is the same for stars of all masses. The alternative hypothesis implies a dependence on the mass, but without specifying anything about its character (e.g., whether frequency is increasing or decreasing with stellar mass, or perhaps peaks at a certain mass).

The null hypothesis sometimes describes what is currently held to be the “truth”. The alternative hypothesis then implies that established theory is wrong. Example: We want to investigate whether Newton's law of gravitation holds at large distances. The null

hypothesis (that Newton's law is valid at all distances) should not be rejected lightly, but only if there is overwhelming evidence against it.

Here H0 and H1 are clearly not interchangeable. The possible outcomes of the test are:“reject H0” (better than “accept H1”) or “do not reject H0” (better than “accept H0”)

A scientific theory can never be verified, but it must be falsifiable (cf. Karl Popper’s criterion of demarcation).


Significance level

4

The outcome of the test can be wrong in two different ways:

P(Type I Error) = α (this is the significance level of the test) P(Type II Error) = β (the power of the test is 1−β)

Both probabilities depend on the quality of data and how the test is designed. For a well-defined null hypothesis, it may be possible to compute the significance level α of a test. It is usually not possible to compute the power 1−β of the test, because we do not know what the

data should look like under the alternative hypothesis.

A test may be designed for a pre-chosen significance level, e.g., α = 0.1, 0.05, 0.01, 0.001.The corresponding critical value is denoted t! : P(t > t! | H0) = α.

DecisionDecision

Do not reject H0 Reject H0

TruthH0 OK Type I Error

(false positive)Truth

H1Type II Error

(false negative) OK


Yes, no, or maybe? Using the p-value instead

5

There is no single “correct” value for the significance level α. If the consequences of making a Type I Error would be serious, then a very small value should be chosen. Otherwise a standard value of 0.05 may be appropriate.

Very often (especially in scientific papers), it is not necessary to make a yes/no statement about whether the null hypothesis should be rejected: it may be more informative to give the p-value resulting from the test. The p-value is the probability of getting a value of the test statistic at least as extreme as the

actually observed value purely by chance, if the null hypothesis is true. [Under H0 one expects p ~ U(0,1).]

The p-value can be computed from the (known) distribution of the test statistic under the null hypothesis.

If a test gives the p-value p = 0.03, the null hypothesis would be rejected at significance level α = 0.05, but not at the more conservative significance level α = 0.01.

By providing the p-value, the final verdict is basically left to the reader and will depend on his/her prior belief in the null hypothesis.


Example 1: The two-point correlation function

In P1 you were asked to decide which of six datasets was not uniform random.

You could formulate this problem as a hypothesis test with null hypothesis:

As a test statistic we could use the computed value of the two-point correlation function w3(θ) for some suitable angle θ. E.g. for the bin 5 < θ ≤ 10 units, the result looks like this:

6

Dataset w3(θ) × 100A 0.212B 0.547C 0.125D 6.141E 0.578F 0.052

For convenience, let t = w3(θ) × 100 be the test statistic (for the bin 5 < θ ≤ 10 units).

Clearly t is much larger for dataset D than for any of the other sets. But how significant is it? What is the p-value?

The p-value could be estimated by Monte Carlo simulations. How?

Given a significance level α (e.g., α = 0.01), how could one determine the critical value of the test?

H0 = the points are uniformly distributed

Note: A-F are a random permutation of 1-6!


Chi-square test for histogram data

The chi-square test is useful for testing if observed frequencies (e.g., in a histogram) are consistent with expected or predicted frequencies.

Given N objects divided into n classes (bins), with observed numbers

Under the null hypothesis the theoretical frequencies arethus the expected numbers .

The test statistic is

Under the null hypothesis, t is (approximately) chi-square distributed with ! = n − 1 degrees of freedom. The –1 comes from the constraint that the sum of the observed numbers is fixed (= N).

(If fi are derived by fitting a model to the data, with m fitted parameters, then the number of degrees of freedom is ! = n − m − 1.)

For given significance level (e.g., α = 0.01), the critical value of t can be looked up in a table (next slide), or computed in MATLAB: tcrit = chi2inv(1–α, !).

Alternatively, the p-value can be computed as p = 1 – chi2cdf(t, !).

7


Critical values for the chi-square distribution

8

Abramowitz and Stegun, Handbook of Mathematical Functions (10th corrected printing, 1972)

! →"↓

In MATLAB: !!2 = chi2inv(1−", #), e.g., chi2inv(0.99, 3) = 11.3448667301444


Example 2: The galaxy/random fields again

This is a simple application of the chi-square test for histogram data.

To investigate clustering in the P1 data on an angular scale of θ ≈ 50 units (for example), one can dividing the whole area (of size 1000×1000 units) into smaller squares of size 50×50 units, and count the number of points in each small square.

In this case the “histogram” bins are the small squares. Thus N = 9404 and n = (1000/50)2 = 400.

Under the null hypothesis, the expected number of points in each bin is Ei = N/n = 23.51.

Oi is the number of points in the ith small square (i = 1, 2, ..., n).

Result, using p = 1 – chi2cdf(t, 399):

9

Dataset t pA 420.585 0.219392B 397.957 0.505330C 391.917 0.590414D 491.449 0.001061E 420.670 0.218533F 428.412 0.149149


Example 3: A chi-square test for the Poisson distribution

The table shows the number of soldiers kicked to death by a horse, in 14 different cavalry corps of the Prussian army, each year from 1875 to 1894 (Bortkiewicz 1898).

The number of cases is 144 + 91 + 32 + 11 + 2 = 280 and the total numberof deaths is 144 • 0 + 91 • 1 + 32 • 2 + 11 • 3 + 2 • 4 = 196.

If the accidents are independent, the corps all have the same size, and the risk is constant over the corps and years, the numbers should follow the Poisson distribution with constant parameter λ (= mean number per year). The null hypothesis is that the counts ~ Pois(λ) for some λ (m = 1).

The MLE of λ is the sample mean, 196/280 = 0.700. The expected counts are:

n = 6 (number of bins)O = [144, 91, 32, 11, 2, 0]E = [139.04, 97.33, 34.07, 7.95, 1.39, 0.22]

! = 4 (degrees of freedom) ⇒ critical value for " = 0.1 is 7.78 ⇒ H0 cannot be rejected(p value is 1-chi2cdf(2.37, 4) = 0.668055402204685 > " ⇒ H0 cannot be rejected)

10

Deaths per year

Number of cases

0 144

1 91

2 32

3 11

4 2

≥ 5 0

⎫⎪⎬ ⇒ t = 2.37⎪⎭


Parametric and non-parametric tests

Parametric test:

This is based on a parameterized family of models (using a fixed number of parameters). Uncertainties are assumed to be understood (e.g. that errors are Gaussian).

Method: Estimate the parameters of the model, calculate a test statistic, and hence the p value.

Example:Chi-square goodness-of-fit of a model to data

Non-parametric (distribution-free) test:

Does not require a parameterized model of the distribution of the tested variables. The test typically compares the data to a given (fixed) distribution, or compares different sets of data.

Method:Calculate a statistic, and hence the p value.

Example:Kolmogorov-Smirnov test

11


Chi-square goodness-of-fit test (parametric)

Recall that the chi-square distribution #2(!) is the distribution of the sum of the squares of ! independent, centred, unit normal variables:

The chi-square distribution is therefore useful to test whether a given model fit the data as well as can be expected - under the assumption that the errors in the data are Gaussian with known $.

Let be the data and the model fitted to the data. If each data point has a standard uncertainty we may compute the goodness-of-fit statistic

The parameter vector ! can be estimated by minimizing t (chi-square fitting, see Ch. 6).

Under certain regularity conditions (that is non-degenerate and not strongly non-linear) it is found that the minimum t is chi-square distributed with ! = n − dim(!) degrees of freedom.

Thus the null hypothesis H0: (the model fits the data) can be rejected with significance level α if t exceeds the critical value (see table on p. 8).

12

[L, L = � . . . Q ֏L(սսս)ֆL

W =Q!L=�

![L ! ֏L(սսս)ֆL

"�

֏L(սսս)

֊�ն


The histogram chi-square test (p. 7) can be used to test if data follow a given distribution.A disadvantage of the method is that the histogram is not a unique statistic of the data (it depends on the choice of bin sizes, etc), and the outcome of the test is often sensitive to these choices.

The K-S test is a very popular alternative for one-dimensional, continous data, because it does not require binning. There are two different situations when the K-S test can be used:

1. Is the data set {xi} (i = 1, ..., n) consistent with the given cdf F (x)?

2. Are the two independent data sets {xi} (i = 1, ..., n) and {xi*} (i = 1, ..., m) mutually consistent (i.e., are they both consistent with the same - but unknown - cdf)?

The K-S test uses the statistic D = max ⎜Sn(x) – F(x)⎟ (case 1) or D = max ⎜Sn(x) – Rm(x)⎟ (case 2). Sn(x) and Rm(x) are the cumulative fractions of the samples [Sn(x) = fraction of data < x, etc].

Important features of the K-S test are: no binning of the data is required it is invariant under (monotonic) transformations of the variable x it is not so sensitive to differences in the wings of the distributions (this could be good or bad) it is “distribution free”: under the null hypothesis, the distribution of D is independent of F(x) two-dimensional variants of it have been formulated, but are less straightforward to use.

Kolmogorov-Smirnov (K-S) test

13


Case 1: Test if the data {xi} (i = 1...n) follow the given pdf f (x) ⇔ cdf F(x)

Test statistic: D = max ⎜Sn(x) – F(x)⎟ , where Sn(x) = cumulative fraction

K-S test (one-sample test)

F(x)

Sn(x) D

The distribution of D is approximately given by

where

This is asymptotically accurate for large n, and in practice good enough if n ≳ 20.At significance level α, H0 is rejected if

MATLAB function: h = kstest(x,CDF,alpha)

14


K-S test (one-sample test) - An important remark!

The K-S test assumes that the data are tested against a fixed distribution F(x).

In many real situations, the distribution depends on some parameters %, which may be adjusted to fit the data (e.g., by minimizing D). That is, the test statistic is

The distribution of D* is not the same as for D. That is, the significance (p-value) is no longer given by the previous formula involving QKS.

In such cases it is recommended to use Monte Carlo experiments (Ch. 9) to find the empirical distribution of D* and hence the p-value of the test.

Monte Carlo experiments are of course very useful in many other cases as well, for example when the theoretical distribution of the test statistic is not known, or only approximately known (e.g., the K-S test with n < 20), or too difficult to derive.

15


Case 2: Test if the two samples (of size n and m) follow the same (but unknown) pdf

Statistic: D = max ⎜Sn(x) – Rm(x)⎟, where Sn(x), Rm(x) = cumulative fractions

K-S test (two-sample test)

Rm(x)

−5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Values of x

Cum

ulat

ive

fract

ions Sn(x)

D

The distribution of D is approximately given by

where QKS is the same functions as in Case 1.

At significance level α, H0 is rejected if

MATLAB function: h = kstest2(x1,x2,CDF,alpha)

16

Documents

Chapter 7: Hypothesis testing - Lund Observatory · Hypothesis testing: The goal of hypothesis testing is to decide, based on a sample from the population, which of the two complementary