Hypothesis testing and confidence intervals · 5 Hypothesis test - wikipedia… A statistical hypothesis test is a method of making statistical decisions using experimental data

1

Hypothesis testing and confidence intervals

Colm O’Dushlaine 12/08

2

Overview Scientific questions statistical hypotheses Null vs. Alternate Hypotheses Deciding on your “measurement” and test statistic

Chi2 squared distribution Visualising a HT as a probability distribution Significance: what is an “exceptional outcome”?

Critical values, rejection regions Choosing alpha and discussion of significance

Two-sided vs. one sided tests P-value vs. CI Limitations and criticisms

3

Recap from last talk... Point estimates

Terms and definitions Sampling distribution of the mean

Basic concept of a sample and inferences we can make about population at large

Standard error of the mean How it relates to C.I.

Confidence intervals How they relate to sample estimates about the population

t distribution When used and why

Odds ratios What they are Relationship with relative risk

Correlation Uses, misuses

4

Statistical inference in general

Statistical inference: The process of drawing conclusions about a population based on information in a sample

Unlikely to see this published:

“In our study of a new antihypertensive drug we found an effective 10% reduction in blood pressure for those on the new therapy. However, the effects seen are only specific to the subjects in our study. We cannot say this drug will work for hypertensive people in general”

2 methods (frequentist) for statistical inference: Hypothesis tests Confidence intervals

5

Hypothesis test - wikipedia…

A statistical hypothesis test is a method of making statistical decisions using experimental data. It is sometimes called confirmatory data analysis, in contrast to exploratory data analysis

Null-hypothesis tests: “Assuming that the null hypothesis is true, what is the probability of observing a value for the test statistic that is at least as extreme as the value that was actually observed?”

6

Some of those responsible…

Fisher

Originally coined phrase “null hypothesis”

Founded “population genetics”

Staunch eugenicist. Wanted to introduce forced sterilisation

Pearson (and son)

Neyman

Just a regular academic...not a eugenicist!

39 PhD students

Founded “mathematical statistics”World’s 1st stats dept in UCLAggressive eugenicist!Committed socialist, refused an OBE

7

Hypothesis testing

Used to investigate the validity of a claim about the value of a population characteristic

For example; the mean potency of a batch of tablets is 500mg per tablet, i.e. µ0 = 500mg

8

PROCEDURE

Specify Null and Alternative hypotheses Specify test statistic Define what constitutes an exceptional

outcome Calculate test statistic and determine

whether or not to reject the Null Hypothesis

9

Step 1 – specify Null and Alternative Specify the hypothesis to be tested and the

alternative that will be decided upon if this is rejected The hypothesis to be tested is referred to as the

Null Hypothesis (labelled H0)

The alternative hypothesis is labelled Ha

For the earlier example this gives:

mg500:

mg500:0≠=

µµ

aH

H

10

Step 1 (ctd.)

The Null Hypothesis is assumed to be true unless the data clearly demonstrate otherwise

11

PROCEDURE




12

Step 2 - specify a test statistic Specify a test statistic which will be used to

measure departure from

where is the value specified under the Null Hypothesis, e.g. in this example

For hypothesis tests on sample means, the test statistic is:

00 : µµ =H

0µ5000 =µ

nsx

t 0µ−=

13

General format of test statistic

valueedhypothesisofSE

valueedhypothesisvalueobservedstatistictest

___

___

−=

14

Step 2 (ctd.)

The test statistic

is a ‘signal to noise ratio’, i.e. it measures how far is from in terms of standard error units

The t distribution with df = n-1 describes the distribution of the test statistic if the Null Hypothesis is true

In this example, the test statistic t has a t distribution with df = 25

nsx

t 0µ−=

x0µ

15

PROCEDURE




16

Step 3 - define what constitutes an exceptional outcome

Define what will be an exceptional outcome a value of the test statistic is exceptional if it has

only a small chance of occurring when the null hypothesis is true

The probability chosen to define an exceptional outcome is called the significance level of the test and is labelled α Conventionally (traditionally), an α of 0.05 is chosen

17

Step 3 - define what constitutes an exceptional outcome (ctd.)

α = 0.05 gives cut-off values on the sampling distribution of t called critical values values of the test statistic t lying beyond the critical

values lead to rejection of the null hypothesis

For this example the critical value for a t distribution with df = 25 is 2.06

18

Student’s t distribution Closely related to the standard normal distribution Z

Symmetric and bell-shaped Has mean = 0 but has a larger standard deviation

Exact shape depends on a parameter called degrees of freedom (df) which is related to sample size

recap

0

0.1

0.2

0.3

0.4

Den

sity

-5 -4 -3 -2 -1 0 1 2 3 4 5

t quantile

Overlay Y's

Y Standard Normal t (df = 3) t (df = 10)

Overlay Plot

df = 3df = 10

Standard Normal

Critical values within tails of distribution

19

PROCEDURE




20

Step 4 - calculate test statistic Calculate the test statistic and see if it lies in the

critical region For the example

t = -4.683 is < -2.06 (the critical value presented 3 slides back) so the hypothesis that the batch potency is 500 mg/tablet is rejected

683.426

783.10500096.490

−=

−=t

21

P value

The P value associated with a hypothesis test is the probability of getting sample values as extreme or more extreme than those actually observed, assuming null hypothesis to be true

22

Example (ctd.)

P value = probability of observing a more extreme value of t, given that the null hypothesis is true

The observed t value was -4.683, so the P value is the probability of getting a value more extreme than ± 4.683

This P value is calculated as the area under the t distribution below -4.683 plus the area above 4.683, i.e., 0.00008474

23

Example (ctd.)

Less than 1 in 10,000 chance of observing a value of t more extreme than -4.683 if the Null Hypothesis is true

Evidence in favour of the alternative hypothesis is very strong

24

P value (ctd.)

0

0.1

0.2

0.3

0.4

-5 -4 -3 -2 -1 0 1 2 3 4 5

t

Overlay Y's

Y t dist (df=25) p-val area

Overlay Plot

-4.683 4.683

25

Limitations of the p-value NO REFERENCE TO MAGNITUDE

Approaches based on estimation and C.I.’s therefore generally considered to be superior

Promotes publication bias: only articles with significant p-values are considered interesting

N.B. Freiman (1978): 21 ‘negative’ clinical trial results (p>0.1)…constructed CI’s…found that about half were compatible with a 50% improvement. Therefore, of clinical significance after all!!

26

Two-tail and one-tail tests

The test described in the previous example is a two-tail test The null hypothesis is rejected if either an

unusually large or unusually small value of the test statistic is obtained

i.e. the rejection region is divided between the two tails

27

One-tail tests

Reject the null hypothesis only if the observed value of the test statistic is Too large Too small

In both cases the critical region is entirely in one tail so the tests are one-tail tests, e.g. 1-sided t-test

Generally **not appropriate**: maybe when have strong prior expectations, e.g. new version of treatment will be better than old version but certainly not worse…but we don’t really know even this without doing an experiment!

28

Statistical versus practical significance When we reject a null hypothesis it is usual to

say the result is statistically significant at the chosen level of significance

But should also always consider the practical significance of the magnitude of the difference between the estimate (of the population characteristic) and what the null hypothesis states that to be...

29

Statistical versus practical significance (ctd.) Rejection of the null hypothesis at some effect size

has no bearing on the practical significance at the observed effect size

A statistically significant finding may not be relevant in practice due to other, larger effects of more concern, while a true effect of practical significance may not appear statistically significant if the test lacks the power (Ricardo) to detect it, e.g. due to sample size deficits

Appropriate specification of both the hypothesis and the test of said hypothesis is therefore important to provide inference of practical utility

30

Confidence Interval

A confidence interval for a population characteristic is an interval of plausible values for the characteristic. It is constructed so that, with a chosen degree of confidence (the confidence level), the value of the characteristic will be captured inside the interval

e.g. we claim with 95% confidence that the population mean lies between 15.6 and 17.2

recap

31

Estimation vs. Hypothesis testing P-value does not quantify. This is essential,

e.g. blood pressure: how much reduced? How consistent an effect? 30% reduction in all people more important than 50% reduction in some

Recall Freiman (1978) clinical trial analysis Should present both but p-value can be

estimated from C.I. so latter is more important

32

Trivial example of p-value weakness Compared 217k tandem repeats Mean length among variable repeats = 57.13,

mean among invariant repeats = 48.43 t-test p<0.0005, Mann-Whitney p<0.0005 But scale of effect is very small Significance reflects very large sample size

and doesn’t really inform on scale of effect

33

Another example of p-value weakness In addition to not describing scale of an

effect, it doesn’t describe the consistency of an effect

Drug A reduces migraines by 30% Drug B reduces migraines by 50%, but only

for some people. No effect on others P-value will not give this information

34

Other criticisms Meta:

Criticism of the application, or of the interpretation, rather than of the method

Philosophical: What about borderline cases? "... surely, God loves the .06 nearly

as much as the .05” (Rosnow, R.L. & Rosenthal, R. (1989)) .05 is arbitrary and traditional "It is usual and convenient for experimenters to take 5% as a

standard level of significance...prepared to ignore all results which fail to reach this standard, and...eliminate from further discussion the greater part of the fluctuations which chance causes have introduced into their experimental results”(Fisher ‘56)

35

Other criticisms (cdt.) Pedagogic:

Null-hypothesis testing just answers the question of "how well the findings fit the possibility that chance factors alone might be responsible”

Students expect hypothesis testing to be a statistical tool for illumination of the research hypothesis by the sample. It is not. The test asks indirectly whether the sample can illuminate the research hypothesis

Practical: Published test results are often contradicted. Mathematical models

support the conjecture that most published medical research test results are flawed. Null-hypothesis testing has not achieved the goal of a low error probability in medical journals

36

So basically p-values are bad!

...or at least limited

Alternatives?

Non-frequentist Bayesian inference methods Other, less biased frequentist approaches,

e.g. Killeen (2005). "An alternative to null-hypothesis significance tests". Psychol Sci 16 (5): 345–53. Estimate the probability of duplicating a result

37

38

Some resources

http://davidmlane.com/hyperstat/ http://www.ats.ucla.edu/stat/stata/ Ricardo’s maths book collection Ricardo Eleisa Carlos I would say wikipedia, but...

http://davidmlane.com/hyperstat/

http://www.ats.ucla.edu/stat/stata/

39

Next week

Assumptions, error (Ricardo)

Documents

Hypothesis testing and confidence intervals · 5 Hypothesis test - wikipedia… A statistical hypothesis test is a method of making statistical decisions using experimental data