Upload
hoangxuyen
View
219
Download
2
Embed Size (px)
Citation preview
1
Hypothesis testing and confidence intervals
Colm O’Dushlaine 12/08
2
Overview Scientific questions statistical hypotheses Null vs. Alternate Hypotheses Deciding on your “measurement” and test statistic
Chi2 squared distribution Visualising a HT as a probability distribution Significance: what is an “exceptional outcome”?
Critical values, rejection regions Choosing alpha and discussion of significance
Two-sided vs. one sided tests P-value vs. CI Limitations and criticisms
3
Recap from last talk... Point estimates
Terms and definitions Sampling distribution of the mean
Basic concept of a sample and inferences we can make about population at large
Standard error of the mean How it relates to C.I.
Confidence intervals How they relate to sample estimates about the population
t distribution When used and why
Odds ratios What they are Relationship with relative risk
Correlation Uses, misuses
4
Statistical inference in general
Statistical inference: The process of drawing conclusions about a population based on information in a sample
Unlikely to see this published:
“In our study of a new antihypertensive drug we found an effective 10% reduction in blood pressure for those on the new therapy. However, the effects seen are only specific to the subjects in our study. We cannot say this drug will work for hypertensive people in general”
2 methods (frequentist) for statistical inference: Hypothesis tests Confidence intervals
5
Hypothesis test - wikipedia…
A statistical hypothesis test is a method of making statistical decisions using experimental data. It is sometimes called confirmatory data analysis, in contrast to exploratory data analysis
Null-hypothesis tests: “Assuming that the null hypothesis is true, what is the probability of observing a value for the test statistic that is at least as extreme as the value that was actually observed?”
6
Some of those responsible…
Fisher
Originally coined phrase “null hypothesis”
Founded “population genetics”
Staunch eugenicist. Wanted to introduce forced sterilisation
Pearson (and son)
Neyman
Just a regular academic...not a eugenicist!
39 PhD students
Founded “mathematical statistics”World’s 1st stats dept in UCLAggressive eugenicist!Committed socialist, refused an OBE
7
Hypothesis testing
Used to investigate the validity of a claim about the value of a population characteristic
For example; the mean potency of a batch of tablets is 500mg per tablet, i.e. µ0 = 500mg
8
PROCEDURE
Specify Null and Alternative hypotheses Specify test statistic Define what constitutes an exceptional
outcome Calculate test statistic and determine
whether or not to reject the Null Hypothesis
9
Step 1 – specify Null and Alternative Specify the hypothesis to be tested and the
alternative that will be decided upon if this is rejected The hypothesis to be tested is referred to as the
Null Hypothesis (labelled H0)
The alternative hypothesis is labelled Ha
For the earlier example this gives:
mg500:
mg500:0≠=
µµ
aH
H
10
Step 1 (ctd.)
The Null Hypothesis is assumed to be true unless the data clearly demonstrate otherwise
11
PROCEDURE
Specify Null and Alternative hypotheses Specify test statistic Define what constitutes an exceptional
outcome Calculate test statistic and determine
whether or not to reject the Null Hypothesis
12
Step 2 - specify a test statistic Specify a test statistic which will be used to
measure departure from
where is the value specified under the Null Hypothesis, e.g. in this example
For hypothesis tests on sample means, the test statistic is:
00 : µµ =H
0µ5000 =µ
nsx
t 0µ−=
13
General format of test statistic
valueedhypothesisofSE
valueedhypothesisvalueobservedstatistictest
___
___
−=
14
Step 2 (ctd.)
The test statistic
is a ‘signal to noise ratio’, i.e. it measures how far is from in terms of standard error units
The t distribution with df = n-1 describes the distribution of the test statistic if the Null Hypothesis is true
In this example, the test statistic t has a t distribution with df = 25
nsx
t 0µ−=
x0µ
15
PROCEDURE
Specify Null and Alternative hypotheses Specify test statistic Define what constitutes an exceptional
outcome Calculate test statistic and determine
whether or not to reject the Null Hypothesis
16
Step 3 - define what constitutes an exceptional outcome
Define what will be an exceptional outcome a value of the test statistic is exceptional if it has
only a small chance of occurring when the null hypothesis is true
The probability chosen to define an exceptional outcome is called the significance level of the test and is labelled α Conventionally (traditionally), an α of 0.05 is chosen
17
Step 3 - define what constitutes an exceptional outcome (ctd.)
α = 0.05 gives cut-off values on the sampling distribution of t called critical values values of the test statistic t lying beyond the critical
values lead to rejection of the null hypothesis
For this example the critical value for a t distribution with df = 25 is 2.06
18
Student’s t distribution Closely related to the standard normal distribution Z
Symmetric and bell-shaped Has mean = 0 but has a larger standard deviation
Exact shape depends on a parameter called degrees of freedom (df) which is related to sample size
recap
0
0.1
0.2
0.3
0.4
Den
sity
-5 -4 -3 -2 -1 0 1 2 3 4 5
t quantile
Overlay Y's
Y Standard Normal t (df = 3) t (df = 10)
Overlay Plot
df = 3df = 10
Standard Normal
Critical values within tails of distribution
19
PROCEDURE
Specify Null and Alternative hypotheses Specify test statistic Define what constitutes an exceptional
outcome Calculate test statistic and determine
whether or not to reject the Null Hypothesis
20
Step 4 - calculate test statistic Calculate the test statistic and see if it lies in the
critical region For the example
t = -4.683 is < -2.06 (the critical value presented 3 slides back) so the hypothesis that the batch potency is 500 mg/tablet is rejected
683.426
783.10500096.490
−=
−=t
21
P value
The P value associated with a hypothesis test is the probability of getting sample values as extreme or more extreme than those actually observed, assuming null hypothesis to be true
22
Example (ctd.)
P value = probability of observing a more extreme value of t, given that the null hypothesis is true
The observed t value was -4.683, so the P value is the probability of getting a value more extreme than ± 4.683
This P value is calculated as the area under the t distribution below -4.683 plus the area above 4.683, i.e., 0.00008474
23
Example (ctd.)
Less than 1 in 10,000 chance of observing a value of t more extreme than -4.683 if the Null Hypothesis is true
Evidence in favour of the alternative hypothesis is very strong
24
P value (ctd.)
0
0.1
0.2
0.3
0.4
-5 -4 -3 -2 -1 0 1 2 3 4 5
t
Overlay Y's
Y t dist (df=25) p-val area
Overlay Plot
-4.683 4.683
25
Limitations of the p-value NO REFERENCE TO MAGNITUDE
Approaches based on estimation and C.I.’s therefore generally considered to be superior
Promotes publication bias: only articles with significant p-values are considered interesting
N.B. Freiman (1978): 21 ‘negative’ clinical trial results (p>0.1)…constructed CI’s…found that about half were compatible with a 50% improvement. Therefore, of clinical significance after all!!
26
Two-tail and one-tail tests
The test described in the previous example is a two-tail test The null hypothesis is rejected if either an
unusually large or unusually small value of the test statistic is obtained
i.e. the rejection region is divided between the two tails
27
One-tail tests
Reject the null hypothesis only if the observed value of the test statistic is Too large Too small
In both cases the critical region is entirely in one tail so the tests are one-tail tests, e.g. 1-sided t-test
Generally **not appropriate**: maybe when have strong prior expectations, e.g. new version of treatment will be better than old version but certainly not worse…but we don’t really know even this without doing an experiment!
28
Statistical versus practical significance When we reject a null hypothesis it is usual to
say the result is statistically significant at the chosen level of significance
But should also always consider the practical significance of the magnitude of the difference between the estimate (of the population characteristic) and what the null hypothesis states that to be...
29
Statistical versus practical significance (ctd.) Rejection of the null hypothesis at some effect size
has no bearing on the practical significance at the observed effect size
A statistically significant finding may not be relevant in practice due to other, larger effects of more concern, while a true effect of practical significance may not appear statistically significant if the test lacks the power (Ricardo) to detect it, e.g. due to sample size deficits
Appropriate specification of both the hypothesis and the test of said hypothesis is therefore important to provide inference of practical utility
30
Confidence Interval
A confidence interval for a population characteristic is an interval of plausible values for the characteristic. It is constructed so that, with a chosen degree of confidence (the confidence level), the value of the characteristic will be captured inside the interval
e.g. we claim with 95% confidence that the population mean lies between 15.6 and 17.2
recap
31
Estimation vs. Hypothesis testing P-value does not quantify. This is essential,
e.g. blood pressure: how much reduced? How consistent an effect? 30% reduction in all people more important than 50% reduction in some
Recall Freiman (1978) clinical trial analysis Should present both but p-value can be
estimated from C.I. so latter is more important
32
Trivial example of p-value weakness Compared 217k tandem repeats Mean length among variable repeats = 57.13,
mean among invariant repeats = 48.43 t-test p<0.0005, Mann-Whitney p<0.0005 But scale of effect is very small Significance reflects very large sample size
and doesn’t really inform on scale of effect
33
Another example of p-value weakness In addition to not describing scale of an
effect, it doesn’t describe the consistency of an effect
Drug A reduces migraines by 30% Drug B reduces migraines by 50%, but only
for some people. No effect on others P-value will not give this information
34
Other criticisms Meta:
Criticism of the application, or of the interpretation, rather than of the method
Philosophical: What about borderline cases? "... surely, God loves the .06 nearly
as much as the .05” (Rosnow, R.L. & Rosenthal, R. (1989)) .05 is arbitrary and traditional "It is usual and convenient for experimenters to take 5% as a
standard level of significance...prepared to ignore all results which fail to reach this standard, and...eliminate from further discussion the greater part of the fluctuations which chance causes have introduced into their experimental results”(Fisher ‘56)
35
Other criticisms (cdt.) Pedagogic:
Null-hypothesis testing just answers the question of "how well the findings fit the possibility that chance factors alone might be responsible”
Students expect hypothesis testing to be a statistical tool for illumination of the research hypothesis by the sample. It is not. The test asks indirectly whether the sample can illuminate the research hypothesis
Practical: Published test results are often contradicted. Mathematical models
support the conjecture that most published medical research test results are flawed. Null-hypothesis testing has not achieved the goal of a low error probability in medical journals
36
So basically p-values are bad!
...or at least limited
Alternatives?
Non-frequentist Bayesian inference methods Other, less biased frequentist approaches,
e.g. Killeen (2005). "An alternative to null-hypothesis significance tests". Psychol Sci 16 (5): 345–53. Estimate the probability of duplicating a result
37
38
Some resources
http://davidmlane.com/hyperstat/ http://www.ats.ucla.edu/stat/stata/ Ricardo’s maths book collection Ricardo Eleisa Carlos I would say wikipedia, but...
39
Next week
Assumptions, error (Ricardo)