Hypothesis Testing. Statistical Inference – dealing with parameter and model uncertainty Confidence…

Hypothesis Testing

Statistical Inference – dealing with parameter and model uncertainty Confidence Intervals (credible intervals)

Hypothesis Tests

Goodness-of-fit

Model Selection (AIC)

Model averaging

Bayesian Model Updating

Statistical Testing of Hypotheses Objective of determining whether parameters

differ from hypothesized values.

Testing procedure framed in terms of comparison of null and alternative hypotheses. Null hypothesis Alternative hypothesis

Compound (1-sided) alternatives

00 = :H0a : H

0a : H

Procedure for Null Hypothesis Testing Specify

Null and alternate hypotheses Compute test statistic

Random variable that summarizes expected sample distribution given the null hypothesis is true (i.e., probability difference between sample means for 2 groups if the true mean is the same)

Compare to the sampled value Test is binary decision

Significance level of the test α Two types of incorrect decisions:

rejecting H0 when it is true (Type I error), Pr = α Not rejecting H0 when it is false (type II error), Pr = β

Power of test = 1- β

P- valuesProbability of obtaining a test statistic at

least as extreme as the observed one, given that null hypothesis is true Not Pr(Null hypothesis is true) Degree of consistency of data with null, not strength of

evidence for alternative

Dependent on null hypothesis (if null is that groups differ by 1 rather than 0 p-value will be different)

Dependent on sample size

Does not provide information on size or precision of estimated effect (i.e., not a measure of biological relevance or a confidence interval)

Reality Conclusion ↓

H0 True, Ha False H0 False, Ha True

We don’t reject H0(null hypothesis)

1-a (eg., 0.95)Odds of saying there is no difference when there really is one.95/100 times when there is no effect, we’ll correctly say there is no effect.

b (eg., 0.20) Type II ErrorOdds of saying there is no difference when there really is one.20/100 times when there is an effect, we’ll say there is no effect.

We reject H0, accept Ha (alternative hypothesis)

a (eg., 0.05) Type I ErrorOdds of saying there is a difference when there is no difference.5/100 times when there is no effect, we’ll say there is one.

1-b (eg., 0.80)POWEROdds of saying there is a difference when there is one.80/100 times when there is an effect, we’ll say there is oen.

Comments: Lower a , lower power; higher a , higher

power

Lower a , conservative in terms of rejecting the null when it’s true (i.e., saying there’s an effect when there really isn’t)

Higher a increases chances of Type I Error, decreases chances of making Type II Error and decreases rigor of test.

Sample Design: Choosing a sample size

Can choose based on target precision level (e.g. confidence intervals) or power (hypothesis testing)

Requires assumptions and tentative parameter (e.g., effect size) values Therefore it is an exercise in approximation Might identify cases where minimal sufficient

sample size would bust budget or is logistically impractical to achieve.

Likelihood Ratio Tests Comparing fit of hypothesized model to another model

(generally containing more parameters) – Null model to alternative model with additional parameters

Maximum likelihood estimation theory Evaluate MLE for restricted and more general parameterizations Calculate Likelihood ratio

Chi-square, with degrees of freedom of difference in number of parameters among models

)x|L()x|L(

2- = a

0e

2

ˆˆ

log

Goodness of fit (GOF)“Absolute” fit of model

Goal is to determine if data are reflective of the statistical model

Test statistic generated based on probability model using estimated parameters

Is there variation in the data that is out of the ordinary and not reflected in our statistical model?

Pearson’s 2 GOF Test Logic: If model is ‘correct’, expected and observed cell

frequencies for each multinomial cell should be similar.

Imagine we roll a die 1000 times and want to determine if the model P(x=1)=P(x=2)=…=P(x=6) is a good model

If sample size is adequate, (expect at least 2 per cell),

S(observedi – expectedi)2/expectedi

~ 2(df = # cells – 1)

General GOF if Large Sample Pearson’s 2

Direct use of Deviance

)x|L()x|L(

2- = saturated

0e

2

ˆˆ

log

Bootstrap GOF Test Compute ML estimates for parameters, Produce empirical distribution of estimates:

Simulate capture histories for each released animal: assume parameter = MLE, ‘flip coins’ to determine survival and capture for each

period, Repeat for {Ri } animals, estimate parameters, Compute deviance

Compare original deviance with empirical distribution (i.e., what percentile?)

What indicates lack of fit? With GOF test, the hope and purpose is

to accept the null hypothesis

This is counter to statistical hypothesis testing

What is a ‘significant’ P-value?

What might cause lack of fit? Inadequate model structure for

detection or survival, e.g., Age dependence, size dependence, etc. Trap dependence Those released earlier survive at different

rate Non-random temporary emigration

Lack of independence among animals

Solutions Inadequate model structure? Improve it.

Goal: Subdivide animals sufficiently that there is equal p and S within a group

Warning: Inadequate model structure doesn’t always result in lack of fit, e.g., Permanent emigration (confounded with S) Random temporary emigration (confounded with p) Random ring loss (confounded with S)

Lack of independence? Correct for Overdispersion Inflate variances using quasi-likelihood.

Adjusting Variances for Overdispersion Based on Quasi-likelihood theory

c-hat = deviance/df adj. variance = c-hat * (ML variance)

Bootstrap adjustment for overdispersion For each simulated sample:

compute deviance compute c-hat = deviance/df

Bootstrap c-hat = (observed deviance)/(mean deviance), or (observed c-hat) / (mean c-hat)

Note: could replace deviance with Pearson 2, or mean with median.

Documents

Hypothesis Testing. Statistical Inference – dealing with parameter and model uncertainty Confidence…