65
Part III The General Linear Model Chapter 9 Regression

Part III The General Linear Model Chapter 9 Regression

Embed Size (px)

Citation preview

Part IIIThe General Linear Model

Chapter 9Regression

GLM, applied to regression

• Example 9.3.1 from Snedecor and Cochran (1989)• Interested in the relationship between:– phosphorus content of corn (Pcorn in ppm) & phosphorus levels in soil samples (Psoil in ppm).

1. Construct Model

Verbal

Graphical Formal

1. Construct Model

Name Units Dimensions Measurement Scale

Response

Explanatory

Graphical

Verbal Phosphorus content of corn (Pcorn) depends on Phosphorus content of soil (Psoil)

1. Construct ModelVerbal

Graphical Formal

Phosphorus content of corn (Pcorn) depends on Phosphorus content of soil (Psoil)

Units Dimensions Measurement Scale

2. Execute analysis. Place data in model format: 𝑃𝑐𝑜𝑟𝑛=𝛼+𝛽𝑃𝑠𝑜𝑖𝑙 ∙𝑃𝑠𝑜𝑖𝑙+𝜖

lm1 <- lm(Pcorn~Psoil, data=corn)

2. Execute analysis. Compute fitted values and residuals.

fits <- fitted(lm1)resid <- residuals(lm1)cbind(corn, fits, resid)

3. Evaluate Model. Plot residuals against fitted values

Check linear trend

3. Evaluate Model. Plot residuals against fitted values plot(fits,resid,pch=16)

Check linear trend

3. Evaluate Model. Plot residuals against fitted values

3. Evaluate Model.

• Using theoretical distributions (χ2, t, F) to calculate p-value, therefore we need to check their assumptions:– Fixed variance (errors homogeneous)– Normally distributed errors.– Independent errors– Unbiased estimate (errors sum to zero)

3. Evaluate Model. Homogeneous errors.

3. Evaluate Model. Homogeneous errors.

3. Evaluate Model. Normal errors.

3. Evaluate Model. Independent errors.This is a text example, we do not have information on spatial layout of samples, or on collection sequence. We will assume independence

3. Evaluate Model. Conclusion.Residuals appear to homogeneous, but not normal. We assume independence, we do not have enough information to evaluate this assumption.

We may need to use an empirical distribution to compute p-values or confidence limits

4. State population and whether sample is representative.

Population?

Sample(n=9)

The population is all values of phosphorus in corn, given knowledge of phosphorus in the soil

The sample is representative if the 17 soil types represent the range of possible soil types

5. Decide on mode of inference. Is hypothesis testing appropriate?

• Since the relationship between P and P content in corn is unknown, we proceed

6. State HA / Ho, test statistic and α

HA:

Ho:

Statistic: α:

𝑃𝑐𝑜𝑟𝑛=𝛼+𝛽𝑃𝑠𝑜𝑖𝑙 ∙𝑃𝑠𝑜𝑖𝑙+𝜖

7. ANOVA: partition df according to model.

n=9

dftot = ________ = _____

dfmodel = 1

dfres= dftotal – dfmodel = _____

7. ANOVA: Calculate SS, partition according to model.

7. ANOVA: Calculate SS, partition according to model.

7. ANOVA: Calculate SS, partition according to model.

Null model: Pcorn = mean(Pcorn)SS total: 2274.00

Regression model: 61.58 + 1.417*PsoilSS residual: 800.43

SS improvement? __________

7. ANOVA: Calculate SS, partition according to model.

7. ANOVA: Partition df, SS according to model. Complete ANOVA table

7. ANOVA: Calculate Type I error from F distribution.

Packages compute and place the p-value in the ANOVA tablep = 0.00885

8. Recompute p-value if necessary.

• p-values can be inaccurate if assumptions are violated

• Distortion depends on sample size– As a rule of thumb, distortion is greatest if n < 30– less serious if 30 < n < 100– usually not serious if n > 100

• When assumptions are not met, recompute Type I error if two conditions are met:1. n small2. p near α

8. Recompute p-value if necessary.

• Due diligence recompute p-value using randomization– Free of assumptions

• In 4000 randomizations there were 27 instances of an F-ratio greater than 12.89– Empirical p-value: 0.00675– Theoretical p-value: 0.008854

9. Declare and report decision about model terms.

• p = 0.006750 (via randomization, hence no assumptions)– p < α = 5% so reject Ho for HA

• Report decision with evidence:– There was a significant increase in available

phosphorus with increase in soil phosphorus (F1,7 = 12.89, p = 0.00675 by randomization)

10. Report and interpret parameters of biological interest.

• Regression Equation:

• Today: Lab 4 due

• Monday & Tuesday: No classes

• Wednesday: Grad seminarLectureQuizz 5

• Thursday: Lab 5a

Chapter 9.2Regression. Explanatory Variable Fixed into Classes

GLM, applied to regressionX variable fixed into classes

• Example: Galton’s Law

• Quantity of interest is the stature (height) of sons in relation to stature (height) of their fathers.

• Data collected by Francis Galton at end of the 19th century.

• 1st application of regression

1. Construct Model

Verbal

Graphical Formal

Data

1. Construct Model Verbal

Graphical Formal

Data

There is a positive relation between heights of sons and fathers

Explanatory: _____________

Response: _____________

Model: __________________

1. Construct Model

Symbol Units Dimensions Measurement Scale

HsonHf

𝐻 𝑠𝑜𝑛=𝛼+𝛽𝐻 𝑓∙𝐻 𝑓 +𝜀

…… …𝐻 𝑠𝑜𝑛=�̂�+ �̂�𝐻 𝑓∙𝐻 𝑓 +𝜀

𝐻 𝑠𝑜𝑛=𝑎+𝑏𝐻 𝑓∙𝐻 𝑓+𝑒

2. Execute analysis. Place data in model format:

lm1 <- lm(Hson~Hf, weights=Nfamily, data=Heights)

𝐻 𝑠𝑜𝑛=𝛼+𝛽𝐻 𝑓∙𝐻 𝑓 +𝜀

…… …

2. Execute analysis. Compute fitted values and residuals.coefficients(lm1)(Intercept) Hf 33.2855960 0.5225171

𝐻 𝑠𝑜𝑛=𝛼+𝛽𝐻 𝑓∙𝐻 𝑓 +𝜀

𝐻 𝑠𝑜𝑛=33.29+0.52 ∙𝐻 𝑓 +𝜀

63.667 = +65.643 = +

… … …

3. Evaluate Model

□ Straight line model ok?

□ Errors homogeneous?

□ Errors normal?

□ Errors independent?

4. State population and whether sample is representative.• Population is all possible measurements, given the

measurement protocol, if we repeated the study thousands of times

• We infer a population consisting of thousands of runs of the same experiment, using the same protocol

5. Decide on mode of inference. Is hypothesis testing appropriate?• Might expect a 1:1 ratio• Undertake hypothesis testing?• Use confidence limits

10. Report and interpret parameters of biological interest.• Compute confidence limits from standard error of

the slope parameter summary(lm1)$coefficients

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 33.28560 1.64243 20.27 2.61e-12 ***Hf 0.52252 0.02424 21.55 1.06e-12 ***

10. Report and interpret parameters of biological interest.

10. Report and interpret parameters of biological interest.

𝐻 𝑠𝑜𝑛=33.29+0.52 ∙𝐻 𝑓 +𝜀

• Confidence limits do not include hypothesis of

• Nor does it include (i.e. no relationship)

• is tightly centered around a value of ~0.5

– Great! But why?

10. Report and interpret parameters of biological interest.

Chapter 9.3Regression. Explanatory Variable Measured with Error

• Adds bias to regression parameter estimates• Example:– Relation between number of eggs and body size in

cabezon fish (Box 14.12, Sokal and Rohlf 1995)

– What is the magnitude of the bias?

GLM, applied to regressionExplanatory Variable Measured with Error

1. Construct Model

• Verbal– Does egg number Neggs depend on body mass M ?

• Graphical

D

V

G F

• Formal– Response: Neggs– Explanatory: M

𝑁 𝑒𝑔𝑔𝑠=𝛼+𝛽𝑀 ∙𝑀+𝜀units?

dimensions?

measurement scale?

2. Execute analysis. Place data in model format:

• The package first estimates the parameters of the general linear model, and

• Where:

lm1 <- lm(Neggs~M, data=data)𝑁 𝑒𝑔𝑔𝑠=𝛼+𝛽𝑀 ∙𝑀+𝜀

Estimate parameters and compute fitted values and residuals

𝑁 𝑒𝑔𝑔𝑠= �̂�𝑜+ �̂�𝑀 ∙(𝑀 −𝑀 )+𝜀

2. Execute analysis. Place data in model format:lm1 <- lm(Neggs~M, data=data)𝑁 𝑒𝑔𝑔𝑠=𝛼+𝛽𝑀 ∙𝑀+𝜀

Estimate parameters and compute fitted values and residuals

3. Evaluate Model

𝑁 𝑒𝑔𝑔𝑠=𝛼+𝛽𝑀 ∙𝑀+𝜀

Where is measurement error

If are normal and independent will be < by a factor of

-Reliability ratio

unknown, but no worse than measurement resolution (1 hectogram)

□ Structure?

□ Straight line model ok?

□ Errors homogeneous?

□ Errors normal?

□ Errors independent?

3. Evaluate Model

□ Structure?

□ Straight line model ok?

□ Errors homogeneous?

□ Errors normal?

□ Errors independent?

3. Evaluate Model

□ Structure?

□ Straight line model ok?

□ Errors homogeneous?

□ Errors normal?

□ Errors independent?

M Neggs Res Lag.Res14 61 15.05 NA17 37 -14.56 15.0524 65 0.35 -14.5625 69 2.48 0.3527 54 -16.26 2.4833 93 11.52 -16.2634 87 3.65 11.5237 89 0.04 3.6540 100 5.43 0.0441 90 -6.43 5.4342 97 -1.30 -6.43

4. State population and whether sample is representative.a) All measurements that could have been made on the fish

by this protocol

b) All cabezon fish

c) All fish that could have been collected when the collection was made

d) Measurements from 11 cabenzon fish reported here

5. Decide on mode of inference. Is hypothesis testing appropriate?• We want to know if the relationship between body

size and egg count deviates from 1:1• Use confidence limits

10. Report and interpret parameters of biological interest.• Compute confidence limits

confint(lm1)

2.5 % 97.5 %(Intercept) -4.098376 43.632008M 1.117797 2.622113

10. Report and interpret parameters of biological interest.

Neggs = Fits + Res61 = 45.95 + 15.0537 = 51.56 + -14.5665 = 64.65 + 0.3569 = 66.52 + 2.4854 = 70.26 + -16.2693 = 81.48 + 11.5287 = 83.35 + 3.6589 = 88.96 + 0.04

100 = 94.57 + 5.4390 = 96.43 + -6.4397 = 98.30 + -1.30

• Check limits free of assumptions – randomization

3.652.48

-14.560.04

15.05-1.305.43-6.430.35

-16.2611.52

49.6054.0450.0966.5685.3180.1788.7882.5294.9280.18

109.83

10. Report and interpret parameters of biological interest.

10. Report and interpret parameters of biological interest.• Report conclusions with evidence:

– = 1.87 with 95% confidence limits of 1.28 to 2.48 kiloeggs/hectogram

– does not include 0, so there is a relationship

– also excludes 1:1 ratio

– We conclude that in this species, large fish invest disproportionately more in eggs (per unit of body mass) than do small fish

Chapter 9.4Exponential Function, using Linear Regression

Exponential functions

• Exponential rates are common in biology• Example: Intrinsic rate of population increase

Exponential functions

• Exponential rates are common in biology• Example: specific growth rate

• = initial weight (kg)

• = recapture weight (kg)

• = time in days from initial to recapture (days)

• = exponential growth rate (%/day)

Exponential functions

• Exponential rates are common in biology• Example: specific growth rate– Growth of 6 lungfish in 2001 in Lake Baringo,

Kenya

kg kg TimeInitial End Days1.32 1.46 501.30 1.48 641.60 1.84 650.76 0.90 560.60 0.65 202.74 2.86 48

1. Construct Model

• Verbal– Growth rate of lungfish is exponential, with fixed growth rate

k• Graphical

D

V

G F

• Formal

– Have to linearize to apply regression:

2. Execute analysis.

3. Evaluate Model

□ Straight line model ok?

□ Errors homogeneous?

□ Errors normal?

□ Errors independent?

4. State population and whether sample is representative.• All measurements that could have been made on the fish

by this protocol

5. Decide whether to use hypothesis testing.• The research objective is to estimate specific growth rate

of fish.

• We will examine the parameters and compute confidence limits (skip to step 10).

10. Report and interpret parameters of biological interest.• Compute confidence limits

• Limits bound zero, suggesting no growth. Yet all fish were larger upon recapture. Improbable result:– 0.56 = 0.0156

• But was growth exponential?

confint(lm1)

2.5 % 97.5 %(Intercept) -0.133723588 0.197839514t -0.001595261 0.004696776

L = Lower limit = -0.160 %/dayU = Upper limit = 0.470 %/day

10. Report and interpret parameters of biological interest.• The estimate of growth rate is approximately 0.1%/day, or

about 3% per month – but the estimate is not reliable!