Lecture 4 Regression Analysis NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA John Birks

Lecture 4 Regression

Analysis

NUMERICAL ANALYSIS OF BIOLOGICAL AND

ENVIRONMENTAL DATA

John Birks

Introduction, Aims, and Main UsesResponse modelTypes of response variables yTypes of predictor variables xTypes of response curvesTransformationsTypes of regressionNull hypothesis, alternative hypothesis, type I and II errors, and Quantitative response variable

Nominal explanatory (predictor) variablesQuantitative explanatory (predictor) variablesGeneral linear model

REGRESSION ANALYSIS

REGRESSION ANALYSIS continued

Presence/absence response variable

Nominal explanatory (predictor) variables

Quantitative explanatory (predictor) variables

Generalised linear model (GLM)

Multiple linear regression

Multiple logit regression

Selecting explanatory variables

Nominal or nominal and quantitative explanatory variables

Assessing assumptions of regression model

Simple weighted average regression

Model II regression

Software for basic regression analysis

INTRODUCTIONExplore relationships between variables and their

environment

+/– or abundances for species (responses)

Individual species, one or more environmental variable (predictors)

AIMS

1. To describe response variable as a function of one or more explanatory variables. This RESPONSE FUNCTION usually cannot be chosen so that the function will predict responses without error. Try to make these errors as small as possible and to average them to zero.

2. To predict the response variable under some new value of an explanatory variable. The value predicted by the response function is the expected response, the response with the error averaged out. cf. CALIBRATION

3. To express a functional relationship between two variables thought, a priori, to be related by a simple mathematical relationship, but where only one of the variables is known exactly. cf. MODEL II REGRESSION

Species abundance or presence/absence

- response variable Y

Environmental variables - explanatory or predictor variables X

MAIN USES

(1) Estimate ecological parameters for species, e.g. optimum, amplitude (tolerance) -

ESTIMATION AND DESCRIPTION

(2) Assess which explanatory variables contribute most to a species response and which explanatory variables appear to be unimportant. Statistical testing - MODELLING

(3) Predict species responses (+/–, abundance) from sites with observed values of explanatory variables -

PREDICTION

(4) Predict environmental variables from species data - CALIBRATION or ‘INVERSE REGRESSION’

Fox (2002)

Sokal & Rohlf (1995)

Draper & Smith (1981)

Montgomery & Peck (1992)

Crawley (2002, 2005)

b0, b1 fixed but unknown coefficients

b0 = interceptb1 = slope

Ey = b0 + b1x SYSTEMATIC PART

Error part is distribution of , the random variation of the observed response around the expected response.

Aim is to estimate systematic part from data while taking account of error part of model. In fitting a straight line, systematic part simply estimated by estimating b0 and b1.

Least squares estimation – error part assumed to be normally distributed.

RESPONSE MODEL

Y = b0 + b1x +

response variable

error

explanatory variable

Systematic part - regression equation

Error part - statistical distribution of error

TYPES OF RESPONSE VARIABLES - y

Quantitative (log transformation)

% quantitative

Nominal including +/–

TYPES OF EXPLANATORY or PREDICTOR VARIABLES - x

Quantitative

Nominal

Ordinal (ranks) - treat as nominal 1/0 if few classes, quantitative if many classes

TYPES OF RESPONSE CURVES

If one explanatory variable x, consists of fitting curves through data.

What type of curve?

(i) EDA scatter plots of y and x.

(ii) Underlying theory and available knowledge.

TYPES OF RESPONSE CURVES

Shapes of response curves. The expected response (Ey) is plotted against the environmental variable (x). The curves can be constant (a: horizontal line), monotonic (b: sigmoid curve, c: straight line), monotonic decreasing (d: sigmoid curve), unimodal (e: parabola, f: symmetric, Gaussian curve, g: asymmetric curve and a block function) or bimodal (h).

Response curves derived from a bimodal curve by restricting the sampling interval. The curve is bimodal in the interval a-f, unimodal in a-c and in d-f, monotonic in b-c and c-e and almost constant in c-d. Ey = expected response; x = environmental variable.

Usually needed TRANSFOR

TYPES OF REGRESSION

(LR = Linear Regression)

TRANSFORMATIONS

Also weighted averaging regression and model II regressions

Explanatory variable x

One Many

Response

variable y

Nominal

Quantitative

Nominal Quantitative

Quantitative

ANOVALinear and non-linear regression

Multiple LR with

nominal dummy

variables

Multiple LR

+/- 2 test Logit[Log linear

contingency tables]

Multiple logit

regression

Tests of statistical hypotheses are probabilistic

Can just as well estimate the degree to which an effect is felt as judge whether the effect exists or not.

As a result, can compute probabilities of two types of error.

NULL HYPOTHESIS, ALTERNATIVE HYPOTHESIS, TYPE I ERROR, TYPE II

ERROR, , AND Null hypothesis H0 ‘y not correlated with x’

No difference, no association, no correlation. Hypothesis to be tested, usually by some type of significance test.

Alternative hypothesis H1 Postulates non-zero difference, association, correlation. Hypothesis against which null hypothesis is tested.

Type I error () probability that we have mistakenly rejected a true null hypothesis

Type II error () probability that we have mistakenly failed to reject a false null hypothesis DECISION

TRUTH Accept H0 Reject H0

H0 true No error: 1 - Type I error:

H0 false Type II error: No error: 1 -

Power of a test is simply the probability of not making type II error, namely 1-. The higher the power, the more likely it is to show, statistically, an effect that really exists.

0.05 0.01 0.001

Rarely estimated. Function of critical value of sample size, and the magnitude of effect being looked for.

Type I error:

Error that results when the null hypothesis is FALSELY REJECTED

Type II error:

Error that results when the null hypothesis is FALSELY ACCEPTED

Relative cover (log-transformed) of a plant species () in relation to the soil types of clay, peat and sand. The horizontal arrows indicate the mean value in each type. The solid vertical bars show the 95% confidence interval for the expected values in each type and the dashed vertical lines the 95% prediction interval for the log-transformed cover in each type.

QUANTITATIVE RESPONSE VARIABLE, NOMINAL EXPLANATORY VARIABLE

Plant cover 3 soil types

y x

Assume responses are independent.

ANALYSIS OF VARIANCE (ANOVA)

Estimate:

Expected responses in 3 soil types. Least squares. Sum over all sites of squared differences between observed and expected response to be minimal. Parameter that minimises this SS is the mean.

Difference between Ey and observed response is residual. Least squares minimises sum of squared vertical distances. Residual SS.

Ey, standard error, and 95% confidence interval = Estimate t(0.95) x s.e

5% critical value in 2-tailed test. Degrees of freedom (v) = n-q parameters

QUANTITATIVE RESPONSE VARIABLE, NOMINAL EXPLANATORY

VARIABLE

Response model - Systematic part. 3 expected responses, one for each soil type. Error part – observed responses vary around expected responses in each soil type. Normally distributed, and variance within each soil type is same.

Term mean s.e. 95% confidence interval

Clay 1.7 0.33 (1.00, 2.40)Peat 3.17 0.38 (2.37, 3.97)Sand 2.33 0.38 (1.53, 3.13)Overall mean 2.33

Means and ANOVA table of the transformed relative cover of the above figure

ANOVA table (ss/df)

d.f. d.f. s.s m.s Fq-1 Regression 2 7.409 3.704 4.24n-q Residual 17 14.826 0.872n-1 Total 19 22.235 1.17

R2

adj = 0.25 variance

Estimate ± t(0.05)(v) s.e.

= ms regression df = 2

ms residual (n - q df = 17)

Critical value of F at 5% level is 3.59

Value of t0.05(v) depends on number of degrees of freedom (v) of the residual with v = 17, t0.05(17) = 2.11

q = parameters = 3, n = number of objects = 20

ms = ss/df

Total ss = Regression ss (q - 1 = 2 df) + Residual ss (n - q = 17 df)(n - 1 = 19 df)

R2adj = 1 – (residual variance / total variance) = 1 - (0.872/1.17) = 0.25

R2 = 1 – (residual sum of squares / total sum of squares) = 1 - (14.826/22.235) = 0.333

ANOVA table

R

QUANTITATIVE RESPONSE VARIABLE, QUANTITATIVE EXPLANATORY

VARIABLE

Straight line fitted by least-squares regression of log-transformed relative cover on mean water-table. The vertical bar on the far right has length equal to twice the sample standard deviation T, the other two smaller vertical bars are twice the length of the residual standard deviation (R). The dashed line is a parabola fitted to the same data (●)

Error partError part – responses independent and normally distributed – responses independent and normally distributed around expected values zaround expected values zyy

Straight line fitted by least-squares: parameter estimates and ANOVA table for the transformed relative cover of the figure above

TermParamet

erEstimat

es.e.

T (= estimate/se)

Constant b0 4.411 0.426 10.35

Water-table

b1 -0.0370.0070

5-5.25

ANOVA table

df d.f. s.s. m.s. F

Parameters-1

Regression

1 13.45

13.45 27.56

df

n-parameters

Residual 18 8.78 0.488 1,18

n-1 Total 19 22.23

1.17

R2adj =

0.58R2 = 0.61 r =

0.78

R

QUANTITATIVE RESPONSE VARIABLE, QUANTITATIVE EXPLANATORY VARIABLE

Does expected response depend on water table? F = 27.56 >> 4.4 (critical value 5%) df (1, 18)(F =MS regression (df = parameters – 1, MS residual ) n – parameters )

Does slope b1 = 0?

absolute value of critical value of two- tailed t-test at 5%

t0.05,18 = 2.10

b1 not equal to 0 [exactly equivalent to F test ]

255.of Fseb1

1bt

Fsebb

2

1

1

Construct 95% confidence interval for b1

estimate t0.05, v se = 0.052 / 0.022

Does not include 0 0 is unlikely value for b1 Check assumptions of response model

Plot residuals against x and Ey

R

Could we fit a curve to these data better than a straight line?

Parabola Ey = b0 + b1x + b2x2

Straight line fitted by least-squares regression of log-transformed relative cover on mean water table. The vertical bar on the far right has a length equal to twice the sample standard deviation T, the other two smaller vertical bars are twice the length of the residual standard deviation (R). The dashed line is a parabola fitted to the same data ().

Polynomial regression

Parabola fitted by least-squares regression: parameter estimates and ANOVA table for the transformed relative cover of above figure.

TermParamet

erEstimate s.e. t

Constant b0 3.988 0.819 4.88

Water-table b1 -0.0187 0.0317 -0.59

(Water-table)2 b2 -0.0001690.00028

4-0.59

ANOVA table

d.f. s.s. m.s. F

Regression 2 13.63 6.815 13.97

Residual 17 8.61 0.506

Total 19 22.23 1.17

R2adj = 0.57

(R2adj = 0.58 for linear model)

1 extra parameter 1 less d.f.

Not different from 0

R

Response variable Y = EY + e

where EY is the expected value of Y for particular values of the predictors and e is the variability ("error") of the true values around the expected values EY.

The expected value of the response variable is a function of the predictor variables EY = f(X1, ..., Xm)

EY = systematic component, e = stochastic or error component.

Simple linear regression EY = f(X) = b0 + b1X

Polynomial regression EY = b0 + b1X + b2X2

Null model EY = b0

GENERAL LINEAR MODEL

Regression Analysis Summary

EY = Ŷ = b0 +

Fitted values allow you to estimate the error component, the regression residuals

ei = Yi – Ŷi

Total sum of squares (variability of response variable)

TSS = where = mean of Y

This can be partitioned into

(i) The variability of Y explained by the fitted model, the regression or model sum of squares

MSS =

(ii) The residual sum of squares

RSS = =

Under the null hypothesis that the response variable is independent of the predictor variables MSS = RSS if both are divided by their respective number of degrees of freedom.

p

jjj xb

1

n

ii YY

1

2)(

n

ii YY

1

2)ˆ(

n

iii YY

1

2)ˆ(

Y

n

iie

1

2

z = c exp[-0.5(x-u)2/t2] (y)z

(y)

PARABOLA FITTED TO LOG-ABUNDANCE DATA,

fitting a Gaussian unimodal response curve to original abundance data

Gaussian response curve with its three ecologically important parameters: maximum (c), optimum (u) and tolerance (t). Vertical axis: species abundance. Horizontal axis: environmental variable. The range of occurrence of the species is seen to be about 4t.

loge z = b0 + b1x + b2x2 = loge (c) - 0.5

(x-u)2/t2

Optimum u = b1 / (2b2)

Tolerance t = 1/ (2b2)

Maximum c = exp (b0 + b1u + b2u2)

If b2 +, minimum

Approximate SE of u and t can be calculated

χ2 =

o = observed frequencye = expected frequency

Numbers of fields in which Achillea ptarmica is present and absent in meadows with different types of agricultural use and frequency of occurrence of each type (unpublished data from Kruijne et al., 1967). The types are pure hayfield (ph), hay pastures (hp), alternate pasture (ap) and pure pasture (pp).

e

eo 2

PRESENCE-ABSENCE RESPONSE VARIABLE, NOMINAL EXPLANATORY

VARIABLE

Relative frequency of occurrence is 113/1538 = 0.073

Under null hypothesis, the expected number of fields with Achillea ptarmica present is, pure hayfield (ph) 0.073 x 146 = 10.7, haypasture (hp) 0.073 x 396, etc. Calculated x2 = 102.1 compared with critical value of 7.81 at 0.05 level with 3 df. Conclude that occurrence of A. ptarmica depends on field type.

Achillea ptarmica Agricultural use

Explanatory variables

ph hp ap pp Total

Response

Present 37 40 27 9 113

Absent 109 356 402 558 1425

Total 146 396 429 567 1538

Frequency

0.254 0.101 0.063 0.016 0.073

(r-1) (c-1) degrees of freedom

Sigmoid curve fitted by logit regression of the presences (● at p = 1) and absences (● at p = 0) of a species on acidity (pH). In the display, the sigmoid curve looks like a straight line but it is not. The curve expresses the probability (p) of occurrence of the species in relation to pH.

PRESENCE-ABSENCE RESPONSE VARIABLE, QUANTITATIVE EXPLANATORY VARIABLE

Straight line (a), exponential curve (b) and sigmoid curve (c) representing equations 1,2, and 3, respectively.

Systematic part – defined as shown

Error part – response can only have two values therefore binomial error distribution

Cannot estimate parameters by least-squares regression as errors not normally distributed and have no constant variance

LOGIT REGRESSION – special case of GLM

GENERALISED LINEAR MODEL

1: Ey = bo+b1x

Can be negative

2: Ey = exp(bo+b1x)

Can be >1

3: Ey = p = [exp(bo+b1x)] [1 + exp (bo+b1x)]

(bo + b1x) linear predictor

Not the same as General Linear Model, more generalised

GLIM

GENSTAT

R or S-PLUS

Logit linear predictor

or p = [exp (linear predictor)] / [1 + exp (linear predictor)]

Estimation in GLM by maximum likelihood.

Likelihood is defined for a set of parameter values as the probability of responses actually observed when that set of values is the true set of parameter values. ML chooses the set of parameter values for which likelihood is maximum.

Measure deviation of observed responses to fitted responses, not by residual SS as in least-squares, but by RESIDUAL DEVIANCE.

[Least-squares principle equivalent to ML if errors are independent and follow normal distribution].

Least-squares regression is one type of GLM.

Solved iteratively.

pp

e1

log

GENERALISED LINEAR MODEL (GLM)

Sigmoid curve fitted by logit regression of the presences (● at p = 1) and absences (● at p = 0) of a species on acidity (pH). In the display, the sigmoid curve looks like a straight line but is not. The curve expresses the probability (p) of occurrence of the species in relation to pH.

Not different from a horizontal line, as t-test of b1 = 0 not rejected

Sigmoid curve fitted by logit regression parameter estimates and deviance table for the presence-absence data of the above figure.Term Parameter Estimate s.e. tConstant b0 2.03 1.98 1.03

pH b1 -0.484 0.357 -1.36 (not >|2.111|)

d.f. Deviance Mean devianceResidual 33 43.02 1.304

If we take for linear predictor the logit transformation of p loge [p/(1-p)] = linear predictor

p = [exp (linear predictor) ]/[ 1 + exp (linear predictor)]

For a parabola (b0 + b1x + b2x2) we get p = [exp (b0 + b1x + b2x2) ]/[1 + exp (b0 + b1x + b2x2)]

or log = b0 + b1x + b2x2

GAUSSIAN LOGIT CURVE

pp

1

Parabola (a), Gaussian curve (b) and Gaussian logit curve (c) representing the equations, respectively.

Gaussian logit curve fitted by logit regression of the presences (● at p = 1) and absences (● at p = 0) of a species on acidity (pH). u = optimum; t = tolerance; pmax = maximum probability of occurrence. Gaussian logit curve fitted by logit regression:

parameter estimates and deviance table for presence-absence dataTerm Estimate s.e. t

Constant

b0 -12.88 51.1 -2.52

pH b1 49.4 19.8 2.5

pH2 b2 4.68 1.9 -2.47

d.f.Devianc

e Mean deviance

Residual 32 23.17 0.724

> t of 1.96

u = -b1 / (2b2)

t = 1 / (√(-2b2)

pmax = {1 + exp (-b0 – b1u – b2u2)}

t – tests of b2, b1 and b0

Deviance tests - Gaussian logit curve → linear – logit (sigmoidal) → null model

Drop in deviance > χ2 3.84

Residual deviance of a model is compared with that of an extended model. The additional parameters in the extended model (e.g. Gaussian logit) are significant when the drop in residual deviance is larger than the critical value of a χ2 distribution with k degrees of freedom (k=number of additional parameters)

Example:

Gaussian logit model – residual deviance = 23.17

Sigmoidal model – residual deviance = 43.02

43.02 - 23.17=19.85 which is >> χ 20.05(1)=3.84

Counts 0,1,2,3... Log-linear or Poisson regression

Log Ey = linear predictor

Can be (b0 + b1x) exponential curve

(b0 + b1x + b2x2) Gaussian curve (if b2

< 0)

[Poisson error distribution, link function log]

Can transform to PSEUDOSPECIES (as in TWINSPAN) and use as +/– response variables in logit regression.

RESPONSE VARIABLE WITH MANY ZERO VALUES

R

Planes

Ey = b0 + b1x1 + b2x2

explanatory variables

b0 – expected response when x1 and x2 = 0

b1 – rate of change in expected response along x1

axis

b2 – rate of change in expected response along x2

axis

b1 measures change of Ey with x1 for a fixed value

of x2

b2 measures change of Ey with x2 for a fixed value

of x1

Response variable expressed as a function of two or more explanatory variables. Not the same as separate analyses because of correlations between explanatory variables and interaction effects.

MULTIPLE LEAST-SQUARES LINEAR REGRESSION

QUANTITATIVE RESPONSE VARIABLE, MANY QUANTITATIVE EXPLANATORY VARIABLES

R

A straight line displays the linear relationship between the abundance value (y) of a species and an environmental variable (x), fitted to artificial data (). (a = intercept; b = slope or regression coefficient).

A plane displays the linear relation between the abundance value (y) of a species and two environmental variables (x1 and x2) fitted to artificial data ().

Estimates of b0, b1, b2 and standard errors and t (estimate / se)

ANOVA total SS, residual SS, regression SS

R2 = R2adj =

Ey = b0 + b1x1 + b2x2 + b3x3 + b4x4 + ……..bmxm

MULTICOLLINEARITY

Selection of explanatory variables:Forward selection Backward selection ‘Best-set’ selection

SS TotalSS Residual

1

Three-dimensional view of a plane fitted by least-squares regression of responses (●) on two explanatory variables x1 and x2. The residuals, i.e. the vertical distances between the responses and the fitted plane are shown. Least-squares regression determines the plane by minimization of the sum of these squared vertical distances.

MSTotal MSResidual

1

R

In multiple regression, where yi are n independent variables (response), the familiar linear model is:

yi = 0 + 1xi1 + 2xi2 + ….+ kxik + i (A1)

where xij’s (k predictor variables) are known constants, 0, 1,…, k are unknown parameters and i’s are independent normal random variables. In matrix notation, the model is written asy = X + , with matrices:

Tn

.

.

.

.

.

. . 1

. . .

. . .

. . .

. . 1

. . 1

X

.

.

. y

1

221

1 11

2

1

2

1

1

0

kknn

k

k

n TTT xx

xx

xx

y

y

y

where nT = total number of replicates. The least squares estimates b of the parameters are obtained by the normal equations:

X’Xb = X’y (A2)

And taking the inverse of X’X, we have:

b = [X’X]-1 [X’y] (A3)

REGRESSION AND ANOVA

REF REF

REFREF

In a similar fashion, consider the linear model for a one-way ANOVA:

Yij = + i + ij (A4)

where yij is the value of the jth replicate in the ith treatment, is the overall parametric mean, i is the effect of the ith treatment and ij is the random normal error associated with that replicate. The model for the expectation of y in any particular treatment is:

E(yi) = + ti (A5)

with ti the ith treatment effect. If there were, for example, three treatments, the model could be written as:

E(y) = X0 + t1X1 + t2X2 + t3X3 (A6)

The values of Xi required to reproduce the model E(yi) = + ti for a given yi, using equation A6 are:

X0 = 1 and

0

otherwise applied, is treatment th theif 1

i

i

X

iX

REF REF

REFREF

This can be expressed by the following matrices:

3

2

1

3

31

2

21

1

11

μ

1 0 0 1

1 0 0 1

1 0 0 1

1 0 0 1

0 1 0 1

0 1 0 1

0 1 0 1

0 1 0 1

0 0 1 1

0 0 1 1

0 0 1 1

0 0 1 1

y

.

.

y

.

.

.

.

.

t

t

tbXy

j

j

j

y

y

y

y

where the columns of the matrix X correspond to X0, X1, X2 and X3,

respectively. A least-squares solution may again be obtained by the equation:

X’Xb=X’y (A7)

REF REF

REFREF

RESPONSE SURFACES

Can also test if x2 influences abundance of y in addition to x1, i.e. do b3 and b4 = 0?

MORE COMPLEX MODELS

Ey = b0 + b1x1 + b2x12 + b3x2 + b4x2

2 + b5x3 + b6x32

+ ... btxm2

Hence need for selecting explanatory variables

PARABOLA QUADRATIC SURFACE

Ey = b0 + b1x + b2x2 Ey = b0 + b1x1 + b2x12 + b3x2 + b4x2

2 (5 parameters)

If log Y Gaussian curve Bivariate Gaussian response surface if b2 and b4 are both negative

T-tests to test 0

Test if surface is unimodal in direction of x1 by null hypothesis b2 0 against b2 < 0 (t of b2)

b4 – test if surface is unimodal in direction of x2

R

PRESENCE-ABSENCE RESPONSE VARIABLE

MANY QUANTITATIVE EXPLANATORY VARIABLES

MULTIPLE LOGIT REGRESSIONMultiple logit regression

2 expl variables

Test for effects of x1 and x2. t-tests of b1 and b2. Bivariate Gaussian logit surface

2 expl

variables

221101xbxbb

pp

elog

222

21 xbxbxbxbb 4321101

pp

elog

R

Three-dimensional view of a bivariate Gaussian logit surface with the probability of occurrence (p) plotted vertically and the two explanatory variables x1 and x2 plotted in the horizontal plane.

Elliptical contours of the probability of occurrence p plotted in the plane of the explanatory variables x1 and x2. One main axis of the ellipses is parallel to the x1 axis and the other to the x2 axis.

Gaussian logit surface

222

21 xbxbxbxbb 4321101

pp

elog

R

INTERACTION EFFECTS OF X1 AND X2

Product terms x1x2

Ey = b0 + b1x1 + b2x2 + b3x1x2

= (b0 + b2x2) + (b1 + b3x2) x1

Intercept and slope and hence values of x1 depend on x2

Effect of x2 also depends on x1

If b3 = 0, NO INTERACTION between x1 and x2

Quadratic surface

Ey = b0 + b1x1 + b2x12 + b3x2 + b4x2

2 + b5x1x2

If b2 + b4 < 0 and 4b2b4 – b52 > 0, have unimodal surface with ellipsoidal

contours but axes not necessarily orthogonal

Can calculate overall optimum

u1 = (b5b3 – 2b1b4) / d d = 4b2b4 – b52

u2 = (b5b1 – 2b3b2) / d

Gaussian logit surface

21522423

2121101

xxbxbxbxbxbbp

pe

log

If b5 ≠ 0, optimum with respect to x1 does depend on value of x2.

If b5 = 0, optimum with respect to x1 does not depend on values

of x2,

i.e. NO INTERACTION

R

• If model is balanced, parameters can be entered or removed in any order

• Adequate model: Non-significantly different from the best model

• Best subset method for selecting variables Try all possible combinations, select the best

Look at the others as well • Automatic selection of variables does not necessarily give the best

subset Backward elimination: Start with all variables, then

remove variables starting with the worst, and continue until all remaining are significant

Forward selection: Start with nothing, add best, as long as the new variables are significant

Stepwise: Start with forward selection, but try backward elimination after every step

J.D. Olden & D.A. Jackson (2000) Ecoscience 7, 501-510.Torturing data for the sake of generality: how valid are our regression models?

SELECTING EXPLANATORY VARIABLES

AKIAKE INFORMATION CRITERION (AIC)

Index of fit that takes account of the parsimony of the model by penalising for the number of parameters. The more parameters in a model, the better the fit. You get a perfect fit if you have a parameter for every data point but the model has no explanatory power.

Trade-off between goodness of fit and the number of parameters required by parsimony.

AIC useful as it explicitly penalises any superfluous parameters in the model by adding 2p to the variance or deviance.

AIC = -2 x (maximised log-likelihood) + 2 x (number of parameters)

Small values are indicative of a good fit to the data.

In multiple regression, AIC is just the residual variance plus twice the number of regression coefficients (including the intercept).

Used to compare the fit of alternative models with different numbers of parameters, and thus useful in model selection.

Smaller the AIC, better the fit.

Given the alternative models involving different numbers of parameters, select the model with the lowest AIC.

R

Three soil types - clay, peat, sandClay - reference classPeat - dummy variable x2

Sand - dummy variable x3

x2 = 1 when peat, 0 when clay or sand

x3 = 1 when sand, 0 when clay or peat

k classes, k – 1 dummy variables

Systematic part Ey = b1 + b2x2 + b3x3

b1 = expected response in reference class (clay)

b2 = difference in expected response between peat and clay

b3 = difference in response between sand and clay

Multiple logit regression - +/– response variable, one continuous variable (x1) and one nominal variable (3 classes (x2, x3))

34232121101

xbxbxbxbbp

pe

log

MANY EXPLANATORY NOMINAL OR NOMINAL AND QUANTITATIVE

VARIABLES

R

Response curves for Equisetum fluviatile fitted by multiple logit regression of the occurrence of E. fluviatile in freshwater ditches on the logarithm of electrical conductivity (EC) and soil type surrounding the ditch (clay, peat, sand). Data from de Lange (1972).

Residual deviance tests to test if maxima are different by dropping x2 and x3.

ASSESSING ASSUMPTIONS OF REGRESSION MODEL

Regression diagnostics – Faraway (2005) chapter 4

Linear least-squares regression

1. relationship between Y and X is linear, perhaps after transformation

2. variance of random error is constant for all observations

3. errors are normally distributed

4. errors for n observations are independently distributed

Assumption (2) required to justify choosing estimates of b parameters so as to minimise residual SS and needed in tests of t and F values. Clearly in minimising SS residuals, essential that no residuals should be larger than others.

Assumption (3) needed to justify significance tests and confidence intervals.

RESIDUAL PLOTS

Plot (Y – EŶ) against EŶ or XR

RESIDUAL PLOTSResidual plots from the multiple

regression of gene frequencies on environmental variables for Euphydryas editha: (a) standardised residuals plotted against Y values from the regression equation, (b) standardised residuals against X1,

(c) standardised residuals against X2,

(d) standardised residuals against X3,

(e) standardised residuals against X4, and

(f) normal probability plot. Normal probability plot –plot ordered standardised residuals against expected values assuming standard normal distribution. If (Y – ŶI) is standard residual for I, expected value is value for standardised normal distribution that exceeds proportion {i – (⅜)} / (n + (¼)) of values in full population

Standardised residual =

MSE

YY )ˆ(

R

OPTIMA +/–

n

iink xu

1

1ˆ

yik abundance of species k at site iAbundance data

n

iik

n

iiik

k

y

xyu

1

1ˆ

2

1

1

21

n

iink xxt̂

TOLERANCES +/–

Abundance data

2

1

1

1

2

n

iik

n

ikiik

k

y

uxyt

ˆˆ

SIMPLE WEIGHTED AVERAGE REGRESSION

WACALIB

CALIB

C2

ter Braak & Looman (1986) Vegetatio 65: 3-11

+/– data - WA just as good as GLR when:

1. species is rare and has narrow tolerance

2. distribution of environmental variable amongst sites is reasonably homogenous over range of species occurrences

3. site scores (xi) are closely spaced in comparison with species amplitude or tolerance

Abundance data:

1. Poisson distributed

2. sites homogeneously distributed

DISREGARDS ABSENCES - DEPENDS ON DISTRIBUTION OF EXPLANATORY

VARIABLE X

WEIGHTED AVERAGES ARE GOOD ESTIMATES

Conditions are strictly true only for infinite gradients.

J. Oksanen (2002)

... of species optima if:

1. Sites x are evenly distributed about optimum u

2. Sites are close to each other

... of gradient values if:

1. Species optima u are evenly distributed about site x

2. All species have equal response widths t

3. All species have equal maximum abundance h

4. Optima u are close to each other

BIAS AND TRUNCATION IN WEIGHTED AVERAGING

Weighted averages are usually good estimates of Gaussian optima, unless the response is truncated. Overestimation at the low end of the gradient, underestimation at the high end of the gradient.

Slight bias towards the gradient centre: shrinkage of WA estimates

WA GLRWA GLR WAWA GLRGLR

J. Oksanen (2002)

MODEL II REGRESSION

When both the response and predictor variables of the model are random (not controlled by the researcher), there is error associated with measurements of both x and y.

This is model II regression

Examples:

Body mass and length

In vivo fluorescence and chlorophyll a

Respiration rate and biomass

Want to estimate the parameters of the equation that describes the relationship between pairs of random variables.

Must use model II regression for parameter estimation, as the slope found by ordinary least-squares regression (model I regression) may be biased by the presence of measurement error in the predictor variable.

MODEL II REGRESSION METHODSChoice of model II regression method depends on the reasons for use and on the features of data

Method

Use and data Test possible

OLS Error on y >> error on x Yes

MA Distribution is bivariate normalVariables are in the same physical units or dimensionlessVariance of error about the same for x and y

Distribution is bivariate normalError variance on each axis proportional to variance of corresponding variable

RMA Check scatter diagram: no outliers Yes

SMA Correlation r is significant No

OLS Distribution is not bivariate normalRelationship between x and y is linear

Yes

OLS To compute forecasted (fitted) or predicted y values(Regression equation and confidence intervals are irrelevant)

Yes

MA To compare observations to model predictions Yes

OLS = ordinary least squares regressionSMA = standard major axis regression

MA = major axis regressionRMA = ranged major axis regression

MODEL II(www.fas.umontreal.ca/biol/legendre)

MODEL II REGRESSION METHODS (continued)

(1) Major axis regression (MA) is the first principal component of the scatter of points. This axis minimises the squared Euclidean distances between the points and the regression line instead of the vertical distances as in OLS

(2) Standard major axis regression (SMA) is a way to make the variables dimensionally homogenous prior to regression.

i) standardise variables x and y (subtract mean, divide by standard deviation)

ii) compute MA regression on standardised x and y

iii) back-transform the slope estimate to the original units by multiplying it by sy/sx where s = standard deviations of y and x.

MODEL II REGRESSION METHODS (continued)

(3) Ranged major axis regression (RMA)

A disadvantage of SMA regression is that the standardisation makes the variances equal.

In RMA, variables are made dimensionally homogeneous by ranging

i) transform variable x and y by ranging

ii) compute MA regression on ranged y and x

iii) back-transform the slope estimate to the original units by multiplying them by the ratio of the ranges (ymax – ymin)/(xmax – xmin)

(4) Ordinary least squares regression (OLS)

Assumes no error on x. If error on y >> error on x, OLS can be used to estimate the slope parameter

minmax

min

yyyy

y ii

1

STATISTICAL TESTING FOR MODEL II REGRESSION

Confidence intervals – with all methods, confidence intervals are large when n is small. Become smaller as n reaches about 60, after which they change very slowly. Model II regression should ideally be used with data sets with 60 or more observations. Confidence intervals for slope and intercept possible for MA, SMA, RMA, and OLS.

Statistical significance of slope – can be assessed by permutation tests for the slopes of MA, OLS, and RMA and for the correlation coefficient r. Cannot test by permutation the slope in SMA as the slope estimate is sy/sx and for all permuted data sy/sx is constant. All one can do is to test the correlation rxy instead of testing bSMA.

General advice is to compute MA, RMA, SMA, and OLS and evaluate results carefully in light of the features of the data (magnitude of errors, distributions) and the purpose of the regression.

Legendre & Legendre (1998) pp. 500-517

McArdle (1998) Can. J. Zool. 66, 2329-2339

Basic regression

MINITAB

SYSTAT

GENSTAT or GLIM

STATISTIX (SX)

R or S-PLUS

Weighted average regression

C2

Model II regression

MODEL II

COMPUTING SOFTWARE FOR REGRESSION ANALYSIS

Documents

Lecture 4 Regression Analysis NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA John Birks