34
Generalized Linear Models Contents 1 Introducing GLMs 2 1.1 Examples of GLMs ....................................... 2 1.2 Inference with GLMs ...................................... 4 1.3 glm in R .............................................. 5 1.4 glm in R: heart attack example ................................. 8 2 Theory of GLMs 12 2.1 The exponential family of distributions ............................ 12 2.2 Fitting generalized linear models ................................ 14 2.3 The sampling distribution of ˆ β ................................. 16 2.4 Comparing models by hypothesis testing ........................... 16 2.4.1 Deviance ......................................... 17 2.4.2 Model comparison with unknown φ .......................... 17 2.5 ˆ φ and Pearson’s statistic .................................... 18 2.6 Residuals and model checking .................................. 18 2.6.1 Pearson residuals ..................................... 18 2.6.2 Deviance residuals .................................... 19 2.6.3 Residual plots ....................................... 19 3 Linking computation and theory 20 3.1 Model formulae and the specification of GLMs ........................ 20 3.1.1 An example ........................................ 21 4 Using the distributional results 22 4.1 Confidence interval calculation ................................. 22 4.2 Single parameter tests ...................................... 23 4.3 Hypothesis testing by model comparison ........................... 23 4.3.1 Known scale parameter example ............................ 23 4.3.2 Unknown scale parameter testing example ....................... 24 5 Model selection more generally 25 5.1 Hypothesis testing based model selection ............................ 25 5.2 Prediction error based model selection ............................. 27 5.3 Remarks on model selection ................................... 30 5.4 Interpreting model coefficients .................................. 31 1

Generalized Linear Models

  • Upload
    nascem

  • View
    51

  • Download
    11

Embed Size (px)

DESCRIPTION

Generalized Linear Models

Citation preview

  • Generalized Linear Models

    Contents

    1 Introducing GLMs 21.1 Examples of GLMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Inference with GLMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 glm in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 glm in R: heart attack example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    2 Theory of GLMs 122.1 The exponential family of distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2 Fitting generalized linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3 The sampling distribution of . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.4 Comparing models by hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    2.4.1 Deviance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.4.2 Model comparison with unknown . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    2.5 and Pearsons statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.6 Residuals and model checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    2.6.1 Pearson residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.6.2 Deviance residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.6.3 Residual plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    3 Linking computation and theory 203.1 Model formulae and the specification of GLMs . . . . . . . . . . . . . . . . . . . . . . . . 20

    3.1.1 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    4 Using the distributional results 224.1 Confidence interval calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.2 Single parameter tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.3 Hypothesis testing by model comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    4.3.1 Known scale parameter example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.3.2 Unknown scale parameter testing example . . . . . . . . . . . . . . . . . . . . . . . 24

    5 Model selection more generally 255.1 Hypothesis testing based model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.2 Prediction error based model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.3 Remarks on model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.4 Interpreting model coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    1

  • 1 Introducing GLMs

    A linear model is a statistical model that can be written

    yi = Xi + i, i i.i.d

    N(0, 2)

    where yi is a response variable, X is a model matrix with elements usually depending on some predictorvariables, the i are random variables. is a vector of unknown parameters, and the purpose of statisticalinference with a linear model is to learn about from the data.

    An exactly equivalent way of writing the linear model is,

    E(yi) i = Xi, yi indep.

    N(i, 2).

    Generalized linear models extend linear models by allowing some non-linearity in the model structureand much more flexibility in the specification of the distribution of the response variable yi. Specifically,a GLM is a statistical model that can be written as

    E(yi) i = (Xi), yi indep.

    Exponential family distribution,

    where is any smooth monotonic function and the Exponential family of distributions includes dis-tributions such as Poisson, Gaussian (normal), binomial and gamma. A feature of exponential familydistributions is that their shape is largely determined by their mean, i, and possibly one other scaleparameter, usually denoted (e.g. for the normal distribution = 2, for the Poisson, = 1). For suchdistributions it is always possible to find a variance function V of i such that

    var(yi) = V (i).

    As we will see later, it is actually possible to relax the GLM distributional assumption and simply specifyV , using the theory of quasi-likelihood.

    For historical reasons it is usual to write GLMs in terms of the (smooth monotonic) link function, g,which is the inverse function of . i.e.

    g(i) = Xi, yi indep.

    Exponential family distribution.

    Examples of commonly used link functions are the log, square root and logit (log odds ratio) functions(see later). The model is written this way because statisticians were (and are) used to thinking aboutmodels for transformed response data (i.e. transformed yi). However it is important to realize thatmodelling some data using a log link (for example) is very different to modelling log(yi) using a linearmodel. Note that X is known as the linear predictor (and is often given the symbol ).

    1.1 Examples of GLMs

    Example 1: AIDS in Belgium.

    2 4 6 8 10 12

    5010

    015

    020

    025

    0

    Year since 1980

    New

    AID

    S c

    ases

    also referred to as explanatory variables or covariates.

    2

  • The above figure shows new AIDS cases each year in Belgium, at the start of the epidemic. In theearly stages of an epidemic an exponential increase model is often appropriate, and a GLM can be usedto fit such a model. If yi is the number of new AIDS cases per year and ti denotes the number of yearssince 1980, then a suitable model for the data might be,

    E(yi) i = eti , yi indep.

    Poi(i).

    Taking logs of both sides of the above equation and defining 1 log() and 2 , we can re-write themodel as

    log(i) = 1 + ti2, yi indep.

    Poi(i),

    which is clearly a GLM with a log link and a model matrix whose ith row is Xi = [1, ti].

    Example 2: Hen harriers and Grouse

    0 20 40 60 80 100 120

    0.0

    0.1

    0.2

    0.3

    0.4

    Grouse.Density

    Con

    sum

    ptio

    n.R

    ate

    The above plot shows the daily consumption rate of Grouse by Hen Harriers (a type of bird of prey)plotted against the density of Grouse on various Grouse moors. If ci denotes consumption rate and di isgrouse density, then ecological theory suggests a saturating model for the data

    E(ci) i = d3i

    + d3i, ci

    indep.Gamma,

    where and are parameters to be estimated. Using the inverse link function we get

    1i

    =1

    +

    1d3i

    , ci indep.

    Gamma.

    So defining new parameters 1 = 1/ and 2 = / we get the GLM

    1i

    = 1 + 21d3i

    , ci indep.

    Gamma.

    i.e. the GLM with an inverse link, Gamma distribution for the response, and model matrix whose ith

    row is Xi = [1, d3i ].

    Example 3: Heart Attacks and Creatine KinaseThe following data are from a study examining the efficacy of blood creatine kinase levels as a

    diagnostic when patients present with symptoms that may indicate a heart attack.

    CK level 20 60 100 140 180 220 260 300 340 380 420 460Heart Attack 2 13 30 30 21 19 18 13 19 15 7 8

    Not Heart Attack 88 26 8 5 0 1 1 1 1 0 0 0

    Here is a plot of the proportion of patients who subsequently turned out to have had a heart attack,against their blood CK levels on admission to hospital.

    3

  • 100 200 300 4000.

    00.

    20.

    40.

    60.

    81.

    0

    CK

    Pro

    port

    ion

    HA

    A convenient model that captures the saturating nature of the relationship between the observedproportions, pi, and the CK levels, xi is the logistic model

    E(pi) =e1+2xi

    1 + e1+2xi

    If yi is the number of heart attack victims observed out of Ni patients with CK level xi then

    i E(yi) = NiE(pi) = Nie1+2xi

    1 + e1+2xi

    and treating the patients as independent we have yi bin(i/Ni, Ni). This model doesnt look linear,but if we apply the logit link function to both sides it becomes,

    log(

    iNi i

    )= 1 + 2xi,

    which is clearly a GLM.

    Example 4: Linear models!Any linear model is just a special case of a GLM. The link function is the identity link and the

    response distribution is Gaussian.

    1.2 Inference with GLMs

    Inference with GLMs is based on the theory of maximum likelihood estimation. That is, given parameters, we can write down, f(y; ) the probability or probability density function of the response y. Pluggingthe observed data yobs into f , and treating it as a function of , we get the likelihood function for

    L() = f(yobs; ).

    The idea is that values of that make the observed data appear relatively probable are more likely tobe correct than those that make them appear relatively improbable. Taking this notion to its logicalconclusion, the most likely values for the parameters are those that cause the likelihood to be as large aspossible: these are the maximum likelihood estimates, . For GLMs the likelihood is actually maximizedwrt by iteratively re-weighted least squares (IRLS), so that successively improved estimates of arefound by fitting working linear models to suitably transformed response data.

    The estimates, , do not depend on the scale parameter, , but when estimates of are required (e.g.the variance of the Gaussian) then this is usually done separately, and not by MLE.

    As we will see, large sample results turn out to imply that

    N(, (XTWX)1) (1)It is usual not to distinguish notationally between the observed data and the arguments of the p(d)f, so it will not be

    done in the rest of these notes.

    4

  • where W is a diagonal matrix such that Wii = V (i)1g(i)2. This result can be used to obtainapproximate confidence intervals for elements of .

    Model comparison is done in one of two ways. Let l() = log{L()}. A hypothesis test that thesmaller of two nested models is correct is conducted using the generalized likelihood ratio test result. If0 are the MLEs of a reduced version of a model with MLEs 1, then if the reduced model is reallycorrect we have that

    2{l(1) l(0)}2dim(1)dim(0).If the quantity on the lhs is too large for consistency with the distribution on the rhs, then we woulddoubt the hypothesis. If is unknown then this result is not directly useable, and an F-ratio basedgeneralization is required.

    The second way of comparing models is via Akaikes Information Criteria (AIC, which Akaike himselfcalled An Information Criteria). Rather than sticking with the simpler of two models until there isstrong evidence that this is untenable, as with hypothesis testing, one instead chooses the model thatis estimated to do the best job at predicting new replicate data, to which it was not fitted. Using thisapproach, whichever model minimizes,

    AIC = 2l() + 2dim()

    is selected.Model checking for GLMs is performed using residuals, in the same way as for linear models. The

    difference is that the distribution of GLM residuals will depend on the response distribution used, whichmakes raw residuals difficult to interpret. For this reason residuals are usually standardized, so that theybehave somewhat like the residuals from a linear model. For example, a simple standardization is todivide each residuals by its model predicted scaled standard deviation, so that all standardized residualsshould have the same variance, if the model is correct. i.e.

    i =yi i

    V (i)

    these are called Pearson residuals.

    1.3 glm in R

    R has a function called glm for fitting GLMs to data. glm functions much like lm, except that the rhs ofthe model formula now defines the way in which the link function of the expected response depends onthe predictor variables. In addition you have to tell glm what response variable distribution to use, andwhat link function: this is done using the family argument, as we will see.

    To see glm in action, consider the hen harrier data again. The data are stored in a data frame calledharrier. First plot them.

    with(harrier,plot(Grouse.Density,Consumption.Rate,ylim=c(0,.4),xlim=c(0,130)))

    0 20 40 60 80 100 120

    0.0

    0.1

    0.2

    0.3

    0.4

    Grouse.Density

    Con

    sum

    ptio

    n.R

    ate

    R is available free from http://cran.r-project.org

    5

  • Now fit the model discussed previously:

    hm hm

    Call: glm(formula = Consumption.Rate ~ I(1/Grouse.Density^3),family = Gamma(link = "inverse"), data = harrier)

    Coefficients:(Intercept) I(1/Grouse.Density^3)

    4.676e+00 5.386e+05

    Degrees of Freedom: 32 Total (i.e. Null); 31 ResidualNull Deviance: 16.17Residual Deviance: 10.47 AIC: -92.38

    As with any model, we should check residuals before proceeding.

    par(mfrow=c(2,2)) ## get all plots on one pageplot(hm) ## plot residuals in various ways

    5 10 15 20 25 30

    1.

    00.

    00.

    51.

    01.

    5

    Predicted values

    Res

    idua

    ls

    Residuals vs Fitted

    7

    2228

    2 1 0 1 2

    2

    1

    01

    23

    Theoretical Quantiles

    Std

    . dev

    ianc

    e re

    sid.

    Normal QQ

    7

    22 28

    5 10 15 20 25 30

    0.0

    0.5

    1.0

    1.5

    Predicted values

    Std

    . dev

    ianc

    e re

    sid.

    ScaleLocation7

    2228

    0.00 0.05 0.10 0.15

    2

    1

    01

    23

    Leverage

    Std

    . dev

    ianc

    e re

    sid.

    Cooks distance0.5

    0.5

    Residuals vs Leverage

    7

    37

    39

    6

  • These residual plots are interpreted in much the same way as the equivalent plots for a linear model,with two differences: i) If the response distribution is not normal/Gaussian then we dont expect thenormal QQ plot to show an exact straight line relationship and (ii) if our response is binary then checkingis more difficult, with many plots being effectively useless.

    For the harrier model there are some clear patterns in the residuals, with the data for some densitiesbeing entirely above or below the fitted line. For this simple example, it can help to plot model predictionsand data on the same plot. To make predictions at a series of grouse densities, we can use the predictfunction.

    pd

  • hm 3 -92.38289hm1 3 -94.17016hm2 3 -92.63887

    So AIC actually supports the model with a quadratic dependence on density. Such a model saturates moreslowly than the original model, but also has problematic residual plots. Finally, for a larger summary ofa model, use the summary function. . .

    > summary(hm1)

    Call:glm(formula = Consumption.Rate ~ I(1/Grouse.Density^2), family = Gamma(link = "inverse"),

    data = harrier)

    Deviance Residuals:Min 1Q Median 3Q Max

    -1.05990 -0.43304 -0.01186 0.20319 1.12450

    Coefficients:Estimate Std. Error t value Pr(>|t|)

    (Intercept) 2.627 1.315 1.998 0.054558 .I(1/Grouse.Density^2) 17674.876 4359.132 4.055 0.000314 ***---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

    (Dispersion parameter for Gamma family taken to be 0.3044807)

    Null deviance: 16.1657 on 32 degrees of freedomResidual deviance: 9.9461 on 31 degrees of freedomAIC: -94.17

    Number of Fisher Scoring iterations: 5

    The output here is much the same as that for a linear model, except that residual sums of squares arereplaced by deviances, of which more later. Note also that an estimate of the scale parameter of thegamma distribution is provided: for some other distributions this parameter is just the known constant1.

    1.4 glm in R: heart attack example

    Before getting more rigorous about the theory of GLMs it is worth going over one more practical examplein R. Again consider the heart attack data and model from section 1.1. First read in the data and plotthem.

    ck

  • 100 200 300 400

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Creatinine kinase level

    Pro

    port

    ion

    Hea

    rt A

    ttack

    Recall that our basic model for these data is that, if yi is the number of heart attack victims out ofNi patients at CK level xi, then

    yi binom(i/Ni, Ni)where E(yi) i, g(i) = 0 + 1xi and g is the logit-link

    g(i) = log(

    iNi i

    ),

    When using binomial models, we need to somehow supply the model fitting function with informationabout Ni as well as yi. R offers two ways of doing this with glm.

    1. The response variable can be the observed proportion of successful binomial trials, in which casean array giving the number of trials must be supplied as the weights argument to glm. For binarydata, no weights vector need be supplied, as the default weights of 1 suffice.

    2. The response variable can be supplied as a two column array, in which the first column gives thenumber of binomial successes, and the second column is the number of binomial failures.

    For the current example the second method will be used. Supplying 2 arrays on the r.h.s. of the modelformula involves using cbind. Here is a glm call which will fit the heart attack model:

    > mod.0

  • where lmax is the largest value that the likelihood could take for the data being fitted (which is themaximized likelihood for a model with one parameter per datum). The deviance is defined in this wayso that it behaves a little bit like the residual sum of squares for a linear model. Well cover deviance inmore depth later. For the moment note that for distributions for which = 1, the deviance should oftenapproximately follows a 2ndim() distribution if the model is correct (although the approximation is notgreat).

    In the output, the Null deviance is the deviance for a model with just a constant term, while theResidual deviance is the deviance of the fitted model. These can be combined to give the proportiondeviance explained, a generalization of r2, as follows:

    > (271.7-36.93)/271.7

    [1] 0.864078

    AIC is the Akaike Information Criteria for the model,(it could also have been extracted using AIC(mod.0)).Notice that the deviance is quite high for the 210 random variable that it should approximate if the

    model is fitting well. In fact

    > 1-pchisq(36.93,10)[1] 5.819325e-05

    shows that there is a very small probability of a 210 random variable being as large as 36.93. The residualplots also suggest a poor fit.

    > op plot(mod.0)

    2 0 2 4 6 8 10 12

    3

    1

    12

    3

    Predicted values

    Res

    idua

    ls

    Residuals vs Fitted

    19

    3

    1.5 0.5 0.5 1.0 1.5

    20

    10

    010

    Theoretical Quantiles

    Std

    . dev

    ianc

    e re

    sid.

    Normal QQ

    1

    3

    9

    2 0 2 4 6 8 10 12

    01

    23

    45

    Predicted values

    Std

    . dev

    ianc

    e re

    sid.

    ScaleLocation1

    3

    9

    0.0 0.1 0.2 0.3 0.4 0.5 0.6

    30

    10

    010

    Leverage

    Std

    . dev

    ianc

    e re

    sid.

    Cooks distance

    10.50.51

    Residuals vs Leverage

    1

    3

    9

    Again, the plots have much the same interpretation as the model checking plots for an ordinary linearmodel, except that it is now standardized residuals that are plotted (actually deviance residuals seelater), the Predicted values are on the scale of the linear predictor rather than the response, and somedeparture from a straight line relationship in the Normal QQ plot is often to be expected. The plots arenot easy to interpret when there are so few data, but there appears to be a trend in the mean of theresiduals plotted against fitted value, which would cause concern. Furthermore, the first point has veryhigh influence. Note that the interpretation of the residuals would be much more difficult for binary data(see later).

    Notice how the problems do not stand out so clearly from a plot of the fitted values overlayed on theraw estimated probabilities:

    10

  • > plot(heart$ck,p,xlab="Creatinine kinase level",+ ylab="Proportion Heart Attack")> lines(heart$ck,fitted(mod.0))

    100 200 300 400

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Creatinine kinase level

    Pro

    port

    ion

    Hea

    rt A

    ttack

    Note also that the fitted values provided by glm for binomial models are the estimated pis, ratherthan the estimated is.

    The trend in the mean of the residuals suggests trying a cubic linear predictor, rather than the initialstraight line.

    > mod.2 mod.2

    Call: glm(formula=cbind(ha,ok)~ck+I(ck^2)+I(ck^3),family=binomial,data=heart)

    Coefficients:(Intercept) ck I(ck^2) I(ck^3)-5.786e+00 1.102e-01 -4.648e-04 6.448e-07

    Degrees of Freedom: 11 Total (i.e. Null); 8 ResidualNull Deviance: 271.7Residual Deviance: 4.252 AIC: 33.66> par(mfrow=c(2,2))> plot(mod.2)

    4 2 0 2 4 6 8

    1.

    00.

    01.

    0

    Predicted values

    Res

    idua

    ls

    Residuals vs Fitted

    5

    4

    10

    1.5 0.5 0.5 1.0 1.5

    10

    5

    05

    10

    Theoretical Quantiles

    Std

    . dev

    ianc

    e re

    sid.

    Normal QQ

    5

    4

    10

    4 2 0 2 4 6 8

    0.0

    1.0

    2.0

    3.0

    Predicted values

    Std

    . dev

    ianc

    e re

    sid.

    ScaleLocation5

    4

    10

    0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

    10

    05

    10

    Leverage

    Std

    . dev

    ianc

    e re

    sid.

    Cooks distance

    10.5

    0.51

    Residuals vs Leverage

    4

    5

    10

    11

  • Clearly 4.252 is not too large for consistency with a 28 distribution (it is less than the expected value,in fact) and the AIC has improved substantially. The residual plots now show less clear patterns than forthe previous model, although if we had more data then such a departure from constant variance wouldbe a cause for concern. Furthermore the fit is clearly closer to the data now:

    par(mfrow=c(1,1))plot(heart$ck,p,xlab="Creatinine kinase level",

    ylab="Proportion Heart Attack")lines(heart$ck,fitted(mod.2))

    100 200 300 400

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Creatinine kinase level

    Pro

    port

    ion

    Hea

    rt A

    ttack

    We can also get R to test the null hypothesis that mod.0 is correct against the alternative that mod.2is required. Somewhat confusingly the anova function is used to do this, although it is a generalizedlikelihood ratio test that is being performed, and not an analysis of variance.

    > anova(mod.0,mod.2,test="Chisq")Analysis of Deviance Table

    Model 1: cbind(ha, ok) ~ ckModel 2: cbind(ha, ok) ~ ck + I(ck^2) + I(ck^3)

    Resid. Df Resid. Dev Df Deviance P(>|Chi|)1 10 36.9292 8 4.252 2 32.676 8.025e-08

    A p-value this low indicates very strong evidence against the null hypothesis: we really do need model2. Note that this comparison of models has a much firmer theoretical basis than the examination of theindividual deviances had.

    2 Theory of GLMs

    This section will cover the theory of GLMs in more depth, filling in the details of the framework outlinedin section 1.2. The section starts by reviewing key results for exponential family distributions, then coversmodel estimation, before covering variance estimation, model comparison etc.

    2.1 The exponential family of distributions

    The response variable in a GLM can have any distribution from the exponential family. A distributionbelongs to the exponential family of distributions if its probability density function, or probability massfunction, can be written as

    f(y) = exp [{y b()}/a() + c(y, )] ,where b, a and c are arbitrary functions, an arbitrary scale parameter, and is known as the canonicalparameter of the distribution (in the GLM context, will completely depend on the model parameters, but it is not necessary to make this explicit yet).

    12

  • For example, it is easy to see that the normal distribution is a member of the exponential family since

    f(y) =1

    2exp

    [ (y )

    2

    22

    ]

    = exp[y2 + 2y 2

    22 log(

    2)

    ]

    = exp[y 2/2

    2 y

    2

    22 log(

    2)

    ],

    which is of exponential form, with = , b() = 2/2 2/2, a() = = 2 and c(, y) = y2/(2)log(

    2) y2/(22) log(2).

    It is possible to obtain general expressions for the the mean and variance of exponential familydistributions, in terms of a, b and . The log likelihood of , given a particular y, is simply log[f(y)]considered as a function of . That is

    l() = {y b()}/a() + c(y, )and so

    l

    = {y b()}/a().

    Treating l as a random variable, by replacing the particular observation y by the random variable Y ,enables the expected value of l/ to be evaluated:

    E(

    l

    )= {E(Y ) b()}/a().

    Using the general result that E(l/) = 0 (at the true value of ) and re-arranging implies that

    E(Y ) = b(). (2)

    i.e. the mean, of any exponential family random variable, is given by the first derivative of b w.r.t., where the form of b depends on the particular distribution. This equation is the key to linking themodel parameters, , of a GLM to the canonical parameters of the exponential family. In a GLM, theparameters determine the mean of the response variable, and, via (2), they thereby determine thecanonical parameter for each response observation.

    Differentiating the likelihood once more yields

    2l

    2= b()/a(),

    and plugging this into the general result, E(2l/2) = E[(l/)2] (the derivatives are evaluated atthe true value), gives

    b()/a() = E[{Y b()}2] /a()2,

    which re-arranges to the second useful general result:

    var(Y ) = b()a().

    a could in principle be any function of , and when working with GLMs there is no difficulty in handlingany form of a, if is known. However, when is unknown matters become awkward, unless we can writea() = /, where is a known constant. This restricted form in fact covers all the cases of practicalinterest here. a() = / allows the possibility of, for example, unequal variances in models based onthe normal distribution, but in most cases is simply 1. Hence we now have

    var(Y ) = b()/. (3)

    In subsequent sections it will often be convenient to consider var(Y ) as a function of E(Y ), and,since and are linked via (2), we can always define a variance function V () = b()/, such thatvar(Y ) = V ().

    13

  • 2.2 Fitting generalized linear models

    Recall that a GLM models an n-vector of independent response variables, Y, where E(Y), viag(i) = Xi

    andYi fi(yi),

    where fi(yi) indicates an exponential family distribution, with canonical parameter i, which is deter-mined by i (via equation 2) and hence ultimately by . Given vector y, an observation of Y, maximumlikelihood estimation of is possible. Since the Yi are mutually independent, the likelihood of is

    L() =n

    i=1

    fi(yi),

    and hence the log-likelihood of is

    l() =n

    i=1

    log[fi(yi)]

    =n

    i=1

    {yii bi(i)}/ai() + ci(, yi),

    where the dependence of the right hand side on is through the dependence of the i on . Noticethat the functions a, b and c may vary with i this allows different binomial denominators, ni, for eachobservation of a binomial response, or different (but known to within a constant) variances for normalresponses, for example. , on the other hand, is assumed to be the same for all i. As discussed in theprevious section, for practical work it suffices to consider only cases where we can write ai() = /i,where i is a known constant (usually 1), in which case

    l() =n

    i=1

    i{yii bi(i)}/ + ci(, yi).

    Maximization proceeds by partially differentiating l w.r.t. each element of , setting the resultingexpressions to zero and solving for .

    l

    j=

    1

    n

    i=1

    i

    (yi

    ij

    bi(i)ij

    ),

    and by the chain ruleij

    =ii

    ij

    ,

    so that differentiating (2), we get

    ii

    = bi (i) ii

    =1

    bi (i),

    which then implies thatl

    j=

    1

    n

    i=1

    [yi bi(i)]bi (i)/i

    ij

    .

    Substituting from (2) and (3), into this last equation, implies that the equations to solve for are

    n

    i=1

    (yi i)V (i)

    ij

    = 0 j. (4)

    14

  • However, these equations are exactly the equations that would have to be solved in order to find bynon-linear weighted least squares, if the weights V (i) were known in advance and were independent of. In this case the least squares objective would be

    S =n

    i=1

    (yi i)2V (i)

    , (5)

    where i depends non-linearly on , but the weights V (i) are treated as fixed. To find the least squaresestimates involves solving S/j = 0 j, but this system of equations is easily seen to be (4), when theV (i) terms are treated as fixed.

    This correspondence suggests a fitting method. Iterate the following two steps to convergence

    1. Given the current i estimates, evaluate the V (i) values

    2. Find a value of which reduces

    i

    (yi i)2V (i)

    (the dependence on is through , but not ). Let this improved parameter vector be denoted ,and use it to update .

    At convergence must satisfy (4).To implement this method we need to be able to find the required improved parameter vectors at step

    2. To do this, just replace i by its first order Taylor expansion around i, so that

    yi i ' yi i

    j

    ij

    (j j)

    With exact equality at = (derivatives evaluated at current ). Now, writing the linear predictor asi = Xi

    ij

    =didi

    ij

    =Xij

    g(i).

    Hence

    i

    (yi i)2V (i)

    '

    i

    (g(i)yi g(i)i Xi + Xi)2g(i)2V (i)

    (6)

    =

    i

    wi(zi Xi)2 (7)

    where zi = g(i)(yi i) + Xi and wi = g(i)2V (i)1. But (7) is just a weighted linear leastsquares problem, which is easily minimized w.r.t. using standard least squares methods, making iteasy to find an improved .

    Hence we arrive at the following GLM fitting algorithm. Iterate the following to convergence. . .

    1. Given the current and estimates, calculate pseudodata z and weights w, as defined above.

    2. Minimize

    i

    wi(zi Xi)2

    w.r.t. to obtain an improved estimate .

    3. Evaluate a new linear predictor estimate = X and new fitted values i = g1(i).

    If, as is usual, the method converges to fixed then this must satisfy (4) and is hence the MLE of .The iteration can be started by setting = y (with modification to avoid e.g. log(0)). The method isknown as Iteratively Re-weighted Least Squares (IRLS).

    15

  • 2.3 The sampling distribution of

    There is a general Maximum Likelihood Estimation result that if is an MLE (and some technicalconditions are met) then

    N(, I1)where I is the information matrix, with elements Iij = E(l/jl/i). The result is exact in thelarge sample limit, or for the normal response, identity link case. To use this result we need to find I fora GLM.

    First define vector u such that j = l/j . Then I = E(uuT). From results already established, wehave that

    uj =l

    j=

    1

    n

    i=1

    Xij(yi i)V (i)g(i)

    If we define diagonal matrices G and V, where Gii = g(i) and Vii = V (i), then this last result becomes

    u = XTG1V1(y )/.Hence,

    E(uuT) = XTG1V1E[(Y )(Y )T]V1G1X/2= XTG1V1VV1G1X/= XTWX/

    since E[(Y )(Y )T] = V and W = V1G2.So we end up with

    N(, (XTWX)1). (8)For distributions with known scale parameter, , this result can be used directly to find confidenceintervals for the parameters, but if the scale parameter is unknown (e.g. for the normal distribution),then it must be estimated, and intervals must be based on an appropriate t distribution. Scale parameterestimation is covered later.

    2.4 Comparing models by hypothesis testing

    Consider testingH0 : g() = X00

    againstH1 : g() = X11,

    where is the expectation of a response vector, Y, whose elements are independent random variablesfrom the same member of the exponential family of distributions, and where X0 X1. If we have anobservation, y, of the response vector, then a generalized likelihood ratio test can be performed. Let l(0)and l(1) be the maximized log-likelihoods of the two models. If H0 is true then in the large samplelimit,

    2[l(1) l(0)] 2p1p0 , (9)where pi is the number of (identifiable) parameters (i) in model i. If the null hypothesis is false, then

    model 1 will tend to have a substantially higher likelihood than model 0, so that twice the difference inlog likelihoods would be too large for consistency with the relevant 2 distribution.

    The approximate result (9) is only directly useful if the log likelihoods of the models concerned canbe calculated. In the case of GLMs estimated by IRLS, this is only the case if the scale parameter, ,is known. Hence the result can be used directly with Poisson and binomial models, but not with thenormal, gamma or inverse Gaussian distributions, where the scale parameter is not known. What to doin these latter cases will be discussed shortly.

    Of course for normal distribution and identity link we use the results of chapter 1.

    16

  • 2.4.1 Deviance

    When working with GLMs in practice, it is useful to have a quantity that can be interpreted in a similarway to the residual sum of squares, in ordinary linear modelling. This quantity is the deviance of themodel and is defined as

    D = 2[l(max) l()] (10)

    =n

    i=1

    2i[yi(i i) b(i) + b(i)

    ], (11)

    where l(max) indicates the maximized log-likelihood of the saturated model: the model with one pa-rameter per data point. l(max) is the highest value that the log- likelihood could possibly have, giventhe data, and is evaluated by simply setting = y and evaluating the log-likelihood. and denote themaximum likelihood estimates of canonical parameters, for the saturated model and model of interest,respectively. Notice how the deviance is defined to be independent of .

    Related to the deviance is the scaled deviance,

    D = D/,

    which does depend on the scale parameter. For the Binomial and Poisson distributions, where = 1,the deviance and scaled deviance are the same, but this is not the case more generally.

    By the generalized likelihood ratio test result (9), we might expect that, if the model is correct, thenapproximately

    D 2np, (12)in the large sample limit. Actually such an argument is bogus, since the limiting argument justifying (9)relies on the number of parameters in the model staying fixed, while the sample size tends to infinity, butthe saturated model has as many parameters as data. Asymptotic results are available for some expo-nential family distributions, to justify (12) as a large sample approximation under many circumstances,and it is exact for the Normal case. Note, however, that it breaks down entirely for the binomial withbinary data.

    Given the definition of deviance, it is easy to see that the log likelihood ratio statistic in (9), can bere-expressed as D0 D1 . So under H0

    D0 D1 2p1p0 (13)(in the large sample limit), where Di is the deviance of model i which has pi identifiable parameters.

    But again, this is only useful if the scale parameter is known so that D can be calculated.

    2.4.2 Model comparison with unknown

    Under H0 we have the approximate results

    D0 D1 2p1p0 and D1 2np,and, if D0 D1 and D1 are treated as asymptotically independent, this implies that

    F =(D0 D1)/(p1 p0)

    D1/(n p1) Fp1p0,np1 ,

    in the large sample limit (a result which is exactly true in the ordinary linear model special case, ofcourse). The useful property of F is that it can be calculated without knowing , which can be cancelledfrom top and bottom of the ratio yielding, under H0, the approximate result that

    F =(D0 D1)/(p1 p0)

    D1/(n p1) Fp1p0,np1 . (14)

    The advantage of this result is that it can be used for hypothesis testing based model comparison, when is unknown. The disadvantages are the dubious distributional assumption for D1 , and the independenceapproximation, on which it is based.

    17

  • 2.5 and Pearsons statistic

    As we have seen, the MLEs of the parameters, , can be obtained without knowing the scale parameter,, but, in those cases in which this parameter is unknown, it must usually be estimated. Approximateresult (12) provides one obvious estimator. The expected value of a 2np random variable is n p, soequating the observed D = D/ to its approximate expected value and re-arranging, we get

    D = D/(n p). (15)

    A second estimator is based on the Pearson statistic, which is defined as

    X2 =n

    i=1

    (yi i)2V (i)

    .

    Clearly X2/ would be the sum of squares of a set of zero mean, unit variance, random variables, havingn p degrees of freedom, suggesting that if the model is adequate then approximately X2/ 2np:this approximation turns out to be well founded. Setting the observed Pearson statistic to its expectedvalue, and re-arranging, yields

    = X2/(n p).Note that it is straightforward to show that

    X2 =i=n

    i

    wi(zi Xi)2,

    where wi and zi are IRLS weights and pseudodata, evaluated at convergence.

    2.6 Residuals and model checking

    We have now assembled the basic theory required for inference with GLMs, but before using the distri-butional results for inference, it is always necessary to check that the model meets its assumptions wellenough that the results are likely to be valid. Model checking is perhaps the most important part ofapplied statistical modelling. In the case of ordinary linear models, this is based on examination of themodel residuals, which contain all the information in the data, not explained by the systematic part ofthe model. Examination of residuals is also the chief means for model checking in the case of GLMs, butin this case the standardization of residuals is both necessary and a little more difficult.

    For GLMs the main reason for not simply examining the raw residuals, i = yi i, is the difficultyof checking the validity of the assumed mean variance relationship from the raw residuals. For example,if a Poisson model is employed, then the variance of the residuals should increase in direct proportion tothe size of the fitted values (i). However if raw residuals are plotted against fitted values it takes anextraordinary ability to judge whether the residual variability is increasing in proportion to the mean, asopposed to, say, the square root or square of the mean. For this reason it is usual to standardize GLMresiduals, in such a way that, if the model assumptions are correct, the standardized residuals shouldhave approximately equal variance, and behave, as far as possible, like residuals from an ordinary linearmodel.

    2.6.1 Pearson residuals

    The most obvious way to standardize the residuals is to divide them by a quantity proportional to theirstandard deviation according to the fitted model. This gives rise to the Pearson residuals

    pi =yi i

    V (i),

    Recall that if {Zi : i = 1 . . . n} are a set of i.i.d. N(0, 1) r.v.s thenP

    Z2i 2n.

    18

  • which should have approximately zero mean and variance , if the model is correct. These residualsshould not display any trend in mean or variance when plotted against the fitted values, or any covariates(whether included in the model or not). The name Pearson residuals relates to the fact that the sum ofsquares of the Pearson residuals gives the Pearson statistic discussed in section 2.5.

    Note that the Pearson residuals are the residuals of the working linear model from the convergedIRLS method, divided by the square roots of the converged IRLS weights.

    2.6.2 Deviance residuals

    In practice the distribution of the Pearson residuals can be quite asymmetric around zero, so that theirbehaviour is not as close to ordinary linear model residuals as might be hoped for. The deviance residualsare often preferable in this respect. The deviance residuals are arrived at by noting that the devianceplays much the same role for GLMs that the residual sum of squares plays for ordinary linear models:indeed for an ordinary linear model the deviance is the residual sum of squares. In the ordinary linearmodel case, the deviance is made up of the sum of the squared residuals. That is the residuals are thesquare roots of the components of the deviance with the appropriate sign attached.

    So, writing di as the component of the deviance contributed by the ith datum (i.e. the ith term inthe summation in (11)) we have

    D =n

    i=1

    di

    and, by analogy with the ordinary linear model, we can define

    di = sign(yi i)

    di.

    As required the sum of squares of these deviance residuals gives the deviance itself.Now if the deviance were calculated for a model where all the parameters were known, then (12) would

    become D 2n, and this might suggest that for a single datum di/ 21, implying that di N(0, ).Of course (12) can not reasonably be applied to a single datum, but nonetheless it suggests that we mightexpect the deviance residuals to behave something like N(0, ) random variables, for a well fitting model,especially in cases for which (12) is expected to be a reasonable approximation.

    2.6.3 Residual plots

    Once you have standardized residuals you should plot them to try and find evidence that the modelassumptions are not met. The main useful plots are:

    Standardized residuals against fitted values. A trend in the mean of the residuals violates theindependence assumption and often implies that something is wrong with the model from the meanof the response perhaps a missing dependence, or the wrong link function. A trend in thevariability of the residuals is diagnostic of a problem with the assumed mean variance relationship i.e. with the assumed response distribution.

    Standardized residuals against all potential predictor variables (selected or omitted from the model).Trends in the mean of the residuals can be very useful for pinpointing missing dependencies of themean response on the predictors.

    Normal QQ plots can be useful for highlighting problems with the distributional assumptions, incases where the response distribution can be well approximated by a normal distribution (withappropriate non-constant variance). For example Poisson residuals for a response with a fairly highmean fall into this category.

    Plots of standardized residuals against leverage are useful for highlighting single points that have avery high influence on the model fitting. leverage is a measure of how influential a data point couldbe, based on the distance of its predictor variables from the predictors of other data.

    19

  • All plots are useful for spotting potential outliers: points which do not fit well with the pattern ofthe rest of the data, and deserve special attention, to check that they are not somehow erroneous, orthat they are not telling you something important about the system that the data relate to. Note that Ralways labels the three largest outliers in a residual plot with their data frame row numbers. Of coursethe fact that they are labeled does not in itself mean that they are problematic.

    3 Linking computation and theory

    To use the theoretical results effectively you need to be able to specify any GLM you want to fit, in R,and extract the quantities required by the theory from R output of various sorts.

    3.1 Model formulae and the specification of GLMs

    Specification of the response distribution and link function is via the family argument of glm and theexamples already covered provide sufficient illustration of this. Specification of the structure of the linearpredictor is more involved, and now that we have covered a number of examples, a more formal discussionof model formulae is appropriate.

    The main components of a formula are all present in the following example

    y ~ a*b + x:z + offset(v) -1

    Note the following:

    ~ separates the response variable, on the left, from the linear predictor, on the right. So in theexample y is the response and a, b, x, z and v are the predictors.

    + indicates that the response depends on what is to the left of + and what is to the right of it.Hence within formulae + should be thought of as and rather than the sum of.

    : indicates that the response depends on the interaction of the variables to the left and right of:. Interactions are obtained by forming the element-wise products of all model matrix columnscorresponding to the two terms that are interacting and appending the resulting columns to themodel matrix (although, of course, some identifiability constraints may be required).

    * means that the response depends on whatever is to the left of * and whatever is to the right ofit and their interaction, i.e. a*b is just a shorter way of writing a + b + a:b.

    offset(v) indicates that a column should be included in the model matrix, specified by v, whosecorresponding parameter has the known value 1.

    -1 means that the default intercept term should not be included in the model. Note that, formodels involving factor variables, this often has no real impact on the model structure, but simplyreduces the number of identifiability constraints by one, while changing the interpretation of someparameters.

    Because of the way that some symbols change their usual meaning in model formulae, it is necessary totake special measures if the usual meaning is to be restored to arithmetic operations within a formula.This is accomplished by using the identity function I() which simply evaluates its argument and returnsit. For example, if we wanted to fit the model:

    yi = 0 + 1(xi + zi) + 2vi + i

    then we could do so using the model formula

    y ~ I(x+z) + v

    See section 6.3 of the the MA20035 notes for a reminder of what interactions of factor variables are.

    20

  • Note that there is no need to protect arithmetic operations within arguments to other functions in thisway. For example

    yi = 0 + 1 log(xi + zi) + 2vi + i

    would be fitted correctly by

    y ~ log(x+z) + v

    3.1.1 An example

    Consider a study looking at the relationship between smoking, drinking and blood pressure. A group ofadult male patients were selected randomly from a GP practice. Each patient had their blood pressure,yi, measured, along with their smoker status (smoker or non-smoker), their alcohol consumption rate(none, within recommended limits or heavy) and their age in years, xi. The following initial model wasproposed.

    E(yi) = + j + k + jk + jxi if patient i is in smoker class j drinker class k

    yi gamma. So the model has main effects for smoking and drinking, and interaction of these factors, anda separate linear dependence on age for smokers and non-smokers (and interaction of age and smoking).Remember that variables are said to interact when the effect of one predictor itself depends on the valueof another predictor.

    The model is a GLM with an identity link and gamma distribution. Suppose that we have thefollowing predictor variables for 10 patients:

    age, x 44 38 39 41 44 37 44 44 42 41smoke F F F F F T T T T Tdrink 1 2 3 1 2 3 1 2 3 1

    Note that smoke and drink are factor variables here, while age is a continuous predictor. Here is thecorresponding linear predictor (which gives E(yi) directly in this case), in identifiable form.

    12345678910

    =

    1 0 0 0 0 0 44 01 0 1 0 0 0 38 01 0 0 1 0 0 39 01 0 0 0 0 0 41 01 0 1 0 0 0 44 01 1 0 1 0 1 0 371 1 0 0 0 0 0 441 1 1 0 1 0 0 441 1 0 1 0 1 0 421 1 0 0 0 0 0 41

    223222312

    In R the model distribution and link would be specified using Gamma(link="identity"), while the modelformula to specify response and linear predictor could be written as:

    y~smoke+drink+smoke:drink + smoke:age

    This form is the clearest translation of the model structure into R, but note that any of the followingwould give the same model (although the identifiability constraints may change between them, which willalter the meaning of some of the parameters).

    y~smoke*drink+age:smoke y~smoke*drink+age*smoke y~smoke*drink+age:smoke-1 . . . and more!

    21

  • Note that within R you can check the model matrix of a GLM using the model.matrix command. Theargument of model.matrix can be a fitted GLM object, a model formula, or even just the rhs of aformula. For example

    model.matrix(~smoke+drink+smoke:drink+age:smoke)

    generates the model matrix given earlier in this section.

    4 Using the distributional results

    Once model passes basic residual checking, were in a position to treat it as a good enough model to dosome formal statistics. This involves using the distributional results (8), (12) and (14).

    4.1 Confidence interval calculation

    Result (8) is useful for finding confidence intervals for model parameters, and linear transformations ofthem. Let V = (XTWX)1, the estimated covariance matrix of ( is known to be 1 in some cases).Let i be the square root of the i

    th diagonal element of V, i.e. the estimated standard error of j .Confidence intervals for i. Using standard theory for normally distributed estimators. . .

    1. A 100(1 )% CI for i, when is known (e.g. Poisson or binomial cases) isi t(1 /2)i

    where t(1 /2) is the 1 /2 critical point of a standard normal distribution.2. A 100(1 )% CI for )i, when is unknown (e.g. gaussian or gamma cases) is

    i tndim()(1 /2)iwhere tk(1 /2) is the 1 /2 critical point of a tk distribution.

    Except in the normal response, identity link case, both results are only approximate, since they are basedon (8), which is only approximate.

    R reports i and i values in the Estimate and Std.Error columns of the Coefficients table of aglm fitted model summary. Function vcov is used to extract V from a glm fitted model object.Confidence intervals for the linear predictor and expected response. Since has an approxi-mately normal distribution, so does any linear transformation of it, such as the linear predictor for theith observation i = Xi. There is a standard result that if Z and U are random vectors with covariancematrices Vz and Vu then if Z = BU where B is a matrix of fixed coefficients then Vz = BVuBT.Applying this implies that

    2i = var(i) = XiVXTi .

    So we have that i N(, 2i). Hence a 100(1 )% CI for i isi tk(1 /2)i

    where tk is tndim() if is unknown and t otherwise. Given a CI for i an equivalent CI for i = E(yi)is easily obtained: [

    g1(i tk(1 /2)i), g1(i + tk(1 /2)i)]

    Derivation of this latter interval is easy. If (a, b) is a 95% CI for i then it includes the true i withprobability, 0.95. But in that case [g1(a), g1(b)] must include the true i = g1(i) with probability0.95, making it a 95% CI for i.

    R function predict(mod,type="link",se=TRUE) will return the fitted values i and associated stan-dard errors i , for model mod, in elements fit and se.fit of the object it returns. If predict issupplied with new values for the predictors then it will produce predictions and standard errors for thelinear predictor corresponding to these, instead.

    22

  • 4.2 Single parameter tests

    Using the same notation as in the previous section, result (8) is also the basis for simple hypothesis testingabout single parameters. Consider testing H0 : j = 0 vs. H1 : j 6= 0. Under H0 we have

    jN(0, 2j ).

    If is known then this becomesj/j N(0, 1)

    and we can calculate a p-value for the hypothesis test in the usual way, by evaluating

    Pr[|Z| |j/j |] where Z N(0, 1).

    With estimated we usej/j tndim()

    under H0, and we can calculate a p-value, by evaluating

    Pr[|T | |j/j |] where T tndim().

    In R glm summary output j/j is reported in the t value or z value column while the corre-sponding p-values are in the Pr(>|t|) or Pr(>|z|) columns.

    4.3 Hypothesis testing by model comparison

    Hypothesis tests of the sort developed in section 2.4 are easily performed in R, as follows:

    1. Fit the GLM embodying the null hypotheses, m0, say.

    2. Fit the model embodying the alternative hypothesis, m1, say. m1 must be an extended version ofm0, so that m0 is nested in m1.

    3. Use anova(m0,m1,test="Chisq") to compare the models by direct use of a generalized likelihoodratio test (13), if is known. Alternatively use anova(m0,m1,test="F") to compare the modelsusing an F ratio test, (14), when is not known.

    4.3.1 Known scale parameter example

    Case-control studies are an important type of study, in which a group of patients with some disease (thecases) are compared to a randomly selected group of healthy subjects from the same population as thecases (the controls). Variables that might be associated with the disease are also collected for all subjects.If a variable is really associated with the disease then it ought to be predictive of whether a randomlyselected patient in the study is a case or a control. Such predictivity can be assessed using GLMs, of thelogistic regression type.

    For example consider a study looking at 143 cases of malignant melanoma (a serious skin cancer)in white male patients aged 25 to 55, classified according to skin type (A, B or C for celtic, middleeuropean type or Mediterranean), compared to 356 white male controls aged 25-55 (selected withoutfurther reference to age or skin type). Patients were divided into 3 groups according to an age factorvariable, as well as being classified into 3 groups by the skin factor variable (so there are 9 groups intotal). The data are in a data frame md1:

    mel age skin n1 15 25-35 A 542 8 25-35 B 523 7 25-35 C 444 26 35-45 A 75

    23

  • 5 18 35-45 B 526 6 35-45 C 427 30 45-55 A 678 25 45-55 B 669 8 45-55 C 47

    Consider testing the null hypothesis that skin type is not associated with melanoma, against thealternative that it is. If we neglect the possibility of an interaction then the null model

    mel0 |Chi|)1 6 20.50622 4 3.4389 2 17.0673 0.0002

    test="Chisq" specifies that a generalized likelihood ratio test is to be performed using (13). Here thep-value is very low: there is a very small probability of observing this large a difference in deviancebetween the two models if mel0 really generated the data. This strongly suggests that mel0 is incorrect.In other words, there is strong evidence in favour of mel1 and an effect of skin type on melanoma risk.The next step would be to examine the model coefficients to ascertain the nature and size of the effect.

    Note that these case-control studies can only be used to look at the relative risk of melanoma givendifferent risk factors. The study tells us nothing about the absolute risk of melanoma, because we havechosen the ratio of cases to controls, rather than observing it in the population of interest.

    4.3.2 Unknown scale parameter testing example

    The dataset motori is derived from dataset motorins from R library faraway. It contains insurancecompany data from Sweden, on payouts (Payment) in relation to number Insured, km travelled (a numericvariable with 5 discrete values), Make of car (a factor variable with 9 levels) and number of years no claimsBonus. An initial model for the data is:

    E(Paymenti) = Insuredi riski, where Payment gammaso

    log{E(Paymenti)} = log(Insuredi) + log(riski).log(Insuredi) is an example of a model offset a predictor variable whose coefficient is fixed at 1.log(riski) can be modelled using a linear model structure to give:

    log{E(Paymenti)} = log(Insuredi) + jkmi + j + Bonusi, if i is from Make j.Consider testing H0 : 1 = 2 = = 9 against the alternative that the j are not all equal. First fitmodels embodying the two hypotheses:

    gl

  • Since is not known for the gamma, we now perform an F-ratio test comparing the models

    > anova(g0,gl,test="F")Analysis of Deviance Table

    Model 1: Payment ~ offset(log(Insured)) + km + Make + BonusModel 2: Payment ~ offset(log(Insured)) + km * Make + Bonus

    Resid. Df Resid. Dev Df Deviance F Pr(>F)1 284 155.0562 276 151.890 8 3.166 0.752 0.6455

    It appears that the dependence of claim rate on km travelled can be assumed not to vary with car make.

    5 Model selection more generally

    The hypothesis tests considered above are examples of rather simple model selection problems. A questionwas formulated in terms of which of two alternative versions of a model generated the observed responsedata, and one of the models was selected (with preference being given to the simpler model). Often ouruncertainty does not amount to a straightforward choice between two alternatives, but instead we wishto find the best model for a set of data, from some rather large set of possibilities.

    Typically we can write down the most complex model we think is reasonable for a set of data, butbelieve that a number of the model coefficients should really be zero. Model selection is about identifyingwhich coefficients those are. There are two basic strategies:

    1. We may want to favour simplicity, and try to find the simplest model that we can get away withfrom the available possibilities. This suggests developing approaches based on successive applicationof the hypothesis testing methods already developed.

    2. We may simply want to find the best model for prediction from among the candidates. That iswe wish to try to find the model that would be best at predicting new data from the system thatgenerated the original data. This can be done by selecting between models on the basis of AkaikesInformation Criterion, or the alternative Bayesian Information Criterion.

    5.1 Hypothesis testing based model selection

    The method of backward selection is one way of performing model selection using hypothesis testingmethods. It works like this:

    1. Decide on a threshold p-value, , below which a term will always be retained in a model.

    2. Fit the largest model under consideration.

    3. Evaluate p-values for testing equality to zero of all model terms, except factor variables or theirinteractions that are involved in higher order interactions in the model (e.g. dont look at p-valuesfor two factors if their interaction is present in the model).

    4. Refit the model without the single term with the highest p-value above . If there are no suchterms to drop then stop. Otherwise return to step 3.

    Notice that only one term at a time is dropped at step 4. This is very important. If two covariates arehighly correlated then dropping one can make a major difference to the p-value associated with the other.This means that dropping more than one variable at a time is dangerous.

    The p-values at step 3 are obtained either from the t-ratio or z-ratio results of section 4.2, or bycomparing models with and without the term concerned, using the GLRT or F-ratio results of section4.3. For terms involving factor variables, the second option is usually the only viable one, since suchterms usually have multiple coefficients.

    25

  • As an example of backwards selection consider the semiconductor electrical resistance data given inFaraway (2005) as the wafer data frame. Four factors (x1 to x4) in the manufacturing process werebelieved to influence semiconductor resistance, resist, and an experiment was conducted to try out allcombinations of two levels of each. The data are as follows:

    > waferx1 x2 x3 x4 resist

    1 - - - - 193.42 + - - - 247.63 - + - - 168.24 + + - - 205.05 - - + - 303.46 + - + - 339.97 - + + - 226.38 + + + - 208.39 - - - + 220.010 + - - + 256.411 - + - + 165.712 + + - + 203.513 - - + + 285.014 + - + + 268.015 - + + + 169.116 + + + + 208.5

    An initial model is fitted with all interactions of the factors up to 3rd order. i.e.

    wm drop1(wm,test="F")Single term deletions

    Model:resist ~ x1 * x2 * x3 + x1 * x3 * x4 + x1 * x2 * x4 + x2 * x3 *

    x4Df Deviance AIC F value Pr(F)

    0.008 129.726x1:x2:x3 1 0.009 127.764 0.0380 0.8775x1:x3:x4 1 0.011 128.035 0.3094 0.6769x1:x2:x4 1 0.029 130.144 2.4173 0.3639x2:x3:x4 1 0.011 128.012 0.2867 0.6871

    The rows of the table are labelled with the names of the dropped terms. The reported p-value in the finalrow is for testing the null hypothesis that the model without the dropped term is adequate (against thealternative that the full model is needed). The AIC for each model under consideration is also reported.The p-values suggest dropping the interaction x1:x2:x3, and

    wm1

  • > drop1(wm1,test="F")Single term deletions

    Model:resist ~ x1 * x3 * x4 + x1 * x2 * x4 + x2 * x3 * x4

    Df Deviance AIC F value Pr(F) 0.009 128.322x1:x3:x4 1 0.011 126.918 0.5961 0.5208x1:x4:x2 1 0.029 130.981 4.6577 0.1636x3:x4:x2 1 0.011 126.874 0.5523 0.5348

    Notice how they have changed from the previous set of p-values. Now the x3:x4:x2 interaction is theone to drop. Repeating these steps we eventually end up with

    > wm7 drop1(wm7,test="F")Single term deletions

    Model:resist ~ x1 * x3 + x3 * x4 + x2 * x3

    Df Deviance AIC F value Pr(F) 0.036 139.199x1:x3 1 0.060 142.540 5.3311 0.04977 *x3:x4 1 0.069 144.510 7.2970 0.02702 *x3:x2 1 0.067 144.061 6.8491 0.03079 *---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

    So if = 0.05, this is the final model. i.e. All main effects are present, along with 3 two way interactions.Once a model has been selected, then the coefficient estimates would be examined and interpreted,

    possibly with the aid of confidence intervals. Note however that inference performed with the final fittedmodel tends to overstate what can really be concluded from the data, since it does not allow for theuncertainty in model selection.

    Forward selection is sometimes used as an alternative to backward selection. It starts from a verysimple model, and tries adding terms, using hypothesis tests to ascertain whether an addition is worth-while. The difficulty with it is that the distributional results on which the hypothesis tests are based relyon at least the alternative model being correct (if possibly over-complex). With forward selection thisassumption is always violated at the outset.

    5.2 Prediction error based model selection

    Suppose that a vector of response data y was really generated from a pdf f0(y), and that a GLM for yimplies that it was generated from pdf f(y). A measure of the mismatch between model and reality isprovided by the Kullback-Leibler distance:

    K ={log[f0(y)] log[f(y)]}f0(y)dy.

    A model with a low K is obviously a good thing. In fact it is possible to estimate the expected K-Ldistance for any particular model fit. It can be shown that selecting between models in order to minimizethis estimated expected K-L distance amounts to selecting models on the basis of their ability to minimizeAkaikes Information Criterion:

    AIC = 2l() + 2dim()where l() is the maximized log-likelihood of the model.

    27

  • An alternative to AIC is the Bayesian Information Criteria, BIC, which penalizes model complexitymore heavily. It is defined as

    BIC = 2l() + loge(n)dim().AIC and BIC can be used in place of hypothesis testing in backward model selection: the model with

    the lowest AIC/BIC score always being the one selected at each stage. Unlike hypothesis testing methodsAIC and BIC can be used to compare models that are not nested (although the comparisons are a bitmore reliable in the nested case).

    As an example of using AIC lets redo the wafer model selection example using AIC for backwardsselection.

    > library(faraway)> data(wafer)> wm drop1(wm)Single term deletions

    Model:resist ~ x1 * x2 * x3 + x1 * x3 * x4 + x1 * x2 * x4 + x2 * x3 *

    x4Df Deviance AIC

    0.008 129.726x1:x2:x3 1 0.009 127.764x1:x3:x4 1 0.011 128.035x1:x2:x4 1 0.029 130.144x2:x3:x4 1 0.011 128.012

    Since no test was specified drop1 simply evaluates the AIC for the full model (wm in this case) and versionsof it omitting all possible single terms. The model with the lowest AIC is then selected. In this instanceit is the model that omits x1:x2:x3, so that term would be dropped. The easiest way to refit a modelomitting some terms is to use the update function, as follows. . .

    > wm1 drop1(wm1)Single term deletions

    Model:resist ~ x1 + x2 + x3 + x4 + x1:x2 + x1:x3 + x2:x3 + x1:x4 +

    x3:x4 + x2:x4 + x1:x3:x4 + x1:x2:x4 + x2:x3:x4Df Deviance AIC

    0.009 128.322x1:x3:x4 1 0.011 126.918x1:x2:x4 1 0.029 130.981x2:x3:x4 1 0.011 126.874

    Notice one wrinkle: the AIC reported for wm1 is 128.322, but when we used drop1 before on wm, itsuggested that the AIC for wm1 would be 127.764. This happens because we need a scale parameterestimate in order to evaluate the AIC, and drop1 always uses the same estimate for all the models it

    28

  • compares, based on the largest model it is considering. Hence the two calls to drop1 give different AICestimates for the same model, because the different calls are using different scale parameter estimates.(This is nothing to do with having used update, by the way.) Strictly speaking the AIC should beevaluated using the MLE of , in which case this problem would not occur, but it makes the computationmuch more expensive (and less reliable) if we do this. Of course the problem does not arise if is known.

    Continuing . . .

    > wm2 drop1(wm2)Single term deletions

    Model:resist ~ x1 + x2 + x3 + x4 + x1:x2 + x1:x3 + x2:x3 + x1:x4 +

    x3:x4 + x2:x4 + x1:x3:x4 + x1:x2:x4Df Deviance AIC

    0.011 130.224x2:x3 1 0.043 136.823x1:x3:x4 1 0.014 128.925x1:x2:x4 1 0.031 133.701> wm3 drop1(wm3)Single term deletions

    Model:resist ~ x1 + x2 + x3 + x4 + x1:x2 + x1:x3 + x2:x3 + x1:x4 +

    x3:x4 + x2:x4 + x1:x2:x4Df Deviance AIC

    0.014 131.582x1:x3 1 0.038 136.723x2:x3 1 0.045 138.879x3:x4 1 0.047 139.386x1:x2:x4 1 0.034 135.503

    . . . at which point we would select wm3 and proceed to examine its coefficients and interpret the model fit.Notice that the AIC selected model is quite a bit more complex than the model selected by hypothesistesting. This is typical. BIC, in contrast, selects simpler models than AIC, and for large sample sizescan select simpler models than hypothesis testing based methods as well (although this depends on the level used, of course).

    Given the rather algorithmic nature of the selection process, it is possible to automate it entirely.The step function will perform the whole backwards selection-by-AIC process for you, with one functioncall. . .

    > wm3a

  • Step: AIC= 128.32resist ~ x1 + x2 + x3 + x4 + x1:x2 + x1:x3 + x2:x3 + x1:x4 +

    x3:x4 + x2:x4 + x1:x3:x4 + x1:x2:x4 + x2:x3:x4

    Df Deviance AIC- x2:x3:x4 1 0.011 126.874- x1:x3:x4 1 0.011 126.918 0.009 128.322- x1:x2:x4 1 0.029 130.981

    Step: AIC= 130.22resist ~ x1 + x2 + x3 + x4 + x1:x2 + x1:x3 + x2:x3 + x1:x4 +

    x3:x4 + x2:x4 + x1:x3:x4 + x1:x2:x4

    Call: glm(formula = resist ~ x1 + x2 + x3 + x4 + x1:x2 + x1:x3 + x2:x3 +x1:x4 + x3:x4 + x2:x4 + x1:x3:x4 + x1:x2:x4, family = Gamma(link = "log"),data = wafer)

    Coefficients:(Intercept) x1+ x2+ x3+ x4+ x1+:x2+

    5.2586 0.2843 -0.1273 0.4626 0.1502 -0.1228x1+:x3+ x2+:x3+ x1+:x4+ x3+:x4+ x2+:x4+ x1+:x3+:x4+-0.2071 -0.1783 -0.1852 -0.2339 -0.1863 0.1018

    x1+:x2+:x4+0.2845

    Degrees of Freedom: 15 Total (i.e. Null); 3 ResidualNull Deviance: 0.6978Residual Deviance: 0.01108 AIC: 130.2

    Notice that the finally selected model is a little different to the one that was selected using drop1. Thisis again down to how the scale parameter is handled: its done differently in step and drop1, and forthis model, this has made a slight difference to the finally selected model.

    Finally note that another possibility with AIC/BIC is all subsets model selection, in which everypossible sub-model of the most complex model is considered, and the one with the lowest AIC/BIC isfinally selected.

    5.3 Remarks on model selection

    Finally lets review the reasons for model selection.

    1. We do model selection because we are often uncertain about the exact form that a model shouldtake, even though it is often possible to write down a model that we expect to be complicatedenough, so that that for some parameter values it should be a reasonable approximation to thetruth.

    2. Selection is important for interpretational reasons: simpler models are easy to interpret than com-plex ones.

    3. Model selection also tends to improve the precision of estimates and the accuracy of model predic-tions. If a model contains more terms than necessary then we will inevitably use up informationin the data in estimating the associated coefficients, which in turn means that the important termsare less precisely estimated.

    30

  • Whether model selection is performed using AIC or model selection depends on the purposes of theanalysis. If we want to develop a model for prediction purposes then it makes sense to use AIC, but ifour interest lies in trying to understand relationships between the predictors and the response, it may bepreferable to use hypothesis testing based methods to try and avoid including model terms unless thereis good evidence that they are needed.

    Finally note that there is a difficult problem associated with model selection:

    It is common practice to use model selection methods to choose one model from a large set ofpotential models, but then to treat the selected model exactly as if it were the only model weever considered, when it comes to calculating confidence intervals etc. In doing this we neglect theuncertainty associated with model selection, and will therefore tend to overstate how precisely weknow the coefficients of the selected model (and how precise its predictions are). This issue is anactive area of current statistical research.

    5.4 Interpreting model coefficients

    Once a model is selected and checked, then usually you will want to examine and interpret its estimatedcoefficients. For many of the examples that we have met the interpretation of parameters is obvious, butfor complex models it can be less easy, especially with factor variables when identifiability constraintsare needed. The failsafe way to check the meaning of each coefficient in practical modelling is to examinethe model matrix. For example, the summary for model wm3 from the previous section is.

    > summary(wm3)...Coefficients:

    Estimate Std. Error t value Pr(>|t|)(Intercept) 5.27133 0.05060 104.176 5.09e-08 ***x1+ 0.25858 0.06532 3.958 0.01670 *x2+ -0.12711 0.06532 -1.946 0.12354x3+ 0.43720 0.05843 7.483 0.00171 **x4+ 0.12466 0.06532 1.908 0.12899x1+:x2+ -0.12222 0.08263 -1.479 0.21321x1+:x3+ -0.15622 0.05843 -2.674 0.05559 .x2+:x3+ -0.17825 0.05843 -3.051 0.03800 *x1+:x4+ -0.13430 0.08263 -1.625 0.17941x3+:x4+ -0.18306 0.05843 -3.133 0.03508 *x2+:x4+ -0.18610 0.08263 -2.252 0.08743 .x1+:x2+:x4+ 0.28450 0.11686 2.435 0.07162 .---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1...

    This can look intimidating, although in fact the parameter names are pretty helpful here. They basicallytell you the circumstance under which the coefficient of a factor will be added to the model. For exampleif x1 is in the + state for some response measurement, then we include the x1+ term in the model (whichjust amounts to adding .25858 to the linear predictor in this case, since x1 is a factor). If x1 and x2 arein the + state for some response measurement, then terms x1+, x2+ and x1+:x2+ are included, and so on.

    To make things completely clear, however, look at the model matrix (and original data frame).

    31

  • > waferx1 x2 x3 x4 resist

    1 - - - - 193.42 + - - - 247.63 - + - - 168.24 + + - - 205.05 - - + - 303.46 + - + - 339.97 - + + - 226.38 + + + - 208.39 - - - + 220.010 + - - + 256.411 - + - + 165.712 + + - + 203.513 - - + + 285.014 + - + + 268.015 - + + + 169.116 + + + + 208.5

    > model.matrix(wm3)(Intercept) x1+ x2+ x3+ x4+ x1+:x2+ x1+:x3+ x2+:x3+ x1+:x4+ x3+:x4+ x2+:x4+ x1+:x2+:x4+

    1 1 0 0 0 0 0 0 0 0 0 0 02 1 1 0 0 0 0 0 0 0 0 0 03 1 0 1 0 0 0 0 0 0 0 0 04 1 1 1 0 0 1 0 0 0 0 0 05 1 0 0 1 0 0 0 0 0 0 0 06 1 1 0 1 0 0 1 0 0 0 0 07 1 0 1 1 0 0 0 1 0 0 0 08 1 1 1 1 0 1 1 1 0 0 0 09 1 0 0 0 1 0 0 0 0 0 0 010 1 1 0 0 1 0 0 0 1 0 0 011 1 0 1 0 1 0 0 0 0 0 1 012 1 1 1 0 1 1 0 0 1 0 1 113 1 0 0 1 1 0 0 0 0 1 0 014 1 1 0 1 1 0 1 0 1 1 0 015 1 0 1 1 1 0 0 1 0 1 1 016 1 1 1 1 1 1 1 1 1 1 1 1

    It is now clear that the intercept is the expected resistance for a wafer where none of the factors arein the + state. The coefficients x1+ to x4+ give the expected increase in resistivity when just one of thefactors is in the + state (referring back to the summary, x1 and x3 seem to lead to a significant increasein resistance, on their own). So what about the interactions? Look at x1+:x2+ as an example its anadjustment that is added on when x1 and x2 are both in the + state together: i.e. it is how much differentthe expected resistivity is to what you would expect if the effects of x1 and x2 both being + simply addedto each other. Referring back to the summary, it seems that when two factors are in the + state, theresistivity is lower than you would expect from just looking at their effects on their own (although notall the interaction coefficients are significantly different from 0).

    Confidence interval calculation for parameters is covered in section 4.1.

    32

  • Index

    AIC, 5, 7, 10, 25, 27, 29AIDS example, 3

    BIC, 25, 28, 29binomial distribution, 4, 14binomial GLMs in R, 9, 24

    fitted values, 11blood pressure example, 21

    canonical parameter, 12case-control study, 23confidence intervals

    and scale parameter, 22for fitted values, 7, 22for linear transformations of , 22for parameters, 22transforming scale, 22

    correlated covariates, 25

    deviance, 8, 9, 17, 19approximate distribution, 10, 17null, 10proportion explained, 10scaled, 17

    distribution of , 4, 16

    exponential family, 12log likelihood, 13mean, 13variance, 13

    F test, 17formula

    I(), 20

    gamma distribution, 3, 21generalized likelihood ratio test, 5, 12, 16, 17, 24GLM

    for binomial data, 9in R, 5likelihood, 14standard form, 2, 14

    harrier example, 3, 6hear attack example, 3heart attack example, 8hypothesis test

    2, 16, 24F ratio, 17model comparison, 23

    hypothesis testing, 5, 16unknown scale parameter, 25

    known scale parameter, 24hypothesis tests

    single parameter, 23

    information matrix, 16interpreting model coefficents, 31IRLS, 4, 15

    starting values, 15iterative weights, 5, 15, 18

    Kullback-Leibler distance, 27

    leverage, 19likelihood, 13likelihood of GLM, 4linear predictor, 2, 21link

    identity, 21link function, 2

    default, 9inverse, 3, 6logit, 4

    log likelihood, 17logistic regression model, 4, 23

    maximum likelihood estimation, 4, 14large sample results, 16

    melanoma example, 23model checking, 5model comparison, 5, 23model matrix, 21, 31model selection

    all subsets, 30backward selection, 25conditioning on final model, 31forward selection, 27motivation, 30prediction error based, 27

    models selection, 25

    p-value, 10, 12, 23, 24Pearson statistic, 18Poisson distribiution, 3pseudodata, 15, 18

    QQ plots, 19

    R, 5R function

    anova, 12, 24cbind, 9drop1, 26

    33

  • glm, 5, 24, 26I(), 6model.matrix, 31plot, 5, 7predict, 7, 22step, 29summary, 6, 31update, 28with, 5

    R model formula, 6, 20, 28r2, 10residual plots, 7, 10, 19residuals, 5, 18

    deviance, 19outliers, 20Pearson, 5, 18raw, 18standardized, 18

    response scale, 7

    saturated model, 17scale parameter, 2, 8, 12, 17

    estimate, 18known, 16unknown, 17, 29

    variance function, 2, 13

    wafer example, 26, 28

    34