STAT 8620, Categorical Data Analysis & Generalized Linear ... · STAT 8620, Categorical Data Analysis & Generalized Linear Models — Lecture Notes The goal of this course is to teach

STAT 8620, Categorical Data Analysis &Generalized Linear Models — Lecture Notes

The goal of this course is to teach methods for the analysis of discreteresponse data and to develop a general framework for the analysis of dis-crete data and other data types for which the assumptions of the classicallinear model (CLM) do not hold.

Such a framework is provided by a class of models known as generalizedlinear models or GLMs.

• GLMs extend the class of CLMs (a.k.a. normal-theory linear models,Gauss-Markov models, or, confusingly, general linear models).

CLMs can be extended in other important ways:

• E.g., by the inclusion of both traditional (fixed) regression parame-ters and “random” parameters, better known as random effects.

– Such linear mixed models (or LMMs) are very useful for han-dling correlation and multiple sources of variability (covered inSTAT 8630).

• By allowing more general forms of nonlinear regression functions(STAT 8230).

• More general classes are possible: generalized linear mixed models(GLMMs; covered in STAT 8630), nonlinear mixed models (NLMMs;coverd in STAT 8230), etc.

In this class, however, we concentrate on GLMs and extensions of GLMssuitable for the analysis of independent, discrete (or otherwise non-normal)data.

But first, we need to begin by introducing some “pre-model” ideas: de-scriptive statistics, measures of association, model-free inference in simpledata tables, etc.

1

Non-model-based Concepts and Methods for Discrete Data:

What do we mean by discrete data?

A discrete random variable is a random variable that can taken on a finiteor countably infinite number of possible values.

• In practice, all random variables are discrete, due to limitations inthe precision of measurement.

• Typically, though, variables that theoretically have an underlyingcontinuous scale are treated as continuous in statistical analyses -unless the scale of measurement is extremely coarse. E.g., weight,height, time elapsed.

The practical basis of the distinction is, does the variable taken on enoughvalues with positive probability to be well approximated by a continuousdistribution.

• Therefore, we’re concerned with random variables that can assumeonly a small number of values.

This includes many qualitative, or nominally scaled categorical vari-ables,

– Religion (Christian, Hindu, Jewish), Gender (male, female),etc.

and also ordinally scaled categorical variables.

– Agreement (strongly agree, agree, neutral, disagree, stronglydisagree), Pain (mild, moderate, severe), etc.

As presented none of these characteristics (Religion, Agreement, etc.) areeven, strictly speaking, random variables. They only become so, and be-come analyzable, when we assign numbers to their values.

– Religion (1=Christian, 2=Moslem, 3=Jewish), Gender (1=male,2=female), Agreement (1=strongly agree, 2=agree, 3=neutral,4=disagree, 5=strongly disagree), etc

The other main types of discrete variables we’ll work with are groupedcontinuous variables (e.g., < 1, 1–5, > 5 years of service) and counts(e.g., number of hospitalizations).

2

2-Way Contingency Tables:

A 2−way contingency table a.k.a. cross-tabulation is simply a two-way array containing the joint distribution of two categorical random vari-ables.

• Contingency tables can be used to display either the joint frequencydistribution or the joint probability distribution.

Let X and Y denote two categorical random variables having I and Jlevels, respectively.

Let πij = Pr(X = i, Y = j). Then the joint probability distribution forthe pair (X,Y ) can be displayed as an I × J table:

The collection of πij ’s i = 1, . . . , I, j = 1, . . . , J , can be written succinctlyas πij. πij is the joint distribution of (X,Y ).

The probability distribution of X ignoring Y (i.e., averaging over Y ) iscalled the marginal distribution of X because of its natural place inthe margin of the contingency table.

• The marginal distribution of X is given by πi+ where πi+ =Pr(X = i).

• The marginal distribution of Y is given by π+j where π+j =Pr(Y = j).

3

The conditional distribution of X given Y = j may also be of interest.This distribution is given by the set of probabilities π1|j , . . . , πI|j where

πi|j = Pr(X = i|Y = j) =Pr(X = i, Y = j)

Pr(Y = j)=

πij

π+j

Joint frequency distributions can also be displayed in contingency tablessimply by replacing πij ’s with nij ’s:

• Here, nij =the number of “subjects” out of a random sample of sizen whose response was (X,Y ) = (i, j).

4

The sample analog of the joint probability distribution is the joint rela-tive frequency distribution, obtained from the joint frequency distributionsimply by dividing through by n:

• Here, pij = nij/n is the proportion of the sample for which theresponse was (X,Y ) = (i, j).

Perhaps the most basic question of interest in a two-way table situation iswhether or not X and Y are independent. That is, is it true that

πij = πi+π+j , for all i, j?

Note that under independence,

πj|i =πij

πi+=

πi+π+j

πi+= π+j

for all i, j.

• That is, Y is independent of X iff the conditional distribution of Yis the same as the marginal distribution of Y , for each row of thetable.

• This is a natural way to think about independence especially if X isthe explanatory variable, and Y is the response.

5

Another question that often is of interest in two-way tables formed fromordinal variables is, are large values of Y more (less) likely when X = ithan when X = i′.

I.e., if we think of there being an underlying continuous distribution forY (that has been grouped or categorized to form Y ), do the underlyingconditional densities look like:

Or, equivalently, do the underlying c.d.f.s look like this:

Mathematically, this happens when for two rows i and i′ we have

Fj|i′ ≤ Fj|i, for all j

where Fj|i denotes the conditional c.d.f. of Y given that X = i:

Fj|i =∑h≤j

πh|i

6

The simplest case of a two-way table results whenX and Y are both binary(dichotomous, Bernoulli, quantal). In this case we have a 2× 2 table:

There are several ways to compare probabilities in a table like this.

1. Risk Difference.

In epidemiology and biostatistics, 2×2 tables often involve a responseY =disease status (1=disease present, 2=absent) and an explanatoryvariable X (e.g., gender: 1=female, 2=male):

In this case, π1|i = probability, or risk of being diseased for genderi.

7

An obvious way to compare the disease risks is to take the riskdifference:

π1|1 − π1|2 = (1− π2|1)− (1− π2|2) = −(π2|1 − π2|2)

• Notice that comparisons of conditional probabilities that Y = 1 withthe risk difference are equivalent to comparisons of conditional prob-abilities that Y = 2 (differ only by sign).

• In the 2× 2 table, X,Y are independent iff

π1|1 − π1|2 = 0

• In the I × 2 table, X,Y are independent iff

π1|i − π1|i′ = 0 for all i, i′

and for the I × J table, X,Y are independent iff

πj|i − πj|i′ = 0 for all i, i′ ∈ 1, . . . , I and all j

2. Relative Risk.

One feature (drawback) of the risk difference is that it ignores therarity of the disease; e.g., risk difference is .01 if female, male risksare (.51, .50) and if female, male risks are (.02, .01).

• In the latter case the risk is twice as great for females!

This suggests that for some purposes, the ratio of risks is a betterbasis of comparison. For 2× 2 tables the relative risk is

π1|1/π1|2

• The relative risk takes values in [0,+∞), and a value of 1.0 corre-sponds to independence.

8

3. Odds ratio.

Another common way to think about chance is in terms of the odds of anevent occurring rather than the probability of the event.

• The odds of an event are simply the probability of the event occur-ring divided by the probability that the event does not occur.

In a 2×2 table, the odds that Y = 1 in the first row (given that X = 1) is

Ω1 =π1|1

π2|1=

π11/π1+

π12/π1+=

π11

π12

and the odds in the second row is Ω2 = π1|2/π2|2 = π21/π22.

The odds ratio compares the chances of Y = 1 at the two levels of Xwhere “chances” are in terms of odds rather than probabilities (risks).We’ll denote the odds ratio as θ:

θ =Ω1

Ω2=

π11/π12

π21/π22=

π11π22

π12π21

• Because of this last representation, θ is sometimes called the cross-product ratio.

• Odds ratios take values in [0,+∞), and, when all cell probabilitiesare positive, a value of 1.0 corresponds to independence.

• When θ > 1, Y = 1 is more likely in the first row, than in the second.

• When θ < 1, Y = 1 is more likely in the second row, than in thefirst.

• Note that θ is not symmetric about 1; e.g., θ = 4 and θ = 0.25represent equally strong associations between X and Y , but in theopposite direction.

9

The sample version of θ replaces πij ’s by pij ’s, or, equivalently by nij ’s:

θ =p11p22p12p21

=n11n22

n2

n12n21

n2

=n11n22

n12n21

The odds ratio has several nice invariance properties, not all of which arepossessed by the relative risk and risk difference.

• θ is invariant to interchange of row variable and column variable (ifwe let Y the row variable and X the column variable we get the samevalue of θ).

⇒ Explanatory/response variable distinction doesn’t affect θ.

• If we switch either the row order or the column order the result isto invert the value of θ (same strength of association but directionchanged to match the row or column change).

• The sample version of θ is unaffected if we multiply through by aconstant in either a row or a column.

– This last result is true of the relative risk and risk differencewith respect to rows, but not columns.

It follows from these facts that the odds ratio is the only one of the threeassociation measures that is appropriate for cross-sectional, prospec-tive, and retrospective study designs.

10

Prospective Study (Clinical Trials, Cohort Studies): exposed and un-exposed groups are identified and followed over time to compare incidenceof disease.

Retrospective Study (Case-Control Study): diseased and disease-freesubjects are identified and their exposure history investigated.

Cross-sectional Study: Sample of n subjects of unknown disease andexposure status is identified and disease status and prior exposure statusare assessed simultaneously.

Exposure Status

DiseaseStatus

D D

E n11 n12 n1+

E n21 n22 n2+

n+1 n+2 n

Prospective Retrospective Cross-sectionalRow totals: fixed random randomCol. totals: random fixed randomGrand total: fixed fixed fixed

In a cross-sectional design, we fix n, and then measure n11, n12, n21, n22,from which we form p11 = n11/n, p12 = n12/n, p21 = n21/n, p22 = n22/nwhich are appropriate estimators of π11, π12, π21, π22.

• Since we can estimate the πij ’s, we can estimate the πj|i’s (π1|1, π1|2);and since all three association measures can be computed from thesequantities, we can estimate risk difference, relative risk, and oddsratio.

11

Compared to a cross-sectional design, in a prospective design, we artifi-cially inflate or deflate n1+, n2+ to fixed sizes. That is, we multiply througheach row by a constant.

• We still have information appropriate to estimate the πj|i’s:

πj=1|i=1 = n11/n1+, πj=1|i2 = n21/n2+

so we can still estimate risk difference, relative risk, and odds ratio.

Compared to a cross-sectional design, in a retrospective design, we ar-tificially inflate or deflate n+1, n+2 to fixed sizes. That is, we multiplythrough each column by a constant.

• We now have information appropriate to estimate the πi|j ’s (theprobability of being exposed or not exposed given your disease sta-tus): πi=1|j=1 = n11/n+1, πi=1|j=2 = n12/n+2.

• However, we can’t form appropriate estimators of the πj|i’s, so wecan’t get at the risk difference or relative risk.

• In contrast, we do still have information from which it is appropriateto estimate θ, because

θ =π11π22

π12π21=

π11

π+1

π22

π+2

π12

π+2

π21

π+1

=πi=1|j=1πi=2|j=2

πi=1|j=2πi=2|j=1

and we can estimate the πi|j ’s.

12

Examples:

A clinical trial of aspirin use to prevent heart attack: Approximately equalnumbers of subjects (11,034 and 11,037) were assigned to placebo andaspirin use groups.

Aspirin Use

HeartAttack

Yes No

Placebo 189 10, 845 11, 034

Aspirin 104 10, 933 11, 037

293 21, 778 22, 071

Risk estimates (πj|i’s):

π1|1 = 189/11034 = .0171, π1|2 = 104/11037 = .0094

• ⇒ risk difference estimate = .0171 − .0094 = .0077 (risk is greateron placebo).

• Relative risk estimate = .0171/.0094 = 1.82.

• Odds ratio estimate is (189)(10933)/[10845(104)] = 1.83.

• The similarity of the odds ratio and relative risk estimates here illus-trates a general phenomenon: for rare events (small risks), θ ≈relativerisk.

13

A case-control study of oral contraceptive use and heart attack. 58 femaleheart attack victims were identified and each of these “cases” was matchedto three “control” subjects of similar age, etc. who had not suffered heartattacks.

Contraceptive Use

HeartAttack

Yes No

Yes 23 34 57

No 35 132 167

58 166 224

In this case, the probability of heart attack given whether or not oralcontraceptives have been used is not estimable. However, we can estimateprobabilities of contraceptive use given presence or absence of heart attack:

πi=1|j=1 = 23/58 = .397, πi=1|j=2 = 34/166 = .205

And from these quantities we can estimate the odds ratio:

⇒ θ =2358

(1− 34

166

)34166

(1− 23

58

) =23(132)

34(35)= 2.551

14

The odds ratio (and risk difference and relative risk) quantifies the as-sociation between two levels of X (two rows) and two levels of Y (twocolumns).

• For 2 × 2 tables, that’s the whole story (the odds ratio summarizesall of the association in the table).

For I×J tables, odds ratios can be constructed for all of the combinationsof pairs of rows (

(I2

)of them) combined with pairs of columns (

(J2

)of

them).

• Many of these(I2

)(J2

)odds ratios are redundant.

A (non-unique) minimal set to describe all of the association in a I × Jtable requires only (I − 1)(J − 1) odds ratios.

Possibilities:

1. Adjacent category odds ratios:

θij =πijπi+1,j+1

πi,j+1πi+1,j, i = 1, . . . , I − 1, j = 1, . . . , J − 1

2. Reference category odds ratios (last category = reference):

θij =πijπIJ

πi,JπI,j, i = 1, . . . , I − 1, j = 1, . . . , J − 1

3. Reference category odds ratios (first category = reference):

θij =πijπ11

πi,1π1,j, i = 2, . . . , I, j = 2, . . . , J

• When the response scale is nominal, it is difficult to summarize theinformation in a minimal set of (I − 1)(J − 1) odds ratios as a singlenumber with little loss of information.

• Several such measures exist, however, including the concentrationcoefficient and the uncertainty coefficient. See Agresti, §2.4.2.

15

For ordinal scaled responses, reduction of the odds ratios in an I×J tabledown to a single number is easier and more appropriate.

For interval and ratio scaled random variables, bivariate analyses oftenfocus on linear association (Pearson correlation). For ordinal variables,though, linearity is not a meaningful concept. Instead, measures of linearassociation can be replaced by measures of monotonicity.

• X and Y have a monotone increasing (decreasing) relationship iffor (Xa, Ya), (Xb, Yb) measured on two subjects a and b, Xa > Xb

implies Ya ≥ Yb (Ya ≤ Yb). The relationship is strictly monotoneincreasing if Xa > Xb implies Ya > Yb (Ya < Yb).

There are several measures of the degree to which monotonicity tends tohold; these include Goodman and Kruskal’s γ, Kendall’s τb , andSomers’ d.

• All of these parameters are defined in terms of the probabilitiesof concordance and discordance and are estimated by countingnumbers of concordant and discordant pairs.

• A pair of subjects on whom we’ve measured X and Y is concordantif the subject with the higher X-value is also the subject with thehigher Y -value.

• A pair of subjects is discordant if the subject with the higherX-valueis also the subject with the lower Y -value.

• A pair of subjects is tied if the subjects have the same X-value orthe same Y -value.

16

Example — Clothing and Intelligence

The table below was published in 1894 in Biometrika. It contains a cross-classification of X =“standard of clothing” and Y =“intelligence” for 1725school children. Both X and Y were subjectively measured on ordinalscales.

Intelligence

Clothing A,B C D E F G

1=Very badly clad 17 13 22 10 10 1 732=Poor but passable 39 58 70 61 33 4 2653=Well clad 41 100 202 255 138 15 7514=Very well clad 33 48 113 209 194 39 636

Total 130 219 407 535 375 59 1725

Intelligence was rated as follows: A =mentally deficient, B =slow anddull, C =dull, D =slow but intelligent, E =fairly intelligent, F =distinctlycapable, G =very able. Categories A and B are combined in the abovetable.

• A pair where subject 1 is in the (1, D) cell and subject 2 is in the(2, E) cell is concordant (subject 2 is higher on both X and Y ).

• A pair where subject 1 is in the (1, D) cell and subject 2 is in the(3, C) cell is discordant (subject 1 is lower on X and higher on Y ).

• A pair where subject 1 is in the (1, D) cell and subject 2 is in the(1, F ) cell is tied.

17

The measures of monotonicity mentioned above all are defined in terms ofΠc and Πd, the probabilities of concordance and discordance, respectively,for a randomly selected pair of bivariate observations.

For these parameters, the association is said to be positive (negative) ifΠc −Πd > 0 (Πc −Πd < 0), where

Πc = 2∑i

∑j

πij

∑h>i

∑k>j

πhk

, Πd = 2∑i

∑j

πij

∑h>i

∑k<j

πhk

The most important of these measures is G&K’s γ. Gamma (γ) is definedas

γ =Πc −Πd

Πc +Πd=

Πc

Πc +Πd− Πd

Πc +Πd.

• From the last expression above we obtain an interpretation for γ: γ isequal to the difference in the conditional probabilities of concordanceand discordance, given that the pair is not tied.

Since γ is a difference in probabilities, it takes values in [−1,+1] withperfect positive (negative) association occurring when γ = 1 (γ = −1).

The sample estimate of γ is

γ =C −D

C +D

where C =the total number of concordant pairs and D =total number ofdiscordant pairs.

18

Back to the Example:

C = 17(58 + 70 + 61 + 33 + 4 + 100 + · · ·+ 39)

+ 13(70 + 61 + 33 + 4 + 202 + · · ·+ 39) + 22(61 + · · ·+ 39)

+ · · ·+ 138(39)

= 507067

and

D = 13(39 + 41 + 33) + 22(39 + 58 + 41 + 100 + 33 + 48) + · · ·+ 15(33 + 48 + 113 + 209 + 194) = 254066

so

γ =507067− 254066

507067 + 254066= .3324

• Here we have mild positive association. Intelligence tends to increasewith quality of clothing (keep in mind, though, that there’s a lotwrong with a study of this kind).

• See also handout clothes.sas.

19

Inference for Two-way Tables

• See Ch. 3 of Agresti.

The data generation mechanism that leads to a given contingency table isusually modelled with one of three sampling models:

1. Poisson Sampling;2. Multinomial Sampling; or3. Product Multinomial Sampling.

(Rarely, a fourth sampling model based on the hypergeometric distributioncomes up, but we’ll talk about that later.)

Example 1 (piston ring failures):

For quality control purposes, an engine manufacturer kept track ofthe number of piston ring failures that occurred in its engines whileunder warranty over a certain period of time. Failures were cross-classified by cylinder and position on the piston.

PositionCylinderNo. N C S

1 17 17 12 46

2 11 9 13 33

3 11 8 19 38

4 14 7 28 49

53 41 72 166

• Notice that in this case all margins are random.

20

Example 2 (malignant melanoma):

Medical researchers decided to investigate the incidence of malignantmelanoma (skin cancer) by tumor type and location (site) of tumor.To do this they examined hospital records from 400 patients treatedfor malignant melanoma.

SiteTumorType Head/Neck Trunk Extremities

A 22 2 10 34

B 16 54 115 185

C 19 33 73 125

D 11 17 28 56

68 106 226 400

• Notice that in this case the bottom right table margin (400) is fixedby design, all other margins are random.

Example 3 (flu vaccine):

In a clinical trial for a flu vaccine approximately half of a total samplesize of 73 subjects were randomized to each of two groups: placebo,and vaccine. The response of interest was antibody level, which wascategorized as low, medium and high.

AntibodyLevelTreatmentGroup Low Medium High

Placebo 25 8 5 38

Vaccine 6 18 11 35

31 26 16 73

• Notice that in this case the bottom right margin as well as both rowmargins are fixed by design. The column margins are random.

These 3 tables illustrate 3 distinct sampling situations ⇒ we must have 3distinct sampling models.

21

Sampling Models:

Let nij =count in the (i, j)th cell of of an I×J contingency table (we won’tdistinguish notationally between the random variable nij and its realizedvalue; the reference will typically be clear from the context). Let N = IJ ,the total number of cells in the table, and let mij = E(nij) i = 1, . . . , I,j = 1, . . . , J , denote the expected cell frequencies.

1. Poisson sampling. Assume n11, . . . , nIJ ’s are independent Poissonr.v.’s with means m11, . . . ,mIJ , respectively.

Recall the Poisson distribution: nij ∼ Poisson(mij) means that nij

has probability mass function

f(nij ;mij) =

exp(−mij)m

nijij

nij !for nij = 0, 1, 2, . . .;

0 otherwise

and E(nij) = var(nij) = mij .

• The Poisson distribution is appropriate when nij can be thought ofas the count of events that occur according to a Poisson process withrate mij .

• More generally, the Poisson distribution is useful for counts of eventsthat occur randomly through time or space without upper bound.

Recall that the likelihood function for a random variable Y is equal to theprobability density (mass) function for that random variable, but thoughtof as a function of the parameters of that density rather than as a functionof the value of the random variable.

The likelihood function for random variables Y1, . . . , Yn (i.e., for the ran-dom vector Y = (Y1, . . . , Yn)

T ) is the joint p.d.f. of Y1, . . . , Yn treated asa function the parameters of the density.

22

Likelihood function for nij :

L(mij ;nij) = f(nij ;mij) =exp(−mij)m

nij

ij

nij !

⇒ the likelihood for n = n11, . . . , nIJ is

L(m;n) =∏i,j

e−mijmnij

ij

nij !

where m = (m11, . . . ,mIJ)T .

• Notice that if n11, . . . , nIJ are all independent Poisson random vari-ables, then n =

∑i,j nij is random, too. In fact, n ∼ Poisson(

∑i,j mij).

⇒ appropriate for a situation like example 1, piston ring failures.

2. Multinomial Sampling.

What if n is fixed, as in the melanoma example?

In that case the Poisson sampling model is not appropriate. If n =400, then no single cell count can exceed 400 (nij ≤ n ∀i, j).

Instead we can think of the N cells in the contingency table as N dis-tinct, mutually-exclusive, and exhaustive outcomes for a single cate-gorical response variable, with outcome probabilities π11, π12, . . . , πIJ .

This approach leads to the multinomial sampling model.

23

The Multinomial Distribution

Let Z be a response variable taking N possible values which we’ll label1, 2, . . . , N . In some sense, Z is really a multivariate response, because wecan replace Z with x = (x1, . . . , xN−1)

T where

xr =

1, if Z = r, r = 1, . . . , N − 1,0, otherwise

Consideration of Z is equivalent to considering x because

Z = r if and only if x = (0, . . . , 0, 1, 0, . . . , 0)T

andPr(Z = r) = Pr(xr = 1)

Now suppose we have n copies of x: x1, . . . ,xn. Then the sum of thesevectors

y =n∑

i=1

xi

follows the multinomial distribution and

Pr(y = (n1, . . . , nN−1)T ) =

n!(∏N−1r=1 nr!

)(n− n+)!

(N−1∏r=1

πnrr

)(1−π+)

n−n+

where πr = Pr(Zi = r), π+ =∑N−1

r=1 πr, and n+ =∑N−1

r=1 nr.

• We write this as y ∼ Mult(n,π), where π = (π1, . . . , πN−1)T .

While y is the vector consisting of the number of units falling in each ofthe first N − 1 categories, y/n gives the proportion of the sample fallingin each of these categories. Moments of y = y/n are

E(y) = π, var(y) =1

n(diag(π)− ππT )

=1

n

π1(1− π1) −π1π2 · · · −π1πN−1

−π2π1 π2(1− π2) · · · −π2πN−1

......

. . ....

−πN−1π1 −πN−1π2 · · · πN−1(1− πN−1)

24

For the two-way table where∑

i,j nij = n, any subset of IJ − 1 ifthe IJ cell counts follows a multinomial distribution. Equivalently, ifn = (n11, . . . , nIJ )

T and π = (π11, . . . , πIJ )T , the likelihood function

is given by

L(π;n) = Pr(n = n;π) = n!∏i,j

πnij

ij

nij !

• In contrast to the Poisson sampling model, we’ve parameterized thelikelihood in terms of the cell probabilities (the πij ’s) rather thanthe expected cell counts (the mij ’s), but these quantities are relatedby

mij = nπij , i = 1, . . . , I, j = 1, . . . , J.

3. Product Multinomial Sampling.

In Poisson sampling, no margins were fixed, in multinomial sam-pling the grand total was fixed. If row (or column) totals are fixed,then its appropriate to think of the table as made up independentmultinomial samples corresponding to the rows.

• In the flu vaccine example, two independent samples were taken: oneof the 38 placebo patients, and a second of 35 vaccine patients.

• ⇒ the two rows of the flu vaccine table can be thought of as inde-pendent multinomials.

First row: (n11, n12)T ∼ Mult

(38, (πj=1|i=1, πj=2|i=1)

T).

Second row: (n21, n22)T ∼ Mult

(35, (πj=1|i=2, πj=2|i=2)

T).

The likelihood for the entire table is the product of the multinomiallikelihoods for each row (independence).

For row totals fixed, the likelihood is given by

I∏i=1

ni+!J∏

j=1

πnij

j|i

nij !

• For row totals fixed, multinomial probabilities are related to expectedcell counts via

mij = ni+πj|i, i = 1, . . . , I, j = 1, . . . , J.

25

Estimation:

We’ve already utilized the sample proportions (pij ’s where pij = nij/n) asestimates of the corresponding cell probabilities (the πij ’s) — for example,

to form an estimator of the odds ratio in a 2×2 table: θ = p11p22/(p12p21).

pij as an estimator of πij can be motivated as a method of moments(MOM) estimator — we estimate a population moment (in this case, amean) with the corresponding sample moment.

This estimator can also be derived as the maximum likelihood estimator(MLE).

Maximum Likelihood Estimation:

Suppose we have a discrete random variable Y (possibly a vector)with observed value y. Suppose Y has probability mass functionf(y;θ), θ ∈ Θ.

The likelihood function, L(θ; y) is defined to equal the probabiltymass function (more generally, the density) but viewed as a functionof θ, not y:

L(θ; y) = f(y;θ)

Therefore, the likelihood at θ0, say, has the interpretation

L(θ0; y) = Pr(Y = y when θ = θ0)

= Pr(observing the obtained data when θ = θ0)

Logic of ML: choose the value of θ that makes this probability largest

⇒ θ, the MLE.

Important Property of MLEs: Consider a reparameterization, ϕ =h(θ) where h is one-to-one, h : Θ → Φ.

If θ is a MLE of θ, then ϕ = h(θ) is a MLE of ϕ (MLE is invariant toparameterization).

26

We use the same procedure when Y is continuous with density func-tion f(y;θ): maximize L(θ; y) = f(y;θ)

Often, our data come from a random sample so that we observe ycorresponding to Yn×1, a vector of independent r.v.’s. In this case

L(θ;y) =n∏

i=1

f(yi;θ)

Since its easier to work with sums than products its useful to notethat in general

argmaxθ

L(θ; y) = argmaxθ

logL(θ; y)︸︷︷︸≡ℓ(θ;y)

It can be shown that under all three sampling models, the log-likelihoodfunction is strictly concave and has a unique maximum (Birch, 1963;Haberman, 1973, 1974; Baker et al. 1985).

It follows that the MLEs can be found by solving the likelihood equa-tions in which we set partial derivatives of the log-likelihood with respectto each parameter equal to zero.

27

Multinomial Sampling.

The multinomial loglikelihood is

ℓ(π;n) =∑i

∑j

nij log(πij)︸︷︷︸kernel of ℓ

+ log(n!)−∑i

∑j

log(nij !)︸︷︷︸doesn’t involve π

• Maximizing the kernel of ℓ(π;n) is equivalent to maximizing ℓ(π;n).

To find MLE of π we must maximize the kernel of ℓ(π;n) subject to theside constraint

∑i

∑j πij = 1.

Question: How do we maximize a function f(θ) subject to equality con-straints g1(θ) = 0, . . . , gr(θ) = 0?

Answer: Use method of Lagrangian multipliers.

First construct the “Lagrangian” L where

L = f(θ)− λ1g1(θ)− · · · − λrgr(θ)

Then maximize L with respect to θ, λ = (λ1, . . . , λr)T . The maximizer of

the Lagrangian θ is the constrained maximizer of f(θ).

• The λi’s are called Lagrange multipliers.

28

To obtain multinomial MLE we first construct the Lagrangian (r = 1):

L =∑i

∑j

nij log πij − λ(∑i

∑j

πij − 1)

Then maximize with respect to π, λ:

∂L∂πij

=nij

πij− λ, i = 1, . . . , I, j = 1, . . . , J

∂L∂λ

= 1−∑i

∑j

πij

Setting these partials to zero implies that the maximizers λ and πij , i =1, . . . , J , j = 1, . . . , J , satisfy

λπij = nij , i = 1, . . . , I, j = 1, . . . , J∑i

∑j

πij = 1

Adding the first IJ equations together we get

λ∑i,j

πij =∑i,j

nij = n

which combined with the last equation yields λ = n.

So, the MLE of πij subject to the constraint∑

i,j πij = 1 is

πij =nij

n= pij , the sample proportion.

Since mij = nπij , the MLE of the expected count in the (i, j)th cell is

mij = nπij = nij , the observed cell count

29

Poisson Sampling.

For Poisson sampling, obtaining the MLEs of the expected cell counts is abit easier since there are no constraints on these parameters.

In this case, the loglikelihood is given by

ℓ(m;n) =∑i

∑j

nij log(mij)−mij︸︷︷︸kernel of ℓ

−∑i

∑j

log(nij !)

The likelihood equations are given by

∂ℓ

∂mij

∣∣∣∣mij=mij

= 0, i = 1, . . . , I, j = 1, . . . , J

ornij

mij− 1 = 0 ⇒ mij = nij , i = 1, . . . , I, j = 1, . . . , J

• MLEs are the same under Poisson, multinomial sampling!

Product Multinomial Sampling.

Let πc = (πj=1|i=1, . . . , πj=J|i=I)T be the vector of conditional probabil-

ities conditioning on the value of X, the row variable. Then the productmultinomial loglikelihood is given by

ℓ(πc;n) =∑i

log ni+! +∑i

∑j

(nij log πj|i − log nij !)

The πj|i’s must satisfy∑

j πj|i = 1 for each i, so we must maximize ℓ(πc;n)subject to these constraints.

30

Lagrangian:

L =∑i

∑j

nij log πj|i −∑i

λi

∑j

πj|i − 1

Partials:

∂L∂πj|i

=nij

πj|i− λi, i = 1, . . . , I, j = 1, . . . , J

∂L∂λi

=∑j

πj|i − 1 i = 1, . . . , I

So the MLEs must satisfy

nij = λiπj|i, i = 1, . . . , I, j = 1, . . . , J∑j

πj|i = 1 i = 1, . . . , I

Summing the first set of equations over j for each i and using the second

set of equations we get λi = ni+, i = 1, . . . , I.

Plugging λi = ni+ back into the first set of equations, we get the MLEs:

πj|i =nij

ni+, i = 1, . . . , I, j = 1, . . . , J

• Since mij = ni+πj|i, we get mij = ni+πj|i = nij , under productmultinomial sampling.

• Again, we get the same MLEs for expected cell counts.

31

Goodness of Fit

Suppose we have a multinomial vector (n1, . . . , nN−1)T with probabilities

(π1, . . . , πN−1)T based on a sample of size n.

Let π = (π1, . . . , πN−1, πN )T , n = (n1, . . . , nN−1, nN )T where πN = 1 −∑N−1i=1 πi and nN = n−

∑N−1i=1 ni (i.e.,

∑Ni πi = 1 and

∑Ni ni = n).

Consider the problem of testing H0 : π = π0 where π0 = (π10, . . . , πN0)T

and∑

i πi0 = 1.

Under H0, the expected category frequencies are mi = nπi0, i = 1, . . . , N .A test statistic for comparing observed category frequencies with expected(or estimated expected) frequencies is called a goodness of fit statistic.

The two most important goodness of fit statistics are

1. The Pearson chi-squared statistic (Pearson statistic).2. The likelihood ratio or deviance statistic.

The Pearson statistic is given by

X2 =

N∑i=1

(ni −mi)2

mi

or, in the two-way table,

X2 =

I∑i

J∑j

(nij −mij)2

mij.

The likelihood ratio statistic is given by

G2 = 2

N∑i

ni log(ni/mi)

or, in the two-way table,

G2 = 2I∑i

J∑j

nij log(nij/mij).

32

Asymptotically, both of these statistics have χ2(N−1) distributions (χ2(IJ−1) in the two-way table case). This result holds for fixed N as n → ∞.

• The sample size necessary for the limiting χ2 distribution to providea good approximation to the exact distributions of X2 and G2 de-pends upon the statistic, n, N , and the degree of “sparseness” in thetable.

• For fixed number of cells, the exact distribution of X2 usually con-verges to χ2 more quickly than that of G2. Agresti recommends aminimum ratio n/N of at least 5 for G2, while n/N can be as smallas 1 for accuracy of the χ2 approximation to X2 as long as the ta-ble does not contain both very small and moderately large expectedfrequencies.

• Section 9.8.4 of Agresti provides further guidelines concerning theaccuracy of the χ2 approximations to the distributions of X2 andG2.

Example – Mendel’s Genetics Experiments:

Gregor Mendel is widely considered to be the father of modern ge-netics. In 1865 he published results of an experiment in which hecrossed pea plants of a pure yellow strain with plants of a pure greenstrain. He hypothesized the existence of genes in the parent plantsthat controlled color and that came in one of two varieties: domi-nant and recessive. In this case the color gene was hypothesized tobe either y or g with y being dominant. Offspring receive one genefrom each parent, so there are four gene-pairs:

(y, y), (y, g), (g, y) – all resulting in a yellow colored offspring (y isdominant); and(g, g) – which results in a green colored offspring.

Mendel hypothesized that in second generation crosses of pure yellowand pure green plants, these four gene-pairs would be equally likelyso that 75% of the second generation plants would be yellow, 25%green.

33

Observed data:

ColorYellow Green

6022 2001 8023

Expected data:

ColorYellow Green

8023(.75) = 6017.25 8023(.25) = 2005.75 8023

Test statistics:

X2 =(6022− 6017.25)2

6017.25+

(2001− 2005.75)2

2005.75= 0.015

G2 = 2 6022 log(6022/6017.25) + 2001 log(2001/2005.75) = 0.015

Here N − 1 = 1, so each statistic has an approximate χ2(1) dis-tribution which gives an approximate p−value of p = .88 for eachstatistic. ⇒ fail to reject H0, ⇒ data support Mendel’s theory.

Mendel reported the results of several experiments of this sort. R.A.Fisher used the the reproductive property of the χ2 distribution tosummarize those results.

The reproductive property says that if X21 , . . . , X

2k are independent

χ2 random variables with degrees of freedom ν1, . . . , νk, respectively,

then∑k

i X2i ∼ χ2(

∑ki νi).

By combining X2 statistics in this way across many of Mendel’sexperiments, Fisher obtained X2 = 42 on d.f.=84 which yields p =.99996.

• This result says that under H0, the probability is .00004 that theobserved cell counts should fit the expected cell counts at least asclosely as Mendel’s data did.

• Did Mendel “cook” his data? Maybe so.

34

Where do the test statistics X2 and G2 come from and why are their(asymptotic) distributions χ2?

Notice that in the N = 2 case (Mendel example), H0 : (π1, π2)T =

(π10, π20)T is equivalent to H0 : π1 = π10 since π2 = 1 − π1 and π20 =

1− π10.

That is, the problem is that of testing a binomial probability equal tosome null value. In the Mendel example we observed a binomial: 6022successes (yellow plants) out of 8023 trials, and we wanted to know ifp = 6022/8023 was consistent with an underlying probability of successπ = π0 where π0 = .75.

By the Central Limit Theorem, p.∼ N(π, π(1 − π)/n) for large n, which

leads to a z−test for H0 (STAT 2000):

z =p− π0√

π0(1− π0)/n

.∼ N(0, 1)

or, equivalently,

z2 = n(p− π0)[π0(1− π0)]−1(p− π0)

.∼ χ2(1)

With a little algebra, its easy to show that z2 = X2 in the N = 2 case.

More generally, for anyN , let p = (n1/n, . . . , nN−1/n)T , π0 = (π10, . . . , πN−1,0)

T

and Σ0 = var(√n(p−π0)) evaluated under H0. Then the Pearson statistic

X2 can be written as a quadratic form:

X2 = n(p− π0)TΣ−1

0 (p− π0)

In the N = 2 (binomial) case,√n(p − π0) converged to N(0, π0(1 − π0))

under H0 (CLT). In the N > 2 (multinomial) case,√n(p−π0) converges

to N(0,Σ0). Therefore, the quadratic form

n(p− π0)TΣ−1

0 (p− π0)

converges in distribution to a χ2(N − 1).

35

How about G2?

Under likelihood-based inference the classical tool for testing hypothesesis the Likelihood Ratio Statistic:

λ =L(θ;y)

L(θ;y), θ = MLE under H0 ∪HA

θ = MLE under H0

Logic:

If λ.= 1 ⇒ no evidence against H0.

If λ >> 1 then assuming H0 is true has made our data much lesslikely that it would have been without assuming H0 ⇒ reject H0.

How large does λ need to be to reject?

Answer: large in comparison to its distribution (above the 100(1 − α)th

percentile of its distribution).

Wilks is famous for having proved that under mild regularity conditions,2 log λ has a limiting χ2(ν) distribution, as n → ∞. Here ν =the differencebetween the dimensions of the parameter space for θ under H0 ∪HA andH0.

• I.e., ν = the number of restrictions placed on θ by H0.

In our problem H0 : π = π0 places N − 1 restrictions on π: (1) π1 = π10,(2) π2 = π20, . . ., (N − 1) πN−1 = πN−1,0.

• Under both H0 and HA we require∑N

i πi = 1, so πN = πN0 is notan additional restriction once restrictions (1)–(N − 1) are required.

• Another way to think about the d.f. is that the dimension of theparameter space for π under H0 ∪HA is N − 1 (all N of the πi’s canvary freely except for one restriction) and the dimension under H0 is0 (none of the πi’s can vary freely. Therefore the degrees of freedomare N − 1− 0 = N − 1.

36

Under multinomial sampling, the likelihood is n!∏N

iπnii

ni!, so

λ =n!∏N

iπnii

ni!

n!∏N

iπnii

ni!

=N∏i

πnii

πnii

=N∏i

(πi

πi

)ni

where πi = ni/n is the unrestricted MLE, and πi = πi0 is the restrictedMLE (under H0 : π = π0).

Since πi/πi = ni/mi0, where mi0 = nπi0 is the expected cell frequencyunder the null hypothesis, we have

λ =N∏i

(ni

mi0

)ni

so2 log(λ) = 2

∑i

ni log(ni/mi0) = G2 a∼ χ2(N − 1)

The fact that X2 and G2 have the same asymptotic χ2 distribution saysthat these tests are asymptotically equivalent.

• In fact, X2 and G2 are special cases of a family of tests statisticswhich are all asymptotically equivalent — the power-divergence fam-ily.

• The power divergence family is defined to include statistics of theform

2

ϕ(ϕ+ 1)

N∑i

ni

[(ni

mi

)ϕ

− 1

]with special cases given by different choices of ϕ ∈ (−∞,+∞)

• The special cases corresponding to ϕ = 0 and ϕ = −1 are defined asthe limits as ϕ → 0 and ϕ → −1, respectively.

• Note that ϕ = 1 gives X2 and ϕ → 0 gives G2, but other impor-tant cases include ϕ = −2 (Neymans’ modified chi-square statis-

tic,∑N

i (ni − mi)2/ni), ϕ = −1/2 (the Freeman-Tukey statistic,

4∑N

i (√ni−

√mi)

2), and ϕ = 2/3 (Cressie and Read found that theχ2 approximation in small samples is best for this choice of ϕ).

37

Goodness of fit with estimated expected frequencies:

One of the most common uses of the Pearson statistic in a two-way tableis to test independence between the row and column variables, X and Y .

SiteTumorType Head/Neck Trunk Extremities

A 22 2 10 34

B 16 54 115 185

C 19 33 73 125

D 11 17 28 56

68 106 226 400

For example, in the above table a natural question to ask is, “Does tumortype depend upon tumor site?”

Under independence,

πij = πi+π+j , i = 1, . . . , I, j = 1, . . . , J.

If we knew the πi+’s and the π+j ’s, then we could test H0 : πij = πi+π+j

∀i, j using

X2 =I∑i

J∑j

(nij −mij)2

mij

.∼ χ2(IJ − 1)

where mij = nπi+π+j .

38

The usual case, however, is one in which we don’t know the marginalprobabilities. The best we can do is to estimate mij with mij = npi+pj+and substitute estimates for the expected cell counts in X2:

X2 =

I∑i

J∑j

(nij − mij)2

mij

How does this affect the distribution of X2?

Answer: When the mij ’s are estimated by ML (or any best asymptotically

normal (BAN) estimator) thenX2 a∼ χ2(N−1−s) (N = IJ in the two-waytable) where s =the the number of “nonredundant” parameters estimatedto obtain the mij ’s.

• Proof: see Read and Cressie (1988, Appendix 6), or Bishop et al.(1975, §14.9).

• BAN: consistent, asymptotically normal, and asymptotically effi-cient.

• s “non-redundant” parameters: the Jacobian of the transformationfrom the model parameters to m is of rank s.

• Analogous result holds for G2 and other power-divergence statistics.

• All results on asymptotic distributions of power-divergence statisticsassume N fixed, n → ∞ (we can get into trouble when N increaseswith n) and hold under all three sampling models.

In the two-way table under independence, mij = nπi+π+j = npi+p+j .We estimated π1+, . . . , πI+ with p1+, . . . , pI+ (I − 1 non-redundant pa-rameters) plus we estimated π+1, . . . , π+J with p+1, . . . , p+J (J − 1 non-redundant parameters).

So, with expected cell frequencies estimated under independence,

X2, G2 .∼ χ2(IJ − 1− [I − 1]− [J − 1]) = χ2([I − 1][J − 1])

39

Independence in the Melanoma Example: (See melanoma.sas.)

Observed cell counts:Site

TumorType Head/Neck Trunk Extremities

A 22 2 10 34

B 16 54 115 185

C 19 33 73 125

D 11 17 28 56

68 106 226 400

Expected cell counts:Site

TumorType Head/Neck Trunk Extremities

A 34(68)/400 = 5.78 9.01 19.21 34

B 31.45 49.025 104.525 185

C 21.25 33.125 70.625 125

D 9.52 14.84 31.64 56

68 106 226 400

So,

X2 =(22− 5.78)2

5.78+ · · ·+ (28− 31.64)2

31.64

= 45.517 + · · ·+ .4188 = 65.8129

and G2 = 51.795. Both statistics have approximate χ2(12− 1− 3 − 2) =χ2(6) distributions, which yields p−values < .0001 in each case.

• Tumor type and tumor site are not independent.

• Notice in melanoma.lst that the contributions to the overall chi-square statistic vary considerably over the entire table. This suggeststhat the dependence between tumor type and site may be limited to(or at least strongest for) certain categories of X and Y . It wouldbe of interest here to partition X2 (or G2) to better understand thenature of the dependence between tumor type and site. See §3.3 inAgresti.

40

Large-Sample Confidence Intervals

Suppose we would like to form a 100(1 − α)% confidence interval aroundθ, the odds ratio; or around γ, Goodman and Kruskal’s measure of mono-tonicity; or suppose we want to test a hypothesis on the odds ratio of theform H0 : θ = θ0 or on the relative risk: H0 : π1|1/π1|2 = RR0?

In each of these cases we require an exact, or at least approximate, distri-bution for some function f(π), and a standard error for f(π).

A general method for obtaining the asymptotic distribution (includingasymptotic s.e.s) of such a function is provided by the delta-method.

δ−Method (see §14.1 of Agresti):

Let X be a r.v. with known mean and variance:

E(X) = µ, var(X) = E[(X − µ)2] = σ2

Let Y = g(X) where g is a twice differentiable function.

Suppose that exact calculation of E(Y ) and var(Y ) is difficult. Approxi-mations may be obtained based on a Taylor series expansion of g(X) aboutµ.

By Taylor’s Theorem

g(X) = g(µ) + g′(µ)(X − µ) +1

2g′′(µ∗)(X − µ)2

where µ∗ lies between X and µ.

Assuming the 3rd term is small and can be ignored, we have the approxi-mations

E(Y ).= E[g(µ) + g′(µ)(X − µ)] = g(µ)

var(Y ) = E[(g(X)− Eg(X))2].= E[(g(µ) + g′(µ)(X − µ)− g(µ))2] = E[g′(µ)(X − µ)2]= [g′(µ)]2E((X − µ)2] = [g′(µ)]2var(X)

41

Now let Xn be a sequence of r.v.’s such that the asymptotic distributionof

√n(Xn − µ) is N(0, σ2(µ)). I.e., suppose

√n(Xn − µ)

d−→ N(0, σ2(µ))

For a sequence of (non-random) variables xn, by Taylor’s Theorem

g(xn) = g(µ) + (xn − µ)g′(µ) +1

2(xn − µ)2g′′(µ∗)

= g(µ) + (xn − µ)g′(µ) + O(|xn − µ|2)

Substituting the random sequence Xn for xn and rearranging,

√n(g(Xn)− g(µ)) =

√n(Xn − µ)g′(µ) +

√nOp(|Xn − µ|2)︸︷︷︸

=Op(n−1)

=√n(Xn − µ)g′(µ) + Op(n

−1/2)︸︷︷︸=op(1)

⇒√n(g(Xn) − g(µ)) has the same limiting distribution as

√n(Xn −

µ)g′(µ). Or,

⇒√n(g(Xn)− g(µ))

d−→ N(0, σ2(µ)[g′(µ)]2).

Here, if g′(·) and σ2(·) are continuous at µ, the asymptotic variance σ2(µ)[g′(µ)]2

can be consistently estimated by σ2(Xn)[g′(Xn)]

2, so (e.g.)

g(Xn)± 1.96σ(Xn)|g′(Xn)|/√n

is an approximate (large-sample) 95% confidence interval for g(µ).

42

Multivariate Case:

Now let X = (X1, . . . , Xp)T be a random vector with known mean and

variance-covariance matrix:

E(X) = µ, var(X) = Σ

Let Y = g(X1, . . . , Xp) where g is a continuous function with first andsecond partial derivatives.

A Taylor series expansion of g(X) about µ is given by

g(X) = g(µ)+

p∑i=1

∂g

∂µi(Xi−µi)+

1

2

p∑i=1

p∑j=1

∂2g

∂µi∂µj(Xi−µi)(Xj−µj)+ · · ·

where∂g

∂µi=

∂g(X)

∂Xi

∣∣∣X=µ

,∂2g

∂µi∂µj=

∂2g(X)

∂Xi∂Xj

∣∣∣X=µ

Let ∂g(µ)/∂µT = ( ∂g∂µ1

, . . . , ∂g∂µp

) be the row vector of partials with respect

to the elements of µ. Then, ignoring 3rd and higher-order terms,

⇒ Y = g(X).= g(µ) +

(∂g(µ)

∂µT

)(X− µ)

⇒ E(Y ).= E(g(µ)) +

(∂g(µ)

∂µT

)E(X− µ) = g(µ)

and

var(Y ) = E[g(X)− E(g(X))2].= E[g(µ) +

(∂g(µ)

∂µT

)(X− µ)− g(µ)2] = E[

(∂g(µ)

∂µT

)(X− µ)2]

= E[

(∂g(µ)

∂µT

)(X− µ)(X− µ)T

(∂g(µ)

∂µT

)T

]

=

(∂g(µ)

∂µT

)E[(X− µ)(X− µ)T ]

(∂g(µ)

∂µT

)T

=

(∂g(µ)

∂µT

)Σ

(∂g(µ)

∂µT

)T

43

Now let Y = (Y1, . . . , Yu)T , where

Yi = gi(X1, . . . , Xp) = gi(X), i = 1, . . . , u

From the univariate results above,

E(Yi).= gi(µ), var(Yi)

.=

(∂gi(µ)

∂µT

)Σ

(∂gi(µ)

∂µT

)T

, i = 1, . . . , u

The off-diagonal elements in var(Y) can also be approximated:

cov(Yi, Yj) = E[(Yi − E(Yi))(Yj − E(Yj))]

.= E[

(∂gi(µ)

∂µT

)(X− µ)

(∂gj(µ)

∂µT

)(X− µ)]

= E[

(∂gi(µ)

∂µT

)(X− µ)(X− µ)T

(∂gj(µ)

∂µT

)T

]

=

(∂gi(µ)

∂µT

)E[(X− µ)(X− µ)T ]

(∂gj(µ)

∂µT

)T

=

(∂gi(µ)

∂µT

)Σ

(∂gj(µ)

∂µT

)T

Now let(

∂g(µ)∂µT

)denote the u × p matrix with ith row equal to

∂gi(µ)∂µT .

That is,(

∂g(µ)∂µT

)has (i, j)th element ∂gi(X)

∂Xj

∣∣∣X=µ

.

Then the results above can be summarized as

E(Y).= g(µ), var(Y)

.=

(∂g(µ)

∂µT

)Σ

(∂g(µ)

∂µT

)T

44

Now let Xn denote a p−dimensional sequence of random vectors such that

√n(Xn − µ)

d−→ Np(0,Σ(µ))

Let g(Xn) = (g1(Xn), . . . , gu(Xn))T be a u−vector-valued function ad-

mitting the following expansion:

gi(x) = gi(µ) +

p∑j=1

(xj − µj)∂gi(x)

∂xj

∣∣∣x=µ

+O(||x− µ||) i = 1, . . . , u

Then the δ−method’s distributional result is

√n (g(Xn)− g(µ))

a∼ Nu

(0,

(∂g(µ)

∂µT

)Σ(µ)

(∂g(µ)

∂µT

)T)

• Again, ∂g(µ)/∂µT and Σ can be evaluated at Xn to obtain a consis-tent estimate for the asymptotic variance of g(Xn) for large-sampleconfidence regions and hypothesis tests.

45

δ−Method for log-odds ratio:

The MLE of the odds ratio in the 2× 2 table is

θ =p11p22p12p21

.

Recall that θ is not symmetric around the independence value of 1. Fur-

thermore, θ is a multiplicative function of the pij ’s.

In contrast,

log(θ) = log p11 + log p22 − log p12 − log p21

is symmetric around zero and is an additive function. The result is that

log(θ) converges in distribution to normality much faster than θ.

• For this reason, inference concerning the odds ratio is often donethrough the log-odds ratio. For example, we construct a 95% CI forlog(θ) and then exponentiate the endpoints to obtain a 95% intervalfor θ.

Let p = (p11, p12, p21)T and π = (π11, π12, π21)

T . Notice that log(θ) canbe written as a function of p:

log(θ) = log p11 + log(1− p11 − p12 − p21)− log p12 − log p21 = g(p)

where √n(p− π)

d−→ N(0,Σ(π))

and

Σ(π) = diag(π)− ππT =

π11(1− π11) −π11π12 −π11π21

−π11π12 π12(1− π12) −π12π21

−π11π21 −π12π21 π21(1− π21)

46

Therefore, we can apply the multivariate δ−method. In the case of the

log odds ratio, g(p) = log(θ) is a scalar-valued function, so u = 1 and

∂g(π)

∂πT=

(∂g(π)

∂π11,∂g(π)

∂π12,∂g(π)

∂π21

)=

(1

π11− 1

π22,− 1

π22− 1

π12,− 1

π22− 1

π21

)=

(π22 − π11

π11π22,−π12 − π22

π12π22,−π21 − π22

π21π22

)

It follows that

√nlog(θ)− log(θ) d−→ N

(0,

(∂g(π)

∂πT

)Σ(π)

(∂g(π)

∂πT

)T)

or, for large n,

log(θ).∼ N

(log(θ),

1

n

(∂g(π)

∂πT

)Σ(π)

(∂g(π)

∂πT

)T)

The asymptotic variance of log(θ) can be consistently estimated by sub-stituting p for π which leads to an estimated asymptotic variance of

1

n

(∂g(π)

∂πT

∣∣∣∣π=p

)Σ(p)

(∂g(π)

∂πT

∣∣∣∣π=p

)T

=1

n

(p22 − p11p11p22

,−p12 − p22

p12p22,−p21 − p22

p21p22

)

×

p11(1− p11) −p11p12 −p11p21−p11p12 p12(1− p12) −p12p21−p11p21 −p12p21 p21(1− p21)

p22−p11

p11p22−p12−p22

p12p22−p21−p22

p21p22

=

1

n11+

1

n12+

1

n21+

1

n22(after some algebra).

• ⇒ For large n, log(θ) is approximately normally distributed with

asymptotic standard error (ASE)(

1n11

+ 1n12

+ 1n21

+ 1n22

)1/2.

47

Oral Contraceptive Use and Heart Attack:

Recall the results of this case-control study:

Contraceptive Use

HeartAttack

Yes No

Yes 23 34 57

No 35 132 167

58 166 224

which yielded θ = 2.551 ⇒ log(θ) = log(2.551) = 0.937.

The ASE for log(θ) is

(1

23+

1

34+

1

35+

1

132

)1/2

=√.109 = .330

So an approximate (large-sample) 95% confidence interval for log(θ)is given by

0.937± 1.96(.330) = (0.289, 1.584)

which leads to an approximate 95% CI for θ given by

(e0.289, e1.584) = (1.336, 4.873)

• See oralcon.sas.

• Agresti recommends adding .5 to each cell of the table to improvethe estimator of θ and to improve the estimator of its asymptoticstandard deviation.

• Inference for relative risk, risk difference, gamma, other measuresof association is conducted using the same approach that we haveillustrated here with θ.

48

Fisher’s Exact Test

In addition to the Poisson, multinomial and product multinomial samplingmodels, occasionally a two-way contingency table arises in such a way sothat both sets of margins are fixed. In this case the appropriate samplingis based on the hypergeometric distribution.

Recall the hypergeometric: Consider a population of n red and black balls,of which n1 are red and n2 = n− n1 are black. Suppose a random sampleof size m is drawn from the population. Let X be the number of red ballsin the sample. Then X is a random variable whose possible values are0, 1, 2, . . . ,m. The probability function of X is given by

Pr(X = x) =

(n1

x

)(n−n1

m−x

)(nm

)This situation can be summarized by a 2× 2 table:

Ball Color

SampledYes No

Red X n1 −X n1

Black m−X n2 − (m−X) n2

m n−m n

• Notice here that the margins of the table are fixed by the populationdistribution of red vs. black (row margins) and the sample size m(column margins).

49

Therefore, in a 2× 2 table like so:

n11 n12 n1+

n21 n22 n2+

n+1 n+2 n

with fixed row and column margins, n11 (any cell count actually) followsa hypergeometric distribution with probability function(

n1+

n11

)(n2+

n+1−n11

)(n

n+1

) , n11 = m−, . . . ,m+

where

m− = max(0, n1+ + n+1 − n), m+ = min(n+1, n1+).

(n11 ≥ n1++n+1−n iff n ≥ n1++n+1−n11, so this condition just ensuresthat n11 is not so small that it forces n22 to be negative to satisfy the fixedmargins.)

Example — Fisher’s Tea Taster

R.A. Fisher motivated his exact test with the following, now fa-mous, example. A British woman of Fisher’s acquaintance claimedto be able to distinguish between tea with milk that was preparedby adding milk to the cup first and tea with milk that was preparedby adding tea to the cup first. To test her claim, she was given 8cups of tea, in four of which milk was added first. Since she knewthat there were four cups of each type, she made four guesses of eachtype. The results of this experiment appear below.

50

True Poured First

GuessPouredFirstMilk Tea

Milk 3 1 4

Tea 1 3 4

4 4 8

The question of interest here is whether or not there is a positiveassociation between the row and column variables (between the truthand her guesses).

That is the question of interest can be expressed as a choice betweenthe following hypotheses on the odds ratio θ:

H0 : θ = 1 vs. HA : θ > 1

In this one-sided alternative situation, the p−value is defined to bethe probability of observing a result at least as extreme of moreextreme in the direction of HA computed under H0.

Under H0, the observed result, θ = 9 is extreme in the sense thatthe woman guessed surprisingly well. The probability of the observedresult is

Pr(θ = 9) = Pr(n11 = 3) =

(43

)(41

)(84

) = 0.229

The observed result is that she guessed 3 out of 4 of each type cor-rectly. The only more extreme result would be that she guessed 4out of 4 correctly. I.e., the only more extreme result is n11 = 4. Thisresult has probability of occurring under H0 given by

Pr(n11 = 4) =

(44

)(40

)(84

) = 0.014

Therefore, an exact p−value for our test is p = .229 + .014 = .243.⇒ there is insufficient evidence to support the woman’s claim.

51

• Notice that the discreteness of the data allows only a few exactp−values to occur. In particular, the only result that would havebeen conclusive evidence against H0 at α = .05 would have been ifshe had been 100% accurate in her guesses.

• In the example above, the discreteness of the null distribution makesit impossible to achieve a significance level of α = 0.05; if we testat α = 0.05, we only reject when n11 = 4, which happens withprobability 0.014 under H0. So the type I error is .014, not 0.05, andwe end up with a conservative test.

• There are several ways to try to “fix” this problem, but a simple ap-proach to reduce the conservatism is to use the mid-p-value ratherthan the p-value. For a one-sided alternative, the mid-p-value isdefined to be one half the observed data probability plus the proba-bilities of all results more extreme than that observed.

– For the lady tasting tea example, the mid-p-value is

1

2Pr(n11 = 3) + Pr(n11 = 4) =

1

20.2286 + 0.014 = 0.129.

• Note that the mid-p-value is not guaranteed to yield Type I errorrate ≤ α, but it usually does, and tends to perform much better inpractice than the ordinary p-value for highly discrete distributions.

• See also teadrinker.sas.

In the previous example both margins were fixed by design, providing aclear justification for Fisher’s exact test.

More generally, however, Fisher’s exact test is often used under multino-mial, product multinomial, and Poisson sampling models by basing thesignificance of a test of association on the conditional distribution of thetest statistic given both of the margins rather than on the unconditionaldistribution of the test statistic.

It is straight-forward to show that if we assume any one of the threeusual sampling models and then condition on both margins, we obtain thehypergeometric distribution for the conditional distribution of nij.

52

Why would we want to use the conditional distribution rather thanthe unconditional distribution?

1. Other methods for testing association we have discussed (e.g., X2,G2 tests of independence) are asymptotic, and the accuracy of theχ2 approximations to the distributions of these test statistics maybe poor when table contains small cell counts.

⇒ want an exact distributional result rather than a large-smaple approximation in small cell count situations.

2. The exact (unconditional) distribution is desirable, but it is usuallyunavailable because it depends upon unknown nuisance parameters.

– A standard way of eliminating nuisance parameters in statis-tical inference is to condition on sufficient statistics for them.In two-way tables this leads to conditioning on the margins.

3. Basing inference on the conditional distribution rather than the marginaldistribution typically involves some loss of information. However,often this loss of information is small (margins contain little infor-mation about the association in a table) and this disadvantage is thelesser of two evils compared with using an inaccurate asymptoticapproximation.

Fisher’s exact test generalizes from the 2×2 case to the I×J case. In theI × J case, the conditional distribution of the cell counts nij given thecell margins is the multiple hypergeometric distribution:

(∏

i ni+!)(∏

j n+j !)

n!∏

i

∏j nij !

and p−values can be obtained by summing multiple hypergeometric prob-abilities for tables at least as extreme as the one obtained.

• We will discuss conditional vs. unconditional inference including thetopic of exact inference more thoroughly later in the course.

53

Statistical Modelling

Questions:

(i.) What is the purpose of the analysis?

– Is it probabilistic or descriptive?

Descriptive – methods in which probability models don’t ex-plicitly enter into the analysis

– Tabulations of means, quantiles, etc.– Graphical representations (histograms, boxplots, scatter-

plots, etc.).

(ii.) Which is the response variable and which the explanatory variable?Is such a distinction appropriate?

– Are we trying to predict?– Are we trying to describe dependence of Y (’s) on X(’s)?– Are we trying to describe interrelationships among Y (’s)?

(iii.) Are outliers present?

– graphical techniques, influence diagnostics (residuals, lever-ages, Cook’s distance, etc.)

– can outliers be omitted?– analyze with and without outliers?

(iv.) Are the observations independent?

– How were the data collected? (through time? clustering? re-peated measures?)

– unaccounted for correlation among observations can lead toinvalid estimates of error and incorrect inferences.

54

(v.) Is transformation of the data appropriate?

– Under given scale do model assumptions hold? (E.g., constantvariance? normality? additive error?)

(vi.) How complex is the problem?

– often the problem appear most complex initially but simplifiesafter further inspection

– very often its useful to think about the problem in terms ofsimplest (STAT6210, STAT6220) methods first.

(vii.) Is more than one source of variability present?

– random effects vs. fixed effects?(are factor levels observed of interest in themselves or as arepresentative random sample from some population of levelsof interest?)

Models

• Often data can be thought of as signal and noise.

Transmission and reception of information (e.g., radio) involves amessage (signal) which is distorted by static (noise, or error).

signal – deterministicnoise – random

• A statistical model of the data accounts for both signal and noise

– (the latter makes it probabilistic).

• Often the signal and noise metaphor is right on the mark. In othercases the signal component is a mathematical description of the mainfeatures, noise component is “everything else” (unexplained compo-nent).

• Many models (parametric models) involve unknown constants calledparameters.

– In such cases, “Fitting the model” amounts to estimating theparameters of the model.

55

Models should be

1. As simple as conveniently possible (parsimony)

2. Valid over an appropriate range of conditions (scope, or generaliz-ability)

3. Consistent with the data.

• Items 1–3 above are often opposed to one another

– By including enough parameters, we can make a model fit thedata as well as we please, but we sacrifice parsimony, general-izability.

Generalized Linear Models (GLMs) are the main subject of thiscourse. They are generalizations of:

Classical Linear Models (CLMs)

Y : Response Variable

X1, . . . , Xp: Explanatory variables.

We typically observe n independent copies (a random sample):

Yi

Xi1, . . . , Xip, i = 1, . . . , n the random variables

yixi1, . . . , xip

, i = 1, . . . , n the observations

In the CLM (and most other “regression” contexts, including GLMs), theexplanatory variables are fixed by the design governing the data collection,or, if not fixed by design, the explanatory variables are conditioned on.That is, we seek to describe, or make inference about the conditionaldistribution of the response given that the explanatory variables are equalto their observed values.

• Thus in all that follows (unless explicitly stated otherwise) the ex-planatory variables can be regarded as non-random.

56

Model:

Yi = β1Xi1 + β2Xi2 + · · ·+ βpXip + ei, i = 1, . . . , n,

where ei is a random variable with

1. E(ei) = 0 for all i.2. var(ei) = σ2 (constant) for all i.3. e1, . . . , en are i.i.d.

• Our model says that E(Yi) = µi is a linear function of the β’s (forall i).

Matrix Notation:

Y = Xβ + e,

where

Y =

Y1

Y2...Yn

, X =

1 X12 · · · X1p

1 X22 · · · X2p

......

. . ....

1 Xn2 · · · Xnp

, β =

β1...βp

, e =

e1e2...en

X is known as the design matrix or model matrix.

Most common examples:

1. Regression2. Analysis of variance (fixed effects only).3. Analysis of covariance.

57

Benefits of CLMs

1. Model is very flexible.

2. Easy to fit.

OLS estimate of β (denoted β) can be found by solving

(XTX)β = XTY (normal equations)

If (XTX)−1 exists, β is given by

β = (XTX)−1XTY

which is the B.L.U.E. according to the Gauss-Markov Theorem (ir-respective of the distribution of the ei’s).

3. Model is relatively easy to interpret.

Limitations of CLMs

1. E(Y) = µ = Xβ is unlimited in range, but in many problems therange is restricted. E.g., when Y is a vector of binary responses, µis a vector of probabilities which should be in [0, 1].

2. Most inference (C.I.’s, hypothesis tests) in CLMs assumes a normaldistribution for ei’s.

3. Additive error.

GLMs (Nelder and Wedderburn, 1972, JRSSB) solve problems 1–3.

58

Examples of GLMs:

a. Suppose we have Bernoulli responses: Yi =01.

⇒ µi = E(Yi) = Pr(Yi = 1) ∈ [0, 1]var(Yi) = µi(1− µi) (not constant)

Suppose we have a covariate X (e.g., dose level) and response Y(e.g., live(0)/die(1)). We’d like to model the mean response (probabilityof death) as a function of dose.

One possibility: A CLM with µ = β1 + β2X.

With this linear model we run into problem 1:

For especially large or small values, we will be predicting µ to be outsideof [0, 1].

Solution: use a nonlinear model where we map β1 + β2X into [0, 1].

E.g., µ = F (β1 + β2X) where F is some c.d.f.

Choices:

1. F = Φ (standard normal c.d.f.)

⇒ µ = Φ(β1 + β2X)

⇒ Φ−1(µ) = β1 + β2X

which is known as the Probit Model.

59

• Notice that this model is nonlinear in a very specific way. We mapa linear function of the β’s (β1 + β2X) into the permissible range ofµ via a nonlinear function (in this case Φ).

• We call the linear function of the β’s the linear predictor of theGLM and it is denoted η (here, η = β1 + β2X).

• We call the function that relates µ to η the link function (here itis Φ−1).

2. The unit logistic c.d.f.

The logistic distribution has density function

fY (y;µ, σ) =exp(y − µ)/σ

σ1 + exp(y − µ)/σ2

for −∞ < y < ∞, where µ, σ are location and scale parameters,respectively.

The unit logistic distribution has µ = 0, σ = 1, and has c.d.f. F (x) =ex

1+ex . This gives

µ =exp(β1 + β2X)

1 + exp(β1 + β2X).

Solving for η = β1 + β2X, we get

log

(µ

1− µ

)︸︷︷︸“logit” link

= β1 + β2X = η

• This model is called a logistic regression model.

Note that here and in the previous probit example we haven’t fully specifythe model yet, just the “signal” part. Our models must also account fornoise as well.

• In the CLM we do this by specifying a distribution for an error term.This is an indirect way of implying a distribution for Y .

• In a GLM we drop the device of an error term and specify the dis-tribution for Y directly.

60

Both the probit and logistic models are completed by specifying the errordistribution to be Bernoulli.

b. Another example for 0,1 responses:

Suppose that the random variable of interest, Z, is Poisson withmean λ. However, we don’t observe Z, only whether or not Z = 0.That is, we observe

Y =

0, if Z = 01, if Z > 0

• For example, Z might be the number of infected samples, but becauseseveral samples were combined before presence/absence of infectionwas measured, we only observe Y .

If we had Z ∼ Poisson(λ), we might model λ with a loglinear modelrelating λ to some linear predictor η involving relevant covariates:

log(λ) = β1 + β2X2 + · · ·+ βpXp (∗)

• Why loglinear? Because with log(λ) = η or, equivalently, λ = eη wemap the unconstrained η to [0,∞) with the exponential function.

We don’t have Z, though, we only observe Y . What do we know aboutY ?:

µ = E(Y ) = Pr(Y = 1) = 1− Pr(Z = 0)︸︷︷︸a Poisson Probability

= 1− e−λ

Solving for λ,− log(1− µ) = λ

orlog(− log(1− µ)) = log(λ)

Adding (*) we get

log(− log(1− µ))︸︷︷︸complementary log-log link

= β1 + β2X2 + · · ·+ βpXp = η

61

• This model comes up in Dilution Bioassay.

Example

• Suppose that there is an unknown concentration ρ0 of an organismin solution.

• To determine ρ0 we proceed by diluting the solution in powers of 2,so that the xth dilution has concentration

ρx =ρ02x

, x = 1, 2, . . . , n.

• After each dilution we examine a plate treated with the solution. Ifplate is streaked ⇒ at least one organism is present.

In this example there is an underlying random variable Z which is a Poissoncount of the number of species present that varies with x so that thePoisson mean depends on x. Instead of observing Z, though, we observe

Y =

0, Z = 0 (plate is not streaked)1, Z ≥ 1 (plate is streaked)

Let µ = Pr(plate is streaked). Then

µ = Pr(Y = 1) = 1− Pr(Z = 0) = 1− e−ρx

⇒ log(− log(1− µ)) = log(ρx) = log(ρ0)︸︷︷︸β1

− (log 2)︸︷︷︸β2

x = η

• Therefore we can estimate ρ0 as eβ1 where β1 is the intercept from asimple regression of Y on X using a complementary log-log link andBernoulli error distribution.

62

3. Log-linear Models

Suppose we measure two categorical variables, one with r levels and theother with c levels, on each of n subjects where n is fixed (non-random).

We may summarize the resulting data (n bivariate responses) in a two-waycontingency table or cross-classification:

Here, Yij = the count of how many of the n subjects responded withcategory i on the first variable and category j on the second variable.

• With n fixed, it is reasonable to assume that the Yij ’s are multinomialwith probabilities πij , i = 1, . . . , r, j = 1, . . . , c, and number of trialsn. In this case

E(Yij) = nπij

Under an assumption that the two variables are independent, πij = πi·π·jwhere πi· is the marginal probability of responding with category i tovariable 1 and π·j is the marginal probability of responding with categoryj to variable 2.

⇒ µij = E(Yij) = nπi·π·j

⇒ log(µij) = log n+ log πi· + log π·j = ηij

We can rewrite this model in the GLM form as

logµij = β1 + β21i=1 + β31i=2 + · · ·+ βr+11i=r

+ βr+21j=1 + · · ·+ βr+c+11j=c

and we can fit the model with log link and multinomial error distribution.

• We’ll see later that we can obtain the same fit by using the Poissondistribution in place of the multinomial as the error distribution.

63

Components of a GLM

In a CLM we assume:

Yi = β1Xi1 + β2Xi2 + · · ·+ βpXip + ei, i = 1, . . . , n

e1, . . . , eniid∼ 0, σ2(N(0, σ2), if we want inference, ML)

Here there are three parts:

Sytematic Component: ηi = β1Xi1 + · · · + βpXip or ηi = xTi β in

matrix notation.

Random Component: Y1, . . . , Ynind∼ N(µi, σ

2) (usually assumedthrough the ei’s)

In addition we assume µi = ηi ∀i.

More generally, in GLMs we have

1. Systematic Component:ηi = xT

i β

2. Random Component: Yi’s are independent r.v.’s each with E(Yi) =µi and each with density

fY (yi; θi, ϕ) = exp

yiθi − b(θi)

ai(ϕ)+ c(yi, ϕ)

, (∗)

Here, ϕ is a (constant) scale parameter (typically a nuisance param.)and θi is a location parameter (typically of interest) and can beexpressed as some function of the mean, µi.

• (*) denotes the Exponential Dispersion (E.D.) Family of distribu-tions

– A generalization of the linear exponential family.

64

How generalized?

• If ϕ is known, then (*) is the linear exponential family with canonicalparameter θi.

• If ϕ is unknown, (*) may or may not be a 2-parameter exponentialfamily.

In GLMs there is a third component:

3. The link between the R.C. and the S.C.:

g(µi) = ηi

All we require of g is that it be one-to-one and differentiable.

Examples of E.D. Family Distributions:

1. Normal Distribution:

f(yi;µi, σ2) =

1√2πσ2

exp

−(yi − µi)

2

2σ2

= exp

[yiµi −

µ2i

2

]1

σ2− y2i

2σ2− 1

2log(2πσ2)

Here, θi = µi, ϕ = σ2, b(θi) = θ2i /2, ai(ϕ) = ϕ, and c(yi, ϕ) =−[y2i /ϕ+ log(2πϕ)]/2.

• Notice, that (for example) b(θi) is a function of θi, so we noticeb(θi) = µ2

i /2 and then make it a function of its argument by usingthe relationship between θi and µi.

65

2. Gamma Distribution

f(yi;µi, ν) =

(ν

µi

)νyν−1i e−νyi/µi

Γ(ν)

= exp

[− yiµi

− logµi

]ν + (ν − 1) log(yi) + ν log ν − log Γ(ν)

Here, θi = −µ−1

i , b(θi) = log(−θi), ϕ = ν, ai(ϕ) = ϕ−1 andc(yi;ϕ) = (ϕ− 1) log yi + ϕ log ϕ− log Γ(ϕ).

3. Poisson Distribution

f(yi;µi) =µyi

i e−µi

yi!

= exp yi logµi − µi − log(yi!)

Here, θi = log µi. b(θi) = eθi , ϕ = 1, ai(ϕ) = ϕ, and c(yi, ϕ) =− log(yi!).

4. Binomial Distribution

f(yi;πi) =

(ni

yi

)πyi

i (1− πi)ni−yi

= exp

yi log

(πi

1− πi

)+ ni log(1− πi) + log

(ni

yi

)

Here, θi = log(

πi

1−πi

), b(θi) = ni log(1 + eθi ), ϕ = 1, ai(ϕ) = ϕ, and

c(yi, ϕ) = log(ni

yi

).

66

Mean and Variance in E.D. Family

In the exponential dispersion family a special relationship exists betweenthe mean and variance. To demonstrate this relationship we first need toreview some results about likelihood and score functions.

Likelihood and Score Functions

Consider a single observation of a random variable Y which has densityfY (y; θ) which involves the scalar parameter θ.

Likelihood function: L(θ; y) = f(y; θ)

Log-likelihood function: ℓ(θ; y) = logL(θ; y)

The first derivative of ℓ with respect to θ,

U = U(θ) =∂ℓ(θ; y)

∂θ=

∂f(y; θ)/∂θ

f(y; θ)

is called the score function (or, sometimes, the efficient score function).

Results:*

1. E(U) = 0

2. var(U) = E(U2) = −E(∂U/∂θ).

Proofs:

1.

E(U) =

∫∂ℓ(θ; y)

∂θf(y; θ)dy, def. of exp. val.

=

∫∂f(y; θ)/∂θ

f(y; θ)f(y; θ)dy =

∫∂f(y; θ)

∂θdy

∗=

∂

∂θ

∫f(y; θ)dy assuming reg. conditions to switch diff. and int.

=∂

∂θ1 = 0

* Under mild regularity conditions that hold for E.D. family

67

2. var(U) = E(U2)− [E(U)]2︸︷︷︸

=0

= E(U2). In addition,

∂U

∂θ=

∂2ℓ(θ; y)

∂θ2=

∂2f(y;θ)∂θ2 f(y; θ)−

(∂f(y;θ)

∂θ

)2[f(y; θ)]

2

So, using the definition of expected value,

−E

(∂U

∂θ

)= −

∫ ∂2f(y;θ)∂θ2 f(y; θ)−

(∂f(y;θ)

∂θ

)2[f(y; θ)]

2

f(y; θ)dy

= −∫

∂2f(y; θ)

∂θ2dy +

∫(∂f(y; θ)/∂θ)

2

f(y; θ)dy

∗= − ∂2

∂θ2

∫f(y; θ)dy︸︷︷︸=0

+

∫ [∂f(y; θ)/∂θ

f(y; θ)

]2︸︷︷︸

=U2

f(y; θ)dy

= E(U2)

So we’ve shown that var(U) = −E(

∂2ℓ∂θ2

)≡ I(θ), which is known as

the Fisher information about θ contained in y, or, more simply,the information.

68

Example – N(µ, σ2), µ unknown, σ2 known:

f(y;µ) =1√2πσ2

exp

−(y − µ)2

2σ2

⇒ ℓ(µ; y) =

yµ− µ2/2

σ2+ stuff not involving µ

It follows that

U =∂ℓ(µ; y)

∂µ= (y − µ)/σ2

⇒ E(U) = (E(y)− µ) /σ2 = (µ− µ)/σ2 = 0

and

I(µ) = −E

(∂2ℓ

∂µ2

)= −E

(∂

∂µ

(y − µ

σ2

))= −E

(− 1

σ2

)=

1

σ2

and

var(U) = var

(y − µ

σ2

)= var

( y

σ2

)=

1

σ4var(y) =

1

σ2

69

Now we’ll use results 1 and 2 about score functions to reveal the relation-ship between the mean and variance in E.D. family.

For E.D. family,

ℓ(θi, ϕ; yi) =yiθi − b(θi)

ai(ϕ)+ c(yi, ϕ)

⇒ Ui =∂ℓ

∂θi=

yi − ∂b(θi)/∂θiai(ϕ)

E(Ui) = 0 then implies E(yi) = ∂b(θi)/∂θi or

µi = b′(θi)

In addition,

∂Ui

∂θi= −∂2b(θi)/∂θ

2i

ai(ϕ)and var(Ui) =

var(yi)

a2i (ϕ)

So, var(Ui) = −E(

∂Ui

∂θi

)implies

var(yi)

a2i (ϕ)=

∂2b(θi)/∂θ2i

ai(ϕ)

⇒ var(yi) =∂2b(θi)

∂θ2iai(ϕ) =

(∂

∂θiµi

)ai(ϕ)

⇒ var(yi) = v(µi)ai(ϕ)

where v(µi) = ∂µi/∂θi = b′′(θi) is known as the variance function.

• We’ve established that in a GLM, the E.D. family distributionalassumption implies that the response variance is a function of theresponse mean.

– Note that var(yi) may be a function of the mean in a trivialway, as in the Normal distribution where v(µ) = 1, ϕ = σ2,ai(ϕ) = ϕ so var(yi) = (1)σ2.

70

In most GLMs, ai(ϕ) is of the form

ai(ϕ) = ϕ/wi

where wi is a known prior weight specific to the ith observation.

• In what follows, we will restrict attention to this special case.

• Notice that ai(ϕ) = ϕ/wi implies

var(yi) = ϕv(µi)/wi

• We will sometimes use the notation

y ∼ ED(µ, ϕ,w)

to denote that y has an E.D. distribution with mean parameter µ =µ(θ), scale parameter ϕ and known weight w. (Note that only µ andϕ are unknown parameters).

Example – Binomial Distribution:

Recall (p.66) that we were able to write the binomial frequency func-tion in the following form:

f(yi;πi) = exp

yi log

(πi

1− πi

)+ ni log(1− πi) + log

(ni

yi

)(∗)

and we said that this was an example of an E.D. distribution with θi =

log(

πi

1−πi

), b(θi) = ni log(1+eθi ), ϕ = 1, ai(ϕ) = ϕ, and c(yi, ϕ) = log

(ni

yi

).

• This is correct, but it is slightly inconvenient for b(θi) to dependupon ni as well as θi.

– Notice that this inconvenience disappears when ni = 1, that iswhen yi is Bernoulli.

71

• When ni > 1, another way to think about the binomial distribu-tion’s relationship to the exponential dispersion family is not yi ∼ED(µi, ϕ = 1, wi = 1) where b(θi) = ni log(1+ eθi ) and µi = b′(θi) =nie

θi/(1 + eθi) = niπi, but instead yi/ni ∼ ED(µi, ϕ = 1, wi = ni),where b(θi) = log(1 + eθi ) and µi = b′(θi) = eθi/(1 + eθi) = πi.

Notice we can rewrite the RHS of (*) as

exp

[yini

log

(πi

1− πi

)+ log(1− πi)

]ni + log

(ni

yi

)

where now θi = log(

πi

1−πi

), b(θi) = log(1 + eθi ), ϕ = 1, wi = ni,

ai(ϕ) = ϕ/wi, and c(yi, ϕ) = log(ni

yi

).

Variance Functions:

Distribution Variance Function

Poisson v(µ) = µ = eθ

Binomial v(µ) = nµ(1− µ) = neθ/(1 + eθ)2

Binomial/n v(µ) = µ(1− µ) = eθ/(1 + eθ)2

Normal v(µ) = 1Gamma v(µ) = µ2 = (−1/θ)2

72

Example – Binomial Distribution

Let Y = the number of successes out of n trials, so Y ∼ Bin(n, π), whereπ = probability of success on each trial, µ = E(Y ). What is µ?

From ED distributional form,

θ = log

(π

1− π

)⇒ π =

eθ

1 + eθ

and b(θ) = n log(1 + eθ), a(ϕ) = ϕ = 1 so

µ = b′(θ) = neθ

1 + eθ= nπ

√

and

var(Y ) = a(ϕ)b′′(θ) = neθ

1 + eθ1

1 + eθ= nπ(1− π)

√

Another Example

Now let P = Y/n = the proportion of successes, so P ∼ (1/n)Bin(n, π).Then

f(p;π) = Pr(P = p|n, π) = Pr(Y = np|n, π) =(n

np

)πnp(1− π)n−np

= exp

np log

(πi

1− πi

)+ n log(1− π) + log

(ni

yi

)

where θ = log(

π1−π

), b(θ) = log(1− π) = log(1 + eθ), and a(ϕ) = ϕ/w =

1/n. So,

µ = E(P ) = b′(θ) =eθ

1 + eθ= π,

√

and

var(P ) = a(ϕ)b′′(θ) =1

nπ(1− π)

√

73

Link Functions:

When considering models for binary random variables, we considered in-verse c.d.f. links. Why?

For a binary random variable Y we want to map the real line (range ofη = xTβ) to [0, 1], the range of µ = E(Y ) (a probability).

• All c.d.f.’s do just that.

So, it made sense to consider models of the form

µ = F (η)

for some c.d.f. F , or equivalently,

F−1(µ) = η

1. Logit link (inverse unit logist c.d.f.) log(

µ1−µ

)2. Probit link (inverse std. normal c.d.f.) Φ−1(µ) = η.

3. Complementary log-log link log(− log(1− µ)) = η where µ = proba-bility of success. This model is equivalent to the model with log-loglink, log(− log(µ)) = η where now µ = probability of failure. Thelog-log link is the inverse c.d.f. for the extreme value or Gumbeldistribution.

For count data the inverse link function should again map the range of η(−∞,+∞) to the range of µ which is now [0,∞). A natural choice is theexponential function:

µ = eη, or, equivalently, log(µ) = η

74

For each error distribution a variety of link functions are possible beyondthose mentioned above. However, for each error distribution there is onelink that is particularly convenient: the canonical link function.

• Canonical links are the links for which θ = η.

• For canonical links there exists a sufficient statistic equal in dimen-sion to β.

Both of these properties make models with canonical links mathematicallyconvenient.

Canonical Links:

Error Distribution Canonical Link

Normal g(µ) = µPoisson g(µ) = log µBinomial/n g(µ) = log(µ/(1− µ))Gamma g(µ) = µ−1

• For canonical links the sufficient statistic is XT︸︷︷︸p×n

Y︸︷︷︸n×1

a p× 1 vector

with components∑n

i=1 xijYi, j = 1, . . . , p.

• Canonical links are convenient, but that alone is not enough to dic-tate their use. Choice of link should be made on

1. model fit2. model interpretability3. whether or not link is canonical (convenience)

75

Grouped vs. Ungrouped Data

In the CLM setting we typically consider ungrouped data where (yi,xi) isthe original observation on unit i

Unit 1...

Unit i...

Unit n

, y =

y1...yi...yn

, X =

x11 · · · x1p

......

...xi1 · · · xip

......

...xn1 · · · xnp

• For example, deaths in a clinical trial: we observe the binary (0=dead,1=alive)variables y1, . . . , yn on the n subjects participating in the trial.

If some of the rows of the X matrix are identical, it is often convenient togroup the data.

In this case index i refers to the ith group of units with the same covariatevector. We also record ni = the number of units in the ith group.

Group 1...

Group i...

Group g

,

n1...ni...ng

, y =

y1...yi...yn

, X =

x11 · · · x1p

......

...xi1 · · · xip

......

...xg1 · · · xgp

• Here we analyze yi the i

th group mean (proportion dead, in example).

– Alternatively, we could analyze the ith group sum (numberdead, in example).

If ungrouped data (yi,xi), i = 1, . . . , n, are modelled with a GLM with

E(yi) = µi = g−1(xTi β), var(yi) = ϕv(µi) (wi = 1)

then the group averages are also in E.D. family so an equivalent GLMmodels the group averages with the same error distribution and

E(yi) = g−1(xTi β) (same mean)

and var(yi) = ϕv(µi)/ni (wi = ni)

76

Why look at grouped data?

1. Some statistical procedures are valid only for grouped data.

• E.g., goodness of fit tests like the deviance, Pearson chi-squarestatistics, inappropriate for binary and count data that can’tbe grouped.

2. Two types of asymptotics are available for grouped data: i.) asymp-totics as ni → ∞ ∀i = 1, . . . , g or ii.) asymptotics as n → ∞without requiring ni → ∞ ∀i. In ungrouped data asymptotics arenecessarily as n → ∞.

3. Computing and memory savings can be substantial with groupeddata.

Methods of Estimation in GLMs

1. Least Squares (OLS, WLS/GLS)2. Maximum Likelihood (ML)3. Conditional ML4. Quasi-likelihood (QL)5. Estimating Functions6. Other methods – Extended QL, Pseudo ML, Pseudo likelihood, etc.

1. Least Squares (OLS, WLS, GLS):

OLS

For a model of the form

Yi = hi(β1, . . . , βp) + ei, i = 1, . . . , n,

where hi’s are known functions of β = (β1, . . . , βp)T ∈ B and e1, . . . , en

satisty

i. E(ei) = 0ii. var(ei) = σ2 > 0 (constant)iii. cov(ei, ej) = 0 for i = j.

we would like to choose β so that hi(β) is close to Yi for all i.

77

One approach: Choose β to minimize the sum of squared deviationsbetween the Yi’s and the hi’s.

Let S(β) =

n∑i=1

(Yi − hi(β)︸︷︷︸=µi

)2 (Least Squares Criterion)

Estimate β by minimizing S(β).

If the hi’s are differentiable and B is open, then β must satisfy

∂S(β)

∂βj= 0, j = 1, . . . , p,

which are called the normal equations.

∂S(β)

∂βj= −2

∑i

(Yi − hi(β))∂hi(β)

∂βj

78

WLS

• The assumption of constant variance can be relaxed. This leads toweighted least squares (WLS).

AssumeYi = hi(β) + ei, i = 1, . . . , n, (∗)

where now ei’s satisfy

i. E(ei) = 0ii. var(ei) = σ2/wi (no longer constant, but instead changing with

a known weight wi).iii. cov(ei, ej) = 0 for i = j.

Model (∗) does not fit into the OLS framework. However, noticethat the following transformed model does:

√wiYi =

√wihi(β) +

√wiei︸︷︷︸=ei

This model has an error term ei which satisfies the OLS assumptions.Therefore we estimate β by minimizing

S(β) =∑i

(√wiYi −

√wihi(β))

2

=∑i

wi(Yi − hi(β))2 (the Weighted LS Criterion)

In matrix notation, we want

argminβ

(Y − h(β))TW(Y − h(β)),

where

W =

w1 0 · · · 00 w2 · · · 0...

.... . .

...0 0 · · · wn

where wi ∝ 1/var(yi).

79

GLS

More generally, suppose V = var(Y) where now V is not necessarilydiagonal. (That is, let’s now relax assumption iii.)

UsingV−1 in place ofW we obtain theGeneralized Least SquaresCriterion, and the GLS estimators of β are given by

argminβ

(Y − h(β))TV−1(Y − h(β))

• If V is known, then βGLS is BLUE in the linear version of thismodel (i.e., when h(β) = Xβ), but it is typically necessary toestimate V to obtain the GLS estimator, in which case it isnot necessarily optimal in any sense.

• GLS is most useful in a linear model where h(β) = Xβ in

which case β is the solution of

(XTV−1X)β = XTV−1Y (the normal equations)

2. Maximum Likelihood:

Suppose we have a discrete random variable Y (possibly a vector)with observed value y. Suppose Y has frequency function f(y;θ),θ ∈ Θ.

The likelihood function, L(θ; y) is defined to equal the frequencyfunction (more generally, the density) but viewed as a function of θ,not y:

L(θ; y) = f(y;θ)

Therefore, the likelihood at θ0, say, has the interpretation

L(θ0; y) = Pr(Y = y when θ = θ0)

= Pr(observing the obtained data when θ = θ0)

Logic of ML: choose the value of θ that makes this probability largest

⇒ θ, the MLE.

80

We use the same procedure when Y is continuous with density func-tion f(y;θ): maximize L(θ; y) = f(y;θ)

Often, our data come from a random sample so that we observe ycorresponding to Yn×1, a vector of independent r.v.’s. In this case

L(θ;y) =n∏

i=1

f(yi;θ)

Since its easier to work with sums than products its useful to notethat in general

argmaxθ

L(θ; y) = argmaxθ

logL(θ; y)︸︷︷︸≡ℓ(θ;y)

Therefore, we define a MLE of θ as a θ so that

ℓ(θ, y) ≥ ℓ(θ; y) ∀θ ∈ Θ

If Θ is an open set, then θ must satisfy (if it exists)

∂ℓ(θ)

∂θj= 0, j = 1, . . . , dim(θ)

or in vector form

∂ℓ(θ; y)

∂θ= 0, (the score equation)

• Difficulty: usually the score equation can’t be solved explicitly – insuch cases we need to use a numerical method (Newton-Raphson,Fisher-Scoring, etc.).

81

In the GLM context, for a single observation

ℓ(θi; yi) = log f(yi; θi)

which we can write as a function of µi since θi = θi(µi):

ℓ(µi; yi) = log(f(yi; θi(µi)) =yiθi(µi)− b(θi(µi))

ai(ϕ)+ c(yi;ϕ)

For all n observations,

ℓ(µ;y) =

n∑i=1

(yiθi(µi)− b(θi(µi))

ai(ϕ)+ c(yi;ϕ)

)

Equivalent to maximizing ℓ(µ;y) with respect to µ, we can minimizethe scaled deviance:

D∗(y;µ) = 2 [ℓ(y;y)− ℓ(µ;y)]

Here, we have written the scaled deviance as a function of µ. Itsminimizing value w.r.t. µ (the MLE of µ) we denote µ.

For some types of GLMs the value of D∗ at µ for a given model givesan important measure of goodness of fit for that model.

Why? Because

D∗(y; µ) = 2 [ℓ(y;y)− ℓ(µ;y)]

compares the quality of the fit of the current model (given by ℓ(µ,y))with the quality of the fit of the best-fitting model possible, the modelwith µ = y (given by ℓ(y;y)).

82

Notice thatD∗(y; µ) = 2 log λ

where

λ =L(µmax;y)

L(µ;y), (likelihood ratio)

where µmax = y is the MLE under a model with no restrictions (thesaturated model), and µ is the MLE under the current model (whichcontains certain restrictions of the µi’s).

• λ is the likelihood ratio for comparing the current model withthe saturated model.

Notice in the CLM with normally distributed errors, and known σ2:

f(y;µ) =1

(2πσ2)n/2exp

−∑i

(yi − µi)2

2σ2

⇒ ℓ(µ;y) = −n

2log(2πσ2)−

∑i

(yi − µi)2

2σ2

andℓ(y;y) = −n

2log(2πσ2)− 0

⇒ D∗(y;µ) = 2 [ℓ(y;y)− ℓ(µ;y)] =1

σ2

∑i

(yi − µi)2

so that D ≡ σ2D∗ (the (unscaled) deviance) is identical to SSE .

• Therefore, in the CLM,

Minimum Deviance = ML = OLS

83

Asymptotics of ML

Inference in GLMs relies on the asymptotic properties of MLEs. Undercertain “regularity conditions” (see Fahrmeir and Tutz, §2.2.1) we have thefollowing results for a MLE β of β, the regression parameter in a GLM.

Asymptotic Existence and Uniqueness: The probability that β existsand is (locally) unique tends to 1 as n → ∞.

• For many important models, the log-likelihood ℓ(β;y) is concaveso that local and global maxima coincide. For strictly concave log-likelihoods the MLE is even unique whenever it exists.

Consistency: As n → ∞, βnp−→ β (weak consistency) and βn → β

with probability 1 (strong consistency). Here, βn denotes the sequence ofMLEs based on samples of size n.

Asymptotic Normality:

√n(βn − β)

d−→ N(0, nI(β)−1)

where I(β) denotes the Fisher information matrix. This result impliesthat for n large

β.∼ N(β, I(β)−1).

basic idea for proof: (one-dimensional case)

(This proof applies to any MLE β for some parameter β under the assumedregularity conditions, not just the MLE of a GLM regression parameter.See Fahrmeir and Kaufmann, 1985, for more details.)

84

Since βn maximizes ℓ(β) =∑n

i log f(yi;β) (I’ve dropped the argument

y here to simplify the notation), it follows that ℓ′(βn) = 0. By Taylor’sTheorem,

0 = ℓ′(βn) = ℓ′(β) + ℓ′′(β)(βn − β) +1

2ℓ′′′(β∗

n)(βn − β)2,

where β∗n is a point that lies between βn and β and which therefore goes

in probability to β (by the consistency of βn).

⇒ (βn − β) =−ℓ′(β)

ℓ′′(β) + 12ℓ

′′′(β∗n)(βn − β)

⇒√n(βn − β) =

− 1√nℓ′(β)

1nℓ

′′(β) + 12nℓ

′′′(β∗n)(βn − β)

Now let’s consider the terms of this ratio one at a time:

1√nℓ′(β) =

1√nU(β)

where U(β) =∑

∂∂β log f(yi;β) is the score function, a sum of independent

r.v.’s each with mean 0, where var(U) = I(β). Therefore, a CLT for non-identically distributed r.v.s yields

1√nU(β)

d−→ N(0, n−1I(β)) (CLT)

In addition, a weak law of large numbers gives

1

nℓ′′ =

1

n

∑ ∂2 log f(yi;β)

∂β2︸︷︷︸independent

p−→ − 1

nI(β) (its expected value)

85

and because

1

nℓ′′′(β∗

n) = mean of n indep. r.v.’sp−→ some constant

and(βn − β)

p−→ 0 (consistency)

it follows that1

nℓ′′′(β∗

n)(βn − β)p−→ 0

So we now have that

√n(βn − β) =

1√nU(β)

− 1nℓ

′′(β) + op(1)

where1√nU(β)

d−→ N(0, n−1I(β))

and − 1nℓ

′′(β)p−→ n−1I(β). So, by Slutzky’s Theorem,

√n(βn − β)

d−→ nI−1(β)×N(0, n−1I(β)) as n → ∞= N(0, nI−1(β))

or βn.∼ N(β, I−1(β)) in large samples. .

• Note that I(β) can be estimated consistently by I(β).

• In addition, the expected information I(β) = −Eℓ′′(β) can bereplaced by the observed information, or negative Hessian, −ℓ′′(β),without changing the asymptotic result.

86

Important Property: Functional Invariance of MLE

Let Y have density f(y; θ), θ ∈ Θ and consider a reparameterization,ϕ = h(θ) where h is one-to-one, h : Θ → Φ.

If θ is a MLE of θ, then ϕ = h(θ) is a MLE of ϕ (MLE is invariant toparameterization).

Computation of MLE for β in GLMs: (Fitting GLMs)

Assume for now that X is of full rank (p).

Need to solve∂ℓ(β;y)

∂βj= 0, j = 1, . . . , p,

where

ℓ(β;y) =

n∑i=1

ℓi(β; yi) =

n∑i=1

yiθi(β)− b(θi(β))

ai(ϕ)+ c(yi, ϕ)

.

By the chain rule,

∂ℓi∂βj

=∂ℓi∂θi

∂θi∂µi

∂µi

∂ηi

∂ηi∂βj

=yi − b′(θi)

ai(ϕ)

1

b′′(θi)

1

g′(µi)xij

Let di = g′(µi) = ∂ηi/∂µi, ui = 1var(yi)d2

i

= 1ai(ϕ)b′′(θi)d2

i

. Then we can

write∂ℓ

∂βj=∑i

∂ℓi∂βj

=∑i

(yi − µi)uidixij

So, we need to solve

n∑i=1

(yi − µi)uidixij = 0 j = 1, . . . , p

where µi, ui, and di all depend on β. Notice that this is a nonlinearproblem for which (in general) no closed-form solution exists.

How do we solve such a problem?

87

Newton-Raphson: For f : R → R we want to solve f(x) = 0. Need tofind the x∗ satisfying f(x∗) = 0. We require |f ′(x∗)| > 0.

A Taylor series expansion of f(x∗) gives

0 = f(x∗) = f(x) + f ′(x)(x∗ − x) + · · ·

which implies that

x∗ ≈ x− f(x)

f ′(x)

for x close to x∗. This implies the iteration

x(m+1) = x(m) − f(x(m))

f ′(x(m))

or, in multiple dimension

x(m+1) = x(m) −[

∂

∂x(m)Tf(x(m))

]−1

f(x(m)).

In our case, f is the score vector, ∂ℓ

∂β

⇒ β(m+1) = β(m) −

[(∂2ℓ

∂β∂βT

)β=β(m)

]−1(∂ℓ

∂β

)β=β(m)

88

The Fisher scoring algorithm replaces the “observed information” ma-

trix − ∂2ℓ

∂β∂βTwith its expected value, the Fisher information matrix, giv-

ing the iteration

β(m+1) = β(m) +

[−E

(∂2ℓ

∂β∂βT

)β=β(m)

]−1(∂ℓ

∂β

)β=β(m)

or

β(m+1) = β(m) +[I(m)n

]−1

S(m), (∗)

where I(m)n , S(m) are the Fisher information matrix and score vector, re-

spectively, each evaluated at β = β(m).

• Fisher scoring is often used in place of Newton-Raphson because theexpected information is easier to compute than observed information.

Fisher scoring may be applied directly via (*) to fit GLMs. However, anequivalentmore convenient algorithm exists known as iteratively reweightedleast squares or IRLS.

89

IRLS:

Pre-multiplying both sides of (*) by I(m)n we have

I(m)n β(m+1) = I(m)

n β(m) + S(m) (∗∗)

In addition, for GLMs In(β) has (j, k)th element

−E

[∂2ℓ

∂βj∂βk

]= E

[(∂ℓ

βj

)(∂ℓ

βk

)]= E

[(∑i

(yi − µi)uidixij

)(∑i

(yi − µi)uidixik

)]=∑i

E[(yi − µi)

2u2i d

2ixijxik

]since all cross products have mean 0 by the independence of the yi’s.Continuing,

−E

[∂2ℓ

∂βj∂βk

]=∑i

var(yi)u2i d

2ixijxik

=∑i

ai(ϕ)b′′(θi)u

2i d

2ixijxik =

∑i

uixijxik

Thus,

In(β) = XTUX, where U =

u1 0 · · · 00 u2 · · · 0...

.... . .

...0 0 · · · un

90

The r.h.s. of (**) is the vector with jth element

n∑i=1

p∑k=1

u(m)i xijxikβ

(m)k +

∑i

(yi − µ(m)i )u

(m)i d

(m)i xij

=

n∑i=1

(xiju

(m)i

[p∑

k=1

xikβ(m)k + (yi − µ

(m)i )d

(m)i

])

where the superscript (m) denotes that the quantity is evaluated at β(m).

Thus the r.h.s. of (**) can be written asXTUz where z is a n−dimensionalvector with ith element

zi =

p∑k=1

xikβ(m)k + (yi − µ

(m)i )d

(m)i

= η(m)i + (yi − µ

(m)i )

(∂ηi∂µi

)β=β(m)

Thus value of β at the (m+ 1)st iteration satisfies

XTUXβ(m+1) = XTUz

orXTV−1Xβ(m+1) = XTV−1z (∗ ∗ ∗)

where z and V = U−1 = diag(u−11 , . . . , u−1

n ) are evaluated at β(m). Noticethat (***) is the WLS normal equation for a regression of the workingvariate z on X.

91

The algorithm for obtaining the MLE of β in a GLM can be summarizedas follows:

1. Obtain starting values.

– these could be for β, but in any given iteration the quantitiesthat depend on the previous iteration’s β−value do so onlythrough µ = µ(β). Therefore, we only need starting values forµ which its convenient to take as the data themselves: µ(0) = y(minor adjustments to the original data may be needed to avoidlog(0), other problems).

2. Compute β(m+1) via the weighted least squares regression of z on X

where zi = η(m)i +(yi−µ

(m)i ) ∂ηi

∂µiwith weight matrixV = diag(u−1

1 , . . . , u−1n )

where

u−1i =

ϕ

wiv(µi)

(∂ηi∂µi

)2

(These ui’s are called the iterative weights). In this step both z,V are evaluated at β(m) (or at µ(0) when m = 0).

3. Repeat step 2 until some convergence criterion is obtained. For ex-ample, stop when

||β(m) − β(m−1)||||β(m−1)||

< ϵ

for some prechosen ϵ > 0. At convergence, the MLE is β = β(m).

92

• This algorithm is known as iteratively reweighted least squares (IRLS)because it consists of a series of WLS calculations where weights arerecomputed at each iteration.

• IRLS = Fisher scoring for E.D. family.

• IRLS = Newton-Raphson for E.D. family with canonical link.

• IRLS implemented in PROC GENMOD, the R function glm(), etc.

• Notice that in the IRLS algorithm, the form of the loglikelihood onlyenters into the algorithm through the first two moments. Therefore,IRLS can be used to fit more general models outside of the GLMclass that are defined by first and second moment assumptions only,rather than full likelihood specifications (Quasi-likelihood models).

What about ϕ?

Notice that ϕ appears on both sides of the WLS normal equation (***),

thus cancelling out of the calculation of β.

Therefore, ϕ need not be updated throughout the IRLS algorithm. ϕ cansimply be estimated at convergence. We’ll come back to how we estimateϕ (what method and formulas) later.

By the ML theory we’ve reviewed we know the asymptotic variance of β:

avar(β) = In(β)−1

which we can estimate as follows:

ˆavar(β) = (XT V−1X)−1,

where V is V evaluated at β,ϕ.

93

Aliasing: (section 3.5, M&N)

So far, we’ve assumed X is of full rank. Often this will not be the case.

If columns x1, . . . ,xr of X form a linearly independent set, then one ormore of β1, . . . , βr are said to be aliased.

In this case β is not estimable because η = Xβ doesn’t uniquely deter-mine β; however, η and hence µ remains estimable, so it is useful to knowhow to fit the model and do inference on µ and other estimable functionsof β in this situation.

Example:

Suppose that length, breadth and area measurements are made on leaveswhich have the property that area=constant×length×breadth. Now sup-pose we form a GLM for calcium concentration in the leaves, say, whichhas linear predictor

η = β0 + β1X1 + β2X2 + β3X3

involving covariatesX1 = log length

X2 = log breadth

X3 = log area

Since area=constant×length×breadth, we have

X3 = c+X1 +X2,

where c = log of the original constant in the formula for area. Hence ηmay be expressed in terms of X1 and X2 as

η = β0 + β1X1 + β2X2 + β3(c+X1 +X2)

= β0 + β3c+ (β1 + β3)X1 + (β2 + β3)X2.

Thus we can distinguish

β0 + β3c, β1 + β3, and β2 + β3,

but not the four parameters β0, β1, β2, β3 separately.

• In this example, the aliasing is intrinsic to the problem. Aliasing willoccur whatever the sizes of the leaves (assuming area, length andbreadth are all measured without error and the leaves all conform tothe relationship area=constant×length×breadth).

94

Two types of Aliasing:

1. Intrinsic Aliasing – the specification of the linear structure containsredundancy whatever the observed values of X; E.g., the leaves ex-ample, or the effects model in the one-way layout.

2. Extrinsic Aliasing – An anomoly of the data makes the columns ofX linearly dependent; E.g., no observations were obtained for onelevel of a factor yielding a 0 column in the X matrix.

Solutions to the Problem of Aliasing:

1. Use a generalized inverse to find one particular (of many) solution

β; then estimate and interpret estimable functions of β.

In solving(XTV−1X)β(m+1) = XTV−1z

obtain β(m+1) as

β(m+1) = (XTV−1X)−XTV−1z

where M− indicates a generalized inverse of the (perhaps singular)matrix M. That is, M− has the property MM−M = M.

This approach is satisfactory because, fortunately, statistically im-

portant functions of β such as η, µ are estimable. That is, η(β),

µ(β) will be the same regardless of which β we choose.

• Choice of a particular β doesn’t change our model, it only determinesthe manner in which we choose to express the linear structure in ourmodel.

95

2. We can avoid a generalized inverse by imposing constraints on β sothat the solution is unique.

– Its important to realize that only constraints on the param-eter estimates (not the parameters themselves) are necessaryto solve the normal equations and to derive most things of in-terest in fitting the model (e.g., in a linear model setting, theSSE , the analysis of variance, the error variance estimate, andthe b.l.u.e. of any estimable function).

– However, it is often desirable to restrict model parameters incertain ways for interpretability reasons. In such cases thesame constraints are applied to parameter estimates as well toobtain a solution of the normal equations.

– Common choices of constraints include the baseline (regression)

constraints in which p − rank(X) of the elements of β are setequal to zero corresponding to the p − rank(X) linear depen-dencies in the design matrix. The usual (anova, sum-to-zero,conventional) constraints are require certain of the parameterestimates to sum to zero (e.g., the estimates of all effects cor-responding to the levels of a factor must sum to 0).

– PROC GENMOD in SAS allows control over the constraintsused to identify the model through options on the CLASSstatement. In R it is also possible to choose parameter con-straints via the options(contrasts=...) function.

96

Goodness of Fit and Hypothesis Testing:

At this point we consider fit of the linear structure. We assume for nowthat the error distribution and link specifications are satisfactory.

To talk about model fit and selection its useful to define several cases ofthe linear structure:

i. If we include n linearly independent explanatory variables ⇒ MLEof µi’s are the yi’s themselves (perfect fit). This situation is knownas the full or saturated model. This model involves no summa-rization (reduction) of the data.

ii. At the other extreme is the null model where we assume µi = µ∀i. This model is almost always too simple.

iii. The current model is the model currently under investigation, lyingsomewhere in between the full and null models.

Sometimes its useful to identify two more models:

iv. The minimal model is the simplest model we wish to consider.Typically its more complex that (ii) because we know (e.g.) thatcertain effects must be included to account for the experimental de-sign and/or to allow hypothesis tests of a priori interest.

v. The maximal model is the most complex model we’re willing toconsider. Typically its less complex than the full model.

97

Most testing problems of interest in GLMs can be expressed as linearhypotheses of the form

H0 : Cβ = b versus H1 : Cβ = b

where C is an s× p, full row rank matrix of constants and b is a p−vectorof constants.

Most commonly, we want to test

H0 : βr = 0 versus H1 : βr = 0

where βr is an r−dimensional sub-vector of β.

A hypothesis of this form includes as a special case a goodness of fit testfor the overall adequacy of the model: for this test the hypotheses are ofthe form

H0 : βn−p = 0 versus H1 : βn−p = 0

where βn−p is the sub-vector of βn, the full model parameter, that mustbe set to zero to obtain the current model (with dim(β) = p).

Under likelihood-based inference the classical tool for testing hypothesesis the Likelihood Ratio Statistic:

λ =L(β;y)

L(β;y), β = MLE under larger model

β = MLE under smaller model

=L(µ;y)

L(µ;y), µ = µ(β)

Logic: if λ.= 1 ⇒ no evidence against H0 ⇒ models fit about equally well

so should adopt smaller model. If λ >> 1 then reject H0, adopt largermodel.

How large does λ need to be to reject?

Answer: large in comparison to its distribution under H0 (above the100(1− α)th percentile of its null distribution).

98

The following result is due to Wilks:

Let

λ =L(θ)

L(θ), θ = MLE under model 1

θ = MLE under model 2

where model 2 is nested within model 1. Then under mild regularityconditions, we have the following asymptotic result:

2 log λa∼ χ2(t1 − t2)

where ti = the number of independent parameters estimated under modeli, i = 1, 2.

Recall we originally expressed the scaled deviance

D∗(y;µ) = 2[ℓ(y;y)− ℓ(µ;y)]

as a function of µ and we noticed that ML=Min Dev.

For a particular model, though, with MLE µ, the scaled deviance at µbecomes our primary tool for inference in a GLM context:

D∗(y; µ) = 2[ℓ(y;y)− ℓ(µ;y)]

• Notice that D∗(y; µ) is just 2 log λ for comparing the current modelto the full model.

99

Examples – Normal Distribution, σ2 Known.

We’ve already seen that

D∗(y; µ) = 2

[− 1

2σ2

∑i

(yi − yi)2 +

1

2σ2

∑i

(yi − µi)2

]

=1

σ2

∑i

(yi − µi)2 =

1

σ2SSE

⇒ D(y; µ) = SSE .

Poisson Distribution

f(yi) =e−µiµyi

i

yi!, i = 1, . . . , n

⇒ ℓ(µ;y) = log

n∏i=1

e−µiµyi

i

yi!=∑i

[−µi + yi logµi − log yi!]

⇒ D∗(y; µ) = 2∑i

[−yi + yi log yi − log yi!− (−µi + yi log µi − log yi!)]

= 2∑i

[yi logyiµi

− (yi − µi)] = D(y; µ), ϕ = 1 here

Note that if the residuals yi − µi sum to zero then

D∗(y; µ) = 2∑i

yi logyiµi

= G2,

the G2 likelihood ratio chi-square statistic often used for goodness of fit incontingency table analysis.

100

Documents

STAT 8620, Categorical Data Analysis & Generalized Linear ... · STAT 8620, Categorical Data Analysis & Generalized Linear Models — Lecture Notes The goal of this course is to teach