48
Modeling a Count Response Bruce A Craig Department of Statistics Purdue University Reading: Faraway Ch. 5, Agresti Ch. 4, KNNL Ch. 14 STAT 526 Topic 8 1

Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

Modeling a Count Response

Bruce A Craig

Department of StatisticsPurdue University

Reading: Faraway Ch. 5, Agresti Ch. 4, KNNL Ch. 14

STAT 526 Topic 8 1

Page 2: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

Poisson Distribution

Let Y represent the count of events (e.g., successes) thatoccur in some fixed interval of measure (e.g., time interval,region of area)

Unlike binomial/Bernoulli, no upper bound on support,Y = 0, 1, 2, . . .

Poisson distribution appropriate model when

The number of events that occur in two nonoverlapping unitsof measure are independentThe probability that an event occurs in a unit of measure isthe same for all units of equal size and is proportional to thesize of the unitThe probability that more than one event occurs in a unit ofmeasure is negligible for very small-sized units. In otherwords, the events occur one at a time.

STAT 526 Topic 8 2

Page 3: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

Poisson Distribution

Probability density function

P(Y = y) =exp{−λ}λy

y !

Can show an exponential family distribution

E (Y ) = Var(Y ) = λ

Features:

If Yiind∼ Pois(λi ), then

i

Yi ∼ Pois(∑

i

λi )

Arises naturally when time between events is iidexponentialApproximates B(m, p) when m is large and p is smallwith λ = mp.Approaches Normality as λ increases

STAT 526 Topic 8 3

Page 4: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

Poisson(λ) for Various λ

STAT 526 Topic 8 4

Page 5: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

Poisson Regression

Model

Yiind∼ Pois(λi )

log λi = xiβ

– βj = diff in log E(Y) for unit change in xj (all others fixed)

– mean at xj + 1 is mean at xj times exp{βj}

log is common (canonical) link but others may be usedLog-likelihood

l(β) = log

n∏

i=1

[

λyii exp{−λi}

yi !

]

=

n∑

i=1

(yi log λi − λi − log yi !)

=

n∑

i=1

(yi xiβ − exp xiβ) + constant =⇒ X′y = X′µ̂

STAT 526 Topic 8 5

Page 6: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

Goodness of Fit

Deviance

D = 2n

i=1

(

yi logyi

µ̂i

− (yi − µ̂i )

)

= 2

n∑

i=1

(

yi logyi

λ̂i

− (yi − λ̂i )

)

Also known as the G 2 statisticDifferent deviance values for grouped or ungrouped data

Pearson X 2

X 2 =n

i=1

(yi − µ̂i )2

V (µ̂i )=

n∑

i=1

(yi − λ̂i )2

λ̂i

STAT 526 Topic 8 6

Page 7: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

Inference About β

Same general procedure as with other GLMs

To test H0 : βj = 0 versus Ha : βj 6= 0, useWald testLikelihood ratio testScore test

For inference about µi , compute η̂i and its SE.Backtransform if interested in CI

For prediction of Y , use Poisson pdf with λ = µ̂ tocompute appropriate quantiles

STAT 526 Topic 8 7

Page 8: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

Example

Is the diversity in plant species on an island related tovarious geographical features of the island?

Study looked at 30 Galapagos islands obtaining a speciescount (Y ) and five geographical features

Area of the island (km2)Highest elevation of the island (m)Distance to Santa Cruz island (km)Distance to the nearest island (km)Area of adjacent island (km2)

Also have count of endemic (unique) species as anotheroutcome variable

STAT 526 Topic 8 8

Page 9: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

Examine the Data Structure

> library(faraway)

> str(gala)

’data.frame’: 30 obs. of 7 variables:

$ Species : num 58 31 3 25 2 18 24 10 8 2 ...

$ Endemics : num 23 21 3 9 1 11 0 7 4 2 ...

$ Area : num 25.09 1.24 0.21 0.1 0.05 ...

$ Elevation: num 346 109 114 46 77 119 93 168 71 112 ...

$ Nearest : num 0.6 0.6 2.8 1.9 1.9 8 6 34.1 0.4 2.6 ...

$ Scruz : num 0.6 26.3 58.7 47.4 1.9 ...

$ Adjacent : num 1.84 572.33 0.78 0.18 903.82 ...

## Will remove the Endemics variable for easier analysis

> gala1 = gala[,-2]

STAT 526 Topic 8 9

Page 10: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

Fitting the Full GLM> fit1 <- glm(Species~., family=poisson, data=gala1)

> summary(fit1)

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 3.155e+00 5.175e-02 60.963 < 2e-16 ***

Area -5.799e-04 2.627e-05 -22.074 < 2e-16 ***

Elevation 3.541e-03 8.741e-05 40.507 < 2e-16 ***

Nearest 8.826e-03 1.821e-03 4.846 1.26e-06 ***

Scruz -5.709e-03 6.256e-04 -9.126 < 2e-16 ***

Adjacent -6.630e-04 2.933e-05 -22.608 < 2e-16 ***

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 3510.73 on 29 degrees of freedom

Residual deviance: 716.85 on 24 degrees of freedom

AIC: 889.68

Number of Fisher Scoring iterations: 5

STAT 526 Topic 8 10

Page 11: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

Goodness of Fit Tests

Residual Deviance

Tests H0 : our model vs Ha: saturated modelDistribution poorly approximated by χ2

> pchisq(fit1$deviance,fit1$df.residual,lower.tail=FALSE)

[1] 7.073157e-136

Pearson X 2

Test H0 : our model vs Ha: saturated modelDistribution better approximated by χ2

Gets even better when Poisson approaches Normal

> pchisq( sum(residuals(fit1, type="pearson")^2 ), fit1$df.residual, lower.tail=FALSE)

[1] 2.18719e-145 # reject H0

Can look more into model diagnostics to find possiblereasons for poor fit

STAT 526 Topic 8 11

Page 12: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

Randomized Quantile Residuals

Can again use these residuals to assess fit

Warning: Problem with extreme outliers / overdispersion> fit1res = qres.pois(fit1)

> fit1res

[1] -2.3704014 2.2765322 -5.5201747 0.7896027 -4.6562257

[6] -3.3413838 -1.3724017 -0.3180478 -4.6393914 -6.2867257

[11] Inf 0.5979665 8.2095362 -6.6106086 2.8081647

[16] -1.2164403 -0.5969734 -5.1684852 -8.2777086 -1.2047194

[21] -3.9679243 2.1827066 3.9796196 -7.4357526 7.9414445

[26] 0.1319390 Inf 1.1363113 -3.6259643 0.6910023

Obs #11: y = 97 and µ̂ = 27.86135

Obs #27: y = 285 and µ̂ = 158.1653

Can set Inf values to large value (e.g., 10)

STAT 526 Topic 8 12

Page 13: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

Diagnosticsqplot(predict(fit1,type="link"),fit1res,geom=c(’point’),ylab="Residual")

fit1res1 = fit1res

fit1res1[is.infinite(fit1res1)] = 10

qqnorm(fit1res1)

abline(a=0,b=1,col="red")

STAT 526 Topic 8 13

Page 14: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

Mean versus Variance

Model assumes mean and variance are the sameCan compare observed variance with µ̂ (log scale)plot(log(predict(fit1,type="response")),xlab="Estimated Mean",

log((gala1$Species-predict(fit1,type="response"))^2),

ylab="Approx Variance")

abline(a=0,b=1)

STAT 526 Topic 8 14

Page 15: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

Quasi-Poisson

Goodness of fit tests and diagnostics suggest overdispersion

Similar to binomial, can estimate dispersion parameter φ

> phihat = sum(residuals(fit1,type="pearson")^2)/fit1$df.residual

> phihat

[1] 31.74914

> summary(fit1,dispersion=phihat)

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 3.1548079 0.2915897 10.819 < 2e-16 ***

Area -0.0005799 0.0001480 -3.918 8.95e-05 ***

Elevation 0.0035406 0.0004925 7.189 6.53e-13 ***

Nearest 0.0088256 0.0102621 0.860 0.390

Scruz -0.0057094 0.0035251 -1.620 0.105

Adjacent -0.0006630 0.0001653 -4.012 6.01e-05 ***

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

(Dispersion parameter for poisson family taken to be 31.74914)

STAT 526 Topic 8 15

Page 16: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

Altering the Structural Form

Scatterplot of log(Species) versus predictors showpredictors highly skewed and relationship nonlinear

Consider model using the log of each predictor

STAT 526 Topic 8 16

Page 17: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

Fitting the Full GLM

> fit2 <- glm(Species~., family=poisson, data=gala2)

> summary(fit2)

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 3.296705 0.285277 11.556 < 2e-16 ***

Area 0.349908 0.018011 19.428 < 2e-16 ***

Elevation 0.034825 0.057003 0.611 0.54125

Nearest -0.041491 0.013926 -2.979 0.00289 **

Scruz -0.030435 0.011573 -2.630 0.00854 **

Adjacent -0.089823 0.006925 -12.972 < 2e-16 ***

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 3510.73 on 29 degrees of freedom

Residual deviance: 360.39 on 24 degrees of freedom

AIC: 533.22

STAT 526 Topic 8 17

Page 18: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

Diagnostics

qplot(predict(fit1,type="link"),fit1res,geom=c(’point’),ylab="Residual")

qqnorm(fit2res)

abline(a=0,b=1,col="red")

STAT 526 Topic 8 18

Page 19: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

Quasi-Poisson

Even with better form, overdispersion appears relevant

> phihat = sum(residuals(fit2,type="pearson")^2)/fit2$df.residual

> phihat

[1] 16.6475

> summary(fit2,dispersion=phihat)

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 3.29671 1.16397 2.832 0.00462 **

Area 0.34991 0.07349 4.762 1.92e-06 ***

Elevation 0.03482 0.23258 0.150 0.88098

Nearest -0.04149 0.05682 -0.730 0.46527

Scruz -0.03044 0.04722 -0.645 0.51923

Adjacent -0.08982 0.02825 -3.179 0.00148 **

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

(Dispersion parameter for poisson family taken to be 16.6475)

STAT 526 Topic 8 19

Page 20: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

Summary

Substantial change in deviance using new structural form> anova(fit1,fit2)

Analysis of Deviance Table

Model 1: Species ~ Area + Elevation + Nearest + Scruz + Adjacent

Model 2: Species ~ Area + Elevation + Nearest + Scruz + Adjacent

Resid. Df Resid. Dev Df Deviance

1 24 716.85

2 24 360.39 0 356.45

Overdispersion, however, still present

Change in log(y) for a unit change in log(x) approximatespercent change in y for percent change in x (elasticity)

STAT 526 Topic 8 20

Page 21: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

Rate Model: Motivation

Each Yi represents a diff interval in space or time

Number of events may depend on the size of the interval# of crimes in cities (population size)# of customers served by workers (time worked)# cars running red lights in different intersections (how busy)

Sometimes can consider problem in binomial setting butoften still better to model as Poisson

counts can be small as compared to the totalthe total trials may not be easily known

Goal: express a common effect of covariates on all Yi ,while accounting for differences in ’exposure’

’exposure’ needs to be defined by a variable

STAT 526 Topic 8 21

Page 22: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

Rate Model: Formulation

Consider the following model:

Yiind∼ Pois(λi )

with

λi = exposurei exp{xiβ}

or

log(λi ) = log(exposurei ) + xiβ

This model is equivalent to using log exposure as apredictor in Poisson GLM with the coefficient fixed at 1

The log exposure variable is often called an offset in thiscontext

STAT 526 Topic 8 22

Page 23: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

Example

Experiment to study the effect of gamma radiation on thenumbers of chromosomal abnormalities (ca)

Three dose amounts (in Grays)Nine dose rates (in Grays/hour)

The number of cells (cells, in hundreds) exposed to adose/rate combination varied across runs

Thus, cells can be viewed as the ’exposure’ variableCan have more than one abnormality per cell> library(faraway)

> head(dicentric)

cells ca doseamt doserate

1 478 25 1 0.10

2 1907 102 1 0.25

3 2258 149 1 0.50

4 2329 160 1 1.00

5 1238 75 1 1.50

6 1491 100 1 2.00

STAT 526 Topic 8 23

Page 24: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

Exploring the Data

Outcome: proportion of cells with an abnormalityInteraction plot:> with(dicentric,interaction.plot(doseamt,doserate,ca/cells,legend=FALSE))

> legend(’topleft’,c(’0.1’,’0.25’,’0.5’,’1’,’1.5’,’2’,’2.5’,’3’,’4’),

lty=9:1,cex=0.8)

STAT 526 Topic 8 24

Page 25: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

Exploring the Data

Interaction plot with numeric factor on x axis> with(dicentric,interaction.plot(doserate,log(doseamt),ca/cells,legend=FAL

> legend(’topleft’,c(’1’,’2.5’,’5’),lty=3:1,cex=0.6)

Can more easily see the change in slopes?

STAT 526 Topic 8 25

Page 26: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

Exploring the Data

Plot with numeric factor on x axis> ggplot(dicentric,aes(x=lograte,y=ca/cells,col=as.factor(doseamt)))+

geom_point()+geom_smooth(method="lm", se=F)+

theme(legend.position = "none")

STAT 526 Topic 8 26

Page 27: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

Model Without an Offset

The coefficient of log(cells) is close to 1 (offset reasonable)

> dicentric$doseF <- factor(dicentric$doseamt)

> fit1 = glm(ca ~ log(cells) + log(doserate)*doseF,

family=poisson, data=dicentric)

> summary(fit1)

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -2.76534 0.38116 -7.255 4.02e-13 ***

log(cells) 1.00252 0.05137 19.517 < 2e-16 ***

log(doserate) 0.07200 0.03547 2.030 0.042403 *

doseF2.5 1.62984 0.10273 15.866 < 2e-16 ***

doseF5 2.76673 0.12287 22.517 < 2e-16 ***

log(doserate):doseF2.5 0.16111 0.04837 3.331 0.000866 ***

log(doserate):doseF5 0.19316 0.04299 4.493 7.03e-06 ***

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 916.127 on 26 degrees of freedom

Residual deviance: 21.748 on 20 degrees of freedom

AIC: 211.15

STAT 526 Topic 8 27

Page 28: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

GLM Model With an Offset

Very little difference in estimates here

> fit2 <- glm(ca ~ offset(log(cells)) + log(doserate)*doseF,

family=poisson, data=dicentric)

> summary(fit2)

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -2.74671 0.03426 -80.165 < 2e-16 ***

log(doserate) 0.07178 0.03518 2.041 0.041299 *

doseF2.5 1.62542 0.04946 32.863 < 2e-16 ***

doseF5 2.76109 0.04349 63.491 < 2e-16 ***

log(doserate):doseF2.5 0.16122 0.04830 3.338 0.000844 ***

log(doserate):doseF5 0.19350 0.04243 4.561 5.1e-06 ***

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 4753.00 on 26 degrees of freedom

Residual deviance: 21.75 on 21 degrees of freedom

AIC: 209.16

STAT 526 Topic 8 28

Page 29: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

Alternative Approach I

Consider LM using proportions as response> propca = dicentric$ca / dicentric$cells

> fit3 = lm(propCA ~ log(doserate)*doseF, data=dicentric)

> summary(fit3)

Coefficients:

Estimate Std.Error tvalue Pr(>|t|)

(Intercept) 0.063489 0.019528 3.251 0.00382 **

log(doserate) 0.004573 0.016692 0.274 0.78680

doseF2.5 0.276315 0.027616 10.005 1.92e-09 ***

doseF5 1.004119 0.027616 36.359 < 2e-16 ***

log(doserate):doseF2.5 0.063933 0.023606 2.708 0.01317 *

log(doserate):doseF5 0.239129 0.023606 10.130 1.54e-09 ***

Residual standard error: 0.05858 on 21 degrees of freedom

Multiple R-squared: 0.9874,Adjusted R-squared: 0.9844

F-statistic: 330 on 5 and 21 DF, p-value: < 2.2e-16

Coefficients defined on data scale

STAT 526 Topic 8 29

Page 30: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

Alternative Approach I

Model assumes Normality and constant variance

Residual plot: evidence of unequal varianceClearly incorrect model choice> plot(fit3$fitted,fit3$residuals)

> abline(h=0)

STAT 526 Topic 8 30

Page 31: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

Alternative Approach II

Consider LM using log(proportion) as response> fit4 <- lm(log(propca) ~ log(doserate)*doseF,data=dicentric)

> summary(fit4)

Coefficients:

Estimate Std.Error tvalue Pr(>|t|)

(Intercept) -2.76243 0.03352 -82.402 < 2e-16 ***

log(doserate) 0.07561 0.02866 2.638 0.015364 *

doseF2.5 1.64378 0.04741 34.672 < 2e-16 ***

doseF5 2.77866 0.04741 58.610 < 2e-16 ***

log(doserate):doseF2.5 0.16483 0.04053 4.067 0.000553 ***

log(doserate):doseF5 0.19480 0.04053 4.807 9.47e-05 ***

...

Residual standard error: 0.1006 on 21 degrees of freedom

Multiple R-squared: 0.9943,Adjusted R-squared: 0.9929

F-statistic: 729.4 on 5 and 21 DF, p-value: < 2.2e-16

Analogous to using offset in Poisson regression

Parameter estimates similar to those in Poisson GLM

SE are more optimistic than in Poisson regression

STAT 526 Topic 8 31

Page 32: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

Alternative Approach II

Model still assumes Normality and constant variance

Residual plot: evidence of unequal variance?May be ok choice but Poisson GLM preferred> plot(fit4$fitted,fit4$residuals)

> abline(h=0)

STAT 526 Topic 8 32

Page 33: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

Pairwise Comparisons

When interaction, often compare levels of A at fixed B

Example 1: µdose5,doserate1 vs µdose2.5,doserate1

Using Poisson GLM with offset

> library(gmodels)

> contrast.function <- c(0, 0, -1, 1, 0, 0)

> names(contrast.function) <- names(coef(fit2))

> contrast.function

(Intercept) log(doserate) doseF2.5 doseF5 1 log(1) 0 1 0 log(1)

0 0 -1 1 - 1 log(1) 1 0 log(1) 0

log(doserate):doseF2.5 log(doserate):doseF5

0 0

> test.result <- estimable(fit2, contrast.function)

> test.result

Estimate Std. Error X^2 value DF Pr(>|X^2|)

(0 0 -1 1 0 0) 1.135667 0.04460504 648.2372 1 0

STAT 526 Topic 8 33

Page 34: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

Pairwise Comparisons

Example 2: µdose5,doserate3 vs µdose2.5,doserate3

> contrast.function <- c(0, 0, -1, 1, -1.0986, 1.0986)

> names(contrast.function) <- names(coef(fit2)) 1 log(3) 0 1 0 log(3)

> test.result <- estimable(fit2, contrast.function) - 1 log(3) 1 0 log(3) 0

> test.result

Estimate Std. Error X^2 value DF Pr(>|X^2|)

(0 0 -1 1 -1.0986 1.0986) 1.171129 0.05390965 471.9289 1 0

Difference is larger (expected because of interaction)

Note that comparisons are made on the link scale

log(µ̂ij)− log(µ̂i ′j) = log(µ̂ij/µ̂i ′j) 6= log(µ̂ij − µ̂i ′j)

STAT 526 Topic 8 34

Page 35: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

Grouped versus Ungrouped Analysis

Since Yiind∼ Pois(λi), the aggregate

i

Yi ∼ Pois(∑

i

λi)

As in binomial/Bernoulli response, log-likelihood only involvessums of yi with same covariate patternsIn previous example, could add total cells and counts ofabnormalities for entries with same doseamt and doserate

Rate models: individual vs groupeddifferent deviancessame parameter estimatessame comparison of nested models

Models with no offset: indiv. vs groupeddifferent deviancessame parameters except interceptsame comparison of nested models

STAT 526 Topic 8 35

Page 36: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

Negative Binomial Distribution

Typical motivationIndependent trials with P(success) = p

Z= the number of trials until kth successZ ∼ NB(p, k), Z = k , k + 1, . . .

A more useful motivationThe distribution of Y |λ ∼ Pois(λ) and λ ∼ G (k , α)

Probability distribution

P(Z = z) =

(

z − 1

k − 1

)

pk(1− p)z−k z = k , k + 1, . . .

or

P(Y = y) =

(

y + k − 1

k − 1

)(

α

1 + α

)y (1

1 + α

)k

y = 0, 1, . . .

Using parameterization Y = Z − k and p = (1 + α)−1

STAT 526 Topic 8 36

Page 37: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

Negative Binomial Distribution

Under this reparameterizationE (Y ) = k(1− p)/p = kα = µVar(Y ) = k(1− p)/p2 = kα(1 + α) = µ(1 + µ/k)Variance approaches Poisson variance as k → ∞

Log-likelihood is

l =

n∑

i=1

(

yi logαi

αi + 1− k log(1 + αi )

)

+ constant

=n

i=1

(

yi logµi

k + µi

+ k logk

k + µi

)

+ constant

Consider k as fixed or additional parameter to estimate

Use log (µi/(k + µi)) = xiβ as link function in GLM

STAT 526 Topic 8 37

Page 38: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

Return to Galapagos Islands Study

Can fix k based on exploratory plots of mean-variancerelationship

R: Use glm (link=log) in the MASS package

> fit.nb <- glm(Species~ log(Area)+log(Adjacent),

family=negative.binomial(1), data=gala)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 3.27257 0.15304 21.384 < 2e-16 ***

log(Area) 0.35100 0.03773 9.304 6.52e-10 ***

log(Adjacent) -0.03204 0.04015 -0.798 0.432

...

(Dispersion parameter for Negative Binomial(1) family

taken to be 0.4650222)

Null deviance: 54.069 on 29 degrees of freedom

Residual deviance: 13.965 on 27 degrees of freedom

AIC: 292.97

STAT 526 Topic 8 38

Page 39: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

Return to Galapagos Islands Study

Given dispersion parameter < 1, k too small

Could try various k or estimate using ML (glm.nb)

> fit.nb1 <- glm.nb(Species~ log(Area)+log(Adjacent),data=gala)

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 3.27777 0.14495 22.613 <2e-16 ***

log(Area) 0.34973 0.03541 9.875 <2e-16 ***

log(Adjacent) -0.03316 0.03737 -0.887 0.375

(Dispersion parameter for Negative Binomial(2.6196)

family taken to be 1)

Null deviance: 134.240 on 29 degrees of freedom

Residual deviance: 32.741 on 27 degrees of freedom

AIC: 284.99

Theta: 2.620

Std. Err.: 0.753

STAT 526 Topic 8 39

Page 40: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

Return to Galapagos Islands Study

Not a big difference in estimates and SEs across the two fits

CI for Theta= k almost includes 1

Poisson limiting case of NB when k → ∞

Thus, test H0 : φ = 1/k = 0 vs Ha : φ > 0

One-sided hypothesisUnder H0, φ is on the boundary of supportVerbeke and Molenberghs (2000):

LR = 2 [logLik(NB model)− logLik(Poisson model)]

∼1

2χ20 +

1

2χ21

STAT 526 Topic 8 40

Page 41: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

Return to Galapagos Islands Study

For this study, the two model fits result in

logLik(NB) = 32.741logLik(Poisson) = 395.54

2 [395.54− 32.741] = 362.799 >1

2χ20(0.95) +

1

2χ21(0.95)

= 1.920729

Conclusion: reject H0

Caution: quality of approximation can be poor

STAT 526 Topic 8 41

Page 42: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

Zero-Inflated Models

Again refers to there being an excess of zero counts thatcannot be adequately modeled using a dispersionparameter

Will consider both zero-inflated Poisson (ZIP) andzero-inflated negative-binomial (ZINB) models

Can think of model structure in two waysAdditional point mass at zero

P(Y = 0) = p∗i + (1− p∗i )gi (0)

P(Y = y) = (1− p∗i )gi (y) y > 0

Hurdle model

P(Y = 0) = f1(0)

P(Y = y) =1− f1(0)

1− f2(0)f2(y) y > 0

STAT 526 Topic 8 42

Page 43: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

Hurdle Model

Consider a latent variable Zi that must exceed some valuefor these to be a non-zero value

This score often called a hurdle

Consider using binary model for this part

Must use a rescaled truncated distribution for second part

In R: hurdle function in pscl packagedist=c(“poisson”,”negbin”,”geometric”)zero.dist = c(“binomial”, “poisson”,”negbin”,”geometric”)link = c(“logit”,”probit”,”cloglog”,”cauchit”,”log”)

STAT 526 Topic 8 43

Page 44: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

Mixture Model

Already discussed this model in relation to the binomial

Consider that there is a proportion p∗ that will alwaysrespond with a zero

Remaining 1− p∗ follow specified distribution

In R: zeroinfl function in pscl packagedist=c(“poisson”,”negbin”,”geometric”)link = c(“logit”,”probit”,”cloglog”,”cauchit”,”log”)Model specified using y ∼ x1 + x2 | z1 + z2 + z3

STAT 526 Topic 8 44

Page 45: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

Example

915 biochemistry graduate studentsOutcome : Number of articles produced in last 3 yrs ofPhD> mod = glm(art ~ ., data=bioChemists,family=poisson)

> summary(mod)

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 0.304617 0.102981 2.958 0.0031 **

femWomen -0.224594 0.054613 -4.112 3.92e-05 ***

marMarried 0.155243 0.061374 2.529 0.0114 *

kid5 -0.184883 0.040127 -4.607 4.08e-06 ***

phd 0.012823 0.026397 0.486 0.6271

ment 0.025543 0.002006 12.733 < 2e-16 ***

Null deviance: 1817.4 on 914 degrees of freedom

Residual deviance: 1634.4 on 909 degrees of freedom

AIC: 3314.1

STAT 526 Topic 8 45

Page 46: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

Example: Hurdle Model> library(pscl)

> modh = hurdle(art~.,data=bioChemists)

> summary(modh)

Count model coefficients (truncated poisson with log link):

Estimate Std. Error z value Pr(>|z|)

(Intercept) 0.67114 0.12246 5.481 4.24e-08 ***

femWomen -0.22858 0.06522 -3.505 0.000457 ***

marMarried 0.09649 0.07283 1.325 0.185209

kid5 -0.14219 0.04845 -2.934 0.003341 **

phd -0.01273 0.03130 -0.407 0.684343

ment 0.01875 0.00228 8.222 < 2e-16 ***

Zero hurdle model coefficients (binomial with logit link):

Estimate Std. Error z value Pr(>|z|)

(Intercept) 0.23680 0.29552 0.801 0.4230

femWomen -0.25115 0.15911 -1.579 0.1144

marMarried 0.32623 0.18082 1.804 0.0712 .

kid5 -0.28525 0.11113 -2.567 0.0103 *

phd 0.02222 0.07956 0.279 0.7800

ment 0.08012 0.01302 6.155 7.52e-10 ***

Number of iterations in BFGS optimization: 12

Log-likelihood: -1605 on 12 Df

STAT 526 Topic 8 46

Page 47: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

Example: Mixture Model> modzi = zeroinfl(art~.,data=bioChemists)

> summary(modzi)

Count model coefficients (poisson with log link):

Estimate Std. Error z value Pr(>|z|)

(Intercept) 0.640838 0.121307 5.283 1.27e-07 ***

femWomen -0.209145 0.063405 -3.299 0.000972 ***

marMarried 0.103751 0.071111 1.459 0.144565

kid5 -0.143320 0.047429 -3.022 0.002513 **

phd -0.006166 0.031008 -0.199 0.842378

ment 0.018098 0.002294 7.888 3.07e-15 ***

Zero-inflation model coefficients (binomial with logit link):

Estimate Std. Error z value Pr(>|z|)

(Intercept) -0.577059 0.509386 -1.133 0.25728

femWomen 0.109746 0.280082 0.392 0.69518

marMarried -0.354014 0.317611 -1.115 0.26502

kid5 0.217097 0.196482 1.105 0.26919

phd 0.001274 0.145263 0.009 0.99300

ment -0.134114 0.045243 -2.964 0.00303 **

Number of iterations in BFGS optimization: 21

Log-likelihood: -1605 on 12 Df

STAT 526 Topic 8 47

Page 48: Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some fixed interval

Summary

Very difficult to choose between the two

Be careful interpreting modelsHurdle: probability of a positive countMixture: probability of a zero count

Bland-Altman plot better describes prediction differences

STAT 526 Topic 8 48