Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some ﬁxed interval

Modeling a Count Response

Bruce A Craig

Department of StatisticsPurdue University

Reading: Faraway Ch. 5, Agresti Ch. 4, KNNL Ch. 14

STAT 526 Topic 8 1

Poisson Distribution

Let Y represent the count of events (e.g., successes) thatoccur in some fixed interval of measure (e.g., time interval,region of area)

Unlike binomial/Bernoulli, no upper bound on support,Y = 0, 1, 2, . . .

Poisson distribution appropriate model when

The number of events that occur in two nonoverlapping unitsof measure are independentThe probability that an event occurs in a unit of measure isthe same for all units of equal size and is proportional to thesize of the unitThe probability that more than one event occurs in a unit ofmeasure is negligible for very small-sized units. In otherwords, the events occur one at a time.

STAT 526 Topic 8 2

Poisson Distribution

Probability density function

P(Y = y) =exp{−λ}λy

y !

Can show an exponential family distribution

E (Y ) = Var(Y ) = λ

Features:

If Yiind∼ Pois(λi ), then

∑

i

Yi ∼ Pois(∑

i

λi )

Arises naturally when time between events is iidexponentialApproximates B(m, p) when m is large and p is smallwith λ = mp.Approaches Normality as λ increases

STAT 526 Topic 8 3

Poisson(λ) for Various λ

STAT 526 Topic 8 4

Poisson Regression

Model

Yiind∼ Pois(λi )

log λi = xiβ

– βj = diff in log E(Y) for unit change in xj (all others fixed)

– mean at xj + 1 is mean at xj times exp{βj}

log is common (canonical) link but others may be usedLog-likelihood

l(β) = log

n∏

i=1

[

λyii exp{−λi}

yi !

]

=

n∑

i=1

(yi log λi − λi − log yi !)

=

n∑

i=1

(yi xiβ − exp xiβ) + constant =⇒ X′y = X′µ̂

STAT 526 Topic 8 5

Goodness of Fit

Deviance

D = 2n

∑

i=1

(

yi logyi

µ̂i

− (yi − µ̂i )

)

= 2

n∑

i=1

(

yi logyi

λ̂i

− (yi − λ̂i )

)

Also known as the G 2 statisticDifferent deviance values for grouped or ungrouped data

Pearson X 2

X 2 =n

∑

i=1

(yi − µ̂i )2

V (µ̂i )=

n∑

i=1

(yi − λ̂i )2

λ̂i

STAT 526 Topic 8 6

Inference About β

Same general procedure as with other GLMs

To test H0 : βj = 0 versus Ha : βj 6= 0, useWald testLikelihood ratio testScore test

For inference about µi , compute η̂i and its SE.Backtransform if interested in CI

For prediction of Y , use Poisson pdf with λ = µ̂ tocompute appropriate quantiles

STAT 526 Topic 8 7

Example

Is the diversity in plant species on an island related tovarious geographical features of the island?

Study looked at 30 Galapagos islands obtaining a speciescount (Y ) and five geographical features

Area of the island (km2)Highest elevation of the island (m)Distance to Santa Cruz island (km)Distance to the nearest island (km)Area of adjacent island (km2)

Also have count of endemic (unique) species as anotheroutcome variable

STAT 526 Topic 8 8

Examine the Data Structure

> library(faraway)

> str(gala)

’data.frame’: 30 obs. of 7 variables:

$ Species : num 58 31 3 25 2 18 24 10 8 2 ...

$ Endemics : num 23 21 3 9 1 11 0 7 4 2 ...

$ Area : num 25.09 1.24 0.21 0.1 0.05 ...

$ Elevation: num 346 109 114 46 77 119 93 168 71 112 ...

$ Nearest : num 0.6 0.6 2.8 1.9 1.9 8 6 34.1 0.4 2.6 ...

$ Scruz : num 0.6 26.3 58.7 47.4 1.9 ...

$ Adjacent : num 1.84 572.33 0.78 0.18 903.82 ...

## Will remove the Endemics variable for easier analysis

> gala1 = gala[,-2]

STAT 526 Topic 8 9

Fitting the Full GLM> fit1 <- glm(Species~., family=poisson, data=gala1)

> summary(fit1)

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 3.155e+00 5.175e-02 60.963 < 2e-16 ***

Area -5.799e-04 2.627e-05 -22.074 < 2e-16 ***

Elevation 3.541e-03 8.741e-05 40.507 < 2e-16 ***

Nearest 8.826e-03 1.821e-03 4.846 1.26e-06 ***

Scruz -5.709e-03 6.256e-04 -9.126 < 2e-16 ***

Adjacent -6.630e-04 2.933e-05 -22.608 < 2e-16 ***

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 3510.73 on 29 degrees of freedom

Residual deviance: 716.85 on 24 degrees of freedom

AIC: 889.68

Number of Fisher Scoring iterations: 5

STAT 526 Topic 8 10

Goodness of Fit Tests

Residual Deviance

Tests H0 : our model vs Ha: saturated modelDistribution poorly approximated by χ2

> pchisq(fit1$deviance,fit1$df.residual,lower.tail=FALSE)

[1] 7.073157e-136

Pearson X 2

Test H0 : our model vs Ha: saturated modelDistribution better approximated by χ2

Gets even better when Poisson approaches Normal

> pchisq( sum(residuals(fit1, type="pearson")^2 ), fit1$df.residual, lower.tail=FALSE)

[1] 2.18719e-145 # reject H0

Can look more into model diagnostics to find possiblereasons for poor fit

STAT 526 Topic 8 11

Randomized Quantile Residuals

Can again use these residuals to assess fit

Warning: Problem with extreme outliers / overdispersion> fit1res = qres.pois(fit1)

> fit1res

[1] -2.3704014 2.2765322 -5.5201747 0.7896027 -4.6562257

[6] -3.3413838 -1.3724017 -0.3180478 -4.6393914 -6.2867257

[11] Inf 0.5979665 8.2095362 -6.6106086 2.8081647

[16] -1.2164403 -0.5969734 -5.1684852 -8.2777086 -1.2047194

[21] -3.9679243 2.1827066 3.9796196 -7.4357526 7.9414445

[26] 0.1319390 Inf 1.1363113 -3.6259643 0.6910023

Obs #11: y = 97 and µ̂ = 27.86135

Obs #27: y = 285 and µ̂ = 158.1653

Can set Inf values to large value (e.g., 10)

STAT 526 Topic 8 12

Diagnosticsqplot(predict(fit1,type="link"),fit1res,geom=c(’point’),ylab="Residual")

fit1res1 = fit1res

fit1res1[is.infinite(fit1res1)] = 10

qqnorm(fit1res1)

abline(a=0,b=1,col="red")

STAT 526 Topic 8 13

Mean versus Variance

Model assumes mean and variance are the sameCan compare observed variance with µ̂ (log scale)plot(log(predict(fit1,type="response")),xlab="Estimated Mean",

log((gala1$Species-predict(fit1,type="response"))^2),

ylab="Approx Variance")

abline(a=0,b=1)

STAT 526 Topic 8 14

Quasi-Poisson

Goodness of fit tests and diagnostics suggest overdispersion

Similar to binomial, can estimate dispersion parameter φ

> phihat = sum(residuals(fit1,type="pearson")^2)/fit1$df.residual

> phihat

[1] 31.74914

> summary(fit1,dispersion=phihat)

Coefficients:


(Intercept) 3.1548079 0.2915897 10.819 < 2e-16 ***

Area -0.0005799 0.0001480 -3.918 8.95e-05 ***

Elevation 0.0035406 0.0004925 7.189 6.53e-13 ***

Nearest 0.0088256 0.0102621 0.860 0.390

Scruz -0.0057094 0.0035251 -1.620 0.105

Adjacent -0.0006630 0.0001653 -4.012 6.01e-05 ***

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

(Dispersion parameter for poisson family taken to be 31.74914)

STAT 526 Topic 8 15

Altering the Structural Form

Scatterplot of log(Species) versus predictors showpredictors highly skewed and relationship nonlinear

Consider model using the log of each predictor

STAT 526 Topic 8 16

Fitting the Full GLM

> fit2 <- glm(Species~., family=poisson, data=gala2)

> summary(fit2)

Coefficients:


(Intercept) 3.296705 0.285277 11.556 < 2e-16 ***

Area 0.349908 0.018011 19.428 < 2e-16 ***

Elevation 0.034825 0.057003 0.611 0.54125

Nearest -0.041491 0.013926 -2.979 0.00289 **

Scruz -0.030435 0.011573 -2.630 0.00854 **

Adjacent -0.089823 0.006925 -12.972 < 2e-16 ***

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1




AIC: 533.22

STAT 526 Topic 8 17

Diagnostics

qplot(predict(fit1,type="link"),fit1res,geom=c(’point’),ylab="Residual")

qqnorm(fit2res)

abline(a=0,b=1,col="red")

STAT 526 Topic 8 18

Quasi-Poisson

Even with better form, overdispersion appears relevant

> phihat = sum(residuals(fit2,type="pearson")^2)/fit2$df.residual

> phihat

[1] 16.6475

> summary(fit2,dispersion=phihat)

Coefficients:


(Intercept) 3.29671 1.16397 2.832 0.00462 **

Area 0.34991 0.07349 4.762 1.92e-06 ***

Elevation 0.03482 0.23258 0.150 0.88098

Nearest -0.04149 0.05682 -0.730 0.46527

Scruz -0.03044 0.04722 -0.645 0.51923

Adjacent -0.08982 0.02825 -3.179 0.00148 **

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

(Dispersion parameter for poisson family taken to be 16.6475)

STAT 526 Topic 8 19

Summary

Substantial change in deviance using new structural form> anova(fit1,fit2)

Analysis of Deviance Table

Model 1: Species ~ Area + Elevation + Nearest + Scruz + Adjacent

Model 2: Species ~ Area + Elevation + Nearest + Scruz + Adjacent

Resid. Df Resid. Dev Df Deviance

1 24 716.85

2 24 360.39 0 356.45

Overdispersion, however, still present

Change in log(y) for a unit change in log(x) approximatespercent change in y for percent change in x (elasticity)

STAT 526 Topic 8 20

Rate Model: Motivation

Each Yi represents a diff interval in space or time

Number of events may depend on the size of the interval# of crimes in cities (population size)# of customers served by workers (time worked)# cars running red lights in different intersections (how busy)

Sometimes can consider problem in binomial setting butoften still better to model as Poisson

counts can be small as compared to the totalthe total trials may not be easily known

Goal: express a common effect of covariates on all Yi ,while accounting for differences in ’exposure’

’exposure’ needs to be defined by a variable

STAT 526 Topic 8 21

Rate Model: Formulation

Consider the following model:

Yiind∼ Pois(λi )

with

λi = exposurei exp{xiβ}

or

log(λi ) = log(exposurei ) + xiβ

This model is equivalent to using log exposure as apredictor in Poisson GLM with the coefficient fixed at 1

The log exposure variable is often called an offset in thiscontext

STAT 526 Topic 8 22

Example

Experiment to study the effect of gamma radiation on thenumbers of chromosomal abnormalities (ca)

Three dose amounts (in Grays)Nine dose rates (in Grays/hour)

The number of cells (cells, in hundreds) exposed to adose/rate combination varied across runs

Thus, cells can be viewed as the ’exposure’ variableCan have more than one abnormality per cell> library(faraway)

> head(dicentric)

cells ca doseamt doserate

1 478 25 1 0.10

2 1907 102 1 0.25

3 2258 149 1 0.50

4 2329 160 1 1.00

5 1238 75 1 1.50

6 1491 100 1 2.00

STAT 526 Topic 8 23

Exploring the Data

Outcome: proportion of cells with an abnormalityInteraction plot:> with(dicentric,interaction.plot(doseamt,doserate,ca/cells,legend=FALSE))

> legend(’topleft’,c(’0.1’,’0.25’,’0.5’,’1’,’1.5’,’2’,’2.5’,’3’,’4’),

lty=9:1,cex=0.8)

STAT 526 Topic 8 24

Exploring the Data

Interaction plot with numeric factor on x axis> with(dicentric,interaction.plot(doserate,log(doseamt),ca/cells,legend=FAL

> legend(’topleft’,c(’1’,’2.5’,’5’),lty=3:1,cex=0.6)

Can more easily see the change in slopes?

STAT 526 Topic 8 25

Exploring the Data

Plot with numeric factor on x axis> ggplot(dicentric,aes(x=lograte,y=ca/cells,col=as.factor(doseamt)))+

geom_point()+geom_smooth(method="lm", se=F)+

theme(legend.position = "none")

STAT 526 Topic 8 26

Model Without an Offset

The coefficient of log(cells) is close to 1 (offset reasonable)

> dicentric$doseF <- factor(dicentric$doseamt)

> fit1 = glm(ca ~ log(cells) + log(doserate)*doseF,

family=poisson, data=dicentric)

> summary(fit1)

Coefficients:


(Intercept) -2.76534 0.38116 -7.255 4.02e-13 ***

log(cells) 1.00252 0.05137 19.517 < 2e-16 ***

log(doserate) 0.07200 0.03547 2.030 0.042403 *

doseF2.5 1.62984 0.10273 15.866 < 2e-16 ***

doseF5 2.76673 0.12287 22.517 < 2e-16 ***

log(doserate):doseF2.5 0.16111 0.04837 3.331 0.000866 ***

log(doserate):doseF5 0.19316 0.04299 4.493 7.03e-06 ***




AIC: 211.15

STAT 526 Topic 8 27

GLM Model With an Offset

Very little difference in estimates here

> fit2 <- glm(ca ~ offset(log(cells)) + log(doserate)*doseF,

family=poisson, data=dicentric)

> summary(fit2)

Coefficients:


(Intercept) -2.74671 0.03426 -80.165 < 2e-16 ***

log(doserate) 0.07178 0.03518 2.041 0.041299 *

doseF2.5 1.62542 0.04946 32.863 < 2e-16 ***

doseF5 2.76109 0.04349 63.491 < 2e-16 ***






AIC: 209.16

STAT 526 Topic 8 28

Alternative Approach I

Consider LM using proportions as response> propca = dicentric$ca / dicentric$cells

> fit3 = lm(propCA ~ log(doserate)*doseF, data=dicentric)

> summary(fit3)

Coefficients:

Estimate Std.Error tvalue Pr(>|t|)

(Intercept) 0.063489 0.019528 3.251 0.00382 **

log(doserate) 0.004573 0.016692 0.274 0.78680

doseF2.5 0.276315 0.027616 10.005 1.92e-09 ***

doseF5 1.004119 0.027616 36.359 < 2e-16 ***

log(doserate):doseF2.5 0.063933 0.023606 2.708 0.01317 *


Residual standard error: 0.05858 on 21 degrees of freedom

Multiple R-squared: 0.9874,Adjusted R-squared: 0.9844

F-statistic: 330 on 5 and 21 DF, p-value: < 2.2e-16

Coefficients defined on data scale

STAT 526 Topic 8 29

Alternative Approach I

Model assumes Normality and constant variance

Residual plot: evidence of unequal varianceClearly incorrect model choice> plot(fit3$fitted,fit3$residuals)

> abline(h=0)

STAT 526 Topic 8 30

Alternative Approach II

Consider LM using log(proportion) as response> fit4 <- lm(log(propca) ~ log(doserate)*doseF,data=dicentric)

> summary(fit4)

Coefficients:

Estimate Std.Error tvalue Pr(>|t|)

(Intercept) -2.76243 0.03352 -82.402 < 2e-16 ***

log(doserate) 0.07561 0.02866 2.638 0.015364 *

doseF2.5 1.64378 0.04741 34.672 < 2e-16 ***

doseF5 2.77866 0.04741 58.610 < 2e-16 ***



...

Residual standard error: 0.1006 on 21 degrees of freedom

Multiple R-squared: 0.9943,Adjusted R-squared: 0.9929

F-statistic: 729.4 on 5 and 21 DF, p-value: < 2.2e-16

Analogous to using offset in Poisson regression

Parameter estimates similar to those in Poisson GLM

SE are more optimistic than in Poisson regression

STAT 526 Topic 8 31

Alternative Approach II

Model still assumes Normality and constant variance

Residual plot: evidence of unequal variance?May be ok choice but Poisson GLM preferred> plot(fit4$fitted,fit4$residuals)

> abline(h=0)

STAT 526 Topic 8 32

Pairwise Comparisons

When interaction, often compare levels of A at fixed B

Example 1: µdose5,doserate1 vs µdose2.5,doserate1

Using Poisson GLM with offset

> library(gmodels)

> contrast.function <- c(0, 0, -1, 1, 0, 0)

> names(contrast.function) <- names(coef(fit2))

> contrast.function

(Intercept) log(doserate) doseF2.5 doseF5 1 log(1) 0 1 0 log(1)

0 0 -1 1 - 1 log(1) 1 0 log(1) 0

log(doserate):doseF2.5 log(doserate):doseF5

0 0

> test.result <- estimable(fit2, contrast.function)

> test.result

Estimate Std. Error X^2 value DF Pr(>|X^2|)

(0 0 -1 1 0 0) 1.135667 0.04460504 648.2372 1 0

STAT 526 Topic 8 33

Pairwise Comparisons

Example 2: µdose5,doserate3 vs µdose2.5,doserate3

> contrast.function <- c(0, 0, -1, 1, -1.0986, 1.0986)

> names(contrast.function) <- names(coef(fit2)) 1 log(3) 0 1 0 log(3)

> test.result <- estimable(fit2, contrast.function) - 1 log(3) 1 0 log(3) 0

> test.result

Estimate Std. Error X^2 value DF Pr(>|X^2|)

(0 0 -1 1 -1.0986 1.0986) 1.171129 0.05390965 471.9289 1 0

Difference is larger (expected because of interaction)

Note that comparisons are made on the link scale

log(µ̂ij)− log(µ̂i ′j) = log(µ̂ij/µ̂i ′j) 6= log(µ̂ij − µ̂i ′j)

STAT 526 Topic 8 34

Grouped versus Ungrouped Analysis

Since Yiind∼ Pois(λi), the aggregate

∑

i

Yi ∼ Pois(∑

i

λi)

As in binomial/Bernoulli response, log-likelihood only involvessums of yi with same covariate patternsIn previous example, could add total cells and counts ofabnormalities for entries with same doseamt and doserate

Rate models: individual vs groupeddifferent deviancessame parameter estimatessame comparison of nested models

Models with no offset: indiv. vs groupeddifferent deviancessame parameters except interceptsame comparison of nested models

STAT 526 Topic 8 35

Negative Binomial Distribution

Typical motivationIndependent trials with P(success) = p

Z= the number of trials until kth successZ ∼ NB(p, k), Z = k , k + 1, . . .

A more useful motivationThe distribution of Y |λ ∼ Pois(λ) and λ ∼ G (k , α)

Probability distribution

P(Z = z) =

(

z − 1

k − 1

)

pk(1− p)z−k z = k , k + 1, . . .

or

P(Y = y) =

(

y + k − 1

k − 1

)(

α

1 + α

)y (1

1 + α

)k

y = 0, 1, . . .

Using parameterization Y = Z − k and p = (1 + α)−1

STAT 526 Topic 8 36

Negative Binomial Distribution

Under this reparameterizationE (Y ) = k(1− p)/p = kα = µVar(Y ) = k(1− p)/p2 = kα(1 + α) = µ(1 + µ/k)Variance approaches Poisson variance as k → ∞

Log-likelihood is

l =

n∑

i=1

(

yi logαi

αi + 1− k log(1 + αi )

)

+ constant

=n

∑

i=1

(

yi logµi

k + µi

+ k logk

k + µi

)

+ constant

Consider k as fixed or additional parameter to estimate

Use log (µi/(k + µi)) = xiβ as link function in GLM

STAT 526 Topic 8 37

Return to Galapagos Islands Study

Can fix k based on exploratory plots of mean-variancerelationship

R: Use glm (link=log) in the MASS package

> fit.nb <- glm(Species~ log(Area)+log(Adjacent),

family=negative.binomial(1), data=gala)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 3.27257 0.15304 21.384 < 2e-16 ***

log(Area) 0.35100 0.03773 9.304 6.52e-10 ***

log(Adjacent) -0.03204 0.04015 -0.798 0.432

...

(Dispersion parameter for Negative Binomial(1) family

taken to be 0.4650222)



AIC: 292.97

STAT 526 Topic 8 38


Given dispersion parameter < 1, k too small

Could try various k or estimate using ML (glm.nb)

> fit.nb1 <- glm.nb(Species~ log(Area)+log(Adjacent),data=gala)

Coefficients:


(Intercept) 3.27777 0.14495 22.613 <2e-16 ***

log(Area) 0.34973 0.03541 9.875 <2e-16 ***

log(Adjacent) -0.03316 0.03737 -0.887 0.375

(Dispersion parameter for Negative Binomial(2.6196)

family taken to be 1)



AIC: 284.99

Theta: 2.620

Std. Err.: 0.753

STAT 526 Topic 8 39


Not a big difference in estimates and SEs across the two fits

CI for Theta= k almost includes 1

Poisson limiting case of NB when k → ∞

Thus, test H0 : φ = 1/k = 0 vs Ha : φ > 0

One-sided hypothesisUnder H0, φ is on the boundary of supportVerbeke and Molenberghs (2000):

LR = 2 [logLik(NB model)− logLik(Poisson model)]

∼1

2χ20 +

1

2χ21

STAT 526 Topic 8 40


For this study, the two model fits result in

logLik(NB) = 32.741logLik(Poisson) = 395.54

2 [395.54− 32.741] = 362.799 >1

2χ20(0.95) +

1

2χ21(0.95)

= 1.920729

Conclusion: reject H0

Caution: quality of approximation can be poor

STAT 526 Topic 8 41

Zero-Inflated Models

Again refers to there being an excess of zero counts thatcannot be adequately modeled using a dispersionparameter

Will consider both zero-inflated Poisson (ZIP) andzero-inflated negative-binomial (ZINB) models

Can think of model structure in two waysAdditional point mass at zero

P(Y = 0) = p∗i + (1− p∗i )gi (0)

P(Y = y) = (1− p∗i )gi (y) y > 0

Hurdle model

P(Y = 0) = f1(0)

P(Y = y) =1− f1(0)

1− f2(0)f2(y) y > 0

STAT 526 Topic 8 42

Hurdle Model

Consider a latent variable Zi that must exceed some valuefor these to be a non-zero value

This score often called a hurdle

Consider using binary model for this part

Must use a rescaled truncated distribution for second part

In R: hurdle function in pscl packagedist=c(“poisson”,”negbin”,”geometric”)zero.dist = c(“binomial”, “poisson”,”negbin”,”geometric”)link = c(“logit”,”probit”,”cloglog”,”cauchit”,”log”)

STAT 526 Topic 8 43

Mixture Model

Already discussed this model in relation to the binomial

Consider that there is a proportion p∗ that will alwaysrespond with a zero

Remaining 1− p∗ follow specified distribution

In R: zeroinfl function in pscl packagedist=c(“poisson”,”negbin”,”geometric”)link = c(“logit”,”probit”,”cloglog”,”cauchit”,”log”)Model specified using y ∼ x1 + x2 | z1 + z2 + z3

STAT 526 Topic 8 44

Example

915 biochemistry graduate studentsOutcome : Number of articles produced in last 3 yrs ofPhD> mod = glm(art ~ ., data=bioChemists,family=poisson)

> summary(mod)

Coefficients:


(Intercept) 0.304617 0.102981 2.958 0.0031 **

femWomen -0.224594 0.054613 -4.112 3.92e-05 ***

marMarried 0.155243 0.061374 2.529 0.0114 *

kid5 -0.184883 0.040127 -4.607 4.08e-06 ***

phd 0.012823 0.026397 0.486 0.6271

ment 0.025543 0.002006 12.733 < 2e-16 ***



AIC: 3314.1

STAT 526 Topic 8 45

Example: Hurdle Model> library(pscl)

> modh = hurdle(art~.,data=bioChemists)

> summary(modh)

Count model coefficients (truncated poisson with log link):


(Intercept) 0.67114 0.12246 5.481 4.24e-08 ***

femWomen -0.22858 0.06522 -3.505 0.000457 ***

marMarried 0.09649 0.07283 1.325 0.185209

kid5 -0.14219 0.04845 -2.934 0.003341 **

phd -0.01273 0.03130 -0.407 0.684343

ment 0.01875 0.00228 8.222 < 2e-16 ***

Zero hurdle model coefficients (binomial with logit link):


(Intercept) 0.23680 0.29552 0.801 0.4230

femWomen -0.25115 0.15911 -1.579 0.1144

marMarried 0.32623 0.18082 1.804 0.0712 .

kid5 -0.28525 0.11113 -2.567 0.0103 *

phd 0.02222 0.07956 0.279 0.7800

ment 0.08012 0.01302 6.155 7.52e-10 ***

Number of iterations in BFGS optimization: 12

Log-likelihood: -1605 on 12 Df

STAT 526 Topic 8 46

Example: Mixture Model> modzi = zeroinfl(art~.,data=bioChemists)

> summary(modzi)

Count model coefficients (poisson with log link):


(Intercept) 0.640838 0.121307 5.283 1.27e-07 ***

femWomen -0.209145 0.063405 -3.299 0.000972 ***

marMarried 0.103751 0.071111 1.459 0.144565

kid5 -0.143320 0.047429 -3.022 0.002513 **

phd -0.006166 0.031008 -0.199 0.842378

ment 0.018098 0.002294 7.888 3.07e-15 ***

Zero-inflation model coefficients (binomial with logit link):


(Intercept) -0.577059 0.509386 -1.133 0.25728

femWomen 0.109746 0.280082 0.392 0.69518

marMarried -0.354014 0.317611 -1.115 0.26502

kid5 0.217097 0.196482 1.105 0.26919

phd 0.001274 0.145263 0.009 0.99300

ment -0.134114 0.045243 -2.964 0.00303 **

Number of iterations in BFGS optimization: 21

Log-likelihood: -1605 on 12 Df

STAT 526 Topic 8 47

Summary

Very difficult to choose between the two

Be careful interpreting modelsHurdle: probability of a positive countMixture: probability of a zero count

Bland-Altman plot better describes prediction differences

STAT 526 Topic 8 48

Documents

Modeling a Count Response - Purdue Universitybacraig/notes526/topic8a.pdf · Poisson Distribution Let Y represent the count of events (e.g., successes) that occur in some ﬁxed interval