Application of the hyper-Poisson generalized linear model for …€¦ · COM-Poisson distribution and its regression model to be very flexible in dealing with count data with a wide

Application of the hyper-Poisson generalized linear model for analyzing motor vehicle crashes

S. Hadi Khazraee1

Graduate Research Assistant Zachry Department of Civil Engineering

Texas A&M University Tel. (979) 845-6003

Email: [email protected]

Antonio Jose Sáez-Castillo, Ph.D. Associate Professor

Department of Statistics and Operations Research University of Jáen, Spain

Tel. +34 953648578 Email: [email protected]

Srinivas Reddy Geedipally, Ph.D., P.E. Assistant Research Engineer

Texas A&M Transportation Institute Texas A&M University System

Tel. (817) 462-0519 Email: [email protected]

Dominique Lord, Ph.D., P.Eng. Associate Professor

Zachry Department of Civil Engineering Texas A&M University

Tel. (979) 458-3949 Email: [email protected]

1 Corresponding author

Application of the hP GLM for crash data modeling

2

ABSTRACT

The hyper-Poisson distribution can handle both over- and under-dispersion, and its

generalized linear model formulation allows the dispersion of the distribution to be observation-

specific and dependent on model covariates. This study’s objective is to examine the potential

applicability of a newly proposed generalized linear model framework for the hyper-Poisson

distribution in analyzing the motor vehicle crash count data. The hyper-Poisson generalized

linear model was first fitted to the intersection crash data from Toronto, characterized by over-

dispersion, and then to the crash data from railway-highway crossings in Korea, characterized by

under-dispersion.

The results of this study are promising. When fitted to the Toronto data set, the goodness-of-

fit measures indicated that the hyper-Poisson model with a variable dispersion parameter

provided a statistical fit as good as the traditional negative binomial model. The hyper-Poisson

model was also successful in handling the under-dispersed data from Korea; the model

performed as well as the gamma probability model and the Conway-Maxwell-Poisson model

previously developed for the same data set.

The advantages of the hyper-Poisson model studied in this paper are noteworthy. Unlike the

negative binomial model, which has difficulties in handling under-dispersed data, the hyper-

Poisson model can handle both over- and under-dispersed crash data. Although not a major issue

for the Conway-Maxwell-Poisson model, the effect of each variable on the expected mean of

crashes is easily interpretable in the case of this new model.

Keywords: hyper-Poisson, under-dispersion, dispersion parameter


3

1. INTRODUCTION

Motor vehicle crash count data are often characterized by over-dispersion, meaning that the

variance of crash counts on a roadway entity is greater than the mean. It is however possible,

although rare, to find crash datasets with under-dispersion, i.e., variance lower than the mean(1),

especially in crash data with low sample means(2). The most commonly used distribution in crash

count data modeling, the negative binomial (NB)/Poisson-gamma, can only accommodate over-

dispersion and will have convergence issues and produce incorrect parameter estimates while

modeling under-dispersed data(1).

Researchers in various fields have proposed numerous alternative models to handle under-

dispersed count data. For instance, the generalized Poisson(3), the weighted Poisson(4), and the

Poisson polynomial(5) models are all extensions of the Poisson model that can handle both over-

and under-dispersed count data.

Of all the models capable of handling both over- and under-dispersion, the Conway-

Maxwell-Poisson distribution (COM-Poisson) has probably gained the most attention, especially

in highway safety. The COM-Poisson distribution was first introduced by Conway and

Maxwell(6) for modeling queues and service rates, and later explored by Shmueli et al.(7) for its

statistical properties(1). The COM-Poisson generalized linear model (GLM) has been applied to

crash data by Lord et al.(8; 9), and Geedipally and Lord(10). Several studies have found both the

COM-Poisson distribution and its regression model to be very flexible in dealing with count data

with a wide range of characteristics(e.g. 11; 12). Despite its flexibility for modeling count data,

Francis et al.(13) have warned about the limitation of COM-Poisson GLM in dealing with over-

dispersed data sets with low sample mean values.


4

Another approach used to handle under-dispersion in crash count data modeling is the

gamma probability distribution. This approach has been used with two different

parameterizations. The first parameterization, proposed by Winkelmann(14) and applied to crash

data first by Oh et al.(2), assumes that the time elapsed between each two successive crashes

(waiting time) follows a gamma distribution. This approach implies that crash events are

“dependent in the sense that the occurrence of at least one event (in contrast to none) up to time t

influences the probability of a further occurrence in t+∆t”(14). Nonetheless, while crash counts

can sometimes have a temporal correlation, they are often described as independent

observations(15). Recently, Daniels et al.(16; 17) used a different parameterization of the gamma

model in which they assumed that the crash frequency itself follows a continuous gamma density

function. Two major theoretical shortcomings exist for this assumption: it implies that crash

counts of zero are not possible, and that non-integer crash counts may be observed (15). Both

implications are obviously fallacious.

The final model worth mentioning is the double-Poisson distribution model proposed by

Efron et al.(18). Although not very popular among researchers, Zou et al.(15) applied the double-

Poisson model to crash count data and found the model to be flexible. Nonetheless, they noticed

that the distribution does not handle under-dispersion as reliably as it does over-dispersion.

Very recently, Saez-Castillo and Conde-Sanchez(19) formulated a generalized linear model

(GLM) framework for a two-parameter generalization of the Poisson distribution, called the

hyper-Poisson distribution(20). The primary objective of this study is to examine the potential

application of the hyper-Poisson GLM in the field of highway safety to model crash count data.

The hyper-Poisson distribution can handle both under- and over-dispersion. In addition, the

regression model examined in this study allows the dispersion of the distribution to vary among


5

observations. Such observation-specific dispersion structure for crash counts on roadway entities

is consistent with the findings of recent research in highway safety. A handful of studies have

addressed shortcomings in the assumption of fixed dispersion among all observations and have

suggested that the model dispersion can potentially depend on the covariates(e.g. 21; 22; 23). Mitra

and Washington(24) advised that the observation-specific structure can be especially important

when the mean function is misspecified, such as in models where the mean only depends on the

entering traffic flow. In the hyper-Poisson model, the covariates enter the mean function at the

same time that they influence the dispersion of the distribution. The dual link structure of the

hyper-Poisson GLM is similar to that suggested by Guikema and Coffalt (25) for the COM-

Poisson regression model.

In this research study, the hP GLM is first fitted to crash data from the signalized

intersections in Toronto to examine the model performance in handling over-dispersed count

data. The objective is to ensure that the model can provide an adequate fit to the majority of

crash count data sets which are characterized by over-dispersion. The modeling results for

Toronto data are compared to those obtained by the NB GLM. The hP model is also fitted to a

data set from railway-highway crossings (RHXs) in Korea which is characterized by under-

dispersion. For this data set, the modeling results are compared to those for gamma probability

distribution from Oh et al.(2) and COM-Poisson GLM from Lord et al.(9).

2. BACKGROUND

This section describes the characteristics of the hP distribution and the corresponding

generalized linear regression model. The first part discusses the hP distribution and its

characteristics and the second part describes an extension of the distribution to model crash

frequency data.


6

2.1. Hyper-Poisson Distribution

Bardwell and Crow(20) derived a two-parameter generalization of the Poisson distribution.

They called the proposed distribution as the hyper-Poisson (hP, hereafter) family because it

turned out to be a subclass of the three-parameter hypergeometric series distribution and reduced

to the Poisson distribution in a special case. Using the original notations, the probability mass

function (pmf) of the hP distribution with parameters θ1 and θ2 is stated as follows:

y

y 2211

21 )(

)(

);;1(F

1 =),y (Y f

(1)

0112 (2)

0

2211 )(

)();;1(

r

r

rF

(3)

where, Y is the response variable (discrete crash count in this study), θ2 is the location parameter,

λ is defined as the dispersion parameter, and );;1( 211 F is the confluent hypergeometric

function with first argument equal to 1(26). If λ=1, the distribution reduces to the Poisson (with

variance equal to mean), λ > 1 results in an over-dispersed distribution, “super-Poisson”, whereas

λ < 1 produces an under-dispersed distribution, “sub-Poisson”(20).

It can be verified from Equation (1) that the hP distribution satisfies the following recurrence

condition:

y21+y f = )f +(y (4)

Summing Equation (4) over all y’s yields the following expression for the mean (µ):

)1)(1( 02 f (5)

);;1(

1);;1()1(

211

2112

F

F (6)


7

It is clear from Equation (6) that when λ = 1, the location parameter θ2 matches the mean. In

this case, Equation (2) suggests θ1 = θ2 and Equation (3) yields 1F1 (1; λ; θ2) = eθ2, so the

distribution Equation (1) reduces to the Poisson with the mean θ2. However, as indicated by

Equation (6), θ2 is not equal to the mean in any other case. The mean and θ2 can become

significantly different as λ deviates from 1.

Equation (6) provides an explicit expression of the mean in terms of θ2 and λ. Nonetheless, θ2

and λ cannot be directly expressed in terms of the mean and the other parameter because they

appear as the arguments of the hypergeometric series, which does not have an explicit inverse

function. This will give rise to a major computational difficulty in regression modeling as

described later in this section.

From Equation (4) and using the method of moments, the following relationship between the

distribution variance (σ2) and mean (µ) is obtained(19):

222

2 ))1(( (7)

A comparison between the hP distribution variance, as shown above, and that from the NB

distribution would be interesting. The relationship between the variance and the mean in the

negative binomial distribution is stated below:

22 (8)

where α is the over-dispersion parameter. A negative estimate of α is indicative of under-

dispersion (σ2 < µ). However, the NB model is inappropriate for modeling under-dispersed data

because the estimated variance will be negative for observations with α < -1/µi (27).

In this paper, for the sake of convenience in comparing the NB and hP distributions, α is

referred to as the “dispersion parameter” of the NB distribution. As Equation (8) indicates for the

NB distribution, the coefficient of the second-degree term in the variance function is allowed to


8

vary, whereas in the hP distribution variance function it is the coefficient of the first degree term

of the mean that can vary and the second-degree coefficient is constantly -1 (see Equation 7).

This allows for higher flexibility of the NB distribution, compared to the hP distribution, to deal

with highly over-dispersed data sets, as demonstrated later in the results section.

Furthermore, Saez-Castillo and Conde-Sanchez(19) showed how the over-dispersed case of

the hP distribution (i.e., when λ > 1) can be viewed upon as a Poisson compound distribution

with a confluent hypergeometric distribution. An interested reader is referred to their work for

the derivation. This finding provides an interpretational basis for application of the hP

distribution and regression model to crash data; crash counts are Poisson distributed with a mean

which itself follows a probability distribution (confluent hypergeometric in this case) to account

for the heterogeneity among the individual entities (sites). Indeed, the confluent hypergeometric

error term captures the variation in the mean caused by the factors not accounted for by the

model. Hence, in the over-dispersion context, the hP distribution is comparable to other

compound Poisson distributions, such as the negative binomial/Poisson-gamma distribution.

2.2. Generalized Linear Model

Saez-Castillo and Conde-Sanchez(19) developed an hP GLM framework to model discrete

count data. In this approach, both the mean and the dispersion parameter of the hP distribution

can depend on the covariates. Denoting Yi as the observed crash count at site i, the GLM

assumes that Yi follows an hP distribution with the mean and dispersion parameter stated as

below:

p

jijji x

10)ln( (9)

q

kikki z

10)ln( (10)


9

where, xij’s and zik’s are the covariates used to estimate the mean and dispersion parameter of

observation i, respectively, and βj’s and δk’s are the regression parameters to be estimated by the

model. The p covariates used to estimate the mean are not necessarily identical to the q

covariates used to estimate the dispersion parameter.

This study adopted the GLM as formulated above to model motor vehicle crashes. The dual

link structure of the hyper-Poisson GLM is similar to that suggested by Guikema and Coffelt(25)

for the COM-Poisson regression model. The first link function, in Equation (9), describes the

mean as a function of covariates. The covariate-dependent mean function allows for inference

about the influence of the changes in the covariates on the expected number of crashes (µ). The

same would not be possible had the location parameter was instead modeled as a function of the

covariates. Given the estimated values of µ and λ, the location parameter (θ2) can be determined

by Equation (6). The variance of each observation can then be determined by Equations (7).

The second link function of the GLM, in Equation (10), is added to increase the flexibility of

the distribution and enable analysis of data with potential over- or under-dispersion depending on

the values of the covariates. As mentioned earlier in the introduction, there are notable

advantages in allowing the dispersion characteristic of the crash count distribution to depend on

the covariates.

3. METHODOLOGY

This section describes the methodology used to fit the hP regression models to the crash data.

The first part presents the functional form of each model, and the second part describes the

procedure adopted to estimate the models.

3.1. Model Functional Form

3.1.1. Toronto data


10

For the Toronto intersection crash data, the following common and simple functional form

was adopted:

21__0

iMiniMaji FF (11)

where FMaj_i and FMin_i denote the average annual daily traffic (AADT) on the major and minor

approach to the intersection, respectively. Such a flow-only crash model for intersections is

consistent with the base safety prediction models suggested by the Highway Safety Manual(28)

and also with several other studies that have modeled the Toronto dataset in the past(e.g., 23; 8).

The hP GLM was applied to the Toronto data in two steps: first, with a constant dispersion

parameter (i.e., 0 i for all i), and next, with an observation-specific dispersion parameter.

The observation-specific structure is especially important here because the mean function is

misspecified, since the mean is allowed to depend on entering traffic flows only. The dispersion

parameter has the following form:

21__0

iMiniMaji FF (12)

This was done to evaluate the improvement in fit when the dispersion parameter is allowed

to vary depending on the covariates. The hP model results were compared to those obtained by

using the NB GLMs (and the maximum likelihood method for model estimation) with a fixed

and a variable dispersion parameter. When variable, the dispersion parameter of the NB model

(αi) followed a similar functional form as in Equation (12):

21__0

iMiniMaji FF (13)

3.1.2. Korea RHX data

For the Korea RHX data, the objective was to compare the hP model fit mainly with that

obtained by the gamma probability model, documented by Oh et al.(2), and COM-Poisson GLM,


11

documented by Lord et al.(9). An interested reader may refer to their work for background

information on the COM-Poisson and gamma probability models. The same functional form for

the expected number of crashes was therefore used here:

)exp(2

01

n

jijjii xF (14)

where Fi is the average daily vehicle traffic (ADT) on site i, and xij is the covariate j at site i.

Various functional forms with different variables were evaluated to model the dispersion

parameter but none of them were found to be significant. This supports the previous finding that

since the functional form describing the mean function contains several covariates, the varying

dispersion parameter is not needed(24).

3.2. Model Estimation

The GLMs in this study were estimated using the method of maximum likelihood. The goal

was to find the set of βj and δk parameters that would maximize the joint likelihood (or log-

likelihood, equivalently) of observations y1,…, yn. From Equation (1), the log-likelihood function

is:

));;1(Flog())(log()log())(log()y ,… ,y(log1

2111

211

n1

n

iiiii

n

ii

n

iii

n

i

yyL (15)

The optimization was carried out using an iterative procedure evaluating the log-likelihood

function at different combinations of βj’s and δk’s until the maximum log-likelihood was

reached. Nevertheless, as Equation (15) indicates, the log-likelihood function depends on θ2i and

λi, while we model µi and λi as a function of covariates. θ2i in Equation (15) must therefore be

replaced with its expression in terms of µi. As specified earlier, no closed form expression exists

for θ2i. Consequently, evaluation of the log-likelihood function at each iteration required solving


12

the nonlinear Equation (6) to find the value of θ2 corresponding to the estimated µi and λi for

each observation.

The code developed by Saez-Castillo and Conde-Sanchez (19), in the software R (29) is used in

this study. The program uses functions nlm and optim to maximize the log-likelihood, and

optimize to solve Equation (6) numerically.

4.DATA DESCRIPTION

This section provides an overview of the two data sets used in this research. As discussed

above, the datasets come from Toronto and Korea.

The Toronto data set contains crash count data collected in 1995 at 868 four-legged

signalized intersections in Toronto. Several research studies (e.g., 30; 23; 31) have used this data set

for the purpose of crash count modeling and have found it to be of good quality. The Toronto

intersection data is characterized by over-dispersion, as commonly seen in most crash data sets.

TABLES AND FIGURES

Table I presents the summary statistics of the variables in this data set.

The Korea data set contains crash count data collected at 162 railway-highway crossings in

Korea. This data set was first used by Oh et al. (2) to fit Poisson and gamma probability models,

and later by Lord et al. (9) to fit a COM-Poisson model. Although the data shows signs of slight

over-dispersion (sample mean = 0.33, sample variance = 0.36), both studies observed under-

dispersion when crashes were modeled conditional on the mean. Out of the many explanatory

variables initially considered for model estimation in these studies, only a few were found to be

statistically significant at 10% level and were included in the final model. The hP model in this

study was estimated using the variables (covariates) that were found to be significant in the

Poisson, Gamma distribution, or COM-Poisson models. TABLES AND FIGURES


13

Table I presents these variables and their characteristics.

5. RESULTS

This section presents the modeling results for the hP GLM. The first part of this section

presents the results for the model fitted to the Toronto intersection data and the second part

shows the results for the data from Korea railway-highway crossings.

5.1. Toronto Data

Error! Reference source not found.Table II summarizes the modeling results for the hP

GLM with a fixed and a varying dispersion parameter and compares the results with those

obtained from the NB model. The NB GLM with a fixed dispersion parameter was estimated

with glm.nb in R, whereas the NB GLM with a variable dispersion parameter was estimated with

PROC NLMIXED in SAS (32). All models were estimated using the maximum likelihood method.

The values in parentheses indicate the standard error of the parameter estimates.

As Table IIError! Reference source not found. indicates, there is no significant difference

in the MPB, MAD, and MSPE of the models considered for the Toronto data. The only notable

trend is the reduction in the bias (MPB) in both the hP and NB models when the dispersion

parameter is allowed to vary. The MAD and MSPE measures of fit vary only slightly from one

model to the other. This is due to the very similar estimates of mean function parameters (β’s).

Note that the MPB, MAD, and MSPE are all only dependent on the mean function and not on the

dispersion parameter. Similar β parameters, therefore, have resulted in similar values for these

measures of fit.

On the other hand, the AIC measure depends not only on the mean function, but also on the

dispersion parameter. The reason is that the AIC depends on the model likelihood function

which, in both the hP and NB model cases, has the dispersion parameter as an input. Thus,


14

models with similar mean function parameters (β’s) may have significantly different AIC’s (e.g.,

compare hP with fixed and varying dispersion parameter in Table II).

Table IIError! Reference source not found. indicates that, when dispersion parameter is

constant, the AIC of the NB model (5077.3) is considerably lower than that of the hP model

(5157.3). The difference in AIC is large enough to infer that the NB model with a fixed

dispersion parameter outperforms the hP model with the same condition.

Nonetheless, when dispersion parameter is allowed to vary depending on the covariates, the

hP model’s fit improves notably (AIC reduces from 5157.3 to 5088.4). Conversely, the NB

model with a variable dispersion parameter is not a significant improvement as two of the

dispersion parameter function coefficients (δ0 and δ1) are found to be statistically insignificant

(at α=0.10) and the reduction in AIC is also marginal (from 5077.3 to 5067.9). As a rule of

thumb, when the change in AIC is less than 10, the difference is usually deemed to be

insignificant (9). Thus, with a variable dispersion parameter, the hP model performs almost as

well as the NB model.

The variance-mean relationship structure of the hP and NB distributions is the key to

explaining the findings above. In the variance-mean function of the NB distribution shown by

Equation (8), the over-dispersion parameter is the coefficient of the second-degree term of the

mean, whereas in the hP distribution variance-mean function shown by Equation (7), the

dispersion parameter can only affect the first-degree coefficient of the mean. Thus, the variance

of the NB distribution is more sensitive to the changes in the dispersion parameter and can

increase at a faster rate. Figure 1(a) shows the mean-variance relationship of the hP and NB

models with fixed dispersion parameters for Toronto data. Clearly, the NB model variance


15

increases more rapidly and so the NB model better fits the over-dispersed Toronto data set than

the hP model with a fixed dispersion parameter.

Once the dispersion parameter of the hP distribution is allowed to vary, the variance-mean

relationship becomes more flexible and the hP model becomes more capable of fitting over-

dispersed crash counts. As illustrated in Figure 1(b) for models with variable dispersion, the hP

model mean-variance relationship becomes more similar to that of the NB model. When the

mean is less than 25 crashes, the variances of the two distributions resemble closely. As the mean

gets larger, however, the variance of the NB model increases at a higher rate than the hP model

and the difference between the variances becomes more significant.

Figure 2Error! Reference source not found.(a) illustrates the frequency distribution of the

varying dispersion parameter of the hP distribution across all observations. It is important to note

that even for such an over-dispersed data set, two of the observations have λ’s less than 1 and are

therefore under-dispersed (conditional on the mean). Despite the very small number of under-

dispersed observations in the Toronto data set, this finding illustrates how the hP model (with a

variable dispersion parameter) can identify data points with under-dispersion, while the NB

model fails to do so.

Figure 2Error! Reference source not found.(b) shows the distribution of the varying dispersion

parameter (α) of the NB model. The NB distribution is under-dispersed if α < 0, equi-dispersed if

α = 0, and over-dispersed otherwise. As shown by Error! Reference source not found.(b), the NB

model did not identify any under-dispersed observations. It is probable that the NB model would

not have performed as well if a great number of observations were under-dispersed (conditional

on the mean).


16

It is also interesting to compare the hP model performance in fitting overdispersed crash

data with that obtained by using the COM-Poisson model. Geedipally and Lord (10) fitted the

COM-Poisson GLM with a variable shape parameter to the Toronto data using a full Bayesian

(FB) approach with non-informative (vague) prior distributions on the parameters. Figure 3

illustrates the comparison of the mean-variance relationship of the hP and COM-Poisson models.

The variances from the two models resemble closely for the entire range of the mean. The hP

model can thus be expected to perform as well as the COM-Poisson.

5.2. Korea RHX Data

Both Oh et al. (2) and Lord et al. (9) examined the application of the NB model to the under-

dispersed (conditional on the mean) data from Korea railway-highway crossings, and deemed it

to be inappropriate. These two studies also considered the Poisson model and despite the

relatively good fit of the model provided, the authors mentioned that the Poisson model should

not be used because the data are under-dispersed. Lord et al. (9) also noted that fitting the Poisson

GLM to such under-dispersed data can have a significant effect on standard errors. Therefore, the

current study compared the hP model fit to the two models found successful by the

aforementioned researchers i.e., the gamma probability, and COM-Poisson.

The Poisson, gamma probability, and COM-Poisson models for Korea RHX data (2,9) were

originally developed using 31 candidate explanatory variables. According to Lord et al. (2), eight

of these variables were found significant in at least one of the three models. These eight variables

constituted the pool of candidate explanatory variables for the hP model developed in this study

(see Table I). Disregarding the remaining 23 variables, it can be assumed that all final models

were estimated using a common set of candidate variables.


17

To obtain greater accuracy and prevent inclusion of correlated variables in the model, a

stepwise forward procedure with the likelihood ratio test was adopted to identify the significant

variables in this study. First, the dominant traffic flow (AADT) variable was introduced into the

model (mean function) and resulted in a log-likelihood value equal to -111.57. Then, the other

covariates entered the model in the order in which they contributed to the increase in log-

likelihood/parameter. A variable was added to the model only if the increase in the log-

likelihood was significant according to the likelihood ratio test (LRT). The significance level of

the LRT was selected at α = 0.1 for the sake of consistency with other models developed for the

Korea data with which the hP model was intended to be compared to. The final model obtained

from this stepwise procedure includes the following six variables in its mean function: AADT,

presence of speed hump, train detector distance, presence of commercial area, presence of track

circuit controller, and presence of a guide. The log-likelihood of the final model is -96.77.

Error! Reference source not found.Table III presents the modeling results for the hP

distribution model and the comparison with the other models. All models were estimated using

the maximum likelihood method. The same set of variables as those in the COM-Poisson model

were found significant in the hP model. However, it is necessary to note that the coefficients

estimated for the COM-Poisson model are for the centering parameter and not for the mean

(E[Y]) as in the case of other distributions in Table III Error! Reference source not found. (see

(9), for more details on the COM-Poisson GLM).

The dispersion parameter of the hP model (0.298) confirms the finding of the previous

studies that the Korea data are under-dispersed (conditional on the mean) (see also 15). Using the

AIC values, the hP model provides a fit as well as the COM-Poisson and gamma models.


18

It is important to note that despite the similar quality of statistical fit, the three models

compared in Table III each include a distinct set of variables. This comparison is still meaningful

because all three models were estimated using a common pool of explanatory variables. The

presence of a certain variable in one model and not in the other is attributable to the correlation

among variables, meaning that the inclusion of a certain set of variables eliminates the need for

one or more other variables. The considerably large difference between parameter estimates in

different models is due to the distinct set of significant variables in each model.

Similar to the Toronto data application, the hP Poisson model for the Korea data performs

very well in terms of the bias; the MPB of the hP model is very close to zero, indicating that the

model neither over-predicts nor under-predicts the crashes. The COM-Poisson model also has a

relatively small bias but the value of MPB for the gamma model indicates that this model over-

predicts the crashes. The MAD and MSPE of the hP distribution are almost as low as those of the

COM-Poisson, but better than those of the gamma model. Overall, the hP and COM-Poisson

models performed almost equally well, slightly outperforming the gamma model.

6. CONCLUSIONS

The results of this study for the application of the hP GLM to crash data modeling are

promising. The hP GLM with a covariate-dependent dispersion parameter could fit the over-

dispersed data from Toronto almost as well as the popular NB model. When applied to the

under-dispersed data from Korea, the hP model had an equally good performance compared to

the COM-Poisson and gamma probability models.

The hP model can handle under-dispersion, while the NB model is incapable to do so

properly. Lord et al.(9) showed that application of the NB model to under-dispersed data can

result in unstable and unreliable parameter estimates, hence mis-specified models. In modeling


19

over-dispersed crash data, however, the authors admit that the NB model is usually preferable

over the hP model because the variance-mean relationship structure of the NB model offers more

flexibility when the variance increases very rapidly with the increase in the mean. The NB model

becomes especially useful when the data are highly over-dispersed. Nonetheless, this study

showed that the hP GLM with covariate-dependent dispersion can perform satisfactorily even

with an over-dispersed data set.

The GLM formulation of the hP model studied in this research has an advantage over the

COM-Poisson GLM. In the hP model, the mean (E[Y]) is expressed in terms of the covariates,

whereas in the COM-Poisson model, the centering parameter, which is approximately equal to

the mode, is a function of covariates. Thus, the hP GLM permits direct interpretation of the

effect of each variable on the expected mean of crashes, while the COM-Poisson GLM has on

the expected mode of the crash distribution. For instance, one might look at the sign of the

variable coefficients in the hP model and directly quantify the effect on the expected mean of

crashes with an increase in the value of each variable.

When compared to the gamma model, the hP model is preferred because it does not suffer

the same theoretical issues involved with the gamma model formulation, as discussed in the first

section of the paper.

This paper was a report on the first steps of the ongoing research on the application of the hP

GLM in crash data modeling. There are many aspects of the application that needs to be further

investigated. For example, the hP model performance should be examined over a greater range of

dispersion characteristics likely through simulated data. It is also recommended to examine the

hP model fit to crash frequency data from the roadway segments and for identifying hazardous

sites.


20


21

REFERENCES

1. Lord D, Mannering F. The Statistical Analysis of Crash-Frequency Data: a Review and

Assessment of Methodological Alternatives. Transportation Research - Part A, 2010;44(5):291–

305.

2. Oh J, Washington SP, Nam D. Accident Prediction Model for Railway–Highway

Interfaces. Accident Analysis & Prevention, 2006;38(2):346–56.

3. Consul P, Famoye F. Generalized Poisson Regression-Model. Communications in

Statistics-Theory and Methods, 1992;21(1):89–109.

4. Castillo J, Pérez-Casany M. Overdispersed and Underdispersed Poisson Generalizations.

Journal of Statistical Planning and Inference, 2005;134:486–500.

5. Cameron AC, Johansson P. Count Data Regression Using Series Expansions: with

Applications. Journal of Applied Econometrics, 1997;12(3):203–23.

6. Conway RW, Maxwell WL. A Queuing Model with State Dependent Service Rates.

Journal of Industrial Engineering, 1962;12:132–6.

7. Shmueli G, Minka T, Kadane JB, Borle S, Boatwright P. A Useful Distribution for

Fitting Discrete Data: Revival of the Conway–Maxwell–Poisson Distribution. Journal of the

Royal Statistical Society Series C, 2005;54(1):127–42.


22

8. Lord D, Guikema SD, Geedipally S. Application of the Conway-Maxwell-Poisson

Generalized Linear Model for Analyzing Motor Vehicle Crashes. Accident Analysis &

Prevention, 2008;40(3):1123–34.

9. Lord D, Geedipally SR, Guikema SD. Extension of the Application of Conway–

Maxwell–Poisson Models: Analyzing Traffic Crash Data Exhibiting Underdispersion. Risk

Analysis, 2010;30(8):1268–76.

10. Geedipally SR, Lord D. Examination of Crash Variances Estimated by Poisson-Gamma

and Conway–Maxwell–Poisson Models. Transportation Research Record, 2011;2241:59–67.

11. Sellers KF, Shmueli G. A Flexible Regression Model for Count Data. Annals of Applied

Statistics, 2010;4(2):943–61.

12. Sellers K, Borle S, Shmueli G. The COM‐Poisson Model for Count Data: A Survey of

Methods and Application. Applied Stochastic Models in Business and Industry, 2012;28(2):104–

16.

13. Francis RA, Geedipally SR, Guikema SD, Dhavala SS, Lord D, LaRocca S.

Characterizing the Performance of the Conway–Maxwell Poisson Generalized Linear Model.

Risk Analysis, 2012; 32(1):167–83.


23

14. Winkelmann R. Duration Dependence and Dispersion in Count-Data Models. Journal of

Business & Economic Statistics, 1995;13(4):467–74.

15. Zou Y, Geedipally SR, Lord D. Evaluating the Double Poisson Generalized Linear

Model. Accident Analysis & Prevention, 2013; forthcoming.

16. Daniels S, Brijs T, Nuyts E, Wets G. Explaining Variation in Safety Performance of

Roundabouts. Accident Analysis & Prevention, 2010;42(2):393–402.

17. Daniels S, Brijs T, Nuyts E, Wets G. Extended Prediction Models for Crashes at

Roundabouts. Safety Science, 2011;49(2):198–207.

18. Efron B. Double Exponential-Families and their Use in Generalized Linear-Regression.

Journal of the American Statistical Association, 1986;81(395):709–21.

19. Sáez-Castillo AJ, Conde-Sánchez A. A Hyper-Poisson Regression Model for

Overdispersed and Underdispersed Count Data. Computational Statistics and Data Analysis,

2012;61:148-57.

20. Bardwell GE, Crow EL. A Two-Parameter Family of Hyper-Poisson Distributions.

Journal of the American Statistical Association, Vol. 9, No. 305, 1964, pp. 133–141.


24

21. Hauer E. Overdispersion in Modeling Accidents on Road Sections and in Empirical

Bayes Estimation. Accident Analysis and Prevention, 2001;33(6):799–808.

22. Heydecker BG, Wu J. Identification of Sites for Road Accident Remedial Work by

Bayesian Statistical Methods: An Example of Uncertain Inference. Advances in Engineering

Software, 2001;32:859–69.

23. Miaou S‐P, Lord D. Modeling Traffic‐Flow Relationships at Signalized Intersections:

Dispersion Parameter, Functional Form and Bayes vs Empirical Bayes. Transportation Research

Record, 2003;1840:31–40.

24. Mitra S, Washington SP. On the Nature of Over‐Dispersion in Motor Vehicle Crash

Prediction Models. Accident Analysis and Prevention, 2007;39(3):459‐68.

25. Guikema SD, Coffelt JP. A Flexible Count Data Regression Model for Risk Analysis.

Risk Analysis, 2008;28(1):213–23.

26. Johnson NL, Kotz S, Kemp AW. Univariate Discrete Distributions, 3rd ed. New York:

Wiley; 2005.

27. Saha K, Paul S. Bias-Corrected Maximum Likelihood Estimator of the Negative Binomial Dispersion Parameter. Biometrics, 2005; 61(1); 179-185.


25

28. American Association of State Highway and Transportation Officials (AASHTO),

Highway Safety Manual. 1st ed. AASHTO; 2010.

29. R Development Core Team, R: A Language and Environment for Statistical Computing.

Vienna (Austria): R Foundation for Statistical Computing; 2011.

30. Lord, D. The Prediction of Accidents on Digital Networks: Characteristics and Issues

Related to the Application of Accident Prediction Models [dissertation]. [Toronto(ON)]:

University of Toronto; 2000.

31. Miranda-Moreno LF, Fu L. Traffic Safety Study: Empirical Bayes or Full Bayes?. 84th

Annual Meeting of the Transportation Research Board, Washington, DC, 2007.

32. SAS Institute Inc. SAS System for Windows. 9th ver. Cary (NC); 2002.

33. Burnham KP, Anderson DR. Model Selection and Multimodel Inference: A Practical

Information-Theoretic Approach, 2nd ed. Springer-Verlag; 2002.

34. Oh J, Lyon C, Washington SP, Persaud BN, Bared J. Validation of the FHWA Crash

Models for Rural Intersections: Lessons Learned. Transportation Research Record,

2003;1840:41-9.


26

TABLES AND FIGURES

Table I: Summary statistics of the data sets in this study

― = not applicable

Table II: Modeling results for the hP and NB GLMs with the Toronto data Hyper-Poisson Negative-Binomial model

Estimate Fixed dispersion

parameter Varying dispersion

parameter Fixed dispersion

parameter Varying dispersion

parameter

Ln(β0) -10.22(0.4464) -10.32(0.4325) -10.25(0.465) -10.30(0.4555)

β1 0.6076(0.0462) 0.6265(0.04422) 0.6207(0.04652) 0.6203(0.04601)

β2 0.6981(0.02205) 0.6876(0.02161) 0.6853(0.02152) 0.6918(0.02227)

λ 25.7492 ― ― ―

α ― ― 0.1398(0.0122) ―

Ln(δ0) ― -17.936(2.709) ― -1.6275(2.4381)

δ1 ― 1.4652(0.2677) ― 0.3223(0.2345)

δ2 ― 0.6492(0.1073) ― -0.3936(0.1002)

AIC1 5157.3 5088.4 5077.3 5067.9

MPB2 0.033 0.006 -0.045 -0.003

MAD3 4.142 4.143 4.142 4.141

MSPE4 32.617 32.670 32.699 32.649 1 Akaike information criterion (33); 2 Mean prediction bias (34); 3 Mean absolute deviance (34); 4 Mean squared predictive error (34); ― = not applicable

Variables Min. Max.

Crashes 0 54 11.56 (10.02) 868

Major approach AADT 5,469 72,178 28,044.81 (10,660.4) 868

Minor approach AADT 53 42,644 11,010.18 (8,599.40) 868

Crashes 0 3 0.33 (0.60) 162

Highway AADT 10 61,199 4617 (10391.57) 162

Average daily highway traffic (rail.trf) 32 203 70.29 (37.34) 162

Train detector distance (dist.trn.dtc) 0 1,329 824.5 (328.38) 162

Time duration btw activation of warning signals and gates (wrn.time) 0 232 25.46 (25.71) 162

Presence of commercial area (p.comm) 1 (yes) ― ― 149 (91.98%)

0 (no) ― ― 13 (8.02%)

Presence of a speed hump (p.hump) 1 (yes) ― ― 134 (82.72%)

0 (no) ― ― 28 (17.28%)

Presence of a track circuit controller (p.trck.cric.cont) 1 (yes) ― ― 113 (69.75%)

0 (no) ― ― 49 (30.25%)

Presence of a guide (p.guide) 1 (yes) ― ― 126 (77.78%)

0 (no) ― ― 36 (22.22%)

―

―

―

―

―

―

Toronto Data

Korea Data

Average (SD) Frequency

―

―


27

Table III: Parameter Estimates and GOF Measures of Three Different Models for the Korea Data

Variables COM-Poisson Gamma Hyper-Poisson

Constant -6.657(1.206)a -3.438(1.008)a -5.513(0.756)

Ln(ADT) 0.648(0.139) 0.230(0.076) 0.472(0.057)

Average daily railway traffic - 0.004(0024) -

Presence of commercial area 1.474(0.513) 0.651(0.287) 0.965(0.370)

Train detector distance 0.0021(0.0007) 0.001(0.0004) 0.0017(0.0006)

Time duration between the activation of warning signals and gates

- 0.004(0.002) -

Presence of track circuit controller -1.305(0.431) - -0.924(0.303)

Presence of guide -88(0.512) - -0.665(0.294)

Presence of speed hump -1.495(0.531) -1.58(0.859) -1.080(0.441)

Shape parameter 2.349(0.634) 2.062(0.758) -

Dispersion parameter - - 0.298(0.189)

AIC 210.7 211.38 209.54

MPB -0.007 0.179 0.004

MAD 0.348 0.459 0.357

MSPE 0.236 0.308 0.246 a Standard error; - = not applicable


28

(a)

(b)

Figure 1: Crash variance vs. mean for the Toronto data obtained by the models with (a) fixed, (b) variable dispersion parameter.

0

50

100

150

200

250

300

350

0 10 20 30 40 50

Var

ian

ce

Mean

hP (constant dispersion) NB (constant dispersion)

0

50

100

150

200

250

300

350

0 10 20 30 40 50

Var

ian

ce

Mean

hP (variable dispersion) NB (variable dispersion)


29

(a)

(b)

Figure 2: Frequency distribution of (a) the varying dispersion parameter of hP model for Toronto data (b) the varying dispersion parameter of NB model for the Toronto data.

2

264 255

122

7249

30 25 25 24

0

50

100

150

200

250

300

0-1

1-10

10-2

0

20-3

0

30-4

0

40-5

0

50-6

0

60-7

0

70-8

0

> 8

0

Fre

qu

ency

λ

0

60

594

187

21 60

100

200

300

400

500

600

700

0.0-

0.1

0.1-

0.2

0.2-

0.3

0.3-

0.4

0.4-

0.5

>0.

5

Fre

quen

cy

α


30

Figure 3: Crash variance-mean relationship of the COM-Poisson vs. the hP model for the Toronto data.

0

50

100

150

200

250

300

350

0 10 20 30 40 50

Var

ian

ce

Mean

hP (variable dispersion) COM (variable shape parameter)

Documents

Application of the hyper-Poisson generalized linear model for …€¦ · COM-Poisson distribution and its regression model to be very flexible in dealing with count data with a wide