40
Examining the Influence of Link Function Misspecification in Conventional Regression Models for Developing Crash Modification Factors By Lingtao Wu*, Ph.D. Texas A&M Transportation Institute Texas A&M University System, 3135 TAMU College Station, Texas 77843-3135 Phone: 979-845-7214, fax: 979-845-6481 Email: [email protected] and Dominique Lord, Ph.D. Professor, Zachry Department of Civil Engineering Texas A&M University, 3136 TAMU College Station, Texas 77843-3136 Phone: 979-458-3949, fax: 979-845-6481 Email: [email protected] February 7, 2017 *Corresponding author

Examining the Influence of Link Function Misspecification ...ceprofs.tamu.edu/dlord/Papers/Wu_Lord_Nonlinear_CMFs.pdf · 70 difficult to examine the influence of a single problem

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Examining the Influence of Link Function Misspecification ...ceprofs.tamu.edu/dlord/Papers/Wu_Lord_Nonlinear_CMFs.pdf · 70 difficult to examine the influence of a single problem

Examining the Influence of Link Function Misspecification in Conventional

Regression Models for Developing Crash Modification Factors

 

By

Lingtao Wu*, Ph.D.

Texas A&M Transportation Institute

Texas A&M University System, 3135 TAMU

College Station, Texas 77843-3135

Phone: 979-845-7214, fax: 979-845-6481

Email: [email protected]

and

Dominique Lord, Ph.D.

Professor, Zachry Department of Civil Engineering

Texas A&M University, 3136 TAMU

College Station, Texas 77843-3136

Phone: 979-458-3949, fax: 979-845-6481

Email: [email protected]

February 7, 2017

*Corresponding author

Page 2: Examining the Influence of Link Function Misspecification ...ceprofs.tamu.edu/dlord/Papers/Wu_Lord_Nonlinear_CMFs.pdf · 70 difficult to examine the influence of a single problem

Wu and Lord 1

 

ABSTRACT 1

This study further examined the use of regression models for developing crash modification 2

factors (CMFs), specifically focusing on the misspecification in the link function. The primary 3

objectives were to validate the accuracy of CMFs derived from the commonly used regression 4

models (i.e., generalized linear models or GLMs with additive linear link functions) when some 5

of the variables have nonlinear relationships and quantify the amount of bias as a function of the 6

nonlinearity. Using the concept of artificial realistic data, various linear and nonlinear crash 7

modification functions (CM-Functions) were assumed for three variables. Crash counts were 8

randomly generated based on these CM-Functions. CMFs were then derived from regression 9

models for three different scenarios. The results were compared with the assumed true values. 10

The main findings are summarized as follows: (1) when some variables have nonlinear 11

relationships with crash risk, the CMFs for these variables derived from the commonly used 12

GLMs are all biased, especially around areas away from the baseline conditions (e.g., boundary 13

areas); (2) with the increase in nonlinearity (i.e., nonlinear relationship becomes stronger), the 14

bias becomes more significant; (3) the quality of CMFs for other variables having linear 15

relationships can be influenced when mixed with those having nonlinear relationships, but the 16

accuracy may still be acceptable; and (4) the misuse of the link function for one or more 17

variables can also lead to biased estimates for other parameters. This study raised the importance 18

of the link function when using regression models for developing CMFs. 19

Keywords: Crash modification factor, generalized linear model, functional form, 20

misspecification, highway safety   21

Page 3: Examining the Influence of Link Function Misspecification ...ceprofs.tamu.edu/dlord/Papers/Wu_Lord_Nonlinear_CMFs.pdf · 70 difficult to examine the influence of a single problem

Wu and Lord 2

 

1. INTRODUCTION 22

Before-after and cross-sectional studies are the two main approaches used to estimate crash 23

modification factors (CMFs) (Bonneson and Pratt, 2008). It is believed that before-after studies 24

can generate CMFs more accurately and are always preferred whenever available. However, 25

several limitations have restricted the use of before-after studies for developing high quality 26

CMFs, such as small sample size, regression-to-the-mean (RTM), multiple treatment effect, site 27

selection bias, etc. Due to limited investments and time constraints, there might be too few sites 28

available for conducting a sound before-after analysis. Sites with unusually high number of 29

crashes are often selected for treatments in practice, which will be affected by the RTM and site 30

selection biases (they are separate, but related biases) if not properly accounted for (Lord and 31

Kuo, 2012). Meanwhile, it is common that multiple treatments are implemented simultaneously 32

for reducing the number and severity of crashes, which makes it difficult to estimate the effects 33

of single treatments with a before-after analysis. Given these limitations, the use of cross-34

sectional studies, particularly using regression models (also known as safety performance 35

functions or SPFs), for estimating CMFs has become popular in recent decades. Cross-sectional 36

analyses are nonetheless not perfect and drawbacks have been identified for estimating CMFs 37

(Gross et al., 2010; Lord and Mannering, 2010). Some researchers have criticized the use of 38

SPFs for developing CMFs because they may not properly capture the cause and effect 39

relationship between crashes and variables (Hauer, 2005a; Hauer, 2005b; Hauer, 2010; Lord and 40

Mannering, 2010; Hauer, 2014; Hauer, 2015). Considering the fact that many CMFs are being 41

developed using SPFs, it is necessary to investigate the accuracy of these CMFs further. 42

A recent attempt to evaluate the use of SPFs for developing CMFs was performed by Wu 43

et al. (2015). The main conclusion was that SPFs can generate reliable CMFs when the premise 44

Page 4: Examining the Influence of Link Function Misspecification ...ceprofs.tamu.edu/dlord/Papers/Wu_Lord_Nonlinear_CMFs.pdf · 70 difficult to examine the influence of a single problem

Wu and Lord 3

 

of cross-sectional studies was matched or when variables having significant influences on crash 45

risk were not omitted in the models. An important assumption within the study was that the 46

variables were considered to have linear relationships (in logarithmic form) with the predicted 47

crash number. This was consistent with the commonly used generalized linear models (GLMs), 48

as the authors have pointed. This linear relationship is monotonic, and the same amount of 49

increase in variable(s) will always bring a fixed change in safety (either benefits or losses 50

depending on the CMFs). In other words, the pre-assumed relationship between variables and 51

crash risk is consistent with the link function specified in the SPF. Both are linear additive 52

functions, or so called canonical log link functions. However, some variables may have nonlinear 53

and/or non-monotonic relationships rather than linear (Hauer, 2004; Gross et al., 2009). Before 54

the true relationships are known to safety analysts, linear functions have been commonly adopted 55

in the GLMs for its simplicity (Hauer, 2015). [Note that, generalized nonlinear models (GNMs) 56

have been developed recently, and a number of studies have utilized nonlinear functions in the 57

GNMs, which showed better fitting results. Nevertheless, the true functions are still unknown.] 58

Under these conditions, the CMFs derived from SPFs may not be able to accurately capture the 59

true safety effects of variables, especially in boundary areas. 60

This study is, in essence, a continuation of recent work (i.e., (Wu et al., 2015)), on the 61

validation of CMFs derived from cross-sectional studies using SPFs, particularly focusing on the 62

conditions when the relationships between variables and crash risk are nonlinear, that is the link 63

function is misspecified in the regression models. It should be noted that there are a number of 64

potential problems with regression models in addition to link function, such as unobserved 65

heterogeneity, omitted variable problem, overfitting of models, and independence assumption 66

among others (Gross and Jovanis, 2007a; Gross and Jovanis, 2007b; Lord and Mannering, 2010; 67

Page 5: Examining the Influence of Link Function Misspecification ...ceprofs.tamu.edu/dlord/Papers/Wu_Lord_Nonlinear_CMFs.pdf · 70 difficult to examine the influence of a single problem

Wu and Lord 4

 

Mannering et al., 2016; Park et al., 2016; Park and Abdel-Aty, 2016; Wu and Lord, 2016). 68

Though multiple limitations are likely to exist when a regression model is estimated, it is very 69

difficult to examine the influence of a single problem on the modeling output and CMF 70

development when these limitations are combined. The limitations related to the omitted-variable 71

bias and variable interactions have been documented separately and the results can be found in 72

Wu et al. (2015) and Wu and Lord (2016). This paper emphasizes on the link functional form 73

only. Thus, the primary objectives of this paper are to examine the accuracy of CMFs derived 74

from the commonly used GLMs with linear functional forms considering variables having 75

nonlinear effects on crash risk and to quantify the bias. Furthermore, the secondary objective 76

seeks to identify how the non-linear relationships affect the coefficients for other variables that 77

have a linear relationship, which has never been evaluated before. It is worth mentioning that 78

there has been a debate about whether or not cross-sectional studies are able to derive reliable 79

cause-effect results, not only in traffic safety study, but also in other fields where this kind of 80

statistical method is used, such as psychology, epidemiology, etc. (Elvik, 2011; Hauer, 2013; 81

Hauer, 2015). Hauer (2013) pointed out that even perfect SPFs may produce untrustworthy 82

CMFs mainly due to the violation of the “ceteris paribus” assumption1. The objective of this 83

study is not to prove whether or not the cross-sectional analyses are able to reveal the cause and 84

effect of traffic collisions. On the contrary, it is to document the potential problems associated 85

with the commonly and “blindly” used GLMs for developing CMFs. Specifically, this study 86

                                                            1 In the circumstance of developing a CMF for a variable with an SPF, this assumption means all

other variables in the SPF and those not included in the SPF are kept constant. For example, if a

CMF for lane width is to be developed from an SPF, it is assumed that all the characteristics

associated with safety (e.g., AADT, roadway alignment, parking, etc.) are constant except the

width of lane.

Page 6: Examining the Influence of Link Function Misspecification ...ceprofs.tamu.edu/dlord/Papers/Wu_Lord_Nonlinear_CMFs.pdf · 70 difficult to examine the influence of a single problem

Wu and Lord 5

 

focused on the misuse of link functional form, and assuming that other conditions have been 87

satisfied. 88

2. BACKGROUND 89

Plenty of efforts have been made in the past decades on the development of crash prediction 90

models. A considerable portion of them focused on the statistical distribution models, e.g., 91

Poisson, Poisson-gamma (negative binomial or NB), Zero-Inflated Poisson, Conwey-Maxwell-92

Poisson, etc. (Lord and Mannering, 2010). Very few have examined the influence of the link 93

function on the predicted number of crashes. Hauer (2015) pointed out that the functional forms 94

used in regression models played vital roles in the analyses. Without the correct functions, the 95

cause and effect result derived from the modeling result cannot be trusted. However, so far, there 96

is no theory for identifying the appropriate function. In practice, most safety analysts adopted the 97

linear additive form, probably for its simplicity. In these models, there is only one coefficient for 98

one single variable. Thus, the CMF derived for that variable is in log-linear form. One unit 99

change in the variable always brings the same amount of change in the expected crashes. 100

However, this kind of CMF may not properly account for the safety effect of the treatment 101

because some variables may have nonlinear influences on crashes (Hauer, 2004; Hauer et al., 102

2004; Lee et al., 2015). El-Basyouny and Sayed (2010) compared the generalized link function 103

against the commonly used linear additive function in modeling crash data. The predicted 104

number of crashes showed significant differences under particular situations. Meanwhile, 105

attempts have been made to explore the nonlinear effects of some variables on safety. 106

Hauer et al. (2004) developed a statistical model to predict non-intersection crash 107

frequency on urban four-lane undivided roadways. Several variables were considered in the 108

analysis. Based on the estimated parameters, some variables were found to have nonlinear effects 109

Page 7: Examining the Influence of Link Function Misspecification ...ceprofs.tamu.edu/dlord/Papers/Wu_Lord_Nonlinear_CMFs.pdf · 70 difficult to examine the influence of a single problem

Wu and Lord 6

 

on safety. For example, the degree of curve was captured to have a “U-shape” effect on on-the-110

road crashes. This indicated some flat curves may be safer than a tangent if it is true. But sharp 111

curves are associated with higher crash risk. 112

Xie and Zhang (2008) applied generalized additive models (GAMs) in traffic crash 113

modeling. Compared to GLMs, GAMs used nonparametric smooth functions instead of 114

parametric terms in GLMs. The analysis result indicated that GAMs performed better than 115

GLMs in terms of goodness-of-fit (GOF) and prediction performance. This method was later 116

utilized to develop CMFs for rural frontage segments in Texas (Li et al., 2011). Analyses showed 117

that nonlinear relationships existed between crash risk and changes in lane and shoulder widths 118

on frontage roads. The result was slightly different with a previous GLM based study (Lord and 119

Bonneson, 2007). 120

In order to capture the nonlinear relationships between variables and crashes, some neural 121

network models have also been introduced into safety analyses. Xie et al. (2007) proposed 122

Bayesian neural network (BNN) model for predicting motor vehicle crashes. BNN models were 123

estimated using the Texas frontage road data, the same used in several previous studies (Lord 124

and Bonneson, 2007; Li et al., 2011). The authors conducted sensitivity analyses of the trained 125

BNN model for two sites. It was found that right shoulder width showed quadratic functions with 126

predicted crash counts at two sites, and lane width showed an “inverse U-shape” relation with 127

crash counts at one site. Li et al. (2008) later conducted a continuation of this work. The 128

researchers applied support vector machine (SVM) models to predict crashes on frontage roads 129

in Texas, aiming to capture nonlinear relationships between explanatory and dependent 130

variables. The results were quite similar with those using BNN method. 131

Page 8: Examining the Influence of Link Function Misspecification ...ceprofs.tamu.edu/dlord/Papers/Wu_Lord_Nonlinear_CMFs.pdf · 70 difficult to examine the influence of a single problem

Wu and Lord 7

 

Recently, Lao et al. (2014) proposed generalized nonlinear models (GNMs) based 132

approach to better elaborate non-monotonic relationships between variables and crash rates. 133

Compared to GLMs, the major improvement of GNM was using piecewise functions to capture 134

the pattern between dependent and independent variables. This made it more flexible to extract 135

complex relationships between the two. Rear-end crashes were modeled using GNM and GLM 136

methods. Comparison showed GNMs outperformed GLMs. In addition, some factors were found 137

to be significant in GNMs, but not in GLMs. Lee et al. (2015) later assessed the safety effects of 138

changing lane width using GNMs. Various nonlinear link functions were used for the effects on 139

crash rates of lane widths, and nonlinear CM-Functions were estimated for changing lane width. 140

It was noted that the CM-Function for lane width showed an “inverse U-shape” curve. It was 141

combined with two quadratic functions and the 12-ft lane was found to be associated with the 142

highest crash rates. This result contradicted some past studies, which concluded widening lanes 143

could consistently reduce crash frequency (AASHTO, 2010). Moreover, Park and Abdel-Aty 144

(2015a) assessed the safety effects of multiple roadside treatments (i.e., poles, trees, etc.) using 145

GLM, GNM, and multivariate adaptive regression splines (MARS) model. The MARS model 146

could capture both nonlinear relationships and interaction impacts between variables. Results 147

generally showed that GNMs provided slightly better fits than the GLMs, and MARS model 148

outperformed the other two. This indicated the roadside treatments had nonlinear effects on crash 149

risk. Park and Abdel-Aty (2016) further estimated the safety effectiveness of multiple cross 150

sectional features on urban arterials using both GLM and GNM. Analysis results revealed that 151

the GNM outperformed the GLM. Studies on the nonlinear relationships between variables and 152

crash risk have started to emerge recently. For example, Zeng et al. (2016a) modeled crash 153

frequency using an optimized neural network, and the researchers found that AADT, speed limit, 154

Page 9: Examining the Influence of Link Function Misspecification ...ceprofs.tamu.edu/dlord/Papers/Wu_Lord_Nonlinear_CMFs.pdf · 70 difficult to examine the influence of a single problem

Wu and Lord 8

 

the presence of a bus stop and rainfall have nonlinear effects on crash risk. Zeng et al. (2016b) 155

developed a GNM-based mixed multinomial logit approach for crash data analysis, and 156

nonlinear relationships between factors and crash density as well as severity were examined. 157

Park et al. (2016) developed CMFs using finite mixture modeling approach. Median width was 158

found to have a “U-shape” relationship with crash frequently, and widening a narrower median 159

are generally more effective than widening a wider median. Banihashemi (2015; 2016) and Wu 160

et al. (2017) developed CM-Functions for horizontal curves on different types of roadways using 161

cross-sectional methods. All of the three CM-Functions were in a nonlinear form. More 162

specifically, Wu et al. (2017)’s study indicated that the nonlienar CM-Function was more 163

accurate than another linear CM-Function in terms of predicting crash numbers at horiozntal 164

curves. 165

In almost all of these studies, analyses using nonlinear methods showed better results 166

than the commonly used GLM approach. This means some variables indeed have nonlinear 167

and/or non-monotonic effects on crash frequency, and the CMFs derived using normal GLMs 168

may not be able to adequately capture this kind of relationship. 169

3. METHODOLOGY 170

3.1 Quantifying Nonlinearity 171

This section describes how the accuracy of CMFs derived from SPFs was investigated when 172

some variables had nonlinear relationships. Intuitively, if the nonlinear relationship is weak (the 173

CM-Function curve is quite flat), the accuracy of CMFs derived from SPFs should be similar to 174

those in the previous study (Wu et al., 2015). On the contrary, if the nonlinear relationship is 175

strong (the curve is sharp), the accuracy of CMFs may be affected. A measurement is necessary 176

Page 10: Examining the Influence of Link Function Misspecification ...ceprofs.tamu.edu/dlord/Papers/Wu_Lord_Nonlinear_CMFs.pdf · 70 difficult to examine the influence of a single problem

Wu and Lord 9

 

to describe how flat or sharp the curve is. This section introduces the concept of quantifying 177

nonlinearity. 178

First, the definition of the closest line to a curve. For a given integrable curve ( )y f x 179

over [ , ]m n , the closest line to this curve is defined as a straight line y k x c that minimizes 180

the area between the two. This definition is illustrated in Figure 1. The dashed curve represents 181

the given function ( )y f x , and the solid line represents the closest line to this curve 182

y k x c . This line minimizes the area between the two (the shadowed area in Figure 1). 183

Given the range, in general, the larger the area is, the stronger the nonlinearity the curve tends to 184

have. Particularly, if the given function is linear, the closest line is the function itself, and the 185

area is technically equal to zero. 186

Second, the definition of average vertical distance between a curve and its closest line. 187

Although the area between a curve and its closest line can reflect the nonlinearity of the curve, 188

the area still depends on the range. Wider range is more likely to yield larger area. And variables 189

affecting traffic crashes are usually different in their possible values in practice. A standardized 190

measurement is necessary to quantify the nonlinearity. The average vertical distance (AVD) 191

between a curve and its closest line is defined as the area between the two divided by the range. 192

So, in Figure 1, the AVD is calculated as dividing the shadowed area by n m . This way, the 193

AVD itself can be used to quantify the nonlinearity of a curve regardless of its range. The larger 194

this distance is, the stronger the nonlinearity that curve has. If the given function is linear, the 195

AVD is zero. 196

Page 11: Examining the Influence of Link Function Misspecification ...ceprofs.tamu.edu/dlord/Papers/Wu_Lord_Nonlinear_CMFs.pdf · 70 difficult to examine the influence of a single problem

Wu and Lord 10

 

197

Figure 1 Example illustrating the closest line to a curve. 198

The details for calculating the coefficients of the line (i.e., k and c ) and AVD are shown 199

below. The objective is to minimize the area, shown in Equation 1. 200

| ( ) ( ) |n

mArea f x k x c dx (1a) 201

Or equivalently, 202

2[ ( ) ( )]n

mZ f x k x c dx   (1b) 203

k and c can be easily derived through mathematical translations, shown below. 204

2 2

( ) ( ) ( )

( ) [ ]

n n n

m m mn n

m m

n m xf x dx f x dx xdxk

n m x dx xdx

  (2a) 205

2

2 2

( ) ( )

( ) [ ]

n n n n

m m m mn n

m m

f x dx x dx f x dx xdxc

n m x dx xdx

  (2b) 206

Page 12: Examining the Influence of Link Function Misspecification ...ceprofs.tamu.edu/dlord/Papers/Wu_Lord_Nonlinear_CMFs.pdf · 70 difficult to examine the influence of a single problem

Wu and Lord 11

 

The area can be calculated by substituting k and c into Equation 1a, and the AVD is 207

then calculated as dividing the area by n m . The AVD will be used to measure the nonlinearity 208

of CM-Functions. 209

3.2 Simulation Protocol 210

To assess the influence of link function misspecification on the accuracy of CMFs, this study 211

used artificial realistic data proposed by Hauer (2014), since the true relationship between 212

variables and crash risk is unknown with real crash data. The simulation experiment was 213

basically the same as that used in the first attempt of this topic (Wu et al., 2015), but necessary 214

changes were made to accommodate the particular requirements in this study. The simulation 215

procedure is briefly described below. 216

Step 1: Assign Initial CM-Functions 217

Assume CM-Functions for highway geometric features of interest. This study assumed 218

multiple kinds of relationships (i.e., linear and nonlinear) between variables and crash risk. For 219

example, we can assume the CM-Function for lane width as Equation 3. 220

2( ) ( )CMF f LW exp a LW b LW c (3) 221

Where, 222

CMF = the specific CMF for a segment with LW-ft lane; 223

LW = the lane width of the segment, in ft; and, 224

a, b, c = coefficients of assumed CM-Functions for lane width. 225

Note that there is a large (even infinite) number of nonlinear forms that may represent the 226

relationships between variables and crash frequency. This study utilized the quadratic form for 227

Page 13: Examining the Influence of Link Function Misspecification ...ceprofs.tamu.edu/dlord/Papers/Wu_Lord_Nonlinear_CMFs.pdf · 70 difficult to examine the influence of a single problem

Wu and Lord 12

 

two reasons: (1) this form has been developed in the previous study based on real crash data (Lee 228

et al., 2015); (2) the quadratic form is very flexible. The nonlinearity can be easily adjusted by 229

changing the coefficients of the splint. In addition, it is able to capture both monotonic and non-230

monotonic relationships (over some particular range). For other researchers, it is possible to use 231

other forms to reflect the underlying relationships between crash risk and variables of interest. 232

Step 2: Calculate Mean Values 233

Calculate the true crash means for each segment using basic SPFs and assumed CMFs. 234

The equations used to calculate the true crash mean are called theoretical functions, and they 235

shall be shown later (i.e., Section 5) through this paper. 236

Step 3: Generate Discrete Counts 237

Generate random counts given that the mean for a segment was gamma distributed with 238

dispersion parameter (the inverse dispersion parameter, 1/ ) and mean equal to 1. 239

Step 4: Estimate CMFs from the Simulated Crash Data Using NB Regression Models 240

For the purpose of this study, the most commonly used GLM and functional form (i.e., 241

linear form) were used, as shown in Equation 4 (Lord and Bonneson, 2007). This equation is 242

called the considered function. Note that a different parameter for describing the mean of the 243

site, i , was used for estimating the models (compared to the one used for the simulation, i ). 244

10

2

( ) ( )n

i i j jj

E L AADT exp x

(4) 245

Where, 246

( )iE = the estimated crash mean during a period for segment i; 247

Page 14: Examining the Influence of Link Function Misspecification ...ceprofs.tamu.edu/dlord/Papers/Wu_Lord_Nonlinear_CMFs.pdf · 70 difficult to examine the influence of a single problem

Wu and Lord 13

 

jx = a series of variables, such as the lane width of segment i; and, 248

0 1, , , n = coefficients to be estimated. 249

For the GOF of the models, the following three methods were used: (1) Akaike 250

information criterion (AIC), (2) Mean absolute deviance (MAD), and (3) Mean-squared 251

predictive error (MSPE). More information about MAD and MSPE are documented in Lord et 252

al. (2008). 253

Once the model was fitted and coefficients were estimated using the simulated crash data, 254

the CM-Function for variable j was then derived as (Lord and Bonneson, 2007; Gross et al., 255

2010): 256

, 0,[ ( )]x j j jCMF exp x x (5) 257

Where, 258

j = estimated coefficient for variable j; 259

x = value of variable j, such as lane width, curve density; 260

0, jx = base condition defined for variable j, usually 12 ft for lane width; and, 261

,x jCMF = CMF specific to variable j with value of x. 262

This also indicated the CMF derived from the SPF for variable j was ( )j jCMF exp , 263

meaning the expected crash frequency would be multiplied or divided by jCMF if the variable j 264

increased or decreased by one unit. 265

Page 15: Examining the Influence of Link Function Misspecification ...ceprofs.tamu.edu/dlord/Papers/Wu_Lord_Nonlinear_CMFs.pdf · 70 difficult to examine the influence of a single problem

Wu and Lord 14

 

Repeat Steps 2 to 4 100 times, calculate the mean and the standard deviation of the 266

estimated CMF values for each variable. 267

Step 5: Evaluate the CMF Derived from the NB Model 268

Two indexes, estimation bias and error percentage, were used to evaluate the CMF 269

derived from SPFs. They are shown in Equations 6 and 7. The smaller is the error percentage, the 270

more accurate the CMF derived from SPFs is. 271

_ _=j j Assumed j SPFCMF CMF (6) 272

_

100j

jj Assumed

eCMF

(7) 273

Where, 274

j = estimation bias of CMF for variable j with a specific value; 275

je = error percentage of CMF for variable j with a specific value, (%); 276

_j AssumedCMF = assumed specific CMF value for variable j; and 277

_j SPFCMF = specific CMF derived from the SPF for variable j. 278

Please note the meaning of terminology “bias” used above to quantify the quality of 279

CMFs. In Mathematics and Statistics, bias is defined as a systematic (built-in) error which makes 280

all values or estimates wrong in the same direction and by a certain amount (Pierce, 2014). 281

Specifically, bias in this paper means the difference between the true CMF for a variable and that 282

estimated from regression models. It can also be defined as misspecification error (as some 283

Page 16: Examining the Influence of Link Function Misspecification ...ceprofs.tamu.edu/dlord/Papers/Wu_Lord_Nonlinear_CMFs.pdf · 70 difficult to examine the influence of a single problem

Wu and Lord 15

 

CMFs are misestimated in the models). However, to simplify the description, the issue of 284

misspecification is referred as “bias” in the rest of this paper. 285

3.3 Scenarios 286

Three scenarios were analyzed in this study, as described below. To distinguish from 287

previous work, the scenarios were named as “Nonlinear number”. 288

Scenario Nonlinear I: Consider one variable only, nonlinear form. Only lane width was 289

considered and assumed to have nonlinear effects on crash. The main objective was 290

to examine the bias of CMF for a variable with different levels of nonlinearity. 291

Scenario Nonlinear II: Consider three variables, only one in nonlinear form. Three variables, 292

lane width, curve density and pavement friction, were considered in this scenario. 293

Curve density and pavement friction were assumed to have linear effects on safety. 294

The CM-Functions for lane width were assumed to be in nonlinear forms. The main 295

objective was to examine the influence of nonlinear variables on the accuracy of 296

CMFs for linear variables. 297

Scenario Nonlinear III: Consider three variables, two in nonlinear form. This scenario was 298

similar with Scenario Nonlinear II, but both lane width and curve density were 299

assumed to have nonlinear relationships (and are considered independent). The CMF 300

for pavement friction was in linear form. The main objective was to examine the 301

influence of nonlinear variables on the accuracy of CMFs for both linear and 302

nonlinear variables. 303

For all the three scenarios, the assumed nonlinear relationships varied from weak to 304

strong. Thus, each scenario contained a number of sub-scenarios. In addition, the inverse 305

Page 17: Examining the Influence of Link Function Misspecification ...ceprofs.tamu.edu/dlord/Papers/Wu_Lord_Nonlinear_CMFs.pdf · 70 difficult to examine the influence of a single problem

Wu and Lord 16

 

dispersion parameter, ϕ, varied from 0.5, 1.0 to 2.0 in each sub-scenario to reflect different traffic 306

characteristics. 307

As discussed above, a number of different issues can affect the development of CMFs 308

from SPFs. However, it will be difficult to properly understand and quantify the bias if multiple 309

limitations are analyzed simultaneously. For the purpose of this study, the main assumptions 310

used were: (1) the variables were considered to be independent of each other (i.e., there is no 311

correlation); and (2) there is no interactions on safety between variables. 312

4. DATA DESCRIPTION 313

This study utilized the same roadway segments (i.e., rural highways in Texas) as in Wu et al. 314

(2015). It contained segment length, Annual Average Daily Traffic (AADT), lane width, curve 315

density (i.e., curves/mile) and pavement friction. The segment length and AADT were based on 316

actual values from the Texas data, while the other three are hypothetical variables created for this 317

study. They were generated using the same method as those in Wu et al. (2015), i.e., independent 318

discrete or continuous uniform distributions. For lane width, the parameters were 8 and 13. For 319

the curve density, they were 0 and 16. And for pavement friction, they were 16 and 48. But it is 320

worth to mention that, the variables may not be independent and uniformly distributed in 321

practice. On the contrary, some of them are usually correlated. For example, when constructing 322

two highways, one with higher demand (i.e., AADT) and the other with lower, it is common that 323

the former one will be designed with higher standard, e.g., wider lanes and shoulders, etc. Thus, 324

variables AADT and lane width are correlated. And also, in highway design manuals (AASHTO, 325

2004), lane width is recommended to be 12 ft for most highways. So 12 ft may be prevalent 326

among lanes, and it is not discrete uniformly distributed in practice. This might affect the 327

regression result and hence the CMFs for variables. For comparative purpose, however, this 328

Page 18: Examining the Influence of Link Function Misspecification ...ceprofs.tamu.edu/dlord/Papers/Wu_Lord_Nonlinear_CMFs.pdf · 70 difficult to examine the influence of a single problem

Wu and Lord 17

 

study utilized the same dataset as the previous one. As a result, the issue of variable correlation 329

does not have to be considered in this study (Lee et al., 2015). 330

In total, there were 1,492 segments. The summary statistics of these variables are shown 331

in Table 1. In the simulated data, the number of sites within each category of lane width (8 to 13 332

ft) is approximately 249. 333

Table 1 Summary Statistics of Highway Segments 334

Variable Sample Size Min. Max Mean (SD c)

Length (mile) 1,492 0.1 6.3 0.55 (0.67) AADT 1,492 502 24,800 6,643.9 (3,996.4)

Lane Width (ft) 1,492 8.0 13.0 10.47 (1.74) CD a (per mile) 1,492 0.02 16.0 8.1 (4.66)

PF b 1,492 16.0 47.9 31.9 (9.08) a – CD = Curve Density; b – PF = Pavement Friction; c - SD = Standard Deviation. 335 336

Similar to the previous study (i.e. (Wu et al., 2015)), it is important to point out that this 337

study selected three geometric features and the CMFs are mainly assumed based on their 338

practical values (i.e., from the HSM, CMF Clearinghouse, etc.) to reflect as close as possible the 339

characteristics related to variables that can influence crash risk. However, it does not have to be 340

so. With the simulation protocol, it would be possible for other researchers to use variables and 341

ranges based on characteristics associated with the roadway entities in which the researchers 342

have detailed information on these characteristics. 343

5. RESULTS 344

5.1 Scenario Nonlinear I: Consider lane width only 345

In this scenario, three nonlinear CM-Functions were assumed for lane width. This way, there 346

were three sub-scenarios, Nonlinear I-1, I-2 and I-3. The first two CM-Functions for lane width 347

were quadratic functions (in logarithm form), shown in Equations 8 and 9. 348

Page 19: Examining the Influence of Link Function Misspecification ...ceprofs.tamu.edu/dlord/Papers/Wu_Lord_Nonlinear_CMFs.pdf · 70 difficult to examine the influence of a single problem

Wu and Lord 18

 

2( ) 0.1 2.22 12.28ln CMF LW LW   (8) 349

2( ) 0.2 4.22 21.88ln CMF LW LW   (9) 350

The third one was a combination of two piecewise quadratic functions. This nonlinear 351

function, shown in Equation 10, was developed by Lee et al. (2015) based on real crash data. 352

Note that, in Lee et al. (2015)’s study lanes narrower than 9 ft were considered to have the same 353

CMF as a 9-ft lane. To keep the analyses consistent and make it easier, this study assumed that 354

an 8-ft lane had a different CMF with a 9-ft lane, and it was directly calculated using Equation 355

10. 356

2

2

0.11 ( 12) 0.30 12( )

0.08 ( 12) 0.30 12

LW LWln CMF

LW LW

  (10) 357

The assumed CM-Functions and their characteristics (closest line, area and AVD) for the 358

three sub-scenarios are summarized in Table 2. It can be seen that, the AVD of I-2 was higher 359

than those of I-1 and I-3. The latter two were close to each other. This made the assumed CM-360

Function in I-2 relatively strong in nonlinearity, and the other two relatively weak. 361

Table 2 Assumed CM-Functions for Lane Width (Scenario Nonlinear I) 362

# * ( )ln CMF a Line b Areac AVD Leveld

NL I-1 20.1 2.22 12.28LW LW 0.123 1.46LW 0.802 0.160 Weak NL I-2 20.2 4.22 21.88LW LW 0.023 0.24LW 1.603 0.321 Strong

NL I-3

20.11 ( 12) 0.30 12LW LW

20.08 ( 12) 0.30 12LW LW

0.339 4.05LW 0.886 0.177 Weak

Note: * # - sub-scenario number, NL - Nonlinear; a: LW - lane width, in ft; b: Line - the closest 363

line to the curve; c: Area - the area between the curve and its closest line; d: Level - the relative 364

nonlinear level. 365

Page 20: Examining the Influence of Link Function Misspecification ...ceprofs.tamu.edu/dlord/Papers/Wu_Lord_Nonlinear_CMFs.pdf · 70 difficult to examine the influence of a single problem

Wu and Lord 19

 

366

The theoretical function of the generated crash counts in these three sub-scenarios is 367

shown in Equation 11. The specific CMF for lane width, CMFLW, i, was calculated using 368

Equations 8 to 10. 369

4, , , ,2.67 10true i spf i LW i i i LW iN N CMF L AADT CMF   (11) 370

The considered functional form in SPFs is shown in Equation 12. As can be seen, the link 371

function is misspecified in the considered regression model. 372

10 2( ) ( )i i iE L AADT exp LW   (12) 373

Table 3 presents the CMFs derived from SPFs as well as other results (i.e., ϕ and GOF 374

measurements). First, the MAD and MSPE of nonlinear forms were significantly higher when 375

compared with linear ones [i.e., Scenario I in the previous study Wu et al. (2015)]. This indicated 376

the CMFs in this scenario might have higher bias. Second, with the increase of nonlinear level, 377

the MAD and MSPE also increased. In other words, when the relationship between the variable 378

and crash risk became strong in nonlinearity, the normal GLMs were likely to produce biased 379

CMFs. Finally, under nonlinear relationships, the inverse dispersion parameters estimated from 380

SPFs were biased (see the column of “ϕ” in Table 3) 381

Page 21: Examining the Influence of Link Function Misspecification ...ceprofs.tamu.edu/dlord/Papers/Wu_Lord_Nonlinear_CMFs.pdf · 70 difficult to examine the influence of a single problem

Wu and Lord 20

 

Table 3 Results of Scenario Nonlinear I 382

# * CMF (SD) a ϕ b AIC d MAD e MSPE f ϕ c = 0.5 NL I-1 0.88 (0.01) 0.57 11413.39 0.31 0.39 NL I-2 0.98 (0.02) 0.81 11522.30 0.64 1.57 NL I-3 1.33 (0.02) 0.56 7770.30 0.15 0.11 ϕ c = 1.0 NL I-1 0.87 (0.02) 1.08 11492.77 0.31 0.39 NL I-2 0.98 (0.02) 1.34 11509.57 0.64 1.58 NL I-3 1.34 (0.03) 1.08 7803.04 0.15 0.13 ϕ c = 2.0 NL I-1 0.88 (0.03) 2.11 11233.07 0.32 0.40 NL I-2 0.98 (0.03) 2.41 11138.83 0.64 1.60 NL I-3 1.35 (0.03) 2.11 7690.76 0.15 0.15

Note: * # - sub-scenario number, NL – Nonlinear; a – mean of CMFs from 100 experiments, SD 383

is the Standard Deviation of the 100 CMFs; b – the inverse dispersion parameter derived from 384

SPFs; c – the theoretical inverse dispersion parameter in each sub-scenario; d, e, f – each is the 385

mean value of the corresponding GOF measure of the 100 results. 386

387

To verify the above assumption, the curves of assumed CM-Functions and the CMFs 388

derived from SPFs are illustrated in Figure 2. Due to the space limitation, Figure 2 only presents 389

the curves with inverse dispersion parameter equals to 0.5. In addition, the specific CMFs for 390

lane widths (8, 9, 11 and 13 ft) are presented in Table 4 for explicit comparison. The CMFs for 391

12-ft lane are excluded from Table 4, because 12 ft is the base condition for lane with and the 392

CMFs are equal to 1.0 in both assumed and derived CM-Functions. 393

Figure 2(a) shows the CM-Functions in sub-scenario I-1 (weak nonlinearity). It can be 394

seen that the assumed true CMF for lane width first decreased and then increased. But the CMF 395

derived from SPF was 0.88, meaning the expected number of crashes would consistently reduce 396

by 12 percent whenever the lane was widened by 1 foot. When the lane width was less than 397

Page 22: Examining the Influence of Link Function Misspecification ...ceprofs.tamu.edu/dlord/Papers/Wu_Lord_Nonlinear_CMFs.pdf · 70 difficult to examine the influence of a single problem

Wu and Lord 21

 

about 9 ft or more than 12 ft, the CMF was underestimated. The result was contrary when the 398

lane width was between about 9 and 12 ft. The bias was more obvious around boundary areas. 399

Figure 2(b) presents the CM-Functions in sub-scenario I-2 (strong nonlinearity). The 400

overall result was similar to that of I-1. But the bias or error around the boundary areas in this 401

sub-scenario was much higher than those in I-1. More interestingly, the CMF derived from SPFs 402

in this sub-scenario was 0.98, very close to 1.0, indicating lane width had minor influence on 403

crash risk. Safety analysts may misleadingly conclude that widening lane has little effect on 404

reducing collisions based on this finding. However, the assumed true safety effect of lane width 405

was far from this statement. 406

The results of I-3 (piecewise nonlinear functions) are shown in Figure 2(c). The CMF for 407

lane width derived from SPFs was 1.33. When the lane width was between 9 and 12 ft, the two 408

curves were close to each other. However, the bias was significantly high when the lane became 409

relatively wide or narrow. The error reached nearly 90 percent at the point of 8-ft lane. On the 410

side of wider lanes, the true CMF decreased as lane width increased, but the CMF derived from 411

SPFs increased continuously. 412

Similar results were found for other inverse dispersion parameters (not documented here 413

due to space limitation). So, it can be concluded that none of the CMFs derived from SPFs could 414

reflect the true safety effects accurately. They were all biased, especially in boundary areas. 415

Regression analysis with the commonly used linear link functions could produce biased CMFs 416

when the variable had nonlinear relationships on crash risk. With the increase of nonlinearity, the 417

bias became significant. In addition, the misuse of linear link function also led to biased 418

estimates for other parameters, which might play important roles in safety analyses. For example, 419

the inverse dispersion parameter is important in calculating the weights in empirical Bayes (EB) 420

Page 23: Examining the Influence of Link Function Misspecification ...ceprofs.tamu.edu/dlord/Papers/Wu_Lord_Nonlinear_CMFs.pdf · 70 difficult to examine the influence of a single problem

Wu and Lord 22

 

analyses (Hauer et al., 2002; Wu et al., 2014; Zou et al., 2015). As a result, biased dispersion 421

parameters lead to biased EB estimates of crashes. 422

In addition to the quadratic form, several other types of nonlinear functional forms have 423

been assumed. CMFs were generated following the same procedure described above and similar 424

results were found (the detailed analyses are not documented here). 425

426 (a) I-1 (b) I-2 427

428 (c) I-3 429

Figure 2 CM-Functions for lane width in scenario Nonlinear I (ϕ =0.5). 430

Page 24: Examining the Influence of Link Function Misspecification ...ceprofs.tamu.edu/dlord/Papers/Wu_Lord_Nonlinear_CMFs.pdf · 70 difficult to examine the influence of a single problem

Wu and Lord 23

 

Table 4 Bias and Error of CMFs for Lane Width in Scenario Nonlinear I 431

Scenario Th.a SPFb Bias Ec Th. SPF Bias E Th. SPF Bias E

LW (ft) 8 9 10

ϕ d = 0.5

IV-1 2.44 0.76 1.68 68.9 1.45 0.81 0.63 43.8 1.05 0.87 0.18 16.9

IV-2 2.44 0.20 2.24 91.8 1.07 0.30 0.77 72.0 0.70 0.45 0.25 36.1

IV-3 0.17 0.63 -0.46 278.3 0.36 0.70 -0.34 93.7 0.64 0.79 -0.15 24.1

ϕ d = 1.0

IV-1 2.44 0.76 1.68 68.8 1.45 0.81 0.63 43.7 1.05 0.87 0.17 16.7

IV-2 2.44 0.21 2.23 91.5 1.07 0.31 0.76 71.4 0.70 0.46 0.25 35.2

IV-3 0.17 0.63 -0.47 282.1 0.36 0.71 -0.35 95.1 0.64 0.80 -0.16 24.8

ϕ d = 2.0

IV-1 2.44 0.75 1.69 69.4 1.45 0.80 0.64 44.4 1.05 0.86 0.18 17.4

IV-2 2.44 0.19 2.25 92.3 1.07 0.28 0.79 73.4 0.70 0.43 0.27 38.3

IV-3 0.17 0.57 -0.40 243.2 0.36 0.66 -0.29 80.0 0.64 0.75 -0.12 18.2

 432

Page 25: Examining the Influence of Link Function Misspecification ...ceprofs.tamu.edu/dlord/Papers/Wu_Lord_Nonlinear_CMFs.pdf · 70 difficult to examine the influence of a single problem

Wu and Lord 24

 

Table 4 Continued 433

Setting Th. SPF Bias E Th. SPF Bias E

LW (ft) 11 13

ϕ d= 0.5

IV-1 0.93 0.93 -0.01 0.8 1.32 1.07 0.25 18.8

IV-2 0.69 0.67 0.02 2.4 2.17 1.49 0.68 31.3

IV-3 0.89 0.89 0.00 0.4 0.93 1.12 -0.20 21.2

ϕ d= 1.0

IV-1 0.93 0.93 -0.01 0.9 1.32 1.07 0.25 18.8

IV-2 0.69 0.67 0.01 1.7 2.17 1.48 0.69 31.8

IV-3 0.89 0.89 0.00 0.2 0.93 1.12 -0.19 20.9

ϕ d= 2.0

IV-1 0.93 0.93 0.00 0.4 1.32 1.08 0.24 18.5

IV-2 0.69 0.66 0.03 4.1 2.17 1.52 0.65 30.1

IV-3 0.89 0.87 0.03 2.8 0.93 1.15 -0.22 24.2

Note: a – theoretical CMF (assumed true specific CMFs for lane widths of 8, 9, 10, 11 and 12 ft); b – CMFs derived from 434

SPFs (i.e., specific CMFs derived from regression models for corresponding lane widths); c – error percentage, %; d – the 435

theoretical inverse dispersion parameter (ϕ) in each sub-scenario. 436

Page 26: Examining the Influence of Link Function Misspecification ...ceprofs.tamu.edu/dlord/Papers/Wu_Lord_Nonlinear_CMFs.pdf · 70 difficult to examine the influence of a single problem

Wu and Lord 25

 

5.2 Scenario Nonlinear II: Consider three variables, only one in nonlinear form 437

In this scenario, the assumed CMFs for curve density and pavement friction were 1.072 and 438

0.973, respectively. Both were in linear form. The same three nonlinear CM-Functions for lane 439

width were used. Three sub-scenarios, II-1, II-2 and II-3, were analyzed, shown in Table 5. 440

Table 5 Assumed CM-Functions (Scenario Nonlinear II) 441

# * ( )ln CMF (Nonlinear Level) CMF

Lane Width Curve

Density Pavement Friction

NL II-1 20.1 2.22 12.28LW LW (W) 1.072CD ( 32)0.973 PF NL II-2 20.2 4.22 21.88LW LW (S) 1.072CD ( 32)0.973 PF

NL II-3 20.11 ( 12) 0.30 12LW LW

20.08 ( 12) 0.30 12LW LW (W) 1.072CD ( 32)0.973 PF

Note: * # - sub-scenario number, NL - Nonlinear; LW - lane width, in ft; CD - curve density, 442

number of curves per mi; PF - pavement friction. 443

444

The nonlinear level of the assumed CM-Functions for lane width of each sub-scenario 445

was the same as the corresponding one in Nonlinear I. It was relatively strong in II-2, and weak 446

in II-1 and II-3. 447

The theoretical function of the generated crash counts and the considered functional form 448

in this scenario are shown in Equations 13 and 14, respectively. 449

, , , , ,true i spf i LW i CD i PF iN N CMF CMF CMF  450

4, , ,2.67 10 i i LW i CD i PF iL AADT CMF CMF CMF   (13) 451

10 2 3 4( ) ( )i i i i iE L AADT exp LW CD PF   (14) 452

The CMFs for the three variables and other modeling results of each sub-scenario are 453

documented in Table 6. The overall result was quite similar with that in Nonlinear I. The MAD 454

Page 27: Examining the Influence of Link Function Misspecification ...ceprofs.tamu.edu/dlord/Papers/Wu_Lord_Nonlinear_CMFs.pdf · 70 difficult to examine the influence of a single problem

Wu and Lord 26

 

and MSPE were higher than those of linear relationships (i.e., Scenario II in the previous study). 455

II-2 consistently had the highest MAD and MSPE. Meanwhile, the inverse dispersion parameters 456

estimated from SPFs were biased again. The CMFs for lane width derived in this scenario were 457

slightly different with those of Nonlinear I. The CMFs for curve density and pavement friction 458

were very close to their true values. However, the MAD and MSPE of this scenario were higher 459

than those of Nonlinear I under the same assumed CM-Function for lane width. 460

Figures 3(a) to (c) illustrate the curves of CM-Functions for lane width (ϕ = 0.5). The 461

results were very close to those of the corresponding sub-scenario in Nonlinear I. The CMFs 462

were all biased, especially in boundary areas. The bias of II-2 was always higher than those of 463

II-1 and II-3 (except over a very small range around 9). 464

The CMFs for curve density derived from SPFs in the three sub-scenarios (ϕ = 0.5) were 465

1.071, 1.069, and 1.073, respectively. They were quite close to the assumed true value, 1.072. 466

The CM-Function curves are shown in Figure 3(d). The CMFs were generally acceptable. 467

However, when comparing the results between the three sub-scenarios, it can be observed that 468

the bias and error percentage in II-2 (strong) were always higher than those in II-1 and II-3 469

(weak). So, as the nonlinearity between lane width and crash risk increased, the bias of CMF for 470

curve density became significant. That is to say, even the link function for one variable was 471

correct, the accuracy of CMF for this variable can still be influenced if incorrect or improper link 472

functions for other variables had been utilized in the models. 473

The CMFs for pavement friction produced from the three sub-scenarios (ϕ = 0.5) were 474

0.972, 0.971 and 0.973, respectively. The CM-Function curves are shown in Figure 3(e). The 475

results were similar with those of curve density. Overall, the bias and error percentage was 476

relatively small. II-2 was consistently the highest in terms of bias and error percentage. 477

Page 28: Examining the Influence of Link Function Misspecification ...ceprofs.tamu.edu/dlord/Papers/Wu_Lord_Nonlinear_CMFs.pdf · 70 difficult to examine the influence of a single problem

Wu and Lord 27

 

Table 6 Results of Scenario Nonlinear II 478

# * CMF (SD) a

ϕ b AIC d MAD e MSPE f LW CD PF

ϕ c =0.5

NL II-1 0.88

(0.014) 1.071

(0.007) 0.972

(0.003) 0.57 14677.2 0.62 1.60

NL II-2 0.98

(0.015) 1.069

(0.006) 0.971

(0.003) 0.79 14988.9 1.27 6.63

NL II-3 1.33

(0.021) 1.073

(0.006) 0.973

(0.003) 0.57 10441.8 0.29 0.52

ϕ c =1.0

NL II-1 0.88

(0.015) 1.072

(0.007) 0.972

(0.003) 1.08 14842.6 0.62 1.65

NL II-2 0.98

(0.021) 1.069

(0.008) 0.971

(0.004) 1.31 14908.6 1.26 6.72

NL II-3 1.35

(0.024) 1.073

(0.008) 0.974

(0.004) 1.07 10520.1 0.30 0.60

ϕ c =2.0

NL II-1 0.88

(0.022) 1.071

(0.008) 0.972

(0.004) 2.08 14439.9 0.63 1.76

NL II-2 0.98

(0.026) 1.069

(0.010) 0.970

(0.005) 2.36 14370.7 1.26 6.80

NL II-3 1.35

(0.034) 1.072

(0.010) 0.974

(0.005) 2.08 10277.6 0.31 0.66

Note: the same notes as those in Table 3. 479

480

Page 29: Examining the Influence of Link Function Misspecification ...ceprofs.tamu.edu/dlord/Papers/Wu_Lord_Nonlinear_CMFs.pdf · 70 difficult to examine the influence of a single problem

Wu and Lord 28

 

481 (a) Lane width in II-1 (b) Lane width in II-2 482

483 (c) Lane width in II-3 (d) Curve density 484

485 (e) Pavement friction 486

Figure 3 CM-Functions of variables in scenario Nonlinear II (ϕ =0.5). 487

Page 30: Examining the Influence of Link Function Misspecification ...ceprofs.tamu.edu/dlord/Papers/Wu_Lord_Nonlinear_CMFs.pdf · 70 difficult to examine the influence of a single problem

Wu and Lord 29

 

5.3 Scenario Nonlinear III: Consider three variables, two in non-linear form 488

To simplify the analyses, the first two nonlinear CM-Functions for lane width in Scenarios 489

Nonlinear I and II were used in this scenario, the last one with piecewise function was removed. 490

The assumed CMF for pavement friction was 0.973, the same as that in Scenario Nonlinear II. 491

Two quadratic CM-Functions for curve density were assumed, as shown in Equations 15 492

and 16, respectively. 493

4 2 2( ) 8.7 10 5.56 10i i iln CMF CD CD   (15) 494

3 2 2( ) 3.5 10 1.39 10i i iln CMF CD CD   (16) 495

The closet line, area and AVD for Equation 15 are 26.95 10 0.037CD , 0.229 and 496

0.014, respectively. And those for Equation 16 are 26.99 10 0.149CD , 0.920 and 0.057, 497

respectively. Both area and AVD of the second function are much higher than those of the first 498

one. So, the nonlinear level of the second one is stronger than the first. 499

In total, there were four sub-scenarios in this scenario, shown in Table 7. It can be seen 500

that the nonlinear level of III-1 was weak in both lane width and curve density. That of III-4 was 501

strong in both. III-2 and III-3 were a combination of a weak and a strong. 502

Table 7 Assumed CM-Functions (Scenario Nonlinear III) 503

# * ( )ln CMF (Nonlinear Level) CMF

Lane Width Curve Density Pavement Friction

NL III-1 20.1 2.22 12.28LW LW (W) 4 2 28.7 10 5.56 10CD CD (W) ( 32)0.973 PF NL III-2 20.2 4.22 21.88LW LW (S) 4 2 28.7 10 5.56 10CD CD (W) ( 32)0.973 PF NL III-3 20.1 2.22 12.28LW LW (W) 3 2 23.5 10 1.39 10CD CD (S) ( 32)0.973 PF NL III-4 20.2 4.22 21.88LW LW (S) 3 2 23.5 10 1.39 10CD CD (S) ( 32)0.973 PF

Note: the same notes as those in Table 6. 504

505

Page 31: Examining the Influence of Link Function Misspecification ...ceprofs.tamu.edu/dlord/Papers/Wu_Lord_Nonlinear_CMFs.pdf · 70 difficult to examine the influence of a single problem

Wu and Lord 30

 

The theoretical function of the generated crash counts and considered functional form 506

used in this scenario were identical with those in Nonlinear II (i.e., Equations 13 and 14). They 507

are reproduced below as Equations 15 and 16, respectively. 508

, , , , ,true i spf i LW i CD i PF iN N CMF CMF CMF  509

4, , ,2.67 10 i i LW i CD i PF iL AADT CMF CMF CMF   (15) 510

10 2 3 4( ) ( )i i i i iE L AADT exp LW CD PF   (16) 511

The CMFs for the three variables as well as other results produced from the modeling are 512

presented in Table 8. Similarly to Nonlinear II, the MAD and MSPE were higher than those of 513

linear relationships. But, surprisingly, they were always the highest in III-2 (combination of 514

strong and weak) rather than in III-4 (strong in both). The inverse dispersion parameters 515

estimated from SPFs were biased again in this scenario. The CMF for lane width derived in this 516

scenario was nearly the same as that of Nonlinear II with corresponding assumed CM-Function. 517

The CMFs for curve density were slightly different with those in Nonlinear II. And the CMFs for 518

pavement friction were very close to the true value. 519

Figures 4(a) to (d) illustrate the CM-Function curves for lane width in each sub-scenario 520

(ϕ = 0.5). The results were nearly identical with those of the corresponding one in Nonlinear II. 521

The CMFs were all biased, especially around boundary areas. The calculation indicated the bias 522

in III-1 and III-3 (weak in lane width) were significantly lower than those in III-2 and III-4 523

(strong in lane width). It seems the changes in nonlinearity of curve density had no significant 524

influence on the CMF for lane width. 525

The CMFs for curve density derived from SPFs in the four sub-scenarios (ϕ = 0.5) were 526

1.073, 1.072, 1.075 and 1.074, respectively. The curves are shown in Figures 4(e) to (h). The 527

Page 32: Examining the Influence of Link Function Misspecification ...ceprofs.tamu.edu/dlord/Papers/Wu_Lord_Nonlinear_CMFs.pdf · 70 difficult to examine the influence of a single problem

Wu and Lord 31

 

CMFs derived from SPFs were overestimated in all of the four sub-scenarios. When comparing 528

the results between the two assumed CM-Functions for curve density, the bias and error 529

percentage of III-3 and III-4 (strong in curve density) were always much higher than those of III-530

1 and III-2 (weak in curve density), expect at a small range around 16. In short, the CMFs for 531

curve density derived from SPFs were all biased when the relationship was nonlinear. The bias 532

increased when the nonlinear level became stronger. Another interesting finding was that the 533

highest bias of CMF for curve density did not appear around the boundary areas, but near the 534

middle. As can be seen in Figures 4(e) to (h), as curve density increased from the baseline point 535

(i.e., zero), the bias first increased then decreased. The highest was around 11. This was probably 536

due to the fact that the baseline for curve density was at the very left side. If the baseline was at 537

some point in the middle (e.g., 8 or 10), the result might be similar to that of lane width. The bias 538

should appear to be small around baseline and became large in boundary areas, intuitively. 539

Nevertheless, the CMFs were still biased. 540

The CMFs for pavement friction produced from the four sub-scenarios (ϕ = 0.5) were 541

0.973, 0.972, 0.973 and 0.972, respectively. The CM-Function curves are shown in Figure 4(i). 542

The overall results were nearly the same as those of Nonlinear II. Both bias and error percentage 543

were relatively small. 544

Page 33: Examining the Influence of Link Function Misspecification ...ceprofs.tamu.edu/dlord/Papers/Wu_Lord_Nonlinear_CMFs.pdf · 70 difficult to examine the influence of a single problem

Wu and Lord 32

 

Table 8 Results of Scenario Nonlinear III 545

# CMF (SD) a

ϕ b AIC d MAD e MSPE f LW CD PF

ϕ c=0.5

NL III-1 0.88

(0.014) 1.073

(0.006) 0.973

(0.002) 0.57 14455.8 0.60 2.05

NL III-2 0.97

(0.017) 1.072

(0.005) 0.972

(0.002) 0.79 14670.4 1.21 6.80

NL III-3 0.88

(0.015) 1.075

(0.005) 0.973

(0.003) 0.57 13835.8 0.55 1.61

NL III-4 0.97

(0.016) 1.074

(0.006) 0.972

(0.003) 0.79 13980.7 1.08 5.47

ϕ c=1.0

NL III-1 0.88

(0.015) 1.072

(0.008) 0.972

(0.003) 1.07 14648.4 0.61 2.15

NL III-2 0.97

(0.020) 1.074

(0.008) 0.972

(0.003) 1.31 14621.6 1.21 6.89

NL III-3 0.88

(0.018) 1.073

(0.008) 0.973

(0.003) 1.08 13965.8 0.55 1.71

NL III-4 0.97

(0.021) 1.074

(0.008) 0.972

(0.004) 1.32 13978.6 1.08 5.55

ϕ c=2.0

NL III-1 0.88

(0.021) 1.073

(0.010) 0.972

(0.005) 2.09 14243.5 0.62 2.26

NL III-2 0.98

(0.027) 1.073

(0.010) 0.971

(0.005) 2.36 14118.3 1.22 7.16

NL III-3 0.88

(0.022) 1.073

(0.010) 0.973

(0.004) 2.09 13592.0 0.56 1.81

NL III-4 0.98

(0.027) 1.074

(0.011) 0.971

(0.004) 2.36 13420.4 1.08 5.70

Note: the same notes as those in Table 3. 546

547

Page 34: Examining the Influence of Link Function Misspecification ...ceprofs.tamu.edu/dlord/Papers/Wu_Lord_Nonlinear_CMFs.pdf · 70 difficult to examine the influence of a single problem

Wu and Lord 33

 

548 (a) Lane width in III-1 (b) Lane width in III-2 549

550 (c) Lane width in III-3 (d) Lane width in III-4 551

552 (e) Curve density in III-1 (f) Curve density in III-2 553

Figure 4 CM-Functions for variables in scenario Nonlinear III (ϕ =0.5). 554

Page 35: Examining the Influence of Link Function Misspecification ...ceprofs.tamu.edu/dlord/Papers/Wu_Lord_Nonlinear_CMFs.pdf · 70 difficult to examine the influence of a single problem

Wu and Lord 34

 

555

556 (g) Curve density in III-3 (h) Curve density in III-4 557

558 (i) Pavement friction 559

Figure 4 Continued 560

6. DISCUSSIONS AND CONCLUSIONS 561

This paper has documented an extensive study on the validation of use of regression models for 562

estimating CMFs, emphasizing on misspecification in link functions connecting variables and 563

crash risk. Several nonlinear CM-Functions mixed with linear ones for variables were assumed 564

and used to generate crash counts. Then, CMFs were estimated using GLM methods and their 565

accuracy were analyzed. The main conclusions can be summarized as follows: (1) the CMFs for 566

variables having nonlinear effects on safety derived from SPFs are biased when using the 567

common GLMs; (2) with the increase in nonlinearity, the bias tends to become significant (as 568

Page 36: Examining the Influence of Link Function Misspecification ...ceprofs.tamu.edu/dlord/Papers/Wu_Lord_Nonlinear_CMFs.pdf · 70 difficult to examine the influence of a single problem

Wu and Lord 35

 

expected); (3) the quality of CMFs for other variables having linear relationships can be 569

influenced when mixed with those having nonlinear relationships, but the accuracy may still be 570

acceptable; and (4) the misuse of linear link functions for one or more variables also leads to 571

biased estimate of other parameters. This study showed the importance of linking functions when 572

modeling crashes, this is consistent with several previous work (Miaou and Lord, 2003; El-573

Basyouny and Sayed, 2010; Lord and Mannering, 2010). This also explains why nonlinear based 574

models outperformed GLMs in the previous studies (Lee et al., 2015; Stefanova et al., 2015; 575

Park and Abdel-Aty, 2016). Improper functions may lead to misleading conclusions and biased 576

CMFs. The conventional GLM should not be used for either modeling the crash or developing 577

CMFs when the variables are having strong nonlinear effects on safety, as Lee et al. (2015) and 578

Zeng et al. (2016a) suggested. 579

This study is not without any limitations. First, a solid model is the basis for predicting 580

crashes and developing reliable CMFs (Wu et al., 2015). This study only considered the most 581

frequently used one (i.e., NB distribution), and it was used in both data generation and modeling. 582

Second, some variables were independently and uniformly generated in this study, which might 583

not be the case in practice (Park and Abdel-Aty, 2015b; Wu and Lord, 2016). Preliminary 584

analyses indicated variable distribution and correlation could also affect modeling result. Lee et 585

al. (2015) pointed out that variable correlation should be considered when modeling crashes. To 586

simplify the analyses and properly quantify the biased estimates, these limitations were not 587

included in this study. Finally, sample size influences the modeling significantly (Lord, 2006; 588

Lord and Miranda-Moreno, 2008). The dataset used in this study contained 1,492 observations, 589

which should be large enough. In reality, many studies used only hundreds or even dozens of 590

observations due to the limited availability of crash and highway data. These questions will 591

Page 37: Examining the Influence of Link Function Misspecification ...ceprofs.tamu.edu/dlord/Papers/Wu_Lord_Nonlinear_CMFs.pdf · 70 difficult to examine the influence of a single problem

Wu and Lord 36

 

affect the quality of CMFs and need further consideration. Nevertheless, the simulation protocol 592

proposed in this paper can still be applied to evaluate the CMFs under different conditions. 593

ACKNOWLEDGEMENTS 594

The authors wish to thank Dr. Ezra Hauer for thehis comments on a previous version of 595

the paper. 596

597

REFERENCES 598

Aashto, 2004. A policy on geometric design of highways and streets, 2004 Washington, D.C. : 599 American Association of State Highway and Transportation Officials, Washington, D.C. 600

Aashto, 2010. Highway safety manual, 1st Edition ed. American Association of State Highway 601 and Transportation Officials, Washington, D.C. 602

Banihashemi, M., 2015. Is horizontal curvature a significant factor of safety in rural multilane 603 highways? Transportation Research Record: Journal of the Transportation Research Board 604 (2515), 50-56. 605

Banihashemi, M., 2016. Effect of horizontal curves on urban arterial crashes. Accident Analysis 606 & Prevention 95, 20-26. 607

Bonneson, J., Pratt, M., 2008. Procedure for developing accident modification factors from 608 cross-sectional data. Transportation Research Record: Journal of the Transportation Research 609 Board 2083, 40-48. 610

El-Basyouny, K., Sayed, T., 2010. Application of generalized link functions in developing 611 accident prediction models. Safety Science 48 (3), 410-416. 612

Elvik, R., 2011. Assessing causality in multivariate accident models. Accident Analysis & 613 Prevention 43 (1), 253-264. 614

Gross, F., Jovanis, P.P., 2007a. Estimation of safety effectiveness of changes in shoulder width 615 with case control and cohort methods. Transportation Research Record (2019), 237-245. 616

Gross, F., Jovanis, P.P., 2007b. Estimation of the safety effectiveness of lane and shoulder width: 617 Case-control approach. Journal of transportation engineering 133 (6), 362-369. 618

Page 38: Examining the Influence of Link Function Misspecification ...ceprofs.tamu.edu/dlord/Papers/Wu_Lord_Nonlinear_CMFs.pdf · 70 difficult to examine the influence of a single problem

Wu and Lord 37

 

Gross, F., Jovanis, P.P., Eccles, K., Chen, K.-Y., 2009. Safety evaluation of lane and shoulder 619 width combinations on rural, two lane, undivided roads. In: Fhwa, U.S.D.O.T. ed., Washington 620 D.C. 621

Gross, F., Persaud, B., Lyon, C., 2010. A guide to developing quality crash modification factors. 622 In: Fhwa, U.S.D.O.T. ed. FHWA, U.S. Department of Transportation, Washington, D.C. 623

Hauer, E., 2004. Statistical road safety modeling. Transportation Research Record: Journal of the 624 Transportation Research Board 1897, 81-87. 625

Hauer, E., 2005a. Cause and effect in observational cross-section studies on road safety. In: 626 Proceedings of the the 84th Annual Meeting of the Transportation Research Board (TRB), 627 Washington D.C. 628

Hauer, E., 2005b. Fishing for safety information in murky waters. Journal of Transportation 629 Engineering 131 (5), 340-344. 630

Hauer, E., 2010. Cause, effect and regression in road safety: A case study. Accident Analysis & 631 Prevention 42 (4), 1128-1135. 632

Hauer, E., 2013. Even perfect regressions may not tell the effect of interventions. In: Proceedings 633 of the The Transportation Research Board (TRB) 92nd Annual Meeting, Washington D.C. 634

Hauer, E., 2014. Trustworthiness of safety performance functions. In: Proceedings of the the 635 93rd Annual Meeting of the Transportation Research Board (TRB), Washington, D.C. 636

Hauer, E., 2015. The art of regression modeling in road safety Springer, USA. 637

Hauer, E., Council, F.M., Mohammedshah, Y., 2004. Safety models for urban four-lane 638 undivided road segments. Transportation Research Record: Journal of the Transportation 639 Research Board 1897, 96-105. 640

Hauer, E., Harwood, D.W., Council, F.M., Griffith, M.S., 2002. Estimating safety by the 641 empirical bayes method - a tutorial. Transportation Research Record: Journal of the 642 Transportation Research Board 1784, 126-131. 643

Lao, Y., Zhang, G., Wang, Y., Milton, J., 2014. Generalized nonlinear models for rear-end crash 644 risk analysis. Accident Analysis & Prevention 62, 9-16. 645

Lee, C., Abdel-Aty, M., Park, J., Wang, J.-H., 2015. Development of crash modification factors 646 for changing lane width on roadway segments using generalized nonlinear models. Accident 647 Analysis & Prevention 76, 83-91. 648

Li, X., Lord, D., Zhang, Y., 2011. Development of accident modification factors for rural 649 frontage road segments in texas using generalized additive models. Journal of Transportation 650 Engineering 137 (1), 74-83. 651

Page 39: Examining the Influence of Link Function Misspecification ...ceprofs.tamu.edu/dlord/Papers/Wu_Lord_Nonlinear_CMFs.pdf · 70 difficult to examine the influence of a single problem

Wu and Lord 38

 

Li, X., Lord, D., Zhang, Y., Xie, Y., 2008. Predicting motor vehicle crashes using support vector 652 machine models. Accident Analysis & Prevention 40 (4), 1611-8. 653

Lord, D., 2006. Modeling motor vehicle crashes using poisson-gamma models: Examining the 654 effects of low sample mean values and small sample size on the estimation of the fixed 655 dispersion parameter. Accident Analysis & Prevention 38 (4), 751-766. 656

Lord, D., Bonneson, J.A., 2007. Development of accident modification factors for rural frontage 657 road segments in texas. Transportation Research Record: Journal of the Transportation 658 Research Board 2023, 20-27. 659

Lord, D., Guikema, S.D., Geedipally, S.R., 2008. Application of the conway-maxwell-poisson 660 generalized linear model for analyzing motor vehicle crashes. Accident Analysis & Prevention 661 40 (3), 1123-1134. 662

Lord, D., Kuo, P.-F., 2012. Examining the effects of site selection criteria for evaluating the 663 effectiveness of traffic safety countermeasures. Accident Analysis & Prevention 47, 52-63. 664

Lord, D., Mannering, F., 2010. The statistical analysis of crash-frequency data: A review and 665 assessment of methodological alternatives. Transportation Research Part A 44 (5), 291-305. 666

Lord, D., Miranda-Moreno, L.F., 2008. Effects of low sample mean values and small sample size 667 on the estimation of the fixed dispersion parameter of poisson-gamma models for modeling 668 motor vehicle crashes: A bayesian perspective. Safety Science 46 (5), 751-770. 669

Mannering, F.L., Shankar, V., Bhat, C.R., 2016. Unobserved heterogeneity and the statistical 670 analysis of highway accident data. Analytic Methods in Accident Research 11, 1-16. 671

Miaou, S.-P., Lord, D., 2003. Modeling traffic crash flow relationships for intersections - 672 dispersion parameter, functional form, and bayes versus empirical bayes methods. 673 Transportation Research Record: Journal of the Transportation Research Board 1840, 31-40. 674

Park, B.-J., Lord, D., Wu, L., 2016. Finite mixture modeling approach for developing crash 675 modification factors in highway safety analysis. Accident Analysis & Prevention 97, 274-287. 676

Park, J., Abdel-Aty, M., 2015a. Assessing the safety effects of multiple roadside treatments using 677 parametric and nonparametric approaches. Accident Analysis & Prevention 83, 203-213. 678

Park, J., Abdel-Aty, M., 2015b. Development of adjustment functions to assess combined safety 679 effects of multiple treatments on rural two-lane roadways. Accident Analysis & Prevention 75, 680 310-319. 681

Park, J., Abdel-Aty, M., 2016. Evaluation of safety effectiveness of multiple cross sectional 682 features on urban arterials. Accid Anal Prev 92, 245-55. 683

Pierce, R., 2014. Definition of bias. Accessed October 2, 2016. 684 http://www.mathsisfun.com/definitions/bias.html 685

Page 40: Examining the Influence of Link Function Misspecification ...ceprofs.tamu.edu/dlord/Papers/Wu_Lord_Nonlinear_CMFs.pdf · 70 difficult to examine the influence of a single problem

Wu and Lord 39

 

Stefanova, T., Burkhardt, J.M., Filtness, A., Wullems, C., Rakotonirainy, A., Delhomme, P., 686 2015. Systems-based approach to investigate unsafe pedestrian behaviour at level crossings. 687 Accid Anal Prev 81, 167-86. 688

Wu, L., Lord, D., 2016. Investigating the influence of dependence between variables on crash 689 modification factors developed using regression models. In: Proceedings of the The 95th 690 Annual Meeting of the Transportation Research Board (TRB), Washington D.C. 691

Wu, L., Lord, D., Zou, Y., 2015. Validation of crash modification factors derived from cross-692 sectional studies with regression models. Transportation Research Record: Journal of the 693 Transportation Research Board 2514, 88-96. 694

Wu, L., Zou, Y., Lord, D., 2014. Comparison of sichel and negative binomial models in hot spot 695 identification. Transportation Research Record: Journal of the Transportation Research Board 696 2460, 107-116. 697

Xie, Y., Lord, D., Zhang, Y., 2007. Predicting motor vehicle collisions using bayesian neural 698 network models: An empirical analysis. Accident Analysis & Prevention 39 (5), 922-33. 699

Xie, Y., Zhang, Y., 2008. Crash frequency analysis with generalized additive models. 700 Transportation Research Record: Journal of the Transportation Research Board 2061, 39-45. 701

Zeng, Q., Huang, H., Pei, X., Wong, S.C., Gao, M., 2016a. Rule extraction from an optimized 702 neural network for traffic crash frequency modeling. Accident Analysis & Prevention 97, 87-703 95. 704

Zeng, Z., Zhu, W., Ke, R., Ash, J., Wang, Y., Xu, J., Xu, X., 2016b. A generalized nonlinear 705 model-based mixed multinomial logit approach for crash data analysis. Accident Analysis & 706 Prevention 99 (Pt A), 51-65. 707

Zou, Y., Wu, L., Lord, D., 2015. Modeling over-dispersed crash data with a long tail: Examining 708 the accuracy of the dispersion parameter in negative binomial models. Analytic Methods in 709 Accident Research 5–6, 1-16. 710

711