A Bayesian network based framework for real-time crash prediction on the basic freeway segments of urban expressways

Af

MD

a

ARRA

KRRBB

1

nc(itdcopvtmi(A2p

y

0d

Accident Analysis and Prevention 45 (2012) 373– 381

Contents lists available at SciVerse ScienceDirect

Accident Analysis and Prevention

j ourna l h o mepage: www.elsev ier .com/ locate /aap

Bayesian network based framework for real-time crash prediction on the basicreeway segments of urban expressways

oinul Hossain ∗, Yasunori Muromachi1

epartment of Built Environment, Tokyo Institute of Technology, Nagatsuta-machi, Midori-ku, Yokohama, Kanagawa 226-8502, Japan

r t i c l e i n f o

rticle history:eceived 3 February 2011eceived in revised form 7 July 2011ccepted 11 August 2011

eywords:eal-time crash prediction modelandom multinomial logitayesian belief net

a b s t r a c t

The concept of measuring the crash risk for a very short time window in near future is gaining more prac-ticality due to the recent advancements in the fields of information systems and traffic sensor technology.Although some real-time crash prediction models have already been proposed, they are still primitivein nature and require substantial improvements to be implemented in real-life. This manuscript inves-tigates the major shortcomings of the existing models and offers solutions to overcome them with animproved framework and modeling method. It employs random multinomial logit model to identify themost important predictors as well as the most suitable detector locations to acquire data to build sucha model. Afterwards, it applies Bayesian belief net (BBN) to build the real-time crash prediction model.

asic freeway segments The model has been constructed using high resolution detector data collected from Shibuya 3 and Shin-juku 4 expressways under the jurisdiction of Tokyo Metropolitan Expressway Company Limited, Japan.It has been specifically built for the basic freeway segments and it predicts the chance of formation ofa hazardous traffic condition within the next 4–9 min for a particular 250 meter long road section. Theperformance evaluation results reflect that at an average threshold value the model is able to successfulclassify 66% of the future crashes with a false alarm rate less than 20%.

. Background

Road crash was conventionally believed to be a complex phe-omenon involving the interaction of factors related to three majoromponents: road geometry and environment, vehicle and humanSabey and Staughton, 1975; Treat et al., 1977). Oh et al. (2001)ntroduced a fourth component, the traffic dynamics, suggestinghat crashes involving safe vehicles regularly occur due to sud-en formation of disrupted traffic condition even on geometricallyorrect roads under favorable driving condition. This contrived thepportunity to improve the shortcoming of the conventional crashrediction models that employ aggregated measures of traffic flowariables (e.g., speed limits for speed, AADT for flow, etc.) to iden-ify hazardous locations. Since then, a small group of researchers,

ainly from North America are promoting the idea of predict-ng crashes in real-time by using high-resolution detector dataOh et al., 2001, 2005, 2006; Abdel-Aty et al., 2004, 2005, 2008;
bdel-Aty and Pande, 2005; Pande and Abdel-Aty, 2005, 2006a,b,007; Lee et al., 2003, 2006). They have advocated for developing aroactive system capable of timely spotting and evolving hazardous
∗ Corresponding author. Tel.: +8801938296371; fax: +81 45 924 5524.E-mail addresses: [email protected], [email protected] (M. Hossain),

[email protected] (Y. Muromachi).1 Tel.: +81 45 924 5524; fax: +81 45 924 5524.

001-4575/$ – see front matter © 2011 Elsevier Ltd. All rights reserved.oi:10.1016/j.aap.2011.08.004

© 2011 Elsevier Ltd. All rights reserved.

condition that can be countervailed with various traffic smoothingmeasures. Although this new concept of real-time crash predictionexhibits huge promise, being in its infancy, the available modelsare yet conceptual. As far as the authors of this paper are aware,none of these models have been implemented in practical scenarioso far. Some of the major shortcomings of the existing models canlargely be classified into three groups:

i) Location of detector: the performance of the proposed modelsvastly relies on the location of the detectors that are selectedwith respect to the crash location to fathom the risk of a futurecrash. Majority of the previous studies have been conducted inthe USA, to be more precise, Interstate – 4 (Abdel-Aty et al.,2004, 2005; Abdel-Aty and Pande, 2005; Pande and Abdel-Aty,2005, 2006a,b, 2007), 5 (Zheng et al., 2010), 405 (Oh et al.,2006), 880 (Oh et al., 2005). The rest took place in GardinerExpressway of Toronto, Canada (Lee et al., 2003) and aroundthe expressways near Utrecht region, Europe (Abdel-Aty et al.,2008). Most of these studies advocated for collecting data fromboth upstream and downstream of the crash location. How-ever, the locations of the detectors varied due to high inter
detector spacing. Where the study sections on I-4 have an interdetector spacing of around 0.8 kilometers, the inter detectorspacing on I-5 vary between 0.6 and 3.9 kilometers. For theUtrecht region expressways (Dutch) the detector spacing is
dx.doi.org/10.1016/j.aap.2011.08.004

http://www.sciencedirect.com/science/journal/00014575

http://www.elsevier.com/locate/aap

mailto:[email protected]



dx.doi.org/10.1016/j.aap.2011.08.004

3 nalys

i

ing crash data from December, 2007 to March, 2008 for Shibuya 3

74 M. Hossain, Y. Muromachi / Accident A

significantly different from that of the I-4 as it never exceeded800 meters but has high standard deviation (Abdel-Aty et al.,2008). Hence, it is difficult for the expressway authoritiesinterested in installing real-time hazard monitoring systemsto decide how they will layout detector on future urbanexpressways. Likewise, if they are interested to monitor spe-cific locations on an existing expressway with existing detectorsthen they may require further guidance on selecting theappropriate detector combinations to extract data for the sys-tem.

ii) Variable space: the potential variable space of the existing stud-ies has been substantially large and diverse considering thecrash sample size. Although the most common predictors havebeen within the average, standard deviation and coefficient ofvariation of speed, flow and occupancy aggregated at differentupstream and downstream detector locations with respect tothe crash location (Abdel-Aty et al., 2004, 2005; Abdel-Aty andPande, 2005; Pande and Abdel-Aty, 2005, 2006b, 2007), somealso involved density, queue length, exposure (Lee et al., 2003),longitudinal (Lee et al., 2003) and lateral (Lee et al., 2003, 2006;Pande and Abdel-Aty, 2006a) difference in traffic flow vari-ables, safe stopping distance of individual vehicles (Oh et al.,2006), average flow ratio calculated from the peak flow (Leeet al., 2006), road geometry (directly or indirectly) (Pande andAbdel-Aty, 2006b; Lee et al., 2003), etc. This is because – (a)road crash is a highly complex phenomena and accounts a widerrange of variables, and, (b) in many occasions the quality as wellas availability of detector data need to be compensated withsurrogate variables. This induces a classical situation involv-ing large variable space and small sample size and it requires asuitable method to select the most important variables. Wheresome studies employed engineering judgment to choose thevariables, others applied statistical methods (testing the sig-nificance by developing logistic regression models with onevariable at a time) as a solution to the problem (Abdel-Aty et al.,2004, 2005; Abdel-Aty and Pande, 2005; Pande and Abdel-Aty,2005, 2006b, 2007; Zheng et al., 2010). A more robust approachwas adopted by Pande and Abdel-Aty (2006a), who applied clas-sification trees and Abdel-Aty et al. (2008), who chose randomforest to rank the variable importance. However, both randomforest and classification trees can be susceptible to intervaldata as they can be biased towards variables with high num-ber of categories (Strobl et al., 2007). Moreover, the studieseither considered the traffic conditions in the upstream andthe downstream or their longitudinal variation separately tomodel crash risk. They found positive correlation in both thesituations indicating that considering both the variable typestogether might have improved the prediction performance ofthe models.

ii) Modeling method: the typical modeling methods employedfor real-time crash prediction models so far can be broadlyclassified into: statistical methods and artificial intelligenceor data mining based methods. The former includes matchedcase–control logistic regression (Abdel-Aty et al., 2004, 2005;Abdel-Aty and Pande, 2005; Pande and Abdel-Aty, 2007; Leeet al., 2006; Zheng et al., 2010), aggregate log linear model(Lee et al., 2003), Bayesian statistics (Oh et al., 2001), etc.The latter encompasses different kinds of neural networks(Abdel-Aty et al., 2008; Abdel-Aty and Pande, 2005; Pandeand Abdel-Aty, 2006b; Oh et al., 2005), fuzzy logic (Oh et al.,2006) and classification trees (Pande and Abdel-Aty, 2006a).Traffic flow variables, e.g., speed, flow, occupancy are highly
correlated in nature (Gazis, 2002). Thus, when modeled usingstatistical approaches, most of them get dropped as part ofmodeling process. Hence, it is important to employ meth-ods that can accommodate correlated variables and make best
is and Prevention 45 (2012) 373– 381

use of every available piece of information to improve theprediction success. Neural network based modeling methods(e.g., probabilistic neural network) can accommodate corre-lated dependent variables. However, they expect sufficientprior knowledge regarding the problem domain exhibitedthrough the interrelationship among the predictors. Further-more, studies of this nature are highly resource demandingand many times data on all the variables are not availableduring the time of modeling. Therefore, a modeling methodthat can accommodate future new variables as well as knowl-edge from new data in course of time without requiringrebuilding or recalibrating the whole model is highly desir-able.

This study addresses the aforementioned shortcomings byproposing a Bayesian belief net (BBN – also known as Bayesiannetwork) based framework to develop real-time crash predictionmodels. The research selects an urban expressway harboring uni-formly yet densely packed detectors. It applies variable importancemeasure of random multinomial logit (RMNL), a recently intro-duced hybrid of conventional multinomial logit and random forestmethods that can handle interval data, to identify and rank the mostimportant variables. Later, it applies BBN as the modeling method.

The manuscript is organized into five sections. The introduc-tory section has laid out the background and stated the purposeand objective of the study. Section 2 describes the activities involv-ing experimental design, data extraction and processing. Section3 presents a brief but self containing introduction to RMNL andBBN. Section 4 discusses the model building and evaluation pro-cess. The concluding section summarizes the salient contributionsand findings of the study along with identifying the limitations andsubsequent future scopes.

2. Study area and data preparation

2.1. Study area

The study area has been chosen based on: (i) quality of detec-tor data, (ii) accuracy in reported crash time and (iii) sample size.Considering these, based on the recommendation by the TokyoMetropolitan Expressway Company Limited, Shibuya 3 and Shin-juku 4 routes under their jurisdiction are chosen as the study area.Both the expressways harbor mostly two lanes in each direction.They are two of the busiest expressways in Japan and are situatedin the heart of Tokyo metropolitan area. Another salient feature ofthese expressways are their level of sophistication as they harbor210 detectors within just 25.4 kilometers (Shibuya 3 = 11.9 kilo-meters; Shinjuku 4 = 13.5 kilometers – in each direction) with aninter detector spacing roughly around 250 meters. This providesan excellent opportunity to experiment with different detectorcombinations and identify the most suitable detector layout planfor monitoring hazardous traffic condition formation in real-time.The detectors in the study area store data of speed, vehicle count,occupancy and number of heavy vehicles per lane for each 8 mil-liseconds round the clock (24 h a day, 365 days a year). Later thedata of all the lanes are aggregated for every 5 min by the authority.The crash data contain date, time in minutes, location (in nearest 10meters), vehicles involved, type of crash, etc. related information.The data have been supplied by Tokyo Metropolitan Expressway intwo stages. The first dataset contains detector data and correspond-

expressway and December, 2007 to November, 2008 for Shinjuku4 expressway. The second dataset covers data for the time periodbetween April, 2008 and October, 2009 for Shibuya 3 expresswayand December, 2008 and October, 2009 for Shinjuku 4 expressway.

nalysis and Prevention 45 (2012) 373– 381 375

2

ttdtTrtt(jtecwo(tdoacs9eirtoatoLcmpc

ttrwA1ta2ttdtmtTtawcwiumcl

M. Hossain, Y. Muromachi / Accident A

.2. Experimental design and data extraction

The study addresses real-time crash prediction as a classifica-ion issue. A traffic condition persisting before a crash is assumedo be associated with crash, often referred as the pre-crash con-ition, and a traffic condition matched based on specific criteriahat did not lead to a crash is labeled as normal traffic condition.hen the model is built to maximize the overall classification accu-acy. The experimental design encompasses two activities – definehe pre-crash and normal traffic condition and identify the detec-or(s) to be used as reference for the traffic data extraction. Oh et al.2001, 2005) defined pre-crash condition as a time period startingust before the crash and extending up to a 5 min time period beforehe crash. Zheng et al. (2010) followed a similar approach with thexception that the duration was for 10 min. Abdel-Aty et al. (2008)ompared the time period recommended by Oh et al. (2001, 2005)ith a time period that is between 5 and 10 min before the crash

ccurrence and found the later to be more significant. Pande et al.2005) justified the selection of a 5 min aggregation by comparinghe 3 min and 5 min aggregation. This study defines pre-crash con-ition as a 5 min time period ending at least 4 min before the crashccurrence. The time of crash has been reported in nearest 1 minnd the detector data have been aggregated for 5 min. Thus, if arash had occurred on 14 January, 2008 at 9:39 am then the corre-ponding pre-crash condition will be traffic data from 9:30 am to:35 am on that day. Regarding the normal traffic conditions, therexists two prominent approaches. Oh et al. (2001, 2005) definedt as a 5 min time period ending 30 min prior to the crash occur-ence. Abdel-Aty et al. (2008) defined it as a time period same ashe pre-crash time window but taking place on all other same dayf the week throughout the dataset. This study followed a similarpproach but took further precautions by eliminating all the normalraffic condition data when a crash had occurred within 1 h beforer after that extracted normal traffic condition data on those days.ater, a subset of the normal traffic condition data was selected toompare with the pre-crash condition. The approach is similar toatched case–control design (Breslow and Day, 1980) where the

re-crash and normal condition datasets represent respectively theases and the control.

In almost all of the previous studies reviewed in this manuscript,he detectors used for data extraction were referred based onheir relative order of presence with respect to the crash locationather than based on their approximate distance. Among those fewho have referred the detector locations in numeric unit scale,bdel-Aty et al. (2008) considered three detector arrangements:st upstream and downstream detectors away from the crash loca-ion by at least 250 meters (position 1), 500 meters (position 2)nd 1000 meters (position 3). Their outcome accentuates position

to classify pre-crash conditions the best. However, it is difficulto use their detector spacing in this study as they did not considerhe influence of ramp areas. Previous studies regarding traffic con-itions relating to crash on highways suggest that the variation inraffic flow variables is significant between the basic freeway seg-

ents and the ramp areas (Jovanis and Chang, 1986). Therefore, inhis manuscript, a new experimental design has been introduced.his study divides the whole road length into 250 meters road sec-ions which is the approximate inter detector spacing of the studyrea. The road length is divided into 250 meter sections in such aay that every section harbors one detector. Then, for every crash

ase, data from two upstream, two downstream and the detectorithin the section are extracted from the detector database follow-

ng the definition of pre-crash traffic condition (see Fig. 1). The two
pstream and two downstream detectors are approximately 250eters and 500 meters apart with reference to the section under
onsideration. This way, the outcome of the model is bound with aocation (250 meter section) as well as a time (at least 4 min reaction

Fig. 1. Selected positions of detectors for data extraction.

time to implement the interventions). Corresponding normal traf-fic condition data have been extracted in a similar way. The crashcases falling within 375 meters from the ramps are not consid-ered in this study as it has been mentioned earlier that the authorsrecommend separate models for the basic freeway segments andthe ramp areas due to their highly distinguishable traffic patterns.However, the proposed framework can also be followed to buildreal-time crash prediction models for the ramp areas.

2.3. Data preparation

For every crash case, the data point contained information on5 min cumulative vehicle count, number of heavy vehicle count,average speed and average occupancy from all five detectors. Forconvenience of understanding a nomenclature for these variablesis introduced in the form d�X, where d represents detector, � is theposition of the detector (e.g., between 1 and 5) and X is the variable(e.g., q for 5 min cumulative vehicle count, p for 5 min cumula-tive count of heavy vehicles, v for average speed over 5 min ando for average occupancy over 5 min). Apart from considering onlythe traffic at detector locations, in order to capture their spatialvariation, another new variable is introduced in the form of d��X,where � and � represents respectively the downstream and theupstream detector positions and X represents the variable underconsideration (e.g., ‘q’, ‘p’, ‘v’ or ‘o’). Hence, d24v represents the dif-ference in speed between the downstream detector position 2 andupstream detector position 4. At this point, the study introducesanother variable called ‘congestion index’ (CI) at each detector posi-tion to capture the combined impact of speed and flow. Dias et al.(2009) argued that level of congestion is a less biased representa-tion of speed as the speed that can be considered low for a specificroad section may emerge as too high for another road section dueto the road geometry and other physical constrains. Appreciatingthe logic, this study includes CI at all 5 positions to the previouslydeclared variable space. The CI for any detector position can becalculated as:

congestion index (CI) = free flow speed − speedfree flow speed

;

when CI > 0 = 0; when CI ≤ 0 (1)

The free flow speed for each of the detector position is calculatedby preparing the speed-flow and speed-occupancy diagram andchoosing the value by observation. The variable is represented asd�i. Thus, the final variable space consist of 65 predictors of which
values for 25 predictors (5 variables for each detector position × 5detector positions = 25) are directly yielded by the detectors andthe rest 40 are calculated by comparing the longitudinal differencein values for these predictors among different detector positions.

3 nalys

stb(mtfdDrbFkm

3

ahdr

3

iwliarb22deabstrpo2ofl

rsrmdmo

i


As mentioned earlier, data were received for this study in twotages. The first stage data covers a period from December, 2007o March, 2008 for Shibuya 3 and December, 2007 to Novem-er, 2008 for Shinjuku 4 expressway. From this, 189 crash casesbasic freeway segments only) and their corresponding 6478 nor-

al traffic condition data points have complete information on allhe variables yielded from the five detector positions. This datasetacilitated the process of variable importance ranking. The secondataset is for a time period from April, 2008 to July, 2009 andecember, 2008 to July, 2009 for the aforementioned expressways,

espectively. A combined dataset has been considered for the modeluilding and evaluation of the real-time crash prediction model.rom this dataset, the crash data of the last two months have beenept for evaluation and the rest of the data have been dedicated forodel building.

. Methodology

Random multinomial logit (RMNL) and Bayesian belief net (BBN)re two major methods employed in this study. Where the formeras been adopted to scientifically identify and rank the major pre-ictors from a large variable space, the later serves in building theeal-time crash prediction model.

.1. Random multinomial logit (RMNL)

Crash is a complex phenomenon and needs to account explic-tly for a wide range of variables to make predictions. Nevertheless,

hen the data are limited, researchers do not have the luxury to uti-ize a large number of predictors for model building and it becomesmportant to intelligently choose a small subset from the avail-ble variable space and maximize the prediction success. Logisticegression has so far been a common choice for the researchersuilding real-time crash prediction models (Abdel-Aty et al., 2004,005; Abdel-Aty and Pande, 2005; Pande and Abdel-Aty, 2005,006b, 2007). However, logistic regression has weakness in han-ling too many highly correlated predictors at a time. Abdel-Atyt al. (2008) made a significant improvement over this approach bypplying random forest which measures the variable importancey creating a collection of classification trees generated by randomampling of data as well as random variable selection. Althoughhe method now days is widely accepted for its stability, unbiasedesult and capability to handle large variable space with small sam-le size, it still can be susceptible to biasness when any or a groupf variables have relatively larger number of classes (Strobl et al.,007). Prinzie and Poel (2008) first came up with the idea of devel-ping a hybrid model by combining the benefits of both randomorest and logit models and they named it random multinomialogit (RMNL).

RMNL is based on the idea (similar to random forest) of multipleandom sampling of data and building model based on a randomlyelected variable space. However, they differ from random forest byeplacing the generated classification trees with trees of multino-ial logit models. This study deals with a problem domain having a

ichotomous outcome. Therefore, it replaces the multinomial logitodels of RMNL with logistic regression. The step by step process

f measuring variable importance is as follows:

i) Let L be the complete dataset with M predictors (here, 65 pre-dictors as mentioned in Section 2) and N records (includingboth traffic conditions leading to crash and normal traffic condi-
tions). Let Lb be the b-th bootstrap sample created by randomlyselecting n samples with replacement from L. Rest of the data,i.e., L − Lb, are called the out of bag data (OOB) of b-th bootstrapsample. In this study, n is 2/3rd of the complete dataset and the
is and Prevention 45 (2012) 373– 381

OOB is the rest 1/3rd data. Let B be the total number of bootstrapsamples created from the complete dataset.

ii) Next step creates B number trees of logit models using the Bbootstrap samples. However, for the b-th tree Tb, instead of con-structing a logit model with M predictors, m randomly selectedpredictors out of available M predictors are chosen and logisticregression models are built with them.

ii) Estimating OOB error rate: for each and every bootstrap itera-tion the misclassification rate is calculated. For example, for thetree Tb, L − Lb datasets are used to calculate its misclassificationrate rb. This is achieved by comparing the predicted outcomeof the L − Lb dataset with the actual outcome. Lastly, the rb ofall the B trees of logistic regression models are aggregated tocalculate the OOB error rate.

iv) Variable importance: the idea of variable importance in RMNLis same as that of random forest. It is measured by permutingthe values of each variable (one variable at a time) of each ofthe B trees and then averaging the new error rate due to alter-ation of values of each variable. The permuted variable with thehighest error rate is considered as the most important variable.The underlying idea is, any error in measuring its value has thehighest impact on the classification performance. Thus, the val-ues of the j-th predictor of M predictors in L − Lb are permutedand the new dataset is used to calculate the misclassificationrate rj

b. Here, |rb − rj

b| is the variable importance Vj of the j-th

variable in the b-th tree. The process is repeated for B logisticregression trees and the final variable importance is calculatedby averaging the Vj of each variable. Here, j = 1 to M.

3.2. Bayesian belief net (BBN)

3.2.1. The conceptBayesian belief net (BBN), also widely known as the Bayesian

network, is a relatively new method in the artificial intelligence (AI)probability and uncertainty community with multifarious usage(e.g., reasoning under uncertainty, making predictions of highlyuncertain phenomena, etc.). Rather than building a model focusingon the problem, BBN models the system which can then be usedto understand a phenomenon or make predictions about events.BBN is highly effective in situations where inferences are not war-ranted logically but, rather, probabilistically. It is an acyclic directedgraph (DAG) which illustrates a factorization of a joint probabilitydistribution over the variables that are represented by the nodesof the DAG, where the factorization is given by the directed linksof the DAG (Charniak, 1991). Hence, BBN is a graphical modelingmethod represented with a graph and a basic equation. The graph-ical part of the BBN consists of two types of elements – a set ofnodes and a set of directed edges, which are in short also knownas the arcs. The nodes are actually the variables having a finiteset of mutually exclusive states and their inter-relationship is rep-resented with the arcs. Though this inter-relationship in generalrefers to causal-effect, it is not a strict requirement for building aBBN (Jensen and Nielsen, 2007). To explain the numerical part, letus define a BBN over a universe of variable U = {A1, . . ., An}. Thenthe BBN can be specified with a joint probability distribution P(U)that can be obtained from the product of all conditional probabilitytables within the BBN:

P(U) =n∏

i=1

P(Ai|pa(Ai)) (2)

This is known as the chain rule for BBN (Jensen and Nielsen, 2007).Here, pa(Ai) stands for the parents of A1. Now, if new findings e1,. . ., em on some of the variables within U are obtained and if joint

M. Hossain, Y. Muromachi / Accident Analys

pa

P

Nt

P

3

Ibi.ie‘tmvbsco

will still ensure availability of partial data to draw inference. At the

Fig. 2. Parent divorcing mechanism in BBN.

robability distribution P(U) is known then Eq. (2) can be re-writtens Eq. (3):

(U, e) =n∏

i=1

P(Ai|pa(Ai))m∏

j=1

ej (3)

ow the probability of any variable A within U can be calculatedhrough marginalizing P(U,e) as shown in Eq. (4):

(A|e) =∑

U\AP(U, e)

P(e)=

∑U\AP(U, e)

∑AP(A, e)

(4)

.2.2. Parent divorcing techniqueA major drawback of BBN is in its complexity in calculation.

magine a (or a part of a) BBN where a variable Y has a large num-er of parents X1, . . ., Xn. If Y has two states (y1, y2) and each of

ts parents have 10 states then its conditional probability P(Y|X1, . ., Xn) will contain 2 × 10n number of cells (see Fig. 2a). This willncrease the calculation difficulty of the model substantially. Olesent al. (1989) introduced a modeling technique widely known as

parent divorcing’ that can be used under such situation to reducehe complexity of graphical models. It introduces a layer of inter-

ediate variables I1, . . ., In in such way that every intermediateariable Ii become the child to its corresponding parent Xi and Iiecomes the parent of Y (see Fig. 2b). Here Ii has fewer numbers of
tates than Xi. In this example, if each Ii has only 3 states then theonditional probability table P(Y|I1, . . ., In) will have 2 × 3n numberf cells only.
Fig. 3. Variable importance for (a) top 10 variables within the variable space, (b) eac

is and Prevention 45 (2012) 373– 381 377

4. Model building

The BBN based real-time crash prediction model has been builtand evaluated in three interlinked phases with three datasets. Thefirst phase is for variable selection that involves 189 crash data andits corresponding 6478 normal traffic condition data. The secondphase constructs the model by finalizing the acyclic directed graphafter simplifying it using parent divorcing technique and then gen-erates the conditional probability tables for each variable. The lastphase evaluates the model performance using a separate datasetwhich lays out a strategy for implementing the model in real-lifesituations.

4.1. Variable selection

Random multinomial logit (RMNL) method was applied on thedataset destined for variable selection to identify and rank the mostimportant variables. A step-wise iteration method has been fol-lowed in which at each step 100 logistic regression trees weregrown by randomly selecting 4 variables at a time. There was no dif-ference in the yielded raw importance value up to the 4th decimalpoint after growing 500 trees. The selected outcomes of the anal-ysis are presented in Fig. 3. Here, Fig. 3a shows the top ten mostimportant variables. Fig. 3b presents the cumulative importance ofdetectors in each location which has been calculated by summingthe importance of all five variables (e.g., for detector in position 2,d2 is the summation the importance of d2q, d2p, d2v, d2o, d2i).Subsequently, Fig. 3c exhibits the importance of detector combina-tion calculated based on their spatial variation (e.g., for the spatialvariation between positions 2 and 4, d24 is the summation of theimportance of d24q, d24p, d24v, d24o and d24i). Fig. 3b accentu-ates that the variables yielded by detector position 2 (i.e., d2) havethe highest impact in distinguishing between crash prone and nor-mal traffic conditions followed by detector position 4 (i.e., d4). Eventhe spatial variations in traffic flow variables are best captured bythe detector combination of positions 2 and 4 (Fig. 3c). Thus, if areal-time crash prediction model is to be built using a combina-tion of detectors in two locations then detector positions 2 and4 should be chosen to maximize the prediction success. Regard-ing the variables to be considered in the model yielded by thesetwo detectors, the four most important variables are: d2i, d4i, d24oand d24v. The variable d3i, although emerged as more importantthan d24o and d24v, it was not considered for model building asthen it will require data from 3 detectors to be extracted. More-over, d3 has a relatively lower importance than d4. The selectionof the aforementioned 4 variables has several advantages. As thevariables come from two separate detectors, failure of one detector

same time, it will not depend on data to be extracted from too manydetectors at a time. The newly developed model will consider bothpoint data and longitudinal variation. Moreover, considering fewer

h detector position, and (c) detector combinations (for spatial variation only).

3 nalysis and Prevention 45 (2012) 373– 381

no

Tap

4

dvtcdcfidin(vf

nBajoaeed2ditstv‘ravopchtwa

drsm

TC

Table 2Outcome of the logistic regression models.

Estimate Std. err. Z-value Pr(>|z|)(a) Model 1(Intercept) −4.562 0.067 −66.43 <2E−16***d2i 2.486 0.138 17.97 <2E−16***d24o −0.132 0.005 −26.22 <2E−16***

(b) Model 2(Intercept) −4.736 0.072 −65.72 <2E−16***d4i 3.178 0.132 24.11 <2E−16***


umber of variables ensure faster calculation time during real-timeperation.

There is no ready made software package available yet for RMNL.herefore, this study followed the instructions presented by Prinziend Poel (2008) and implemented the RMNL algorithm by using Rrogram (Ihaka and Gentleman, 1996) to conduct this study.

.2. The model construction

In this phase, data have been extracted from the combinedataset for all the crash cases having complete information on theariables yielded by the detector positions 2 and 4. The dataset con-ains 722 crash cases and their corresponding 26,899 normal trafficondition data points. Subsequently, the dataset has been furtherivided into two parts – for model building and the last two months’rash data (77 crash cases with corresponding 2695 normal traf-c condition data points) for the evaluation purpose. The realizedataset for model building encompasses 645 crash cases (includ-

ng the 189 crashes employed for variable selection) and 24,204ormal traffic condition data. Of these 472 (73.5%), 76 (11.8%), 9414.6%) and 3 (<1%) are respectively rear-end, side-swipe, singleehicle and tipping over crashes. Thus, rear-end crashes accountor 3/4th of all the data considered for model building.

After extracting the data and deciding upon the variables, theext step requires to determining the graphical structure of theBN. One of the most plausible structures will be to select ‘crash’ as

child node to the parent nodes: d2i, d4i, d24o and d24v. In BBNargon, the variable ‘crash’ (dichotomous outcome – crash proner normal traffic condition) is referred as the ‘outcome variable’nd d2i, d4i, d24o and d24v are the ‘information variables’. Nev-rtheless, as all the information variables are numerical in nature,ven if all four of them have only 5 states then the resulting con-itional probability table P(crash|d2i, d4i, d24o, d24v) will have

× 54 = 1250 cells. This creates an ideal situation to apply ‘parentivorcing’ technique (see Section 3.2.2) to reduce the complex-

ty of the conditional probability table for ‘crash’ by introducingwo new intermediate variables ‘risk ratio A’ and ‘risk ratio B’ inuch a way that every two information variables become parento one of these intermediate variables and the two intermediateariables become the only parents of ‘crash’. Now, ‘risk ratio A’ and

risk ratio B’ have been derived from building two separate logisticegression models with two out of the four information variabless independent variables at a time and ‘crash’ as the dependentariable. Using these two logistic regression models and the valuesf their corresponding independent variables, probability of crashrone traffic condition can be calculated. However, in order to exe-ute the aforementioned steps, it is important to ensure that twoighly correlated variables are not grouped together to build any ofhe two logistic regression models. Hence, Pearson correlation testas performed on the four information variables and the results

re presented in Table 1.Table 1 implies that the congestion index between the two

etector positions is highly correlated (i.e., |�| > 0.7) but they haveather poor correlation (i.e., |�| < 0.3) with the variables repre-enting spatial differences. Therefore, the two logistic regressionodels for the two risk ratios are developed as: Model 1: dependent

able 1orrelation among information variables.

Pearson correlation (�)

Variable d2i d4i d24v d24o

d2i 1.0 0.878 −0.265 0.210d4i 1.0 0.121 −0.186d24v 1.0 −0.730d24o 1.0

d24v 0.031 0.003 10.53 <2E−16***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1.

variable – crash (traffic condition leading to crash/normal trafficcondition), independent variables – d2i and d24o; Model 2: depen-dent variable – crash, independent variable – d4i and d24v. Theresults of the models are as shown in Table 2. Later, the probabil-ity of a traffic condition being crash prone was calculated for themodeling dataset using the results of Table 2 and they were storedin the intermediate variables ‘risk ratio A’ (for Model 1) and ‘riskratio B’ (for Model 2). Afterwards, the new intermediate variableswere reclassified with each variable having four states – high risk,moderate risk, low risk, very low risk. Selecting the break pointsfor these classes is very crucial for the success of the model andit vastly depends on the objective of the researcher. Lower breakpoints for ‘high risk’ and ‘moderate risk’ categories will have influ-ence on increasing the prediction success for the crash prone trafficconditions, albeit, sacrifice the prediction efficiency of the normaltraffic conditions causing excessive false alarms. This study takesa conservative approach by choosing to identify only the high risktraffic conditions to reduce false alarm rate and sets high breakpoints for the aforementioned two risky categories. The procedurefollowed to identify the break points is demonstrated with a samplecalculation in Appendix A.

Afterwards, prior probability distribution P(d2i), P(d4i), P(d24v),d(d24o), P(risk ratio A|d2i, d24o), P(risk ratio B|d4i, d24v) andP(crash|risk ratio A, risk ratio B) need to be calculated. As the val-ues of d2i, d4i, d24v and d24o are numerical, several histogramswere prepared to further check the distribution of these variablesboth for crash and normal traffic condition cases separately andeach of these four variables were reclassified into eight states toensure that each had representative data from both crash and nor-mal traffic conditions. Therefore, the both P(risk ratio A|d2i, d24o)and P(risk ratio B|d4i, d24v) tables contained 4 × 8 × 8 = 256 cellsand P(crash|risk ratio A, risk ratio B) contained 2 × 4 × 4 = 32 cells.Finally, the BBN was developed following the literature in Section3.2. For that the academic version of Hugin Expert (Hugin ExpertTutorial, 2011), a software tool for building BBN, has been used.Hugin Expert takes the information variables and outcome vari-able along with their states in the form of a DAG and conditionalprobability tables as input and computes and provides the BBN asoutput. The resulting model is illustrated by Fig. 4. To elaboratemore, the node d2i suggests that 23.72% of the samples (crash andnormal all together) have congestion index at detector location d2to be less than 0.04; 7.16% of the samples have congestion index atthe same location between 0.04 and 0.06 and so on.

4.3. Model evaluation

The crash data from the last two months of the study periodhave been kept for evaluation of the newly built BBN model. The
data set contains 77 crash cases and corresponding 2695 normaltraffic condition cases. Of these 2695 normal traffic condition datapoints, 385 (77 × 5 = 385) have been randomly selected to evaluatethe model. Next, each data point in the evaluation dataset has been

M. Hossain, Y. Muromachi / Accident Analysis and Prevention 45 (2012) 373– 381 379

Fig. 4. The BBN of real-time c

etNaeahmai

alsavwretewpw

5

c

Fig. 5. Performance evaluation of the BBN.

ntered into the BBN individually and their associated probabili-ies to belong to crash prone traffic condition have been calculated.ow, from Fig. 4, it can be ascertained that based on the prior prob-bilities of the information and intermediate variables, if no newvidence is entered into the BBN then the average probability of

traffic condition being associated with crash is 4.56%. This valueas been used as the minimum threshold for evaluating the perfor-ance of the model. The threshold value has been raised up to 15%

nd the classification accuracy have been fathomed and presentedn Fig. 5.

The results reflect that at a threshold value of 4.56% the model isble to successful classify 66% of the crashes with a false alarm rateess than 20%. If the threshold value is raised up to 7% then the modeltill can predict 58% crashes and 87% normal traffic conditions withn overall classification accuracy of 82%. And in case of a thresholdalue as high as 14%, the model classifies 30% of the crash casesith only less than 3% false alarm. It is very difficult to compare the

esults obtained in this study with the previous literatures as theirvaluation method as well as modeling approach varied substan-ially. However, results are to some extent comparable to Abdel-Atyt al. (2008) where they have evaluated the modeling performanceith three different detector combinations. Their best model couldredict 61% of the crash cases from a separate evaluation data setith 21% false alarm.

. Conclusion and discussion

The manuscript investigates the major shortcomings of theurrent initiatives regarding predicting crash risk on urban express-

rash prediction model.

ways in real-time and offers solutions to overcome them with animproved framework and modeling method. The queries regardingthe detector positions from which data will be extracted to performprediction has been answered by conducting the study on a heavilyinstrumented urban expressway such as the Tokyo MetropolitanExpressway. The issue of large variable space with small samplesize has been addressed by introducing random multinomial logitmodel to identify and rank the most important variables. The studyalso underlined that the variables under consideration in real-timecrash prediction models are highly correlated by nature and thusthe modeling methods employed must be robust enough to accom-modate those variables. Moreover, it also highlights that manytimes surrogate variables may be needed to be used to model sucha problem due to lack of data availability. These models may aswell need to be updated with the partially available new data onsome variables. For this, it is required to be able to update itself inmodular way. The model will need to accommodate new variablesin future without requiring to completely rebuilding itself. Con-sidering all these specific requirements, this study has introducedBayesian belief net (BBN) as a modeling method. The proposedmodel also binds its result with a space (250 meter section) andtime (for the next 4–9 min) which will be necessary for researcherinvolved with countermeasure designing. To elaborate more, whenevidences are entered, the model predicts the chance of a traf-fic condition of a 250 meter long section under consideration inthe basic freeway segment to become hazardous within the next4–9 min.

The findings regarding the most appropriate detector positionsuggest that a detector placed approximately 250 meters down-stream from the centroid of the section under consideration cancapture the abnormality in the traffic conditions with the highestprecision. The second best detector position is a location 250 metersupstream from the centroid of the section under surveillance. Thestudy identifies that the traffic conditions in the upstream and thedownstream as well as the difference in the traffic flow param-eters in these locations have high impact in precise detection ofhazardous conditions. A new variable called ‘congestion index’ hasbeen introduced through this study, too, which was calculated bycomparing the instantaneous speed of the stream with the freeflow speed at that location. The final variable set used in the modelcomprised of: congestion index in the downstream and upstream
and the difference in speed and occupancy between the upstreamand the downstream. The study has also demonstrated how thecomplexity of the model can be reduced by combining the classical‘parent divorcing’ technique with conventional BBN.

3 nalysis and Prevention 45 (2012) 373– 381

bhaewtgmTteopttaflabreanaittttcopf

mrttarftmpftpc

A

Ms

Ai

e

i

Table A1Descriptive statistics for Rule 1 and Rule 2 for Model 1.

Rule Min 1st Q Median Mean 3rd Q Max.

Rule 1 0.0003 0.0442 0.1558 0.2367 0.3800 0.9268a


From the performance evaluation point of view, the model haseen built following a conservative approach to capture mainlyighly hazardous road traffic conditions maintaining a low falselarm rate. This approach is important as the road authorities arexpected to take counter measures such as warning the driversith variable message signs (VMS), controlling the driving speed

hrough variable speed limit (VSL), maintaining the level of con-estion through ramp metering and sometimes even with drasticeasures such as main line metering and lane change prohibition.

herefore, as the knowledge regarding proactive evasive measureso countervail traffic conditions impending to crash is still at itsarly development stage, it is expected that actions will be takennly when the risk of a crash is substantially high. Regarding theerformance evaluation of the model, it was also taken into accounthat many a time a crash does not occur even under a hazardousraffic condition due to the skills of drivers and several crasheslso take place due to factors that cannot be captured with trafficow variables. Therefore, the 66% success rate in capturing haz-rdous traffic conditions with a less than 20% false alarm rate cane considered satisfactory. The study presents a choice for theoad authorities through Fig. 5. Decision makers may choose differ-nt threshold values based on their needs to decide upon evasivections. Moreover, the proposed model is robust from the mainte-ance point of view, too. It is built with data from two detectorsnd thus it can still yield results even if one detector fails. It is alsomportant to highlight that the model need not be implementedhroughout the basic freeway segments. The expressway authori-ies are not required to layout detectors throughout the length ofhe expressway either. As it binds crash with both time and space,he authorities can identify hazardous locations on the road withonventional crash prediction models and layout detectors as rec-mmended in this study and monitor those specific areas only. Thisrovides the expressway authorities with higher level of flexibilityrom the financial point of view.

Although the study presents a new framework coupled withodern modeling methods to bridge the gap between conceptual

esearch and practical implementability, it has its own limita-ions. The manuscript addresses the issue of real-time hazardousraffic condition monitoring for the basic freeway segments onlynd recommends a new study to develop separate models for theamp vicinities following a similar framework. The research mainlyocuses on timely detection of hazardous traffic condition forma-ion and does not provide much insight on the underlying crash

echanism. The study also does not cover issues related to appro-riate intervention design. Nevertheless, the new methodologicalramework demonstrated in this study can be used as a core and fur-her studies regarding crash mechanism understanding, improvingrediction performance and designing counter measures can beonducted around it.

cknowledgement

The authors would like to express their gratitude to Tokyoetropolitan Expressway Company Limited for facilitating the

tudy.

ppendix A. Selecting the break points for thentermediate variables

For illustration purpose, we here demonstrate how the four cat-
gories for the intermediate variable ‘risk ratio A’ are selected.
Let a crash data point have the following values associated witht:

d2i = 0.87 (congestion index is unit less).

Rule 2 0.0258 0.0151 0.0134 0.0127 0.0109 0.0000

a All the values are negative.

d24o = 0.70 (difference in occupancy between detector positions2 and 4. This is unit less as well and the occupancy in the detectordatabase is calculated on a 0–1 scale).

Thus, using Model 1 parameters in Table 2, the probability ofcrash prone traffic condition for this data can be calculated as=1/(1 + exp(−(−4.562 + 2.486 × 0.87 − 0.132 × 0.7))) = 1/(1 + exp(2.4913)) = 0.0767.

Now, the data for modeling have an average prob-ability to belong to crash prone traffic condition is=(645)/(645 + 24,204) = 0.026.

Thus, the excess probability for the data point is=0.0767 − 0.026 = 0.0507.

This way, the excess probability for all the data points (bothfor crash prone, i.e., crash = 1 and normal traffic data, i.e., crash = 0)is calculated. Next, we calculate the descriptive statistics (mean,median, 1st quartile, 3rd quartile, minimum, maximum) of all theexcess probabilities with these rules:

Rule 1: for crash = 1 and excess probability > 0.Rule 2: for crash = 0 and excess probability < 0.The results are as shown in Table A1.Now, the median value 0.1558 (for Rule 1) is used as the break

point between ‘high risk’ and ‘moderate risk’, 0.0 is used as thebreak point between ‘moderate risk’ and ‘low risk’ (i.e., averageprobability of crash = 1) and −0.0134 is used as break point between‘low risk’ and ‘very low risk’. Thus, a data point will be classified as‘high risk’ only when its associated risk is substantially higher thanthe average risk and similarly, a data point will be classified as lowrisk only when its associated probability is substantially lower thanthe average risk.

A similar process is followed to categorize ‘risk ratio B’ as well.

References

Abdel-Aty, M., Uddin, N., Pande, A., Abdalla, M.F., Hsia, L., 2004. Predicting freewaycrashes from loop detector data by matched case–control logistic regression.Transport. Res. Rec. 1897, 88–95.

Abdel-Aty, M., Pande, A., 2005. Identifying crash propensity using specific trafficspeed conditions. J. Safety Res. 36 (1), 97–108.

Abdel-Aty, M., Uddin, N., Pande, A., 2005. Split models for predicting multivehiclecrashes during high-speed and low-speed operating conditions on freeways.Transport. Res. Rec. 1908, 51–58.

Abdel-Aty, M., Pande, A., Das, A., Knibbe, W.J., 2008. Assessing safety on Dutch free-ways with data from infrastructural-based intelligent transportation systems.Transport. Res. Rec. 2083, 153–161.

Breslow, N.E., Day, N.E., 1980. Statistical methods in cancer research. Volume I – theanalysis of case–control studies. IARC Sci. Publ. 32 (32), 5–338.

Charniak, E., 1991. Bayesian network without tears. AI Mag. 12 (4), 50–63.Dias, C., Miska, M., Kuwahara, M., Warita, H., 2009. Relationship between conges-

tion and traffic accidents on expressways: an investigation with Bayesian beliefnetworks. In: Proceedings of 40th Annual Meeting of Infrastructure Planning(JSCE), Japan.

Gazis, D.C., 2002. Traffic Theory. Kluwer Academic Publishers, USA.Hugin Expert Tutorial, 2011. Accessed on: 6 January, 2011. Accessed from:

http://www.hugin.com/developer/tutorials.Ihaka, R., Gentleman, R., 1996. R: a language for data analysis and graphics. J. Comput.

Graph. Stat. 5 (3), 299–314.Jensen, F.V., Nielsen, T.D., 2007. Bayesian Networks and Decision Graphs. Springer,

NY.Jovanis, P.P., Chang, H.L., 1986. Modeling the relationship of accidents to miles trav-

eled. Transport. Res. Rec. 1068, 42–51.
Lee, C., Hellinga, B., Saccomanno, F., 2003. Real-time crash prediction model for the
application to crash prevention in freeway traffic. Transport. Res. Rec. 1840,67–77.

Lee, C., Abdel-Aty, M., Hsia, L., 2006. Potential real-time indicators of sideswipecrashes on freeways. Transport. Res. Rec. 1953, 41–49.

http://www.hugin.com/developer/tutorials

nalys

O

O

O

O

P

P

P

M. Hossain, Y. Muromachi / Accident A

h, C., Oh, J., Ritchie, S., Chang, M., 2001. Real-time estimation of freeway acci-dent likelihood. In: Proceedings of the 80th Annual Meeting of TransportationResearch Board, Washington, DC.

h, C., Oh, J., Ritchie, S., Chang, M., 2005. Real time hazardous traffic condition warn-ing system: framework and evaluation. IEEE Trans. Intell. Transp. Syst. 6 (3),265–272.

h, C., Park, S., Ritchie, S.G., 2006. A method for identifying rear-end collision risksusing inductive loop detectors. Accid. Anal. Prev. 38, 295–301.

lesen, K.G., Kjaerulff, U., Jensen, F., Jensen, F.V., Falck, B., Andreassen, S., Andersen,S.K., 1989. A MUNIN network for the median nerve – a case study on loops. Appl.Artif. Intell., 3.

ande, A., Abdel-Aty, M., 2005. A freeway safety strategy for advanced proactive
traffic management. J. Intell. Transport. Syst. 9 (3), 145–158.
ande, A., Abdel-Aty, M., Hsia, L., 2005. Spatiotemporal variation of risk precedingcrashes on freeways. Transport. Res. Rec. 1908, 26–36.

ande, A., Abdel-Aty, M., 2006a. Assessment of freeway traffic parameters leadingto lane-change related collisions. Accid. Anal. Prev. 38, 936–948.

is and Prevention 45 (2012) 373– 381 381

Pande, A., Abdel-Aty, M., 2006b. Comprehensive analysis of the relationship betweenreal-time traffic surveillance data and rear-end crashes on freeways. Transport.Res. Rec. 1953, 31–40.

Pande, A., Abdel-Aty, M., 2007. Multiple-model framework for assessment of real-time crash risk. Transport. Res. Rec. 2019, 99–107.

Prinzie, A., Poel, D.V., 2008. Random forests for multiclass classification: randommultinomial logit. Expert Syst. Appl. 34 (3), 1721–1732.

Sabey, B.E., Staughton, G.C., 1975. Interacting roles of environment, vehicle and roaduser in accidents. In: Proceedings of the 5th international conference of theInternational Association for Accident and Traffic Medicine, London.

Strobl, C., Boulesteix, A.L., Zeileis, A., Hothorn, T., 2007. Bias in random forest variableimportance measures: illustrations, sources and a solution. BMC Bioinf. 8 (25).

Treat, J.R., Tumbas, N.S., McDonald, S.T., Shinar, D., Hume, R.D., Mayer, R.E., Stanisfer,R.L., Castellan, N.J., 1977. Tri-level study of the causes of traffic accidents. ReportNo. DOT-HS-034-3-535-77 (TAC).

Zheng, Z., Ahn, S., Monsere, C.M., 2010. Impact of traffic oscillations on freeway crashoccurrences. Accid. Anal. Prev. 42 (2), 626–636.

Documents

A Bayesian network based framework for real-time crash prediction on the basic freeway segments of urban expressways