Upload
jacob-louie
View
31
Download
3
Tags:
Embed Size (px)
Citation preview
1
Building Statistical Models to Predict Durations of Subway System
Incidents
by
Jacob Louie, M.Eng Candidate
#997677125
Supervisors: Dr. Amer Shalaby and Dr. Khandker Nurul Habib
Graduate Department of Civil Engineering University of Toronto
2
Building Statistical Models to predict durations of Subway System
Incidents
Jacob Louie
Masters of Engineering
Civil Engineering
University of Toronto
2015 (expected)
Abstract
This study aims at comparing the accuracy and goodness of fit of Ordinary Least
Squares Regression, Ordered Logit Models, and Survival Analysis in predicting the
duration of subway incidents in Toronto. The study considers as variables the cause or
type of incident, the location of incident, the time of incident, and the train-type involved.
It was found that the Survival model, specifically the Log-Logistic Accelerated Failure
Time model, provided the highest accuracy and most explanatory power. The effect of
non-causal variables was also statistically significant.
3
Acknowledgements
I would like to thank my supervisors Professor Amer Shalaby and Professor Khandker Nurul
Habib for their continuous guidance throughout my last two semesters at the University of
Toronto. The last eight months have been a steep learning curve for me, for which I am grateful
to have overcome so far. I would also like to thank all my friends who live in the ITS lab at the
University of Toronto and ARUP who have exchanged their ideas with me. I look forward to the
coming months to continue to improve upon my work for TRB.
4
Contents Acknowledgements .............................................................................................................................. 3
List of Tables ......................................................................................................................................... 5
List of Figures........................................................................................................................................ 6
1. Introduction ....................................................................................................................................... 7
1.2 Literature Review ........................................................................................................................ 7
2. Data Description and Preprocessing .............................................................................................. 8
2.1 General Data Description .......................................................................................................... 8
2.2 Trends in the Toronto Subway Performance ......................................................................... 10
2.3 Data Preprocessing for Model Development ......................................................................... 12
2.4 Variable Definitions Used in the Models ................................................................................. 12
2.5 Data Partitioning for Training and Testing .............................................................................. 15
3. Descriptions of model formulations considered ........................................................................... 15
3.1 Ordinary Least Squares (OLS) Regression ........................................................................... 15
3.2 Ordered Logit Model ................................................................................................................ 16
3.3 Survival Analysis ...................................................................................................................... 17
4. Discussion of Results ..................................................................................................................... 19
4.1 Linear Regression Model ......................................................................................................... 19
4.2 Ordered Logit Model ................................................................................................................ 19
4.2.1 Ordered Logit Model without simulation .......................................................................... 19
4.2.2 Ordered Logit Model by simulation .................................................................................. 21
4.3 Survival Analysis ...................................................................................................................... 22
5. Summary and Discussion of Results ............................................................................................ 27
5.1 Assessment of goodness of fit ................................................................................................ 27
5.2 Comparison of Parameter Estimates between the three models ......................................... 27
5.2 Interpretation of Parameter Estimates .................................................................................... 30
6. Conclusion ...................................................................................................................................... 33
6.1 Recommendations for future work .......................................................................................... 33
5
List of Tables
Table 1: Whether or not a line reached its on-time performance target ........................................ 10
Table 2: Variables describing time, location, and vehicle type involved ........................................ 14
Table 3: Variables describing the incident type ............................................................................... 14
Table 4: The threshold definitions for the output levels of the ordered logit model ...................... 20
Table 5: Confusion table that shows the count of correct classifications and misclassifications.
The row labels are the predicted outcomes, and the column names are the actual outcomes. .. 20
Table 6: Confusion table that shows the percentage of correct classifications and
misclassifications. The row labels are the predicted outcomes, and the column labels are the
actual outcomes .................................................................................................................................. 20
Table 7: Predicted probabilities for record #2067 ............................................................................ 21
Table 8: CDF for record #2067 .......................................................................................................... 21
Table 9: Confusion table resulting counts from the simulated ordered logit model ...................... 22
Table 10: Confusion table resulting percentages from the simulated ordered logit model. ......... 22
Table 11: Fit statistics for the four common AFT specifications. The Log-logistic appears to be
the best ................................................................................................................................................ 24
Table 12: Parameter estimates and effects comparison between the parametric and non-
parametric formulations. There is considerable discrepancy between the two formulations. ...... 26
Table 13: Parameter Estimates for tthe Log-logistic AFT, OLS Regression, and Ordered Logit
Model ................................................................................................................................................... 28
Table 14: List of incident types on which the three models disagree............................................. 29
Table 15: Incident types along with their mean duration and standard deviation. Non consensual
incident types are highlighted in yellow. ........................................................................................... 30
Table 16 Parameter estimates and average durations of the final model (log-logistic AFT) ....... 32
6
List of Figures
Figure 1: The distribution of incidents over time ................................................................................ 9
Figure 2: Histogram of incidents, assuming all incidents had a non-zero duration ........................ 9
Figure 3: Frequency of incident occurrence by count ..................................................................... 10
Figure 4: Incident Types by total duration over 2013. Total number of hours of delay was 498. 11
Figure 5: Incident Types by Count, 2013. Total number of incidents exceeded 10 000 .............. 11
Figure 6: Boxplots showing the distribution of duration delays for each incident type. ................ 12
Figure 7: Residual and QQ plots for OLS regression does not support assumption of normally
distributed errors ................................................................................................................................. 19
Figure 8: Cox-Snell Residual plots for the Exponential, Weibull, and Log-Logistic specifications.
The data omits incidents less than 2 minutes. ................................................................................. 23
Figure 9: Cox Snell Residual plots for the Log-logistic specification. Data includes all incidents
less than 2 minutes as well, with incidents less than 1 minute assigned a random number. ...... 24
7
1. Introduction In public transit systems, delays and incidents can negatively affect service throughout the
system, as well as damage the transit agency's image. Aside from prevention of incidents,
minimisation of the duration of the incidents is one way to mitigate the negative impacts on
service system-wide, and is of particular interest in this paper. Even if durations cannot be
minimised, predictions for delay durations are needed to help agencies decide whether or not
an incident is disruptive enough to warrant a formal response and to help customers decide on
alternative routes (Straphangers Campaign, 2014). Consequently, the purpose of this study is to
build statistical models that can predict the durations of incidents on the Toronto Transit
Commission (TTC) subway system. A list of incidents on the subway system over the year 2013
was provided by the TTC.
To accomplish this goal, three popular models were considered: Ordinary Least Squares (OLS)
Regression, Ordered logit model, and hazard model (also known as survival analysis or
accelerated failure time (AFT) model). These models will be compared on their predictive
capability, their accuracy, the quality of estimates (i.e. whether or not the signs and magnitudes
are logical), and their limitations.
This report will begin with a literature review of predicting the duration of delays in the
transportation context. A brief description of each model formulation, including their limitations
and methods to interpret their parameter estimates will then be given. This will be followed by a
discussion of the models’ goodness of fit statistics, and accuracy rate against holdout data. The
report will end with a comparative assessment between the models and an interpretation of the
parameter estimates.
1.2 Literature Review Survival analysis has been used to predict the clearance time of highway incidents in the past.
In a study on highway incident duration in Washington State by Nam and Mannering (2000), the
total incident duration was broken down into time to detection of incident, time for responding
agency to react to incident, time for the response crew to travel to incident, and time for
response crew to clear the incident once they arrived. Separate models were constructed for
each time segment. The authors used different Accelerated Failure Time (AFT) models to
predict the effect of different covariates on the expected delay time itself and found that
variables describing time of year, location, and weather were significant predictors, although
their effects were not stable year-to-year. The data set identified the exact time of the onset of
the incident, the time of incident detection, and the time of incident clearance which allowed the
authors to study the time components separately.
Detecting an incident can be highly dependent on the manner of detection. For example, in an
incident that involved a fire that started on a subway train in Seoul Korea, the response
procedure to the fire was delayed as the operator of the train did not receive the information of
the fire through the passenger intercom (Roh, n.d.). Consequently, the severity of the incident
increased dramatically.
Valenti, Lelli, and Cucina (2010) also compared regression, hazard, and neural network models
in trying to predict the duration of traffic incidents. Each observation also includes which
response agencies were involved (EMS, fire, police, etc.), and how many response personnel
8
and response vehicles were at the site. The researchers found that most of the models had
difficulty predicting both short duration incidents and long duration incidents simultaneously.
Weng et al. (2014) also built an Accelerated Failure Time (AFT) model to predict the duration of
subway delays in Hong Kong from a dataset spanning 6 years. With a large multi-year dataset,
they were able to test the temporal transferability of their model, as well as consider very high
severity but rare incidents, such as crashes and fatalities resulting from accidents. The authors
studied the incident type as the only covariates in the model, but do not yet consider other
variables such as the time of day, time of year, day of week, prevailing headways, and location.
Sharman (2014) also constructed and applied a survival model to predict the dwell times of
urban delivery vehicles. The data set of dwell times was obtained through the preprocessing of
“passively-collected GPS data”. Sharman found that the AFT formulation, specifically the log-
logistic specification, as described in section 3.3, yielded the best results.
Giuliano (1989) uses analysis of variance to compare the duration of freeway incidents against
different categorical variables, including whether the incident occurred during day or night time,
the number of lanes closed or affected by the incident, incident type, and whether or not trucks
were involved. Giuliano found that although incident type was statistically significant, there was
a large variation within each incident type.
In the transportation context, there seems to be more duration modelling studies of highway
traffic incidents than on public transit incidents. Past research found on duration analysis of
public transit incidents were not at the level of detail that studies on highway incidents were able
to achieve, as far as incorporating other variables aside from incident type or cause is
concerned. Incorporating variables on time of occurrence, location, and passenger volumes
remains a goal for Weng et al. (2014) and is the motivation for our study as well.
2. Data Description and Preprocessing
2.1 General Data Description In the 2013 subway incident data set provided by the TTC, the following information is given for
each recorded incident: The date, time, day of week, station, bound, line, resulting delay in
minutes, the vehicle #, the operator #, the guard #, a description of the incident, and a code
representing the incident type.
There were roughly 12 000 subway incidents in the year 2013. 55% of these incidents register a
delay of 0 minutes. The smallest non-zero delay was 2 minutes. The remaining delays were
given to the nearest minute. It is also worth noting that other transit agencies, such as the MTA
in New York City, the MTR in Hong Kong, and the MRT in Singapore, record incidents that
exceed 8, 8, and 5 minutes, respectively (SC, 2014; Weng, et al., 2014). On the other hand, it
can also be argued that short duration incidents of even less than 2 minutes can still be
influential on the subway system’s performance when headways are extremely short.
Figure 1 shows the distribution of incident occurrence over time of day. Of interest from this
figure are the two peaks in incident occurrence during the morning peak period and evening
peak period, defined at 6h-9h and 15h-19h, respectively. It is not surprising to see the frequency
of incident occurrence rising with passenger demand.
9
Figure 1: The distribution of incidents over time
From Figure 2, the vast majority of subway incidents have short durations, and the frequency of
incident durations decreases with increasing durations. The overall distribution of incident
duration exhibits a rough exponential distribution.
Figure 2: Histogram of incidents, assuming all incidents had a non-zero duration
Figure 3shows the location distribution of the incidents. Many incidents seem to be concentrated
at terminals, at major interchanges, and locations near yards. The prevalence of incidents is
also higher on the Yonge-University-Spadina (YUS) line than on the Bloor-Danforth (BD) line,
possibly due to higher passenger traffic overall.
0
200
400
600
800
1000
12001
:00
2:0
0
3:0
0
4:0
0
5:0
0
6:0
0
7:0
0
8:0
0
9:0
0
10
:00
11
:00
12
:00
13
:00
14
:00
15
:00
16
:00
17
:00
18
:00
19
:00
20
:00
21
:00
22
:00
23
:00
0:0
0
Co
un
t
Time of Day
Distribution of incidents throughout the Day
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0 20 40 60 80 100 120 140 160
Per
cen
tage
Time (minutes)
Histogram of incident durations
10
Figure 3: Frequency of incident occurrence by count
2.2 Trends in the Toronto Subway Performance By the end of 2014, the TTC moved 1.8 million daily riders (TTC, 2015), a 2% growth since
2013. Work is now underway to update the signalling system to increase the capacity, causing
an increase in track-level maintenance activity (TTC, 2013).
Table 1 lists the relative performance of each rapid transit line since 2013.
Table 1: Whether or not a line reached its on-time performance target
Line 2015 2014 2013
Yonge-University-Spadina (YUS)
Below Below Below
Bloor-Danforth (BD) Below Below Above
Sheppard (SHP) Above Above Above
Scarborough RT (SRT)
Above Below Above
The Yonge-University-Spadina (YUS) line has consistently been the worst performing subway
line in Toronto since at least 2013, with an average percentage of on time performance peaking
at less than 94% (TTC, 2015). In contrast, other lines regularly met or surpassed their
percentage of on-time performance target of 96%. This difference in performance could be due
to a number of factors. As the subway system ages and demand increases, the need to update
the infrastructure becomes more urgent (Man, Misra, & Shalaby, 2014). Work has been
underway on the YUS line to upgrade the signalling system and other critical infrastructure.
Also, the new Toronto Rocket (TR) trains were the cause of many door-related delays when
they were first introduced (TTC, 2013, p. 5).
Series1 Series2
11
Other large cities are experiencing similar problems. In New York City, the top delay causes
include sickness, doors being held by passengers, speed-control and signal issues, and
security. Some of these causes, such as police stoppages, have become more prevalent in light
of terrorist attacks in recent history (Chan & McGinty, 2005). In Toronto, security was also the
leading incident type, accounting for the most amount of delay time, followed by door problems.
The breakdown of delays by cause is given in Figure 4 and Figure 5.
Figure 4: Incident Types by total duration over 2013. Total number of hours of delay was 498.
Figure 5: Incident Types by Count, 2013. Total number of incidents exceeded 10 000
12
As can be seen from the above pie charts, lengthier delays do not necessarily correspond with
more frequent delays. The variation of incident durations for each incident type is also very
large, as shown in Figure 6. It would be inadequate to simply describe each incident type with a
single point average.
Figure 6: Boxplots showing the distribution of duration delays for each incident type.
2.3 Data Preprocessing for Model Development In an attempt to improve the model fit, it was assumed that incidents with 0 minute delay that
can conceivably affect train operations were actually anywhere between 0 and 2 minutes. These
incidents were reassigned a delay duration of 1 minute. As for the remaining incidents of 0
minute delay, different options were tested, including omitting them entirely (due to lack of
relevance to train operations), or reassigning them a random number of less than 1 minute. In
the context of Survival Analysis, these zero delay incidents could have been treated as “left-
censored”, in which the only knowledge of the incident duration is that it was less than the
minimum threshold of 2 minutes (Allison, 2010).
2.4 Variable Definitions Used in the Models Different formulations for the headway or time covariates were tested. The prevailing headways
were derived using the TTC’s service summaries. However, in all of the estimated models, the
coefficient for headway is very small, but statistically significant. The final model uses a dummy
indicator variable to represent whether or not the incident occurred during the peak hour. In this
study, the peak hour was defined as 6h-9h or 15h-19h on Mondays-Fridays.
Multiple formulations for location were also possible as well. Earlier models divided the entire
subway system into 13 zones according to the incident’s location and direction on the line. A
continuous variable to represent location from a fixed point, such as the line’s terminal, was also
13
tested. In the end, the final model uses an indicator variable of whether or not the incident
occurred at an interchange station.
In light of the Seoul subway incident discussed in section 1.2 (Roh, n.d.), it was decided to
include a variable that represents whether or not the train was equipped with an intercom. In
2013, the only trains that were equipped with passenger intercoms were the new “Toronto
Rocket” (TR) trains.
There were over 100 different code types to represent incident type or cause. These had to be
aggregated to make the target model more tractable. Many of these code types were duplicates
of the same incident type, and a number of these codes had very few observations. The 100
codes were reducible to a minimum of 35 distinct categories without losing details on the nature
of the incident. Further reductions were possible by categorising the incident types based on
their severity level (Affecting customer safety, potential property damage, criminal, affecting
train operations only, and other).
Weng, et al. (2014) also identified 6 main incident categories in their study of the Hong Kong
subway system: Power infrastructure failure, vehicle failure, turnout malfunction, crash, and
operation error. However, many of the incidents in the TTC’s data set are not classifiable into
these categories.
Man, et al., (2014) proposed 10 categories derived from surveys and interviews with various
stakeholders of the public transit system. The proposed categories are Passenger-related,
Failure of infrastructure, Congestion delay, Weather-related, Accidents, Construction, Signals.
Power, Demand surge, and Other. Although the TTC data can be classified into these
categories, there would be a number of categories with differing severity levels that would
become aggregated with each other (e.g. PAA activations by customers and suicides can both
be considered “Passenger-related” incidents).
In this paper, it was decided to use all 35 incident types in the final model. If fewer parameters
are needed in the future, one way to further reduce the number of categories is to compare
suspected similar categories using the following formula:
Equation 1
𝑡𝑑𝑓=1,𝛼/𝑛 =𝛽𝑗 − 𝛽𝑘
√𝑉𝑎𝑟(𝛽𝑗) + 𝑉𝑎𝑟(𝛽𝑘) − 2 ∗ 𝐶𝑜𝑣𝑎𝑟(𝛽𝑗, 𝛽𝑘)
Several methods exist to reduce the probability of Type I errors when making multiple post-hoc
comparisons. In the Bonferroni correction method (Wesstein, n.d.), the level of significance,
95% if α=5%, should be divided by the number of comparisons to be made in order to limit the
number of Type I errors that will occur. Bonferroni correction is often considered overly
conservative as the number of pairwise comparisons gets large. In the current model
formulation, the number of potential comparisons is up to
(352
) = 595 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡 𝑐𝑜𝑚𝑝𝑎𝑟𝑖𝑠𝑜𝑛𝑠
The number of comparisons can be reduced by deciding beforehand the pairs of coefficients to
test. This can be the subject of future work.
14
The variables used in the final model can be class as into four groups: incident location, incident
time, whether or not the train had an intercom system, and incident type. In the final model, all
variables were binary (1 indicating yes, and 0 indicating no). The detailed definitions of all of the
variables are given in Table 2 and Table 3.
Table 2: Variables describing time, location, and vehicle type involved
Variable Name Description
Peak_hour1 If the incident occurred during rush hour, from 6am-9am, or from 3pm-6pm, Monday to Friday
Interchange_station1 If the incident occurred at an interchange station Intercom1 If the incident involved is equipped with an intercom system
Table 3: Variables describing the incident type
Variable name Definition
CategoryCOMMUNICATIONS Problems with Communication infrastructure with Transit Control
CategoryDEBRIS___TRACK Debris or objects at Track Level
CategoryDISORDERLY_PATRON Disorderly Patron
CategoryDOOR Door mechanical problems
CategoryDOOR_PASSENGER Door problems caused by passengers
CategoryDOOR_PERSONNEL_MISTAKE Door opened off platform or in tunnel by personnel
CategoryESCALATOR_ELEVATOR_STAIRS Escalator/Elevator/Stair problem
CategoryFIRE___TRACK Fire at track level
CategoryFIRE_IN_STATION Fire elsewhere in station
CategoryFIRE_ON_TRAIN Fire on train
CategoryHOLDUP_ALARM_ACTIVATED Holdup alarm activated
CategoryINJURY Injury to personnel or customer
CategoryNO_POWER Loss of Power – includes traction and station power loss
CategoryONBOARD_MECHANICAL_MAJOR Major vehicle problems
CategoryONBOARD_MECHANICAL_MINOR Minor vehicle problems
CategoryOPERATOR_NOT_AVAILABLE_NOT_IN_POSITION Operator and/or guard not in position for duties
CategoryOPERATOR_OVERSHOT_PLATFORM Operator overshot platform
CategoryOTHER Other
CategoryPAA_ACTIVATED_BY_CUSTOMER PAA alarm activated by customer or unknown person
CategoryPERSONNEL_ERROR Mistake in operation by personnel or supervisor
CategorySECURITY_OTHER Incident’s involving security or police
CategorySIGNAL_SWITCH Problems involving wayside signal or track switch problems
CategorySPEED_CONTROL
Speed control restricting operations – includes Emergency Brake problems failing to release and gliding brakes
CategorySUICIDE_ Moving trains coming in contact with person
CategoryTRACK_PROBLEM Defective track structure
CategoryTRACTION_POWER_PROBLEM Traction Power Problem
CategoryTRAIN_STOP_CONTACTED Train tripped past a train stop
CategoryTRAINING_DEPARTMENT Incident caused by personnel in training
CategoryUNAUTHORISED_TRACK_LEVEL Unauthorised person at track level
CategoryUNSANITARY_UNHEALTHY Train taken out of service due to unsanitary conditions
CategoryWEATHER Incident caused by extreme weather
CategoryWORK_REFUSAL Delay caused by personnel refusing to work due to bad working environment
CategoryWORKZONE_PROBLEMS Delays caused by track level activity
CategoryYARDHOUSE_PROBLEM Incidents offline in the Yardhouse
15
2.5 Data Partitioning for Training and Testing The data set was divided into a training set and a testing set. The training set consists of 70% of
the records. Each model was fitted using the training set. Predictions were generated for the
testing set using the fitted model. The model was evaluated by taking the mean square error
between the predictions of the testing set and the actual durations of the testing set. Note that
only the ordinary least squares regression and survival analysis could be compared in this
manner, as the ordered logit model cannot generate predictions within 1 minute resolution. As
the training set should reflect the proportion of incidents as the testing set and as the data set as
a whole, the data set was divided into 5 classes defined by the classes for the ordered logit
model in section 4.2 before sampling 70% of each class. It was also possible to break the data
down by incident type, but that would require sampling 70% from 35 groups, which requires
considerably more effort.
3. Descriptions of model formulations considered As described earlier, this study will test three popular statistical models: The Ordinary Least
Squares Regression, the Ordered Logit Model, and Survival Analysis. A brief introduction of
each model is given in the remainder of section 3.
3.1 Ordinary Least Squares (OLS) Regression It is necessary to model the Log of the duration delay T rather than T itself, as this would ensure
that T remains positive. The regression equation is given in Equation 2.
Equation 2
log(𝑇𝑖) = 𝛽𝑜 + ∑ 𝛽𝑖,𝑗 ∗ 𝑥𝑖,𝑗
𝐶𝑜𝑣𝑎𝑟𝑖𝑎𝑡𝑒𝑠
𝑗=1
+ 𝜎 ∗ 𝜀
where β’s are the explanatory variable parameter estimates, x’s are the explanatory variables,
and ϵ is the random error term that is normally distributed with a variance of 1. If the variance of
the error component is not 1, this would be captured in the scale factor σ. The goodness of fit is
represented by the R2 parameter, where
Equation 3
𝑅2 = 1 −𝑆𝑆𝐸
𝑆𝑆𝑇, 𝑆𝑆𝐸 = ∑(𝑦𝑖 − 𝑦𝑖
∗)2
𝑁
𝑖=1
, 𝑆𝑆𝑇 = ∑(𝑦𝑖 − 𝑦𝑎𝑣𝑒)2
𝑁
𝑖=1
To reward parsimonious models, the adjusted R2 of Equation 4can be used instead in Equation
3:
Equation 4
1 − (1 − 𝑅2) ∗𝑁 − 1
𝑁 − 𝑝 − 1
where N is the sample size and p is the number of predictor variables.
Lengthier durations are predicted by positive parameter estimates in Equation 2, whereas
shorter durations are predicted by negative parameter estimates.
16
3.2 Ordered Logit Model An ordered logit model discretizes the output variable into categorical levels that have a natural
order. One example of the use of an ordered model is predicting the severity of a car accident,
where the ordered levels are minor, major property damage, injury, or death. In this study, the
ordered class levels can be categorised as very short, short, medium, long, and very long.
As there were many more observations of shorter durations than longer durations, the
discretization intervals were shorter for the shorter duration classes. The continuous but latent
variable y* helps predict the resulting class, and is given in Equation 5:
Equation 5
𝑦∗ = 𝑥′𝛽 + 𝑢
where x and β are as defined in section 3.1, and u is the threshold level. The actual output class
yi is determined by y* and is given in Equation 6.
Equation 6
𝑦𝑖 = 𝑗
where yi is the discretized version of y* and is the output level for observation i, if
Equation 7
𝛼𝑗−1 ≤ 𝑦∗ < 𝛼𝑖
Equation 8
Consequently, the probability of observing class j is given in Equation 8 and Equation 9.
𝑝𝑖𝑗 = 𝑃(𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 𝑖 𝑤𝑖𝑙𝑙 𝑠𝑒𝑙𝑒𝑐𝑡 𝑗) = 𝑃(𝑦𝑖 = 𝑗) = 𝑃(𝛼𝑖−1 < 𝑦∗ < 𝛼𝑖)
Equation 9
𝑃(𝛼𝑖−1 < 𝑦∗ < 𝛼𝑖) = 𝐹(𝑦∗ = 𝛼𝑖−1) − 𝐹(𝑦∗ = 𝛼𝑖)
For logit models, the cumulative distribution function is given in Equation 10
Equation 10
𝐹(𝑧) =𝑒𝑧
1 + 𝑒𝑧
The ordered logit model can be used to predict approximately the duration of an incident. In light
of this, it is not possible to directly compare the goodness of fit of an ordered model with a
regression model or a survival model, which output a predicted duration that is continuous.
Although there may exist mathematically rigorous ways to define the output classes, the classes
were defined based on the log of the duration as explained in section 4.2.
Parameter estimates with positive coefficients increase the probability of longer durations.
Negative coefficients increase the probability of shorter durations.
17
3.3 Survival Analysis Hazard modelling and survival analysis have been used to analyse the distributions of durations
of incidents and time for an event to occur (Machin, Cheung, & Parmar, 2006). Initially used
extensively in the medical field to analyse time for death or failure to occur, these methods have
appeared in the transportation context to analyse the durations of highway incidents (Nam &
Mannering, 1998). In this study, survival analysis will be used to predict the duration of a
subway incident.
Hazard models examine the conditional probability of an event occurring at any point in time,
given that it has not yet occurred. The hazard function is given in Equation 11:
Equation 11
ℎ(𝑡) =𝑓(𝑡)
1 − 𝐹(𝑡)=
𝑓(𝑡)
𝑆(𝑡)
where h(t) is the conditional probability of event occurrence at time t, and F(t) is the Cumulative
Density Function of event occurrence at time t. F(t) can be interpreted as the probability of an
event having already occurred before or at time t. Therefore 1-F(t)=S(t) is the probability that an
event has not yet occurred at time t, or the probability of survival at t. The following relationships
can also be proven:
Equation 12
ℎ(𝑡) = −𝑑
𝑑𝑡log(𝑆(𝑡))
Equation 13
𝐻(𝑡) = − log(𝑆(𝑡))
where H(t) is the cumulative hazard rate.
The hazard function itself can be expressed parametrically or non-parametrically. Examples of
parametric hazard models include the Exponential, Weibull, Log-normal, Log-logistic, and
Generalised-Gamma model. An Exponential hazard model assumes a constant hazard over
time – that the conditional probability of event occurrence does not change over time. Weibull
hazard models allow for monotonically increasing or decreasing hazards. Log-Normal and Log-
Logistic models allow for a decreasing concave-up hazards, or hazards that increase to a
maximum then decrease monotonically afterward. Some hazard functions can be expressed in
the form of Equation 14.
Equation 14
ℎ(𝑡) = 𝛽𝑜 + ∑ 𝛽𝑖,𝑗 ∗ 𝑥𝑖,𝑗
𝐶𝑜𝑣𝑎𝑟𝑖𝑎𝑡𝑒𝑠
𝑗=1
Sometimes, the covariates on the right hand side of equation 14 proportionally increase or
decrease the hazard function, in “proportional hazard” models. Of the above mentioned
parametric models, only the Exponential and Weibull models can be expressed as proportional
hazards.
18
In proportional hazard models, positive coefficients increase the hazard of event occurrence,
whereas negative coefficients decrease the hazard of event occurrence.
Survival analysis can also be used to examine the effect of covariates on the expected time to
the event itself, rather than the hazard of the event occurrence, in “Accelerated Failure Time”
(AFT) models in Equation 15.
Equation 15
log(𝑇𝑖) = 𝛽𝑜 + ∑ 𝛽𝑖,𝑗 ∗ 𝑥𝑖,𝑗
𝐶𝑜𝑣𝑎𝑟𝑖𝑎𝑡𝑒𝑠
𝑗=1
+ 𝜎 ∗ 𝜀
The covariates on the right hand side of equation 15 directly increase or decrease the expected
log(T). Examples of parametric AFT models include the Exponential, Weibull, Log-Normal, Log-
Logistic, and Generalised-Gamma model. Note that the Log-normal, Log-logistic, and
Generalised-Gamma cannot be expressed as proportional hazard models.
Sometimes, both the hazard model and the AFT model can be interpreted in similar ways. An
increasing hazard rate can be interpreted that the event occurrence is increasingly likely as time
progresses, thereby shortening the expected duration. Conversely, a decreasing hazard rate
can be interpreted as the event being decreasingly likely to occur with time, lengthening the
expected duration. However, this interpretation means that the coefficient signs will have
opposite meanings between hazard and AFT models – Longer durations are expected from
negative coefficients in hazard models, but from positive coefficients in AFT models, and vice-
versa.
For Exponential specifications, ϵ has an extreme value distribution with the scale parameter σ
being fixed to 1. For the Weibull specification, ϵ also has an extreme value distribution, but the
scale parameter σ can take on any positive value. This allows an increasing or a decreasing
hazard. For the Log-Normal specification, ϵ is normally distributed and the model reduces to a
linear regression equation of log(T). In the absence of censored data, a log-normal AFT model
is identical to an ordinary least squares regression in model in Equation 2 (Allison, 2010). For
the Log-Logistic specification, ϵ has a distribution as shown in Equation 16.
Equation 16
𝑃𝐷𝐹(𝜀𝑗) =𝑒𝜀𝑗
(1 + 𝑒𝜀𝑗)2
This probability density function is evenly symmetric with a mean of zero, and also has a bell-
curve that is heavier-tailed than the normal distribution.
Because the objective is to build a model that would predict the duration of the delay itself,
rather than the probability of incident clearance over time, the AFT formulation is more relevant
for this study’s purposes.
19
4. Discussion of Results Linear Regression, Ordered Logit, and Survival Models were estimated using R. In the following
sections, goodness of fit statistics, and diagnostic plots for each model were also produced. All
resulting parameter estimates are shown in Table 13 in Section 5.2.
4.1 Linear Regression ModelIn order for a regression model to be an appropriate model, the distribution of residuals should
be normal and constant about the predicted value. The residual plot in Figure 7 does not show
any patterns that suggest obvious improvements to the model that can be made. However, the
residual plot suggests that the variances are not very constant and not normally distributed. The
deviations on the left and right sides of the Normal Q-Q plot suggest that the errors have a
distribution that is heavier-tailed than the normal distribution.
Figure 7: Residual and QQ plots for OLS regression does not support assumption of normally distributed errors
The mean square error rate on the test set was 127.8303 and the goodness of fit of the model
on the training set was 0.5433. The low R2 value indicates that the model was not able to
capture almost half of the variation. These results argue against using OLS regression.
4.2 Ordered Logit Model In constructing an ordered logit model, two methods were considered: to fit the ordered logit
model directly to the data and generate predictions from the parameter estimates, and to
generate predictions by simulation.
4.2.1 Ordered Logit Model without simulation Time intervals were defined based on the log of the duration.
Equation 17
𝑦𝑖∗ = log (𝐷𝑒𝑙𝑎𝑦)
The thresholds for level are then defined in Table 4.
20
Table 4: The threshold definitions for the output levels of the ordered logit model
y [y*lower y*upper) [Delaylower DelayUpper)
2 0 1 1 2.718
3 1 2 2.718 7.389
4 2 3 7.389 20.086
5 3 4 20.086 54.598
6 4 5 54.598 148.413
7 5 infinity 148.413 Infinity
The parameter estimates for the ordered logit model are once again shown in Table 13 in
Section 5.2. Although it is possible to calculate the P-values for these statistics using the t-
values, Ripley (2013) claims that such statistical tests are “wildly misleading”. To assess the
accuracy of the ordered logit model’s prediction, a confusion table is generated that shows the
number count (Table 5) and percentage of correct classifications and misclassifications (Table
6).
Table 5: Confusion table that shows the count of correct classifications and misclassifications. The row labels are the predicted outcomes, and the column names are the actual outcomes.
P2 P3 P4 P5
P2 3884 969 46 4 4903
P3 550 3610 566 57 4783
P4 11 70 128 41 250
P5 0 0 1 14 15
4445 4649 741 116
Table 6: Confusion table that shows the percentage of correct classifications and misclassifications. The row labels are the predicted outcomes, and the column labels are the actual outcomes
P2 P3 P4 P5
P2 87.37908 20.84319 6.207827 3.448276
P3 12.37345 77.65111 76.38327 49.13793
P4 0.247469 1.5057 17.27395 35.34483
P5 0 0 0.134953 12.06897
The total correct classification percentage is calculated by summing the diagonal values in
Table 5 and dividing by the total number of records used.
Equation 18
𝑃𝑒𝑟𝑐𝑒𝑛𝑡 𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛 =∑ 𝑛𝑖,𝑖
5𝑖=2
𝑁
where ni,i is the diagonal count of class i, and N is the total number of records used by the
ordered logit model, which is 9951. In this model, the resulting accuracy rating is
21
𝑃𝑒𝑟𝑐𝑒𝑛𝑡 𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛 =3884 + 3610 + 128 + 14
9951= 76.7%
Although the model correctly predicts the class of records more often than not, most of the
correct classifications are in the shorter duration classes.
From Table 5, the shorter duration events tended to have the highest percentage of correct
classifications by the model. The longer duration events have lower percentages of correct
classifications. This is not surprising because this problem arises when one of the groups has a
very small market share in the observed data set. Artificially inflating the underrepresented
share can solve this problem but can make the estimated thresholds biased (Train, 2002).
Alternatively, simulation can be used to see if the accuracy improves.
4.2.2 Ordered Logit Model by simulation Predictions were simulated by drawing a random number from a uniform distribution between 0
and 1 for each observation (Train, 2009). The corresponding prediction for that random number
was found by comparing it with the CDF of that individual. This process was repeated for each
individual 1000 times. The final prediction for each individual was taken as the most frequently
predicted output class for that individual. A sample calculation follows:
The predicted probabilities of each class for observation #2067 are shown in Table 7
Table 7: Predicted probabilities for record #2067
P1 P2 P3 P4 P5
2067 0 0.41741 0.57082 0.010842 0.000928
The Cumulative Density Function (CDF) for observation #2067 is calculated using Equation 19,
where
Equation 19
𝑃(𝑖 ≤ 𝑗) = ∑ 𝑃(𝑘 = 𝑖)
𝑗
𝑘=1
The resulting CDF is given in Table 8
Table 8: CDF for record #2067
P(<=1)= P(1)
P(<=2)= P(1)+P(2)
P(<=3) =P(1)+P(2)+P(3)
P(<=4)= P(1)+P(2)+P(3)+P(4)
P(<=5)= 1-P(<=4)
0 0.41741 0.98823 0.999072 1
Next, a random number η is drawn from the uniform distribution between 0 and 1, which
represents the value of the CDF for that observation.
Equation 20
𝜂 = 𝐹(𝑥)
22
The value of η dictates the location on the CDF of the predicted event. For example, if η=0.45,
then η represents an event whose CDF is between P(<=3) and P(<=2). The predicted
probability is the difference in events between these two cumulative probabilities, which is P3.
The results are shown in Table 9 and Table 10 below:
Table 9: Confusion table resulting counts from the simulated ordered logit model
P2 P3 P4 P5
P2 3231 1149 67 9
P3 1177 2960 472 51
P4 30 479 169 35
P5 7 61 33 21
Table 10: Confusion table resulting percentages from the simulated ordered logit model.
P2 P3 P4 P5
P2 72.68841 24.71499 9.041835 7.758621
P3 26.47919 63.66961 63.69771 43.96552
P4 0.674916 10.30329 22.80702 30.17241
P5 0.15748 1.31211 4.453441 18.10345
Simulation appears to have improved the accuracy by very little, over the fitted model.
Furthermore, both ordered logit model and simulated model fail to predict any event in the first
class, even though there are over 2000 of these events. These results argue against pursuing
ordered logit models in the future.
4.3 Survival Analysis As explained in the background section, the AFT interpretation of survival models made more
sense for this study rather than the hazard model itself, although the results of the AFT can
provide insight into the time-dependency of the hazard.
4.3.1 Heterogeneity
Unobserved heterogeneity had to be considered for two reasons:
1. Not all attributes of each observed delay could be used in model building, such as Guard
and Operator. This is because these variables must be treated as categorical variables,
and there are too many distinct categories. By intentionally omitting these covariates
from the model building, unobserved heterogeneity is introduced. Accounting for
heterogeneity can improve the model fit.
2. Some guards, operators, and train cars experience repeated delays over the year 2013,
creating possible dependencies between some observations.
It is not critical if parametric heterogeneity or nonparametric heterogeneity is used (Keifer, 1988,
as cited in Sharman, 2014). Common parametric heterogeneity forms include Gamma and
Gaussian distributed heterogeneity. Both Gamma and Gaussian heterogeneity distributions are
supported in R; however, a maximum of one heterogeneity term to replace one variable is
allowed. Although the 35 incident types could have been aggregated into a smaller number of
23
categories, this aggregation would have introduced additional sources of heterogeneity,
precluding us from replacing other variables such as Operator# or Guard# with the random
frailty term.
4.3.2 Overall assessment of model fit, quality, and limitations
There are a number of ways to check the adequacy of the proposed hazard model specification. One method is to examine the graphical plot of the Cox-Snell residuals. The plot is constructed, with log{-log[S(CSi)]} on the y-axis and log(CSi) on the x-axis, where
Equation 21
𝐶𝑆𝑖 = − log(𝑆(𝑡𝑖, |𝑥𝑖)) = 𝐻(𝑡𝑖)
and S(ti|xi) is the parametric survival curve that was fitted using maximum likelihood techniques.
Equation 22
𝑆(𝑡) = ∏ (1 −𝑑𝑡
𝑛𝑡)
𝑡
𝑘=1
= 𝑆(𝑡 − 1) ∗ (1 −𝑑𝑡
𝑛𝑡)
and S(CSi) is the non-parametric Kaplan-Meier survival curve, given by Equation 22.
For a specification to be a good one, the resulting plot should closely align with a slope of 1 and an intercept of 0 (Machin, et al., 2006). The Exponential, Weibull, and Log-Logistic specifications were tested.
Figure 8: Cox-Snell Residual plots for the Exponential, Weibull, and Log-Logistic specifications. The data omits incidents less than 2 minutes.
From the above Cox Snell residual plots in Figure 8, it can be concluded that the specification of the Exponential, Weibull, and Log-Logistic are inadequate. The source of the problem could have been a misspecification of the covariate form in the linear predictor of Equation 14 (Lindqvist, Aaserud, &, Kvaløy, 2012) or the inappropriate choice of the Exponential, Weibull, or Log-Logistic survival form. Out of the three specifications, the Log-Logistic model comes closest to a straight line, but at the wrong slope. Weng et al. (2014) arrived at similar results, but considered the specification to be appropriate because of the straightness of the line alone. Note that the above residual plots are based on the data where delays less than 2 minutes were excluded.
In a subsequent test, it was assumed that all incidents recorded as zero that can conceivably
affect train service were actually 1 minute. Although survival analysis is designed to handle
24
right-censored data, dealing with left-censored data is not as straight forward (Huston & Juarez-
Colunga, 2009). It is unclear whether or not the R package “survreg” can handle left-censored
data. Therefore, in this analysis, the remaining zero-delay duration incidents were assigned a
uniformly distributed random number between 0 and 0.6. Treating these data as left-censored
should be the subject of future research. The resulting Cox-Snell residual plot is shown in Figure
9.
Figure 9: Cox Snell Residual plots for the Log-logistic specification. Data includes all incidents less than 2 minutes as well, with incidents less than 1 minute assigned a random number.
The Log-Logistic remains the best out of the three parametric specifications, and the slope of the residuals plots is much closer to 1. There is not yet theoretical justification for assuming a nonzero duration for all records in the data set, except that leaving the 0 minute durations as 0 would make the model difficult to estimate (the log of 0 is undefined). The possibility that some incident types followed different AFT specifications and scale parameters was also tested, but was found to worsen the model fit.
Next, the competing specifications are compared based on other fit statistics, such as the AIC or
McFadden’s ρ2 in Table 11.
Table 11: Fit statistics for the four common AFT specifications. The Log-logistic appears to be the best
Specification Log Likelihood Df AIC
Exponential -23253.38 38 46582.76
Weibull -23002.41 36 46076.81
Log-normal -22973.58 39 46025.17
Log-logistic -22129.88 39 44337.76
The comparisons of the AIC’s also suggest that the log-logistic specification is the most
appropriate.
The adjusted approximate McFadden’s ρ2 in Equation 23 was used because the number of
observations used was very large, which would have dampened the effect of including too many
covariates (Sharman, 2014).
25
Equation 23
𝜌2𝑎𝑑𝑗𝑢𝑠𝑡𝑒𝑑 = 1 −
(𝐿𝐿𝑓𝑢𝑙𝑙) − 𝑟
𝐿𝐿𝑛𝑢𝑙𝑙= 1 −
−22129.88 − 38
−25982.18= 0.1468
This ρ2adjusted is two orders of magnitudes larger than the ρ2
adjusted value calculated by Sharman
(2014) in his study, who also recommended the log-logistic AFT as the final model.
Finally, a nonparametric model is estimated to ensure that the parameter estimates of the AFT
model are not biased (Bhat, 1996, as cited in Sharman, 2014). A nonparametric hazard model
assumes a constant hazard within each time interval defined. Non-parametric modelling also
assumes proportional hazards and does not lend itself as easily to predicting the expected
duration directly. For example, an incident type, such as “suicide”, may make more sense as a
covariate in an AFT model, as the occurrence of such a high severity event leads to longer
clearing times. In the nonparametric proportional hazards model, however, the effect of suicides
has a constant effect on the risk of duration termination regardless of how long the incident has
lasted for. A negative coefficient suggests that the risk of duration termination is low, but it is
unclear whether that means there would be a low risk of the incident clearing in an earlier time
interval or a later time interval. Conceivably, this ambiguity can be solved by including
interaction terms with incident types and time intervals to see how the effect of incident type
changes over time. Presently, however, the model fails to converge on a solution when
interaction terms are included.
In order to make the nonparametric hazard model directly comparable with the AFT model, it
was assumed that a high hazard rate corresponds to shorter durations, and low hazard rate
corresponds to longer durations, regardless of the time interval. The covariate effects are
compared in Table 12 on page 26. The time intervals defined follow those of the Ordered Logit
Model in section 4.2.
There is considerable discrepancy between the parametric and nonparametric models. It is
therefore possible that the parametric estimates are biased. The problem may also be stemming
from an inappropriate specification of the nonparametric model as well (for example,
inappropriate choice of time intervals).
The baseline hazards, shown as ji at the bottom of Table 12, reaches a peak at the second
interval and decreases afterward. Although this may support the Log-Logistic hazard, which
increases to a maximum before decreasing, the decrease in hazard is not monotonic. This is
another discrepancy between the non-parametric model and the Log-Logistic specification.
The non-parametric model cannot be used alone to make predictions of durations, as setting up
the model requires prior knowledge on the approximate duration of the delay as an input
covariate.
Although the effects of the covariates in the log-logistic AFT model is not consistent with the
results of the non-parametric model, there is more evidence to support the Log-Logistic
specification:
the closely-fitting Cox-Snell residuals plot is nearly straight with a slope of 1,
the lowest AIC value belongs to the log-logistic specifcation,
the ρ2 value is 0.14 which is larger than that calculated in similar studies, and
26
the distributions of the errors follows a heavy-tailed distribution as observed in section
4.1.
Table 12: Parameter estimates and effects comparison between the parametric and non-parametric formulations. There is considerable discrepancy between the two formulations.
Coeffs para.
Coeff nonpar.
Para
effect
NPara
effect
! if doesn’t
match
(Intercept) 0.8108*** 2.6844*** ↑ ↓ !
CategoryCOMMUNICATIONS 0.9036*** 0.949*** ↑ ↓ !
CategoryDEBRIS___TRACK 0.2325 1.4545*** ↑ ↓ !
CategoryDISORDERLY_PATRON 0.1446. -0.1205 ↑ ↑
CategoryDOOR 0.4872*** -0.0035 ↑ ↑
CategoryDOOR_PASSENGER 0.2979** -0.6719*** ↑ ↑
CategoryDOOR_PERSONNEL_MISTAKE 1.4548*** 1.6519*** ↑ ↓ !
CategoryESCALATOR_ELEVATOR_STAIRS -0.778*** -3.74*** ↓ ↑ !
CategoryFIRE___TRACK 1.4131*** 1.5765*** ↑ ↓ !
CategoryFIRE_IN_STATION 0.5096 2.0547*** ↑ ↓ !
CategoryFIRE_ON_TRAIN 1.657*** 1.7441*** ↑ ↓ !
CategoryHOLDUP_ALARM_ACTIVATED -2.2132*** -1.4357*** ↓ ↑ !
CategoryINJURY -0.2485** -0.1403. ↓ ↑ !
CategoryNO_POWER -0.4871 1.913*** ↓ ↓
CategoryONBOARD_MECHANICAL_MAJOR 0.425* -0.5354* ↑ ↑
CategoryONBOARD_MECHANICAL_MINOR 0.5091*** -0.2815* ↑ ↑
CategoryOPERATOR_NOT_AVAILABLE_NOT_IN_POSITION 1.0791*** -0.4456*** ↑ ↑
CategoryOPERATOR_OVERSHOT_PLATFORM -0.391*** -0.6236*** ↓ ↑ !
CategoryOTHER -0.195* -0.6105*** ↓ ↑ !
CategoryPAA_ACTIVATED_BY_CUSTOMER -0.5859*** -0.5876*** ↓ ↑ !
CategoryPERSONNEL_ERROR -0.3647 -1.1146*** ↓ ↑ !
CategorySECURITY_OTHER 0.2643** 0.4597*** ↑ ↓ !
CategorySIGNAL_SWITCH -0.0248 0.2432** ↓ ↓
CategorySPEED_CONTROL -0.672*** 0.3727*** ↓ ↓
CategorySUICIDE_ 3.1148*** 2.9083*** ↑ ↓ !
CategoryTRACK_PROBLEM 0.0761 0.2495 ↑ ↓ !
CategoryTRACTION_POWER_PROBLEM 0.8795*** 0.6019*** ↑ ↓ !
CategoryTRAIN_STOP_CONTACTED -0.5181*** -2.1302*** ↓ ↑ !
CategoryTRAINING_DEPARTMENT -1.1104*** 0.0756 ↓ ↓
CategoryUNAUTHORISED_TRACK_LEVEL 0.9244*** 1.2434*** ↑ ↓ !
CategoryUNSANITARY_UNHEALTHY 0.4704*** -0.5106*** ↑ ↑
CategoryWEATHER -0.197 1.7318*** ↓ ↓
CategoryWORK_REFUSAL -0.1564 1.3354*** ↓ ↓
CategoryWORKZONE_PROBLEMS 0.9513*** -1.0524*** ↑ ↑
CategoryYARDHOUSE_PROBLEM 0.6248*** 0.3307* ↑ ↓ !
Peak_hour1 0.0552** 0.1055*** ↑ ↓ !
Interchange_station1 -0.0549* 0.1201*** ↓ ↓
Intercom1 0.1347*** -0.3911*** ↑ ↑
frailty.gamma(Guard, ↑
Scale 0.529
Frailty.Gamma 0.367
j1 0.5927***
j2 7.2736***
j3 -1.246***
j4 -0.8438***
j5 0
The log-logistic AFT is also consistent with the findings of other AFT studies in transportation
(Jones, et al., 1991, as cited in Valenti, et al., 2010; Sharman, 2014; Weng, et al., 2014). The
scale parameter being less than 1 suggests that the hazard increases to a maximum, and then
decreases afterward (Machin, Cheung, & Parmar, 2006; Allison, 2010). An increasing hazard
27
means that an event is increasingly likely to be cleared as time progresses. A decreasing
hazard means that an incident is decreasingly likely to come to an end as time progresses, and
this is analogous to incidents with extremely long durations. The suggested parametric
distribution and scale parameter can be supported by the wide spread of event durations as
shown by the boxplots in Figure 6.
The mean square error of the fitted model against the test set was 56.64, which is also
considerably better than the mean square error of 127 from the OLS regression model in
section 4.1.
5. Summary and Discussion of Results
5.1 Assessment of goodness of fit Between the OLS regression model and the AFT model, the AFT model was the better
performing model on the holdout data, with a lower mean square error rate of 56, versus 127 for
the OLS model. Furthermore, the R2 value for the OLS regression model was only about 0.55,
which is means that the regression model failed to explain half of the variation. The McFadden
adjusted ρ2 value for the AFT model was only 0.14, but this cannot be interpreted as a poor fit in
the same way as R2 (Allison, 2010).
The Ordered Logit Model overpredicted shorter duration incidents and underpredicted longer
duration incidents. Generating predictions by simulating random draws from each observation’s
CDF did not improve the accuracy.
5.2 Comparison of Parameter Estimates between the three models Table 13 includes the resulting parameter estimates from all three models. Note that a
significance indicator was attached to the parameter estimates for the Ordered Logit Model for
comparison purposes only, although the significance interpretation may not be valid, as
discussed in section 4.2.1. All three models achieve consensus that incidents at interchange
stations have shorter durations by their negative sign. Only the regression model predicts that
peak hour incidents are cleared faster (by its negative sign), but not significantly so. All three
models also predict that trains with intercoms increase the expected delay duration as well.
28
Table 13: Parameter Estimates for tthe Log-logistic AFT, OLS Regression, and Ordered Logit Model
β Log-logistic AFT
β OLS Regression
β Ordered Logit Model
(Intercept) 0.8108*** 1.71874***
2|3 Ordered Logit Model Only -3.9025
3|4 Ordered Logit Model Only 0.8644
4|5 Ordered Logit Model Only 3.4158
CategoryCOMMUNICATIONS 0.9036*** -0.50832*** -1.33136*
CategoryDEBRIS___TRACK 0.2325 -0.51197*** -2.25413***
CategoryDISORDERLY_PATRON 0.1446. -0.17944* -0.56732*
CategoryDOOR 0.4872*** -0.3837*** -1.74628***
CategoryDOOR_PASSENGER 0.2979** -0.61444*** -2.29766***
CategoryDOOR_PERSONNEL_MISTAKE 1.4548*** 0.30377 1.30081*
CategoryESCALATOR_ELEVATOR_STAIRS -0.778*** -0.07482 -1.49211
CategoryFIRE___TRACK 1.4131*** 0.57698*** 1.55113***
CategoryFIRE_IN_STATION 0.5096 -0.2971 -1.63022
CategoryFIRE_ON_TRAIN 1.657*** 0.75405** 2.05242*
CategoryHOLDUP_ALARM_ACTIVATED -2.2132*** 0.24078 0.31606
CategoryINJURY -0.2485** 0.02 -0.11177
CategoryNO_POWER -0.4871 1.48598*** 0.07684
CategoryONBOARD_MECHANICAL_MAJOR 0.425* -0.43562* -2.18963***
CategoryONBOARD_MECHANICAL_MINOR 0.5091*** -0.40336*** -1.74338***
CategoryOPERATOR_NOT_AVAILABLE_NOT_IN_POSITION 1.0791*** -0.58654*** -2.53207***
CategoryOPERATOR_OVERSHOT_PLATFORM -0.391*** -1.19377*** -4.23457***
CategoryOTHER -0.195* -0.21884** -0.96065***
CategoryPAA_ACTIVATED_BY_CUSTOMER -0.5859*** -1.38509*** -4.90224***
CategoryPERSONNEL_ERROR -0.3647 -0.11267 -1.59531*
CategorySECURITY_OTHER 0.2643** 0.00113 -0.16588
CategorySIGNAL_SWITCH -0.0248 -0.97164*** -3.67877***
CategorySPEED_CONTROL -0.672*** -1.53454*** -5.73978***
CategorySUICIDE_ 3.1148*** 2.18616*** 6.00801***
CategoryTRACK_PROBLEM 0.0761 -0.77326*** -3.2738***
CategoryTRACTION_POWER_PROBLEM 0.8795*** -0.03314 -0.10878
CategoryTRAIN_STOP_CONTACTED -0.5181*** -1.34075*** -4.88328***
CategoryTRAINING_DEPARTMENT -1.1104*** -0.5995 -1.6665
CategoryUNAUTHORISED_TRACK_LEVEL 0.9244*** -0.03376 0.17715
CategoryUNSANITARY_UNHEALTHY 0.4704*** -0.30677*** -1.40978***
CategoryWEATHER -0.197 0.34076 0.27399
CategoryWORK_REFUSAL -0.1564 -0.17394 -1.02694
CategoryWORKZONE_PROBLEMS 0.9513*** 0.4357** 0.75202.
CategoryYARDHOUSE_PROBLEM 0.6248*** -0.10716 -0.34776
Peak_hour1 0.0552** -0.02063 0.11864*
Interchange_station1 -0.0549* -0.0412* -0.16484*
Intercom1 0.1347*** 0.03714* 0.11086*
R2 N/A 0.55 N/A
ρ2adjusted 0.14 N/A N/A
% of Correct classifications 76.7%
29
There is considerable discrepancy between the coefficient signs of certain incident types listed
in Table 14.
Table 14: List of incident types on which the three models disagree.
Variable Name
CategoryCOMMUNICATIONS
CategoryDEBRIS___TRACK
CategoryDISORDERLY_PATRON
CategoryDOOR
CategoryDOOR_PASSENGER
CategoryFIRE_IN_STATION
CategoryHOLDUP_ALARM_ACTIVATED
CategoryINJURY
CategoryNO_POWER
CategoryONBOARD_MECHANICAL_MAJOR
CategoryONBOARD_MECHANICAL_MINOR
CategoryOPERATOR_NOT_AVAILABLE_NOT_IN_POSITION
CategorySECURITY_OTHER
CategoryTRACK_PROBLEM
CategoryTRACTION_POWER_PROBLEM
CategoryUNAUTHORISED_TRACK_LEVEL
CategoryUNSANITARY_UNHEALTHY
CategoryWEATHER
CategoryYARDHOUSE_PROBLEM
The source of the discrepancy is unclear. The standard deviation for the duration for each
incident type was examined as a possible source. Although it was expected that the lack of
consensus may have been attributable to the very high variance of some incident types, this
was not the case. As shown in Table 15 on page 30, no discernible trend exists that would
relate lack of consensus on sign with high variance. It should also be noted that for every
incident type to which there is agreement between the three models, the magnitudes of the
coefficients for the Log-Logistic AFT model is larger than those of the OLS regression model for
incident types with positive coefficients only. This suggests that the Log-Logistic AFT model
may be giving more weight to extremely long-duration events, whereas the OLS regression may
be dismissing these events as “outliers”. This is not surprising given that the log-logistic model
allows for a wider spread of errors with their heavier-tailed distribution than the normal
distribution.
As discussed in Section 4.3.2, the Log-Logistic AFT achieved the best accuracy and had an
acceptable goodness of fit, unlike the other two models. Furthermore, as the Log-Logistic AFT is
better able to capture extremely lengthy durations, it is recommended to use the Log-Logistic
AFT as the final model.
30
Table 15: Incident types along with their mean duration and standard deviation. Non consensual incident types are highlighted in yellow.
Row Labels Average of Min_Delay StdDev of Min_Delay
CategoryNO_POWER 56.56648 152.936
CategoryWEATHER 44.38469 134.1471
CategorySUICIDE_ 61.625 32.86716
CategorySECURITY_OTHER 7.190962 32.57716
CategoryWORKZONE_PROBLEMS 10.46374 13.48527
CategoryCOMMUNICATIONS 6.282051 11.58481
CategoryFIRE_ON_TRAIN 16.5 11.16116
CategoryDEBRIS___TRACK 6.730769 10.69788
CategoryFIRE___TRACK 12.69528 10.02139
CategoryGrand Total 3.164316 9.86407
CategorySIGNAL_SWITCH 3.619211 8.665953
CategoryDOOR_PERSONNEL_MISTAKE 12 7.646015
CategoryUNAUTHORISED_TRACK_LEVEL 7.954935 7.050333
CategoryTRACK_PROBLEM 3.701149 6.446768
CategoryFIRE_IN_STATION 5.666667 6.372288
CategoryTRACTION_POWER_PROBLEM 6.444444 5.393159
CategoryOTHER 2.724066 4.626019
CategoryINJURY 3.149353 4.609265
CategoryYARDHOUSE_PROBLEM 4.973442 4.282172
CategoryASSAULT 3.593119 4.160723
CategoryDISORDERLY_PATRON 3.747075 3.462135
CategoryDOOR 4.464286 3.028167
CategoryPERSONNEL_ERROR 2.237552 2.571571
CategoryHOLDUP_ALARM_ACTIVATED 0.979396 2.53193
CategoryWORK_REFUSAL 2.369286 2.469779
CategoryTRAIN_STOP_CONTACTED 1.938897 2.465431
CategoryONBOARD_MECHANICAL_MINOR 4.148387 2.289994
CategoryOPERATOR_NOT_AVAILABLE_NOT_IN_POSITION 3.542959 1.659106
CategoryUNSANITARY_UNHEALTHY 3.999085 1.621515
CategorySPEED_CONTROL 1.474679 1.595119
CategoryONBOARD_MECHANICAL_MAJOR 3.722222 1.48742
CategoryTRAINING_DEPARTMENT 1.213223 1.438952
CategoryOPERATOR_OVERSHOT_PLATFORM 2.08867 1.41492
CategoryDOOR_PASSENGER 3.395833 1.385331
CategoryPAA_ACTIVATED_BY_CUSTOMER 1.703869 1.254668
CategoryESCALATOR_ELEVATOR_STAIRS 0.332081 0.673997
5.2 Interpretation of Parameter Estimates The parameter estimates from the Log-Logistic AFT model, along with the resulting expected
duration is repeated in Table 16 on page 32. The relative value of the “Interchange_station1”,
“Peak_hour1”, and “Intercom1” variables are tested after they are interpreted.
On average, incidents are cleared 5% faster at interchange stations than at other stations. This
is not surprising, as incidents at interchange stations may affect more than one line, and hence,
larger parts of the system than incidents at non-interchange stations. Interchange stations also
tend to handle more passenger traffic than non-interchange stations. Furthermore, incidents are
more prevalent at interchange stations than at other stations. In light of this, the transit agency
probability assigned higher priority to clearing incidents at interchange stations.
The coefficient for Peak_hour1 is positive, suggesting that incidents are cleared faster during
off-peak hours. This does not match prior expectations, because the periods with short
prevailing headways corresponds to the times of higher passenger volumes. Although it may be
31
in the transit agency’s interest to assign higher priority to clearing peak-period incidents, heavy
traffic conditions during rush hour may impede the quick performance of response teams, which
would add to the total delay duration. The actual effect of rush hour on clearance time is small,
increasing expected duration by only 5%.
The coefficient for “Intercom1” is significantly positive. Although it can be argued that the sign
should be negative since an intercom should improve the detection and identification of the
nature of an incident, the presence of an intercom was not enough to overcome the other
problems that the Toronto Rocket trains experienced in 2013. Furthermore, it is possible that
passengers were not yet accustomed to using the newly available intercom effectively. Although
this variable was statistically significant, the effect of Toronto Rocket trains may no longer
increase the expected duration in other years, as technical bugs from new technology become
resolved in future years.
The value of including the variables “Interchange_station1”, “Peak_hour1”, and “Intercom1” in
the final model is tested using Likelihood Ratio test in Equation 24
Equation 24
𝐿𝑅 = −2(𝐿𝐿𝑜 − 𝐿𝐿𝑎)~𝜒𝑑𝑜𝑓=1,𝛼=0.052 = 3.841
where LLo is the log likelihood of the model without these variables, and LLa is the log likelihood
of the model with the variables. The degrees of freedom is 1 because the value of each
individual variable is tested, rather than the value of all three variables simultaneously.
From the bottom of Table 16, the effects of “Peak_hour1” and “intercom1” add value to the final
model. Conversely, the effect of “Interchange_station1” does not add much value, at the 95%
confidence level. The latter variable is the least statistically significant out of these three
variables, and only serves to reduce the predicted duration, making the model slightly more
optimistic about the clearance time. It is nevertheless insightful to include these three variables
in the final model in any case.
The expected base delay (off-peak, involving a train without an intercom, and in a non-
interchange station) of different incident types is calculated by Equation 25:
Equation 25
𝐷𝑒𝑙𝑎𝑦𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑,𝑏𝑎𝑠𝑒 𝑐𝑎𝑠𝑒 = exp (𝛽𝑜 + 𝛽𝑡𝑦𝑝𝑒𝑥𝑡𝑦𝑝𝑒)
It was found that:
incidents involving Workzone problems, suicide, fires on trains, fires on tracks, doors
opening off the platform all have the effect of lengthening the expected duration,
incidents involving speed control problems, PAA activations, train stops contacting,
personnel error, and work refusals tend to have the shortest durations, producing delays
that are 2 minutes or less, and
Vehicle mechanical problems, including door problems, are cleared between only 3-4
minutes on average.
32
Table 16 Parameter estimates and average durations of the final model (log-logistic AFT)
Parameter Estimate β
Multiplicative effect on base-category (Assault)
Expected base delay (minutes)
LLo LLa LR
(Intercept) 0.8108*** 2.249707 2.249707
CategorySUICIDE_ 3.1148*** 22.52892 50.68348
CategoryFIRE_ON_TRAIN 1.657*** 5.243557 11.79647
CategoryDOOR_PERSONNEL_MISTAKE 1.4548*** 4.283627 9.636905
CategoryFIRE___TRACK 1.4131*** 4.108673 9.24331
CategoryOPERATOR_NOT_AVAILABLE _NOT_IN_POSITION
1.0791*** 2.942031 6.618707
CategoryWORKZONE_PROBLEMS 0.9513*** 2.589073 5.824656
CategoryUNAUTHORISED_TRACK_LEVEL 0.9244*** 2.520356 5.670062
CategoryCOMMUNICATIONS 0.9036*** 2.468474 5.553343
CategoryTRACTION_POWER_PROBLEM 0.8795*** 2.409695 5.421107
CategoryYARDHOUSE_PROBLEM 0.6248*** 1.867872 4.202166
CategoryFIRE_IN_STATION 0.5096 1.664625 3.744919
CategoryONBOARD_MECHANICAL_MINOR 0.5091*** 1.663793 3.743047
CategoryDOOR 0.4872*** 1.627752 3.661965
CategoryUNSANITARY_UNHEALTHY 0.4704*** 1.600634 3.600958
CategoryONBOARD_MECHANICAL_MAJOR 0.425* 1.52959 3.44113
CategoryDOOR_PASSENGER 0.2979** 1.347027 3.030416
CategorySECURITY_OTHER 0.2643** 1.302519 2.930286
CategoryDEBRIS___TRACK 0.2325 1.26175 2.838569
CategoryDISORDERLY_PATRON 0.1446. 1.155577 2.59971
CategoryTRACK_PROBLEM 0.0761 1.07907 2.427592
CategorySIGNAL_SWITCH -0.0248 0.975505 2.1946
CategoryWORK_REFUSAL -0.1564 0.855217 1.923988
CategoryOTHER -0.195* 0.822835 1.851137
CategoryWEATHER -0.197 0.821191 1.847438
CategoryINJURY -0.2485** 0.77997 1.754704
CategoryPERSONNEL_ERROR -0.3647 0.694405 1.562208
CategoryOPERATOR_OVERSHOT_ PLATFORM
-0.391*** 0.67638 1.521657
CategoryNO_POWER -0.4871 0.614406 1.382233
CategoryTRAIN_STOP_CONTACTED -0.5181*** 0.595651 1.340041
CategoryPAA_ACTIVATED_BY_CUSTOMER -0.5859*** 0.556605 1.252197
CategorySPEED_CONTROL -0.672*** 0.510686 1.148894
CategoryESCALATOR_ELEVATOR_STAIRS -0.778*** 0.459324 1.033344
CategoryTRAINING_DEPARTMENT -1.1104*** 0.329427 0.741115
CategoryHOLDUP_ALARM_ACTIVATED -2.2132*** 0.10935 0.246006
Peak_hour1 0.0552** 1.056752 -23765.2 -23751.3 27.8*>3.841
Interchange_station1 -0.0549* 0.94658 -23765.2 -23763.3 3.8<3.841
Intercom1 0.1347*** 1.144193 -23765.2 -23667 196.4*>3.841
Incidents that threaten customer safety and simultaneously affect train operations tend to have
the longest average duration than incidents that either threatened customer safety without
affecting train operations, or affected train operations without affecting customer safety. This
may reflect the TTC’s “safety first” attitude that requires them to do a more thorough job of
responding to life-threatening incidents. These results are also consistent with the findings of
Weng, et al. (2014), who explained that crews responding to incidents involving potential
casualties have to prioritise rescue operations of human lives over the salvaging of property and
equipment. Conversely, incidents involving personnel errors and mechanical failures that do not
threaten safety do not require external response agencies, such as EMS, Police, or Fire
Services, and hence tend to have shorter durations. This conclusion is also corroborated by
Weng, et al (2014).
33
Note that incident duration within each incident type still has high variation, consistent with the
findings of Giuliano (1989), and supported by the heavier tails of the error distribution.
Predictions based on the average may not always be accurate.
6. Conclusion In this study, we estimated and compared the quality and predictive capabilities of three
different models: OLS Regression, Ordered Logit Model, and AFT models. The models were
tested on their goodness of fit values and their error rate when used to make predictions for the
holdout data. The AFT model performed better than the OLS Regression model with a slightly
lower mean-square error rate. The parameter estimates were assessed and interpreted, and the
results were found to be consistent with prior expectations and previous studies (Weng, et al.,
2014). Although incident duration in subway systems has been studied in the past using incident
type as the explanatory variable, our study was able to determine that the effect of other
additional variables such as station type, train type, and time of day on incident duration was
statistically significant.
6.1 Recommendations for future work Weng, et al. (2014) also considered a mixed effects model, as variation in clearance time can
be introduced in other factors, such as the Operator, Guard, and specific vehicle in the fleet.
This was tried in this study, but would not always converge because of the excessive number of
parameters.
Other formulations for the incident type variables will be tested to see if the predictive model can
be made more parsimonious. Changes to the location, train type, and time variables will also be
considered to see if more insights can be learned.
34
References
Aaserud, S., Kvaløy, J. T., & Lindqvist, B. H. (2013). Residuals and functional form in
accelerated life regression models. In Risk Assessment and Evaluation of
Predictions (pp. 61-65). Springer New York.
Allison, P. D. (2010). Survival analysis using SAS: a practical guide. SAS Institute.
Bhat, C.R. (1996). A hazard-based duration model of shopping activity with nonparametric
baseline specification and nonparametric control for unobserved hetereogeneity.
Transportation Research Part B: Methodological, 30, 189-207
Chan, S., McGinty, J. (2005, March, 26). Think the Subway's Running Later? You're Right. The
New York Times. Retrieved from
http://www.nytimes.com/2005/03/26/nyregion/26subway.html?pagewanted=all
Giuliano, G. (1989). Incident characteristics, frequency, and duration on a high volume urban
freeway. Transportation Research Part A: General, 23(5), 387-396.
Huston, C., & Juarez-Colunga, E. (2009). Guidelines for computing summary statistics for data-
sets containing non-detects. Department of Statistics and Actuarial Science, Simon
Fraser University for the Bulkley Valley Research Center with assistance from the BC
Ministry of Environment.
Kiefer, N. M. (1988). Economic duration data and hazard functions. Journal of Economic
Literature, 26, 646-679
Machin, D., Cheung, Y. B., & Parmar, M. (2006). Survival analysis: a practical approach. John
Wiley & Sons.
Man, J., Aarshabh, M., Shalaby, A. (2014). A SURVEY-BASED APPROACH TO
UNDERSTANDING THE RESILIENCE 2 IMPLICATIONS OF DAY-TO-DAY
DISRUPTIONS ON TRANSIT OPERATION (Unpublished dissertation). University of
Toronto, Toronto, Ontario.
Nam, D., & Mannering, F. (2000). An exploratory hazard-based analysis of highway incident
duration. Transportation Research Part A: Policy and Practice,34(2), 85-102.
Ripley, B. (2013, October, 19). RE: No P.values in polr summary. Message posted to
http://r.789695.n4.nabble.com/No-P-values-in-polr-summary-tp4678547p4678590.html
Roh, S. K. (n.d.) A study on the emergency response manual for urban transit
fires.In Proceedings of the 7th Asia-Oceania Symposium on Fire Science and
Technology (pp. 29-38). Retrieved from http://www.iafss.org/publications/aofst/7/17/view
Sarle, W.S. (1997), Neural Network FAQ, part 1 of 7: Introduction, periodic posting to the
Usenet newsgroup comp.ai.neural-nets. Retrieved from:
ftp://ftp.sas.com/pub/neural/FAQ.html
Sharman, B. (2014). Behavioural Modelling of Urban Freight Transportation: Activity and Inter-
Arrival Duration Models Estimated Using GPS Data (Unpublished doctoral dissertation).
University of Toronto, Toronto, Ontario.
35
Straphangers Campaign. (2014). Methodology: Straphangers Campaign Analysis of MTA Alerts
of Subway Incidents/Delays in 2011 and 2013. Retrieved from
http://www.straphangers.org/alerts/14/Methodology.pdf
Toronto Transit Commission. (2013). Chief Executive Officer’s Report – February 2013 Update.
[Toronto]. Retrieved from
https://www.ttc.ca/About_the_TTC/Commission_reports_and_information/Commission_
meetings/2013/February_25/Reports/CHIEF_EXECUTIVE_OFFI.pdf
Toronto Transit Commission. (2014). Chief Executive Officer’s Report – November/December
2014 Update. [Toronto]. Retrieved from
https://www.ttc.ca/About_the_TTC/Commission_reports_and_information/Commission_
meetings/2014/December_9/Reports/CHIEF_EXECUTIVE_OFFICERS_REPORT_NOV
EMBER_DECEMBER_2014_UPDAT.pdf
Toronto Transit Commission. (2015). Chief Executive Officer’s Report – January 2015 Update.
[Toronto]. Retrieved from
https://www.ttc.ca/About_the_TTC/Commission_reports_and_information/Commission_
meetings/2015/January_21/Reports/CHIEF_EXECUTIVE_OFFICER_S_REPORT_JAN
UARY_2015_UPDATE.pdf
Train, K. E. (2009). Discrete choice methods with simulation. Cambridge university press.
Weng, J., Zheng, Y., Yan, X., & Meng, Q. (2014). Development of a subway operation incident
delay model using accelerated failure time approaches.Accident Analysis &
Prevention, 73, 12-19.
Wesstein, E. (n.d.). Bonferroni Correction. MathWorld – A Wolfram Web Resource. Retrieved
from http://mathworld.wolfram.com/BonferroniCorrection.html