Final M Eng Report - track no changes

1

Building Statistical Models to Predict Durations of Subway System

Incidents

by

Jacob Louie, M.Eng Candidate

#997677125

Supervisors: Dr. Amer Shalaby and Dr. Khandker Nurul Habib

Graduate Department of Civil Engineering University of Toronto

2

Building Statistical Models to predict durations of Subway System

Incidents

Jacob Louie

Masters of Engineering

Civil Engineering

University of Toronto

2015 (expected)

Abstract

This study aims at comparing the accuracy and goodness of fit of Ordinary Least

Squares Regression, Ordered Logit Models, and Survival Analysis in predicting the

duration of subway incidents in Toronto. The study considers as variables the cause or

type of incident, the location of incident, the time of incident, and the train-type involved.

It was found that the Survival model, specifically the Log-Logistic Accelerated Failure

Time model, provided the highest accuracy and most explanatory power. The effect of

non-causal variables was also statistically significant.

3

Acknowledgements

I would like to thank my supervisors Professor Amer Shalaby and Professor Khandker Nurul

Habib for their continuous guidance throughout my last two semesters at the University of

Toronto. The last eight months have been a steep learning curve for me, for which I am grateful

to have overcome so far. I would also like to thank all my friends who live in the ITS lab at the

University of Toronto and ARUP who have exchanged their ideas with me. I look forward to the

coming months to continue to improve upon my work for TRB.

4

Contents Acknowledgements .............................................................................................................................. 3

List of Tables ......................................................................................................................................... 5

List of Figures........................................................................................................................................ 6

1. Introduction ....................................................................................................................................... 7

1.2 Literature Review ........................................................................................................................ 7

2. Data Description and Preprocessing .............................................................................................. 8

2.1 General Data Description .......................................................................................................... 8

2.2 Trends in the Toronto Subway Performance ......................................................................... 10

2.3 Data Preprocessing for Model Development ......................................................................... 12

2.4 Variable Definitions Used in the Models ................................................................................. 12

2.5 Data Partitioning for Training and Testing .............................................................................. 15

3. Descriptions of model formulations considered ........................................................................... 15

3.1 Ordinary Least Squares (OLS) Regression ........................................................................... 15

3.2 Ordered Logit Model ................................................................................................................ 16

3.3 Survival Analysis ...................................................................................................................... 17

4. Discussion of Results ..................................................................................................................... 19

4.1 Linear Regression Model ......................................................................................................... 19

4.2 Ordered Logit Model ................................................................................................................ 19

4.2.1 Ordered Logit Model without simulation .......................................................................... 19

4.2.2 Ordered Logit Model by simulation .................................................................................. 21

4.3 Survival Analysis ...................................................................................................................... 22

5. Summary and Discussion of Results ............................................................................................ 27

5.1 Assessment of goodness of fit ................................................................................................ 27

5.2 Comparison of Parameter Estimates between the three models ......................................... 27

5.2 Interpretation of Parameter Estimates .................................................................................... 30

6. Conclusion ...................................................................................................................................... 33

6.1 Recommendations for future work .......................................................................................... 33

5

List of Tables

Table 1: Whether or not a line reached its on-time performance target ........................................ 10

Table 2: Variables describing time, location, and vehicle type involved ........................................ 14

Table 3: Variables describing the incident type ............................................................................... 14

Table 4: The threshold definitions for the output levels of the ordered logit model ...................... 20

Table 5: Confusion table that shows the count of correct classifications and misclassifications.

The row labels are the predicted outcomes, and the column names are the actual outcomes. .. 20

Table 6: Confusion table that shows the percentage of correct classifications and

misclassifications. The row labels are the predicted outcomes, and the column labels are the

actual outcomes .................................................................................................................................. 20

Table 7: Predicted probabilities for record #2067 ............................................................................ 21

Table 8: CDF for record #2067 .......................................................................................................... 21

Table 9: Confusion table resulting counts from the simulated ordered logit model ...................... 22

Table 10: Confusion table resulting percentages from the simulated ordered logit model. ......... 22

Table 11: Fit statistics for the four common AFT specifications. The Log-logistic appears to be

the best ................................................................................................................................................ 24

Table 12: Parameter estimates and effects comparison between the parametric and non-

parametric formulations. There is considerable discrepancy between the two formulations. ...... 26

Table 13: Parameter Estimates for tthe Log-logistic AFT, OLS Regression, and Ordered Logit

Model ................................................................................................................................................... 28

Table 14: List of incident types on which the three models disagree............................................. 29

Table 15: Incident types along with their mean duration and standard deviation. Non consensual

incident types are highlighted in yellow. ........................................................................................... 30

Table 16 Parameter estimates and average durations of the final model (log-logistic AFT) ....... 32

6

List of Figures

Figure 1: The distribution of incidents over time ................................................................................ 9

Figure 2: Histogram of incidents, assuming all incidents had a non-zero duration ........................ 9

Figure 3: Frequency of incident occurrence by count ..................................................................... 10

Figure 4: Incident Types by total duration over 2013. Total number of hours of delay was 498. 11

Figure 5: Incident Types by Count, 2013. Total number of incidents exceeded 10 000 .............. 11

Figure 6: Boxplots showing the distribution of duration delays for each incident type. ................ 12

Figure 7: Residual and QQ plots for OLS regression does not support assumption of normally

distributed errors ................................................................................................................................. 19

Figure 8: Cox-Snell Residual plots for the Exponential, Weibull, and Log-Logistic specifications.

The data omits incidents less than 2 minutes. ................................................................................. 23

Figure 9: Cox Snell Residual plots for the Log-logistic specification. Data includes all incidents

less than 2 minutes as well, with incidents less than 1 minute assigned a random number. ...... 24

7

1. Introduction In public transit systems, delays and incidents can negatively affect service throughout the

system, as well as damage the transit agency's image. Aside from prevention of incidents,

minimisation of the duration of the incidents is one way to mitigate the negative impacts on

service system-wide, and is of particular interest in this paper. Even if durations cannot be

minimised, predictions for delay durations are needed to help agencies decide whether or not

an incident is disruptive enough to warrant a formal response and to help customers decide on

alternative routes (Straphangers Campaign, 2014). Consequently, the purpose of this study is to

build statistical models that can predict the durations of incidents on the Toronto Transit

Commission (TTC) subway system. A list of incidents on the subway system over the year 2013

was provided by the TTC.

To accomplish this goal, three popular models were considered: Ordinary Least Squares (OLS)

Regression, Ordered logit model, and hazard model (also known as survival analysis or

accelerated failure time (AFT) model). These models will be compared on their predictive

capability, their accuracy, the quality of estimates (i.e. whether or not the signs and magnitudes

are logical), and their limitations.

This report will begin with a literature review of predicting the duration of delays in the

transportation context. A brief description of each model formulation, including their limitations

and methods to interpret their parameter estimates will then be given. This will be followed by a

discussion of the models’ goodness of fit statistics, and accuracy rate against holdout data. The

report will end with a comparative assessment between the models and an interpretation of the

parameter estimates.

1.2 Literature Review Survival analysis has been used to predict the clearance time of highway incidents in the past.

In a study on highway incident duration in Washington State by Nam and Mannering (2000), the

total incident duration was broken down into time to detection of incident, time for responding

agency to react to incident, time for the response crew to travel to incident, and time for

response crew to clear the incident once they arrived. Separate models were constructed for

each time segment. The authors used different Accelerated Failure Time (AFT) models to

predict the effect of different covariates on the expected delay time itself and found that

variables describing time of year, location, and weather were significant predictors, although

their effects were not stable year-to-year. The data set identified the exact time of the onset of

the incident, the time of incident detection, and the time of incident clearance which allowed the

authors to study the time components separately.

Detecting an incident can be highly dependent on the manner of detection. For example, in an

incident that involved a fire that started on a subway train in Seoul Korea, the response

procedure to the fire was delayed as the operator of the train did not receive the information of

the fire through the passenger intercom (Roh, n.d.). Consequently, the severity of the incident

increased dramatically.

Valenti, Lelli, and Cucina (2010) also compared regression, hazard, and neural network models

in trying to predict the duration of traffic incidents. Each observation also includes which

response agencies were involved (EMS, fire, police, etc.), and how many response personnel

http://www.straphangers.org/alerts/14/Methodology.pdf

8

and response vehicles were at the site. The researchers found that most of the models had

difficulty predicting both short duration incidents and long duration incidents simultaneously.

Weng et al. (2014) also built an Accelerated Failure Time (AFT) model to predict the duration of

subway delays in Hong Kong from a dataset spanning 6 years. With a large multi-year dataset,

they were able to test the temporal transferability of their model, as well as consider very high

severity but rare incidents, such as crashes and fatalities resulting from accidents. The authors

studied the incident type as the only covariates in the model, but do not yet consider other

variables such as the time of day, time of year, day of week, prevailing headways, and location.

Sharman (2014) also constructed and applied a survival model to predict the dwell times of

urban delivery vehicles. The data set of dwell times was obtained through the preprocessing of

“passively-collected GPS data”. Sharman found that the AFT formulation, specifically the log-

logistic specification, as described in section 3.3, yielded the best results.

Giuliano (1989) uses analysis of variance to compare the duration of freeway incidents against

different categorical variables, including whether the incident occurred during day or night time,

the number of lanes closed or affected by the incident, incident type, and whether or not trucks

were involved. Giuliano found that although incident type was statistically significant, there was

a large variation within each incident type.

In the transportation context, there seems to be more duration modelling studies of highway

traffic incidents than on public transit incidents. Past research found on duration analysis of

public transit incidents were not at the level of detail that studies on highway incidents were able

to achieve, as far as incorporating other variables aside from incident type or cause is

concerned. Incorporating variables on time of occurrence, location, and passenger volumes

remains a goal for Weng et al. (2014) and is the motivation for our study as well.

2. Data Description and Preprocessing

2.1 General Data Description In the 2013 subway incident data set provided by the TTC, the following information is given for

each recorded incident: The date, time, day of week, station, bound, line, resulting delay in

minutes, the vehicle #, the operator #, the guard #, a description of the incident, and a code

representing the incident type.

There were roughly 12 000 subway incidents in the year 2013. 55% of these incidents register a

delay of 0 minutes. The smallest non-zero delay was 2 minutes. The remaining delays were

given to the nearest minute. It is also worth noting that other transit agencies, such as the MTA

in New York City, the MTR in Hong Kong, and the MRT in Singapore, record incidents that

exceed 8, 8, and 5 minutes, respectively (SC, 2014; Weng, et al., 2014). On the other hand, it

can also be argued that short duration incidents of even less than 2 minutes can still be

influential on the subway system’s performance when headways are extremely short.

Figure 1 shows the distribution of incident occurrence over time of day. Of interest from this

figure are the two peaks in incident occurrence during the morning peak period and evening

peak period, defined at 6h-9h and 15h-19h, respectively. It is not surprising to see the frequency

of incident occurrence rising with passenger demand.

9

Figure 1: The distribution of incidents over time

From Figure 2, the vast majority of subway incidents have short durations, and the frequency of

incident durations decreases with increasing durations. The overall distribution of incident

duration exhibits a rough exponential distribution.

Figure 2: Histogram of incidents, assuming all incidents had a non-zero duration

Figure 3shows the location distribution of the incidents. Many incidents seem to be concentrated

at terminals, at major interchanges, and locations near yards. The prevalence of incidents is

also higher on the Yonge-University-Spadina (YUS) line than on the Bloor-Danforth (BD) line,

possibly due to higher passenger traffic overall.

0

200

400

600

800

1000

12001

:00

2:0

0

3:0

0

4:0

0

5:0

0

6:0

0

7:0

0

8:0

0

9:0

0

10

:00

11

:00

12

:00

13

:00

14

:00

15

:00

16

:00

17

:00

18

:00

19

:00

20

:00

21

:00

22

:00

23

:00

0:0

0

Co

un

t

Time of Day

Distribution of incidents throughout the Day

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 20 40 60 80 100 120 140 160

Per

cen

tage

Time (minutes)

Histogram of incident durations

10

Figure 3: Frequency of incident occurrence by count

2.2 Trends in the Toronto Subway Performance By the end of 2014, the TTC moved 1.8 million daily riders (TTC, 2015), a 2% growth since

2013. Work is now underway to update the signalling system to increase the capacity, causing

an increase in track-level maintenance activity (TTC, 2013).

Table 1 lists the relative performance of each rapid transit line since 2013.

Table 1: Whether or not a line reached its on-time performance target

Line 2015 2014 2013

Yonge-University-Spadina (YUS)

Below Below Below

Bloor-Danforth (BD) Below Below Above

Sheppard (SHP) Above Above Above

Scarborough RT (SRT)

Above Below Above

The Yonge-University-Spadina (YUS) line has consistently been the worst performing subway

line in Toronto since at least 2013, with an average percentage of on time performance peaking

at less than 94% (TTC, 2015). In contrast, other lines regularly met or surpassed their

percentage of on-time performance target of 96%. This difference in performance could be due

to a number of factors. As the subway system ages and demand increases, the need to update

the infrastructure becomes more urgent (Man, Misra, & Shalaby, 2014). Work has been

underway on the YUS line to upgrade the signalling system and other critical infrastructure.

Also, the new Toronto Rocket (TR) trains were the cause of many door-related delays when

they were first introduced (TTC, 2013, p. 5).

Series1 Series2

11

Other large cities are experiencing similar problems. In New York City, the top delay causes

include sickness, doors being held by passengers, speed-control and signal issues, and

security. Some of these causes, such as police stoppages, have become more prevalent in light

of terrorist attacks in recent history (Chan & McGinty, 2005). In Toronto, security was also the

leading incident type, accounting for the most amount of delay time, followed by door problems.

The breakdown of delays by cause is given in Figure 4 and Figure 5.

Figure 4: Incident Types by total duration over 2013. Total number of hours of delay was 498.

Figure 5: Incident Types by Count, 2013. Total number of incidents exceeded 10 000

12

As can be seen from the above pie charts, lengthier delays do not necessarily correspond with

more frequent delays. The variation of incident durations for each incident type is also very

large, as shown in Figure 6. It would be inadequate to simply describe each incident type with a

single point average.

Figure 6: Boxplots showing the distribution of duration delays for each incident type.

2.3 Data Preprocessing for Model Development In an attempt to improve the model fit, it was assumed that incidents with 0 minute delay that

can conceivably affect train operations were actually anywhere between 0 and 2 minutes. These

incidents were reassigned a delay duration of 1 minute. As for the remaining incidents of 0

minute delay, different options were tested, including omitting them entirely (due to lack of

relevance to train operations), or reassigning them a random number of less than 1 minute. In

the context of Survival Analysis, these zero delay incidents could have been treated as “left-

censored”, in which the only knowledge of the incident duration is that it was less than the

minimum threshold of 2 minutes (Allison, 2010).

2.4 Variable Definitions Used in the Models Different formulations for the headway or time covariates were tested. The prevailing headways

were derived using the TTC’s service summaries. However, in all of the estimated models, the

coefficient for headway is very small, but statistically significant. The final model uses a dummy

indicator variable to represent whether or not the incident occurred during the peak hour. In this

study, the peak hour was defined as 6h-9h or 15h-19h on Mondays-Fridays.

Multiple formulations for location were also possible as well. Earlier models divided the entire

subway system into 13 zones according to the incident’s location and direction on the line. A

continuous variable to represent location from a fixed point, such as the line’s terminal, was also

13

tested. In the end, the final model uses an indicator variable of whether or not the incident

occurred at an interchange station.

In light of the Seoul subway incident discussed in section 1.2 (Roh, n.d.), it was decided to

include a variable that represents whether or not the train was equipped with an intercom. In

2013, the only trains that were equipped with passenger intercoms were the new “Toronto

Rocket” (TR) trains.

There were over 100 different code types to represent incident type or cause. These had to be

aggregated to make the target model more tractable. Many of these code types were duplicates

of the same incident type, and a number of these codes had very few observations. The 100

codes were reducible to a minimum of 35 distinct categories without losing details on the nature

of the incident. Further reductions were possible by categorising the incident types based on

their severity level (Affecting customer safety, potential property damage, criminal, affecting

train operations only, and other).

Weng, et al. (2014) also identified 6 main incident categories in their study of the Hong Kong

subway system: Power infrastructure failure, vehicle failure, turnout malfunction, crash, and

operation error. However, many of the incidents in the TTC’s data set are not classifiable into

these categories.

Man, et al., (2014) proposed 10 categories derived from surveys and interviews with various

stakeholders of the public transit system. The proposed categories are Passenger-related,

Failure of infrastructure, Congestion delay, Weather-related, Accidents, Construction, Signals.

Power, Demand surge, and Other. Although the TTC data can be classified into these

categories, there would be a number of categories with differing severity levels that would

become aggregated with each other (e.g. PAA activations by customers and suicides can both

be considered “Passenger-related” incidents).

In this paper, it was decided to use all 35 incident types in the final model. If fewer parameters

are needed in the future, one way to further reduce the number of categories is to compare

suspected similar categories using the following formula:

Equation 1

𝑡𝑑𝑓=1,𝛼/𝑛 =𝛽𝑗 − 𝛽𝑘

√𝑉𝑎𝑟(𝛽𝑗) + 𝑉𝑎𝑟(𝛽𝑘) − 2 ∗ 𝐶𝑜𝑣𝑎𝑟(𝛽𝑗, 𝛽𝑘)

Several methods exist to reduce the probability of Type I errors when making multiple post-hoc

comparisons. In the Bonferroni correction method (Wesstein, n.d.), the level of significance,

95% if α=5%, should be divided by the number of comparisons to be made in order to limit the

number of Type I errors that will occur. Bonferroni correction is often considered overly

conservative as the number of pairwise comparisons gets large. In the current model

formulation, the number of potential comparisons is up to

(352

) = 595 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡 𝑐𝑜𝑚𝑝𝑎𝑟𝑖𝑠𝑜𝑛𝑠

The number of comparisons can be reduced by deciding beforehand the pairs of coefficients to

test. This can be the subject of future work.

14

The variables used in the final model can be class as into four groups: incident location, incident

time, whether or not the train had an intercom system, and incident type. In the final model, all

variables were binary (1 indicating yes, and 0 indicating no). The detailed definitions of all of the

variables are given in Table 2 and Table 3.

Table 2: Variables describing time, location, and vehicle type involved

Variable Name Description

Peak_hour1 If the incident occurred during rush hour, from 6am-9am, or from 3pm-6pm, Monday to Friday

Interchange_station1 If the incident occurred at an interchange station Intercom1 If the incident involved is equipped with an intercom system

Table 3: Variables describing the incident type

Variable name Definition

CategoryCOMMUNICATIONS Problems with Communication infrastructure with Transit Control

CategoryDEBRIS___TRACK Debris or objects at Track Level

CategoryDISORDERLY_PATRON Disorderly Patron

CategoryDOOR Door mechanical problems

CategoryDOOR_PASSENGER Door problems caused by passengers

CategoryDOOR_PERSONNEL_MISTAKE Door opened off platform or in tunnel by personnel

CategoryESCALATOR_ELEVATOR_STAIRS Escalator/Elevator/Stair problem

CategoryFIRE___TRACK Fire at track level

CategoryFIRE_IN_STATION Fire elsewhere in station

CategoryFIRE_ON_TRAIN Fire on train

CategoryHOLDUP_ALARM_ACTIVATED Holdup alarm activated

CategoryINJURY Injury to personnel or customer

CategoryNO_POWER Loss of Power – includes traction and station power loss

CategoryONBOARD_MECHANICAL_MAJOR Major vehicle problems

CategoryONBOARD_MECHANICAL_MINOR Minor vehicle problems

CategoryOPERATOR_NOT_AVAILABLE_NOT_IN_POSITION Operator and/or guard not in position for duties

CategoryOPERATOR_OVERSHOT_PLATFORM Operator overshot platform

CategoryOTHER Other

CategoryPAA_ACTIVATED_BY_CUSTOMER PAA alarm activated by customer or unknown person

CategoryPERSONNEL_ERROR Mistake in operation by personnel or supervisor

CategorySECURITY_OTHER Incident’s involving security or police

CategorySIGNAL_SWITCH Problems involving wayside signal or track switch problems

CategorySPEED_CONTROL

Speed control restricting operations – includes Emergency Brake problems failing to release and gliding brakes

CategorySUICIDE_ Moving trains coming in contact with person

CategoryTRACK_PROBLEM Defective track structure

CategoryTRACTION_POWER_PROBLEM Traction Power Problem

CategoryTRAIN_STOP_CONTACTED Train tripped past a train stop

CategoryTRAINING_DEPARTMENT Incident caused by personnel in training

CategoryUNAUTHORISED_TRACK_LEVEL Unauthorised person at track level

CategoryUNSANITARY_UNHEALTHY Train taken out of service due to unsanitary conditions

CategoryWEATHER Incident caused by extreme weather

CategoryWORK_REFUSAL Delay caused by personnel refusing to work due to bad working environment

CategoryWORKZONE_PROBLEMS Delays caused by track level activity

CategoryYARDHOUSE_PROBLEM Incidents offline in the Yardhouse

15

2.5 Data Partitioning for Training and Testing The data set was divided into a training set and a testing set. The training set consists of 70% of

the records. Each model was fitted using the training set. Predictions were generated for the

testing set using the fitted model. The model was evaluated by taking the mean square error

between the predictions of the testing set and the actual durations of the testing set. Note that

only the ordinary least squares regression and survival analysis could be compared in this

manner, as the ordered logit model cannot generate predictions within 1 minute resolution. As

the training set should reflect the proportion of incidents as the testing set and as the data set as

a whole, the data set was divided into 5 classes defined by the classes for the ordered logit

model in section 4.2 before sampling 70% of each class. It was also possible to break the data

down by incident type, but that would require sampling 70% from 35 groups, which requires

considerably more effort.

3. Descriptions of model formulations considered As described earlier, this study will test three popular statistical models: The Ordinary Least

Squares Regression, the Ordered Logit Model, and Survival Analysis. A brief introduction of

each model is given in the remainder of section 3.

3.1 Ordinary Least Squares (OLS) Regression It is necessary to model the Log of the duration delay T rather than T itself, as this would ensure

that T remains positive. The regression equation is given in Equation 2.

Equation 2

log(𝑇𝑖) = 𝛽𝑜 + ∑ 𝛽𝑖,𝑗 ∗ 𝑥𝑖,𝑗

𝐶𝑜𝑣𝑎𝑟𝑖𝑎𝑡𝑒𝑠

𝑗=1

+ 𝜎 ∗ 𝜀

where β’s are the explanatory variable parameter estimates, x’s are the explanatory variables,

and ϵ is the random error term that is normally distributed with a variance of 1. If the variance of

the error component is not 1, this would be captured in the scale factor σ. The goodness of fit is

represented by the R2 parameter, where

Equation 3

𝑅2 = 1 −𝑆𝑆𝐸

𝑆𝑆𝑇, 𝑆𝑆𝐸 = ∑(𝑦𝑖 − 𝑦𝑖

∗)2

𝑁

𝑖=1

, 𝑆𝑆𝑇 = ∑(𝑦𝑖 − 𝑦𝑎𝑣𝑒)2

𝑁

𝑖=1

To reward parsimonious models, the adjusted R2 of Equation 4can be used instead in Equation

3:

Equation 4

1 − (1 − 𝑅2) ∗𝑁 − 1

𝑁 − 𝑝 − 1

where N is the sample size and p is the number of predictor variables.

Lengthier durations are predicted by positive parameter estimates in Equation 2, whereas

shorter durations are predicted by negative parameter estimates.

16

3.2 Ordered Logit Model An ordered logit model discretizes the output variable into categorical levels that have a natural

order. One example of the use of an ordered model is predicting the severity of a car accident,

where the ordered levels are minor, major property damage, injury, or death. In this study, the

ordered class levels can be categorised as very short, short, medium, long, and very long.

As there were many more observations of shorter durations than longer durations, the

discretization intervals were shorter for the shorter duration classes. The continuous but latent

variable y* helps predict the resulting class, and is given in Equation 5:

Equation 5

𝑦∗ = 𝑥′𝛽 + 𝑢

where x and β are as defined in section 3.1, and u is the threshold level. The actual output class

yi is determined by y* and is given in Equation 6.

Equation 6

𝑦𝑖 = 𝑗

where yi is the discretized version of y* and is the output level for observation i, if

Equation 7

𝛼𝑗−1 ≤ 𝑦∗ < 𝛼𝑖

Equation 8

Consequently, the probability of observing class j is given in Equation 8 and Equation 9.

𝑝𝑖𝑗 = 𝑃(𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 𝑖 𝑤𝑖𝑙𝑙 𝑠𝑒𝑙𝑒𝑐𝑡 𝑗) = 𝑃(𝑦𝑖 = 𝑗) = 𝑃(𝛼𝑖−1 < 𝑦∗ < 𝛼𝑖)

Equation 9

𝑃(𝛼𝑖−1 < 𝑦∗ < 𝛼𝑖) = 𝐹(𝑦∗ = 𝛼𝑖−1) − 𝐹(𝑦∗ = 𝛼𝑖)

For logit models, the cumulative distribution function is given in Equation 10

Equation 10

𝐹(𝑧) =𝑒𝑧

1 + 𝑒𝑧

The ordered logit model can be used to predict approximately the duration of an incident. In light

of this, it is not possible to directly compare the goodness of fit of an ordered model with a

regression model or a survival model, which output a predicted duration that is continuous.

Although there may exist mathematically rigorous ways to define the output classes, the classes

were defined based on the log of the duration as explained in section 4.2.

Parameter estimates with positive coefficients increase the probability of longer durations.

Negative coefficients increase the probability of shorter durations.

17

3.3 Survival Analysis Hazard modelling and survival analysis have been used to analyse the distributions of durations

of incidents and time for an event to occur (Machin, Cheung, & Parmar, 2006). Initially used

extensively in the medical field to analyse time for death or failure to occur, these methods have

appeared in the transportation context to analyse the durations of highway incidents (Nam &

Mannering, 1998). In this study, survival analysis will be used to predict the duration of a

subway incident.

Hazard models examine the conditional probability of an event occurring at any point in time,

given that it has not yet occurred. The hazard function is given in Equation 11:

Equation 11

ℎ(𝑡) =𝑓(𝑡)

1 − 𝐹(𝑡)=

𝑓(𝑡)

𝑆(𝑡)

where h(t) is the conditional probability of event occurrence at time t, and F(t) is the Cumulative

Density Function of event occurrence at time t. F(t) can be interpreted as the probability of an

event having already occurred before or at time t. Therefore 1-F(t)=S(t) is the probability that an

event has not yet occurred at time t, or the probability of survival at t. The following relationships

can also be proven:

Equation 12

ℎ(𝑡) = −𝑑

𝑑𝑡log(𝑆(𝑡))

Equation 13

𝐻(𝑡) = − log(𝑆(𝑡))

where H(t) is the cumulative hazard rate.

The hazard function itself can be expressed parametrically or non-parametrically. Examples of

parametric hazard models include the Exponential, Weibull, Log-normal, Log-logistic, and

Generalised-Gamma model. An Exponential hazard model assumes a constant hazard over

time – that the conditional probability of event occurrence does not change over time. Weibull

hazard models allow for monotonically increasing or decreasing hazards. Log-Normal and Log-

Logistic models allow for a decreasing concave-up hazards, or hazards that increase to a

maximum then decrease monotonically afterward. Some hazard functions can be expressed in

the form of Equation 14.

Equation 14

ℎ(𝑡) = 𝛽𝑜 + ∑ 𝛽𝑖,𝑗 ∗ 𝑥𝑖,𝑗


𝑗=1

Sometimes, the covariates on the right hand side of equation 14 proportionally increase or

decrease the hazard function, in “proportional hazard” models. Of the above mentioned

parametric models, only the Exponential and Weibull models can be expressed as proportional

hazards.

18

In proportional hazard models, positive coefficients increase the hazard of event occurrence,

whereas negative coefficients decrease the hazard of event occurrence.

Survival analysis can also be used to examine the effect of covariates on the expected time to

the event itself, rather than the hazard of the event occurrence, in “Accelerated Failure Time”

(AFT) models in Equation 15.

Equation 15

log(𝑇𝑖) = 𝛽𝑜 + ∑ 𝛽𝑖,𝑗 ∗ 𝑥𝑖,𝑗


𝑗=1

+ 𝜎 ∗ 𝜀

The covariates on the right hand side of equation 15 directly increase or decrease the expected

log(T). Examples of parametric AFT models include the Exponential, Weibull, Log-Normal, Log-

Logistic, and Generalised-Gamma model. Note that the Log-normal, Log-logistic, and

Generalised-Gamma cannot be expressed as proportional hazard models.

Sometimes, both the hazard model and the AFT model can be interpreted in similar ways. An

increasing hazard rate can be interpreted that the event occurrence is increasingly likely as time

progresses, thereby shortening the expected duration. Conversely, a decreasing hazard rate

can be interpreted as the event being decreasingly likely to occur with time, lengthening the

expected duration. However, this interpretation means that the coefficient signs will have

opposite meanings between hazard and AFT models – Longer durations are expected from

negative coefficients in hazard models, but from positive coefficients in AFT models, and vice-

versa.

For Exponential specifications, ϵ has an extreme value distribution with the scale parameter σ

being fixed to 1. For the Weibull specification, ϵ also has an extreme value distribution, but the

scale parameter σ can take on any positive value. This allows an increasing or a decreasing

hazard. For the Log-Normal specification, ϵ is normally distributed and the model reduces to a

linear regression equation of log(T). In the absence of censored data, a log-normal AFT model

is identical to an ordinary least squares regression in model in Equation 2 (Allison, 2010). For

the Log-Logistic specification, ϵ has a distribution as shown in Equation 16.

Equation 16

𝑃𝐷𝐹(𝜀𝑗) =𝑒𝜀𝑗

(1 + 𝑒𝜀𝑗)2

This probability density function is evenly symmetric with a mean of zero, and also has a bell-

curve that is heavier-tailed than the normal distribution.

Because the objective is to build a model that would predict the duration of the delay itself,

rather than the probability of incident clearance over time, the AFT formulation is more relevant

for this study’s purposes.

19

4. Discussion of Results Linear Regression, Ordered Logit, and Survival Models were estimated using R. In the following

sections, goodness of fit statistics, and diagnostic plots for each model were also produced. All

resulting parameter estimates are shown in Table 13 in Section 5.2.

4.1 Linear Regression ModelIn order for a regression model to be an appropriate model, the distribution of residuals should

be normal and constant about the predicted value. The residual plot in Figure 7 does not show

any patterns that suggest obvious improvements to the model that can be made. However, the

residual plot suggests that the variances are not very constant and not normally distributed. The

deviations on the left and right sides of the Normal Q-Q plot suggest that the errors have a

distribution that is heavier-tailed than the normal distribution.

Figure 7: Residual and QQ plots for OLS regression does not support assumption of normally distributed errors

The mean square error rate on the test set was 127.8303 and the goodness of fit of the model

on the training set was 0.5433. The low R2 value indicates that the model was not able to

capture almost half of the variation. These results argue against using OLS regression.

4.2 Ordered Logit Model In constructing an ordered logit model, two methods were considered: to fit the ordered logit

model directly to the data and generate predictions from the parameter estimates, and to

generate predictions by simulation.

4.2.1 Ordered Logit Model without simulation Time intervals were defined based on the log of the duration.

Equation 17

𝑦𝑖∗ = log (𝐷𝑒𝑙𝑎𝑦)

The thresholds for level are then defined in Table 4.

20

Table 4: The threshold definitions for the output levels of the ordered logit model

y [y*lower y*upper) [Delaylower DelayUpper)

2 0 1 1 2.718

3 1 2 2.718 7.389

4 2 3 7.389 20.086

5 3 4 20.086 54.598

6 4 5 54.598 148.413

7 5 infinity 148.413 Infinity

The parameter estimates for the ordered logit model are once again shown in Table 13 in

Section 5.2. Although it is possible to calculate the P-values for these statistics using the t-

values, Ripley (2013) claims that such statistical tests are “wildly misleading”. To assess the

accuracy of the ordered logit model’s prediction, a confusion table is generated that shows the

number count (Table 5) and percentage of correct classifications and misclassifications (Table

6).

Table 5: Confusion table that shows the count of correct classifications and misclassifications. The row labels are the predicted outcomes, and the column names are the actual outcomes.

P2 P3 P4 P5

P2 3884 969 46 4 4903

P3 550 3610 566 57 4783

P4 11 70 128 41 250

P5 0 0 1 14 15

4445 4649 741 116

Table 6: Confusion table that shows the percentage of correct classifications and misclassifications. The row labels are the predicted outcomes, and the column labels are the actual outcomes

P2 P3 P4 P5

P2 87.37908 20.84319 6.207827 3.448276

P3 12.37345 77.65111 76.38327 49.13793

P4 0.247469 1.5057 17.27395 35.34483

P5 0 0 0.134953 12.06897

The total correct classification percentage is calculated by summing the diagonal values in

Table 5 and dividing by the total number of records used.

Equation 18

𝑃𝑒𝑟𝑐𝑒𝑛𝑡 𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛 =∑ 𝑛𝑖,𝑖

5𝑖=2

𝑁

where ni,i is the diagonal count of class i, and N is the total number of records used by the

ordered logit model, which is 9951. In this model, the resulting accuracy rating is

21

𝑃𝑒𝑟𝑐𝑒𝑛𝑡 𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛 =3884 + 3610 + 128 + 14

9951= 76.7%

Although the model correctly predicts the class of records more often than not, most of the

correct classifications are in the shorter duration classes.

From Table 5, the shorter duration events tended to have the highest percentage of correct

classifications by the model. The longer duration events have lower percentages of correct

classifications. This is not surprising because this problem arises when one of the groups has a

very small market share in the observed data set. Artificially inflating the underrepresented

share can solve this problem but can make the estimated thresholds biased (Train, 2002).

Alternatively, simulation can be used to see if the accuracy improves.

4.2.2 Ordered Logit Model by simulation Predictions were simulated by drawing a random number from a uniform distribution between 0

and 1 for each observation (Train, 2009). The corresponding prediction for that random number

was found by comparing it with the CDF of that individual. This process was repeated for each

individual 1000 times. The final prediction for each individual was taken as the most frequently

predicted output class for that individual. A sample calculation follows:

The predicted probabilities of each class for observation #2067 are shown in Table 7

Table 7: Predicted probabilities for record #2067

P1 P2 P3 P4 P5

2067 0 0.41741 0.57082 0.010842 0.000928

The Cumulative Density Function (CDF) for observation #2067 is calculated using Equation 19,

where

Equation 19

𝑃(𝑖 ≤ 𝑗) = ∑ 𝑃(𝑘 = 𝑖)

𝑗

𝑘=1

The resulting CDF is given in Table 8

Table 8: CDF for record #2067

P(<=1)= P(1)

P(<=2)= P(1)+P(2)

P(<=3) =P(1)+P(2)+P(3)

P(<=4)= P(1)+P(2)+P(3)+P(4)

P(<=5)= 1-P(<=4)

0 0.41741 0.98823 0.999072 1

Next, a random number η is drawn from the uniform distribution between 0 and 1, which

represents the value of the CDF for that observation.

Equation 20

𝜂 = 𝐹(𝑥)

22

The value of η dictates the location on the CDF of the predicted event. For example, if η=0.45,

then η represents an event whose CDF is between P(<=3) and P(<=2). The predicted

probability is the difference in events between these two cumulative probabilities, which is P3.

The results are shown in Table 9 and Table 10 below:

Table 9: Confusion table resulting counts from the simulated ordered logit model

P2 P3 P4 P5

P2 3231 1149 67 9

P3 1177 2960 472 51

P4 30 479 169 35

P5 7 61 33 21

Table 10: Confusion table resulting percentages from the simulated ordered logit model.

P2 P3 P4 P5

P2 72.68841 24.71499 9.041835 7.758621

P3 26.47919 63.66961 63.69771 43.96552

P4 0.674916 10.30329 22.80702 30.17241

P5 0.15748 1.31211 4.453441 18.10345

Simulation appears to have improved the accuracy by very little, over the fitted model.

Furthermore, both ordered logit model and simulated model fail to predict any event in the first

class, even though there are over 2000 of these events. These results argue against pursuing

ordered logit models in the future.

4.3 Survival Analysis As explained in the background section, the AFT interpretation of survival models made more

sense for this study rather than the hazard model itself, although the results of the AFT can

provide insight into the time-dependency of the hazard.

4.3.1 Heterogeneity

Unobserved heterogeneity had to be considered for two reasons:

1. Not all attributes of each observed delay could be used in model building, such as Guard

and Operator. This is because these variables must be treated as categorical variables,

and there are too many distinct categories. By intentionally omitting these covariates

from the model building, unobserved heterogeneity is introduced. Accounting for

heterogeneity can improve the model fit.

2. Some guards, operators, and train cars experience repeated delays over the year 2013,

creating possible dependencies between some observations.

It is not critical if parametric heterogeneity or nonparametric heterogeneity is used (Keifer, 1988,

as cited in Sharman, 2014). Common parametric heterogeneity forms include Gamma and

Gaussian distributed heterogeneity. Both Gamma and Gaussian heterogeneity distributions are

supported in R; however, a maximum of one heterogeneity term to replace one variable is

allowed. Although the 35 incident types could have been aggregated into a smaller number of

23

categories, this aggregation would have introduced additional sources of heterogeneity,

precluding us from replacing other variables such as Operator# or Guard# with the random

frailty term.

4.3.2 Overall assessment of model fit, quality, and limitations

There are a number of ways to check the adequacy of the proposed hazard model specification. One method is to examine the graphical plot of the Cox-Snell residuals. The plot is constructed, with log{-log[S(CSi)]} on the y-axis and log(CSi) on the x-axis, where

Equation 21

𝐶𝑆𝑖 = − log(𝑆(𝑡𝑖, |𝑥𝑖)) = 𝐻(𝑡𝑖)

and S(ti|xi) is the parametric survival curve that was fitted using maximum likelihood techniques.

Equation 22

𝑆(𝑡) = ∏ (1 −𝑑𝑡

𝑛𝑡)

𝑡

𝑘=1

= 𝑆(𝑡 − 1) ∗ (1 −𝑑𝑡

𝑛𝑡)

and S(CSi) is the non-parametric Kaplan-Meier survival curve, given by Equation 22.

For a specification to be a good one, the resulting plot should closely align with a slope of 1 and an intercept of 0 (Machin, et al., 2006). The Exponential, Weibull, and Log-Logistic specifications were tested.

Figure 8: Cox-Snell Residual plots for the Exponential, Weibull, and Log-Logistic specifications. The data omits incidents less than 2 minutes.

From the above Cox Snell residual plots in Figure 8, it can be concluded that the specification of the Exponential, Weibull, and Log-Logistic are inadequate. The source of the problem could have been a misspecification of the covariate form in the linear predictor of Equation 14 (Lindqvist, Aaserud, &, Kvaløy, 2012) or the inappropriate choice of the Exponential, Weibull, or Log-Logistic survival form. Out of the three specifications, the Log-Logistic model comes closest to a straight line, but at the wrong slope. Weng et al. (2014) arrived at similar results, but considered the specification to be appropriate because of the straightness of the line alone. Note that the above residual plots are based on the data where delays less than 2 minutes were excluded.

In a subsequent test, it was assumed that all incidents recorded as zero that can conceivably

affect train service were actually 1 minute. Although survival analysis is designed to handle

24

right-censored data, dealing with left-censored data is not as straight forward (Huston & Juarez-

Colunga, 2009). It is unclear whether or not the R package “survreg” can handle left-censored

data. Therefore, in this analysis, the remaining zero-delay duration incidents were assigned a

uniformly distributed random number between 0 and 0.6. Treating these data as left-censored

should be the subject of future research. The resulting Cox-Snell residual plot is shown in Figure

9.

Figure 9: Cox Snell Residual plots for the Log-logistic specification. Data includes all incidents less than 2 minutes as well, with incidents less than 1 minute assigned a random number.

The Log-Logistic remains the best out of the three parametric specifications, and the slope of the residuals plots is much closer to 1. There is not yet theoretical justification for assuming a nonzero duration for all records in the data set, except that leaving the 0 minute durations as 0 would make the model difficult to estimate (the log of 0 is undefined). The possibility that some incident types followed different AFT specifications and scale parameters was also tested, but was found to worsen the model fit.

Next, the competing specifications are compared based on other fit statistics, such as the AIC or

McFadden’s ρ2 in Table 11.

Table 11: Fit statistics for the four common AFT specifications. The Log-logistic appears to be the best

Specification Log Likelihood Df AIC

Exponential -23253.38 38 46582.76

Weibull -23002.41 36 46076.81

Log-normal -22973.58 39 46025.17

Log-logistic -22129.88 39 44337.76

The comparisons of the AIC’s also suggest that the log-logistic specification is the most

appropriate.

The adjusted approximate McFadden’s ρ2 in Equation 23 was used because the number of

observations used was very large, which would have dampened the effect of including too many

covariates (Sharman, 2014).

25

Equation 23

𝜌2𝑎𝑑𝑗𝑢𝑠𝑡𝑒𝑑 = 1 −

(𝐿𝐿𝑓𝑢𝑙𝑙) − 𝑟

𝐿𝐿𝑛𝑢𝑙𝑙= 1 −

−22129.88 − 38

−25982.18= 0.1468

This ρ2adjusted is two orders of magnitudes larger than the ρ2

adjusted value calculated by Sharman

(2014) in his study, who also recommended the log-logistic AFT as the final model.

Finally, a nonparametric model is estimated to ensure that the parameter estimates of the AFT

model are not biased (Bhat, 1996, as cited in Sharman, 2014). A nonparametric hazard model

assumes a constant hazard within each time interval defined. Non-parametric modelling also

assumes proportional hazards and does not lend itself as easily to predicting the expected

duration directly. For example, an incident type, such as “suicide”, may make more sense as a

covariate in an AFT model, as the occurrence of such a high severity event leads to longer

clearing times. In the nonparametric proportional hazards model, however, the effect of suicides

has a constant effect on the risk of duration termination regardless of how long the incident has

lasted for. A negative coefficient suggests that the risk of duration termination is low, but it is

unclear whether that means there would be a low risk of the incident clearing in an earlier time

interval or a later time interval. Conceivably, this ambiguity can be solved by including

interaction terms with incident types and time intervals to see how the effect of incident type

changes over time. Presently, however, the model fails to converge on a solution when

interaction terms are included.

In order to make the nonparametric hazard model directly comparable with the AFT model, it

was assumed that a high hazard rate corresponds to shorter durations, and low hazard rate

corresponds to longer durations, regardless of the time interval. The covariate effects are

compared in Table 12 on page 26. The time intervals defined follow those of the Ordered Logit

Model in section 4.2.

There is considerable discrepancy between the parametric and nonparametric models. It is

therefore possible that the parametric estimates are biased. The problem may also be stemming

from an inappropriate specification of the nonparametric model as well (for example,

inappropriate choice of time intervals).

The baseline hazards, shown as ji at the bottom of Table 12, reaches a peak at the second

interval and decreases afterward. Although this may support the Log-Logistic hazard, which

increases to a maximum before decreasing, the decrease in hazard is not monotonic. This is

another discrepancy between the non-parametric model and the Log-Logistic specification.

The non-parametric model cannot be used alone to make predictions of durations, as setting up

the model requires prior knowledge on the approximate duration of the delay as an input

covariate.

Although the effects of the covariates in the log-logistic AFT model is not consistent with the

results of the non-parametric model, there is more evidence to support the Log-Logistic

specification:

the closely-fitting Cox-Snell residuals plot is nearly straight with a slope of 1,

the lowest AIC value belongs to the log-logistic specifcation,

the ρ2 value is 0.14 which is larger than that calculated in similar studies, and

26

the distributions of the errors follows a heavy-tailed distribution as observed in section

4.1.

Table 12: Parameter estimates and effects comparison between the parametric and non-parametric formulations. There is considerable discrepancy between the two formulations.

Coeffs para.

Coeff nonpar.

Para

effect

NPara

effect

! if doesn’t

match

(Intercept) 0.8108*** 2.6844*** ↑ ↓ !

CategoryCOMMUNICATIONS 0.9036*** 0.949*** ↑ ↓ !

CategoryDEBRIS___TRACK 0.2325 1.4545*** ↑ ↓ !

CategoryDISORDERLY_PATRON 0.1446. -0.1205 ↑ ↑

CategoryDOOR 0.4872*** -0.0035 ↑ ↑

CategoryDOOR_PASSENGER 0.2979** -0.6719*** ↑ ↑

CategoryDOOR_PERSONNEL_MISTAKE 1.4548*** 1.6519*** ↑ ↓ !

CategoryESCALATOR_ELEVATOR_STAIRS -0.778*** -3.74*** ↓ ↑ !

CategoryFIRE___TRACK 1.4131*** 1.5765*** ↑ ↓ !

CategoryFIRE_IN_STATION 0.5096 2.0547*** ↑ ↓ !

CategoryFIRE_ON_TRAIN 1.657*** 1.7441*** ↑ ↓ !

CategoryHOLDUP_ALARM_ACTIVATED -2.2132*** -1.4357*** ↓ ↑ !

CategoryINJURY -0.2485** -0.1403. ↓ ↑ !

CategoryNO_POWER -0.4871 1.913*** ↓ ↓

CategoryONBOARD_MECHANICAL_MAJOR 0.425* -0.5354* ↑ ↑

CategoryONBOARD_MECHANICAL_MINOR 0.5091*** -0.2815* ↑ ↑

CategoryOPERATOR_NOT_AVAILABLE_NOT_IN_POSITION 1.0791*** -0.4456*** ↑ ↑

CategoryOPERATOR_OVERSHOT_PLATFORM -0.391*** -0.6236*** ↓ ↑ !

CategoryOTHER -0.195* -0.6105*** ↓ ↑ !

CategoryPAA_ACTIVATED_BY_CUSTOMER -0.5859*** -0.5876*** ↓ ↑ !

CategoryPERSONNEL_ERROR -0.3647 -1.1146*** ↓ ↑ !

CategorySECURITY_OTHER 0.2643** 0.4597*** ↑ ↓ !

CategorySIGNAL_SWITCH -0.0248 0.2432** ↓ ↓

CategorySPEED_CONTROL -0.672*** 0.3727*** ↓ ↓

CategorySUICIDE_ 3.1148*** 2.9083*** ↑ ↓ !

CategoryTRACK_PROBLEM 0.0761 0.2495 ↑ ↓ !

CategoryTRACTION_POWER_PROBLEM 0.8795*** 0.6019*** ↑ ↓ !

CategoryTRAIN_STOP_CONTACTED -0.5181*** -2.1302*** ↓ ↑ !

CategoryTRAINING_DEPARTMENT -1.1104*** 0.0756 ↓ ↓

CategoryUNAUTHORISED_TRACK_LEVEL 0.9244*** 1.2434*** ↑ ↓ !

CategoryUNSANITARY_UNHEALTHY 0.4704*** -0.5106*** ↑ ↑

CategoryWEATHER -0.197 1.7318*** ↓ ↓

CategoryWORK_REFUSAL -0.1564 1.3354*** ↓ ↓

CategoryWORKZONE_PROBLEMS 0.9513*** -1.0524*** ↑ ↑

CategoryYARDHOUSE_PROBLEM 0.6248*** 0.3307* ↑ ↓ !

Peak_hour1 0.0552** 0.1055*** ↑ ↓ !

Interchange_station1 -0.0549* 0.1201*** ↓ ↓

Intercom1 0.1347*** -0.3911*** ↑ ↑

frailty.gamma(Guard, ↑

Scale 0.529

Frailty.Gamma 0.367

j1 0.5927***

j2 7.2736***

j3 -1.246***

j4 -0.8438***

j5 0

The log-logistic AFT is also consistent with the findings of other AFT studies in transportation

(Jones, et al., 1991, as cited in Valenti, et al., 2010; Sharman, 2014; Weng, et al., 2014). The

scale parameter being less than 1 suggests that the hazard increases to a maximum, and then

decreases afterward (Machin, Cheung, & Parmar, 2006; Allison, 2010). An increasing hazard

27

means that an event is increasingly likely to be cleared as time progresses. A decreasing

hazard means that an incident is decreasingly likely to come to an end as time progresses, and

this is analogous to incidents with extremely long durations. The suggested parametric

distribution and scale parameter can be supported by the wide spread of event durations as

shown by the boxplots in Figure 6.

The mean square error of the fitted model against the test set was 56.64, which is also

considerably better than the mean square error of 127 from the OLS regression model in

section 4.1.

5. Summary and Discussion of Results

5.1 Assessment of goodness of fit Between the OLS regression model and the AFT model, the AFT model was the better

performing model on the holdout data, with a lower mean square error rate of 56, versus 127 for

the OLS model. Furthermore, the R2 value for the OLS regression model was only about 0.55,

which is means that the regression model failed to explain half of the variation. The McFadden

adjusted ρ2 value for the AFT model was only 0.14, but this cannot be interpreted as a poor fit in

the same way as R2 (Allison, 2010).

The Ordered Logit Model overpredicted shorter duration incidents and underpredicted longer

duration incidents. Generating predictions by simulating random draws from each observation’s

CDF did not improve the accuracy.

5.2 Comparison of Parameter Estimates between the three models Table 13 includes the resulting parameter estimates from all three models. Note that a

significance indicator was attached to the parameter estimates for the Ordered Logit Model for

comparison purposes only, although the significance interpretation may not be valid, as

discussed in section 4.2.1. All three models achieve consensus that incidents at interchange

stations have shorter durations by their negative sign. Only the regression model predicts that

peak hour incidents are cleared faster (by its negative sign), but not significantly so. All three

models also predict that trains with intercoms increase the expected delay duration as well.

28

Table 13: Parameter Estimates for tthe Log-logistic AFT, OLS Regression, and Ordered Logit Model

β Log-logistic AFT

β OLS Regression

β Ordered Logit Model

(Intercept) 0.8108*** 1.71874***

2|3 Ordered Logit Model Only -3.9025

3|4 Ordered Logit Model Only 0.8644

4|5 Ordered Logit Model Only 3.4158

CategoryCOMMUNICATIONS 0.9036*** -0.50832*** -1.33136*

CategoryDEBRIS___TRACK 0.2325 -0.51197*** -2.25413***

CategoryDISORDERLY_PATRON 0.1446. -0.17944* -0.56732*

CategoryDOOR 0.4872*** -0.3837*** -1.74628***

CategoryDOOR_PASSENGER 0.2979** -0.61444*** -2.29766***

CategoryDOOR_PERSONNEL_MISTAKE 1.4548*** 0.30377 1.30081*

CategoryESCALATOR_ELEVATOR_STAIRS -0.778*** -0.07482 -1.49211

CategoryFIRE___TRACK 1.4131*** 0.57698*** 1.55113***

CategoryFIRE_IN_STATION 0.5096 -0.2971 -1.63022

CategoryFIRE_ON_TRAIN 1.657*** 0.75405** 2.05242*

CategoryHOLDUP_ALARM_ACTIVATED -2.2132*** 0.24078 0.31606

CategoryINJURY -0.2485** 0.02 -0.11177

CategoryNO_POWER -0.4871 1.48598*** 0.07684

CategoryONBOARD_MECHANICAL_MAJOR 0.425* -0.43562* -2.18963***

CategoryONBOARD_MECHANICAL_MINOR 0.5091*** -0.40336*** -1.74338***

CategoryOPERATOR_NOT_AVAILABLE_NOT_IN_POSITION 1.0791*** -0.58654*** -2.53207***

CategoryOPERATOR_OVERSHOT_PLATFORM -0.391*** -1.19377*** -4.23457***

CategoryOTHER -0.195* -0.21884** -0.96065***

CategoryPAA_ACTIVATED_BY_CUSTOMER -0.5859*** -1.38509*** -4.90224***

CategoryPERSONNEL_ERROR -0.3647 -0.11267 -1.59531*

CategorySECURITY_OTHER 0.2643** 0.00113 -0.16588

CategorySIGNAL_SWITCH -0.0248 -0.97164*** -3.67877***

CategorySPEED_CONTROL -0.672*** -1.53454*** -5.73978***

CategorySUICIDE_ 3.1148*** 2.18616*** 6.00801***

CategoryTRACK_PROBLEM 0.0761 -0.77326*** -3.2738***

CategoryTRACTION_POWER_PROBLEM 0.8795*** -0.03314 -0.10878

CategoryTRAIN_STOP_CONTACTED -0.5181*** -1.34075*** -4.88328***

CategoryTRAINING_DEPARTMENT -1.1104*** -0.5995 -1.6665

CategoryUNAUTHORISED_TRACK_LEVEL 0.9244*** -0.03376 0.17715

CategoryUNSANITARY_UNHEALTHY 0.4704*** -0.30677*** -1.40978***

CategoryWEATHER -0.197 0.34076 0.27399

CategoryWORK_REFUSAL -0.1564 -0.17394 -1.02694

CategoryWORKZONE_PROBLEMS 0.9513*** 0.4357** 0.75202.

CategoryYARDHOUSE_PROBLEM 0.6248*** -0.10716 -0.34776

Peak_hour1 0.0552** -0.02063 0.11864*

Interchange_station1 -0.0549* -0.0412* -0.16484*

Intercom1 0.1347*** 0.03714* 0.11086*

R2 N/A 0.55 N/A

ρ2adjusted 0.14 N/A N/A

% of Correct classifications 76.7%

29

There is considerable discrepancy between the coefficient signs of certain incident types listed

in Table 14.

Table 14: List of incident types on which the three models disagree.

Variable Name

CategoryCOMMUNICATIONS

CategoryDEBRIS___TRACK

CategoryDISORDERLY_PATRON

CategoryDOOR

CategoryDOOR_PASSENGER

CategoryFIRE_IN_STATION

CategoryHOLDUP_ALARM_ACTIVATED

CategoryINJURY

CategoryNO_POWER

CategoryONBOARD_MECHANICAL_MAJOR

CategoryONBOARD_MECHANICAL_MINOR

CategoryOPERATOR_NOT_AVAILABLE_NOT_IN_POSITION

CategorySECURITY_OTHER

CategoryTRACK_PROBLEM

CategoryTRACTION_POWER_PROBLEM

CategoryUNAUTHORISED_TRACK_LEVEL

CategoryUNSANITARY_UNHEALTHY

CategoryWEATHER

CategoryYARDHOUSE_PROBLEM

The source of the discrepancy is unclear. The standard deviation for the duration for each

incident type was examined as a possible source. Although it was expected that the lack of

consensus may have been attributable to the very high variance of some incident types, this

was not the case. As shown in Table 15 on page 30, no discernible trend exists that would

relate lack of consensus on sign with high variance. It should also be noted that for every

incident type to which there is agreement between the three models, the magnitudes of the

coefficients for the Log-Logistic AFT model is larger than those of the OLS regression model for

incident types with positive coefficients only. This suggests that the Log-Logistic AFT model

may be giving more weight to extremely long-duration events, whereas the OLS regression may

be dismissing these events as “outliers”. This is not surprising given that the log-logistic model

allows for a wider spread of errors with their heavier-tailed distribution than the normal

distribution.

As discussed in Section 4.3.2, the Log-Logistic AFT achieved the best accuracy and had an

acceptable goodness of fit, unlike the other two models. Furthermore, as the Log-Logistic AFT is

better able to capture extremely lengthy durations, it is recommended to use the Log-Logistic

AFT as the final model.

30

Table 15: Incident types along with their mean duration and standard deviation. Non consensual incident types are highlighted in yellow.

Row Labels Average of Min_Delay StdDev of Min_Delay

CategoryNO_POWER 56.56648 152.936

CategoryWEATHER 44.38469 134.1471

CategorySUICIDE_ 61.625 32.86716

CategorySECURITY_OTHER 7.190962 32.57716

CategoryWORKZONE_PROBLEMS 10.46374 13.48527

CategoryCOMMUNICATIONS 6.282051 11.58481

CategoryFIRE_ON_TRAIN 16.5 11.16116

CategoryDEBRIS___TRACK 6.730769 10.69788

CategoryFIRE___TRACK 12.69528 10.02139

CategoryGrand Total 3.164316 9.86407

CategorySIGNAL_SWITCH 3.619211 8.665953

CategoryDOOR_PERSONNEL_MISTAKE 12 7.646015

CategoryUNAUTHORISED_TRACK_LEVEL 7.954935 7.050333

CategoryTRACK_PROBLEM 3.701149 6.446768

CategoryFIRE_IN_STATION 5.666667 6.372288

CategoryTRACTION_POWER_PROBLEM 6.444444 5.393159

CategoryOTHER 2.724066 4.626019

CategoryINJURY 3.149353 4.609265

CategoryYARDHOUSE_PROBLEM 4.973442 4.282172

CategoryASSAULT 3.593119 4.160723

CategoryDISORDERLY_PATRON 3.747075 3.462135

CategoryDOOR 4.464286 3.028167

CategoryPERSONNEL_ERROR 2.237552 2.571571

CategoryHOLDUP_ALARM_ACTIVATED 0.979396 2.53193

CategoryWORK_REFUSAL 2.369286 2.469779

CategoryTRAIN_STOP_CONTACTED 1.938897 2.465431

CategoryONBOARD_MECHANICAL_MINOR 4.148387 2.289994

CategoryOPERATOR_NOT_AVAILABLE_NOT_IN_POSITION 3.542959 1.659106

CategoryUNSANITARY_UNHEALTHY 3.999085 1.621515

CategorySPEED_CONTROL 1.474679 1.595119

CategoryONBOARD_MECHANICAL_MAJOR 3.722222 1.48742

CategoryTRAINING_DEPARTMENT 1.213223 1.438952

CategoryOPERATOR_OVERSHOT_PLATFORM 2.08867 1.41492

CategoryDOOR_PASSENGER 3.395833 1.385331

CategoryPAA_ACTIVATED_BY_CUSTOMER 1.703869 1.254668

CategoryESCALATOR_ELEVATOR_STAIRS 0.332081 0.673997

5.2 Interpretation of Parameter Estimates The parameter estimates from the Log-Logistic AFT model, along with the resulting expected

duration is repeated in Table 16 on page 32. The relative value of the “Interchange_station1”,

“Peak_hour1”, and “Intercom1” variables are tested after they are interpreted.

On average, incidents are cleared 5% faster at interchange stations than at other stations. This

is not surprising, as incidents at interchange stations may affect more than one line, and hence,

larger parts of the system than incidents at non-interchange stations. Interchange stations also

tend to handle more passenger traffic than non-interchange stations. Furthermore, incidents are

more prevalent at interchange stations than at other stations. In light of this, the transit agency

probability assigned higher priority to clearing incidents at interchange stations.

The coefficient for Peak_hour1 is positive, suggesting that incidents are cleared faster during

off-peak hours. This does not match prior expectations, because the periods with short

prevailing headways corresponds to the times of higher passenger volumes. Although it may be

31

in the transit agency’s interest to assign higher priority to clearing peak-period incidents, heavy

traffic conditions during rush hour may impede the quick performance of response teams, which

would add to the total delay duration. The actual effect of rush hour on clearance time is small,

increasing expected duration by only 5%.

The coefficient for “Intercom1” is significantly positive. Although it can be argued that the sign

should be negative since an intercom should improve the detection and identification of the

nature of an incident, the presence of an intercom was not enough to overcome the other

problems that the Toronto Rocket trains experienced in 2013. Furthermore, it is possible that

passengers were not yet accustomed to using the newly available intercom effectively. Although

this variable was statistically significant, the effect of Toronto Rocket trains may no longer

increase the expected duration in other years, as technical bugs from new technology become

resolved in future years.

The value of including the variables “Interchange_station1”, “Peak_hour1”, and “Intercom1” in

the final model is tested using Likelihood Ratio test in Equation 24

Equation 24

𝐿𝑅 = −2(𝐿𝐿𝑜 − 𝐿𝐿𝑎)~𝜒𝑑𝑜𝑓=1,𝛼=0.052 = 3.841

where LLo is the log likelihood of the model without these variables, and LLa is the log likelihood

of the model with the variables. The degrees of freedom is 1 because the value of each

individual variable is tested, rather than the value of all three variables simultaneously.

From the bottom of Table 16, the effects of “Peak_hour1” and “intercom1” add value to the final

model. Conversely, the effect of “Interchange_station1” does not add much value, at the 95%

confidence level. The latter variable is the least statistically significant out of these three

variables, and only serves to reduce the predicted duration, making the model slightly more

optimistic about the clearance time. It is nevertheless insightful to include these three variables

in the final model in any case.

The expected base delay (off-peak, involving a train without an intercom, and in a non-

interchange station) of different incident types is calculated by Equation 25:

Equation 25

𝐷𝑒𝑙𝑎𝑦𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑,𝑏𝑎𝑠𝑒 𝑐𝑎𝑠𝑒 = exp (𝛽𝑜 + 𝛽𝑡𝑦𝑝𝑒𝑥𝑡𝑦𝑝𝑒)

It was found that:

incidents involving Workzone problems, suicide, fires on trains, fires on tracks, doors

opening off the platform all have the effect of lengthening the expected duration,

incidents involving speed control problems, PAA activations, train stops contacting,

personnel error, and work refusals tend to have the shortest durations, producing delays

that are 2 minutes or less, and

Vehicle mechanical problems, including door problems, are cleared between only 3-4

minutes on average.

32

Table 16 Parameter estimates and average durations of the final model (log-logistic AFT)

Parameter Estimate β

Multiplicative effect on base-category (Assault)

Expected base delay (minutes)

LLo LLa LR

(Intercept) 0.8108*** 2.249707 2.249707

CategorySUICIDE_ 3.1148*** 22.52892 50.68348

CategoryFIRE_ON_TRAIN 1.657*** 5.243557 11.79647

CategoryDOOR_PERSONNEL_MISTAKE 1.4548*** 4.283627 9.636905

CategoryFIRE___TRACK 1.4131*** 4.108673 9.24331

CategoryOPERATOR_NOT_AVAILABLE _NOT_IN_POSITION

1.0791*** 2.942031 6.618707

CategoryWORKZONE_PROBLEMS 0.9513*** 2.589073 5.824656

CategoryUNAUTHORISED_TRACK_LEVEL 0.9244*** 2.520356 5.670062

CategoryCOMMUNICATIONS 0.9036*** 2.468474 5.553343

CategoryTRACTION_POWER_PROBLEM 0.8795*** 2.409695 5.421107

CategoryYARDHOUSE_PROBLEM 0.6248*** 1.867872 4.202166

CategoryFIRE_IN_STATION 0.5096 1.664625 3.744919

CategoryONBOARD_MECHANICAL_MINOR 0.5091*** 1.663793 3.743047

CategoryDOOR 0.4872*** 1.627752 3.661965

CategoryUNSANITARY_UNHEALTHY 0.4704*** 1.600634 3.600958

CategoryONBOARD_MECHANICAL_MAJOR 0.425* 1.52959 3.44113

CategoryDOOR_PASSENGER 0.2979** 1.347027 3.030416

CategorySECURITY_OTHER 0.2643** 1.302519 2.930286

CategoryDEBRIS___TRACK 0.2325 1.26175 2.838569

CategoryDISORDERLY_PATRON 0.1446. 1.155577 2.59971

CategoryTRACK_PROBLEM 0.0761 1.07907 2.427592

CategorySIGNAL_SWITCH -0.0248 0.975505 2.1946

CategoryWORK_REFUSAL -0.1564 0.855217 1.923988

CategoryOTHER -0.195* 0.822835 1.851137

CategoryWEATHER -0.197 0.821191 1.847438

CategoryINJURY -0.2485** 0.77997 1.754704

CategoryPERSONNEL_ERROR -0.3647 0.694405 1.562208

CategoryOPERATOR_OVERSHOT_ PLATFORM

-0.391*** 0.67638 1.521657

CategoryNO_POWER -0.4871 0.614406 1.382233

CategoryTRAIN_STOP_CONTACTED -0.5181*** 0.595651 1.340041

CategoryPAA_ACTIVATED_BY_CUSTOMER -0.5859*** 0.556605 1.252197

CategorySPEED_CONTROL -0.672*** 0.510686 1.148894

CategoryESCALATOR_ELEVATOR_STAIRS -0.778*** 0.459324 1.033344

CategoryTRAINING_DEPARTMENT -1.1104*** 0.329427 0.741115

CategoryHOLDUP_ALARM_ACTIVATED -2.2132*** 0.10935 0.246006

Peak_hour1 0.0552** 1.056752 -23765.2 -23751.3 27.8*>3.841

Interchange_station1 -0.0549* 0.94658 -23765.2 -23763.3 3.8<3.841

Intercom1 0.1347*** 1.144193 -23765.2 -23667 196.4*>3.841

Incidents that threaten customer safety and simultaneously affect train operations tend to have

the longest average duration than incidents that either threatened customer safety without

affecting train operations, or affected train operations without affecting customer safety. This

may reflect the TTC’s “safety first” attitude that requires them to do a more thorough job of

responding to life-threatening incidents. These results are also consistent with the findings of

Weng, et al. (2014), who explained that crews responding to incidents involving potential

casualties have to prioritise rescue operations of human lives over the salvaging of property and

equipment. Conversely, incidents involving personnel errors and mechanical failures that do not

threaten safety do not require external response agencies, such as EMS, Police, or Fire

Services, and hence tend to have shorter durations. This conclusion is also corroborated by

Weng, et al (2014).

33

Note that incident duration within each incident type still has high variation, consistent with the

findings of Giuliano (1989), and supported by the heavier tails of the error distribution.

Predictions based on the average may not always be accurate.

6. Conclusion In this study, we estimated and compared the quality and predictive capabilities of three

different models: OLS Regression, Ordered Logit Model, and AFT models. The models were

tested on their goodness of fit values and their error rate when used to make predictions for the

holdout data. The AFT model performed better than the OLS Regression model with a slightly

lower mean-square error rate. The parameter estimates were assessed and interpreted, and the

results were found to be consistent with prior expectations and previous studies (Weng, et al.,

2014). Although incident duration in subway systems has been studied in the past using incident

type as the explanatory variable, our study was able to determine that the effect of other

additional variables such as station type, train type, and time of day on incident duration was

statistically significant.

6.1 Recommendations for future work Weng, et al. (2014) also considered a mixed effects model, as variation in clearance time can

be introduced in other factors, such as the Operator, Guard, and specific vehicle in the fleet.

This was tried in this study, but would not always converge because of the excessive number of

parameters.

Other formulations for the incident type variables will be tested to see if the predictive model can

be made more parsimonious. Changes to the location, train type, and time variables will also be

considered to see if more insights can be learned.

34

References

Aaserud, S., Kvaløy, J. T., & Lindqvist, B. H. (2013). Residuals and functional form in

accelerated life regression models. In Risk Assessment and Evaluation of

Predictions (pp. 61-65). Springer New York.

Allison, P. D. (2010). Survival analysis using SAS: a practical guide. SAS Institute.

Bhat, C.R. (1996). A hazard-based duration model of shopping activity with nonparametric

baseline specification and nonparametric control for unobserved hetereogeneity.

Transportation Research Part B: Methodological, 30, 189-207

Chan, S., McGinty, J. (2005, March, 26). Think the Subway's Running Later? You're Right. The

New York Times. Retrieved from

http://www.nytimes.com/2005/03/26/nyregion/26subway.html?pagewanted=all

Giuliano, G. (1989). Incident characteristics, frequency, and duration on a high volume urban

freeway. Transportation Research Part A: General, 23(5), 387-396.

Huston, C., & Juarez-Colunga, E. (2009). Guidelines for computing summary statistics for data-

sets containing non-detects. Department of Statistics and Actuarial Science, Simon

Fraser University for the Bulkley Valley Research Center with assistance from the BC

Ministry of Environment.

Kiefer, N. M. (1988). Economic duration data and hazard functions. Journal of Economic

Literature, 26, 646-679

Machin, D., Cheung, Y. B., & Parmar, M. (2006). Survival analysis: a practical approach. John

Wiley & Sons.

Man, J., Aarshabh, M., Shalaby, A. (2014). A SURVEY-BASED APPROACH TO

UNDERSTANDING THE RESILIENCE 2 IMPLICATIONS OF DAY-TO-DAY

DISRUPTIONS ON TRANSIT OPERATION (Unpublished dissertation). University of

Toronto, Toronto, Ontario.

Nam, D., & Mannering, F. (2000). An exploratory hazard-based analysis of highway incident

duration. Transportation Research Part A: Policy and Practice,34(2), 85-102.

Ripley, B. (2013, October, 19). RE: No P.values in polr summary. Message posted to

http://r.789695.n4.nabble.com/No-P-values-in-polr-summary-tp4678547p4678590.html

Roh, S. K. (n.d.) A study on the emergency response manual for urban transit

fires.In Proceedings of the 7th Asia-Oceania Symposium on Fire Science and

Technology (pp. 29-38). Retrieved from http://www.iafss.org/publications/aofst/7/17/view

Sarle, W.S. (1997), Neural Network FAQ, part 1 of 7: Introduction, periodic posting to the

Usenet newsgroup comp.ai.neural-nets. Retrieved from:

ftp://ftp.sas.com/pub/neural/FAQ.html

Sharman, B. (2014). Behavioural Modelling of Urban Freight Transportation: Activity and Inter-

Arrival Duration Models Estimated Using GPS Data (Unpublished doctoral dissertation).

University of Toronto, Toronto, Ontario.

35

Straphangers Campaign. (2014). Methodology: Straphangers Campaign Analysis of MTA Alerts

of Subway Incidents/Delays in 2011 and 2013. Retrieved from

http://www.straphangers.org/alerts/14/Methodology.pdf

Toronto Transit Commission. (2013). Chief Executive Officer’s Report – February 2013 Update.

[Toronto]. Retrieved from

https://www.ttc.ca/About_the_TTC/Commission_reports_and_information/Commission_

meetings/2013/February_25/Reports/CHIEF_EXECUTIVE_OFFI.pdf

Toronto Transit Commission. (2014). Chief Executive Officer’s Report – November/December

2014 Update. [Toronto]. Retrieved from


meetings/2014/December_9/Reports/CHIEF_EXECUTIVE_OFFICERS_REPORT_NOV

EMBER_DECEMBER_2014_UPDAT.pdf

Toronto Transit Commission. (2015). Chief Executive Officer’s Report – January 2015 Update.

[Toronto]. Retrieved from


meetings/2015/January_21/Reports/CHIEF_EXECUTIVE_OFFICER_S_REPORT_JAN

UARY_2015_UPDATE.pdf

Train, K. E. (2009). Discrete choice methods with simulation. Cambridge university press.

Weng, J., Zheng, Y., Yan, X., & Meng, Q. (2014). Development of a subway operation incident

delay model using accelerated failure time approaches.Accident Analysis &

Prevention, 73, 12-19.

Wesstein, E. (n.d.). Bonferroni Correction. MathWorld – A Wolfram Web Resource. Retrieved

from http://mathworld.wolfram.com/BonferroniCorrection.html

Documents

Final M Eng Report - track no changes