2
The College of William & Mary Minimizing Flight Delay Tanujit Dey David Phillips Patrick Steele* *Undergraduate researcher Data Expo 2009 Washington, DC Introduction Southwest Airlines 1987–2008 1987 1997 2002 2008 Motivations: Over time, flight networks have grown in size and complexity, delays on flight legs have similarly grown. How can individuals and airlines make better decisions regarding flight travel? Goal: Design a visual decision support tool that can find a flight plan with smallest predicted delay. Data Years: 1987, 1992, 1997, 2002, 2005–2008 Airlines: American, American Eagle, Continental, Delta, Skywest, Southwest, United Variables: Year, Month, DayofMonth, DayOfWeek, DepTime, Ar- rTime, UniqueCarrier, ArrDelay, DepDelay, Origin, Dest, Distance, TaxiIn, TaxiOut, CarrierDelay, WeatherDelay, NASDelay, Security- Delay, LateAircraftDelay Omitted data: All cancelled flights Stochastic IP We use techniques from integer programming and stochastic optimiza- tion. A (linear) integer program (IP) is an optimization problem with form min γ j t i=1 C j γ j : i, m j =1 a ij γ j = b i , j, γ j ∈{0, 1} , where C j ,a ij R are given and γ j are represent yes/no decisions. Examples include finding the minimum cost assignment of airplanes to flights, routing service delivery vehicles and scheduling sports teams. A stochastic IP (SIP) has some or all of C j and a ij random. E.g.,, if C j were random variables, then the IP would be a SIP. Every SIP has an associated deterministic IP where the random vari- ables are replaced by non-random parameters. The solution and associated objective value of an SIP are random variables so solutions found are usually in expectation or probability. Flight Graphs A graph for the airline F is N F =(V , E ) where V are nodes repre- senting airports and E⊆V×V are edges representing flight legs, i.e., V = {i : i = an airport from our data} E.g., LAX, IAD, ORD ∈N . E = {(i, j ): flight of F from i to j in 2005-2008} E.g., If there is a flight from LAX to IAD of F in the data, then (LAD, IAD) ∈E . Edges are directed! (i, j ) =(j, i). A path in N F is an ordered set of edges, P = ((i 1 ,i 2 ), (i 2 ,i 3 ),..., (i k -1 ,i k )) so that i j = i for all j = . We define |P| = k . United Airlines 1987 2008 Node size=number of flights, color=delay per flight, opacity=Prob. of delay Shortest paths with random distances (SPRD) For a given origin, destination, month/weekday of travel and the max- imum number of legs allowed, we solve the following SIP. For all (i, j ) ∈E , we define γ ij as indicators that (i, j ) are on the shortest path and γ as the vector of γ ij . We define the Shortest Paths Problem with Random Distances (SPRD) as the SIP, min (i,j )∈E C ij γ ij : γ Ω , where C ij are random variables representing delay and Ω is the set of arc indicators corresponding to paths from the origin to the destination. Thus, Ω= γ : (i,j )∈E γ ij flights out of i - (j,i)∈E γ ij flights into i = b i , i ∈V , where for all i not equal to the origin or destination, b i = 0. At the origin, b i = 1 and at the destination, b i = -1. Our solution method 1. For each (i, j ) ∈E , estimate the distribution for C ij . 2. Repeat 500 times: (a) Randomly generate a realization, c ij C ij , (i, j ) ∈E . (b) Solve the (deterministic) shortest path problem. (c) Save the shortest path found. Cascading dependencies Model In order to predict delay, we performed a multiple linear regression on the response DelayLevel, a categorical variable defined as follows. DelayLevel 0 1 2 3 4 Delay (minutes) 0 - 15 15 - 30 30 - 60 60 - 120 120+ Due to the volume of the yearly data sets, we randomly sampled (without replacement) 70% of the data to perform the multiple regression, and averaged the estimated coefficients of significant variables over 500 runs. These Bagged estimates were then used to predict DelayLevel which were linearly extrapolated by our sampling methods to predict delay for any origin-destination pair. Sampling methods Given an origin-destination pair, s, t ∈V , max. flight legs, k , month, m, weekday, w and airline, we must find distributions C ij for flight legs (i, j ) in one of the following sets. Let P (s, t) denote a path from s to t and E (s, t, k )= {(i, j ) ∈E :(i, j ) ∈P (s, t), |P (s, t)| = k }. For (i, j ), let S (i, j, m, w, τ )= {(i, j ) d,t :(i, j ) has flight on date d, time t τ , month=m, weekday=w }. Sample in one of the following two ways: 1. Naive: For each arc, (i, j ) ∈E (s, t, k ), independently generate (i, j ) d,t Unif(S (i, j, m, w, 0)) 2. Cascade: Generate a specific date, d Unif{1/1/05,..., 12/31/08 : month=m,weekday=w }. (a) Set τ = 0. For d, generate (s, j ) d,t j Unif{(s, i) δ,σ ∈S (s, j, m, w, τ ): δ = d}, for every j with (s, j ) ∈E (s, t, k ). (b) For each j , set τ = t j and repeat. For each arc (i, j ), apply estimated delay formula to sampled (i, j ) d,t j to obtain c ij . Naive Cascade Albuquerque, NM to Jackson Hole, WY Boston, MA to Los Angeles, CA Arc thickness indicates frequency on a shortest path, color indicates expected delay.

Minimizing Flight Delay Data Expo 2009stat-computing.org/dataexpo/2009/posters/dey-phillips-steele.pdf · Minimizing Flight Delay Tanujit Dey • David Phillips • Patrick Steele*

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Minimizing Flight Delay Data Expo 2009stat-computing.org/dataexpo/2009/posters/dey-phillips-steele.pdf · Minimizing Flight Delay Tanujit Dey • David Phillips • Patrick Steele*

The College of William & Mary

Minimizing Flight DelayTanujit Dey •David Phillips •Patrick Steele*

*Undergraduate researcher

Data Expo 2009Washington, DC

Introduction

Southwest Airlines 1987–2008

1987 1997

2002 2008

Motivations:

•Over time, flight networks have grown in size and complexity, delayson flight legs have similarly grown.

•How can individuals and airlines make better decisions regarding flighttravel?

Goal: Design a visual decision supporttool that can find a flight plan withsmallest predicted delay.

Data

•Years: 1987, 1992, 1997, 2002, 2005–2008

•Airlines: American, American Eagle, Continental, Delta, Skywest,Southwest, United

•Variables: Year, Month, DayofMonth, DayOfWeek, DepTime, Ar-rTime, UniqueCarrier, ArrDelay, DepDelay, Origin, Dest, Distance,TaxiIn, TaxiOut, CarrierDelay, WeatherDelay, NASDelay, Security-Delay, LateAircraftDelay

•Omitted data: All cancelled flights

Stochastic IPWe use techniques from integer programming and stochastic optimiza-tion. A (linear) integer program (IP) is an optimization problemwith form

minγj

t∑

i=1

Cjγj : ∀i,m∑

j=1

aijγj = bi, ∀j, γj ∈ 0, 1

,

where Cj, aij ∈ R are given and γj are represent yes/no decisions.

•Examples include finding the minimum cost assignment of airplanes toflights, routing service delivery vehicles and scheduling sports teams.

•A stochastic IP (SIP) has some or all of Cj and aij random. E.g.,,if Cj were random variables, then the IP would be a SIP.

•Every SIP has an associated deterministic IP where the random vari-ables are replaced by non-random parameters.

•The solution and associated objective value of an SIP are randomvariables so solutions found are usually in expectation or probability.

Flight Graphs•A graph for the airline F is NF = (V , E) where V are nodes repre-

senting airports and E ⊆ V × V are edges representing flight legs,i.e.,

V = i : i = an airport from our dataE.g., LAX, IAD, ORD ∈ N .

E = (i, j) : ∃ flight of F from i to j in 2005-2008

E.g., If there is a flight from LAX to IAD of F in the data, then(LAD, IAD) ∈ E . Edges are directed! (i, j) 6= (j, i).

•A path in NF is an ordered set of edges,

P = ((i1, i2), (i2, i3), . . . , (ik−1, ik))

so that ij 6= i` for all j 6= `. We define |P| = k.

United Airlines1987 2008

Node size=number of flights,color=delay per flight, opacity=Prob. of delay

Shortest paths with random distances (SPRD)

For a given origin, destination, month/weekday of travel and the max-imum number of legs allowed, we solve the following SIP. For all(i, j) ∈ E , we define γij as indicators that (i, j) are on the shortest pathand γ as the vector of γij. We define the Shortest Paths Problem with

Random Distances (SPRD) as the SIP, min

∑(i,j)∈E

Cijγij : γ ∈ Ω

,

where Cij are random variables representing delay and Ω is the set ofarc indicators corresponding to paths from the origin to the destination.Thus,

Ω =

γ :

∑(i,j)∈E

γij︸ ︷︷ ︸flights out of i

−∑

(j,i)∈Eγij︸ ︷︷ ︸

flights into i

= bi,∀i ∈ V

,

where for all i not equal to the origin or destination, bi = 0. At theorigin, bi = 1 and at the destination, bi = −1.

Our solution method

1.For each (i, j) ∈ E, estimate the distribution for Cij.

2. Repeat 500 times:

(a)Randomly generate a realization, cij ∼ Cij,∀(i, j) ∈ E .

(b)Solve the (deterministic) shortest path problem.

(c) Save the shortest path found.

Cascading dependenciesModel

In order to predict delay, we performed a multiple linear regression on the response DelayLevel,a categorical variable defined as follows.

DelayLevel 0 1 2 3 4

Delay (minutes) 0− 15 15− 30 30− 60 60− 120 120+

Due to the volume of the yearly data sets, we randomly sampled (without replacement) 70% ofthe data to perform the multiple regression, and averaged the estimated coefficients of significantvariables over 500 runs. These Bagged estimates were then used to predict DelayLevel whichwere linearly extrapolated by our sampling methods to predict delay for any origin-destinationpair.

Sampling methods

Given an origin-destination pair, s, t ∈ V , max. flight legs, k, month, m, weekday, w and airline,we must find distributions Cij for flight legs (i, j) in one of the following sets. Let P(s, t) denotea path from s to t and

E(s, t, k) = (i, j) ∈ E : (i, j) ∈ P(s, t), |P(s, t)| = k.

For (i, j), let S(i, j,m,w, τ ) =

(i, j)d,t : (i, j) has flight on date d, time t ≥ τ , month=m, weekday=w.

• Sample in one of the following two ways:

1.Naive:

–For each arc, (i, j) ∈ E(s, t, k), independently generate (i, j)d,t ∼ Unif(S(i, j,m,w, 0))

2.Cascade:

–Generate a specific date, d ∼ Unif1/1/05, . . . , 12/31/08 : month=m,weekday=w.(a) Set τ = 0. For d, generate

(s, j)d,tj ∼ Unif(s, i)δ,σ ∈ S(s, j,m, w, τ ) : δ = d,

for every j with (s, j) ∈ E(s, t, k).

(b) For each j, set τ = tj and repeat.

•For each arc (i, j), apply estimated delay formula to sampled (i, j)d,tj to obtain cij.

Naive CascadeAlbuquerque, NM to Jackson Hole, WY

Boston, MA to Los Angeles, CA

Arc thickness indicates frequency on ashortest path, color indicates expected delay.

Page 2: Minimizing Flight Delay Data Expo 2009stat-computing.org/dataexpo/2009/posters/dey-phillips-steele.pdf · Minimizing Flight Delay Tanujit Dey • David Phillips • Patrick Steele*

Finding shortest paths

A deterministic problem

Given a set of predicted delay times (cij) on the arcs, and an origin-destination pair (s, t), findthe path from s to t of k flight legs or less with minimum delay.

Our algorithm1. Find E(s, t, k) via Breadth First Search (BFS), finds s-reachable nodes with a FIFO queue.

BFS starting from IADMark IAD found (yellow).

FIFO queue = [IAD].

LAX

BOS

IAD

SEA

MIA

ORD

DFW

LGA

PHF

Mark IAD done (orange).Mark unfound neighbors yellow.

Set queue = [BOS, MIA].Mark arcs to unfound neighbors blue.

LAX

BOS

IAD

SEA

MIA

ORD

DFW

LGA

PHF

Mark BOS done.No unfound neighbors.

Set queue = [MIA].

LAX

BOS

IAD

SEA

MIA

ORD

DFW

LGA

PHF

Mark MIA done.Mark unfound neighbors.Set queue = [ORD, PHF].

Mark arcs to unfound neighbors.

LAX

BOS

IAD

SEA

MIA

ORD

DFW

LGA

PHF

Continue until queue empty whichimplies all IAD-reachable nodes foundand blue edges form paths from IAD.

White nodes are not IAD-reachable.

LAX

BOS

IAD

SEA

MIA

ORD

DFW

LGA

PHF

Legend

Not yet found

Found, in queue

Found, out of queue

2. Find the shortest path from s ∈ V via Dijkstra’s algorithm (requires Cij ≥ 0).

•Dijkstra’s algorithm uses d(i) = estimate of shortest path from s to i

•Relax(i): ∀(i, j) ∈ E, if d(j) > d(i) + Cij, set d(j) = d(i) + Cij.

•At each step, finds node p where d(p) = mind(i) : i not relaxed, then calls Relax(p).

Find shortest paths from IAD.Set d(IAD)=0 and

d(i)=! for i "= IAD.

IAD, 0

LAX, !

PHF, !

DFW, !

3

10

4

1

4

2

5

d(IAD)=min d(i) for i, un-Relaxed.Call Relax(IAD) and

mark IAD Relaxed (orange)

IAD, 0

LAX, 3

PHF, 10

DFW, !

3

10

4

1

4

2

5

d(LAX)=min d(i) amongun-Relaxed. Call Relax(LAX)mark LAX Relaxed (orange)

IAD, 0

LAX, 3

PHF, 7

DFW, 5

3

10

4

1

4

2

5

d(DFW)=min d(i) amongun-Relaxed. Relax(DFW)

has no e!ect. Mark DFW

IAD, 0

LAX, 3

PHF, 7

DFW, 5

3

10

4

1

4

2

5

Relax(PHF) has no e!ect.Mark PHF. Blue edges

are on SP’s from IAD.

IAD, 0

LAX, 3

PHF, 7

DFW, 5

3

4

2

Legend

i, d(i) un-Relaxed node

j, d(j)Relaxed node,

d(j) won’t change

RuntimesAlgorithm step For two flight legs and less For three flight legs and less

BFS computations < 1 second. < 1 second.

Cascade sampling ∼ 30 seconds. ∼ 400 seconds.

Dijkstra’s algorithm < 1 seconds. ∼ 3 seconds.

Total runtime < 32 seconds < 7 minutes

ConclusionsSFO-BTV, United, Mon. in Dec. PHL–PDX, Continental, Wed. in Mar.

IAD–LAS, Delta, Fri. in June SAN–JFK, American, Sun. in Sep.

• Cascade sampling predicts cascade effects of delay better than Naive sampling.

• Cascade effects and delay patterns on any flight route within airlines, times/datesand airports can be visually compared.

• Runtimes are modest, with sampling as the computational bottleneck.

• Use αijCij + c((i, j)

)for costs to find objectives such as total travel time, weighted

delay with flight costs, etc.

MIA–SEA, all Mondays over all monthsContinental United

Minimum overall flight times with delay – United AirlinesDulles Baltimore

Points at the same radius are the same distance away from the center.Color indicates delay.